🎙️When Will We Speak Without Language Barriers?

Remember my experiment with the interview with Sharon Zhou, CEO at Lamini, where I did all the editing myself and shared tools I used with you? For this interview, I gave up and hired a professional video editor. Even with all the awesome AI tools, it still takes too much time if it’s not your primary specialty. So, to the point of AI stealing our jobs: what AI tools actually do – they accelerate professionals, but you still need to hire that professional.

In this episode of “Inference”: Mati Staniszewski, ElevenLabs’s co-founder

Mati’s first language is Polish, mine – Russian. Right now, we’re conversing easily in English, simply because we can. But imagine our parents were here, too – how would they communicate? Will there soon come a time when any conversation could flow effortlessly, without language barriers?

Mati is co-founder and CEO at ElevenLabs, an AI audio company specializing in lifelike speech synthesis and multilingual dubbing, – and he thinks we’re close. Real-time voice translation is already working in narrow use cases – customer support, healthcare. The harder part is preserving tone, emotion, timing. That’s where most systems still break.

The tech is pretty straightforward so far – speech-to-text, LLM translation, text-to-speech – but real-time conversations that capture nuance, emotion, and context? That’s still tough.

What makes it hard? Capturing subtle emotional cues. Detecting who’s speaking in noisy rooms. Keeping latency low enough for natural conversations. ElevenLabs nailed Lex Fridman’s podcast dubbing – but it took weeks, not seconds.

So how soon before we chat freely across languages? Two to three years, Mati predicts. But what exactly still needs solving? And what happens next?

Watch and listen in this episode.

This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription – do it here.

Subscribe to our YT channel, and/or read the transcript (edited for clarity, brevity, and sanity) ⬇️.

You speak Polish. I speak Russian. Now we speak English – we can do it. But if say, our parents were here, they would not be able to talk. When will be that moment when two different people with different languages will be able to talk without any language barrier?

Mati: Such a great question. You might have heard about why we started ElevenLabs – it was almost exactly because of that. When you watch movies in Polish, all the voices – whether male or female – are narrated by just one person. That flattens the experience: the original emotion, tonality, and nuance get lost. That’s something we’d really love to change.

When you think about the technology, we’re already seeing it preserve much of the original voice and even some emotional intonation. Real-time translation is a bit harder. Still, I think what we’re seeing already shows incredible promise – it’s possible in some cases. The next stage will be figuring out: how do you make it real-time so that you can actually stream that content. I think this will happen this year.

Really?

Mati: And then, of course, there’s the bigger question of what comes next. Ideally, you'd have some kind of device – maybe headphones, or even just your phone – where you can speak and have it live-translated to the other person. EarPods are great because they let the other person listen as you speak, so it can stream in real time.

But to make this widely adopted, we’ll need smaller, faster models – more lightweight and efficient. My guess is, in the next two to three years, we’ll start seeing that out in the world. And hopefully, within five years, it’ll be something anyone can use, anywhere.

And yeah – how incredible would that be?

I’m really excited about this – I think it’s something we absolutely need. But I went to CES this year and tried a few of these devices… and they’re just not there yet.

My newsletter focuses on machine learning and artificial intelligence, so the audience is fairly technical. Can you walk me through the key technical challenges that still need to be solved before this becomes possible?

Mati: Of course. And I don’t know if you’re familiar with The Hitchhiker’s Guide to the Galaxy and the Babel Fish? That’s exactly where this is headed, and I’d love to be part of making it real.

Right now, the process involves three main steps. First, you have speech-to-text – you try to understand what someone is saying. But beyond just transcribing the words, the future goal is deeper: understanding who is speaking. If there are multiple people, you need speaker diarization to distinguish between them.

And then, you also want to capture the emotions behind the speech – how something is being said, not just what’s being said.

Yes, how do you understand emotions?

Mati: Exactly. There’s so little data that includes not just the speech and the text, but also the metadata – like how something was said. So I think that’s a real barrier: how do you create that at scale, with high quality? That’s something we’re working on.

So, speech-to-text is the first part. Then you need the LLM part – or it could be another model, but LLMs are probably the best solution – where you translate from one language to another. But the important thing is that you want to preserve the original meaning.

And depending on how much of the sentence you take, that meaning might shift. If you have two sentences, and one explains the other, the way you should translate and deliver it can be very different. Like, you might have a sentence: “What a wonderful day.” Easy enough to translate. But if it’s: “What a wonderful day,” I said sarcastically – that’s a completely different meaning, and tone.

So there’s this layer in the middle. And, of course, a wider set of phrases that people might use, which need explanation. Then, at the end, you have the text-to-speech step, which also needs to carry over the emotion and voice of the original – and be able to recreate that on the other side. Depending on the use case, it also needs to be roughly the same length, so it feels more seamless.

So you have those three steps: speech-to-text, LLM, text-to-speech. Where the space is today – LLMs are good for some use cases, especially flatter deliveries. But for a lot of others, especially with niche words or phrases – it’s harder. Speech-to-text, I think, is in a good spot overall. We released our model recently, which beat all the benchmarks – we’re really happy about that – Scribe. Of course that helped lower this barrier, but still, you will have a long tail of other languages that will be harder. The emotion thing needs to get fixed, which isn't yet. The text-to-speech, I think is very good, but still will need more of the context understanding from the speaker.

How would you solve the emotion?

Mati: The first step would be to create a very large dataset of... well, we’re working with voice coaches and others to build a larger set of audio, which we then annotate for how things are said.

So even in our conversation, let’s say someone takes a sample and labels it: “In this sentence, the emotional tone is excited, calm – and the speaker stutters every so often.”

That kind of labeling helps. If you have enough of those examples, the model can start to generalize and understand new, unseen examples as well.

Of course, once you go into different languages, the way you describe emotionality might vary – and it’s still unclear whether you can rely on the same technology to translate it across all of them. That’s still unknown. But yeah, it’s a really interesting challenge.

In your conversational AI – which combines STT, LLMs, and TTS – what level of performance have you achieved across each component? Or in percentage terms, how far along are you?

Mati: I think for some use cases, we’re already there. In certain scenarios, it can really help. Maybe to give a concrete example – we work with a company in the U.S. called Hippocratic, which is automating some of the tasks nurses don’t have time for around patient appointments. Things like calling patients to ask how they’re feeling, reminding them to take medication, scheduling follow-ups – that kind of stuff. And it works. That’s something that’s already possible.

And that’s English to English, of course.

Then there’s another company we work with on the customer support side where callers speak one language, and the agents only understand another. So it live-translates between them. And it works really well, because previously there was no way for those people to communicate at all. Now they can. What it doesn’t fully do yet is preserve the emotional tone in the same way – but for this particular customer support use case, that wasn’t the main barrier. So in that context, it’s working well.

Where it doesn’t work as well yet is in the more complex, high-emotion scenarios – like the real-time dubbing or translation you mentioned earlier. We’re working with some media companies who want to broadcast, say, a sports event, and have it translated in real time. But that’s tough – there are so many names, fast speech, emotional highs, and commentary style that’s hard to match. So there’s still a bit of a barrier there.

That said, I’d say for use cases like call centers, healthcare, education – we’re already seeing real deployments happening now. It's really just a matter of scaling it further.

And I think in the next 12 to 18 months, we’ll start to see more of these conversational AI agents in emotionally sensitive or high-context settings – whether that’s real-time dubbing, or emotionally rich conversations. That’s what’s coming next.

That’s amazing. How hard was it to do the dubbing for the Lex Fridman podcast? It sounded incredible!

Mati: Thank you. It was pretty hard – mainly because we wanted to make sure that every word, every sentence was translated the right way.

It was also a big moment for us. Given where we started and our mission, being able to work with someone like Lex – and others like him – and then seeing that work shared with such a broad audience… it felt very close to heart.

We spent a lot of time on QA and QC, both for the translation – making sure everything was as accurate as possible – and with external partners who helped us review it. We also put a lot of focus on the audio. We wanted to make sure the voices of the speakers were represented properly, with the same tone and presence.

We would literally listen side by side – comparing the original with the new version. And of course, the tricky thing with podcasts on platforms like YouTube is that the length has to match exactly.

So even if one language is naturally longer than another, you still need to fit it into the same time window. That means taking the original sentence, maybe paraphrasing it slightly – and only then generating the audio on the other side.

But then you're working with longer content, different emotions, and altered phrasing. How do you handle that? Very carefully. I don’t remember the exact number, but I think English to Spanish ends up being 30–40% longer when spoken. So trying to compress that into the same time frame is hard.

That was a real challenge – but also a meaningful one. If we can one day do this kind of dubbing semi-automatically or even fully automatically – and we’re getting close – that would be a powerful proof point that the tech is really there.

It’s like putting the system through a high-stress test, aiming for 1000% quality.

So it took you like a day?

Mati: It took us longer than a day. I think it took, depending on the podcast, anywhere from a week to two weeks.

The hardest part was the translation – we worked with another team to get that exactly right. And not just the translation itself, but translation in the context of audio. So it’s not just about getting the words right – you also have to match the length. That back-and-forth, refining both the meaning and the timing, took the most time.

The audio part – just generating a good voice – was relatively easier. But generating a great voice with the right emotions, and doing it within the time constraints, that was much harder.

Wow. I actually thought it was really bold to go for it.

Mati: Thank you. I’m not sure if you know this, but about three years ago, I was reading your work – and I think I even tried to reach out.

You did, you did, but I couldn’t collaborate with ElevenLabs at that moment. And then I reached out to you on LinkedIn this December, asking for your predictions. But you didn’t respond.

Mati: Maybe I can still do that!

We’ll do it next year. But let’s discuss this current situation here: someone walks in, and there's noise. How do you deal with that in the conversational AI?

Mati: We have a pretty good model now that tries to detect – based on the speaker’s volume and the context of what was said before – whether something is a real interruption or a fake one. Like, someone keeps speaking, and the system is trying to figure out: is this actually someone else jumping in, or just the same person continuing?

It works fairly well, but it’s not a bulletproof solution.

One future direction we’re excited about is something like this: say I’m speaking to an agent – based on my first few sentences, the system encodes my voice, and then uses that as a reference. So later, it can check: is this still the same voice? That would be much more accurate and fair.

Of course, you have different use cases. Sometimes you want multiple people in the conversation. So what we’re thinking is – if you’re speaking with an agent – you could have a setting where it auto-detects who’s speaking and tailors responses just to that person.

Let’s say it’s Ksenia speaking. The system detects it’s your voice based on an initial sample, and then keeps checking: is the same person still speaking? It does that by comparing the voice embeddings. If we can do that quickly enough, then the agent can just keep the conversation flowing naturally.

And if another speaker comes in, the system checks: does this match the previous embedding? If not, it knows someone else is speaking and switches accordingly – without cutting anyone off mid-sentence.

The real challenge is: how do you do that fast enough to keep the interaction seamless?

Yes, latency.

Mati: But we think we can do it – especially for noisy environments, which some people have specifically requested support for in our work. We’re hoping to ship it next quarter, so Q2. That should really help.

And of course, sometimes you might not want it to auto-assign to just one speaker. You might prefer to allow multiple speakers. In that case, we’ll keep it broader and more flexible.

How's the latency? Did you solve this problem?

Mati: Current latency – end-to-end, including everything like network delays, interruptions, even handling names – is around one to 1.2 seconds between responses, depending on the region. So, very quick.

That's not bad!

Mati: The text-to-speech part – generating audio from text – is actually the fastest. We have the quickest model out there, with a latency of just 70 milliseconds.

Where it takes more time is in the transcription and LLM stages. But the key challenge isn’t just speed – it’s when to interrupt. Do you jump in the moment someone stops speaking? What if they’re just pausing mid-sentence?

In our conversational framework, we’ve built a number of smart mechanisms behind the scenes to handle this. We analyze the context to see if it sounds like a natural end to a sentence. We check for the length of the silence that follows. All of those signals are combined – and only then do we generate a response.

Meanwhile, as we’re running those checks, we’re already pre-generating some of the LLM output. That way, we can start streaming the response faster once the decision is made. That’s a big part of why the system feels so quick and natural end-to-end.

An alternative would be to build a true multimodal system – training all three components together. That would feel even more fluid and natural. But you lose a level of control that’s really important in certain domains.

With the three-step pipeline, you can ensure the LLM stays on-topic, especially in areas like customer support or healthcare, where precision matters. You want it to say exactly what you intend, nothing more or less. And that’s where all the additional checks come in – which can introduce some latency, yes, but also improve safety and reliability.

Multimodal systems trade off some of that control for more naturalness. We’re actively working on that too, from a research perspective. But for large enterprise customers, the three-step setup is still what we recommend – it’s more stable and easier to monitor.

I was actually going to ask about that – because it seems like the next step for conversational AI, and for combining these technologies, would be moving toward something multimodal at some point.

Mati: At some point – and for certain use cases where emotional nuance or a real sense of connection is especially important – we do think multimodal will become the standard. Probably within the next year or two.

To make it more concrete: in interactions where there's no need to take direct action on the backend – like issuing a refund or canceling something – a multimodal experience could actually be more useful.

Of course, you still want to be cautious. For example, you wouldn’t want a therapist AI to be multimodal unless you’re absolutely certain it works reliably, every time. Otherwise, you risk hallucinations – both in the text and audio – which makes the system less stable.

But long term, say in the next three to five years, we’ll likely see more stability and strong proof points that multimodal is the way to go for those emotionally rich use cases.

And when that happens, we’ll start to see broader adoption. But in the meantime, the three-step solution still gives you something valuable – something modular and controllable.

It’s working.

Mati: Exactly. It’s scalable, it works, and it carries emotion. You can build your knowledge base into the agents – and do it relatively simply.

You started as a research-first company. How deep are you still involved in technical stuff?

Mati: My co-founder is incredible – he’s the genius and the brain behind all the models we’ve built. He’s managed to bring together some of the best researchers in the audio space. So now we have a very strong audio research team that keeps pushing out incredible models and topping benchmarks.

We’re still heavily focused on audio research and will be for years to come. Our goal is to build frontier models – the best in the field – whether that’s for speech-to-text, conversational agents, or other areas of audio understanding and generation. That work will absolutely continue.

On the personal side, what we really believe in is that research alone isn’t enough. You also need the product layer – the end-to-end experience for users. It’s not enough to have a great text-to-speech model with good narration. You also need a product or workflow that lets someone create an entire audiobook, or build a full voice agent experience – something that integrates their knowledge base, their functions, their world.

That’s the part I stay close to.

And when it comes to deploying this tech to clients, one thing we do – maybe less traditionally – is have engineers work directly with them. They embed into the client’s workflow, understand their needs…

Pain points.

Mati: Exactly. And then build that solution more closely around their needs.

Do you miss being involved in research more?

Mati: I was never directly involved in the research-research side of things, but I did work on some of the product aspects – actually building the product itself.

Back at my previous company, which was part of Palantir, I was much closer to the technical side – working on pipelines and helping clients with optimization problems. There are definitely parts of that I miss, especially the math and some of the engineering thinking that came with it.

But one thing that’s still true – and something I really love about my current role – is that I still work closely with clients. I talk to them a lot, try to understand their problems, and figure out what solutions we can build. That feeds directly into how we shape the product.

And even though I’m not writing code myself anymore, it’s still really rewarding to see how the work gets deployed.

Congrats on your recent huge round. Congrats on the collaboration with Lex Fridman, and now this week, you announced the partnership with Google Cloud. ElevenLabs seems to be everywhere. And going back to when you first reached out to me, I remember thinking it was such a smart move to work with newsletters. So – what’s your strategy?

Mati: One thing we try to stay true to is really understanding the problem we’re trying to solve – and then building the right solution around that. But the other piece is just as important: even if you know the problem and you’ve built a great solution, most of the world still doesn’t know it exists. Most customers don’t know it’s possible.

So the big question becomes: how do you tell them? How do you show people that the technology is finally here to solve these issues?

And because we started with research – and built so much of the early stack – people don’t always trust or believe it when you just say, “This is the best human-like model,” or, “This is the most emotional model.” That’s not enough. So we’re always looking for ways to show the world – not just tell them—what’s actually possible, with real use cases.

One part of the strategy was: open the technology to creators and developers. Let them use it in their own projects. Let them show what’s possible. And honestly, even we learned things we didn’t expect – use cases we hadn’t thought of. So that discovery process went both ways.

At the same time, we’ve always believed the research is strong. We know it solves real problems – from audiobook narration and newsletter voiceovers, to podcast dubbing, to film voiceovers, and now voice agents. But instead of just claiming that, we work with creators, developers, innovators – to show the breadth. And in parallel, we partner with large companies to go deep: understand the scale, the enterprise-level requirements, the security and compliance needs.

Like last week – we were proud to announce our partnership with Deutsche Telekom. That’s a very different angle. But part of why we’re excited about it is they’ve already seen how people engage with voice – through podcasts, through phone calls – and they’ve seen the quality. Now the focus is: how do we go even deeper? How do we build something truly end-to-end?

So that’s how we’re thinking about it. On one side, make the best technology widely accessible. On the other, build very specific, deeply integrated enterprise solutions.

You have research, you have the product, everyone loves you, but you never open-sourced anything. Why? And are you going to?

Mati: Great question. We actually spent a lot of time thinking about this.

Right now – because we’ve invested so much into our research, and focused on areas that others weren’t really paying attention to – to be honest, if we opened it up too early, we’d be giving away a lot of our advantage and IP. That would make it easier for others to recreate what we’ve built.

Of course, this will change as we build more of the product layer. And to be fair, over the past two years, we’ve added a lot on that front – but we’re still relatively early compared to where we want to be.

Once we feel we’ve nailed the right parameters around the product – something more mature and robust – then we’ll be happy to open up more of the research side. That would help strengthen the mission overall.

But today, given the time, energy, and resources we’ve put in, we want to keep things close – for now. Especially since many of the companies in this space have far more resources than we do.

There is a lot of competitors.

Mati: Yes, exactly. And you know, some of the big companies – especially the hyperscalers – will likely enter this space at some point.

So eventually, we will open up access. But for now, it’s more about building the foundation – making sure the product is where it needs to be.

Also part of the strategy. So, probably my last question – is the dubbing problem in Poland solved?

Mati: Yeah, I think we’re headed in a good direction. It’s not solved yet, but we’re getting there.

You know, it’s one thing to solve the research problem – it’s another to actually get the world to know the solution exists and use it. That’s where we’ve been spending a lot of time, working with companies to bring it into real use.

And I think in the next few years – single digits for sure, maybe even just two or three – it’ll be solved.

Thank you so much for this conversation! And to our readers – thank you for watching and reading!

🎙️When Will We Speak Without Language Barriers?

Do leave a comment

Reply

Keep Reading

Turing Post