Remember my experiment with the interview with Sharon Zhou, CEO at Lamini, where I did all the editing myself and shared tools I used with you? For this interview, I gave up and hired a professional video editor. Even with all the awesome AI tools, it still takes too much time if itβs not your primary specialty. So, to the point of AI stealing our jobs: what AI tools actually do β they accelerate professionals, but you still need to hire that professional.
In this episode of βInferenceβ: Mati Staniszewski, ElevenLabsβs co-founder
Matiβs first language is Polish, mine βΒ Russian. Right now, weβre conversing easily in English, simply because we can. But imagine our parents were here, too β how would they communicate? Will there soon come a time when any conversation could flow effortlessly, without language barriers?
Mati is co-founder and CEO at ElevenLabs, an AI audio company specializing in lifelike speech synthesis and multilingual dubbing, β and he thinks weβre close. Real-time voice translation is already working in narrow use cases β customer support, healthcare. The harder part is preserving tone, emotion, timing. Thatβs where most systems still break.
The tech is pretty straightforward so far β speech-to-text, LLM translation, text-to-speech β but real-time conversations that capture nuance, emotion, and context? Thatβs still tough.
What makes it hard? Capturing subtle emotional cues. Detecting whoβs speaking in noisy rooms. Keeping latency low enough for natural conversations. ElevenLabs nailed Lex Fridmanβs podcast dubbing β but it took weeks, not seconds.
So how soon before we chat freely across languages? Two to three years, Mati predicts. But what exactly still needs solving? And what happens next?
Watch and listen in this episode.
This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription β do it here.
Subscribe to our YT channel, and/or read the transcript (edited for clarity, brevity, and sanity) β¬οΈ.

You speak Polish. I speak Russian. Now we speak English β we can do it. But if say, our parents were here, they would not be able to talk. When will be that moment when two different people with different languages will be able to talk without any language barrier?Β
Mati: Such a great question. You might have heard about why we started ElevenLabs β it was almost exactly because of that. When you watch movies in Polish, all the voices βΒ whether male or female β are narrated by just one person. That flattens the experience: the original emotion, tonality, and nuance get lost. Thatβs something weβd really love to change.
When you think about the technology, weβre already seeing it preserve much of the original voice and even some emotional intonation. Real-time translation is a bit harder. Still, I think what weβre seeing already shows incredible promise β itβs possible in some cases. The next stage will be figuring out: how do you make it real-time so that you can actually stream that content. I think this will happen this year.Β
Really?Β
Mati: And then, of course, thereβs the bigger question of what comes next. Ideally, you'd have some kind of device βΒ maybe headphones, or even just your phone β where you can speak and have it live-translated to the other person. EarPods are great because they let the other person listen as you speak, so it can stream in real time.
But to make this widely adopted, weβll need smaller, faster models βΒ more lightweight and efficient. My guess is, in the next two to three years, weβll start seeing that out in the world. And hopefully, within five years, itβll be something anyone can use, anywhere.
And yeah βΒ how incredible would that be?
Iβm really excited about this β I think itβs something we absolutely need. But I went to CES this year and tried a few of these devicesβ¦ and theyβre just not there yet.
My newsletter focuses on machine learning and artificial intelligence, so the audience is fairly technical. Can you walk me through the key technical challenges that still need to be solved before this becomes possible?
Mati: Of course. And I donβt know if youβre familiar with The Hitchhikerβs Guide to the Galaxy and the Babel Fish? Thatβs exactly where this is headed, and Iβd love to be part of making it real.
Right now, the process involves three main steps. First, you have speech-to-text β you try to understand what someone is saying. But beyond just transcribing the words, the future goal is deeper: understanding who is speaking. If there are multiple people, you need speaker diarization to distinguish between them.
And then, you also want to capture the emotions behind the speech β how something is being said, not just whatβs being said.
Yes, how do you understand emotions?Β
Mati: Exactly. Thereβs so little data that includes not just the speech and the text, but also the metadata β like how something was said. So I think thatβs a real barrier: how do you create that at scale, with high quality? Thatβs something weβre working on.
So, speech-to-text is the first part. Then you need the LLM part βΒ or it could be another model, but LLMs are probably the best solution β where you translate from one language to another. But the important thing is that you want to preserve the original meaning.
And depending on how much of the sentence you take, that meaning might shift. If you have two sentences, and one explains the other, the way you should translate and deliver it can be very different. Like, you might have a sentence: βWhat a wonderful day.β Easy enough to translate. But if itβs: βWhat a wonderful day,β I said sarcastically β thatβs a completely different meaning, and tone.
So thereβs this layer in the middle. And, of course, a wider set of phrases that people might use, which need explanation. Then, at the end, you have the text-to-speech step, which also needs to carry over the emotion and voice of the original β and be able to recreate that on the other side. Depending on the use case, it also needs to be roughly the same length, so it feels more seamless.
So you have those three steps: speech-to-text, LLM, text-to-speech. Where the space is today β LLMs are good for some use cases, especially flatter deliveries. But for a lot of others, especially with niche words or phrases βΒ itβs harder. Speech-to-text, I think, is in a good spot overall. We released our model recently, which beat all the benchmarks β weβre really happy about that βΒ Scribe. Of course that helped lower this barrier, but still, you will have a long tail of other languages that will be harder. The emotion thing needs to get fixed, which isn't yet. The text-to-speech, I think is very good, but still will need more of the context understanding from the speaker.Β
How would you solve the emotion?Β
Mati: The first step would be to create a very large dataset of... well, weβre working with voice coaches and others to build a larger set of audio, which we then annotate for how things are said.
So even in our conversation, letβs say someone takes a sample and labels it: βIn this sentence, the emotional tone is excited, calm β and the speaker stutters every so often.β
That kind of labeling helps. If you have enough of those examples, the model can start to generalize and understand new, unseen examples as well.
Of course, once you go into different languages, the way you describe emotionality might vary β and itβs still unclear whether you can rely on the same technology to translate it across all of them. Thatβs still unknown. But yeah, itβs a really interesting challenge.
In your conversational AI βΒ which combines STT, LLMs, and TTS βΒ what level of performance have you achieved across each component? Or in percentage terms, how far along are you?
Mati: I think for some use cases, weβre already there. In certain scenarios, it can really help. Maybe to give a concrete example β we work with a company in the U.S. called Hippocratic, which is automating some of the tasks nurses donβt have time for around patient appointments. Things like calling patients to ask how theyβre feeling, reminding them to take medication, scheduling follow-ups β that kind of stuff. And it works. Thatβs something thatβs already possible.
And thatβs English to English, of course.
Then thereβs another company we work with on the customer support side where callers speak one language, and the agents only understand another. So it live-translates between them. And it works really well, because previously there was no way for those people to communicate at all. Now they can. What it doesnβt fully do yet is preserve the emotional tone in the same way β but for this particular customer support use case, that wasnβt the main barrier. So in that context, itβs working well.
Where it doesnβt work as well yet is in the more complex, high-emotion scenarios β like the real-time dubbing or translation you mentioned earlier. Weβre working with some media companies who want to broadcast, say, a sports event, and have it translated in real time. But thatβs tough β there are so many names, fast speech, emotional highs, and commentary style thatβs hard to match. So thereβs still a bit of a barrier there.
That said, Iβd say for use cases like call centers, healthcare, education β weβre already seeing real deployments happening now. It's really just a matter of scaling it further.
And I think in the next 12 to 18 months, weβll start to see more of these conversational AI agents in emotionally sensitive or high-context settings β whether thatβs real-time dubbing, or emotionally rich conversations. Thatβs whatβs coming next.
Thatβs amazing. How hard was it to do the dubbing for the Lex Fridman podcast? It sounded incredible!
Mati: Thank you. It was pretty hard β mainly because we wanted to make sure that every word, every sentence was translated the right way.
It was also a big moment for us. Given where we started and our mission, being able to work with someone like Lex β and others like him β and then seeing that work shared with such a broad audienceβ¦ it felt very close to heart.
We spent a lot of time on QA and QC, both for the translation β making sure everything was as accurate as possible βΒ and with external partners who helped us review it. We also put a lot of focus on the audio. We wanted to make sure the voices of the speakers were represented properly, with the same tone and presence.
We would literally listen side by side β comparing the original with the new version. And of course, the tricky thing with podcasts on platforms like YouTube is that the length has to match exactly.
So even if one language is naturally longer than another, you still need to fit it into the same time window. That means taking the original sentence, maybe paraphrasing it slightly β and only then generating the audio on the other side.
But then you're working with longer content, different emotions, and altered phrasing. How do you handle that? Very carefully. I donβt remember the exact number, but I think English to Spanish ends up being 30β40% longer when spoken. So trying to compress that into the same time frame is hard.
That was a real challenge β but also a meaningful one. If we can one day do this kind of dubbing semi-automatically or even fully automatically β and weβre getting close β that would be a powerful proof point that the tech is really there.
Itβs like putting the system through a high-stress test, aiming for 1000% quality.
So it took you like a day?Β
Mati: It took us longer than a day. I think it took, depending on the podcast, anywhere from a week to two weeks.
The hardest part was the translation β we worked with another team to get that exactly right. And not just the translation itself, but translation in the context of audio. So itβs not just about getting the words right βΒ you also have to match the length. That back-and-forth, refining both the meaning and the timing, took the most time.
The audio part β just generating a good voice βΒ was relatively easier. But generating a great voice with the right emotions, and doing it within the time constraints, that was much harder.
Wow. I actually thought it was really bold to go for it.
Mati: Thank you. Iβm not sure if you know this, but about three years ago, I was reading your work β and I think I even tried to reach out.
You did, you did, but I couldnβt collaborate with ElevenLabs at that moment. And then I reached out to you on LinkedIn this December, asking for your predictions. But you didnβt respond.
Mati: Maybe I can still do that!
Weβll do it next year. But letβs discuss this current situation here: someone walks in, and there's noise. How do you deal with that in the conversational AI?Β
Mati: We have a pretty good model now that tries to detect β based on the speakerβs volume and the context of what was said before β whether something is a real interruption or a fake one. Like, someone keeps speaking, and the system is trying to figure out: is this actually someone else jumping in, or just the same person continuing?
It works fairly well, but itβs not a bulletproof solution.
One future direction weβre excited about is something like this: say Iβm speaking to an agent β based on my first few sentences, the system encodes my voice, and then uses that as a reference. So later, it can check: is this still the same voice? That would be much more accurate and fair.
Of course, you have different use cases. Sometimes you want multiple people in the conversation. So what weβre thinking is β if youβre speaking with an agent β you could have a setting where it auto-detects whoβs speaking and tailors responses just to that person.
Letβs say itβs Ksenia speaking. The system detects itβs your voice based on an initial sample, and then keeps checking: is the same person still speaking? It does that by comparing the voice embeddings. If we can do that quickly enough, then the agent can just keep the conversation flowing naturally.
And if another speaker comes in, the system checks: does this match the previous embedding? If not, it knows someone else is speaking and switches accordingly β without cutting anyone off mid-sentence.
The real challenge is: how do you do that fast enough to keep the interaction seamless?
Yes, latency.Β
Mati: But we think we can do it β especially for noisy environments, which some people have specifically requested support for in our work. Weβre hoping to ship it next quarter, so Q2. That should really help.
And of course, sometimes you might not want it to auto-assign to just one speaker. You might prefer to allow multiple speakers. In that case, weβll keep it broader and more flexible.
How's the latency? Did you solve this problem?Β
Mati: Current latency β end-to-end, including everything like network delays, interruptions, even handling names β is around one to 1.2 seconds between responses, depending on the region. So, very quick.
That's not bad!
Mati: The text-to-speech part βΒ generating audio from text β is actually the fastest. We have the quickest model out there, with a latency of just 70 milliseconds.
Where it takes more time is in the transcription and LLM stages. But the key challenge isnβt just speed β itβs when to interrupt. Do you jump in the moment someone stops speaking? What if theyβre just pausing mid-sentence?
In our conversational framework, weβve built a number of smart mechanisms behind the scenes to handle this. We analyze the context to see if it sounds like a natural end to a sentence. We check for the length of the silence that follows. All of those signals are combined β and only then do we generate a response.
Meanwhile, as weβre running those checks, weβre already pre-generating some of the LLM output. That way, we can start streaming the response faster once the decision is made. Thatβs a big part of why the system feels so quick and natural end-to-end.
An alternative would be to build a true multimodal system β training all three components together. That would feel even more fluid and natural. But you lose a level of control thatβs really important in certain domains.
With the three-step pipeline, you can ensure the LLM stays on-topic, especially in areas like customer support or healthcare, where precision matters. You want it to say exactly what you intend, nothing more or less. And thatβs where all the additional checks come in β which can introduce some latency, yes, but also improve safety and reliability.
Multimodal systems trade off some of that control for more naturalness. Weβre actively working on that too, from a research perspective. But for large enterprise customers, the three-step setup is still what we recommend β itβs more stable and easier to monitor.
I was actually going to ask about that β because it seems like the next step for conversational AI, and for combining these technologies, would be moving toward something multimodal at some point.
Mati: At some point βΒ and for certain use cases where emotional nuance or a real sense of connection is especially important β we do think multimodal will become the standard. Probably within the next year or two.
To make it more concrete: in interactions where there's no need to take direct action on the backend β like issuing a refund or canceling something β a multimodal experience could actually be more useful.
Of course, you still want to be cautious. For example, you wouldnβt want a therapist AI to be multimodal unless youβre absolutely certain it works reliably, every time. Otherwise, you risk hallucinations β both in the text and audio β which makes the system less stable.
But long term, say in the next three to five years, weβll likely see more stability and strong proof points that multimodal is the way to go for those emotionally rich use cases.
And when that happens, weβll start to see broader adoption. But in the meantime, the three-step solution still gives you something valuable β something modular and controllable.
Itβs working.Β
Mati: Exactly. Itβs scalable, it works, and it carries emotion. You can build your knowledge base into the agents β and do it relatively simply.
You started as a research-first company. How deep are you still involved in technical stuff?Β
Mati: My co-founder is incredible βΒ heβs the genius and the brain behind all the models weβve built. Heβs managed to bring together some of the best researchers in the audio space. So now we have a very strong audio research team that keeps pushing out incredible models and topping benchmarks.
Weβre still heavily focused on audio research and will be for years to come. Our goal is to build frontier models βΒ the best in the field β whether thatβs for speech-to-text, conversational agents, or other areas of audio understanding and generation. That work will absolutely continue.
On the personal side, what we really believe in is that research alone isnβt enough. You also need the product layer βΒ the end-to-end experience for users. Itβs not enough to have a great text-to-speech model with good narration. You also need a product or workflow that lets someone create an entire audiobook, or build a full voice agent experience β something that integrates their knowledge base, their functions, their world.
Thatβs the part I stay close to.
And when it comes to deploying this tech to clients, one thing we do β maybe less traditionally β is have engineers work directly with them. They embed into the clientβs workflow, understand their needsβ¦
Pain points.Β
Mati: Exactly. And then build that solution more closely around their needs.
Do you miss being involved in research more?Β
Mati: I was never directly involved in the research-research side of things, but I did work on some of the product aspects β actually building the product itself.
Back at my previous company, which was part of Palantir, I was much closer to the technical side β working on pipelines and helping clients with optimization problems. There are definitely parts of that I miss, especially the math and some of the engineering thinking that came with it.
But one thing thatβs still true β and something I really love about my current role β is that I still work closely with clients. I talk to them a lot, try to understand their problems, and figure out what solutions we can build. That feeds directly into how we shape the product.
And even though Iβm not writing code myself anymore, itβs still really rewarding to see how the work gets deployed.
Congrats on your recent huge round. Congrats on the collaboration with Lex Fridman, and now this week, you announced the partnership with Google Cloud. ElevenLabs seems to be everywhere. And going back to when you first reached out to me, I remember thinking it was such a smart move to work with newsletters. So βΒ whatβs your strategy?
Mati: One thing we try to stay true to is really understanding the problem weβre trying to solve βΒ and then building the right solution around that. But the other piece is just as important: even if you know the problem and youβve built a great solution, most of the world still doesnβt know it exists. Most customers donβt know itβs possible.
So the big question becomes: how do you tell them? How do you show people that the technology is finally here to solve these issues?
And because we started with research β and built so much of the early stack β people donβt always trust or believe it when you just say, βThis is the best human-like model,β or, βThis is the most emotional model.β Thatβs not enough. So weβre always looking for ways to show the world β not just tell themβwhatβs actually possible, with real use cases.
One part of the strategy was: open the technology to creators and developers. Let them use it in their own projects. Let them show whatβs possible. And honestly, even we learned things we didnβt expect β use cases we hadnβt thought of. So that discovery process went both ways.
At the same time, weβve always believed the research is strong. We know it solves real problems β from audiobook narration and newsletter voiceovers, to podcast dubbing, to film voiceovers, and now voice agents. But instead of just claiming that, we work with creators, developers, innovators β to show the breadth. And in parallel, we partner with large companies to go deep: understand the scale, the enterprise-level requirements, the security and compliance needs.
Like last week β we were proud to announce our partnership with Deutsche Telekom. Thatβs a very different angle. But part of why weβre excited about it is theyβve already seen how people engage with voice β through podcasts, through phone calls β and theyβve seen the quality. Now the focus is: how do we go even deeper? How do we build something truly end-to-end?
So thatβs how weβre thinking about it. On one side, make the best technology widely accessible. On the other, build very specific, deeply integrated enterprise solutions.
You have research, you have the product, everyone loves you, but you never open-sourced anything. Why? And are you going to?Β
Mati: Great question. We actually spent a lot of time thinking about this.
Right now β because weβve invested so much into our research, and focused on areas that others werenβt really paying attention to β to be honest, if we opened it up too early, weβd be giving away a lot of our advantage and IP. That would make it easier for others to recreate what weβve built.
Of course, this will change as we build more of the product layer. And to be fair, over the past two years, weβve added a lot on that front β but weβre still relatively early compared to where we want to be.
Once we feel weβve nailed the right parameters around the product βΒ something more mature and robust β then weβll be happy to open up more of the research side. That would help strengthen the mission overall.
But today, given the time, energy, and resources weβve put in, we want to keep things close βΒ for now. Especially since many of the companies in this space have far more resources than we do.
There is a lot of competitors.Β
Mati: Yes, exactly. And you know, some of the big companies β especially the hyperscalers β will likely enter this space at some point.
So eventually, we will open up access. But for now, itβs more about building the foundation βΒ making sure the product is where it needs to be.
Also part of the strategy.Β So, probably my last question β is the dubbing problem in Poland solved?
Mati: Yeah, I think weβre headed in a good direction. Itβs not solved yet, but weβre getting there.
You know, itβs one thing to solve the research problem β itβs another to actually get the world to know the solution exists and use it. Thatβs where weβve been spending a lot of time, working with companies to bring it into real use.
And I think in the next few years β single digits for sure, maybe even just two or three β itβll be solved.
Thank you so much for this conversation! And to our readers β thank you for watching and reading!
