Turing Post
Posts
🎙️What Limits AI Today? When Inference Will Be Like Electricity? And Why Market Fit Can Kill?

🎙️What Limits AI Today? When Inference Will Be Like Electricity? And Why Market Fit Can Kill?

A very interesting interview with Lin Qiao from Fireworks AI

Ksenia Se
August 23, 2025

Subscribe to our YouTube channel, or listen the interviews on Spotify / Apple

What limits AI today isn’t imagination – it’s the cost of running it at scale.

In this episode of Inference, I sat down with Lin Qiao, co-founder & CEO of Fireworks AI – an inference-first company, and former head of PyTorch at Meta, where she led the rebuild of Meta’s entire AI infrastructure stack.

We talk about:

Why product-market fit can be the beginning of bankruptcy in GenAI
The iceberg problem of hidden GPU costs
Why inference scales with people, not researchers 2025 as the year of AI agents (coding, hiring, SRE, customer service, medical, marketing)
Open vs closed models – and why Chinese labs are setting new precedents
The coming wave of 100× more efficient AI infrastructure

Watch to hear Lin’s vision for inference, alignment, and the future of AI infrastructure. And – at the end – Lin shares her very personal journey to overcome fears. Watch it now →

This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription – do it here.

This transcript is edited by GPT-5. Let me know what you think. And – it’s always better to watch the full video) ⬇️

Ksenia Se:
Hello everyone, and welcome back to Inference, the interview series on Turing Post. Today I’m thrilled to talk with Lin Qiao, co-founder and CEO of Fireworks AI, and former head of PyTorch at Meta, where she led the rebuild of Meta’s entire AI infrastructure stack. Welcome, Lin. Let’s start with the big question: when will inference become a solved problem? What would it take for inference to feel like electricity – reliable, cheap, and invisible? And what still stands in the way?

Lin Qiao:
Thanks for having me. That’s an interesting question. I think we’re just at the starting point of optimizing inference, and there are many dimensions to look at. I want to start from the perspective of active operators – the cohort we care most about.

There’s no question that GenAI is a revolutionary technology. It can generate content on par with or beyond human interaction with the real world, and that’s its biggest value. Because of this, it’s safe to predict we’ll see many generational companies emerge, defining new user experiences that never existed before. They’ll disrupt industries and change how we interact with software day to day.

Through that lens, we’re already seeing a lot of innovation. But there’s an interesting phenomenon: in traditional startups, once you hit product-market fit, you scale – and that’s how you build a viable business. With GenAI applications, hitting product-market fit and having a viable business are two different problems.

You can create a new user experience that delivers tremendous value to consumers or developers, but that doesn’t mean you can quickly scale to a viable business. The cost structure is so much higher. Everything around GPUs – the infrastructure, the operations – is orders of magnitude more expensive than building traditional apps on CPUs.

We often hear stories from companies that say: we’re confident in our product, the signals are great, we even have a waiting list of millions of users… but we can’t open the floodgates, because if we do, we’ll run out of money. In other words, in GenAI, hitting product-market fit can actually be the beginning of bankruptcy.

Ksenia: That’s very interesting. It really is something new.

Lin:
Yes, and it’s fundamental. You can visualize it like an iceberg. Right now, a huge iceberg of GenAI applications is being built, but most of it is still submerged under the waterline because infrastructure costs are so high. If those costs shrink by even 10x, the number of applications emerging above the waterline will be enormous. That’s where the future is.

When it comes to how infrastructure costs shrink, there are many approaches. I’ll share our observations from working with leading application providers.

Ksenia: So what’s your approach – how do you make this iceberg smaller?

Lin:
It boils down to a fundamental misalignment between two sets of data. On one side, you have the data used to train foundation models in research labs – whether open or closed models. Those labs define objectives, design problem statements, and curate datasets to produce the outcomes they want.

On the other side, you have application developers. Their goal is product design that maximizes user engagement. They constantly experiment with features and collect product data. That data distribution is built for a completely different purpose.

So when app developers use these foundation models to power their products, they inherit this misalignment. And that’s the root cause of gaps we see in accuracy, latency, and efficiency.

Some young companies have figured out how to close this gap – aligning their product data with the models to build systems that are faster, cheaper, and more accurate. That allows them to scale beyond product-market fit into viable businesses. But the majority still treat models as utilities – they send requests to the API without addressing the underlying misalignment.

That’s the dynamic space we see right now. And it’s where we’re trying to help – enabling application developers to close that alignment gap.

Ksenia: So you’re working with enterprises to align those two data streams. Tell me a little about your journey – you founded Fireworks in October 2022, before ChatGPT. Why then? And how did your vision and approach change once generative AI really boomed?

Lin:
Our founding team is large, and many of us worked at Meta for seven to ten years, essentially bootstrapping Meta’s AI infrastructure across both training and inference. When we started Fireworks in September 2022, we had the option to focus on either side – PyTorch is used for both. At that time, most people were focused on training: building models, calling GPUs for training, or creating training infrastructure. We made a strategic decision to go all in on inference.

Why? Because inference scales fundamentally differently. Training scales with a small pool of researchers. Inference scales with consumers and developers – with the entire world population as the upper bound. The production requirements are higher, the complexity greater, and those are the kinds of problems we wanted to solve.

Looking back, that choice set us apart. It let us build an inference toolchain sophisticated enough to make us the best provider on that side of the stack. Our approach ties back to the data alignment problem I mentioned earlier. We don’t believe in “one size fits all.” Instead, we believe in “one size fits one.” Every application workload is different, and we optimize for each one.

The analogy we use is a database. A database doesn’t treat every query the same – it runs a query optimizer that figures out the most efficient execution plan. We apply the same idea to inference, but it’s even more complex. We built what we call our 3D optimizer, which optimizes across three dimensions simultaneously: quality, speed, and cost.

The challenge is the search space – there are many underlying components, each with dozens of options, leading to hundreds of thousands of possible combinations. We’re finding the one needle in that haystack. But the good news is, we’re very good at solving these kinds of problems. Today, nearly all Fireworks customers use our 3D optimizer.

Ksenia: It must be hard to explain that level of complexity to enterprises.

Lin:
That’s what happens under the hood. With enterprises, we map it to business value and use cases. And right now, in 2025, the big theme is agents. Startups and enterprises alike are building them.

There are coding agents – of many types – that dramatically increase developer productivity. Hiring agents that take a job profile, source candidates, run interviews, and assess performance. SRE agents that debug and triage production issues during incidents. Customer service agents – hugely popular, since some enterprises have 20,000+ human agents today. Making them more productive translates to massive cost savings.

We also see marketing agents that can automatically design outbound campaigns targeted to specific audiences. And adoption is spreading across verticals: medical, retail, education, finance, and more.

When we talk to enterprises, we frame the impact of our 3D optimizer through these case studies. That lands better than talking only in technical terms.

Ksenia: And how do you talk to them about models? What’s your approach? Do you lean toward general-purpose models, or smaller, narrower ones?

Lin:
We believe strongly in developing in the open. Our business model is mainly focused on open models because they give enterprises transparency and control – something they care about deeply.

That said, we also look at it from the user’s perspective. Their goal isn’t to make open models successful – their goal is to solve business problems and deliver impact. They’ll use whatever tool helps them do that. So in enterprise engagements, we provide a “cookbook” for building an AI gateway. It connects to whatever model providers they want, and we’re one of those providers.

We help standardize that stack. We also give them private evaluation benchmarks so they can objectively compare models for different use cases. If a closed model works better, we’ll simply show them the report and let them decide. And if they want to tune an open model for the best quality, we provide the tools. Our principle is simple: meet customers where they are, rather than forcing them into a vendor’s frame.

Ksenia: Speaking of open models – Chinese labs have been on fire lately. DeepSeek R1, Kimi K2, GLM from Zhipu… they’ve set new precedents for everyone. Why do you think they’ve been so successful?

Lin:
It’s fascinating, especially since they often have more resource constraints – fewer powerful GPUs – yet they’ve achieved remarkable results. I see this as a sign of convergence. Closed and open models are starting to look similar in quality.

At the end of the day, two factors matter: training techniques and data. On techniques, talent flows globally now. People move, research is shared. There’s less “secret sauce” than before. On data, the distributions that determine model quality are also converging. Everyone draws from similar public datasets and works with the same labeling companies. That levels the playing field.

So the real race becomes: how do you generate more and better data? Large models generate synthetic data to train smaller ones. That’s capital intensive, but synthetic data quality is improving. Everyone’s experimenting.

Another frontier is training with inference in mind. Instead of just optimizing training-time quality, labs are tweaking architectures to improve inference-time performance – making models faster and cheaper to run without losing accuracy. We’ve seen a lot of creative work here, especially from recent releases. Expect more small but important architectural tweaks rather than massive leaps.

Overall, I think base models in text are converging. But in multimodality – voice, vision, video – closed models are ahead. They’ve invested heavily there, while open models are more focused on reasoning, coding, and tool use for agents. Multimodal will catch up, but text-based models will converge faster.

Ksenia: Yes, multimodal is much more expensive, so it makes sense for closed labs to protect their inventions. But I agree – open will catch up. I’m curious about something else. Chinese companies like Zhipu (now ZAI) tie open source to their AGI vision. In the U.S., it’s different. Your former employer Meta recently suggested they might not open source everything – Mark Zuckerberg said they’d be more cautious. Why the difference?

Lin:
I don’t think Meta has made a final call. There are still healthy internal debates about priorities – whether to focus on product to drive revenue, or on ecosystem to consolidate around LLaMA. That’s just business strategy.

At the same time, Google and others in the U.S. are also driving open models. And contributing to the open community is no small task – the bar is high. Anyone releasing a new model has to show strong benchmarks against the latest open and closed models. That competition raises the quality bar for everyone.

So I remain excited. The open ecosystem will only get stronger because no one can afford to lower the bar. Each new release becomes a demonstration of research depth, talent density, and technical output – good for the company’s reputation, and good for the broader community.

Ksenia: If we talk about superintelligence and AGI – what’s your take? Is it about solving intelligence, or more about building better tools? Or do you not think about it that way at all?

Lin:
We have a strong opinion on this. Fireworks wants to provide value by making application developers shine. Our role is to build the best tools and infrastructure for them – so they can easily create a data flywheel from their product.

Product data aligns with the model, making the model better tuned to the application. A better model drives better user engagement. More engagement produces more data. More data improves the model again. That’s the virtuous cycle we want developers to build with Fireworks.

So our position is clear: we add value as tools and infrastructure. We want to empower developers to create social value that shows up in everyday life – things my mom might use and say, “this is really cool.” And I’ll be able to tell her, “that app runs on Fireworks.” That’s the impact we’re aiming for.

Ksenia: When do you think AI infrastructure will become much easier and lighter – really everywhere?

Lin:
We’ve already shown what’s possible. With our 3D optimizer we’ve accelerated inference speed and reduced cost by 4x to 10x – sometimes even improving quality at the same time. That’s the power we’ve demonstrated, and we’ll keep pushing it.

But in the bigger picture, I believe we’ll see 100x more efficient infrastructure. Look at CPUs: from single core to dual core to many-core, every generation brought better price-performance and lower manufacturing cost with scale. The same thing will happen with GPUs, ASICs, and accelerators. Hardware will get more efficient, and infrastructure around it will too.

Ksenia: Your approach is pragmatic. You’re building the AI world as it comes. What excites you most – and what concerns you most – about this world you’re building?

Lin:
What excites me is the pace. This is a generational technology shift, bigger than cloud-first or mobile-first. Every day we wake up and go to sleep thinking about how to solve problems in this new paradigm. It sparks intellectual curiosity and creativity, not just in our team but across the whole community.

What keeps me up at night is balance. Our principles are “customer first” and “high velocity of innovation.” But moving fast has a flip side: infrastructure must also be stable and reliable. Striking that balance is critical. We want to innovate quickly, scale features quickly, and support customization quickly – but always on a foundation that enterprises can trust. Managing that tension is the challenge.

Ksenia: Thank you. My last question: what is a book or idea that shaped how you think about leadership and the future?

Lin:
I can’t point to one book. It’s more my life experience – the people I’ve worked with, the internal journey of learning to think differently. Seventeen years ago I was a completely different person. I had passion, but also an inner voice saying, “others can do this better, not you.” It took me years to quiet that voice.

The change came because people pushed me – they challenged me, forced me to embrace the idea that I could accomplish things, and gave me a different imagination for what was possible. That journey shaped me deeply.

It also gave me insight I share with others: often the real limitation is your inner voice, not the external world. That’s why I believe finding people who challenge you – who make you uncomfortable – is a blessing. It’s painful, but if you face it, even if you fail sometimes, you come out a different person on the other side.

Ksenia: So your book is the people around you.

Lin:
Definitely. The people who challenge you and push you through the tunnel of discomfort – they’re the ones who shape who you become.

🎙️What Limits AI Today? When Inference Will Be Like Electricity? And Why Market Fit Can Kill?

A very interesting interview with Lin Qiao from Fireworks AI

Do leave a comment

Reply