• Turing Post
  • Posts
  • Creating large language model that "speaks DNA"

Creating large language model that "speaks DNA"

Interview with Barry Canton from Gingko Bioworks about real-life implementation of ML and AI

If you like Turing Post, please consider to support us today.

We are happy to introduce a new series of interviews that highlights the real-world implementation of Foundation Models (FMs) and Large Language Models (LLMs). We start with an absolutely fascinating topic that can bring so many changes into our world: Bioengineering. In this insightful conversation with Barry Canton, CTO and co-founder of Ginkgo Bioworks, we discussed the evolution of synthetic biology, the integration of AI and GenAI in their R&D, and the future of bioengineering. Let's see what it takes to create a model that 'speaks DNA'.

The History of Bioengineering and the start of Gingko Bioworks

What was your inspiration to start this company?

The five of us met at MIT. Four of us – me, Jason Kelly, Reshma Shetty and Austin Che – were graduate students entering this new field at the time called “synthetic biology,” and were inspired by one of the founders of that field, our professor and co-founder Tom Knight. At the time, it was clear that biology could play a role in addressing global challenges such as climate change, food security, and health care. But it was also clear that to develop a biotechnology product could take on the order of 10 years and $100M to develop, with little certainty of ultimate success. In fact most projects failed. Tom realized that this was because the tools and technologies available to the would-be biological engineer were crude and immature. In the first half of his career Tom had seen how investment in foundational tools and technologies had been essential to realizing the potential of computers. Our inspiration was that we could use the lessons learned in the development of earlier engineering disciplines together with the unique abilities of biology to build a new discipline, biological engineering.

We started the company after we graduated in 2008 and have been pursuing that vision for the last 15 years. Today, we’re immensely fortunate to have a talented team of more than a 1000 people and more than a billion dollars of capital to help us make it real.

The Synthetic Biology Working Group at MIT, initially a collective of enthusiasts in programming biology, has seen synthetic biology evolve significantly, especially with AI integration. Initially, cell programming was a manual, project-specific process. Now, it's transformed by software, robotic automation, and AI, enabling Ginkgo to manage over 100 diverse programs simultaneously, enhancing efficiency, speed, and success rates. The field's expansion into various sectors like biopharma, agriculture, and consumer electronics demonstrates its broadening scope and reaffirms the belief in biology's global technological potential, with AI emerging as a key driver in advancing cell programming.

Machine learning and Generative AI Implementation

How do you use machine learning (ML) at Ginkgo Bioworks?

We’ve been collecting large datasets for many years now and have been using state-of-the-art ML, deep learning, and now Generative AI (primarily in protein modeling) to work with this data.

But we don’t just collect and use data in the context of a single program. Rather, we accumulate data across every program we run such that later projects start from an existing body of knowledge that we call our Codebase. As we’ve scaled our operations, so has the scale of that Codebase grown rapidly. Additionally, the breadth of our Codebase now includes data on a range of organisms and product classes across industrial biotech, agricultural biotech, and biopharma. Where before we would leverage that Codebase using conventional statistical and mechanistic modeling approaches, now our Codebase becomes the corpus of data with which we can train AI models. We always intended this to be the case some day and it’s very exciting to be living in the future now that AI approaches have matured so rapidly!

One of the key AI tools developed by Ginkgo is Owl, an AI platform for enzyme engineering. Owl leverages our extensive data to deliver better enzymes with fewer iterations. We call it “Owl” because it lets us see in the dark, as Jake describes in this Foundry Theory video. 

In 2023, we announced a significant partnership with Google Cloud to develop generative AI models for cell programming and biosecurity.

Tell us more about this partnership. Is there any connection between your work and Google’s work on AlphaFold?

This partnership aims to leverage Google's computing power and Ginkgo's extensive Codebase to accelerate the development of new foundation models to teach AI to speak DNA, much the same way that Chat-GPT speaks human language. These foundation models help us model DNA, proteins, and ultimately entire cells both in terms of the structure and the function. We are also building application-specific models on top of our foundation models that will enable predictive design of enzymes (which can help us make them more active, more stable and more effective), cell therapies (enabling more targeted and effective therapies) and nearly any application our customers need cell programming for.

We work with proteins on a daily basis as they are the core machinery of every cell. We need to understand the structure and function of proteins as well as engineer proteins to improve their performance for our customers needs. Tools like AlphaFold have proven quite useful to us as we work with proteins. Indeed, AlphaFold’s improvement over the state-of-the-art in protein structure prediction really jump-started a lot of interest in the application of transformer-based AI models throughout the field.

DNA as a language and trend to multimodality

Drawing parallels between linguistic models and DNA sequences can be quite evocative. How does Ginkgo approach DNA as a 'language' for instructing cellular behaviors?

Yes, the parallels here are very exciting, not least because they allow us to transfer the tools that are being developed in natural language over to the biological domain. Both forms of language are the result of evolutionary processes working on sequences of characters that encode meaning. Both developed specific grammars and hierarchical units of meaning – the equivalents of words and sentences exist in DNA. Also, the languages of different organisms are often closely related, just as French and Italian are similar. So the parallels between the two forms of language are considerable.

So how does the language of DNA instruct cellular behavior? There are special proteins in the cell that can recognize and interpret the sequence of letters in the DNA as a set of instructions based on the grammar and hierarchical organization mentioned above. Instructions include what cellular machinery to make, how to process environmental signals, and when to replicate. Cells are little autonomous machines that include all of the code needed for them to operate and make copies of themselves! All of that code is found in the genome of the cell.

Once we understand that DNA is a language and that it provides the cell with the instructions it needs to operate, we can see that if we read the DNA of a cell we can understand how that cell will behave. We can also see that if we can write or edit the DNA we can change how that cell will behave. Over the past 30 years or so, it has become routine to read the DNA of cells found in nature and we have decoded many of the instructions in natural DNA. We have also become relatively adept at writing or editing new DNA to provide new instructions to the cell.

So what is exciting today is that we are getting better at reading, writing, and expressing meaning in DNA code. The tools of GenAI, when applied to DNA, proteins, and cells, can speed up our decoding of natural sequences and help us generate new DNA code faster, and perhaps better, than we could before.

Though LLMs are the most known models now, the trend is towards multimodality. How do you think about multimodal models from a bioengineering perspective?

AI researchers have been able to make significant advances in biotechnology using the current LLM approaches. This is because the language of life is coded in the linear sequences of DNA, RNA, and protein and so many of the methods developed for natural language are transferable to biology. Nevertheless, living cells are 3 dimensional dynamic, complex systems meaning we need approaches beyond today’s LLMs to make biological models truly and deeply predictive. 

When we engineer cells we know we need to integrate multiple data types including sequence data, proteomics, metabolites, imaging, and more. We need to predict complex multi-dimensional phenotypes as multiple proteins and other biomolecules interact to produce a valuable product. So yes, multimodal will be increasingly important

How do you envision generative AI's role in the endeavor to 'make biology easier to engineer'?

We think biology can solve many of the challenges faced by the world. McKinsey said as much as 60 percent of the physical inputs to the global economy could, in principle, be produced biologically.

When we talk about making biology easier to engineer, we mean that we want to make the often messy, expensive, and laborious process of making new things with biology more efficient and predictable. Today, we do that by using robotic and software automation to allow us to test more design prototypes than was possible previously, increasing the probability of finding a high-performing cell. In the future, AI can eliminate the need for some of that experimental work in the lab by learning from the Codebase we have accumulated from prior projects and generating new high performing designs without ever setting foot in the lab. The result can be faster and more cost effective product development cycles for our partners.

Biosecurity and other risks

How does Ginkgo navigate the delicate terrain of biosecurity, especially when melding the predictive capacities of AI with bioengineering?

Our approach to biosecurity is deeply informed by the work we have done over many years to learn how to engineer biology. Throughout that time, we have engaged with bioethicists, policy experts, and the public to put the technology in a broader context. Taken together, this has deepened our understanding of the potential and limits of the technology today and how it may evolve in the future. We are seeing a continuing and strong demand for leadership and innovation in biosecurity, both around biothreat prevention but also around helping design better countermeasures (e.g. vaccines that are better targeted to the next variant).

From the very earliest days, we recognized that if we are going to work on this mission “to make biology easier to engineer,” we also need to invest in the tools that are going to protect against the misuse of biological engineering. Certainly, we need to be responsible for what's on our platform, but we also must recognize that even if our platform is really well protected, biology does not respect borders. Biology will get easier to engineer writ large, and therefore you need a biosecurity infrastructure that can do the things that we're used to in cybersecurity. Cybersecurity wasn't baked into the digital revolution until too late – we've been baking it in from the start, and we think about biosecurity at the point of design.

What deserves more consideration: the long-term risks or tackling the immediate risks of AI?

Fortunately, our understanding of biological risk is relatively mature [ref]. This is because the last 50 years have seen fundamental breakthroughs in understanding how cells work and how to manipulate them. Scientists, bioethicists, public policy experts, among others, have long realized that the power to manipulate cells is a dual use technology of immense power to human health, our food supply, and the economy more broadly. Key milestones, including the discovery of recombinant DNA, the first sequencing of a human pathogen, the publication of the human genome, and the commercialization of gene synthesis, have triggered deep engagement in questions of biosafety and biosecurity and how to balance risk with societal benefit. 

We view AI as another in the list of enabling technologies that accelerates our ability to program cells. For any new enabling technology, we can’t focus on the near term risks created by these new abilities at the expense of the long-term risks or vice versa, we must consider both. It’s important to note that harmful uses of biotechnology generally must cross the digital-to-physical divide. Fortunately, today that means that existing regulatory and oversight frameworks for biotechnology do limit the potential for AI to be used for harm. However, this fact should not be taken as a reason for complacency. The COVID pandemic has shown us that we must continue to invest in detection and response infrastructure at a national and global level so that we can respond to biological threats regardless of whether they are products of natural evolution, or human- or AI-based design.

The future

In one of your talks, you said, “The era of Moore’s Law is coming to a close, but biology’s exponentials are just beginning.” What do you mean by that?

There are hard physical limits that semiconductors run into that have made it harder for Moore’s Law to continue. By contrast, we’re in the early days of developing the tools for programing biology. They could be miniaturized by orders of magnitude – accelerating processes and reducing expensive material costs. Since our tools are still relatively crude, the complexity and sophistication of the systems we can build today is still much less than the sophistication in the natural world that we see around us. Lastly, we use biological tools throughout the process of programming biology. This means that as we get better at programming biology, we can make better tools for ourselves just as faster computers made semiconductor design better. This positive feedback loop will drive future exponential improvements in the field.

What area of research do you keep an eye on? 

DNA Synthesis. Imagine if it cost several cents every time you typed a letter of python code? That’s the challenge faced by anyone seeking to program biology. Imagine if your python code was limited to 5,000 characters? That’s the limitation experienced by biological designers today. We need faster, cheaper, and longer DNA. Our friends at Twist have used a very exciting technology platform to help us along this journey. I keep an eye on enzymatic DNA synthesis as an emerging technology that could potentially unlock further improvements.

Thank you for reading! if you find it interesting, please do share 🤍

Join the conversation

or to participate.