• Turing Post
  • Posts
  • Token 1.4: Foundation Models – The Building Blocks

Token 1.4: Foundation Models – The Building Blocks

We touch upon some systematic concepts and also offer a few practical insights from Rishi Bommasani

Token 1.4: Foundation Models – The Building Blocks

The article is a good primer for someone looking to get acquainted with the topic. Here we touch upon some systematic concepts and also offer a few practical insights from Rishi Bommasani, a co-author of one of the best papers on the topic “On the Opportunities and Risks of Foundation Models.” In this episode, we will explore:

  • The definition of foundation models (FM)

  • Key characteristics that have transformed our understanding of ML applicability

  • Various types of FMs

  • Current trends in the field

  • Non-generative FMs worth noting

  • Unique challenges posed by these models

  • And, we'll wrap up with an interview on how to assess if your company should adopt a foundation model

In Token 1.1., we touched upon the paradigm shift from task-specific to task-centric machine learning (ML):

“Task-centric ML focuses on using foundation models to perform a wide range of tasks efficiently and with fewer training examples. You're no longer burdened by the never-ending thirst for new data. This eliminates the 'data bottleneck,' allowing you to iterate faster and meet evolving business needs.”

So let’s dive deeper into what are those foundation models that facilitated this profound shift in AI/ML applications.

The definition of foundation models

This term gained traction after being coined by the Stanford Institute for Human-Centered Artificial Intelligence's (HAI) Center for Research on Foundation Models (CRFM) in their seminal August 2021 paper, titled “On the Opportunities and Risks of Foundation Models”:

"A foundation model is any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks."

They are not foundational models in ML but they present a foundation that doesn’t need to be changed every time you change the task. 

There is an interesting aspect of covering foundation models. Technologically, FMs are not new. They are based on deep neural networks and (most of the time) self-supervised learning – which have been around for decades.

"Foundation models are enabled by transfer learning and scale. The idea of transfer learning is to take the 'knowledge' learned from one task (e.g., object recognition in images) and apply it to another task (e.g., activity recognition in videos). Within deep learning, pretraining is the dominant approach to transfer learning: a model is trained on a surrogate task (often just as a means to an end) and then adapted to the downstream task of interest via fine-tuning." 

Transfer learning is the mechanism that makes foundation models feasible. By adding scale, these models become extremely potent. The OpenAI developers were puzzled by the success of ChatGPT because it was a version of an AI system that they’d had for a while. What truly brought changes into the ML world was… user experience. “We made it more aligned with what humans want to do with it. It talks to you in dialogue, it’s easily accessible in a chat interface, and it tries to be helpful. That’s amazing progress, and I think that’s what people are realizing,” explains Jan Leike, the leader of OpenAI’s alignment team.

This recent success in simplifying access to these models has shaken the whole industry, enabling everyone – from a research lab to an ML startup, to enterprises and the average Joe – to start learning about FMs/LLMs and applying them. An uncountable amount of ML startups pivoted towards LLM applications. Yet, as accessibility grows, our collective understanding of these models remains in its early stages. A systematic approach, particularly when categorizing FMs, is evolving.

What sets foundation models apart?

  • Their vast reservoirs of training data.

  • The sheer number of parameters they possess.

  • Their unparalleled adaptability, allows them to be fine-tuned for a diverse range of tasks.

These characteristics made some fundamental changes to how we think about the applicability of ML!

  • Foundation models are centering the value realization towards the task layer, making it easier to kickstart new solutions by leveraging core AI capabilities.

  • Better models ≠ better products. Foundation models are making it much more accessible to integrate AI capabilities, both for business and consumer applications.

  • Being a foundation. As the name suggests, you don’t have to build a new model each time, rather than using a foundation model for a different range of tasks.

Classifying foundation models

The realm of foundation models is vast and diverse. They can be:

  • Generative or Predictive: The former creates content, while the latter predicts or classifies based on input.

  • Transformer or Diffusion-Based: These refer to the underlying architecture.

  • Modality-Specific: They can focus on a particular medium like text, video, or audio, or be multimodal, working across multiple types.


Prominent examples of generative models include transformer-based large language models (LLMs) like the GPT series and diffusion models like Midjourney.

Trending now with a lot of potential are multimodal models (sometimes termed LMMs, though we find this label confusing). These models are designed to retrieve information from different sources such as text, image, audio, and video to build a more complete and accurate understanding of the underlying data. Notable examples include CLIP, DALL-E, ViT (which can also be classified under computer vision models), Flamingo, Blip, Kosmos-1, and so on. Currently, most multimodal models are primarily bimodal (e.g., text-to-image). However, with recent ChatGPT updates, users can now engage with text, and image recognition, send prompts with images, voice, etc.

As Chip Nuyen pointed out in her blog, "There are other data modalities we haven't explored, such as graphs and 3D assets. We also haven't delved into the formats used to represent smell and touch (haptics)." Exciting!

The application of multimodal models in robotics and healthcare appears to be particularly relevant and exciting.


While generative AI has captured mainstream attention, it's crucial to spotlight significant ongoing research in the non-generative domain. For instance, I-JEPA, developed by Meta AI, is a self-supervised learning architecture geared towards understanding images in a manner akin to human cognition. It achieves this by predicting representations of various image segments from a singular context block. Drawing inspiration from Yann LeCun's vision of AI models with internal world models, I-JEPA offers quicker learning and enhanced adaptability compared to traditional generative approaches. This architecture, which focuses on abstract representations, demonstrates impressive performance on several computer vision tasks and boasts heightened computational efficiency. Having been trained on over 100 million images, I-JEPA heralds a move toward more human-like AI. It leverages self-supervised learning to derive semantic insights from images, making it suitable for a range of applications, including self-driving cars and medical diagnoses.

This list is not exhaustive but gives a sense of the current developments.

Interview with Rishi Bommasani

Wrapping up, if your enterprise is contemplating the integration of foundation models, consider the insights from Rishi Bommasani, one of the authors of the “On the Opportunities and Risks of Foundation Models”, Society Lead at the Stanford Center for Research on Foundation Models (CRFM).

Brief summary:

  1. Choosing Foundation Models: Deciding on FMs involves considering resources, benefits, costs, and the potential to free employees from mundane tasks.

  2. Foundation vs. Traditional ML: FMs are versatile, and their value often lies in adaptation. Assessment should focus on application-specific performance.

  3. Evaluating Adaptability: Evaluation of FMs is typically done by assessing performance when adapted for specific tasks, not just inherent adaptability.

  4. Importance of Dual Assessments: Both foundation models and their specialized derivatives need evaluation, catering to different stakeholder priorities.

  5. Resource Considerations: While intrinsic property evaluations cater to research, practical industry decisions require balancing accuracy with efficiency, robustness, and fairness.

Full interview:

- How can one determine whether your project or company requires a foundation model?

This is clearly a very complex decision, in that the answer will depend on many contextual factors: what resources and expertise does the company have, what is their current process, and what is the incremental benefit of a FM (e.g. greater accuracy, technical simplicity), what are incremental costs (e.g. greater latency, dependency on external foundation model providers), which of these are one-time vs. recurring, and so forth.

Perhaps a more important question in my mind is how will the foundation model be used (e.g. are employees asking the model questions or using it to classify documents or caption images for user-facing applications) and, as a result, how it will allow for the reallocation of human labor at the company. I think many of the productive applications of foundation models I see are when they change company practices to allow employees to focus attention on core decisions that require judgment and bear high stakes, which is precisely where you want human insight, and away from monotonous tasks that require little human touch.

- Foundation models differ from traditional machine learning models in that traditional ML models are trained for specific tasks, while foundation models can be versatile across numerous tasks. Can these two types of models be easily compared?

They definitely can be compared, and it should be said there is a softer continuum based on how aggressively one adapts a foundation model. While one extreme is to train a bespoke model from scratch for each task/application, for many applications we see sufficient positive transfer/leverage from the foundation model that a more relevant question is how aggressively to adapt/specialize/customize the foundation model to the application at hand. For example, should a smaller foundation model be heavily specialized through fine-tuning on application-specific data, or a larger foundation be simply prompted? Regardless, all of these models can be evaluated for how they fare on the evaluations relevant to the specific application: in most cases, this is the most socially and commercially relevant assessment.

- Foundation models possess specific features like adaptability that are not found in traditional models. Is there an established evaluation framework for assessing the quality of foundation models?

It’s an interesting question. While one could try to assess how adaptable a foundation model is, often the question is operationalized instead by asking how well the foundation model does when adapted in a specific way. E.g. how well does OpenAI’s GPT-4 do when it is prompted in a specific way, or how well does Meta’s Llama 2 do fine-tuned on specific data? Since, in particular, the best way to adapt a foundation model may differ across foundation models due to fundamental and practical constraints (e.g. Anthropic does not expose the capability to fine-tune Claude 2 at the time of writing).

- Specifically, is it necessary to evaluate both foundation models and task-specific derivatives? What is the best approach to address this?

Indeed, the best approach is to evaluate both, because evaluations will surface different insights. More fundamentally, the results of those interpretations will have different significance: a startup making a decision on which model to use in their customer service application should pay attention to different evaluations than the broad AI academic community.

- Could you share your knowledge about the practical implementation of meta-benchmarks and direct evaluation of intrinsic properties? Is this level of detail necessary for the industry, or is it more for scientists who develop new methods and foundation models?

I would say that directly attempting to evaluate the intrinsic properties of a foundation model (e.g. abstract reasoning capabilities) is more of a scientific exercise than something immediately relevant for most contexts. In that regard, I believe there is a dramatic over-emphasis on matters like emergent capabilities or legal reasoning on the Bar Exam and so forth. Researchers will hone these evaluations over time, but what is immediately business-relevant needs to be much more concrete and grounded. Of course, the hope is evaluations of abstract capabilities like analogical reasoning are predictors of the downstream performance that impacts society, but unfortunately, these predictive relationships are not well-established just yet.

- Should one consider the resources required for adapting and scaling foundational models when choosing between them?

Most likely. I think it is clear that the field of AI is dramatically over-valuing how accurate models are (e.g. raw capabilities) in a way that reflects a poor understanding of what matters in society. No matter how accurate a model is, if the costs/resources required are exorbitant, it won’t be used. Our work on the Holistic Evaluation of Language Models (HELM) embodies this perspective by presenting an integrated view that includes accuracy, efficiency, robustness, fairness, and many other matters. While accuracy may be a critical factor (and, perhaps correctly, the most important factor), we should be looking at these many dimensions to make well-informed decisions about which models to use.

Key research, sketching the field’s trajectory →

Join the conversation

or to participate.