Diving into the foundational models that once reshaped machine learning is extremely exciting – we can track what was invented and how it influenced today’s AI. One of these approaches is BERT, Bidirectional Encoder Representations from Transformers, developed by researchers at Google AI in 2018 (and baked into Google Search in 2019.)
This neural network architecture introduced a groundbreaking shift in natural language processing (NLP). BERT was the first to pre-train a deep Transformer in a truly bidirectional way. At its core, BERT is pre-trained language model (LM) that learns a deep representation of text by considering both left and right context simultaneously in all its layers, in other words, you guessed it, bidirectionally. This allows the model to develop a richer understanding of word meaning and context than was impossible with earlier one-directional LMs. BERT was also the first to bring “pre-train then fine-tune” paradigm to Transformers at scale. And it’s not just about the unique concept – BERT’s performance results firmly established it as a new standard in NLP.
Around 2020 the spotlight swung to giant autoregressive models (GPT‑3, PaLM, Llama, Gemini). They could generate text, not just label it, and the hype machine followed. In headlines and conference buzz, BERT looked passé – even though its dozens of spin‑offs (RoBERTa, DeBERTa, etc.) kept doing the heavy lifting for classification, ranking, and retrieval.
Lately BERT sees quiet a revival driven by pragmatism. 2024‑25 research (you might have seen ModernBERT and NeoBERT) is rediscovering its sweet spot: cheap, fast, and good enough when you don’t need a 70‑billion‑parameter chatbot. Specific capabilities of BERT has sparked a wave of novel models that also understand context more deeply. So if you hear a lot of BERT recently, it’s time to refresh your knowledge and discuss what is so special about BERT and what it is really capable of. We’ll also look at different models built on BERT concept and explore how BERT is implemented in retrieval tasks on the example of the freshest (today’s!) open-source release – ConstBERT from Pinecone. We also asked a couple of questions to one of the researchers behind it! It’s fascinating journey through the foundations and current innovations – so let's go.
In today’s episode, we will cover:
The main idea behind BERT: Overcoming the limitations of GPT
How does BERT work?
How effective is BERT, actually?
Cool family of BERT-based models
BERT and Retrieval – ConstBERT
General limitations
Conclusion
Sources and further reading
The main idea behind BERT: Overcoming the limitations of GPT
Some basics: Generative Pre-trained Transformer (GPT) is an language model based on transformer architecture, that mostly excels at handling sequences of data. Its main focus is generating stuff like texts, for example, writing essays, making summaries and classifications, generating code or continuing a conversation. It predicts the next word in a sentence, given the previous words. This means that GPT models process text only in one direction – from left to right (or right to left, depending on the model). While this works well for text generation, this unidirectional nature limits the model's ability to understand the full context of a word because it only considers previous words, not taking into account future ones. But what if we need to look a little bit more in the future?
Some tasks, like question answering, indeed require deeper understanding of relationships between words or sentences. That’s why back in 2018 (these 7 years feel like a century ago!), researchers from Google AI Language developed a bidirectional model, called Bidirectional Encoder Representations from Transformers, or simply, BERT. This model looks at both the left and right context of a word at the same time, which helps it better understand the meaning of sentences. It’s trained on a lot of text without needing labeled data, which makes it very flexible. Another notable thing is that BERT’s pre-trained language representations can be easily adapted to various NLP tasks by fine-tuning a task-specific output layer or head.

Image Credit: BERT original paper
Each task, such as question answering or sentence classification, has its own fine-tuned version of the model, but all versions are initialized from the same pre-trained model.
Here is how this idea is realized from the technical side and what allows BERT to achieve state-of-the-art (SOTA) performance.
How does BERT work?
BERT’s main feature is that it has a unified architecture for all tasks. Its structure remains almost the same whether it is in the pre-training phase or fine-tuning. So conceptually, BERT introduced the idea of pre-training a deep Transformer network on a general language task, then fine-tuning it for specific NLP tasks, requiring just an additional output layer for a new task.
For pre-training, as we’ve already said, BERT doesn’t need labeled data. It uses two unsupervised tasks instead:
Masked Language Model (MLM)
BERT masks randomly 15% of the words in a sentence and then tries to predict the hidden words based on the surrounding context. Instead of just predicting the masked word, it guesses them from a vocabulary list.
Interestingly, BERT doesn’t always replace the masked word with [MASK] token, because it doesn’t appear later during fine-tuning. It uses [MASK] token in 80% of cases and sometimes randomly replaces the word with another word (10% of cases) or leaves it unchanged (also 10% of time).
Through this BERT learns to handle situations where a word might not be specifically hidden with a special token.
Next Sentence Prediction (NSP)
This task helps BERT understand the relationship between two sentences and how they follows another. When BERT is given two sentences, half of the time, the second sentence (Sentence B in the picture below) is the actual next sentence after Sentence A. The other half of the time, Sentence B is a random sentence from the text.

Image Credit: BERT original paper
The next stage is fine-tuning, which helps adapt the acquired knowledge to real tasks.
At the core of BERT’s Transformer architecture is self-attention mechanism. It’s what makes BERT bidirectional, helping it to consider tokens in both directions of the text. Unlike other models, BERT uses the self-attention mechanism to put together two stages: encoding text pairs and applying bidirectional cross attention. For a complete breakdown of how self-attention, QKV, and KV cache work, read our Ultimate Guide to Attention.
For example, in the phrase "The cat sat on the mat," BERT considers how the word "sat" is related to "cat" and "mat", both before and after. Self-attention allows BERT to process all tokens at once, instead of processing them one by one. This helps gather a complete understanding of the relationships between words in the sentence.
Self-attention also helps on the sentence level. When it comes to tasks that involve pair of sentences, BERT can explore how these sentences are related to each other, processing them simultaneously. Tasks that involve sentence pairs include:
Paraphrasing: Identifying if one sentence is a rephrasing of the other.
Question answering: Given a question and a passage, finding the correct answer from the passage.
Natural Language Inference (NLI): Does one sentence entail, contradict, or is neutral to the other?
For example, in sentence pairs, BERT takes both sentences and connects them together into a single input sequence. A question-answer pair might be turned into:
Sentence A: "What is the capital of France?"Sentence B: "The capital of France is Paris."
In addition to word embeddings, BERT uses special tokens to mark the beginning of each sentence. The first token is always [CLS], which is used for classification tasks, and each sentence is separated by the [SEP] token.
So BERT takes a question and a passage as input sentence that may look like this:
Input = [CLS] What is the capital of France? [SEP] The capital of France is Paris. [SEP]

Image Credit: BERT’s input representation. BERT original paper
Overall, fine-tuning adjusts BERT's pre-trained parameters to make it perform well on specific tasks, and all the layers, including the self-attention layers, are updated to optimize performance for the given task.
Another reason why BERT drew so much attention is its breakthrough performance. In 2018, BERT’s launch re‑set the NLP high score table:
GLUE: BERT‑Large 80.5 → +7 pts over prior SOTA, +7.7 over GPT.
SQuAD v1.1: +1.5 F1; v2.0: +5.1 F1.
SWAG: +27 pts on ESIM+ELMo, +8.3 on GPT.
Those 110‑ to 340‑million‑parameter encoders became the default backbone and spawned the still‑growing “‑BERT” family powering classification, ranking, and retrieval pipelines today.
Cool family of BERT-based models
Let’s start with several classic examples (remember, all links are in the Sources and further reading section):
RoBERTa (Robustly Optimized BERT Pretraining Approach): It is an optimized version of BERT by Facebook AI which reshapes its pre-training – it skips the Next Sentence Prediction (NSP) objective, uses dynamic instead of static masking, trains for longer with larger batch sizes, longer sequences and on larger datasets, and employs byte-level BPE tokenizer.
ALBERT (A Lite BERT): This parameter-efficient variant of BERT introduced by Google Research is designed to improve scalability and reduce memory usage of BERT with two major architectural innovations – Factorized Embedding Parameterization and Cross-Layer Parameter Sharing. It also replaces BERT's Next Sentence Prediction (NSP) with a new self-supervised loss – Sentence Order Prediction (SOP).
DistilBERT by Hugging Face applies knowledge distillation during the pre-training phase of an LM to create a smaller model that retains most of BERT’s performance. This resulted in a smaller, cheaper, and 71% faster model with a smaller model footprint (207MB) on edge devices. DistillBERT retains 97% of BERT-Base’s performance.
TinyBERT is a 7.5× smaller and 9.4× faster version of BERT built through knowledge distillation, which still achieves ~96.8% of BERT-Base's performance on GLUE benchmark. It introduces a two-stage learning framework, including general and task-specific distillation, combined with a transformer-specific (layer-wise internal) distillation method.
SpanBERT is designed to improve BERT’s performance on span-level prediction tasks through three key concepts: 1) span-based masking instead of token masking, 2) Span Boundary Objective (SBO) for training, and 3) Single-Segment Pretraining that removes NSP.
Also, several other approaches have been developed: ELECTRA, DeBERTa and MiniLM, along with many domain-specific BERTs, such as SciBERT, BioBERT, ClinicalBERT, BlueBERT, LegalBERT, FinBERT, and PubMedBERT. Additionally, there are multilingual and multimodal models like mBERT, XLM-RoBERTa, and CamemBERT.
Recent most advanced BERT approaches include:
ModernBERT (2024): It has a more up-to-date transformer architecture with Rotary Positional Embeddings (RoPE) for long context, GeGLU activation function instead of GeLU and GeGLU MLP blocks, and also added layer norm after embedding. It is trained on 2 trillion tokens. In MLM pre-training it uses 30% mask rate compared to BERT’s 15%, and doesn’t use NSP.
ModernBERT also leverages special innovations such as hybrid attention – every 3 layers it uses global attention and employs local sliding window attention (128 tokens) in between. It removes all padding tokens before computation, which only waste resources, and packs sequences together using a greedy algorithm to fully utilize input length. It’s efficient on common GPUs, being 2–4× faster than DeBERTaV3 and processing context with 8192 tokens.
NeoBERT (2025) is another fresh reimagining of BERT, which goes further in training: it’s trained on 2.1 trillion tokens and uses 20% masking rate in MLM, also skipping the NSP objective. As ModernBERT, NeoBERT uses RoPE, but it implements SwiGLU instead of GeGLU activations, and RMSNorm, which improve its ability to model long sequences and stabilize training. For fast inference at long lengths, it uses FlashAttention.
NeoBERT4096 model processes 4,096-token sequences with 17.2k tokens/sec, beating ModernBERTbase by ~47% in throughput. It is also 2× faster than RoBERTalarge on long inputs.

Image Credit: NeoBERT original paper
Have you noticed the tendance that most of the approaches skip NSP pre-training? This happens because it slows down training unnecessarily, introducing an additional classification head and high complexity. While MLM pre-training approach covers enough contextual learning, allowing BERT to learn sentence boundaries, handling the context flow, and coherence across sentences, NSP becomes redundant. That’s why, since RoBERTa release it is removed in most BERT model variants.
What is more, one of the most interesting implementation of BERT is using it for efficient retrieval. And just today Pinecone announced open-source ConstBERT, a multi-vector retrieval method.
BERT and Retrieval – ConstBERT
We asked one of the ConstBERT researchers, Antonio Mallia, why using BERT for retrieval is effective.
Why did you choose BERT as the foundation for your model? What makes it especially valuable for retrieval tasks?
“We chose BERT primarily because this was an academic exploration, and BERT remains a dominant benchmark in contextualized language modeling within the research community. It offers a well-understood and widely adopted architecture, which helps in ensuring reproducibility and comparability of results. Furthermore, our baseline model, ColBERT, is built on top of BERT, making it a natural choice for a fair and controlled comparison. While more modern foundation models could be leveraged in industrial deployments to improve performance, the core contribution of our work lies in the model architecture itself – particularly in how it manages token-level representations – rather than in the specific backbone used. Adapting the model to newer encoders is straightforward and orthogonal to the architectural innovations we propose.”
A couple of words about the baseline model for ConsBERT, ColBERT. It is a ranking model based on contextualized late interaction over BERT that was introduced by Stanford researchers in 2020. It combines the power of BERT-based contextual embeddings with the speed and efficiency needed for large-scale search systems. Instead of processing query-document pairs together like BERT re-rankers, this retrieval systems implements the following tricks:
ColBERT separately encodes queries and documents into contextualized token-level embeddings.
It performs lightweight interaction between those embeddings using a late interaction mechanism called MaxSim, which computes the maximum similarity between each query token and the best-matching document token.
This enables offline and reusable document encoding and fast fine-grained vector-similarity search.
ColBERT is a multi-vector method, meaning it breaks each document into several smaller vectors (often one per token or phrase) so it can match query bits to document bits. But this approach can extremely slow down the systems – just imagine how long it takes and how much compute it needs to process each query vector against the entire pool of document vectors.
Pinecone together with researchers from University of Glasgow and University of Pisa proposed anuograded multi-vector approach, ConstBERT, that keeps the same number of vectors per document, no matter how long the text is. This gives a good advantage – the storage and lookup patterns stay predictable.

But how does ConstBERT "decide" which token-level information to preserve when compressing to a fixed number of vectors?
ConstBERT compresses variable-length token sequences into a fixed number of document-level vectors through a learned pooling mechanism. Instead of maintaining one embedding per token, the model learns a set of pooling operators during training that selectively aggregate token embeddings into a small, fixed set of output vectors. Each pooled vector captures different semantic aspects of the document, enabling a rich yet compact representation. This design balances retrieval efficiency with expressiveness, ensuring that critical contextual information is preserved even with aggressive dimensionality reduction.
Here is how the ConstBERT workflow looks like in simpler terms:
The original M token embeddings run through one extra, simple linear layer.
That layer outputs a fixed number of vectors, irrespective of the document’s length. Let’s call it C. This C is smaller than M.
The next stage is training that layer together with the rest of the model, so it learns which combinations of tokens best represent each summary vector.
When a query comes in, the system compares each query word to those C summary vectors.
Thanks to these capabilities, ConstBERT can perfectly fit into three-stage retrieval pipeline as the middle layer (also proposed by Pinecone), placing multi-vector retrieval between “fast but rough” and “slow but perfect” approaches. Here is how:
Stage 1: Fast first-pass retrieval
A cheap, single-vector dense retriever (or even a sparse model) grabs, for example, the top 1,000 candidates.
Stage 2: Multi-vector refinement
Involves re-embedding those 1,000 candidates with multi-vector model to rescore and narrow down to the top 100. Researchers propose to attach ConstBERT’s fixed vectors as metadata on each item to do a quick late-interaction pass.
Stage 3: Cross-encoder reranking
If you need the very highest precision, apply a cross-encoder just to the top 100.
It’s important to note that multi-vector models like ColBERT and ConstBERT aren’t built for “grabbing everything,” they work better in a cascade. Compared to ColBERT, ConstBERT performs slightly worse out of the box, but as a reranker, it matches or surpasses ColBERT in accuracy while using much less storage and compute.
This example of BERT implementation as part of a retrieval system demonstrates that BERT can be more than just a model that answers questions about the context; it can also be a component of more complex systems.
We’ve explored many BERT variants and highlighted their performance gains. Yet, even though BERT revolutionized NLP, it still has a few limitations.
General limitations
Computationally expensive: BERT, especially the larger version, requires significant computational resources, powerful hardware (like TPUs or GPUs) and a lot of time for training and fine-tuning. That is why many researchers choose to distill BERT models to work with its smaller versions.
Need for massive training data: BERT relies heavily on unsupervised pre-training on large datasets, which also makes it limited by the data it's trained on.
Dependence on fine-tuning data: If the task-specific data is noisy or insufficient, BERT may not perform well, even though it was pre-trained on large, high-quality datasets.
No long-term memory: For tasks that require reasoning over long sequences of information, like multi-turn conversations or long-term dependencies in text, BERT might struggle because it doesn’t "remember" past inputs once the session ends.
Limited to fixed input length: Basically, BERT has a maximum input length, usually 512 tokens. This can be a limitation for tasks that require processing longer documents. Although advanced BERT variations, like ModernBERT and NeoBERT, are overcoming this limitation.
BERT Today: From NLP Game-Changer to Efficient Retrieval
New approaches to model architectures encourage breakthroughs that then become foundations for future AI models. This is the story of BERT. Once created, it became the starting point for a cascade of novel models built for deeper context understanding, improving on that first BERT version introduced years ago. Combinations like BERT and retrieval are especially inspiring, showing its potential as a backbone for more complex systems.
Maybe the next stage is to use BERT as part of a more advanced, modular workflow? For a smart AI future, we need models like BERT – ones that don’t just generate answers by following an algorithm, but that look deeper into the task, the context, the problem.
Sources and further reading
Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing (Google Research blog)
ModernBERT (“Finally, a Replacement for BERT” Hugging Face blog)
Cascading retrieval with multi-vector representations: balancing efficiency and effectiveness (ConstBERT) (Pinecone blog)

