This website uses cookies

Read our Privacy policy and Terms of use for more information.

“Why is a raven like a writing desk?” – embeddings won’t answer Lewis Carroll’s famous riddle, but they can show how even a strange question becomes a matter of distance, neighborhoods, and latent structure.

In previous series we talked about tokens from many perspectives, including technology and economics. But when we already have tokens, then the question arises: how to process them?

Tokenization breaks language into pieces a model can count. But a token ID, on its own, is still useless to a neural network. The number 14,382 does not mean “cat”, “bank”, or “##ing,” it is just an index. Words get meaning only from the contexts they appear in, and before a model can do anything that will lead to a logical response, it has to turn that token index into a meaningful form. And this is where embeddings – dense vectors of numbers – step to the front.

Embeddings are the step where pieces of language move from integers into geometry, where distance represents meaning. This shift from discrete symbols to continuous space enables similarity search, contextual understanding, transfer learning, retrieval, and multimodal alignment at scale.

It’s science, but it still feels like magic.

Today we’ll dive into how tokens gain meaning through embeddings, how this works in general, and why RoPE (Rotary Position Embedding) – now used in so many systems – is so important. But let’s start with the core idea.

In today’s episode:

  • The history: How did we arrive at embeddings?

  • Core embedding concepts

  • From tokenization to embedding

  • Embeddings in Transformers

  • A little bit more about embedding evolution

  • What is RoPE and why is it so important?

  • Conclusion

  • Sources and further reading

The history: How did we arrive at embeddings?

The idea of embeddings comes from concepts and approaches that existed long before we actually started calling them “embeddings.” An important starting point was G. E. Hinton’s et al. work titled “Distributed Representations” that raised an important question: how should a system represent knowledge at all?

Representation is how a system encodes an object internally. In earlier approaches to AI and cognitive modeling, the main representational scheme was local: one concept one unit (one feature, or dimension). But there is no notion of similarity unless you explicitly build it, no sharing between concepts, and scaling becomes awkward and unrealistic, because every new concept needs its own slot.

Distributed representations brought a more complex view on concepts, starting to represent a concept as a pattern a specific configuration of values across all units where each unit can participate in many concepts at once. Once a concept is a pattern rather than a location, you can no longer treat representations as discrete objects. They start behaving like points in a space (and interestingly, this is what today’s embedding is about). Two concepts that share structure will activate overlapping sets of units, and that overlap becomes the definition of similarity. It works more like learning in real life, for example, learning a fact about one animal shifts expectations about similar animals, and systems working with distribution representations does this transition automatically.

If you translate that into modern language, you get:

  • units → dimensions

  • activations → values

  • patterns → vectors

Another big early step was the 1990’s “Indexing by Latent Semantic Analysis” research by Scott Deerwester and others. The idea was to move from searching only for the exact words in a query to Singular Value Decomposition (SVD) on a term-document matrix to build a smaller “semantic” space. SVD is a linear algebra method for breaking a matrix X=UΣV into simpler pieces that reveal its underlying structure: how terms relate to hidden concepts (U), how important each concept is (Σ), and how documents relate to those same concepts V. In the semantic space, related words, documents, and queries end up near each other, so a system can find relevant documents even when they use different words. It helped with the synonym problem and only partly with words that have multiple meanings.

In 2003, with “A Neural Probabilistic Language Model” paper by Yoshua Bengio et al. the representation became a trainable part of the model itself. This paper proposed to map each word to a vector (now this is the idea of embedding), using a neural network to predict the next word from these vectors. Bengio’s language model framed this as a way to beat the curse of dimensionality: instead of treating every word as an isolated atom, learn a distributed representation for each word and a probability function over sequences expressed through those representations. That basic idea never went away. Modern models are much larger and more sophisticated, but they still begin with the same move: map symbols into vectors, then learn over the vectors.

The result is simple, but really important:

  • Similar words → similar vectors

  • Similar sentences → similar structure

If a model sees a sentence like “The cat is walking in the bedroom,” it can generalize to: “A dog is running in a room,” because cat ≈ dog and bedroom ≈ room. These words belong to the same space.

This is the “magic” of embedding – total generalization, making the relation of words and concepts more obvious for models and forming the sense of context.

So what do we have? Distributed representations introduced the idea that concepts are overlapping patterns over shared components, and embeddings turned that into a concrete story: concepts are vectors in a continuous space where geometry encodes meaning.

Let’s sort out all the terms and workflows.

Core embedding concepts

Here are definitions you need to understand to go further with embedding exploration:

  1. Vector – an ordered list of numbers (for example, [0.2,−1.3,0.7,...]) that represents an object in a mathematical space. In embeddings, a vector encodes features of meaning, where each dimension captures some latent property learned from data.

  2. Embedding – a learned dense vector representation of an object (word, sentence, image, etc.) such that similar items are mapped to nearby points in space. Embeddings are designed to capture semantic or functional relationships.

  3. Dimension – one coordinate (feature) of a vector; the total number of dimensions defines the space (e.g., it can be 768-dimensional). Each dimension typically contributes to encoding patterns in the data.

  4. Dense vectors – vectors where most values are non-zero, compactly encoding information across all dimensions. They are typically for embeddings, because they generalize better as they distribute meaning across dimensions.

  5. Sparse vectors – vectors with mostly zero values and with only a few active features.

  6. Vector space – a geometric space where objects are represented as vectors and you can do operations like addition, scaling, dot products (a weighted sum of component-wise products indicating similarity), etc.

  7. Embedding space – a specific vector space learned by a model, where all embeddings live, and where operations like distance, angles, and directions encode meaning. Relationships between vectors (such as closeness or direction) correspond to relationships between the underlying concepts.

    Every embedding space is a vector space, but not every vector space is an embedding space.

  8. Latent space – a transformed feature space learned by a model where underlying structure, like semantic similarity, becomes more explicit. It captures hidden factors that explain the data, even if those factors are not directly observable or labeled. (It’s also the name of the podcast Latent Space that we’re fans of.)

  9. Semantic similarity – a measure of how close two vectors are in meaning, usually computed via cosine similarity or dot product. Items with similar meanings (even if phrased differently) end up close together in the embedding space.

Now, let’s see how it all works in an embedding process →

From tokenization to embedding

Don’t settle for shallow articles. Learn the basics and go deeper with us. Truly understanding things is deeply satisfying.

Join Premium members from top companies like Microsoft, NVIDIA, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI. 

After text is split into tokens and mapped to token IDs, the model still can’t do very much with those IDs directly. In practice, systems use an embedding matrix a trainable table of vectors to turn token IDs into continuous vectors the network can compute with.

Image Credit: What are Vector Embeddings? by Qdrant

The embedding matrix is built by assigning each token its own learnable vector. In practice, each token corresponds to one row in the matrix, so embedding is just a lookup: the token ID selects a row of this matrix, producing a dense vector. In this embedding matrix R∣V∣×d:

  • ∣V∣ is the vocabulary size the number of unique tokens the model knows, or the number of rows.

  • d is the embedding dimension the size of each vector, or the number of columns.

So, each row = one token, and each column = one feature of the embedding vector, and the same token always gives the same vector. Vectors start random, but get updated during training so they capture useful patterns. Only the rows that correspond to used tokens get updated during backpropagation.

Interestingly, tokenization step helps the embedding process as well, because the embedding table does not store vectors for perfect linguistic words, but for subword units (like those achieved during tokenization). This lets the model build representations for rare or unfamiliar words from smaller pieces.

Embeddings in Transformers

An embedding layer can vary depending on the system. In PyTorch’s documentation, an embedding layer is a simple lookup table that maps indices to vectors. In Transformers, these vectors serve as the inputs to the attention mechanism, and here things become more complex.

In Transformers, the word order is an important aspect, that is why initial vectors are scaled and then combined with positional encodings. The result is that the first layer representation is not just “a word vector”, but a sum of content and position: a point in a vector space that already encodes both identity and location in the sequence.

Image Credit: Attention is All You Need

From there, the representation stops being static. The input vectors = “token embeddings + position encodings” enter the Transformer’s attention layers, where self-attention repeatedly mixes information across tokens. Each hidden state (a context-aware vector representation of a token inside the model as it gets processed layer by layer) is transformed into queries, keys, and values, and attention computes interactions between all tokens. Despite queries, keys, and values belong to the attention mechanism, they are derived from embeddings, being just linear projections of embeddings or hidden states. So embeddings reveal themselves as a key component for further attention functioning (we’ll look at attention in the next episode).

Then attention mechanism determines how much token i attends to token j, and the resulting weighted sums produce new representations. This is the point where embeddings become contextual: the vector for a token is no longer fixed, but depends on the entire sequence.

​That is why embeddings feel a bit magical in practice. A simple example, the word “bank” will have different vectors depending on whether it appears near “river” or “finance” in the context, because the surrounding tokens reshape its representation through attention. Embeddings also let a model treat “cat”, “kitten”, and “tiger” as nearby objects in a continuous space instead of unrelated labels.

This also leads to an important distinction that is often blurred. There is an initial token embeddingthe lookup result, and there is a contextual embeddingthe hidden state after several Transformer layers. They are related, but they are not the same object. In modern models the useful semantic representation is almost always the contextual one.

Another important detail is that the same embedding matrix is often reused at the output stage. The final hidden state is projected back into vocabulary space using a matrix that is shared with the input embeddings, so the model learns a single geometric space that serves both as input representation and as a basis for predicting tokens.

Outside of generation, embeddings form the basis of a broader ecosystem. Models can map entire sentences or documents into single vectors, where distance reflects semantic similarity. In these spaces, vectors are often normalized, and similarity is measured with cosine similarity or dot product. This allows embeddings to support search, clustering, recommendation, and retrieval systems. In setups like RAG, query embeddings are compared to document embeddings in a shared vector space, and nearest neighbors are retrieved as context for generation.

Now you can clearly see that embeddings are the foundation of everything, where the key idea is geometric: meaning is encoded as position in a high-dimensional space, and relationships between texts are expressed as distances and directions within that space.

A little bit more about embedding evolution

There are several other important milestones in the evolution of embeddings that came after the “A Neural Probabilistic Language Model” paper (2003), each expanding their capabilities. Now when we’ve broken down the embedding pipeline, they should make more sense.

  • Word2Vec (2013) made embeddings practical by learning vectors directly from context using simple neural objectives: CBOW that takes surrounding words and predicts the center word, and Skip-gram which is the opposite to CBOW, learning a word vector from a center word and trying to predict the words around it. It showed that words appearing in similar contexts end up with similar vectors, and even captured relationships like “king − man + woman ≈ queen.” The big win was efficiency: it scaled well to massive datasets.

  • GloVe (2016) improved on this by combining local context-window methods, like Word2Vec, with global word-word co-occurrence statistics from the whole corpus. It trained a weighted log-bilinear regression model on the non-zero entries of a word co-occurrence matrix. This helped embeddings capture meaningful linear substructure and stronger global semantic relationships.

  • fastText (2017) extended the skip-gram Word2Vec model by representing each word as a bag of character n-grams (contiguous sequences of n items taken from a sequence). Each word is represented as the sum of its subword vectors, which helps with rare words, out-of-vocabulary words, and morphologically rich languages. It is especially useful when you don’t have perfect vocabulary coverage.

  • Then came Transformers (2017) with positional encoding.

  • ELMo (2018) introduced contextual embeddings by generating word vectors from a deep bidirectional language model (LSTM). Like in Transformers the representation depended on the entire sentence.

  • BERT (2019) used a Transformer encoder architecture that conditions on both left and right context in all layers. It is trained with masked language modeling, which enables deep bidirectional, context-aware representations, and next sentence prediction.

  • CLIP (2021) extended embeddings beyond text. It learned a shared vector space for images and text. It is trained with contrastive learning: matching image–text pairs are pulled closer together, and mismatched pairs are pushed apart. This made things like text-to-image search and multimodal understanding real.

Today embeddings are especially powerful in multimodality: models map text, images, audio, video, and even actions into a shared space. They also map many languages into the same space for cross-lingual search and reasoning without translation.

Embeddings have become the backbone of how models access external knowledge with RAG: systems embed queries, retrieve relevant data from vector databases, and then generate answers. This is what opened up the possibility to go beyond what models were trained on and interact with constantly changing information.

Even more, embeddings now define how systems connect to memory, tools, and data. They power search, memory, recommendations, routing, and even agent behavior, deciding what is relevant, what is similar, and what should happen next. For example, a system embeds the user request, embeds tool descriptions (search, code, API, etc.), and selects the closest match, turning similarity into action.

And one more thing stands out most →

What is RoPE and why is it so important?

In 2023, researchers from Zhuiyi Technology Co., Ltd. in Shenzhen proposed Rotary Position Embedding (RoPE), which you can now find in many modern models used to improve the handling of long sequences.

Previously, we explained that in basic Transformers position info is added to embeddings. RoPE did something revolutionary: it gave the option to rotate the vector in space based on its position in the sequence.

It works by splitting each embedding into pairs of numbers: [x1, x2], [x3, x4], and so on. Each pair is treated like a tiny 2D vector. Then, depending on the token’s position, RoPE rotates each of these 2D vectors by some angle.

The angle is based on two things: the token position m, and a frequency θ, which is different for each pair of dimensions. So position 0 is basically not rotated, position 1 is rotated a little, position 10 is rotated more, and so on.

Image Credit: RoFormer: Enhanced Transformer with Rotary Position Embedding

So RoPE turns position into rotation, and attention turns rotation into relative distance awareness, so both queries and keys are rotated according to their positions. Attention compares queries and keys using a dot product the absolute rotations cancel out, and what remains depends on the distance between the two tokens. That’s how RoPE gives you relative position “for free,” without extra lookup tables.

In practice, the model doesn’t use full rotation matrices. It uses a simple version: scale the vector with cosine values, and mix in a slightly shuffled version of it (where pairs are swapped and one sign is flipped) using sine values. This achieves the same rotation effect, but in a way that is fast and efficient on GPUs.

One more advantage of RoPE is that different dimensions rotate at varying speeds. High-frequency dimensions rotate quickly and are good for local relationships; low-frequency dimensions rotate slowly and can preserve longer-range relationships. This allows nearby tokens stay more aligned, while faraway tokens become less aligned.

Conclusion

Putting this all together, embedding has become the indispensable bridge between counting tokens and actually working with them, and a core interface between models and real-world data. The “magic” here is that it turns language into geometry: once meaning becomes geometry, computation gets traction. Geometry itself becomes an optimization target, and you can shape the geometry of the space so that similarity behaves better downstream.

Embeddings have made possible a lot of things in AI that we don’t even think about today as something extraordinary: storing meaning in a form that vector databases can search, similarity search, contextual meaning, transfer learning, retrieval, and multimodal alignment at scale.

But their real strength isn’t in being a standalone component:

  • Tokenization explains how language gets split into model-readable units.

  • Embedding explains how those units become learnable coordinates.

  • Attention explains how those coordinates interact.

  • Generation explains how the model turns those interactions back into language.

Tokenization gives nothing, and attention can’t go further without embedding, because it gives the most important context map a representation layer that everything else builds on. If we skip embeddings, meaning has nowhere to live. And meaning is all we need.

Sources and further reading

  • Distributed Representations by G. E. Hinton, J. L. McClelland, and D. E. Rumelhart | Paper

  • Indexing by Latent Semantic Analysis | Paper

  • A Neural Probabilistic Language Model by Yoshua Bengio et. al | Paper

  • Attention Is All You Need | Paper

  • PyTorch Embedding documentation

  • OpenAI Vector Embeddings guide

  • RoFormer: Enhanced Transformer with Rotary Position Embedding | Paper

  • Efficient Estimation of Word Representations in Vector Space | Paper

  • GloVe: Global Vectors for Word Representation | Paper

  • Enriching Word Vectors with Subword Information | Paper

  • Deep contextualized word representations | Paper

  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Paper

  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks | Paper

Resources From Turing Post:

Reply

Avatar

or to participate

Keep Reading