“Why is a raven like a writing desk?” – embeddings won’t answer Lewis Carroll’s famous riddle, but they can show how even a strange question becomes a matter of distance, neighborhoods, and latent structure.
In previous series we talked about tokens – from many perspectives, including technology and economics. But when we already have tokens, then the question arises: how to process them?
Tokenization breaks language into pieces a model can count. But a token ID, on its own, is still useless to a neural network. The number 14,382 does not mean “cat”, “bank”, or “##ing,” it is just an index. Words get meaning only from the contexts they appear in, and before a model can do anything that will lead to a logical response, it has to turn that token index into a meaningful form. And this is where embeddings – dense vectors of numbers – step to the front.
Embeddings are the step where pieces of language move from integers into geometry, where distance represents meaning. This shift from discrete symbols to continuous space enables similarity search, contextual understanding, transfer learning, retrieval, and multimodal alignment at scale.
It’s science, but it still feels like magic.
Today we’ll dive into how tokens gain meaning through embeddings, how this works in general, and why RoPE (Rotary Position Embedding) – now used in so many systems – is so important. But let’s start with the core idea.
In today’s episode:
The history: How did we arrive at embeddings?
Core embedding concepts
From tokenization to embedding
Embeddings in Transformers
A little bit more about embedding evolution
What is RoPE and why is it so important?
Conclusion
Sources and further reading
The history: How did we arrive at embeddings?
The idea of embeddings comes from concepts and approaches that existed long before we actually started calling them “embeddings.” An important starting point was G. E. Hinton’s et al. work titled “Distributed Representations” that raised an important question: how should a system represent knowledge at all?
Representation is how a system encodes an object internally. In earlier approaches to AI and cognitive modeling, the main representational scheme was local: one concept – one unit (one feature, or dimension). But there is no notion of similarity unless you explicitly build it, no sharing between concepts, and scaling becomes awkward and unrealistic, because every new concept needs its own slot.
Distributed representations brought a more complex view on concepts, starting to represent a concept as a pattern – a specific configuration of values across all units – where each unit can participate in many concepts at once. Once a concept is a pattern rather than a location, you can no longer treat representations as discrete objects. They start behaving like points in a space (and interestingly, this is what today’s embedding is about). Two concepts that share structure will activate overlapping sets of units, and that overlap becomes the definition of similarity. It works more like learning in real life, for example, learning a fact about one animal shifts expectations about similar animals, and systems working with distribution representations does this transition automatically.
If you translate that into modern language, you get:
units → dimensions
activations → values
patterns → vectors
Another big early step was the 1990’s “Indexing by Latent Semantic Analysis” research by Scott Deerwester and others. The idea was to move from searching only for the exact words in a query to Singular Value Decomposition (SVD) on a term-document matrix to build a smaller “semantic” space. SVD is a linear algebra method for breaking a matrix X=UΣV⊤ into simpler pieces that reveal its underlying structure: how terms relate to hidden concepts (U), how important each concept is (Σ), and how documents relate to those same concepts V⊤. In the semantic space, related words, documents, and queries end up near each other, so a system can find relevant documents even when they use different words. It helped with the synonym problem and only partly with words that have multiple meanings.
In 2003, with “A Neural Probabilistic Language Model” paper by Yoshua Bengio et al. the representation became a trainable part of the model itself. This paper proposed to map each word to a vector (now this is the idea of embedding), using a neural network to predict the next word from these vectors. Bengio’s language model framed this as a way to beat the curse of dimensionality: instead of treating every word as an isolated atom, learn a distributed representation for each word and a probability function over sequences expressed through those representations. That basic idea never went away. Modern models are much larger and more sophisticated, but they still begin with the same move: map symbols into vectors, then learn over the vectors.
The result is simple, but really important:
Similar words → similar vectors
Similar sentences → similar structure
If a model sees a sentence like “The cat is walking in the bedroom,” it can generalize to: “A dog is running in a room,” because cat ≈ dog and bedroom ≈ room. These words belong to the same space.
This is the “magic” of embedding – total generalization, making the relation of words and concepts more obvious for models and forming the sense of context.
So what do we have? Distributed representations introduced the idea that concepts are overlapping patterns over shared components, and embeddings turned that into a concrete story: concepts are vectors in a continuous space where geometry encodes meaning.
Let’s sort out all the terms and workflows.
Core embedding concepts
Here are definitions you need to understand to go further with embedding exploration:
Vector – an ordered list of numbers (for example, [0.2,−1.3,0.7,...]) that represents an object in a mathematical space. In embeddings, a vector encodes features of meaning, where each dimension captures some latent property learned from data.
Embedding – a learned dense vector representation of an object (word, sentence, image, etc.) such that similar items are mapped to nearby points in space. Embeddings are designed to capture semantic or functional relationships.
Dimension – one coordinate (feature) of a vector; the total number of dimensions defines the space (e.g., it can be 768-dimensional). Each dimension typically contributes to encoding patterns in the data.
Dense vectors – vectors where most values are non-zero, compactly encoding information across all dimensions. They are typically for embeddings, because they generalize better as they distribute meaning across dimensions.
Sparse vectors – vectors with mostly zero values and with only a few active features.
Vector space – a geometric space where objects are represented as vectors and you can do operations like addition, scaling, dot products (a weighted sum of component-wise products indicating similarity), etc.
Embedding space – a specific vector space learned by a model, where all embeddings live, and where operations like distance, angles, and directions encode meaning. Relationships between vectors (such as closeness or direction) correspond to relationships between the underlying concepts.
Every embedding space is a vector space, but not every vector space is an embedding space.
Latent space – a transformed feature space learned by a model where underlying structure, like semantic similarity, becomes more explicit. It captures hidden factors that explain the data, even if those factors are not directly observable or labeled. (It’s also the name of the podcast Latent Space that we’re fans of.)
Semantic similarity – a measure of how close two vectors are in meaning, usually computed via cosine similarity or dot product. Items with similar meanings (even if phrased differently) end up close together in the embedding space.
Now, let’s see how it all works in an embedding process →
From tokenization to embedding
Don’t settle for shallow articles. Learn the basics and go deeper with us. Truly understanding things is deeply satisfying.
Join Premium members from top companies like Microsoft, NVIDIA, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI.



