One-sentence summary: Matrix multiplication has a geometric meaning — it projects vectors — and once you see that, the dot product at the heart of Attention is no longer mysterious.

8.1 Why Learn This Before Attention?

Matrix multiplication is the foundation of the Transformer: it appears in embeddings, Attention, FFN, and the output head

Attention is full of matrix multiplication. If you only think about it as rows times columns, the QKV mechanics will feel like symbol-pushing. If you see it as geometry, the architecture clicks.

Transformer components that rely on matrix multiplication

Here is a map of where matrix multiplication appears in the Transformer:

Embedding lookup (token ID → d_model-dimensional vector)
Q, K, V projection matrices in Attention
FFN expand and contract layers
Final vocabulary projection (LM Head)

Matrix multiplication is everywhere. Understanding it geometrically is the single highest-leverage thing you can do before Chapter 9.

8.2 Scalars, Vectors, and Matrices

Scalar is one number; vector is an ordered list; matrix is a 2D table

Before going further, let's fix the vocabulary.

8.2.1 Scalar

A scalar is a single number.

Temperature, learning rate, attention score at one position — these are all scalars.

8.2.2 Vector

A vector is an ordered list of numbers.

[3, 2, 9, 84]

Vectors can represent almost anything: a 3D position [x, y, z], an RGB color [255, 128, 0], or a token's semantic representation in a 4096-dimensional space. The key property is that the order matters.

8.2.3 Matrix

A matrix is a 2D table of numbers.

3 × 4 matrix:
┌─────────────────┐
│  □  □  □  □  │
│  □  □  □  □  │
│  □  □  □  □  │
└─────────────────┘

You can think of a matrix as a stack of row vectors, or equivalently as a collection of column vectors.

8.2.4 In the Transformer

Object	Example
Scalar	learning rate, temperature, one attention score
Vector	one token's embedding (shape: `[d_model]`)
Matrix	all token embeddings at once (shape: `[seq_len, d_model]`) or a weight matrix `[d_model, d_model]`

The token representation flowing through the Transformer is a matrix of shape [seq_len, d_model] — one row per token, one column per feature dimension.

8.3 Matrix Multiplication: The Computation

8.3.1 Dimension Rule

The dimension rule for matrix multiplication:

[A, B] × [B, C] = [A, C]

The inner dimensions must match (both B). The output shape is the two outer dimensions.

8.3.2 Worked Example

Dot product calculation: first row times first column gives the top-left element of the result

Let's compute a [4, 3] × [3, 4] multiplication. The result is [4, 4].

For the first element of the result (row 0, column 0):

row 0 of left matrix: [0.2, 0.4, 0.5]
col 0 of right matrix: [2, 1, 7]

dot product: 0.2×2 + 0.4×1 + 0.5×7
           = 0.4 + 0.4 + 3.5
           = 4.3

The fundamental operation is the dot product: multiply corresponding elements and sum.

In Python/NumPy/PyTorch:

C = A @ B  # @ is the matrix multiplication operator

8.3.3 Why "Dot Product"?

The name comes from the mathematical notation A · B. For two vectors of the same length:

A · B = a₁b₁ + a₂b₂ + a₃b₃ + ... + aₙbₙ

A matrix multiply is just many dot products organized into a grid.

8.4 Two Ways to Think About the Same Operation

Two views of matrix multiplication: raw dot products vs linear transformation of a vector

The same operation has two useful frames.

8.4.1 Frame One: Dot Product (Matrix × Matrix)

[4, 3] × [3, 4] = [4, 4]

Two matrices multiply. Each element of the output is a dot product between a row of the left matrix and a column of the right matrix.

This frame is useful when both operands contain data — for example, computing all pairwise similarities between token vectors.

8.4.2 Frame Two: Linear Transformation (Matrix × Vector)

[4, 3] × [3, 1] = [4, 1]

A weight matrix transforms a single vector: input dimension changes from 3 to 4.

This frame is useful when one operand is data and the other is a learned weight matrix. The weight matrix defines a learned transformation of the vector space.

8.4.3 Linear Transformation Intuition

"Linear transformation" sounds technical. The geometric idea is simple:

A weight matrix moves a vector from one space to another — possibly changing its dimension, rotating it, stretching it, or projecting it down.

In the Transformer:

The embedding table maps token IDs (integers) into d_model-dimensional space.
The Q, K, V weight matrices move d_model vectors into a different d_model (or d_key) space, emphasizing different aspects.
The FFN expand layer moves vectors from d_model into 4 × d_model space.

Linear transformations are everywhere because vectors in different "views" of the same data are what the model learns to compare.

8.5 Geometric Meaning: Vector Space

Now for the part that makes Attention click.

8.5.1 Word Vectors in 3D

Token vectors as arrows in 3D space: similar tokens point in similar directions

Suppose we have a simplified vocabulary with three tokens, each represented as a 3D vector:

cat  = [7, 7, 6]
fish = [6, 4, 5]
love = [-4, -2, 1]

Plot these as arrows from the origin in 3D space:

cat and fish point in roughly the same direction — they are both concrete nouns.
love points in a very different direction — it is an abstract verb.

This direction similarity is meaningful. The model learns to place semantically related tokens in similar directions.

8.5.2 Matrix Multiplication Computes Similarity

Look at what happens when we multiply the full token matrix by a single vector:

token matrix [n, d] @ query vector [d, 1] = similarity scores [n, 1]

Each element of the output is the dot product between one token's vector and the query. The dot product is large when the two vectors point in similar directions, and small (or negative) when they point in opposite directions.

A concrete example with three tokens scored against the query token "PR":

Token pair	Dot product score	Interpretation
`agent` · `PR`	100	high similarity — both central to a code review workflow
`merged` · `PR`	100	high similarity — merge is the direct action on a PR
`playlist` · `PR`	22	low similarity — unrelated domain

The model learns these directions from training. The numbers are not hand-coded; they emerge from exposure to text where agent, merged, and PR appear together frequently, while playlist does not.

This is why matrix multiplication shows up inside Attention. One operation computes all pairwise similarities between tokens.

8.5.3 The `d_model` Dimension

d_model annotated on the token matrix: each column is one feature dimension

d_model is the number of dimensions in each token's representation:

Model	d_model
GPT-2 Small	768
GPT-2 Large	1,280
GPT-3	12,288
LLaMA-7B	4,096

More dimensions means a richer representation — more "directions" available to encode distinctions. It also means larger weight matrices and more computation.

8.6 Dot Product as Cosine Similarity

8.6.1 The Angle Between Vectors

Cosine similarity: the dot product relates to the angle between two vectors

The dot product relates to the angle between vectors through a formula:

\cos(\theta) = \frac{A \cdot B}{|A| \times |B|}

Rearranging:

A \cdot B = |A| \times |B| \times \cos(\theta)

Where:

|A| is the length (magnitude) of vector A.
|B| is the length of vector B.
θ is the angle between them.

8.6.2 Geometric Intuitions

Situation	cos(θ)	dot product	Interpretation
Same direction	≈ 1	large positive	very similar
90° apart	0	≈ 0	unrelated
Opposite directions	-1	negative	opposing

This gives us a clean geometric reading of the dot product: it measures how much two vectors agree in direction.

8.6.3 A Concrete Example

A = "this" = [3, 5]
B = "a"    = [1, 4]

Compute:

A · B = 3×1 + 5×4 = 3 + 20 = 23
|A| = √(9 + 25) = √34 ≈ 5.83
|B| = √(1 + 16) = √17 ≈ 4.12

cos(θ) = 23 / (5.83 × 4.12) ≈ 23 / 24.0 ≈ 0.96

These two vectors have a cosine similarity of 0.96 — nearly parallel, highly similar.

8.6.4 This Is the Core of Attention

In Attention:

A Query vector asks: "What am I looking for?"
A Key vector says: "Here is what I contain."
Their dot product measures whether the Query's question matches the Key's advertisement.

High dot product → high similarity → high attention weight after Softmax.

Attention is dot-product similarity applied to learned vector representations. Everything else is engineering around this idea.

8.7 Projection: A Second Geometric View

8.7.1 What Projection Means

Projection: how much of vector B lies in the direction of vector A

The dot product has a second geometric interpretation: projection.

A · B = |A| × (length of B's shadow projected onto A's direction)

Or equivalently:

A · B = |B| × (length of A's shadow projected onto B's direction)

Projection asks: how much of one vector's "content" lies in the direction of another?

8.7.2 The Projection Picture

In a 2D sketch:

Draw vector A (red arrow).
Draw vector B (blue arrow).
Drop a perpendicular from the tip of B onto the line defined by A.
The length from the origin to that foot is the projection of B onto A.

The dot product equals |A| times that projection length.

8.7.3 Why This Matters for Language

In high-dimensional token space:

"king" and "monarchy" have a large projection onto each other — they strongly share a "royalty" component.
"king" and "algorithm" have a small projection — they share little in common.

The model doesn't have explicit dimensions labeled "royalty" or "abstractness." It learns directions in space that capture these distinctions from training data. Matrix multiplication — dot products — is how it measures alignment with those learned directions.

8.8 Connecting Back to Attention

8.8.1 The Attention Formula Preview

The core of Attention (details in Chapter 9):

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

The term QK^T is a matrix multiply — it computes dot products between every Query vector and every Key vector simultaneously. The result is a matrix of similarity scores.

8.8.2 Reading Q, K, V Geometrically

Q = X WQ
K = X WK
V = X WV

Read this as: take the input X and view it through three learned geometric lenses. Each projection matrix WQ, WK, WV rotates and stretches the same data into a different coordinate system:

WQ projects into a "what am I looking for" space.
WK projects into a "what do I advertise" space.
WV projects into a "what information do I contribute" space.

The dot product between Q and K vectors then measures alignment between these two projected spaces. High alignment → high attention weight → the model blends more of that token's Value into the output.

8.8.3 Summary of the Geometric Reading

Math	Geometric meaning	Role in Attention
`A · B`	similarity / projection	measures Query-Key match
matrix multiply `AB`	batch dot products	computes all pairwise scores at once
Softmax	normalize to probabilities	converts scores into weights

8.9 Chapter Summary

8.9.1 Key Concepts

Concept	Meaning
Scalar	a single number
Vector	an ordered list of numbers; represents a point or direction
Matrix	a 2D table; represents a transformation or a batch of vectors
Dot product	element-wise multiply and sum; measures vector alignment
Linear transformation	using a weight matrix to rotate/stretch/project a vector
Cosine similarity	dot product normalized by vector lengths; pure angle measure
Projection	how much of one vector lies in the direction of another

8.9.2 Key Formulas

Dot product:

A · B = a₁b₁ + a₂b₂ + ... + aₙbₙ

Cosine similarity:

\cos(\theta) = \frac{A \cdot B}{|A| \times |B|}

Projection of B onto A:

A · B = |A| × (B projected onto A)

Matrix multiply dimension rule:

[A, B] × [B, C] = [A, C]

8.9.3 Core Takeaway

Matrix multiplication is not abstract symbol-pushing. Geometrically, it measures how much two vectors point in the same direction. Attention uses this directly: the dot product between a Query and a Key vector says whether they "match." High match → high attention weight. That is the whole idea.

Chapter Checklist

After this chapter, you should be able to:

State the dimension rule for matrix multiplication and compute a small example by hand.
Explain the dot product as a measurement of vector alignment.
Explain projection in plain English: how much of one vector lies in another's direction.
Explain why matrix multiply is the right tool for computing all pairwise similarities between token vectors.
Connect the dot product to the Query-Key matching inside Attention.

See You in the Next Chapter

That is the geometry. If you can draw two arrows, say "their dot product is large," and explain why that leads to a high attention weight, you are ready for Chapter 9.

Chapter 9 closes the loop: we put the geometric intuition together with the actual Attention formula, look at what attention heatmaps reveal, and answer the question of why dot product specifically — rather than some other similarity measure — became the standard choice.