One-sentence summary: computers do not process text directly. Tokenization converts text into token IDs, which are then mapped into vectors.
4.1 Why Tokenization Exists
In the previous chapter, the first step in the Transformer map was:
text -> token IDs
This chapter explains that step.
4.1.1 Computers Need Numbers
A computer does not see a sentence the way we do. It does not know that:
The agent opened a pull request.
is made of meaningful words. It needs numeric units.
Tokenization is the process that turns text into a sequence of numbers. Each numeric unit is called a token ID.
4.1.2 Where It Sits in the Architecture
Tokenization is the entry point:
raw text -> token IDs -> embeddings -> position -> Transformer blocks
Without tokenization, the rest of the model has nothing to process.
4.2 Two Ways to Tokenize
The simplest idea is to assign a number to every character. Real LLMs usually do something smarter.
4.2.1 Method One: Character IDs
For an English sentence, a naive character-level tokenizer might assign:
T -> 1
h -> 2
e -> 3
space -> 4
a -> 5
g -> 6
n -> 7
t -> 8
...
This is easy to understand. Every character becomes a number.
But it has problems:
- Too many tokens: one word becomes many characters.
- Weak semantic units:
pull requestis split into letters even though it is one meaningful phrase. - Inefficient context use: long text consumes context length quickly.
Character tokenization is not wrong, but it is rarely the best choice for modern LLMs.
4.2.2 Method Two: BPE and Word Pieces
Most GPT-style tokenizers use a subword strategy such as BPE, Byte Pair Encoding.
The idea is:
- common chunks become single tokens
- rare words can still be split into smaller pieces
- the vocabulary stays finite
- the model can handle unseen text
Using OpenAI's cl100k_base tokenizer, this text:
The agent opened a pull request.
becomes:
[791, 8479, 9107, 264, 6958, 1715, 13]
The token pieces are:
791 -> "The"
8479 -> " agent"
9107 -> " opened"
264 -> " a"
6958 -> " pull"
1715 -> " request"
13 -> "."
Notice that spaces often become part of the token. That is normal.
4.2.3 Context Length
Context length is the number of tokens the model can process at once.
If a model supports 128,000 tokens, that does not mean 128,000 English words. It means 128,000 tokenizer units.
Different languages and writing systems have different token efficiency. English words often tokenize into familiar chunks. Some non-English text may require more tokens for the same amount of human-readable content.
That is why LLM APIs charge by token instead of by word.
4.3 From Token to Embedding
Token IDs are still not enough. The model must convert each ID into a vector.
This is called Embedding.
4.3.1 Embedding Lookup Table
The model contains a large table:
[vocab_size, d_model]
Where:
- vocab_size is the number of token IDs the tokenizer knows.
- d_model is the vector width used by the model.
For example, if:
vocab_size = 100256
d_model = 64
then the embedding table contains:
100256 x 64 = 6,416,384 numbers
Those numbers are trainable parameters.
4.3.2 Lookup Process
Take the sentence:
The agent opened a pull request.
Tokenization gives:
[791, 8479, 9107, 264, 6958, 1715, 13]
Then the model performs table lookup:
token 791 -> row 791 -> vector
token 8479 -> row 8479 -> vector
token 9107 -> row 9107 -> vector
...
The result is a matrix:
[context_length, d_model]
If the sentence has 7 tokens and d_model = 64, the matrix shape is:
[7, 64]
This matrix is the numeric representation sent into the Transformer blocks.
4.3.3 Why Use Vectors?
Why not use token IDs directly?
Because IDs have no geometry. Token ID 791 is not "closer" to token ID 792 in a meaningful semantic way.
Vectors solve that. They can encode relationships:
agent,tool, andworkflowcan occupy a nearby regionpull requestandcode reviewcan be relatedpull requestandpull tabcan separate based on context
Embedding vectors make language available to matrix math without pretending that token IDs themselves have meaning.
4.4 Try It With tiktoken
You can inspect tokenization with OpenAI's tokenizer library:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
text = "The agent opened a pull request."
tokens = enc.encode(text)
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {enc.decode(tokens)}")
for token_id in tokens:
print(f"{token_id} -> {enc.decode([token_id])!r}")
Expected shape of the result:
Token IDs: [791, 8479, 9107, 264, 6958, 1715, 13]
Token count: 7
Decoded: The agent opened a pull request.
791 -> 'The'
8479 -> ' agent'
...
This small experiment is worth doing. Tokenization becomes much less abstract once you see the pieces.
4.5 Parameter Count in the Embedding Layer
The embedding layer can hold a meaningful number of parameters.
4.5.1 Formula
embedding parameters = vocab_size x d_model
4.5.2 Examples
| Model | vocab | width | params |
|---|---|---|---|
| GPT-2 Small | 50,257 | 768 | about 38.6M |
| GPT-2 Large | 50,257 | 1,280 | about 64.3M |
| GPT-3 | 50,257 | 12,288 | about 618M |
| LLaMA-2-7B | 32,000 | 4,096 | about 131M |
Embedding is not a tiny pre-processing detail. It is a learned parameter table that matters.
4.6 Chapter Summary
4.6.1 Key Concepts
| Concept | Meaning |
|---|---|
| Tokenization | converts text into tokenizer units |
| Token | a model-readable text fragment |
| Token ID | the numeric ID for a token |
| Vocab size | the number of known token IDs |
| Embedding | maps token IDs to vectors |
| d_model | the width of the model's internal vectors |
| Context length | the maximum token count processed at once |
4.6.2 Flow
"The agent opened a pull request."
|
| Tokenization
v
[791, 8479, 9107, 264, ...]
|
| Embedding lookup
v
[context_length, d_model] matrix
4.6.3 Core Takeaway
Tokenization plus embedding is how text enters the Transformer. Tokenization cuts text into model-readable units; embedding turns those units into vectors that can participate in matrix computation.
Chapter Checklist
After this chapter, you should be able to:
- Explain why tokenization is needed.
- Describe the difference between character tokenization and BPE-style tokenization.
- Explain what
vocab_size,d_model, andcontext_lengthmean. - Explain why token IDs are converted into vectors.
- Calculate the parameter count of an embedding table.
See You in the Next Chapter
That is it for Tokenization. The next time an API charges you by token, you should know exactly what it is counting.
Now text has become vectors. But one key thing is still missing: position.
The sentences:
The agent tagged the reviewer.
The reviewer tagged the agent.
contain nearly the same words but mean different things. Chapter 5 explains how the model knows where each token sits in the sequence.