One-sentence summary: A neural network layer is a learned function that transforms vectors through matrix multiplication and nonlinearity — and for understanding the Transformer, treating it as a shape-changing black box gets you most of the way there.

7.1 You Do Not Need to Be a Neural Network Expert

Neural network layers: the FFN component of each Transformer block

Let me say this plainly before going any further: you do not need to deeply understand neural networks to understand the Transformer.

Inside each Transformer block, the Feed Forward Network (FFN) is a neural network layer. But its role in the architecture is simple: it receives a vector, transforms it, and returns a vector of the same size. For the purposes of understanding how the whole Transformer works, you can treat it as a learned function with a known shape.

What you do need to know:

What shape goes in.
What shape comes out.
Where the learnable parameters live.

If you want the deeper picture, this chapter has it. But the goal is to give you a working mental model, not to turn you into a backprop engineer.

7.2 The Biological Inspiration (and Why It Only Goes So Far)

Biological neuron vs artificial neuron analogy

7.2.1 Biological Neural Networks

The human brain has roughly:

86 billion neurons: each one a small processing unit
100 trillion synapses: connections between neurons

A neuron receives electrical signals from upstream neurons. When the total incoming signal crosses a threshold, the neuron "fires" and sends a signal downstream.

7.2.2 Artificial Neural Networks

Artificial neural networks borrow the vocabulary but not the full biology:

Node: models a neuron
Weight: models a synapse strength
Activation: when the weighted sum of inputs crosses a threshold, the node activates

The important disclaimer: an artificial neural network is a mathematical model, not a simulation of the brain. It borrows the name and the rough metaphor. The actual computation is matrix multiplication plus a nonlinear function.

Do not let the "neural" branding make this feel mystical. It is linear algebra with a twist.

7.3 What Neural Networks Learn to Do

MNIST digit clustering: 10,000 handwritten digits project into 2D, with same digits clustering together

7.3.1 Automatic Feature Discovery

The compelling property of neural networks is that they learn useful representations without being told what to look for.

A classic demonstration: train a network on 10,000 handwritten digit images (the MNIST dataset). Each image is 28×28 pixels = 784 dimensions. After training, project the learned representations down to 2D. What you see: the digits cluster naturally. Images of "3" cluster together, images of "7" cluster together, and so on.

Nobody told the network what a "3" looks like. It discovered the structure from the training signal.

7.3.2 The Language Parallel

The same principle applies to language. Given enough training text:

The model learns which words tend to co-occur.
It learns grammatical structure from distribution patterns.
It learns that "agent" and "reviewer" occupy similar syntactic slots.

After training, "pull request" and "code review" end up in nearby regions of the embedding space — not because anyone programmed that relationship, but because the training signal consistently puts them in similar contexts.

7.4 The Basic Structure of a Neural Network

Three-layer neural network: input, hidden, output

7.4.1 Three Layers

The minimal neural network has three layers:

Input layer: receives the raw data — in our case, a token vector.
Hidden layer: performs the transformation. This is where the learnable weights live.
Output layer: returns the result.

"Hidden" just means the layer is not directly observed as input or output. It is hidden inside the computation.

7.4.2 An Example with Concrete Features

Suppose the input is a vector that represents a product listing. The hidden layer learns to extract features like:

price range
target demographic
category

From those extracted features, the output layer predicts a label.

The network learns:

Which input dimensions to combine for each feature.
Which feature combinations predict which labels.

Nobody handcrafted those intermediate features. The network discovered them by minimizing prediction error on the training data.

7.5 The Mathematical Core: Matrix Multiplication

A neural layer as matrix multiplication: input vector times weight matrix equals output vector

7.5.1 One Layer = One Matrix Multiply

The core computation of a single dense (fully connected) layer:

y = xW + b

Where:

x is the input vector.
W is the weight matrix — the learnable parameters.
b is the bias vector — also learnable.
y is the output vector.

Concretely, if the input is 2D and the output is 2D:

input vector × weight matrix = output vector

[0.54]   [w₁ w₃]   [0.91]
[0.84] × [w₂ w₄] = [0.90]

The output element at position i is the dot product of the input with the i-th column of W.

7.5.2 Multiple Layers

Multi-layer network visualization: layer 1 to layer 4, each connection is a weight

When you stack multiple layers:

layer 1 -> layer 2 -> layer 3 -> layer 4

Each arrow is a matrix multiply. "Deep learning" is simply the name for neural networks with many such layers.

In the figure, the connections between nodes represent the weights. Every connection is one element of a weight matrix. More layers = more matrices = more learnable parameters.

7.5.3 Why This Matters for Parameter Counts

When someone says "GPT-3 has 175 billion parameters," most of those parameters are numbers inside weight matrices like these. They are not stored in a separate knowledge base. They are the learned values of W in every layer.

7.6 Activation Functions: The Nonlinear Ingredient

Matrix operations in code and their dimension changes

7.6.1 Why Nonlinearity Matters

If you stack only linear layers:

y₂ = (x W₁) W₂ = x (W₁ W₂) = x W₃

Multiple matrix multiplications collapse into a single matrix multiply. No matter how many layers you add, the whole stack is equivalent to one layer. You cannot represent complex patterns this way.

Activation functions insert nonlinearity between layers, breaking this collapse.

7.6.2 ReLU

The simplest widely-used activation is ReLU (Rectified Linear Unit):

ReLU(x) = max(0, x)

Positive values pass through unchanged. Negative values become zero.

import torch
import torch.nn as nn

x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
print(nn.functional.relu(x))  # tensor([0., 0., 0., 1., 2.])

7.6.3 GELU and SwiGLU

Modern LLMs typically use smoother variants:

GELU (Gaussian Error Linear Unit): used in GPT-2, BERT, many others. Smoother than ReLU around zero.
SwiGLU: used in LLaMA and many recent models. A gated variant that empirically produces better results.

For understanding the architecture, you do not need to memorize these. The important point is: every layer has a nonlinear activation after the matrix multiply, and the specific choice of activation function affects model quality more than you might expect.

7.7 PyTorch Implementation

7.7.1 A Simple Network

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(2, 3),   # input layer:  2-dim → 3-dim
    nn.ReLU(),         # activation function
    nn.Linear(3, 1),   # output layer: 3-dim → 1-dim
)

Seven lines (including the blank line and comments). That is a complete feedforward network.

nn.Linear(in_features, out_features) creates a layer with a weight matrix of shape [in_features, out_features] and a bias vector of shape [out_features].

7.7.2 Dimension Changes

The shape of data as it passes through:

input (1, 2) @ weight (2, 3) = hidden (1, 3) @ weight (3, 1) = output (1, 1)

Matrix multiplication rule: (a, b) @ (b, c) = (a, c). The inner dimensions must match. The outer dimensions are the result shape.

Understanding dimension changes is the key to reading Transformer code.

7.8 The FFN in the Transformer Block

FFN position inside the Transformer block: after Attention and its residual connection

7.8.1 The Expand-Then-Contract Pattern

Inside each Transformer block, the Feed Forward Network (FFN) follows a specific shape:

[d_model] → [4 × d_model] → [d_model]

The vector expands to four times its width, passes through a nonlinear activation, then contracts back to d_model.

ffn = nn.Sequential(
    nn.Linear(d_model, 4 * d_model),   # expand
    nn.GELU(),                          # activation
    nn.Linear(4 * d_model, d_model),   # contract
)

This is the standard FFN for GPT-2-style models. LLaMA uses a SwiGLU variant with three matrices instead of two, but the expand-then-contract idea is the same.

7.8.2 The Full Block Structure

Input
  ↓
LayerNorm
  ↓
Masked Multi-Head Attention
  ↓
Residual connection
  ↓
LayerNorm
  ↓
Feed Forward Network (FFN)    <- this is the neural network layer
  ↓
Residual connection
  ↓
Output

Attention mixes information across positions. The FFN processes each position's representation independently. The two sub-layers have complementary roles:

Attention asks: which other tokens in the sequence are relevant to this one?
FFN asks: given that context, how should this token's representation change?

7.8.3 Why Each Token Independently?

The FFN applies the same transformation to each token position without mixing across positions. That happens inside Attention. Keeping the operations separate makes the architecture easier to scale and modify.

7.9 Where Are the Parameters?

Parameter locations across the Transformer: embedding, FFN, LayerNorm, Attention, LM Head

7.9.1 Every Learnable Weight Matrix

Here is a map of where parameters live in a Transformer:

Component	Parameters
Embedding table	`vocab_size × d_model`
FFN (per layer)	`d_model × 4d_model + 4d_model × d_model`
Attention Q, K, V, O (per layer)	`4 × d_model²`
LayerNorm (per layer)	`2 × d_model` (γ and β)
LM Head (final projection)	`d_model × vocab_size`

7.9.2 Parameter Counts for a Realistic Model

For LLaMA-7B (d_model = 4096, d_ff = 11008, 32 layers, vocab_size = 32,000):

Component	Parameters
Embedding	`32,000 × 4,096 ≈ 131M`
FFN per layer (SwiGLU, 3 matrices)	`3 × 4,096 × 11,008 ≈ 135M`
Attention per layer	`4 × 4,096² ≈ 67M`
LayerNorm per layer	`2 × 4,096 ≈ 8K` (tiny)

*LLaMA uses SwiGLU activation, which requires three projection matrices — gate, up, and down — rather than the two used in a traditional FFN.

7.9.3 The Surprising Fact About FFN

Many people assume Attention is where most parameters live, because Attention is where the "interesting" computation happens. But look at the numbers: FFN parameters per layer are roughly twice the Attention parameters per layer.

Across 32 layers, the FFN accounts for the majority of the parameter budget.

Recent research suggests this makes sense: Attention specializes in routing information (which token attends to which), while the FFN specializes in storing factual associations learned from training data. The FFN is where the model's "knowledge" largely lives.

7.10 Chapter Summary

7.10.1 Key Concepts

Concept	Meaning
Neural network layer	matrix multiply + bias + activation function
Hidden layer	intermediate transformation layer
Activation function	nonlinear function applied after matrix multiply (ReLU, GELU, SwiGLU)
FFN	the neural network component inside each Transformer block
Expand-then-contract	FFN pattern: `d_model → 4d_model → d_model`

7.10.2 The Core Formula

output = activation(input × W₁ + b₁) × W₂ + b₂

In PyTorch:

ffn = nn.Sequential(
    nn.Linear(d_model, 4 * d_model),
    nn.GELU(),
    nn.Linear(4 * d_model, d_model),
)

7.10.3 What You Actually Need to Know

Shape: FFN takes [seq_len, d_model], expands to [seq_len, 4 × d_model], contracts back to [seq_len, d_model].
Per-position: FFN processes each token independently; it does not mix across positions.
Parameters: FFN holds more parameters than Attention — often 2–3× more per block.
Role: Attention mixes information; FFN transforms per-token representations and stores learned associations.

The neural network layer inside a Transformer is just matrix multiplication plus nonlinearity. Treat it as a learned function that changes the shape of each token's representation. That is enough to reason about the full architecture.

Part 2 Summary

You have now finished Part 2: Core Components.

Chapter	Component	Core role
Chapter 4	Tokenization + Embedding	text → token ID → vector
Chapter 5	Positional Encoding	adds position information to each vector
Chapter 6	LayerNorm + Softmax	stabilizes activations; converts scores to probabilities
Chapter 7	Feed Forward Network (FFN)	transforms and stores per-token information

You understand all the components. Part 3 goes into the mechanism that ties them together and gives the Transformer its power: Attention.

Chapter Checklist

After this chapter, you should be able to:

State what y = xW + b computes and where the learnable parameters are.
Explain why activation functions are necessary between layers.
Describe the FFN expand-then-contract pattern and its dimension changes.
State the role of Attention vs. the role of FFN in a Transformer block.
Explain why FFN holds more parameters than Attention and what that suggests about where "knowledge" lives.

See You in the Next Chapter

That is the neural network layer covered. We now have all the building blocks: tokenization, embeddings, positional encoding, LayerNorm, Softmax, and the FFN.

Chapter 8 takes a short detour into geometry before we open up Attention. Specifically, it answers a question that trips up almost everyone when they first encounter Attention: what is matrix multiplication actually doing, geometrically? Once you have that picture, the dot-product at the heart of Attention makes immediate sense.