One-sentence summary: Mamba is not "Transformer is dead" — it is the first serious argument that the quadratic Attention bottleneck can be bypassed without losing language modeling quality.


32.1 The Transformer Bottleneck

32.1.1 Attention's Complexity Problem

We have spent most of this book studying Attention. The formula is elegant:

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Hidden inside this formula is a problem. The QKTQK^T computation:

  • Q has shape [N, d]
  • K has shape [N, d]
  • QKTQK^T has shape [N, N]

Every token attends to every other token. Compute and memory both scale as O(N²).

For short sequences, this is fine. For long sequences, it becomes prohibitive.

32.1.2 The Numbers

Sequence length NAttention matrixMemory (FP16)
1,0241M elements2 MB
4,09616M elements32 MB
32,7681B elements2 GB
131,072 (128K)17B elements34 GB
1,048,576 (1M)1T elements2 TB

Going from 4K to 128K tokens (32× longer sequence) requires 1,024× more memory for the Attention matrix alone. This is why GPT-4 launched with 8K context, and why extending to 128K required significant engineering beyond just flipping a config flag.

32.1.3 The KV Cache Compounds the Problem

Chapter 22 covered KV Cache: the technique of caching past keys and values to avoid recomputation during generation. KV Cache is necessary for efficient inference, but it grows linearly with sequence length:

KV Cache size = 2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element

For Llama 3-70B:

  • 4K context: ~2.5 GB
  • 128K context: ~80 GB
  • 1M context: ~625 GB (requires multiple servers)

This creates a hard tradeoff: long context means less concurrent users on the same hardware. Extending context length is not "free" even if you have techniques to make the Attention computation itself more efficient.

32.1.4 The Design Question

Is there an architecture that:

  • Captures long-range dependencies (what Attention does well)
  • Processes sequences in O(N) time (not O(N²))
  • Maintains O(1) or O(1) state at inference time (not O(N) KV cache)
  • Matches Transformer quality on language modeling

State Space Models, and Mamba specifically, are the most credible answers so far.


32.2 State Space Models

32.2.1 RNN Context

Before Transformers, sequence modeling used RNNs. Their core idea: compress the entire history into a fixed-size hidden state.

h_t = f(h_{t-1}, x_t)    # update state
y_t = g(h_t)              # produce output

RNN advantages: O(N) compute, O(1) inference memory (just the hidden state).

RNN disadvantages: sequential computation (cannot parallelize), gradient vanishing/exploding (hard to capture long-range dependencies), and a fixed-size bottleneck that cannot hold all relevant context.

Transformers solved the quality problem but reintroduced the O(N²) scaling problem. SSMs try to get O(N) scaling with better quality than RNNs.

32.2.2 The State Space Equations

SSMs originate from control theory and signal processing. The continuous-time formulation:

h(t)=Ah(t)+Bx(t)h'(t) = Ah(t) + Bx(t)
y(t)=Ch(t)+Dx(t)y(t) = Ch(t) + Dx(t)

Where:

  • x(t)x(t) — input signal
  • h(t)h(t) — hidden state (the "memory")
  • y(t)y(t) — output signal
  • A, B, C, D — learned system parameters

The parameters have interpretable roles:

  • A: how the hidden state evolves on its own ("forgetting curve")
  • B: how input is written into the hidden state ("importance of new information")
  • C: how to read the output from the hidden state ("what to report")
  • D: direct pass-through from input to output (often set to zero)

32.2.3 Discretization

Computers handle discrete sequences, not continuous signals. We need to discretize the equations with a step size Δ\Delta:

hk=Aˉhk1+Bˉxkh_k = \bar{A}h_{k-1} + \bar{B}x_k
yk=Chky_k = Ch_k

The discrete parameters come from the zero-order hold discretization:

Aˉ=eΔA\bar{A} = e^{\Delta A}
Bˉ=(eΔAI)A1B\bar{B} = (e^{\Delta A} - I)A^{-1}B

The step size Δ\Delta controls how much time each token "represents." A larger Δ\Delta causes the model to update its state more aggressively and forget the past more quickly.

32.2.4 Two Modes of Computation

SSMs have a key structural property: they can be computed in two equivalent ways.

Recurrent form (for inference):

h = zeros(batch_size, d_state)
for k in range(seq_len):
    h = A_bar @ h + B_bar @ x[k]    # update state
    y[k] = C @ h                     # read output

Cost: O(N) time, O(1) memory (just h).

Convolutional form (for training):

# Pre-compute the convolutional kernel
K = [C @ B_bar,
     C @ A_bar @ B_bar,
     C @ A_bar^2 @ B_bar,
     ...]   # length N

# Apply as convolution
y = conv1d(x, K)

Cost: O(N log N) via FFT, parallelizable across the sequence.

The SSM advantage: train with the parallel convolutional form, infer with the O(1)-memory recurrent form. This decouples training efficiency from inference efficiency in a way RNNs cannot.

32.2.5 S4: The Structured State Space Model

In 2021, Albert Gu and colleagues at Stanford proposed S4 (Structured State Space Sequence Model). The key contribution: using a specific structured initialization for A (called HiPPO) that allows the hidden state to efficiently remember long-range history.

S4 demonstrated that SSMs could handle very long sequences (tens of thousands of elements) on tasks where Transformers struggled due to context length limits.

But S4 had a critical limitation: its parameters are input-independent. The A, B, C matrices are the same regardless of what token is being processed. The model cannot decide to pay more attention to some tokens and less to others.

This is fundamentally different from Attention, where the query dynamically selects what to attend to. A fixed-parameter SSM has fixed "attention" to all positions, weighted only by the learned forgetting curve.


32.3 Mamba: Selective State Spaces

32.3.1 The Paper

December 2023. Albert Gu (CMU) and Tri Dao (Princeton) publish "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." The claim: an SSM architecture that matches Transformer quality on language modeling tasks.

This was the first credible result in that direction. SSMs before Mamba were good at specific domains (audio, genomics, long-sequence classification) but fell short of Transformers on language.

32.3.2 The Core Innovation: Input-Dependent Selection

Mamba makes the SSM parameters functions of the input:

Δk=softplus(Linear(xk))\Delta_k = \text{softplus}(\text{Linear}(x_k))
Bk=Linear(xk)B_k = \text{Linear}(x_k)
Ck=Linear(xk)C_k = \text{Linear}(x_k)

The step size Δ\Delta now depends on what the current token is. So does B (how to write new information) and C (what to read out).

The A matrix remains input-independent — this is a deliberate design choice that preserves computational structure.

What input-dependent Δ\Delta means:

  • Large Δk\Delta_k: the state updates aggressively. The model effectively "forgets" old state and focuses on the current token. Behaves like high-resolution observation.
  • Small Δk\Delta_k: the state barely changes. The model preserves memory and treats the current token as low-importance. Behaves like smoothing over many tokens.

The model learns to assign large Δ\Delta to important tokens and small Δ\Delta to filler words. This is selection: the model decides what to remember.

32.3.3 Selection Compared to Attention

Both selection (Mamba) and Attention (Transformer) are mechanisms for deciding which information to carry forward. They differ fundamentally in how they do it:

Attention:

At each position, explicitly compare this token to all previous tokens.
Retrieve a weighted sum of their values.
The weight depends on (Q, K) similarity.
Cost: O()  every position vs every other position.

Selective SSM (Mamba):

At each position, decide how much to update the hidden state.
The state carries a compressed summary of all past tokens.
The decision (Δ, B, C) depends on the current token.
Cost: O(N)  one state update per position.

Mamba cannot directly access any past token the way Attention can. It accesses the compressed state. This is an expressiveness tradeoff — Mamba is less expressive per position but much more compute-efficient.

32.3.4 Why Selection Matters for Language

Consider: "The agent merged the pull request, which it had submitted three days earlier."

To resolve "it," the model needs to connect to "agent" many tokens back. In Attention, this is explicit: "it" attends to "agent" with high weight.

In a non-selective SSM, "agent" is mixed into the state at its position, and by the time we reach "it," it may have been diluted by all the tokens in between. The model has no mechanism to say "this particular entity is important, preserve it."

Mamba's selection mechanism allows the model to effectively say: "When I see 'agent,' write it prominently into the state (large B). When I see 'the' or commas, update minimally (small Δ)." By the time "it" needs to resolve, "agent" is still accessible in the state.

32.3.5 Hardware-Aware Algorithm

Making parameters input-dependent breaks the convolutional computation. The kernel is no longer fixed, so we cannot precompute it and apply a single FFT.

Mamba's solution: a custom hardware-aware algorithm.

Key observations:

  1. The selective scan can be parallelized within a block using parallel prefix sums
  2. If we stay in SRAM (fast on-chip memory) rather than DRAM, the bandwidth cost is manageable
  3. Fusing the scan, A/B/C computation, and output projection into a single CUDA kernel eliminates intermediate DRAM reads/writes

The result:

  • Training throughput: ~3× faster than Flash Attention 2 at comparable sequence length
  • Inference: O(N) with constant memory per position

32.3.6 Mamba Block Structure

Input x
  
  ├─────────────────────────────────────────┐
                                           
   Linear projection (expand)              Linear projection (expand)
  Conv1d  SiLU activation               SiLU (gate)
  
  Selective SSM
  
  × (element-wise multiply with gate)
  
   Linear projection (contract)
  Output

Key design choices:

  • No Attention: completely removed
  • Expand-contract pattern: similar to FFN's up-projection / down-projection
  • Gating: the parallel branch acts as a multiplicative gate, providing nonlinearity

32.3.7 Complexity Summary

DimensionTransformerMamba
Training computeO(N²d)O(Nd)
Training memoryO(N²) + O(N) KVO(N)
Inference per stepO(N) (with KV cache)O(1)
Inference memoryO(N) KV cacheO(1) state

The inference row is the key. A Transformer accumulating a KV cache grows its memory linearly with every token generated. Mamba's recurrent state is constant size regardless of sequence length.


32.4 Mamba-2 and State Space Duality

32.4.1 Mamba-1 Limitations

Mamba matched Transformers on language perplexity, but analysis revealed areas where Attention still held advantages:

  • In-context learning: Transformers use few-shot examples more effectively
  • Precise information retrieval: "What was the exact value mentioned in paragraph 3?" is harder for a state-based model
  • Complex multi-step reasoning: explicit position access in Attention helps multi-hop reasoning

32.4.2 State Space Duality

May 2024, Dao and Gu publish the Mamba-2 paper: "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality."

The theoretical contribution: SSMs and Attention are two instances of the same general framework.

Both can be written as structured matrix multiplications:

  • Dense matrix (no structure) = full Attention
  • Semiseparable matrix (specific low-rank structure) = SSM

This "State Space Duality" (SSD) connects the two families mathematically. It shows they are not alternatives from different paradigms — they are points on the same spectrum, differing in what structure constraints are imposed on the sequence-to-sequence transformation matrix.

Practical consequence: the SSD framework suggests intermediate points. Models that are neither fully dense Attention nor fully structured SSM, but something in between.

32.4.3 Mamba-2 Improvements

Beyond theory, Mamba-2 delivers practical gains:

  • 2–8× faster than Mamba-1 on training throughput
  • Larger state dimension: from 16 to 256 (better long-range memory)
  • Better tensor core utilization: the new algorithm maps cleanly to GPU hardware structure
TaskMamba-1Mamba-2Transformer
Language perplexitybaseslightly lowerslightly lower
Long-sequence (16K+)fast2–8× fasterslow
Training throughputhighhighermedium

32.5 Hybrid Architectures: Jamba

32.5.1 The Motivation

Neither pure Transformer nor pure Mamba is clearly best across all tasks:

Task typeTransformer advantageMamba advantage
Short sequence, complex reasoningStrongModerate
Long sequence, throughputMemory bottleneckLinear scaling
Precise information retrievalStrongWeaker
Token-level routingStrongStrong
Inference speed (long context)SlowFast

An obvious question: can you combine them?

32.5.2 Jamba Architecture

March 2024, AI21 Labs releases Jamba — the first production-scale model combining Transformer layers, Mamba layers, and MoE.

Layer structure (repeating pattern):

Group of 4 layers:
  Layer 1: Mamba + MLP
  Layer 2: Mamba + MLP
  Layer 3: Mamba + MLP
  Layer 4: Attention + MLP    one Attention per four Mamba layers

The MLP in some layers is replaced by MoE.

Design logic:

  • Most layers use efficient Mamba: O(N) computation, handles the bulk of sequence processing
  • Occasional Attention layers: provide precise global information retrieval when needed
  • MoE: increases capacity without proportionally increasing inference compute

32.5.3 Jamba Configuration

ParameterValueNote
Total parameters52B
Active parameters12BMoE reduces per-token compute
Total layers32
Attention layers8One per four-layer group
Mamba layers24Three-quarters of all layers
MoE experts16Per MoE layer
Active experts2Top-2 routing
Context window256K

The 1:3 Attention-to-Mamba ratio is not arbitrary. Empirically, this provides enough Attention for high-quality retrieval tasks while keeping most computation in the faster Mamba layers.

32.5.4 Why the Hybrid Can Outperform Pure Architectures

Long document processing:

[First 8K tokens]  [240K tokens in the middle]  [Last 8K tokens]

Mamba layers:   Efficiently compress the 240K middle section.
                State-based compression: no O() blowup.

Attention layers: Precise retrieval across key sections.
                  "What was the deadline mentioned at the start?"

In-context learning:

[Few-shot examples]  [Query]

Mamba layers:   Process the examples into a compressed state.
Attention layers: Align query structure to example structure directly.

The combination handles tasks that neither pure architecture handles well alone.

32.5.5 Jamba Performance

Against same-class comparisons:

ModelParametersContextVRAMThroughput
Llama 2 70B70B4K140 GBbaseline
Mixtral 8x7B47B32K94 GB1.5×
Jamba52B (12B active)256K100 GB

Jamba processes 256K context with lower VRAM than Llama 70B processes 4K. The Mamba layers are doing the heavy lifting on the long-range compression.


32.6 Other Alternative Architectures

32.6.1 RWKV: Linear Attention via Recurrence

RWKV (Receptance Weighted Key Value) is an architecture that reformulates Attention in a recurrent-compatible way.

The core formula (simplified):

wkv_t = sum(exp(k_i + w*(t-1-i)) * v_i for i in range(t))
y_t   = sigmoid(r_t) * wkv_t

RWKV can be computed:

  • As a transformer: parallel over positions during training
  • As an RNN: sequential, constant state during inference

The state is time-decayed: information from position i is weighted by ew(t1i)e^{w \cdot (t-1-i)}, so older tokens contribute less. This is similar to how Δ\Delta controls forgetting in Mamba, but the decay is fixed rather than input-dependent.

RWKV has a strong open-source community (RWKV-4 through RWKV-6). It is competitive with Mamba on many language benchmarks and supports local deployment through smaller variants.

32.6.2 RetNet: Retention

RetNet (Retentive Network) from Microsoft Research (2023) uses a "retention" mechanism:

Retention(Q,K,V)=(QKTD)V\text{Retention}(Q, K, V) = (QK^T \odot D)V

where D is a decay matrix: Dij=γijD_{ij} = \gamma^{i-j} for i ≥ j and 0 otherwise. Information from earlier positions is geometrically discayed.

RetNet offers three computation modes:

  1. Parallel: like Transformer, used for training
  2. Recurrent: like RNN, used for inference
  3. Chunked: hybrid, useful for very long sequences

The decay structure is fixed (controlled by γ\gamma), making RetNet more constrained than Mamba's input-dependent selection. In language benchmarks, RetNet is competitive with small Transformers but trails larger ones.

32.6.3 Hyena: Implicit Long Convolutions

Hyena (Stanford, 2023) replaces Attention with implicit long convolutions:

Input x
  
Generate N filter functions f1, f2, ..., fn from x via small network
  
Output y = (((x  f1) · g1)  f2) · g2 ...

The filters are generated implicitly by a small MLP, so they can be arbitrarily long without storing an N×N matrix. Cost: O(N log N) via FFT.

Hyena showed strong results on long-sequence classification tasks (PathX, 16K sequence). On standard language modeling benchmarks, it is competitive with small Transformers but does not reach Mamba-level language quality.

32.6.4 Architecture Comparison

ArchitectureCore mechanismComputeLanguage qualityLong-seq
TransformerDense AttentionO(N²)BestLimited
MambaSelective SSMO(N)Near TransformerExcellent
RWKVLinear recurrenceO(N)GoodExcellent
RetNetDecayed retentionO(N)GoodExcellent
HyenaImplicit convolutionO(N log N)ModerateExcellent
JambaHybrid (Attn + Mamba)O(N)StrongExcellent

32.7 Code Examples

32.7.1 Minimal SSM

import torch
import torch.nn as nn

class SimpleSSM(nn.Module):
    """Minimal SSM for conceptual understanding."""

    def __init__(self, d_model: int, d_state: int = 16):
        super().__init__()
        self.d_state = d_state
        self.A      = nn.Parameter(torch.randn(d_state, d_state) * 0.01)
        self.B      = nn.Parameter(torch.randn(d_state, d_model) * 0.01)
        self.C      = nn.Parameter(torch.randn(d_model, d_state) * 0.01)
        self.delta  = nn.Parameter(torch.ones(1) * 0.1)

    def discretize(self):
        """Zero-order hold discretization."""
        A_bar = torch.matrix_exp(self.delta * self.A)
        B_bar = self.delta * self.B   # simplified
        return A_bar, B_bar

    def forward(self, x):
        """x: (batch, seq_len, d_model) → (batch, seq_len, d_model)"""
        batch, seq_len, _ = x.shape
        A_bar, B_bar = self.discretize()

        h = torch.zeros(batch, self.d_state, device=x.device)
        outputs = []
        for t in range(seq_len):
            h = h @ A_bar.T + x[:, t, :] @ B_bar.T   # state update
            y = h @ self.C.T                            # read output
            outputs.append(y)

        return torch.stack(outputs, dim=1)

32.7.2 Selective SSM (Mamba-Style)

class SelectiveSSM(nn.Module):
    """Mamba-style selective SSM. The key: B, C, Δ depend on the input."""

    def __init__(self, d_model: int, d_state: int = 16, d_conv: int = 4):
        super().__init__()
        self.d_state = d_state
        self.in_proj = nn.Linear(d_model, d_model * 2)
        self.conv    = nn.Conv1d(d_model, d_model, kernel_size=d_conv,
                                 padding=d_conv - 1, groups=d_model)

        # Input-dependent parameters  the Mamba innovation
        self.B_proj     = nn.Linear(d_model, d_state)
        self.C_proj     = nn.Linear(d_model, d_state)
        self.delta_proj = nn.Linear(d_model, d_model)

        # A is input-independent but uses special initialization
        A = torch.arange(1, d_state + 1).float()
        self.A_log = nn.Parameter(torch.log(A))

        self.out_proj = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch, seq_len, _ = x.shape

        xz   = self.in_proj(x)
        x, z = xz.chunk(2, dim=-1)

        x = x.transpose(1, 2)
        x = self.conv(x)[:, :, :seq_len]
        x = x.transpose(1, 2)
        x = torch.nn.functional.silu(x)

        # Compute input-dependent parameters
        B     = self.B_proj(x)                                        # (B, L, d_state)
        C     = self.C_proj(x)                                        # (B, L, d_state)
        delta = torch.nn.functional.softplus(self.delta_proj(x))      # (B, L, d_model)
        A     = -torch.exp(self.A_log)                                 # (d_state,)

        y = self._selective_scan(x, A, B, C, delta)
        y = y * torch.nn.functional.silu(z)
        return self.out_proj(y)

    def _selective_scan(self, x, A, B, C, delta):
        batch, seq_len, d_model = x.shape
        h = torch.zeros(batch, d_model, self.d_state, device=x.device)
        outputs = []
        for t in range(seq_len):
            delta_t = delta[:, t, :].unsqueeze(-1)           # (B, D, 1)
            A_bar   = torch.exp(delta_t * A)                  # (B, D, d_state)
            B_bar   = delta_t * B[:, t, :].unsqueeze(1)       # (B, D, d_state)
            h       = h * A_bar + x[:, t, :].unsqueeze(-1) * B_bar
            y       = (h * C[:, t, :].unsqueeze(1)).sum(-1)   # (B, D)
            outputs.append(y)
        return torch.stack(outputs, dim=1)

32.7.3 Scaling Comparison

def compare_scaling():
    """Compare wall-clock time at different sequence lengths."""
    import time
    d_model = 512
    batch   = 4

    seq_lengths = [512, 1024, 2048, 4096, 8192]
    print(f"{'seq_len':>8}  {'Attention':>10}  {'SSM':>10}  {'ratio':>8}")
    print("-" * 50)

    for seq_len in seq_lengths:
        x = torch.randn(batch, seq_len, d_model).cuda()

        attn = nn.MultiheadAttention(d_model, 8).cuda()
        torch.cuda.synchronize()
        t0 = time.time()
        for _ in range(10):
            attn(x, x, x)
        torch.cuda.synchronize()
        attn_time = (time.time() - t0) / 10

        ssm = SelectiveSSM(d_model).cuda()
        torch.cuda.synchronize()
        t0 = time.time()
        for _ in range(10):
            ssm(x)
        torch.cuda.synchronize()
        ssm_time = (time.time() - t0) / 10

        ratio = attn_time / ssm_time
        print(f"{seq_len:>8}  {attn_time:>10.4f}s  {ssm_time:>10.4f}s  {ratio:>7.2f}×")

# Typical output:
#  seq_len   Attention         SSM     ratio
# --------------------------------------------------
#      512    0.0023s    0.0018s    1.28×
#     1024    0.0045s    0.0032s    1.41×
#     2048    0.0156s    0.0061s    2.56×
#     4096    0.0589s    0.0118s    4.99×
#     8192    0.2234s    0.0233s    9.58×

The SSM advantage grows as sequence length increases. At 8K tokens, SSM is ~10× faster. At 128K tokens, the advantage is measured in orders of magnitude.


32.8 Architecture Selection Guide

32.8.1 Transformer

Choose when:

  • Sequences under 4K tokens (Attention overhead is manageable)
  • Tasks requiring precise information retrieval (Attention handles this directly)
  • In-context learning is central (Transformers use few-shot examples better)
  • Ecosystem maturity matters (libraries, hardware optimization, pre-trained weights)

32.8.2 Mamba / SSM

Choose when:

  • Sequences over 8K tokens (the quadratic cost begins to dominate)
  • Throughput matters more than latency
  • Memory-constrained environments where KV cache is impractical
  • Signal processing, audio, genomics (SSMs have strong priors for these domains)

32.8.3 Hybrid (Jamba-style)

Choose when:

  • Context lengths from 32K to 1M+ tokens
  • Need strong language quality AND long-range efficiency
  • Both retrieval tasks and sequential processing in the same model
  • Willing to adopt a newer, less-battle-tested stack

Near term (2025–2026):

  • Transformer remains the default for new projects
  • SSM layers appear in long-context hybrid models
  • Hardware vendors optimize for Mamba's scan operations

Medium term (2026–2028):

  • Hybrid architectures may become the new standard for frontier systems
  • SSMs gain adoption in production serving where long contexts are routine

Longer term:

  • Architecture search may discover better sequence modeling primitives
  • SSD theory suggests a unified framework from which both Attention and SSM emerge as special cases

32.8.5 Academic vs Industrial Perspective

Academia is excited about SSMs as theoretical objects. The SSD framework, connecting Attention and SSMs through structured matrix theory, is considered a genuine mathematical insight.

Industry remains cautious. The Transformer stack has years of optimization: Flash Attention, KV Cache compression, quantization tooling, speculative decoding, all the infrastructure that makes Transformers fast in production. Mamba needs its own stack, and that stack is still maturing.

The realistic path: hybrid architectures that mix Transformer components with SSM components, letting teams benefit from both while the SSM tooling catches up.


32.9 Chapter Summary

32.9.1 The Problem Transformer Has

O(N²) Attention: both compute and memory scale quadratically with sequence length. A 128K-context model needs 1,024× the Attention memory of a 4K-context model.

KV Cache: the O(N) cache that enables efficient generation still grows without bound. 1M tokens = 625 GB for a 70B model.

32.9.2 What SSMs Offer

Recurrent state: O(1) inference memory regardless of sequence length. The hidden state compresses all past context into a fixed-size vector.

O(N) compute: one state update per token position, not one comparison per pair.

Training parallelism: equivalent convolutional computation during training, still parallelizable.

32.9.3 What Mamba Adds

Selection: make Δ\Delta, B, and C input-dependent. The model decides per-token what to write into state and how aggressively to update it. This is the mechanism that closes the quality gap with Transformers.

Hardware-aware algorithm: custom CUDA kernel that keeps the selective scan efficient on modern GPUs.

32.9.4 Key Equations

Discrete SSM:

hk=Aˉhk1+Bˉxk,yk=Chkh_k = \bar{A}h_{k-1} + \bar{B}x_k, \quad y_k = Ch_k

Mamba selection:

Δk,Bk,Ck=f(xk)\Delta_k, B_k, C_k = f(x_k)

Complexity:

AttentionSSM
Train computeO(N²d)O(Nd)
Train memoryO(N²)O(N)
Inference memoryO(N) KVO(1) state

32.9.5 My Take

Mamba is the first architecture I have seen that makes a credible argument against Transformer's dominance on language tasks — not by beating it on benchmarks (early Mamba was roughly parity), but by proving that you can reach the same quality with fundamentally better complexity.

The SSD insight is the most theoretically interesting: Attention and SSMs are not different paradigms, they are different structural constraints on the same matrix framework. That suggests the right architecture for any specific task is neither pure Transformer nor pure Mamba, but some point on the SSD spectrum — which is exactly what Jamba's hybrid approach is empirically discovering.

For practitioners: watch the hybrid architectures. The pure-SSM story is theoretically compelling but underinvested in tooling. The hybrid story is where the production wins will appear first.


Chapter Checklist

After this chapter, you should be able to:

  • Explain where the O(N²) cost in Attention comes from and why it matters at long context.
  • Describe the SSM state equations and what A, B, C each control.
  • Explain why SSMs have two computation modes (recurrent and convolutional) and what each is good for.
  • Explain what Mamba's selective mechanism does and how input-dependent Δ implements "selection."
  • Describe the State Space Duality insight and what it implies about Transformer and SSM unification.
  • Explain Jamba's architectural choice and why the 1:3 Attention-to-Mamba ratio makes sense.
  • Choose between Transformer, Mamba, and hybrid architecture for a given sequence length and task.

See You in the Next Chapter

That completes Part 9. You have now followed the Transformer stack from its core mechanisms all the way to the frontier: alignment training, sparse expert routing, extended reasoning, and the first credible post-Transformer alternatives.

If you can explain RLHF, MoE, test-time compute scaling, and why Mamba's O(N) matters — without looking at these chapters — you are ready for the appendices and for reading the original papers.

The field moves fast. These chapters represent a snapshot of April 2026. What will not move: the first-principles thinking that makes new papers comprehensible. The math and the intuition you built in Parts 1–8 are the durable part. Everything in Part 9 is the application.

Cite this page
Zhang, Wayland (2026). Chapter 32: Post-Transformer Architectures - Mamba and Hybrid Models. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/chapter-32-post-transformer-architectures
@incollection{zhang2026transformer_chapter_32_post_transformer_architectures,
  author = {Zhang, Wayland},
  title = {Chapter 32: Post-Transformer Architectures - Mamba and Hybrid Models},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/chapter-32-post-transformer-architectures}
}