Blog

How Torn Is a Trained Mixture-of-Experts?

June 25, 2026

中文
Difference-quotient continuity signature on released OLMoE-1B-7B. The hard top-k refinement growth sits at about 16x — the grid ratio, i.e. scaling exponent ~1, the order-0-jump signature — at every one of the 16 layers, while two continuity-guaranteed controls (a soft-edge gate and a tied-expert control) stay flat at about 1x (exponent ~0).

A Geometric Diagnostic for Routing Discontinuity in Released Weights — Zhuo Zhang, Independent Researcher · 📄 Read the full paper (PDF) →

A companion to DeepSeek V4 and Manifold Tearing. That post was the argument — loss spikes are geometric tears, not optimization blowups. This one is the measurement: I built a diagnostic, pointed it at released MoE weights, and asked how torn they actually are. This write-up is the readable version; the PDF carries the full method, related-work, and reproducibility appendices.

In the DeepSeek post I borrowed a word from Max Ma and Gen-Hua Shi's Deep Manifold framework: a Mixture-of-Experts layer doesn't bend the representation, it can tear it. Bending is continuous and benign; tearing is discrete and the pathology. It was a good story. But a metaphor is a debt — at some point you have to show the tear exists, say how big it is, and admit what it does and doesn't cause.

So I paid the debt. What follows is a measurement paper compressed into a blog post: a geometric diagnostic, run on released OLMoE and Qwen weights, with negative controls and known-answer tests.


A metaphor I owed a measurement

A MoE layer routes each token to a small top-kk subset of experts. That discrete choice makes the layer-to-layer map C0C^0-discontinuous: two hidden states arbitrarily close, but on opposite sides of a routing boundary, go to different experts — so the block output can jump by an O(1)O(1) amount. That jump is the tear.

Prior work approaches this from two sides and stops short of measuring it on real models:

  • Continuous-routing fixes — ReMoE replaces top-kk with ReLU routing and explicitly frames top-kk as a jump discontinuity; Soft MoE and sparsemax/α-entmax are relatives. These remove the discontinuity; they don't quantify the one sitting in shipped weights.
  • Routing flip-rate — the R3 router-replay study reports that ≈10% of routers and 94% of tokens flip at least one expert between the training and inference engines. That counts how often the expert set changes, not the output jump the change induces.

Nobody had measured the quantity that actually governs the transport map on a real model: the output-jump geometry on released LLM-MoE weights — how big the jump is, where it sits, and which direction triggers it. That's the gap.

What a tear actually is

A MoE block maps hRdh \in \mathbb{R}^d to

y(h)=ewe(h)Ee(h),y(h) = \sum_e w_e(h)\, E_e(h),

with router logits ge=Wehg_e = W_e h, top-kk selecting the largest, and ww the renormalized softmax over the selected logits. Order the logits g(1)g(k)g(k+1)g_{(1)} \ge \cdots \ge g_{(k)} \ge g_{(k+1)}. The k/(k+1)k/(k{+}1) active-set boundary is where

g(k)=g(k+1),g_{(k)} = g_{(k+1)},

a hyperplane with normal n=W(k)W(k+1)n = W_{(k)} - W_{(k+1)}. Cross it and you swap the kk-th expert, jumping the block output by E(k)(h)E(k+1)(h)\lVert E_{(k)}(h) - E_{(k+1)}(h)\rVert. That is a genuine C0C^0 discontinuity — contrast a ReLU MLP, which is C0C^0-continuous with C1C^1 kinks.

The instrument is a difference quotient: walk a path that crosses the boundary, measure Δy/Δx\lVert \Delta y\rVert / \lVert \Delta x\rVert, and refine the grid — I sample it at resolutions 50020008000500 \to 2000 \to 8000. For a continuous map the quotient saturates; for a genuine order-0 jump it grows linearly in the resolution, as T1T^1. The tell is the scaling exponent: 1\approx 1 for a jump, 0\approx 0 for a continuous map.

Here is the part that's easy to misread, so let me be blunt about it. I summarize that growth as hardG = (max quotient at 8000) / (max quotient at 500). For any nonzero C0C^0 jump that ratio is — by construction — just the grid ratio 8000/500=168000/500 = 16, independent of kk, the expert count, or the layer. So 16× is not a severity. It is the number a true order-0 jump must produce under this protocol; it certifies "the singularity is order-0, and it's here," and nothing more. Larger would not be worse — there is no "larger." The actual severity lives elsewhere: in the per-block output jump and the expert cliff, both below. Two negative controls keep the instrument honest — a tied-expert control (the swap is a no-op) and a continuity-guaranteed soft-edge gate — both required to sit at exponent 0\approx 0 (1×\approx 1\times).

A synthetic block with an exact, known C0C^0 jump nails the law down: the hard difference quotient scales as T1.00T^{1.00} (fitted exponent 1.0005, R2=1.000R^2 = 1.000 across resolutions 25016000250 \to 16000), while the controls stay flat. That is the known-answer test — it confirms the released-weight 16× is the order-0 signature, not a probe artifact, and explains why it is pinned across layers, models, and kk.

Continuity-signature scaling on a synthetic block with an exact, known C0 jump (median over 48 boundary paths). The hard difference quotient scales as T^1.00 (fitted exponent 1.0005, R-squared 1.000 over resolutions 250 to 16000), so the refinement growth equals the grid ratio by construction; soft-edge and tied controls stay flat at exponent ~0.

Three core measurements come out of this:

  • M1 — boundary prevalence: the distribution of the margin g(k)g(k+1)g_{(k)} - g_{(k+1)} and the fraction of tokens sitting near the boundary.
  • M2 — expert cliff: the normalized E(k)E(k+1)\lVert E_{(k)} - E_{(k+1)}\rVert, plus the cosine of the swapped pair. cos ≈ 1 means redundant (no real tear); cos ≈ 0 means non-redundant at the 2/2\sqrt{2}/2 baseline; cos < 0 means genuinely specialized. This separates "outsized specialization" from "merely non-redundant."
  • M3 — continuity signature: the scaling exponent above (≈1 for a jump, ≈0 for a continuous map), summarized as hardG, with its two controls.

Severity, then, is not in the continuity signature. It is carried by the per-block jump and M2 — keep that split in mind for everything below.

The result: released routers are torn at every layer

The tear is not synthetic. On OLMoE-1B-7B (16 layers, 64 experts, k=8k=8), the continuity signature is a genuine order-0 jump at every layer — refinement growth hardG 16×\approx 16\times (scaling exponent 1.00, range 15.93–16.01), the difference quotient diverging at the grid-refinement rate — while both controls stay pinned at exponent 0\approx 0 (1.0×\approx 1.0\times). That's the headline figure at the top of this post. As established above, the 16× is the grid ratio any true jump reaches, not a severity. The severity is two other numbers: the per-block output jump 0.239\approx 0.239 — about 24% of the block-output norm (a per-block geometric quantity, not a change in the model's output, logits, or task accuracy) — and the expert cliff M2 0.70\approx 0.70, cosine 0.025\approx 0.025, right at the 2/2\sqrt{2}/2 unrelated-vector baseline (non-redundant, but not outsized specialization). Near-boundary fraction is 100%\approx 100\% (median margin 1.5×103\approx 1.5\times 10^{-3}).

Table 1 — OLMoE-1B-7B per-layer diagnostic (24 texts). hardG is the refinement growth (max@8000 / max@500); for an order-0 jump it equals the grid ratio 16, i.e. exponent ≈1.00 — its near-constancy is the protocol's, not the model's. The block-jump column is the per-block output jump (fraction of block-output norm). hardG, block jump, and the two controls are medians over 8 boundary-crossing refinement paths; margin, M2, cos are per-token medians. Across layers: median hardG 15.99× (exponent 1.00), M2 0.704, cos 0.025; mean block jump 0.239; soft/tied controls ≈1.00.

LayermarginM2 cliffcoshardGblock jumpsoft ctltied ctl
00.00120.6940.05616.000.2381.0051.001
10.00090.7010.03315.980.3091.0031.002
20.00100.7020.02815.990.2461.0041.001
30.00110.7040.02216.000.2481.0041.002
40.00110.7030.02515.990.2821.0031.002
50.00120.7020.03016.000.2881.0041.002
60.00110.7040.02315.990.2841.0021.003
70.00140.7040.02616.010.2331.0031.002
80.00150.7040.02516.000.2491.0031.002
90.00190.7030.02616.010.2661.0031.002
100.00190.7080.01515.990.2121.0031.002
110.00210.7090.01215.980.2441.0031.002
120.00260.7070.02115.970.2261.0031.002
130.00230.7090.01515.990.1571.0031.002
140.00230.7030.03515.960.2221.0031.002
150.00280.6000.30115.930.1231.0031.002

Layer 15 (the last block) is the lone mild outlier — a lower cliff (M2 0.60) and higher cosine (0.30 → more expert redundancy) — yet its discontinuity (hardG 15.93×) is undiminished.

This is not an OLMoE quirk. Qwen1.5-MoE-A2.7B (24 layers, 60 routed experts, k=4k=4) reproduces the same pattern:

metricOLMoEQwen1.5-MoE
hardG (continuity signature)≈16× all layers15.99× (15.96–16.02)
M2 expert cliff≈0.700.710
cos(Ek,Ek+1)\cos(E_k, E_{k+1})≈0.0250.010
near-boundary fraction (margin < 0.05)≈100%98.2%
whole-block jump≈0.2390.368
tied / soft control≈1.0×1.002× / 1.003×
Cross-model replication of the static tear metrics on OLMoE-1B-7B and Qwen1.5-MoE-A2.7B: hardG about 16x and the M2 cliff about 0.70 on both families, with near-boundary prevalence and whole-block jump also high in both.

Is the 16× just an artifact of k=8k=8? No — and there are two reasons before you even run another experiment. (i) By construction, the refinement growth equals the grid ratio 8000/500 for any nonzero C0C^0 jump, regardless of kk; there is no algebraic path from kk to 16. (ii) Qwen routes k=4k=4 and lands on the same growth (15.99×, exponent 1.00) as OLMoE's k=8k=8 — exactly what you'd expect if 16 is the protocol's grid ratio, and not what you'd expect if it were set by kk. (A direct k{1,2,4,8}k \in \{1,2,4,8\} sweep would settle it outright; I leave that as future strengthening.)

The one-line summary: training does not sew up the seam. Released routers carry the order-0 discontinuity (exponent ≈1) at every layer — with the per-block jump real at ≈24% and the cliff M2 ≈ 0.70 sitting at the unrelated-vector baseline — on both families.

The tear is directional — and that's the surprising part

Here is where the naive intuition fails. The obvious robustness test is: perturb the hidden state randomly and see if removing the tear makes the block more stable. The answer is a null — and the null is informative.

Even at a perturbation large enough to flip 67.6% of tokens' top-kk sets, the hard-vs-soft block jump differs by only 2.6% (0.581 vs 0.566). Killing the tear barely changes the block's response to random noise. The null isn't that experts never flip — at this magnitude 67.6% of them do — it's that random flips add almost no discontinuous excess on top of the smooth response the soft gate already produces.

The reason is geometric: random perturbations are almost always tangent to the boundary. Perturb along the raw-logit boundary normal instead, at 2×2\times the per-token distance-to-tear, and the picture inverts:

OLMoE layernormal fliptangent flipnormal jumptangent jumpnormal ΔKLtangent ΔKL
00.9150.0090.0980.00670.0063−0.0003
80.9450.0040.1790.01110.01610.0050
150.9440.0040.0840.00810.00170.0001

The distance-to-tear is tiny — median 0.24–0.34% of h\lVert h\rVert — so the model lives on the boundary. A boundary-normal nudge of under 1% relative magnitude flips the expert (0.9\approx 0.9 probability) and produces an O(0.1)O(0.1) per-block output jump (a fraction of the block-output norm, not the model output) with measurable downstream KL, while an equal-magnitude random/tangent nudge does next to nothing. The tear is exploitable, but only along a specific low-dimensional direction.

Directional fragility on OLMoE layers 0, 8, and 15. Boundary-normal perturbations flip the k/k+1 expert (about 0.9) and induce O(0.1) block jumps with measurable downstream KL; equal-magnitude tangent perturbations do almost nothing.

And you can't cheaply re-gate it away. Continuous re-gating removes the static tear (softG 1×\approx 1\times), but applied to all layers at inference it raises perplexity 10.04 → 159.94 (15.9×) at the gentlest threshold. Single-layer re-gating is mild (1.2×\approx 1.2\times). So a practical mitigation has to be layer-targeted — or, as DeepSeek-V4 does, folded into training rather than bolted on afterward.

Training: it decomposes, it doesn't detonate

This is the honest negative, and it's the part that keeps the metaphor from overreaching. To probe training I used a controlled from-scratch GPT-MoE probe at OLMoE-like geometry (E=64, k=8, ≈308M params) — not the 7B model's pretraining.

Mechanism (real, modest): a parameter-space difference-quotient probe finds hard routing rougher than a continuity-matched soft gate — 2.0× vs 1.4×. Directionally consistent with the synthetic 4.5×, but weaker as the geometry becomes realistic.

Outcome: natural training is spiky-but-convergent, and the tear does not self-heal. Over 8000 steps there are 196 spikes >0.3 (max 0.585) yet no divergence (final loss 3.27). M2 holds 0.711 → 0.690 and hardG stays 16×\approx 16\times throughout, while the operational whole-block jump collapses early (0.431 → ≈0.15) then sits in a noisy band. Training suppresses the tear's consequence, not its topology.

Controlled E=64/k=8 from-scratch training probe. The M2 expert cliff and the hardG continuity signature persist throughout training, while the operational whole-block jump collapses early then plateaus.

Push harder with a tear-magnitude dial across two seeds at the edge of stability:

seedtear_levelspikes >0.3max spikediverged
00.0 / 0.5 / 1.03 / 2 / 40.42 / 0.42 / 4.21no / no / no
10.0 / 0.5 / 1.07 / 5 / 60.53 / 0.52 / 5.28no / no / no

No run diverges. Spike count isn't even monotone in tear level. But full tear reproducibly seeds a rare severe-but-recovered spike (max ≈4–5 vs ≈0.5). The honest statement: the tear contributes optimization roughness and rare spike severity, but is not by itself sufficient for collapse at this scale — momentum and Adam absorb it. Scale-dependence stays open, and that's exactly where DeepSeek-V4's engineering becomes relevant.

Tear-level dial across two seeds (tear_level 0.0 / 0.5 / 1.0): spike count is non-monotone and no run diverges, but full tear reproducibly seeds a rare severe-but-recovered spike (max about 4 to 5 versus about 0.5).

Reading DeepSeek-V4's three defenses geometrically

This is where it closes back on the DeepSeek post. The key caution: routing discontinuity is not loss spike. The discontinuity is the first of several separable factors:

  1. Routing discontinuity — directional, the geometric entry point (§ above).
  2. Expert-outlier magnitude — the jump scales with it; I measure the correlation between h\lVert h\rVert and the k/(k+1)k/(k{+}1)-swap jump at +0.437 (high-norm tokens jump more).
  3. Temporal backbone/router mismatch — untested here (future work).
  4. Cross-layer propagation gain — downstream-KL / injected-jump 0.020.09\approx 0.02\text{–}0.09 (a non-expansive residual bounds it).
  5. Optimizer absorption — the training result above.

DeepSeek-V4's three interventions map cleanly onto factors 2–4 — they control the consequences, not the topology: SwiGLU clamping (factor 2), Anticipatory Routing (factor 3), and manifold-constrained hyper-connections (factor 4 — a residual map on the Birkhoff polytope, spectral norm ≤ 1, non-expansive).

I can verify the first one directly. A SwiGLU-clamp sweep on released OLMoE leaves the topology flat — hardG 16.03 → 16.02, M2 0.705 → 0.701 — while reducing absolute amplitude (expert cliff 2.90 → 1.97, absolute hardJump 0.214 → 0.164); the relative jump stays scale-invariant (0.310 → 0.303). Clamp caps the jump's amplitude, not its existence — the DeepSeek decomposition, measured rather than asserted.

SwiGLU-clamp decomposition on released OLMoE. Tightening the clamp leaves the topology flat (hardG and M2 barely changed) while reducing the absolute expert cliff and absolute jump, at a hook-path perplexity cost.

In the DeepSeek post I drew this table, arguing V4's four mechanisms all do the same kind of thing in different positions — admit that weights and activations are geometric objects:

LayerMechanismGeometric move
OptimizerMuonProject updates onto the isometry group
RoutingAnticipatory RoutingDecide in the source-point geometry
Forward residualmHCConstrain residual to the Birkhoff polytope (non-expansive)
ActivationSwiGLU clampingBound curvature

The measurement here says the geometry those mechanisms respect is genuinely there, in the shipped weights — a true order-0 discontinuity at every layer — whether or not the optimizer ever happened to trip over it during training.

What this is, and what it isn't

This is a measurement/analysis result, and I want to be precise about its boundaries:

  • It does not propose a new gate, a new C0/C1C^0/C^1 taxonomy, or claim first discovery of the discontinuity. ReMoE, Puigcerver et al. (2022), and the spline-theory line own those.
  • What it adds: a diagnostic that runs on released weights with negative controls and known-answer tests; a cross-model characterization (OLMoE + Qwen); a directional inference result with a mitigation bound; and an honest training-time decomposition that refuses to overclaim.

The bottom line: the MoE routing tear is real, measurable, and cross-model — a genuine C0C^0 jump (the difference quotient diverges at scaling exponent ≈1 at every layer, controls flat at ≈0) that trained routers carry rather than sew up. It is severe in the per-block output jump (≈24% of the block-output norm) — not in the diagnostic's 16× (that's the grid ratio, not a magnitude), and not in expert specialization (M2 sits at the unrelated-vector baseline). Its inference consequence is directional: random inputs miss it, boundary-normal inputs of under 1% magnitude hit it, and you can't re-gate it away post-hoc for free. Its training role is a decomposition, not a single cause — which is precisely why the engineering remedies cap amplitude rather than remove the tear.

The DeepSeek post argued that treating the network as a geometric object is moving from philosophical stance to engineering default. This one supplies the number that the stance was missing.


Caveats, stated plainly. The continuity signature certifies the order of the singularity (exponent ≈1 vs controls ≈0) and where it sits — it does not measure severity; its 16× growth is the grid ratio any genuine jump must produce, not a model property and not larger-is-worse. Every number here is a geometric quantity on hidden states: the ≈24% block jump and the boundary-normal fragility are not tied to end-to-end task accuracy (GSM8K, MMLU), and the jump is measured one layer at a time, so how tears compound across stacked blocks is untested. Beyond that: two model families and 24 short prompts; a small-scale 308M training probe (the spike-severity signal may be scale-dependent); a single-path hardJump trace (multi-path resampling would give error bars); clamp quality measured within a hook path; the temporal-mismatch factor named but not measured. Deep Manifold (Ma & Shi) is used as motivation only — external prior work, not my own framework. The numeric source of truth is a result-JSON set; every figure regenerates from it with a stdlib-only SVG script.

References

The full method and reproducibility appendices are in the paper PDF.

  • Ziteng Wang, Jun Zhu, Jianfei Chen. ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing. ICLR 2025. arXiv:2412.14711
  • Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Neil Houlsby. From Sparse to Soft Mixtures of Experts. ICLR 2024. arXiv:2308.00951
  • André F. T. Martins, Ramón Fernandez Astudillo. From Softmax to Sparsemax. ICML 2016. arXiv:1602.02068
  • Ben Peters, Vlad Niculae, André F. T. Martins. Sparse Sequence-to-Sequence Models. ACL 2019. arXiv:1905.05702
  • Joan Puigcerver, Rodolphe Jenatton, Carlos Riquelme, Pranjal Awasthi, Srinadh Bhojanapalli. On the Adversarial Robustness of Mixture of Experts. NeurIPS 2022. arXiv:2210.10253
  • Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, Fuli Luo. Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers (R3). 2025. arXiv:2510.11370
  • Boris Hanin, David Rolnick. Complexity of Linear Regions in Deep Networks. ICML 2019. arXiv:1901.09021
  • Randall Balestriero, Richard Baraniuk. A Spline Theory of Deep Networks. ICML 2018. arXiv:1805.06576
  • Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, Yoshua Bengio. On the Number of Linear Regions of Deep Neural Networks. NeurIPS 2014. arXiv:1402.1869
  • William Fedus, Barret Zoph, Noam Shazeer. Switch Transformers. JMLR 2022. arXiv:2101.03961
  • Barret Zoph et al. ST-MoE: Designing Stable and Transferable Sparse Expert Models. 2022. arXiv:2202.08906
  • Dmitry Lepikhin et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. ICLR 2021. arXiv:2006.16668
  • DeepSeek-AI. DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. 2026. arXiv:2606.19348
  • Max Y. Ma, Gen-Hua Shi. Deep Manifold Part 1: Anatomy of Neural Network Manifold. 2024. arXiv:2409.17592
  • Max Y. Ma, Gen-Hua Shi. Deep Manifold Part 2: Neural Network Mathematics. 2025. arXiv:2512.06563
  • Niklas Muennighoff et al. OLMoE: Open Mixture-of-Experts Language Models. 2024. arXiv:2409.02060
  • Qwen Team. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters. 2024. Model blog

Related: DeepSeek V4 and Manifold Tearing · The Four Realms of Neural Networks