This piece is adapted from a conversation in the Deep Manifold Dialogues — me, Yuan Ma, and Gen-hua Shi on "AI and mathematics" — built on Ma and Shi's Deep Manifold framework. It traces the "manifold prior" I set up earlier in The Four Realms of Neural Networks and DeepSeek V4 and Manifold Tearing back one more layer: where did this mathematics actually come from?
In 1991, Gen-hua Shi introduced the numerical manifold method, and it drew high praise from Chern. The conversation mentioned that Chern had also left behind a larger question:
Can piecewise-stacked manifolds be extended to arbitrarily complex regions?
It's a purely geometric question. But it was later answered once in each of two unrelated worlds — once in rock mechanics, by Shi's numerical manifold method, and once in artificial intelligence, by neural networks.
Stranger still: the two answers are the same one.
And most of the people who answered the second time — the engineers training large models — have no idea they're answering a geometer's question. This piece wants to connect that line: how this mathematics traveled from Riemann all the way to neural networks, how it was handed down, and why it resurfaced in AI "by accident."
Before going further, let's set a yardstick. The whole conversation really circles three old dreams of mathematicians —
- Local and global: can you build global behavior out of small local pieces?
- Continuous and discontinuous: how does the mathematics of the continuous compute a discontinuous world full of cracks and faults?
- Forward and inverse problems: given the cause, find the effect — that's the forward problem; from the effect, infer the cause — that's the inverse problem.
These three tensions recur throughout: the manifold answers the first pair, the numerical manifold answers the second, and neural networks — answer the third.
1. The manifold lineage: Riemann → Poincaré → Chern
Mathematics has a long hidden thread, and its theme is a single sentence: how to build the whole out of local parts.
Riemann was the first to make it concrete. Studying multivalued functions, he found that instead of forcing a self-conflicting function onto a single plane, you could spread it over a layered surface — each layer locally regular and differentiable, while the whole still accommodates branches and singularities. That's the seed of the manifold idea: the whole can be very complex, but locally it must be simple — simple enough to do calculus on.
Poincaré made it algebraic. With simplices, boundaries, and gluing, he turned "how local pieces assemble into a whole" into computable topology: a manifold is just a collection of local pieces glued together by compatible rules. The word covering gets its precise meaning starting with him.
In between, Borel and Lebesgue laid a piece of foundation — open covers, finite subcovers, compactness. It answered a crucial technical question: out of infinitely many local pieces, under what conditions can finitely many cover the whole? Without that foundation, "build the whole from local parts" is just intuition; with it, it becomes mathematics you can actually compute.
With Chern, this line closes into modern differential geometry. Looking back, mathematics performs a similar "generalization" every so often, and the rhythm is strikingly alike:
- Galois turned symmetry into a computable object — group theory, a generalization of number, after which "structure" itself became an object of study.
- Poincaré turned the shape of space into computable topology — Analysis Situs in 1895, a generalization of shape, founding algebraic topology.
- Chern pushed this geometry to the global scale — Chern classes, modern differential geometry (the 1940s), a generalization of function: using manifolds built on covering systems so that local geometric quantities assemble into global invariants, and connecting to the physics of the real world.
Every step uses the same move: first admit the whole is too complex to handle directly, then retreat to the local to find rules, and finally use the language of covering to reassemble the local back into the whole. What Chern did was make this geometric language powerful enough to carry differential equations and connect to real-world physics. There's a line of Jiang Zehan's that gets quoted again and again — through the window of differential equations, mathematicians see the light of the real world. Chern's step amounted to wiping that window completely clean.
2. The Chinese lineage: Jiang Lifu → Jiang Zehan, Chern → Boju Jiang, Gen-hua Shi
When this manifold/topology line reached China, at its root sits a name that often gets skipped: Jiang Lifu. Two people who would later rewrite Chinese geometry and topology came out of his tutelage — both Chern and Jiang Zehan were his students; and Boju Jiang, who later went far in fixed-point theory, was his son. These seemingly separate branches share a single source.
Jiang Zehan brought algebraic topology into Peking University, and brought up a whole lineage of people with it.
Two of those branches went far, and both landed on the same subject: fixed points.
One branch is Boju Jiang, who pursued pure Nielsen fixed-point theory — internationally recognized as representative work in the field.
The other branch is Gen-hua Shi. In 1963 he graduated from Peking University's mathematics department and stayed on as a graduate student under Jiang Zehan, focusing on algebraic topology and fixed-point theory. Together with Jiang Zehan and Boju Jiang he developed topological fixed-point theory, introducing the "Shi-type nest spaces" and the "Shi condition."
Worth noting: Chern also served on Gen-hua Shi's doctoral committee — the line loops back to Chern himself, and foreshadows the high praise the numerical manifold method would later earn from him.
Up to this point, Shi was still a standard topologist. What makes him unusual is where he went next — he didn't stay in pure mathematics; he aimed all this fixed-point and manifold skill at rock.
- From 1968 to 1977, he developed the stereographic projection method for rock-mass stability analysis.
- Around 1980, he systematically developed block theory, producing block cutting, simplex integration, and discontinuous deformation analysis (DDA). This part he did with Richard Goodman at Berkeley — work no one in rock mechanics can sidestep.
- In 1991, he founded the numerical manifold method (NMM), which won Chern's strong endorsement.
- In 2013, he systematically built up contact theory.
Jiang Zehan's line about "seeing reality through the window of differential equations" got split, in the next generation, into three jobs: Chern showed that differential equations can describe reality (that it's possible); Shing-Tung Yau developed how (the methods of geometric analysis); and Gen-hua Shi took on the hardest question — how to actually solve it inside an arbitrarily complex real region.
A topologist who studied fixed points turns around to compute whether a rock will collapse. It sounds like leaving mathematics, but really he took mathematics somewhere mathematicians usually don't go. Chern's question — whether piecewise-stacked manifolds can be extended to arbitrarily complex regions — Shi answered it out on the engineering site.
3. One question, two worlds answering at once
Chern's insight compresses into one sentence: complex global behavior can be constructed from stacked, local, piecewise manifolds.
That sentence has been borne out twice.
The first time, in computational mechanics, answered by the numerical manifold method. A real rock mass is the most unreasonable research subject there is: discontinuous, cracked, faulted, prone to large deformation, and full of contact and compression. Classical analytical methods mostly fail against this kind of geometry. NMM's answer: approximate it with stacked, piecewise-smooth manifolds, and you can handle any physical region no matter how complex. Chern's geometric intuition was proven right in engineering.
The second time, in AI, answered by neural networks. Neural networks are also stacked, piecewise, locally smooth manifolds — except no one designed them that way; they learned their way there from data. The same geometric intuition grew twice over, in two worlds unaware of each other.
The difference is one word: Shi chose his manifolds; neural networks learn theirs. Same structure, different origin.
The next few sections make "same structure" concrete — exactly where the sameness lies.
4. The numerical manifold: first, its place in the family of numerical methods
Shi answered Chern's question with a method called the numerical manifold method (NMM). To understand it, put it back into the family of numerical computation.
To compute a continuum in engineering, the common approaches divide up the work like this:
- Finite element: cut the region into small elements, each smooth inside — piece-by-piece smooth.
- Finite difference: lay down a uniform grid and take differences at the grid points.
- Analytical solution: a closed-form solution in higher-order functions — the cleanest, but it only works on regular regions.
- DDA (discontinuous deformation analysis): Shi's own earlier method, built specifically for the discontinuous contact between blocks.
- Numerical manifold: piecewise-smooth + covering. It doesn't require the pieces to fit together seamlessly; instead it lets a set of mutually overlapping local pieces each carry a series expansion, then organizes them with a covering.
Piecewise-smooth plus covering is exactly the step from local to global — and that's where the name "numerical manifold" comes from.
The most elegant move in the numerical manifold method is that it splits the covering into two layers.
The mathematical cover: chosen by the analyst. It's a continuous, regular, smooth mathematical skeleton — a set of mutually overlapping local pieces, each carrying a series expansion. It has nothing to do with what the rock you're studying looks like; it's the idealized continuous structure you choose to impose.
The physical cover: given by the material. A real object comes with its own boundaries — the material's edges, faults, cracks, cut lines. Cut along those and you get physical elements. This is the imposed, observable structure.
The key: an element = the intersection of the two. The element that actually enters the computation is where a mathematical piece and a physical region meet.
An analogy: the mathematical cover is like a translucent, regular mesh membrane that you lay down yourself — continuous and smooth; the physical cover is the real, cracked rock beneath the membrane. Lay the membrane over it, and the lines where membrane meets rock are the elements you actually compute. The rock cracks; the membrane doesn't — the crack's information enters the computation through the intersection without tearing the smooth mathematical membrane underneath.
That's why NMM can go from local to global, handle discontinuity, and compute forward and inverse problems alike: it separates "where you approximate" (mathematical) from "where the material actually is" (physical), then lets them intersect.
5. Simplex integration, and distance theory
NMM has another move that captures Shi's style well: simplex integration.
Classical integration comes with a rich one-dimensional table of integrals — polynomials, sine and cosine, exponentials; plug into a formula and integrate. The trouble was never the integrand; the hard part is the shape: real regions are irregular polygons full of holes and cracks, and the standard approach can only map them back onto a regular reference region first, then integrate.
Shi's approach faces the irregular geometry head-on: simplex-decompose the polygon into triangles (polygon = a sum of simplices), then do "exact integration from the vertices" — using only the vertex coordinates, you can compute area, centroid, second moments, and even higher moments exactly, with no reference mapping and no numerical quadrature.
In a sentence: simplex decomposition turns geometric irregularity into algebra on the vertices. The difficulty was never the integrand but the shape — and Shi reduced the shape problem from a geometric one to an arithmetic one.
From equalities to inequalities. One more thread is worth a mention. Classical analysis deals with equalities and continuity; but in the real world, a great many relations are inequalities — contact, distance, nearness. Shi's contact theory (systematically built in 2013) studies exactly the "did they touch, how far apart" questions between blocks; it belongs to distance (contact) theory: the ε-covers of a metric space, concerned with whether two points are close enough.
This connects straight to AI: similarity, retrieval, and contrastive learning are all computing distance — whether two vectors are close enough, whether they should be grouped into one class. The things AI uses every day land, mathematically, squarely in Shi's distance theory.
6. The same structure, showing up in neural networks
Now carry these two kinds of cover over to neural networks, and they line up exactly.
- The physical cover corresponds to what's observable in the network: activations, neurons/units, the structure of the data flow. Like the real cracked rock — you can measure it.
- The mathematical cover corresponds to what's hidden in the network: the manifold structure it learned from data, the skeleton at the symbolic and computational level. Like the smooth mathematical membrane — except this time no one drew it; the network learned it on its own.
- And computation happens where the two meet. Same principle as NMM.
Take a single token: it doesn't live on any one layer but in a stack of stacked manifold layers; the network's first FFN projection is exactly one local covering. A token is represented separately across multiple manifold layers, and its overall behavior is assembled from these local pieces — Chern's sentence again.
Written as a formula, for a token's features , the first layer performs one local projection, then stacks into depth:
What the earlier step learns is the mathematical cover (a local learnable manifold); the activations stacked up afterward are the physical cover (observable). This is precisely the numerical manifold's step where "a local cover, through a single projection, becomes a stacked learnable manifold."
| Mathematical cover | Physical cover | Computation happens at | |
|---|---|---|---|
| Numerical manifold (rock) | Analyst-chosen continuous smooth cover | Material-given boundaries, faults, cracks | Their intersection |
| Neural network (AI) | Hidden, learned manifold structure | Observable activations, units, data flow | Their intersection |
With no one telling it to, the neural network rebuilt Shi's covering on its own. It wasn't designed into a numerical manifold; it learned its way into one.
And what it learns is relations, not objects. One layer deeper: what does a neural network actually learn from data? Not the data itself, but the computable relations among the data.
- Learning doesn't memorize isolated samples; it learns the mappings, differences, and structure between samples.
- What classification rests on isn't the objects themselves but the relations between them — relations matter more than objects.
- A neural network's activations are propertyless: an activation carries no intrinsic property of "what it is"; it's just a coordinate in some network of relations.
- So what separates the classes? Counting — turn differences into numbers, and the differences become separable.
In the language of category theory, a neural network follows a "relations before objects" first principle: relations and differences come first, and objects are merely nodes on the network of relations. This also explains why one and the same network can put a word and an image into the same space — as objects they're worlds apart, but in the relational structure they can occupy the same position. We'll meet this again in the next section on fixed points.
7. Fixed points: Shi's topology and large-model training are one and the same
Don't forget Shi's foundation is fixed-point theory. This section is the connection I most want to get across in this piece.
What's a fixed point? Given a map , the point satisfying — mapped over, it's still itself — is the fixed point. Fixed-point theory asks three classic questions:
- Existence: is there a solution at all?
- Uniqueness: if there is, is there only one?
- Stability: perturb the solution a little — does it come back?
These three are exactly the questions large-model training faces every day, only no one frames them in this vocabulary. What Shi's lineage studied is precisely the theory of fixed-point classes (the Theory of Fixed Point Classes) — not just whether a solution exists, but characterizing "the structure of the solutions."
A neural network's learning is, in the end, finding an (approximate) fixed point in high-dimensional space. Write it as a residual:
When the residual goes to zero, the system "stops" at the fixed point — it no longer changes, and learning has converged. Better still, different modalities share the same numerical form: the word and an image of a dog get embedded into the same structure, their difference folded into one and the same language of fixed points — which is just another way of saying "relations matter more than objects" from the last section.
Put it in optimizable form by wrapping the residual in a Lagrangian:
Training is finding the critical point where — that is, the fixed point. The residual is the elegant mathematical characterization; the Lagrangian form makes it computable.
One level up, different training paradigms correspond to different operations on the fixed-point classes:
- Pretraining: forming the fixed-point classes — growing many stable attractors out of chaos.
- Instruction tuning: aligning those classes — singling out the target classes and reinforcing them.
- Reinforcement learning: driving and perturbing those classes — pushing many of them into motion.
So the fixed-point classes Shi studied in topology and what large-model training does are not "alike" — they're the same mathematics. One characterizes the structure of solutions in an abstract space; the other searches for a numerical fixed point in high-dimensional parameters — the same thing. That's also the one-line summary of Ma and Shi's Deep Manifold framework: a neural network is a learnable numerical computation grounded in fixed-point theory.
8. Learning is an inverse problem: a second hidden thread
The manifold/covering thread above runs from Riemann all the way to neural networks. There's actually a second hidden thread running in parallel to the same place — the inverse problem.
The forward problem is: given the cause, find the effect; the inverse problem reverses it — from observed effects, infer the hidden structure. Learning is an inverse problem — given a pile of data (effects), find the model that could generate them (the cause).
This line has its own lineage too:
- ~1902, Hadamard: proposed the "well-posed / ill-posed" criterion — existence, uniqueness, stability.
- 1917, Radon: the Radon transform, which later became the mathematical foundation of CT imaging — reconstructing structure from projections, the most typical inverse problem.
- 1943 / 1963, Tikhonov: regularization, making ill-posed inverse problems numerically solvable.
- 1951, Gelfand & Levitan: the spectral theory of inverse problems — recovering an operator from spectral data.
- 1950s–60s, Faddeev: pushing inverse problems into high dimensions.
This lineage is rarely mentioned in AI circles today, but large models stand right on its extension: inferring a high-dimensional hidden structure from massive observations. The next section, on "why AI is unreliable," rests on Hadamard's "ill-posed" criterion from this very line.
9. This also explains why AI is "unreliable"
Seeing a neural network as an inverse problem, as fixed-point finding, has a direct byproduct: it explains why AI is so unstable.
Mathematics has a criterion called well-posedness: for a problem to compute "cleanly," it must satisfy existence, uniqueness, and stability at once. Miss even one, and it's automatically classed as ill-posed.
And learning is often an inverse problem — inferring hidden structure from observed effects — and such problems tend to be under-constrained, prone to trouble with uniqueness and stability; that is, they lean "ill-posed." The "jagged" quality AI shows (a term Joshua S. Gans used in 2026) — dazzling on one task, then face-planting inexplicably on one that looks simpler — is more or less what that under-constraint looks like on the surface. We said earlier that activations are "propertyless"; ill-posedness and propertylessness are the two flaws to watch most closely when you look at AI through "the three dreams of mathematicians."
You see this most directly in a large model's loss curve. Early in training, loss drops fast; then it enters a long, slow descent that jitters up and down (zig-zag) — it doesn't converge as a clean curve all the way down, but bounces back and forth near the fixed point.
This is exactly what Shi and his colleagues call "a thirty-seven-year topic": piecewise smoothness, covering, open-closed iteration, convergence. An iterative system is searching for a fixed point, and the search was never smooth to begin with; and when the inverse problem is also ill-posed (under-constrained), the jitter gets more pronounced. That's the root of why a large model's loss curve isn't stable enough.
Inverse-problem mathematics had a remedy long ago: regularization — adding extra constraints, priors, and boundaries to an underdetermined problem to pull it back into a solvable, stable range. The kit we give large models today is the engineering version of the same remedy: scaffolding to steady multi-step reasoning, retrieval and tools to anchor it to external facts, and a harness to fence in the boundaries of the output. In the end, it's all adding external constraints to an under-constrained core so it runs steadier and more reliably.
And that loops right back to Shi's home turf. A rock mass is itself discontinuous, cracked, unstable — engineers never try to eliminate the cracks, which is impossible; they fit the rock with support so it holds steady, cracks and all. Fitting a large model with scaffolding and fitting a mountainside with support are the same idea. An under-constrained core, wrapped in a layer of engineering constraint, is what makes it run stable.
10. Yuan Ma: the man standing between two worlds
This line could be seen at all thanks to someone who has lived in both worlds — Yuan Ma.
He graduated from Tongji University's civil engineering department in 1986 and worked early on AI-assisted engineering drafting. From 1989 to 1999 he studied under Shi, developing Fourier-series-enriched numerical manifolds and higher-order discontinuous deformation analysis. Later he moved into IT, working on big data and AI at Microsoft Research.
The key to that résumé isn't its length but its span: one end is Shi's engineering numerical methods, the other is modern AI.
This correspondence — that neural networks and numerical manifolds are the same structure — is invisible without someone fluent on both sides. People who only know numerical manifolds won't read Transformer papers; people who only know deep learning have never heard of NMM or fixed-point classes. To match a 1991 rock-mechanics method with a 2020s large model, you need someone who has stood on both maps. Since 2024, Ma and Shi have formally collaborated, writing this up as Deep Manifold theory.
11. An old tradition: those who advance mathematics often aren't mathematicians
One more layer is worth mentioning. Looking back, many of the people who pushed mathematics forward weren't professional mathematicians.
Fermat was a judge, and number theory was his hobby; half of Pascal's and Newton's motivation was in physics; Turing was thinking about computation and machines. They barged into mathematics with problems from their own fields and opened up new directions for it. The engineers training large models today are no different — they chase prediction accuracy and engineering metrics, not geometric theorems, yet they have, without meaning to, pushed the mathematics of manifolds and fixed points forward another notch. AI isn't the exception to this tradition; it's its latest instance.
You can make one more cut. Among those who advance mathematics, there are roughly two kinds:
- The theoretical geniuses: Jiang Zehan, Chern — pushing mathematics forward by studying the conjectures themselves.
- The down-to-earth doers: Gen-hua Shi, along with Berkeley's Goodman (1935–2025) and Zuyan Lü — people who carried mathematics onto the engineering site; they don't prove theorems, they compute whether a rock will collapse.
AI's two faces — powerful and fragile — need both kinds of people looking at once: the geniuses to see its structure, the down-to-earth doers to see whether it holds up in the real world.
12. Deep manifolds in the real world
At the end of the conversation, we pulled the camera back a little: the first-principle landing point of this "deep manifold" view is the real world — real data, real problems, real learning and discovery.
A line of Terence Tao's was quoted here: "Mathematics is not about numbers, equations, computations, or algorithms — it is about understanding." What deep manifolds set out to do is understand the mathematics behind neural networks, rather than treat them as a black box you can only tune.
Looking out from this view, a few things come clear:
- Inverse problems, generalization, causality: learning is an inverse problem; generalization is whether it stays stable; causality can be seen as a kind of ordered manifold geometry.
- The forward problem can be more deceptive than the inverse: a forward task that looks simple is the one most likely to trip the model up — and that's where the "jaggedness" comes from.
- Training paradigms now have mathematical definitions: pretraining / instruction tuning / reinforcement learning correspond to three operations on the fixed-point classes — no longer just engineering folklore.
Put neural networks back into the long river of mathematics and they aren't a monster that sprang up from nowhere — they're the manifold thread of Riemann, Poincaré, and Chern meeting the inverse-problem thread of Hadamard, Radon, and Tikhonov, converging in the 2020s.
Conclusion
- Manifold/covering is a long line: Riemann → Poincaré → Chern, the theme always "build the whole from local parts."
- The line reached China: Jiang Zehan → Boju Jiang, Gen-hua Shi, joining fixed points and manifolds together.
- Chern's question was answered by two worlds at once: rock mechanics (NMM) and AI (neural networks) — the same answer.
- The key mechanism is the two covers: the mathematical cover (a smooth structure chosen/learned) and the physical cover (a given/observable structure), with computation at their intersection.
- Learning is fixed-point finding: the fixed-point classes correspond to pretraining / tuning / reinforcement learning — the same mathematics Shi did in topology.
- Learning leans "ill-posed" (an under-constrained inverse problem): regularization, retrieval, tools, and a harness add the external constraints — the same idea as supporting a mountainside.
There's a line from the conversation I'm fond of — a falling rock always comes to rest in the end. An iterative system eventually stops at its fixed point. Neural networks, while chasing prediction accuracy, have without meaning to pushed the mathematics of Riemann, Poincaré, Chern, Jiang Zehan, and Gen-hua Shi one more step forward.
If Chern could see this, he would probably be glad: his geometric intuition — the whole can be assembled from the local — lives in the numerical manifold, and in the rise of AI as well.