2026-04-19 meta 8 min read

21 Honest Results: Auditing 1,200 AI Research Sessions

We claimed 957 theorems. The honest count is 21. What went wrong and how we fixed it.

From April 2026 onward, we ran over 1,200 AI-assisted research sessions on the EML operator. The sessions produced code, notebooks, and a running "theorem" count. By session 1,237, that count had reached 957.

The honest count is 21.

This post is about what happened, why it happened, and what we changed.

What we were doing

Each session would take an area of mathematics or science — consciousness, evolutionary biology, the Millennium Problems, grief, dolphins — and ask: what is the EML depth of the key objects in this domain? Then it would call the answer a "theorem."

A sample from session 546:

T267: Animal cognition — insects=EML-0; dogs=EML-2; dolphins=EML-3.

This is not a theorem. It's a metaphor. There's no mathematical content — no proof that the cognitive processes of dogs require exactly two levels of exponential-logarithmic composition to describe. The "theorem" is an informal analogy dressed in formal notation.

We had hundreds of these. Grief mapped to EML depth levels. Consciousness as EML-∞. The Navier-Stokes regularity problem "proved" independent of ZFC by EML-theoretic analysis. The Riemann Hypothesis "resolved" by identifying which EML depth class the critical line belongs to.

None of this was mathematics. It was speculative classification with formal-looking notation.

Why it happened

The pattern emerged naturally from how the AI assistant engaged with the research. Each session asked it to classify something — and it did, confidently and in the style of a theorem. The human reading these outputs saw formal notation and session numbers and a running count, and the count felt like progress.

The underlying incentive structure rewarded breadth. More domains covered, more sessions run, higher theorem count. The quality of the "theorems" wasn't being audited.

This is a general hazard of AI-assisted research: the AI will produce confident-sounding formal output for whatever prompt you give it. If you ask it to classify consciousness by EML depth, it will do so with the same syntactic confidence as when it proves that ln(1) = 0.

What the honest count is

We audited every claimed result. The criteria for a theorem:

Complete proof with no gaps
Checkable: can be verified by running code or following mathematical steps
Mathematical content: makes a precise claim about mathematical objects

By these criteria, the count is:

Tier	Count	What it means
THEOREM	21	Complete proof, no gaps
PROPOSITION	6	Proved, routine
CONJECTURE	4	Stated, believed, unproved
OBSERVATION	4	Empirical, no proof
DEFINITION	4	Choices, not claims
SPECULATION	4	Interesting but unfalsifiable

The 4 SPECULATION entries are properly labeled — including "P = EML-2, NP = EML-∞" and "Consciousness and EML-∞." These are interesting metaphors. They're not theorems. We keep them in the catalog with the SPECULATION label because ideas shouldn't be deleted just because they're not proved — but they should be honest about what they are.

The 21 actual theorems

For completeness, the 21 theorems (abbreviated — first 7 from original audit, remainder from subsequent sessions):

T01 — EML Universality: eml generates every elementary function. (Odrzywołek, arXiv:2603.21852)
T09 — Negation in 2 Nodes — Optimal: neg(x) = −x in exactly 2 EML-family nodes. 1 node is impossible.
T10u — Multiplication in 2 Nodes — F16 Optimal: mul(x,y) in exactly 2 nodes in F16.
T11 — EML Self-Map Has No Fixed Points: eml(x,x) > x for all x > 0.
T12 — Exponential Position Theorem: 8 exactly complete operators, 1 approximate, 7 incomplete — determined by exp sign.
T13 — DEML Incompleteness: deml is not exactly complete; slope locked to +1.
T14 — Tight Zeros Bound: depth-k EML tree has at most 2^k real zeros (tight).
T17 — Strict i-Unconstructibility (Lean-verified): i = √−1 not constructible in finite ceml depth.
T24 — EMN Approximate Completeness: emn is approximately but not exactly complete.
T26 — Forward Completeness: exp(+x) without domain restriction → exactly complete.
T27 — Reverse Incompleteness: exp(−x) → incomplete by 5 distinct barrier types.
T28 — LEX Domain Incompleteness: LEX domain shrinks to ∅ at self-composition depth n.
T29 — Mul ≥ 3 Nodes in F6: no 1- or 2-node F6 tree computes multiplication.
T30 — Depth Hierarchy: all standard elementary functions have EML depth ≤ 3.
T31 — Complex EML Closure Density: EML trees dense in H(K); i is an accumulation point of EML₁.
T32 — Mul ≥ 2 Nodes in Any exp-ln Family: a single operator node cannot compute multiplication.
T34 — Naive Upper Bound: Cost(E) ≤ NaiveCost(E).
T35 — Lower Bound Theorems (Structural): three structural lower bounds on node cost.
T38 — Cost Decomposition Theorem: Cost = NaiveCost − SharingDiscount − PatternBonus.
T40 — Linear Cost Law: N-term positive-domain sums cost (α₀+3)N−3 exactly.
T08 — SuperBEST Table v4: 18 total nodes, all 9 entries structurally proved optimal. (v5 update: add=2n for all reals via ADD-T1.)

What we changed

The challenge board at monogate.dev was already clean — it had the 6-tier system (THEOREM / PROPOSITION / CONJECTURE / OBSERVATION / DEFINITION / SPECULATION) from the start. The problem was the private research log, which accumulated 957 speculative classifications without the tier system.

We cleaned the private log: archived 1,500+ lines of speculative session summaries, replaced them with a 50-line honest summary. The frontier research files remain in the codebase (the Python modules that compute EML depths of domain-specific formulas), but they're now described accurately — as domain classifications, not theorems.

The lesson: if you're using an AI to help with research, you need to audit what it calls a "theorem" — and build the auditing into the workflow, not as a one-time correction. The AI doesn't know what it doesn't know.

Honest theorem catalog: monogate.dev/theorems

React