All Research

March 2, 2026 · Experiment

Research: Strutured State with Higher-Fidelity Observations Dramatically Improves Spatial Reasoning


Cover of Research: Strutured State with Higher-Fidelity Observations Dramatically Improves Spatial Reasoning
Download PDF

What we did:

Ran Niva's Manifold physics engine on a top-tier spatial reasoning benchmark from Stanford/Northwestern (Theory of Space, ICLR 2026). Ours is the first non-LLM system evaluated on it.

Headline numbers:

* At structured sensor input, Niva scores 97% on spatial reasoning tasks, whereas GPT-5.2 scores 72% and Claude-4.5 Sonnet scores 66%
* At text-equivalent input (apples-to-apples with LLMs), Manifold scores 68%, competitive with frontier models despite Manifold being a deterministic geometry engine with zero training data
* When we gave frontier LLMs the same structured sensor data Manifold uses, the LLMs' accuracy dropped to 4-10%, a 60+ point regression from their text baselines. Chain-of-thought prompting made it worse, not better.

Why it matters:

1. Sensor quality is the dominant variable: a 29-point accuracy jump resulted from improving input fidelity alone, with zero changes to the reasoning pipeline. This is the core argument for physics-native architecture over LLMs in robotics.
2. The result is architecture-specific. LLMs literally cannot use precise sensor data. This isn't a prompting problem or a fine-tuning problem. It's a fundamental computational limitation.
3. Niva's pipeline is deterministic (same input always produces same output), noise-tolerant (under 1 point drop at 10-20x real sensor noise), and runs sub-50ms. These are deployment requirements, not research metrics.

Bottom line: On the hardest published benchmark for spatial reasoning in robotics, a physics-native architecture with zero training data matches frontier LLMs at their own game and dramatically outperforms them when real sensor data is available. LLMs can't bridge this gap regardless of how they're prompted.