From red-glowing spider silk to programmable ecosystems, we are no longer evolving life — we’re beginning to design it.
Found & Framed Series

7 min read
July 16, 2025
There’s a certain comfort in watching a machine walk through its logic. When a language model explains its steps, corrects itself, or hesitates before answering, it feels more intelligent — more human. Large Reasoning Models (LRMs), the latest evolution in generative AI, build directly on that instinct. These systems are trained not just to produce answers, but to simulate reasoning: they generate long, structured chains of thought before arriving at a conclusion.
But as Apple researchers argue in a recent paper — titled The Illusion of Thinking — that simulation may be all it is. The paper doesn’t question whether LRMs can solve problems. It questions whether they truly understand the steps they take.
And what it finds is sobering.

The Setup: Four Puzzles and a Hypothesis
Rather than relying on existing math benchmarks (which often suffer from training contamination and ambiguous complexity), the researchers created controlled puzzle environments. These weren’t language tests — they were tests of planning, abstraction, and consistency.
A quick breakdown of the puzzles used:
- Tower of Hanoi: A classic recursive logic task involving moving disks between pegs according to size and position rules.
- River Crossing: A constraint-satisfaction puzzle where actors and agents must cross a river without violating specific safety constraints.
- Blocks World: A stack rearrangement challenge where blocks must be moved one at a time to match a goal configuration.
- Checker Jumping: A linear swapping puzzle that requires red and blue tokens to change positions through legal moves and jumps.
Each of these challenges can be scaled in complexity, and they were deliberately chosen because they require multi-step reasoning — not just pattern completion.

The Collapse Comes in Three Acts
When tested on these puzzles, the researchers compared two types of models: standard large language models (LLMs) that produce direct answers without intermediate steps, and their reasoning-augmented counterparts — models like Claude 3.7 Sonnet (Thinking) or DeepSeek-R1 — that generate explicit reasoning traces using Chain-of-Thought and self-reflection techniques. Across complexity levels, they observed three distinct behavioral regimes.
- Low complexity: Standard large language models — those that skip the step-by-step “thinking” — performed better. They were faster, more accurate, and more efficient.
- Medium complexity: This is where LRMs shined. Their longer reasoning chains helped them outperform their non-thinking counterparts.
- High complexity: Everything fell apart. Both model types collapsed to zero accuracy. And most curiously, the LRMs began using fewer tokens to think — even when they had plenty of token budget left.
This last detail is key. Token-length reasoning traces refer to the number of tokens (words or symbols) a model uses to work through a problem. You’d expect that as problems get harder, models would reason more, not less. But the opposite happened. Their thinking effort declined precisely when the task demanded more structure.

The Trouble with Looking Smart
This wasn’t the only strange pattern. In simpler tasks, the LRMs often found the correct solution early in their reasoning trace, only to keep exploring and introduce errors. In more complex cases, they wandered through incorrect paths before stumbling onto a valid answer — if they did at all.
Even more striking: when researchers gave the models a fully correct algorithm (say, for Tower of Hanoi), the models still failed to execute it reliably. They weren’t struggling to find the answer. They were struggling to follow it.
This raises questions about what these models are actually doing. Are they reasoning — or just sounding like they are?

Fluency Isn’t Understanding
Part of the problem lies in our perception. When we read an AI-generated thought process — especially one that mimics reflection or caution — we attribute intelligence. But what we’re really responding to is fluency: the ability to generate grammatically correct, coherent, and stylistically appropriate language.
Fluency is a surface feature. It doesn’t guarantee that the underlying reasoning is sound. A model might fluently explain a solution path while following a flawed logic or hallucinating rules.
This echoes a long-standing tension in cognitive science between simulated reasoning and substantive reasoning. Simulated reasoning is when a system mimics the surface features of human thought — step-by-step progression, internal monologue, structured reflection — without actually maintaining internal logical coherence. It’s the difference between rehearsing the lines of a play and genuinely understanding the plot. Substantive reasoning, by contrast, depends on internal coherence. It tracks premises, updates beliefs, follows causal links, and adjusts to feedback. It doesn’t just output thoughts — it maintains a structure behind them.
What this study shows is that most current models — even the ones designed for deep thinking — haven’t crossed that boundary.
. . . .
A Case of Epistemic Erosion
The implications go beyond technical performance. There’s a philosophical risk here: what scholars sometimes refer to as epistemic erosion. The term refers to the gradual loss of standards for what counts as a valid reason, explanation, or proof. When we accept the appearance of reasoning — just because it sounds convincing — we blur the line between knowing and seeming to know.
In high-stakes domains like healthcare, policy, or finance, this erosion could be dangerous. If we trust systems that simulate thought but don’t hold up under complexity, we risk importing incoherence into decisions that affect lives.

Apple’s Motivations: A Tactical Reveal?
Now let’s consider the meta-question: why is Apple publishing this?
It’s a rare move for Apple to be so open about technical limitations. Unlike OpenAI or Google, Apple isn’t yet seen as a GenAI frontrunner. And this paper doesn’t present a breakthrough model — it presents a critique of others. So what’s the play?
One possibility: Apple is positioning itself as the sober voice in a space full of overpromises. By highlighting structural limits and evaluation flaws, it may be staking ground for a different direction — one that values grounded reasoning over spectacle. Another read, of course, is more tactical: calling out the flaws of leading competitors at a moment when Apple is rumored to be integrating GenAI into its platforms.
Not everyone sees Apple’s paper as purely scientific. Some skeptics argue this kind of critique conveniently downplays progress made by competitors like OpenAI and Anthropic. AI researcher Gary Marcus has long criticized the hype around reasoning in language models, calling out their fragility under abstraction and unfamiliarity. Some view Apple’s move as strategic: a chance to define the conversation around robustness, rather than compete head-on in fluency arms races.
Either way, it’s a move worth watching. Especially if it signals a pivot toward reasoning architectures that go deeper than language style.
. . . .
So What Now?
The illusion, it turns out, runs deeper than performance metrics. Even as GenAI systems produce longer chains of thought and more reflective responses, they aren’t moving any closer to general intelligence. They simulate the outward form of reasoning — structured logic, internal dialogue, apparent self-correction — without the substance beneath. True AGI would require adaptability, internal consistency, and the capacity to reason across complexity without collapse. What this paper quietly suggests is that today’s models, while increasingly eloquent, remain fundamentally brittle.
Which brings us back to what we choose to measure — and what we choose to believe. The takeaway from The Illusion of Thinking isn’t that language models are useless. It’s that we need better metrics — and better metaphors. If we continue to mistake fluency for reasoning, verbosity for depth, and coherence for logic, we’ll build machines that are very good at pretending to think.
And we’ll get very good at believing them.
. . . .
References
- Shojaee et al., “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” Apple, 2025.
- Gary Marcus, “Five Ways in Which the Last 3 Months… Have Vindicated ‘Deep Learning Is Hitting a Wall’,” Marcus on AI, Substack, Feb 2025.
- Ballon et al., “The Relationship Between Reasoning and Performance in Large Language Models,” arXiv:2502.15631
- Chen et al., “Reasoning Models Don’t Always Say What They Think,” arXiv:2505.05410
- Chollet, F. (2019). On the Measure of Intelligence. arXiv:1911.01547
- Bubeck, S., et al. (2023). Sparks of Artificial General Intelligence: Early Experiments with GPT-4. Microsoft Research.
- Mitchell, M. (2023). Why AI Is Harder Than We Think. MIT Technology Review.
- Shanahan, M. (2022). Talking About Large Language Models. arXiv:2304.13712.
. . . . .
