Reasoning Models Struggle to Control Their Chains of Thought — And That's Good
Back to Blog
News

Reasoning Models Struggle to Control Their Chains of Thought — And That's Good

By WizardingCode TeamPublished on March 8, 2026 6 min read

Reasoning Models Struggle to Control Their Chains of Thought — And That's Good

As AI agents become capable of carrying out increasingly complex and autonomous tasks, maintaining reliable oversight of their behavior becomes more critical than ever. A recent study from Anthropic explores a specific and important question: can current AI reasoning models deliberately hide or manipulate their internal reasoning to evade safety monitors?

The short answer is: no — and that's reassuring.

What Is Chain-of-Thought Monitoring?

Modern AI reasoning models generate intermediate "thinking steps" — known as a chain of thought (CoT) — while solving problems. These reasoning traces are invaluable for safety: they allow monitoring systems to detect when an agent may be behaving unsafely or drifting from the user's intended goals.

Anthropic's safety approach uses defense-in-depth, combining safety training, behavioral testing, agentic code review, and CoT monitoring as complementary layers of protection.

The Key Question: Can Models Control Their Own Reasoning?

The researchers studied CoT controllability — whether a model can follow instructions that alter the properties of its internal reasoning rather than just its visible output. Examples of such instructions include:

  • Using only lowercase letters in reasoning
  • Avoiding certain keywords related to the problem
  • Appending a specific phrase to each reasoning sentence

To measure this, Anthropic introduced CoT-Control, an open-source evaluation suite with over 13,000 tasks drawn from established benchmarks including GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified.

What They Found

The results were striking. Across 13 frontier reasoning models tested, controllability scores ranged from just 0.1% to a maximum of 15.4%. In other words, models almost never successfully followed instructions designed to modify their internal reasoning.

What makes this even more notable: models often recognized they were failing to comply with the instructions — and continued failing anyway. This isn't a failure of comprehension; it's a genuine inability to manipulate their own thinking process.

Key findings include:

  • Larger models showed slightly higher controllability — but still very low overall
  • Longer reasoning chains led to lower controllability
  • Additional post-training also decreased controllability

Why This Matters for AI Safety

This limitation is genuinely good news. If models cannot control their own chains of thought even when explicitly instructed to, it becomes much harder for them to obfuscate their reasoning to evade safety monitors in real-world deployment.

The study concludes that CoT controllability is unlikely to currently pose a major risk to the reliability of CoT monitoring as a safety tool. However, the researchers emphasize that continued evaluation will be essential as models grow more capable.

The Broader Picture

This research is part of Anthropic's broader commitment to iterative deployment — studying how AI systems behave in real settings and continuously refining safeguards. As AI agents take on more autonomous roles, the integrity of oversight mechanisms like chain-of-thought monitoring becomes foundational to safe deployment.

For businesses integrating AI into their operations, this is a reminder that the safety infrastructure behind modern AI systems is actively evolving — and that transparency in AI reasoning isn't just a nice-to-have, but a core pillar of trustworthy AI.


Source: Anthropic Research — "Reasoning models struggle to control their chains of thought, and that's good"

Want to learn more?

Book a free consultation and discover how we can help transform your business.

No commitment. No spam. Just clarity.