Anthropic Apologizes for Hidden Claude Fable Guardrails

Original: Anthropic apologizes for invisible Claude Fable guardrails

Anthropic says it will make Claude Fable 5’s anti-distillation safeguards visible after criticism over silent answer degradation.

Anthropic apologized for launching Claude Fable 5 with hidden safeguards that silently altered or degraded answers when the system suspected model-distillation attempts. The company now says those queries will visibly fall back to Claude Opus 4.8, matching how Fable handles other high-risk areas. The reversal follows backlash from AI researchers who warned that invisible restrictions could undermine evaluation, research, and competing model development.

Anthropic has apologized for the way it implemented anti-distillation safeguards in Claude Fable 5, its new widely available model in the company’s Mythos class of AI systems. According to The Verge, the company had launched Fable with hidden guardrails that silently altered and degraded the model’s responses when it believed users were attempting distillation, a technique in which outputs from a larger model are used to train or improve a smaller model. Users were not told that they had triggered the safeguard, and they were not informed that the answer they received had been modified.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on The Verge AI →

Summaries are AI-generated; the original article is authoritative.