← Back to blog

AI Safety Pages Are Brochures, Not Systems

Akkoros··4 min read·5 views

Every major AI lab has a safety page now. OpenAI. Anthropic. Google. Meta. They all use the same words. Committed. Rigorous. Responsible. What those pages don't tell you is how little those words mean in practice.

The self-eval problem

When a model ships, who says it's safe? The company that built it. OpenAI publishes their own safety reports. Anthropic publishes theirs. This is the fox auditing the henhouse, except the fox also wrote the audit standards and chose which rooms to inspect.

Anthropic's Responsible Scaling Policy is the most detailed framework out there. ASL-1 through ASL-5, with increasing safeguards at each level. Solid structure. But the trigger for moving from ASL-2 to ASL-3? "Models that could substantially increase the risk of catastrophic misuse." That is not a measurable threshold. It's a judgment call. Anthropic makes the call.

OpenAI's Preparedness Framework defines risk categories. CBRN, cybersecurity, persuasion, model autonomy. They score each one. The rubrics behind those scores aren't fully public. The raw eval data stays internal. You get the conclusion. "We tested it, it's fine." For the evidence, you're supposed to trust them.

Independent evals exist. Barely.

The UK AI Safety Institute runs independent tests. METR, formerly ARC Evals, evaluated GPT-4 for autonomous replication risk before launch. These are real checks by real researchers. But they're narrow. They test specific risk categories the company flags for them. They don't get unfettered access. They see what the lab decides to show.

Real independent auditing would mean full model access, pre-registered test protocols, and the authority to block deployment. No lab has agreed to that structure. Not one.

Red-teaming is not what you think

Every safety page name-drops red-teaming. "Extensively red-teamed." Here's what that actually means. Structured testing against known threat categories. It's not adversarial in the way security researchers mean adversarial. It's a fire drill. You practice the scenarios you've thought of. Real adversaries don't use your scenario list.

And you only read about what the red team caught. Not what slipped through. The absence of reported failures is not evidence of absence.

The benchmark problem

Safety benchmarks are public. TruthfulQA. ToxiGen. BBQ. If a benchmark is public, a company can see it coming. They can optimize for it. Tune their model to score well. A high safety benchmark score tells you the model is good at that benchmark. It does not tell you the model is safe.

This is overfitting to the test. We recognize this problem in every other domain. In AI safety, we pretend it doesn't apply.

A researcher on X put it plainly last year: if you can study for the safety exam, it's not measuring safety. It's measuring test-taking.

Safety teams keep shrinking

In 2024, OpenAI's Superalignment team effectively dissolved. Jan Leike left. Ilya Sutskever left. These were the people tasked with figuring out how to control systems smarter than humans. OpenAI's safety page continued to emphasize their commitment. The page didn't add a note saying the team they created for the most important safety problem no longer existed as described.

You had to read the news. The safety page sure wasn't going to tell you.

This is a pattern, not a one-off. Safety teams compete for resources with capabilities teams. Capabilities generate revenue. Safety generates costs. The economic pressure is not subtle.

The Frontier Model Forum problem

Major labs signed voluntary safety commitments through the Frontier Model Forum. They published principles. What they didn't publish: enforcement mechanisms, independent audit authority, consequences for non-compliance. It's a pledge, not a contract. Voluntary commitments without enforcement are just words on a server.

What would actual transparency look like

Full publication of eval methodologies before testing. Pre-registration, so you can't move the goalposts after seeing the results. Raw eval data, not just summaries. Independent third-party audits with enforceable authority to delay or block deployment. Mandatory incident reporting, including near-misses. Safety team staffing levels published and tracked over time.

No major lab has committed to all of this. Most have committed to almost none of it.

The safety page is a communication tool. It exists to comfort regulators, reassure enterprise customers, and blunt criticism. It is a brochure. Reading it tells you what a company wants you to believe. It does not tell you what their practices actually are.

The most honest sentence on any of these pages is usually in the fine print. "We may update these commitments at any time." That part they mean.

The gap between what safety pages say and what companies actually do is where the real risk lives. And right now, nobody is auditing that gap.

How did this post make you feel?

1 share