Score Theatre

Why CVSS, AI Benchmarks, and Other Numbers Keep Lying About the Real World

Score Theatre banner
A score is a compression. The real world is the thing being compressed. Don't confuse the two.

The Comfort of a Single Number

People love a single number. A CVSS rating. A benchmark percentage. A leaderboard position. It's clean, it's comparable, and it gives you something to put on a slide. The problem is that the moment you reduce a messy reality to one digit, you've thrown away most of what actually matters.

This is a piece about that throwing-away. Specifically, it's about two flavours of it that have been winding me up lately: CVE/CVSS scores in security, and AI benchmarks in machine learning. Different domains, same disease. Both produce numbers that get treated as ground truth by people who haven't read past the headline, and both quietly fall apart the moment you try to apply them to the actual systems and workflows we have.

CVSS: A Theoretical Worst Case, Quoted as Gospel

The Common Vulnerability Scoring System gives every CVE a number from 0 to 10. That number is then printed in news articles, plugged into compliance dashboards, and used to justify all-hands fire drills. The trouble is: CVSS measures theoretical maximum impact, not real-world exploitation risk.

That's not me being edgy on a blog. It's the conclusion of a growing body of academic and industry analysis showing that CVSS scores correlate poorly with actual exploitation likelihood, that the distribution of scores is heavily skewed toward the high end, and that the system creates an overwhelming volume of "high priority" issues most teams cannot meaningfully triage.

Research cited by Dark Reading found that only 12% of CVEs flagged "critical" by government bodies actually warranted that severity when assessed in context. HeroDevs walked through the inverse case: a 3.2-rated "low severity" CVE in a payment processing path that ended in a customer data breach. The score was almost dismissed because the number was small. The number was wrong - or rather, the number was fine in the abstract and catastrophic in context.

It gets worse. The same CVE can score differently under CVSS 3.0, 3.1 and 4.0, and different scoring authorities routinely disagree about the same flaw. The Hacker News' own coverage put it bluntly: CVSS should be one signal among many, not the final word. And Elementrica's analysis of JFrog's open-source data reached the same place: a number on its own oversimplifies a complex security issue.

None of this is news to anyone who has run a vulnerability programme for a week. It is, however, news to most of the audience reading the headlines.

The Python Library That Is and Isn't a Threat

Here's the example that pushes me from "frustrated" to "actively annoyed."

A developer scaffolds a static site - some markdown, a generator, a build pipeline that spits out HTML. They run a dependency audit. Up pops a critical CVE in some Python or JavaScript library buried three transitive dependencies deep. The score is 9.8. The advisory talks about remote code execution. Panic stations.

Except the package is being used at build time, on a developer's laptop or a CI runner, to produce static files. The output is HTML, CSS and JavaScript on a CDN. There is no live Python process. There is no server-side request handler. There is no network surface where an attacker can reach the vulnerable code path. The "runtime" of the library was a five-second invocation that wrote some files to disk and exited.

The exact same package, used in a Django app behind a public endpoint, with user input flowing into the vulnerable function, is a genuine 9.8. Same CVE, same version, two completely different risk pictures. The score doesn't know which one you are. It can't.

This isn't a hypothetical I made up. It's a recurring conversation on r/webdev and across the static-site-generator community, where Python tools like Pelican sit alongside other build-time toolchains pulling in dependencies that have - quite legitimately - racked up CVEs. The aikido write-up of common Python vulnerabilities is full of issues that matter enormously in a long-running web service and matter not at all in a one-shot build script.

Real CVEs like CVE-2025-4517 in the Python runtime itself carry CVSS 9.4 because in some deployments they really are that bad - and in others, they are functionally inert.

The annoying public mindset is that the score is the answer. "It's a 9.8, fix it." But fix what? In a static-site build that runs in an ephemeral container with no network, no untrusted input, and no persistence, the realistic exploitation surface is a developer who deliberately feeds malicious markdown into their own laptop. That's not zero risk - supply chain compromise is real - but it isn't a 9.8, and treating it as one means you spend your week chasing the wrong thing while the actually exposed internet-facing service quietly accumulates 5.x-rated bugs that compose into something genuinely dangerous.

The score doesn't tell you which one is which. Context does. And context is the thing that scoring systems, by design, refuse to encode.

AI Benchmarks: Same Disease, Different Lab Coat

Now zoom out, swap CVSS for ARC-AGI, and you'll find the exact same pattern.

ARC-AGI-3, released by the ARC Prize Foundation, is the latest in a line of benchmarks designed to measure "fluid intelligence" - the ability to adapt to genuinely novel tasks rather than recall training data. Unlike its predecessors, ARC-AGI-3 is interactive: agents have to explore unfamiliar environments, infer goals on the fly, build internal world models, and plan action sequences without natural-language instructions.

The technical report breaks the evaluation down into four functional components - Exploration, Modeling, Goal-Setting, and Planning & Execution - which is, on paper, a much more honest decomposition of "agentic intelligence" than a single percentage usually implies.

The results are spicy. Humans solve close to 100% of the environments. As of March 2026, frontier systems - GPT-5.4, Claude Opus 4.6, Gemini 3.1 - all scored 0%. One detailed analysis put the very best result at 0.37%.

Good. That's a benchmark doing what a benchmark is supposed to do: surfacing a real capability gap that other evaluations were hiding behind multiple-choice trivia and saturated leaderboards.

And yet. The same public mindset that turns a 9.8 CVSS into "drop everything" turns a 0% ARC-AGI-3 into either "AI is fake" or "AGI is cancelled." Neither is what the number means. ARC-AGI-3 is, by explicit design, "easy for humans, hard for AI." It is engineered to expose specific weaknesses in current architectures: poor exploration behaviour, weak sequential world models, pattern matching that doesn't transfer to procedurally novel environments.

That is a useful finding. It is not the same as "these models are useless," which is how the number gets quoted in the wild. The frontier models that score 0% on ARC-AGI-3 are simultaneously shipping production code, summarising legal documents, writing this kind of post's grumpy cousin, and refusing to play chess at a reasonable level. Different tasks. Different surfaces. Different scores. The benchmark is a probe, not a verdict.

Why The Mindset Is The Problem

The annoying thing isn't the existence of CVSS, or ARC-AGI-3, or any other scoring system. They are all useful instruments when used as instruments. The annoying thing is the public reflex to treat the number as the conclusion - to skip the part where you ask:

A static-site Python build and a public-facing Django service share a CVE but not a threat model. A frontier LLM that scores 0% on ARC-AGI-3 and 90% on a coding benchmark is not "smart" or "dumb" - it's differently shaped, and the shape only becomes visible when you stop reading the headline and start reading the methodology.

The number is the start of the conversation. In too many rooms, it's still the end of it.

The Brew Take

We don't use software the same way all the time. We don't use models the same way all the time. The real world is messy: ephemeral build pipelines, internal-only services, air-gapped boxes, prototype agents, production agents, things running for five seconds and things running for five years. Any evaluation system that pretends otherwise is going to mislead you, and the more authoritative its single-digit output, the more dangerous the misleading is.

Use the scores. Don't trust them.