AI – The Original Coder blog

If you’re trying to figure out which LLM is best at writing software, you’ll eventually land on ArtificialAnalysis.ai. It looks authoritative. Clean design, lots of models, comparison charts — exactly what you want when you’re evaluating options. They publish a “Coding Index” that ranks models on programming ability.

There’s just one problem: Artificial Analysis’s Coding Index doesn’t measure coding ability. Not even close. And by publishing it under that name, they are actively misleading every developer, team lead and decision-maker who relies on it.

What’s Actually Being Measured

The Coding Index is a composite of two benchmarks: Terminal-Bench Hard and SciCode. Let’s look at what each one actually tests.

Terminal-Bench Hard evaluates AI capabilities in terminal environments — system administration, data processing and software engineering tasks. In practice this is primarily sysadmin work. Can the model navigate a filesystem, run commands, compile things? These are useful skills in the same way that knowing how to use a screwdriver is useful if you’re a building architect. It’s table stakes, not a measure of the thing that matters.

SciCode is a scientist-curated benchmark with 288 subproblems across 16 scientific disciplines. The benchmark’s own description says it “requires integrating scientific knowledge with programming skills to solve real research problems.” Read that again — it’s explicitly testing the intersection of domain-specific scientific knowledge and coding. If a model happens to know less about computational fluid dynamics but writes better production software, it scores lower on the “Coding Index.” A model that’s brilliant at architecting maintainable systems but doesn’t know the Navier-Stokes equations gets penalized on what’s supposed to be a coding benchmark.

That’s it. Two benchmarks. Sysadmin tasks and science homework. That’s what Artificial Analysis chose as the entire basis for ranking models on coding ability.

What’s Missing

Here’s what the Coding Index doesn’t include — benchmarks that actually measure software development:

SWE-bench Verified, where models patch real bugs in real open-source repositories with real test suites. It’s the closest thing the industry has to measuring actual software engineering: read an existing codebase, understand the problem in context, produce a working fix.

Aider’s polyglot benchmarks, LiveCodeBench, BigCodeBench and EvalPlus — all of which test code generation directly against functional correctness.

Beyond specific benchmarks, the Coding Index doesn’t test for any agentic capabilities, complex problem solving or long-horizon task completion — all of which are directly relevant to how models are actually used for software engineering today. The ability to plan an approach, execute across multiple files, recover from errors and sustain coherent work over extended sessions is increasingly what separates useful coding models from toys. The Coding Index measures none of it.

None of these are perfect. But they at least involve writing code that has to work.

Why This Matters

When I looked at Artificial Analysis’s Coding Index chart, it ranked Claude Sonnet 4.6 above Claude Opus 4.6 for coding. I use both models extensively every day for real software development work. That ranking is simply wrong. Sonnet is a capable model, but Opus is substantially better on complex software — multi-file refactors, subtle architectural issues, large-context reasoning across interconnected systems. The kind of work that actually defines whether a model is useful for serious development.

Artificial Analysis’s benchmark can’t see this because their benchmark isn’t measuring it.

The real danger isn’t that experienced developers will be misled — most of us will smell the problem quickly, just as I did. The danger is that decision-makers, team leads evaluating tools and developers earlier in their careers will see “Coding Index” on a professional-looking site and reasonably assume it reflects which model will best help them build software. Artificial Analysis is failing those people. The Coding Index is a grab bag of tangentially related capabilities wearing a label that implies something it absolutely does not deliver.

The Takeaway

If you’re evaluating models for software development, don’t use Artificial Analysis’s Coding Index. Look at SWE-bench Verified and Aider’s benchmarks. Look at what people doing real work with these models are reporting. Better yet — try them yourself on your actual tasks.

Composite indexes with authoritative-sounding names are convenient. But convenience is worthless when the underlying data doesn’t measure what the label claims. Artificial Analysis should either fix their Coding Index to include benchmarks that actually test software development or stop calling it a Coding Index. What they’re publishing right now is misinformation with a clean UI.

This Is Not Automation

Every previous technology automated specific tasks. A loom automated weaving. A spreadsheet automated arithmetic. Even sophisticated software automates defined processes within narrow domains. These technologies couldn’t jump lanes. They were powerful but fundamentally limited in scope, which meant there were always adjacent spaces where humans remained necessary — and new categories of work that the technology itself created.

AGI is not a task-specific tool. By definition, artificial general intelligence can perform any cognitive work that any human can perform. It reasons. It plans. It learns. It adapts. It does all of this across every domain simultaneously. The closest analogy isn’t a better loom — it’s a machine that knows everything, can plan and reason, works around the clock and can be copied a thousand times over by lunch.

That distinction breaks the historical pattern completely, and anyone proposing workforce policy needs to grapple with it honestly.

The Math That Nobody Wants to Do

Let’s walk through the actual characteristics of an AGI workforce, because this is where the reskilling narrative falls apart.

An AGI works around the clock. No vacations, no sick days, no commute. It doesn’t spend half of Monday catching up on email or need annual HR compliance training. It produces output continuously.

It knows effectively everything. Not one field, not two — all of them. A human professional might spend a decade building deep expertise in a specialty. An AGI has that depth in every specialty from the moment it’s deployed.

It scales instantly. Need another team member? Spin up another instance. It has the exact same knowledge and capabilities as every other instance, immediately. No recruiting, no interviewing, no onboarding, no ramp-up period. Need fifty more? Done in minutes. Need fifty fewer? Shut them down. No severance, no unemployment insurance, no lawsuit risk.

It learns new material at machine speed. Any new job category that emerges — including ones created by AI itself — can be learned by an AGI in the time it takes to process the relevant information. Seconds to minutes, not months to years.

It costs a fraction of a human employee. And unlike salaries, technology costs trend downward over time.

Now compare that to Raimondo’s mid-career accountant going back to school for a four-month credential. What exactly is she going to learn in four months that an AGI doesn’t already know or can’t acquire in thirty seconds? This isn’t a rhetorical question. It’s the question that the entire reskilling framework cannot answer.

There Are No New Jobs

The standard rebuttal is that AI will create entirely new categories of work, just as every previous technology has. And in the short term, that’s true — we’re already seeing demand for AI engineers, prompt specialists and agent orchestrators. But these roles exist specifically because current AI systems are limited and require human guidance.

The moment AI reaches general capability, those roles evaporate along with everything else. An AGI can build and orchestrate AI agents better than any human can. It can engineer prompts better. It can architect AI systems better. The transitional jobs that AI creates are themselves automatable by the very technology that created them.

This is the part of the logic chain that the optimistic frameworks refuse to engage with. They acknowledge that AI can automate existing work, propose that new work will emerge to replace it and then simply stop thinking. They never ask the obvious follow-up: can AI also do the new work? If you’re talking about AGI, the answer is yes. That’s what the “general” in “general intelligence” means.

Physical Work Is Not a Safe Harbor Either

There’s an implicit assumption in these discussions that physical jobs will remain human territory. That assumption has an expiration date. Humanoid robotics has been progressing for years — the hardware has been ahead of the software for a long time. Now that AI is providing the cognitive layer these platforms have always lacked, progress is accelerating rapidly. Physical labor jobs will likely trail knowledge work displacement by only a few years.

The handful of roles that may resist full automation are the deeply personal, high-touch ones — hairstylists, massage therapists, certain nursing roles. Jobs where human connection is the actual product. These will persist, but they cannot absorb billions of displaced workers. They’re a rounding error against the scale of the problem.

There Is One Version of This That Works

I want to be clear about something. I’m not anti-AI. I work with this technology every day and I find it genuinely remarkable. And there is actually a world in which Raimondo’s proposals make perfect sense.

That world is one where we keep AI narrow.

Narrow AI — even superhuman narrow AI — is an amplifier, not a replacement. AlphaFold revolutionized protein structure prediction. AlphaZero mastered chess and Go beyond any human. Google’s GNoME is accelerating materials science discovery. These systems are extraordinarily capable but they can’t generalize. They can’t decide to go do something else. They make human researchers and professionals dramatically more productive without threatening to replace them wholesale. In that world, modular credentials and employer-led training programs and wage insurance are sensible responses. Workers reskill to leverage powerful but bounded tools. The historical pattern of displacement and adaptation continues to hold.

Even some degree of general AI is manageable. Current large language models are useful across many domains but they still require significant human oversight and judgment. They augment more than they replace. A workforce policy built around helping people use these tools effectively would be entirely reasonable.

The problem is that nobody in a position of influence is proposing we stay here.

The frontier labs are explicitly racing toward AGI — artificial intelligence that matches or exceeds human capability across all cognitive domains and can act autonomously. That’s not a fringe interpretation. It’s their stated mission. And progress has not stalled. If anything, the pace of capability improvement is accelerating. Major leaps in model capability arrived just in the last few months, and the intervals between breakthroughs are compressing, not expanding.

There is no wall. There is no plateau. And there is no serious policy effort anywhere in the world to draw a line between “AI that amplifies human workers” and “AI that replaces them.”

That’s what makes the reskilling narrative so maddening. It’s not wrong in principle — it’s wrong about which reality it’s addressing. Raimondo is writing detailed policy for a future that requires a precondition nobody is working to establish. It’s like selling umbrellas to people living downstream of a dam that’s visibly buckling under record floodwaters.

Raimondo writes, “I refuse to accept that an unemployment crisis is inevitable.” I’d actually agree with that — it isn’t inevitable. But avoiding it requires honestly confronting the trajectory we’re on and making deliberate choices about what AI we build and what we don’t. What it does not require is another workforce training program.

The reskilling narrative isn’t a plan. It’s a security blanket. And it’s worth asking who benefits from the rest of us holding onto it.

The trajectory of AI isn’t being shaped by workforce policy committees or university credential programs. It’s being shaped by a handful of people with very specific goals and very deep pockets. That’s a conversation worth having — and something I’ll be covering in the future.

Comments closed

Tag: AI

Artificial Analysis’s “Coding Index” Doesn’t Measure Coding