When the Code Becomes Optional

Specifications replace code entirely

Jan 27, 2026

Andrej Karpathy just posted his observations on X from coding with Claude agents over the past few weeks. For those who don’t know him: former Tesla AI director, OpenAI founding member, Stanford CS231n creator. When Karpathy talks about AI coding, people listen.

His headline observation: in two months he went from 80% manual coding to 80% agent-assisted.

“The biggest change to my basic coding workflow in ~2 decades of programming and it happened over the course of a few weeks.”

But then he lists the problems:

“The mistakes have changed a lot - they are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do. The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking.”
“They also don’t manage their confusion, they don’t seek clarifications, they don’t surface inconsistencies, they don’t present tradeoffs, they don’t push back when they should, and they are still a little too sycophantic.”

Here’s the thing: some of these are genuine model limitations. But some are symptoms of how we’re using them. And someone just demonstrated the difference.

The Library With No Code

In the comments on Karpathy’s post, he linked to Drew Breunig’s whenwords - a software library that contains no code.

Not “minimal code.” Not “generated code.” Zero code.

whenwords is a relative time formatting library. It turns timestamps into human-readable strings like “3 hours ago” or “last Tuesday.” Standard utility function stuff.

Here’s what the library contains:

SPEC.md: Detailed specification of how it should behave
tests.yaml: Language-agnostic test cases (input/output pairs)
INSTALL.md: Instructions for Claude/Codex/whatever

The installation instructions are comically simple:

Implement the whenwords library in [LANGUAGE].

1. Read SPEC.md for complete behavior specification
2. Parse tests.yaml and generate a test file
3. Implement all five functions
4. Run tests until all pass
5. Place implementation in [LOCATION]

That’s it. Pick your language, paste into Claude, go.

And it works. Ruby, Python, Rust, Elixir, Swift, PHP, Bash, even Excel. One spec, infinite implementations. As Breunig notes:

“There wasn’t a single language where Claude couldn’t implement whenwords in one shot.”

This isn’t a thought experiment. It’s a working library you can use today.

What This Actually Proves

Remember Karpathy’s observation that models “make wrong assumptions on your behalf”? Breunig demonstrated what happens when you don’t give them room to assume anything.

The whenwords spec is ~500 lines. It’s exhaustive. It defines edge cases. It provides examples. It specifies behavior precisely enough that Claude can implement it in any language without making a single wrong assumption.

Two weeks ago I wrote about why AI hallucinations are often our fault. The core insight: LLMs don’t hallucinate randomly - they fill gaps in underspecified problems with plausible patterns from training data.

Breunig just took this to its logical conclusion. If you specify completely, code becomes optional. The specification IS the library.

What Specification Can’t Fix

To be clear: not all of Karpathy’s complaints are specification problems.

Sycophancy is real. Models genuinely don’t push back when they should. No amount of clear specs will make Claude say “actually, that architecture is overcomplicated” or “have you considered this simpler approach?” That’s a model behavior issue that better prompting can only partially address.

The “not surfacing inconsistencies” problem is somewhere in between. Sometimes models miss inconsistencies because the spec is genuinely unclear. Sometimes they notice but don’t mention it because they’re trained to be agreeable. You can prompt for the former (”list any ambiguities before implementing”) but the latter is baked into current model behavior.

The point isn’t that specification solves everything. It’s that we’re conflating two different failure modes: problems we’re causing by underspecifying, and genuine model limitations we need to work around. Fixing the first makes the second much more manageable.

Why This Matters Beyond Simple Libraries

whenwords is deliberately simple - five functions, well-defined standard, no complex dependencies. Breunig knows this. He asks:

“What does software engineering look like when coding is free?”

Then lists reasons you’d still want traditional code libraries:

performance-critical work
complex testing requirements
ongoing support needs
security patches
community interoperability.

Fair points. But notice what’s missing: “because AI can’t implement it.”

The capability boundary isn’t “what can AI code?” anymore. It’s “what can we specify well enough?”

And here’s the thing about specification: we’ve always needed it. When you hand requirements to a junior developer, an offshore team, or a contractor, you need clear specs. The better your specification, the better the implementation.

AI just made the feedback loop instant and the cost nearly zero.

This Isn’t Theoretical For Me

I’ve been using AI-assisted development for months to build production systems:

At my day job: a data enrichment system replacing a $300K/year SaaS solution
SARK: an enterprise AI governance framework
Multiple internal tools that compressed 6-month development cycles to weeks

I built these using Czarina, my own AI orchestration system - which I created after noticing the need while initially building SARK. Czarina forces good specification practices - it literally doesn’t work without careful upfront planning, task decomposition, and clear boundaries. This is why all the agentic capabilities emerging now bring up task managers and subtask decomposition. The tools are forcing us to get better at specification because they can’t function without it.

The breakthrough wasn’t better models. It was learning to specify context properly.

Before: “Build a security review automation system”

After: “We need to automate the access control review portion of our security reviews. Currently this is a 6-hour manual process where security engineers check application configurations against our standard controls matrix. Build a system that can:

Ingest application configs
Compare them against our controls (here’s the schema)
Generate exception reports (here’s the format)
Flag changes from previous reviews”

Same model. Completely different results.

When Karpathy mentions that models “overcomplicate code” and implement “inefficient, bloated, brittle construction”? In my experience, that happens when you don’t specify simplicity as a constraint. His observation that models will implement something inefficient and then say “of course!” when you suggest the simpler approach? That’s the model showing you it CAN do the simple thing - you just didn’t ask for it initially.

These errors aren’t random. They’re systematic. They emerge from insufficient context, underspecified constraints, and vague success criteria. Fix those, and the remaining problems - the genuine model limitations - become much easier to spot and work around.

The Awareness Gap

Here’s what’s fascinating about Karpathy’s post: despite listing all these problems, he concludes:

“It is still a net huge improvement and it’s very difficult to imagine going back to manual coding.”

He shifted 80% of his workflow to AI assistance in weeks. This from someone who knows the limitations intimately.

Karpathy notes that “well into double digit percent of engineers” are doing this, while general population awareness is in “low single digit percent.”

This is the real bottleneck. Not model capability. Not even prompting technique. Awareness and adoption.

Breunig shipped a library with no code. Karpathy shifted his entire workflow in weeks. The models aren’t perfect. They won’t be perfect next month either. But they’re good enough to transform how we work - if we learn to work with them properly.

The gap compounds. Every month spent evaluating is a month competitors spend learning. You can’t catch up by reading whitepapers. You catch up by doing the work - making the mistakes, learning the patterns, building institutional knowledge about what actually works.

What Has Actually Changed

Software engineering has always been about translating human intent into machine instructions. We kept inventing abstraction layers to make that translation easier:

Assembly → C → Python → “tell Claude what you want in detail”

Each step required less technical precision and more conceptual clarity. Assembly required you to think in registers. C required you to manage memory. Python let you think in algorithms. AI lets you think in specifications.

Each layer traded some control for some convenience. Each layer made more people productive. Each layer got dismissed by the previous generation as “not real programming.”

Breunig’s library is just the next step: the specification layer.

When your specifications are good enough, the implementation becomes a commodity. The value isn’t in the code anymore - it’s in knowing what to build and how to specify it clearly.

I experienced this directly when building SARK. I needed Rust for performance-critical components. I’d never written Rust before. Didn’t matter. I specified what the code needed to do, the constraints it needed to satisfy, and Claude generated working Rust. The language barrier vanished because I could specify the behavior precisely enough.

Where This Leaves Us

Karpathy asks great questions:

What happens to the “10X engineer” ratio?
Do generalists increasingly outperform specialists?
How much of society is bottlenecked by digital knowledge work?

On that second question: clearly yes. The connecting tissue between disparate ideas - seeing how concepts from one domain apply to another, guiding AI agents across different problem spaces - that’s generalist work. Specialists will still matter for deep technical problems, but the ability to orchestrate AI tools across multiple domains, to spot patterns that connect them, to write specifications that bridge different fields? That’s where generalists shine.

I’d add one more question: How much of society is bottlenecked by organizational inertia in adopting tools that already work?

Because the capability is here. The question isn’t “when will AI be ready?” The question is “when will we be ready for AI that’s already here?”

And maybe more specifically: when will we realize that learning to specify context clearly is now the highest-leverage skill in software development?

The Coding Endgame

If Drew Breunig can ship a working library with zero code by writing good specifications, what does that say about the rest of software development?

Maybe this: the endgame isn’t “AI writes better code than humans.” The endgame is “code becomes an implementation detail and humans write specifications.”

whenwords might be the beginning of that future.

James Henry is a Senior Security Engineer who builds AI-augmented development systems. He created SARK (enterprise AI governance framework) and Czarina (AI orchestration system) using the specification-driven approach described in this article.

Rainbow Roxy

Jan 27

Hey, great read as always, this revelation about conceptual errors and models making assumptions without clarification makes me deeply reflect on the subtle ways we, as practitioners, might overlook foundational details in our own disciplines, like a Pilates movement that feels right but subtly misses it's core purpose.

James Henry

Discussion about this post

Ready for more?