Tokens Burned Is the New Lines of Code

My LinkedIn feed has been full of it for weeks. My X timeline too. People posting token spend screenshots like they're progress reports. Startup founders bragging they spent $16k on Claude Code last month and are aiming for $60k next. Leaderboards. Rankings. Titles like "Token Legend" and "AI God."

And then last week, it hit critical mass. Forbes reported on the "tokenmaxxing" movement sweeping Silicon Valley, where companies compete to see who burns the most AI tokens. Jensen Huang went on the All-In podcast and said: "That $500,000 engineer, at the end of the year, I'm going to ask him, 'How much did you spend in tokens?' If that person says '$5,000' I will go ape-something else. If that $500,000 engineer did not consume at least $250,000 worth of tokens, I am going to be deeply alarmed."

Then Fortune reported that a Meta employee had built an internal leaderboard called "Claudeonomics" tracking token consumption across the company's 85,000+ staff. Top users got titles. In a 30-day window, total usage hit 60 trillion tokens. The top individual user averaged 281 billion. Mark Zuckerberg didn't even crack the top 250. Meta CTO Andrew Bosworth, meanwhile, was publicly saying his best engineer was spending his salary equivalent in tokens but running "5x to 10x more productive." "It's like, this is easy money," Bosworth said. "No limit."

I've been in software long enough to recognize what's happening here. This is "lines of code" with a much higher price tag.

We've been here before

In 2003, Martin Fowler wrote a short piece on why software productivity cannot be measured that should probably be required reading for every technical executive. His argument on lines of code was precise:

"One of my biggest irritations are studies of productivity based on lines of code. Any good developer knows that they can code the same stuff with huge variations in lines of code."

The problem is obvious once you say it out loud. LOC measures activity, not output. Two developers can build the same feature: one writes 1,200 lines, the other writes 80. The concise one probably built a better system. Under a LOC regime, the verbose one looks more productive.

Teams evaluated on LOC responded rationally. They wrote more lines. They copy-pasted rather than abstracting. They avoided refactoring because deleting code would hurt their numbers. The metric shaped behavior, but not toward better software. More code. Worse systems.

Then in 2023, McKinsey published a piece claiming to have cracked objective developer productivity measurement. Gergely Orosz and Kent Beck's thorough response pointed out the same flaw: nearly every McKinsey metric was measuring effort and output, not outcomes. Kent Beck recounted watching Facebook's internal developer sentiment surveys devolve from useful feedback into managers negotiating with engineers for higher scores. That's what happens when you incentivize a proxy metric. The number improves. The thing you actually cared about does not.

You'd think we would have learned. We haven't.

Same mistake, different unit

The seductive logic of tokenmaxxing runs like this. Token consumption = AI usage. More AI usage = teams are using AI. Therefore, high token spend = high AI adoption = good.

It is precisely as flawed as measuring lines of code, just with a billing dashboard instead of a commit graph. And to be fair to the Forbes article, Sendbird's CEO John Kim basically said exactly that: "We've seen this movie before." He was referring to the 1990s and 2000s LOC culture. The real indicator, he noted, is how much AI-generated code actually makes it into production. Token spending "is more of a conversation starter." I agree with that. It becomes a problem when the conversation starter gets promoted to the headline KPI.

GitHub's 2024 developer survey found that 97% of enterprise developers had used AI coding tools at work at some point. Meaningful organizational adoption, though, required clear policies, workflows, and measurable outcomes tied to actual business results. Not just usage. Not just consumption.

Boris Cherny, the engineer behind Claude Code, publicly shared that he didn't open an IDE at all during one month of work, with Opus 4.5 writing around 200 PRs. That's impressive. But what makes it impressive is not the tokens those 200 PRs consumed. It's that they were 200 real merged contributions with working software on the other end.

The value is in the outcome. Tokens are the energy that got you there, nothing more.

When the metric becomes the target

There's a principle called Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The history of software development is basically a museum of Goodhart's Law in action.

Tracking tokens as an AI adoption KPI sets up the exact same dynamic. Engineering teams measured on token consumption will consume more tokens. That's just how incentives work. Want to look more productive? Run a few more agentic loops. Let the model reason at length before generating output. Wrap every task in an orchestration layer that calls four tools where one would do. Token spend goes up. Value delivered does not.

Actually, the Claudeonomics story proved this almost immediately. Fortune noted that "some employees have put AI agents to work for hours to maximize their token usage." There it is. Goodhart's Law executing in real time, inside a company that's supposed to be at the frontier of AI-driven productivity. The leaderboard had been up for maybe a few weeks before it was shut down, and employees were already gaming it by running agents in loops. The metric was three weeks old and it had already stopped measuring what it was supposed to measure.

Any developer reading this can probably think of five ways to inflate token usage metrics at no benefit to anyone. I won't list them. But if I can think of five, so can the engineers being measured on this.

Andrej Karpathy described the current moment in software engineering as a "magnitude 9 earthquake" for the profession. He's right. But earthquakes don't get measured in the electricity consumed. They get measured in what moved.

The documentation version of this problem

This isn't only a problem for engineering teams. I see the same dynamic in knowledge management, which is much closer to home for us at Rasepi.

"We published 400 documents this quarter" is a number that sounds good in a slide deck. It has nothing to say about whether those documents are accurate, whether anyone read them, or whether the information in them is still true six months later. You can hit that number with AI and no thinking whatsoever. Token-assisted noise published at scale.

The honest metric is harder to collect but much more useful: what percentage of your knowledge base actually reflects how your systems work today? How many people reached a correct answer using your documentation? How many tried, failed, and ended up asking someone on Slack instead?

Those questions don't have pretty dashboards yet. They require actual thought about what you want documentation to do for your organization. (This is, not coincidentally, exactly the problem Rasepi is built around. Forced expiry dates exist precisely so teams have to reckon with whether content is still valid, rather than letting it silently decay behind a high page-count metric.)

What to track instead

The honest answer to "is our AI investment paying off?" cannot be read from a billing dashboard.

You can approximate it with better questions: are cycle times improving? Is the ratio of features shipped to bugs reported trending in the right direction? Are engineers reporting they spend more time on judgment-heavy work and less on typing? Is your documentation staying current instead of accumulating like sediment?

These are harder to pull from an API. They require thinking about what output you actually want from your teams, which, admittedly, is the harder work. But they're the questions that matter, because they're about outcomes rather than inputs.

Token spend tells you how much compute you bought. Whether that compute became something useful is an entirely separate question. Companies that don't maintain that distinction are going to build very expensive dashboards that show them almost nothing.

We spent years optimizing the wrong metric for developer productivity. We have maybe one quarter before the same mistake gets baked into every AI adoption report in the enterprise. The window to avoid this is open, but it won't stay that way.

We've been here before

Same mistake, different unit

When the metric becomes the target

The documentation version of this problem

What to track instead

Keep your docs fresh. Automatically.