Why is my AI coding bill so much higher than I expected?

Output tokens are 4–5× more expensive than input on every major model. If the agent is rewriting whole files instead of patching, your cost is dominated by output. The fix is asking for diffs, not full rewrites — and using a coding-tuned model that defaults to patches.

Does prompt caching actually help?

Yes, dramatically. Anthropic's cached read tokens are 10% the price of fresh input on Claude Sonnet 4.6. A 200k-token context that re-reads on every turn costs $0.60 fresh vs $0.06 cached — and over a long session that delta is the difference between a $3 day and a $30 day.

When should I /compact a session vs start fresh?

Compact when the conversation is one continuous task and the older messages are bloating context without adding signal. Start fresh when the next task is genuinely separate — the cost of re-loading context is lower than the cost of carrying noise that confuses the model.

Are sub-agents worth the token overhead?

For parallelizable work, yes — a sub-agent that does its own research and returns a small summary saves you carrying 50k of search results in your main context. For sequential, dependent work, sub-agents are pure overhead.

Stop Burning Tokens: AI Coding Cost Guide

TL;DR

Output costs 4–5× input. Ask for diffs, not full file rewrites.
Cached reads cost 10%. Structure prompts so the long parts (instructions, codebase context) are cacheable.
Sonnet for the boring 80%, Opus for the hard 20%. Picking the right tier per task is the biggest single win.
Restart > compact when the next task is unrelated. The "long session" is a habit, not a benefit.

CH 01

Why your bill is asymmetric.

Every model has three prices, not one:

Input — text you send (your prompt, the conversation history, file contents).
Output — text the model generates (code, explanations, tool calls).
Cached input — input the provider has already seen and stored.

Output is roughly 4–5× more expensive than fresh input. Cached input is roughly 10% the price of fresh input. So a one-shot "rewrite this file" can cost more than a full hour of careful patches — because rewriting forces the model to generate every line as output, while patching has it generate a tiny diff.

Mental model: token cost ≈ diff size × output-multiplier + context size × cache-discounted-input-rate. Optimize for small diffs and cache-warm context, not "the smallest possible model."

CH 02

Model pricing matrix (May 2026).

Prices per million tokens. Verify against the provider's page before you optimize anything serious — vendors adjust these often. The calculator below uses these defaults.

Model	Input / 1M	Cached / 1M	Output / 1M	Sweet spot
Claude Opus 4.7	$15	$1.50	$75	Architecture, hard bugs, refactors
Claude Sonnet 4.6	$3	$0.30	$15	Default agent loop, ~80% of work
GPT-5.3 Codex	$10	$1	$30	Codex CLI batches, long autonomous runs
GPT-5.5	$15	$1.50	$60	Tricky reasoning, planning, hard math
Gemini 3.1 Pro	$5	$0.50	$15	Long-context (2M tok), search-aware
Composer 2.5 Fast	Included	Included	Included	Quick tweaks in Cursor subscriptions

The Sonnet rule: if you can't articulate why Opus is needed for the next task, you don't need Opus. Sonnet does ~80% of real coding work for ~20% of the cost. Reserve Opus for: ambiguous architecture decisions, refactors across >10 files, debugging that's taken you longer than an hour.

CH 03

Caching: the 10× lever.

Anthropic, OpenAI, and Google all cache prefix-matching input. If turn 1 sends "system prompt + project rules + file A + question 1" and turn 2 sends "system prompt + project rules + file A + question 2", the prefix is reused. You pay 10% of the input rate for the cached chunk.

This sounds like free magic. It is, but it requires shaping your prompts so the long, static parts come first and the short, varying parts come last. Reverse the order and you cache nothing.

good prompt shape (cache-friendly)

┌──────────────────────────────────┐
│ STATIC PREFIX (cached after 1st) │
├──────────────────────────────────┤
   System prompt
   Project rules / AGENTS.md
   File contents (the relevant 3-5 files)
   Tool definitions
├──────────────────────────────────┤
│ VARIABLE SUFFIX (paid in full)   │
└──────────────────────────────────┘
   This turn's user message
   This turn's tool call results

In Claude Code, Cursor, and Codex CLI this happens automatically as long as you don't keep re-loading different files mid-conversation. The way to ruin caching: every turn, the agent reads a new file with cat and dumps it inline. Now the prefix changes every turn and your cache hit rate is zero.

Tactic: at the start of a session, ask the agent to read the files it needs once, summarize what it learned, and proceed. Re-load only on demand.

Pitfall · Cache TTL is 5 minutes by default

If you take a coffee break longer than the cache TTL (5 min on Anthropic, configurable up to 1 hour on the new tier), the prefix evaporates and turn N+1 pays full price for the warm-up. Either keep moving or explicitly opt into the 1-hour cache via the API headers your tool exposes.

CH 04

Compact vs start fresh.

"/compact" is the most overused command in Claude Code. People run it because the UI shows context filling up — but the question is never "how full is the context", it's "does any of this old context still help with the next thing?"

Situation	Compact?	Start fresh?
Same feature, you're 40 turns in, want to keep going	Yes	No
Switching from "build auth" to "fix CSS bug"	No	Yes
Agent is repeating wrong answers across turns	No	Yes
Context is 80% but the relevant 20% is still useful	Yes	No
You can't summarize what the session is about in one sentence	No	Yes

The rule of thumb: compacting always loses signal. If you're already losing signal because the agent is confused, compact will help. If you're not losing signal, compact will hurt and you'll re-load the same files anyway.

CH 05

When sub-agents pay off.

Sub-agents look like a free productivity boost — until you do the math. Spawning a sub-agent means duplicating the system prompt and tool definitions, then carrying the return message in your main context. Net cost is real.

The math works out when:

Research collapses into a summary. Sub-agent reads 50k tokens of docs, returns 500 tokens of answer. You spend 50k in the sub-agent's context, 500 in yours. Win.
Tasks parallelize cleanly. Four sub-agents on four independent files in 5 minutes vs four sequential turns in 20 minutes. Wall-clock win, often a cost win because each sub-agent's context is smaller than your shared one would be.
You want to throw away the working notes. The sub-agent's whole context dies when it returns. You keep only the answer.

The math fails when:

The work is sequential. Three steps where each depends on the last. A sub-agent buys you nothing.
The sub-agent needs your context to do its job. If you have to brief it for 5 minutes, you've already lost.

DEMO · INTERACTIVE

Live: session cost calculator.

Pick a model, drag the sliders to match what your day actually looks like. Numbers update live. All math runs in your browser — nothing leaves the page.

Session cost calculator Prices in USD · Verify against vendor pages before relying on numbers

Model Claude Sonnet 4.6

Avg context per turn 40k tokens

Avg output per turn 1.5k tokens

Cache hit rate 70%

Turns per hour 30

Active hours per day 6

Things to try: bump cache from 70% to 0% (see why it matters). Switch from Sonnet to Opus on the same workload (4–5× jump). Drop output from 1.5k to 0.5k by asking for diffs not rewrites.

PITFALLS

Common pitfalls.

Picking Opus for everything "to be safe"

This is the single biggest waste of money in AI coding. Opus is 5× the price of Sonnet for tasks where the quality delta is <10%. Default to Sonnet, escalate explicitly.

Letting the agent re-read the codebase every prompt

You'll see cat src/**/*.ts or 50 file reads in the agent's transcript. That kills your cache. Get it to summarize the relevant files once at the start of the session, then reference the summary.

"Rewrite the file" instead of "edit lines 40-65"

Modern tools default to patch-style edits via apply-diff tools. If yours is rewriting whole files, check whether the diff tool is enabled. Output tokens drop 5–20× the moment you switch.

Forgetting that think mode bills like Opus

Extended thinking / reasoning tokens count as output. A 30k-token thinking block on Opus is $2.25 by itself. Use thinking for hard problems, not for "summarize this email."