How Token Pricing Actually Works
AI token pricing charges separately for the tokens you send (input) and the tokens the model generates (output). A token is roughly four characters of English, so a thousand-word prompt is about 1,300 input tokens. The trap is the asymmetry: output tokens cost three to ten times more than input tokens, because generation is far more compute-intensive than reading. On Claude Sonnet 4.6, input runs around $3 per million tokens and output around $15 — a 5x gap.
That asymmetry, not the headline input price, drives your real cost per request. A summarisation task with a long document in and a short answer out is cheap. A reasoning or agentic task that emits thousands of output tokens per call — and chains several calls together — is expensive, even on a "cheap" model. Modelling the input-to-output ratio of your actual workload is the first step in any AI cost analysis, a discipline we apply across the pricing work in our enterprise software pricing benchmarks guide.
The 2026 Price Collapse — and Why Your Bill Rose
Token list prices fell roughly 80% between early 2025 and early 2026. GPT-4o input alone dropped from $5.00 to $2.50 per million tokens, and budget models now sit below $0.15. Yet most enterprises report their AI spend rising over the same period. The reason is volume: reasoning models emit far more output tokens, agentic workflows multiply the number of calls, and broader rollout expands the user base faster than unit prices fall.
This is the same dynamic that makes consumption-based pricing dangerous without governance — a pattern we examine in pay-per-use vs subscription. Falling unit prices give a false sense of safety; the only durable control is a token budget tied to usage volume, not list rate.
Model Pricing Benchmarks
The table below shows representative 2026 list rates per million tokens. Prices move quickly, so treat these as a benchmark band rather than a quote — but the tiering structure (budget, mid, premium) is stable across providers.
| Tier / Model | Input (per 1M) | Output (per 1M) | Typical Use |
|---|---|---|---|
| Budget (Gemini Flash Lite, GPT Nano) | $0.08–$0.15 | $0.28–$0.60 | High-volume classification, routing |
| Mid (Claude Sonnet 4.6, GPT-5.4) | $2.50–$3.00 | $15.00 | Production reasoning, drafting |
| Premium (Claude Opus 4.5) | $5.00 | $25.00 | Complex analysis, agents |
| Frontier reasoning (GPT-5 class) | $15.00 | $75.00 | Hardest tasks only |
Enterprises running a tiered routing architecture — roughly 70% of queries to a budget model, 20% to mid-tier, 10% to premium — achieve a median blended cost near $2.31 per million tokens. The single biggest cost mistake is sending every request to a frontier model "to be safe".
The Free Levers: Caching, Batch, Routing
Three mechanisms cut token costs before any contract negotiation, and most enterprises leave all three on the table. First, prompt caching: repeated input — a long system prompt, a fixed knowledge base — is cached and re-billed at 75–90% off. A $3 per million system prompt drops to roughly $0.30 on cache hits. Second, the Batch API: asynchronous workloads that tolerate a delay get a flat 50% discount on every token. Third, model routing: directing each request to the cheapest model that can handle it, rather than defaulting to the most capable.
Together these typically reduce a production AI bill by 40–70% with no loss of quality and no vendor negotiation required. They should be exhausted before you ever ask a provider for a discount — both because they are free, and because they shrink the committed volume you will eventually negotiate around. The same "fix the consumption before you commit" logic governs cloud commitments, as covered in Enterprise Agreement vs pay-as-you-go.
Negotiating an Enterprise Token Contract
Once usage is optimised and a stable volume baseline is visible, list rates become negotiable. Volume discounts and committed-use agreements typically cut list token rates by 25–50% for predictable, high-volume workloads. Providers also offer provisioned-throughput tiers that reserve dedicated capacity at a negotiated rate — useful for latency-sensitive production systems — and the largest enterprise deals are custom-quoted entirely.
The marketplace route matters too: accessing models through Amazon Bedrock or Azure AI Foundry lets you fold AI spend into an existing cloud commitment such as an AWS EDP or Azure MACC, sometimes unlocking better effective rates than a direct provider contract — though Bedrock volume pricing may require annual commitments in the $10,000–$50,000 range. The levers that move these deals are the familiar ones: a credible multi-model alternative, a committed annual volume, and timing aligned to the provider's fiscal year. We size and negotiate these agreements through our AI procurement advisory practice.
Building a Defensible Token Budget
The deliverable that protects an enterprise is a token budget: a model, per use case, of expected tokens per request, requests per day, and the routed model mix — converted into a monthly cost with a clear variance band. Without it, AI spend is governed by hope. With it, every optimisation and every negotiated discount can be measured against a baseline, and finance has a number to hold the programme to.
Treat the AI token contract like any other major procurement: benchmark before you negotiate, optimise consumption first, commit only your stable baseline, and keep a flexible tier for growth. To benchmark your token spend and model an enterprise AI contract before you commit, request a confidential briefing, or download our Price Benchmarking Report for current rate bands across vendors.