The Direct Answer: List Rate vs Effective Cost
It helps to separate two things. The published per-token rate — for example, GPT-4o at roughly $2.50 per million input tokens and $10 per million output tokens — is not something a buyer haggles down call by call in the self-serve console. But the effective cost per token your enterprise actually pays is highly negotiable, through the commercial structures that sit on top of the list rate. For sustained, high-volume workloads, the combined effect of those structures is a 30–70% reduction versus naive pay-as-you-go.
So "are token prices negotiable" is the wrong question. The right one is "what is my effective rate after committed throughput, caching, batch and credits" — and that number is very much in play. The wider framing sits in our complete guide to AI procurement, and the mechanics of the meter itself in understanding and negotiating AI token pricing.
Committed Throughput and Reservations
The largest structural lever is provisioned throughput — reserving dedicated capacity instead of metering per token. On Azure OpenAI, a Provisioned Throughput Unit (PTU) runs around $2,448 per month at the hourly rate, but a one-year reservation commonly takes 40–65% off, and annual commitments save roughly a further 35% versus monthly. For a steady workload, the effective per-token cost can fall by up to about 70%.
The trade-off is utilisation risk: reserved capacity is billed whether or not you use it, so provisioning only pays above a clear utilisation threshold. Model your real token throughput before committing — the discipline mirrors capping consumption-based spend, which we cover in how to cap usage-based pricing.
| Lever | Typical Saving vs PAYG List | Best For |
|---|---|---|
| Batch API | ~50% | Non-urgent, asynchronous workloads |
| Prompt caching | Up to ~75% on cached context | Repeated system prompts / long context |
| Provisioned reservation (annual) | 40–70% | Steady, high-volume production traffic |
| Committed-spend agreement | 10–30% | Scaled, multi-product AI estates |
The No-Commitment Levers
Before committing to any reservation, capture the levers that cost nothing to pull. The batch API typically runs about 50% cheaper than synchronous calls for work that can tolerate a delay — overnight document processing, evaluation runs, bulk enrichment. Prompt caching discounts repeated context heavily, which matters whenever a large system prompt or knowledge base is sent on every call. And the most reliable lever of all is model right-sizing: routing requests to a smaller model where it suffices often beats any negotiated discount outright, because it removes the tokens rather than discounting them.
These choices are architectural as much as commercial, which is why token strategy belongs in the procurement conversation, not just the engineering one. The deployment decision also shifts the maths — running models direct versus through a hyperscaler changes both rate and data terms, as compared in Azure OpenAI vs OpenAI direct cost.
Pull the free levers first. Batch and caching can take 50–75% off the affected calls with no commitment, and right-sizing the model removes tokens entirely — only then does a provisioned reservation make sense, because you are reserving against a workload you have already optimised, not a bloated one.
Folding Tokens Into a Cloud Commitment
The most overlooked lever is the one you may already own. On Azure, OpenAI consumption rolls into your Enterprise Agreement or Microsoft Customer Agreement, and any negotiated Azure committed-spend (MACC) discount carries straight through to AI usage. The same applies to AWS Bedrock under an EDP and Google Cloud models under a committed-use contract. If you already hold a cloud commitment, your AI tokens should be drawing it down at the negotiated rate — not billing separately at list.
This is also where seat licences and token spend should be traded against each other rather than bought apart, the point we make in how OpenAI Enterprise is priced. The contractual terms to secure alongside the rate — data rights, rate-limit guarantees, price-change protection — are set out in the AI procurement checklist, and our AI procurement advisory negotiates the whole package.
The List-Price Trap
The expensive mistake is budgeting and buying at the published per-token rate, treating it as fixed because it is printed on a pricing page. Enterprises that do this routinely overpay by half or more on production workloads that would qualify for provisioned reservations, batch processing or MACC pass-through. The list rate is a default for experimentation, not a price a serious buyer at scale should ever pay in full.
Token cost compounds fast as AI moves from pilot to production, so the time to structure it is before the workload scales — the timing principle in planning your IT contract renewal calendar applies here too. When your AI token spend is material, request a confidential briefing — we model your effective rate and negotiate the committed structures that bring it down.