How Vertex AI Pricing Actually Works
Google Vertex AI pricing is consumption-based: you pay per million input and output tokens for Gemini models, plus separate charges for grounding, embeddings, tuning, and any dedicated compute. The published rate card is the starting point, not the enterprise price. Gemini 2.5 Flash-Lite lists at $0.10 per million input tokens and $0.40 per million output; the flagship Gemini 2.5 Pro lists at $1.25 input and $10.00 output. Output is where the bill is made — a chat-heavy or agentic workload that generates long responses can cost 8x its input on Pro.
The decisive point for any buyer: Vertex AI consumption is discounted through Google Cloud committed use discounts (CUDs), not through a standalone Vertex contract. The per-token rate becomes negotiable once you fold projected token volume into the broader Google Cloud commitment — the same mechanism that governs Compute Engine, BigQuery, and the rest of the platform. That is why a Vertex AI negotiation is really a Google Cloud commit negotiation, and why it belongs inside your wider Google Cloud commercial relationship rather than treated as an isolated AI line item.
The Real Cost Drivers (and Hidden Fees)
Three line items routinely break enterprise Vertex AI budgets — and none of them are the headline token rate. The first is Google Search grounding: $14 per 1,000 queries on Gemini 3 models and $35 per 1,000 on Gemini 2.x. For a team that grounds every response against live search, this single charge often exceeds the inference cost itself and becomes the largest line on the bill. The second is the Vertex management fee layered on top of raw compute — an NVIDIA A100 GPU is $2.93 per hour of compute plus a $0.44 per hour Vertex fee, a 15% surcharge that is easy to miss in a proof-of-concept and material at production scale. The third is the input/output asymmetry already noted: output tokens cost up to 8x input.
| Cost component | List rate (2026) | Optimisation lever |
|---|---|---|
| Gemini 2.5 Pro — input / output | $1.25 / $10.00 per M tokens | Batch mode: $0.625 / $5.00 (50% off) |
| Gemini 2.5 Flash-Lite — input / output | $0.10 / $0.40 per M tokens | Right-size model to task |
| Context caching (repeated input) | ~10% of base input rate | Up to 90% saving on repeated prompts |
| Google Search grounding | $14–$35 per 1,000 queries | Cap grounded calls; cache results |
| Vertex management fee (A100) | $0.44 / hour on top of $2.93 compute | Fold into CUD compute commit |
The optimisation levers matter before you ever negotiate a discount. Context caching charges only about 10% of the standard input rate for repeated tokens, cutting up to 90% off prompts with large fixed system context. Batch mode takes 50% off for non-urgent work — document processing, nightly analytics, content pipelines — dropping Gemini 2.5 Pro to $0.625 / $5.00. A buyer who has already engineered caching and batch into the workload negotiates from a credible, defensible volume forecast rather than an inflated one.
Committed Use Discounts: The Core Lever
The discount engine is the Google Cloud CUD. A one-year commitment typically yields 20–30% off list; a three-year commitment 40–50%, with up to 55% on Vertex AI compute specifically. The discount scales with total commit size: once Vertex spend is aggregated into the broader Google Cloud commitment, indicative net pricing runs 25–40% below list at $1M–$5M of annual spend, 35–50% below at $5M–$20M, and 45%+ above $20M. Build the forecast in tokens, not dollars — a token forecast survives model price changes, and it is the unit Google's own deal desk reasons in.
A CUD charges for the committed amount whether or not you consume it. Commit 70–80% of forecast usage and leave 20–30% on-demand for spiky or experimental workloads — then negotiate a re-forecast at the 12-month mark.
This is the same consumption-versus-commitment tension that runs through every modern AI contract. If you are weighing per-seat against usage-based models elsewhere in your stack, our analysis of seat-based versus consumption AI pricing sets out where each structure wins — and Vertex sits firmly on the consumption side, which is exactly why commit discipline matters so much.
Five Levers That Move the Price
The strongest lever is a Google Workspace renewal. Google's enterprise sales team carries different quotas for Workspace renewals versus net-new cloud spend and cares intensely about renewal retention. A two-tranche structure works: tranche one consolidates Workspace seats with a Gemini add-on at 15–20% off; tranche two commits Vertex AI API volume at 28–32%, with a two-year term on both to give Google the certainty that justifies the deeper discount. Second, a credible competitive alternative — Google will open near 25%; push to 30–32% by citing live OpenAI and Anthropic quotes. Our side-by-side on OpenAI vs Anthropic vs Google API pricing gives you the benchmark numbers to do this, and the Anthropic Claude API pricing tiers are the most direct comparison to put on the table.
Third, commit aggregation: roll Vertex into the platform-wide CUD rather than signing a standalone Vertex commit, so the AI spend earns the bracket discount of the whole account. Fourth, term length — a three-year commit roughly doubles the one-year discount, but only commit to three years where the token forecast is genuinely stable. Fifth, timing to Google's quarter-end, where the deal desk has the most latitude to approve exceptions. If your decision is between platforms rather than within Google, our Gemini Enterprise versus Microsoft Copilot comparison frames the trade-offs; for agent-heavy workloads, weigh it against Salesforce Agentforce consumption pricing before you commit.
Contract Protections to Secure
Beyond price, four protections belong in any Vertex AI commit. First, an overrun and re-forecast clause: the right to re-baseline the commitment at 12 months without penalty, so a wrong forecast does not lock you into paying for tokens you never burn. Second, data residency in writing — Vertex supports data-at-rest in 10 countries with CMEK and VPC Service Controls, but EU data residency is a paid add-on; confirm which regions and which add-ons are priced in, not assumed. Third, a price-protection clause holding your negotiated per-token rates for the full term against list-price moves. Fourth, audit and exit terms documented before signing. The contract clauses that most often catch buyers out are catalogued in our white paper on AI contract red flags — read it before you countersign.
Buyer Traps to Avoid in 2026
The defining trap is uncontrolled consumption. Uber exhausted its entire 2026 AI budget by April after usage across a 5,000-strong engineering team climbed from 32% to 84% in a single month — a reminder that an under-forecast CUD is as damaging as an over-forecast one. The second trap is treating Vertex as a standalone purchase and forfeiting the platform-wide bracket discount. The third is ignoring grounding and management fees until the first full production invoice. For the complete framework — model selection, FinOps controls, and contract structure end to end — see our complete guide to AI procurement. When the commitment is material, request a confidential briefing and we will benchmark your Vertex forecast against live market deals before you sign.