AI Model Hosting Contracts: On-Prem vs Cloud

The build-versus-buy decision for AI hosting used to be settled by capability. In 2026, with open-weight models closing the gap and GPU rates below $1.50 an hour, it is increasingly settled by contract terms — and the exit rights you negotiate up front.

By AI Practice Lead

Three Hosting Models, Three Contracts

AI model hosting in 2026 comes in three commercial shapes, and each is a different contract. Metered API access from a foundation-model vendor charges per token and bundles the infrastructure; private-cloud or dedicated-capacity hosting reserves GPU capacity in a provider's estate; and on-premise hosting runs open-weight or licensed models on hardware you control. The capability gap that once forced everyone onto frontier APIs has narrowed — open-weight models are now competitive for a wide band of enterprise workloads — so the decision turns on economics and contract terms rather than raw quality. The broader commercial context sits in the AI contract negotiation deep dive.

Where the Cost Crossover Sits

The crossover between metered API and self-hosting is a utilisation question. With specialised-cloud H100 capacity available from roughly $1.03–$2.49 per hour — after a 64–75% price collapse from the 2024 peak — self-hosting a capable open-weight model crosses below per-token API economics once a GPU is kept busy enough. Below that utilisation, metered API access wins because idle time costs nothing. The full GPU benchmark sits in negotiating AI compute costs and GPU pricing; the practical rule is that steady, high-volume, latency-tolerant workloads favour self-hosting, while spiky or frontier-capability workloads favour the API.

Model the crossover on a representative production workload, including egress and idle time — not on peak throughput. The vendor's per-token quote and the self-host hourly rate are only comparable once both are expressed as cost per thousand real requests.

API Hosting: The Terms That Matter

For metered API hosting, the rate is rarely the binding constraint — the lock-in terms are. Five clauses decide whether you stay in control: data residency and processing location; opt-in training consent so your inputs are never used to improve the model without written agreement; model-deprecation notice, because vendors retire models every few months and an unannounced deprecation can break a production system; version pinning, so you choose when to migrate; and an exit right to export prompts, fine-tunes and embeddings in a usable format. The deprecation and support side is covered in negotiating AI vendor support and SLAs, and the data-rights side in AI safety clauses in enterprise contracts.

Private-Cloud and Dedicated Capacity

Dedicated-capacity hosting sits between API and on-prem: you reserve GPU capacity, often with a foundation-model or platform vendor, and run inference in an isolated environment. It addresses data-isolation requirements without the capital commitment of on-prem, but it reintroduces the reserved-capacity problem from the compute market — a multi-year reservation in a market where rates fell 64–75% needs a downward re-rate clause. Negotiate dedicated capacity the way you would any reserved GPU commitment, reserving only the irreducible baseline and keeping burst on metered terms.

On-Prem: Sovereignty and Its Price

On-premise hosting delivers genuine data sovereignty and removes dependence on a single API provider — the decisive factor for some regulated workloads. But it is not lock-in-free: you take on hardware procurement, open-weight or commercial model licences, and the operational burden of running inference at scale. The economics only pay back at sustained high utilisation, the same crossover logic as private-cloud. And the model and tooling licences still need exit and portability terms, because an on-prem deployment built on a licensed model that changes its terms is its own form of dependence. Where retrieval is involved, the platform licensing in enterprise RAG platform licensing applies regardless of where the model runs.

Compliance, Residency and Open-Weight Licences

The hosting decision is often forced by compliance rather than cost. Where data cannot leave a jurisdiction or a tenant boundary, metered API access to a multi-tenant model may be ruled out regardless of price, pushing the workload toward dedicated capacity or on-prem. The contract has to make the residency commitment explicit — processing location, sub-processor list, and the right to be notified before either changes — rather than relying on a region setting that the vendor can revise. These terms sit alongside the safety and data-handling clauses in AI safety clauses in enterprise contracts, and they frequently determine the hosting model before the cost crossover is even calculated.

Open-weight model licences are the second compliance trap. A self-hosted deployment built on an open-weight model is only as portable as that model's licence permits — some carry acceptable-use restrictions, field-of-use limits, or clauses that change on new releases. Treat the model licence as a contract in its own right: confirm the permitted commercial scope, whether the terms can change retroactively, and what happens to your deployment if the licence is revised. The same provenance and rights questions that govern training data, covered in the broader AI contract negotiation deep dive, apply to the weights you host.

Structuring the Decision

Treat hosting as a portfolio rather than a single choice. Most large enterprises in 2026 run a mix: metered API for frontier and spiky workloads, dedicated or self-hosted capacity for high-volume stable ones, and a documented ability to move between them. That portability is itself the source of negotiating power — a credible self-hosting alternative disciplines API pricing, as set out in the multi-model AI strategy guide. For the full evaluation framework, download the AI Procurement Checklist or request a confidential briefing.

Deciding With Confidence

The hosting decision in 2026 is no longer a leap of faith about model quality — open-weight models have closed enough of the gap that the choice is an economic and contractual one. The crossover is a utilisation number: keep a GPU busy enough at $1.03–$2.49 per hour and self-hosting beats the API; leave it idle and the metered API wins. Everything else — residency, open-weight licence terms, exit rights — is a contract question you can answer before a single workload moves.

The enterprises that get this right run a deliberate portfolio rather than a single bet, and they negotiate the portability that keeps the portfolio fluid. A documented self-host option is worth more than its direct saving because it disciplines every API quote you receive, the point made across the multi-model AI strategy and compute-cost guides. Through our AI procurement advisory practice we will model the crossover on your real workloads and negotiate the hosting and exit terms that make the decision reversible.

Common Questions

AI Model Hosting: FAQ

Is self-hosting an AI model cheaper than using an API?
It can be for high-volume, stable workloads. With specialised-cloud H100 capacity from roughly $1.03–$2.49 per hour and open-weight models now competitive, self-hosting crosses below per-token API pricing once sustained utilisation is high. For spiky, low-volume or frontier-capability workloads, metered API access remains cheaper because you pay nothing when idle.
What contract terms matter most for API model hosting?
Data residency, opt-in training consent, model-deprecation notice, version pinning, and an exit right that lets you export prompts, fine-tunes and embeddings in usable form. The metered rate matters, but the deprecation and portability terms determine whether you are locked in when the vendor changes models — which they do every few months.
Does on-prem AI hosting remove vendor lock-in?
It reduces dependence on a single API provider but introduces hardware, model-licence and support commitments of its own. The sovereignty benefit is real for regulated data, but on-prem only pays back at sustained high utilisation, and the model and tooling licences still need exit and portability terms negotiated.

Get the Hosting Decision Right

Our advisors model the cost crossover and negotiate the hosting, exit and portability terms that decide whether build-versus-buy works in your favour.

Request a Confidential Briefing Explore AI Procurement Advisory

AI Procurement Intelligence

Monthly briefings on AI hosting economics, model licensing, and the contract terms that move enterprise risk — from advisors who negotiate these deals.