Table of Contents
Azure Spot VMs Explained
Azure Spot VMs allow you to use Azure's surplus compute capacity at significantly reduced prices. When Azure needs the capacity back for on-demand or reserved customers, it can evict Spot VMs with 30 seconds' notice. This is the fundamental trade-off: deep discounts in exchange for preemptibility.
Azure Spot VMs are the Azure equivalent of AWS Spot Instances and GCP Preemptible/Spot VMs — all three major providers offer this pricing model because excess capacity is commercially inefficient, and selling it at a discount recovers marginal revenue that would otherwise be lost.
From a commercial strategy perspective, Spot pricing is one of the highest-leverage cost optimization tools available on Azure — but only when applied to the right workloads. Applying Spot to unsuitable workloads creates operational instability that costs far more to manage than the discount saves.
Spot Pricing Model and Discount Levels
Azure Spot pricing is dynamic — it varies by VM series, region, availability zone, and current capacity utilization. Microsoft publishes historical Spot pricing through the Azure pricing calculator and the Spot pricing history API. Understanding pricing variability is important for workload placement decisions.
| VM Series | Typical Spot Discount vs On-Demand | Eviction Rate (Low Traffic Periods) |
|---|---|---|
| D-series (general purpose) | 60-75% | 5-15% |
| F-series (compute optimised) | 65-80% | 5-10% |
| E-series (memory optimised) | 55-70% | 8-20% |
| N-series (GPU) | 70-90% | 10-25% |
| H-series (HPC) | 75-90% | 5-15% |
Eviction rates vary dramatically by region and time of day. East US, West Europe, and Southeast Asia (high-utilization regions) typically have higher eviction rates than less-constrained regions. Spot VMs deployed in off-peak hours (nights and weekends in the deployment region's time zone) experience significantly lower eviction rates — making batch jobs that can be scheduled during off-peak windows particularly well-suited to Spot pricing.
Workloads Suitable for Spot Pricing
The defining characteristic of a Spot-suitable workload is fault tolerance: the ability to be interrupted, state to be preserved or work to be restarted, and the overall job to complete correctly despite individual VM evictions.
Batch Data Processing
ETL pipelines, data transformation jobs, large-scale analytics queries, and log processing are ideal Spot candidates. These workloads process bounded data sets, can be designed to checkpoint progress, and produce deterministic outputs regardless of how many VMs complete which portions of the work. A Databricks cluster running on Spot VMs for nightly data transformation achieves the same output at 70% lower compute cost — as long as jobs are designed to handle node loss gracefully.
Machine Learning Training
ML training is one of the highest-value Spot use cases because training jobs are compute-intensive (hours to days), the cost savings are substantial in absolute terms, and modern ML frameworks (PyTorch, TensorFlow) natively support checkpoint-and-resume patterns. An ML training job running on 100 Standard_NC24 (GPU) VMs for 48 hours costs approximately $14,400 at on-demand pricing — on Spot at 80% discount, this drops to $2,880. Designing the training job to checkpoint every 30 minutes means at most 30 minutes of computation is lost per eviction event, a small overhead against the $11,520 saving.
CI/CD and Test Environments
Build pipelines, automated testing, and integration test environments are natural Spot candidates. Eviction of a CI build simply retriggers the build from the last checkpoint or from scratch — no data loss, no production impact, and developers are accustomed to build retries. Many organizations run their entire CI/CD fleet on Spot pricing and find that the occasional eviction-induced retry adds seconds to average build times while saving 60-75% of compute costs.
Web Tier Horizontal Scaling
Stateless web application nodes that receive traffic through a load balancer can use Spot instances for burst capacity scaling. When Spot nodes are evicted, the load balancer removes them from rotation; remaining nodes (on-demand or reserved) absorb the traffic. This pattern requires sufficient baseline capacity on non-Spot VMs to handle traffic if Spot nodes are evicted simultaneously — a scenario that, while rare, must be planned for.
Workloads That Should Never Use Spot
Applying Spot pricing to the wrong workloads creates operational risk that erodes the discount savings through increased support costs, incident management overhead, and application reliability damage.
Databases and Stateful Services
Relational databases, NoSQL stores, and any stateful service where data consistency depends on all nodes being available should not run on Spot VMs. Even with strong backup and replication strategies, eviction of a primary database node during high-write periods can cause data loss or extended recovery procedures that cost far more than the Spot savings.
Long-Running Interactive Sessions
Development environments with long-running interactive sessions, RDP/SSH connections, or Jupyter notebooks where users are actively working are poor Spot candidates. An eviction destroys the session and any unsaved in-memory work. The user experience damage and lost productivity cost exceeds the compute savings.
Applications with SLA Commitments
Any production workload with committed SLA uptime obligations to customers should not have Spot VMs as the primary compute tier. A Spot eviction event during peak traffic that removes 30% of your compute capacity and causes SLA breach has a financial penalty that may dwarf months of Spot savings.
Real-Time Processing with Low Latency Requirements
Financial transaction processing, real-time fraud detection, healthcare monitoring applications, and any workload where processing latency is a hard requirement should not use Spot. The node removal during eviction creates processing gaps that violate latency guarantees.
Designing for Eviction Tolerance
The difference between a workload that benefits from Spot and one that's damaged by it is almost entirely architectural. Eviction-tolerant architecture is not complex — but it must be built in, not retrofitted.
The Azure Scheduled Events API delivers eviction notices 30 seconds before preemption. Applications must poll this endpoint to receive notices. The implementation pattern: every worker process polls Scheduled Events on a tight interval (every 1-5 seconds); when an eviction notice is received, the worker saves its checkpoint state, drains in-flight work, and terminates gracefully within the 30-second window.
For Azure VM Scale Sets with Spot instances, configure: eviction policy set to "Deallocate" (preserves VM disk for faster restart) rather than "Delete"; automatic instance repair to replace evicted nodes; and queue-based work distribution so jobs are not lost when nodes are removed. This architecture pattern — queue-based distribution, checkpoint-on-eviction, automatic replacement — is the foundation of all high-value Spot deployments.
Building a Blended Pricing Strategy
The optimal Azure pricing strategy is not "use Spot everywhere" or "use Reserved Instances everywhere" — it is a blended strategy that applies each pricing model to the workloads for which it is best suited.
| Workload Type | Recommended Pricing | Rationale |
|---|---|---|
| Production compute (stable) | Reserved Instances (1-year) | 40-60% discount, guaranteed availability |
| Production compute (variable) | Azure Savings Plans | Flexibility across VM families, 20-40% discount |
| Batch/analytical processing | Spot VMs | 60-90% discount, fault-tolerant architecture |
| Dev/test environments | Spot + Azure Dev/Test pricing | Maximum discount for non-production |
| Burst capacity | Spot (with on-demand fallback) | Deep discount with on-demand safety net |
For a $1M annual Azure compute environment, a well-designed blended strategy might allocate: 50% of spend to Reserved Instances for stable production workloads (saving $250-300K vs. on-demand); 25% to Spot for batch and analytical workloads (saving $150-200K vs. on-demand); and 25% to on-demand for genuinely variable and burst workloads. Total blended savings: $400-500K annually versus pure on-demand — a 40-50% reduction.
Spot vs. Azure Savings Plans vs. Reserved Instances
Choosing between Azure's three non-on-demand pricing models requires understanding the trade-offs on discount depth, flexibility, and eviction risk:
- Reserved Instances: 40-60% discount vs. on-demand. Fixed VM size and region. 1-year or 3-year term. No eviction risk. Best for stable, known workloads where size and region won't change.
- Azure Savings Plans: 20-40% discount vs. on-demand. Flexible across VM families, regions, and OS types. 1-year or 3-year commitment to an hourly spend level. No eviction risk. Best for dynamic environments where VM mix changes.
- Spot VMs: 60-90% discount vs. on-demand. Flexible sizing and regions. No commitment required. Eviction risk — suitable only for fault-tolerant workloads. Best for batch, analytics, ML training, and dev/test.
The combination of Savings Plans (for production workload baseline) and Spot (for analytical and batch workloads) often outperforms Reserved Instances alone for complex enterprise environments with mixed workload types. Model your specific workload mix against all three pricing models before committing to a strategy.
Spot Implementation Best Practices
For organizations beginning Spot adoption or optimizing an existing Spot strategy:
- Start with dev/test: The lowest-risk Spot deployment is development and test environments. Eviction is acceptable, stakes are low, and you build operational experience with Spot behavior before applying it to production adjacent workloads.
- Use Azure VM Scale Sets: Do not run Spot as standalone VMs for production-adjacent workloads. VM Scale Sets with Spot instances provide automatic replacement, load balancing integration, and mixed Spot/on-demand configurations — all essential for resilient Spot deployments.
- Implement multi-region Spot queues: For batch processing with Spot, implement Azure Service Bus or Azure Storage Queue-based job distribution across multiple Azure regions. When a Spot VM in East US is evicted, a VM in West US picks up the next job from the queue. This pattern achieves near-continuous batch processing despite individual eviction events.
- Monitor Spot eviction rates: Use Azure Monitor to track eviction frequency by VM type and region. If eviction rates for a VM family exceed 20-25%, the overhead of job restarts and infrastructure management may be eroding your Spot savings. Shift to a different VM family or region with lower eviction pressure.
- Apply Azure Hybrid Benefit: If you have eligible Windows Server licenses with Software Assurance, apply Azure Hybrid Benefit to your Spot VMs. AHUB waives the Windows license component of the VM cost, providing an additional 20-40% discount on top of the already-discounted Spot price for Windows workloads.