At the end of 2022, using GPT-4-level AI cost $20 per million tokens. Now it's $0.40. A 50x collapse in 2 years. This isn't a simple discount — it's a structural shift that's changing how startups use AI entirely.

3-Second Summary
LLM inference costs dropping 10x/year DeepSeek triggers price war $50K→$5K monthly API costs Startup entry barriers vanish AI-native businesses explode

What Is This?

a16z's Guido Appenzeller coined a name for this phenomenon — "LLMflation". At equivalent performance levels, LLM inference costs are dropping 10x every year. When GPT-3 launched in November 2021, it was $60 per million tokens. Now you can get the same performance level from Llama 3.2 3B for $0.06. A 1,000x drop in 3 years.

Epoch AI's analysis is even more dramatic. Price decline speeds vary by benchmark, with a median of 50x per year. Looking at data from January 2024 onward, prices are falling at 200x per year. The cost of achieving GPT-4-level performance on PhD-level science problems (GPQA) is dropping 40x annually.

1,000x
Same-performance cost decline over 3 years
50x/yr
Median LLM inference price decline rate
90–95%
DeepSeek vs OpenAI price gap

Why so fast? Six factors are working simultaneously. GPU performance improvements, model quantization (16-bit→4-bit), software optimization, smaller and more efficient models, instruction tuning advances, and pricing pressure from open-source models. It's much faster than semiconductors during the Moore's Law era.

The decisive trigger was DeepSeek. When DeepSeek R1 appeared in January 2025, the industry was turned upside down. Costs were 90–95% lower than OpenAI and Anthropic while performance was comparable. Nvidia's stock recorded its largest single-day drop in history. The key was that DeepSeek achieved this using older A100 chips instead of the latest H100s, which couldn't be obtained due to US export controls.

What Makes It Different?

The numbers make it clear. In August 2025, when OpenAI launched GPT-5, they priced it lower than GPT-4o. TechCrunch reported this as "the start of a price war." Google dropped Gemini Flash-Lite to $0.10 per million tokens, and Anthropic responded with batch processing options.

Early 2023 (GPT-4 Era) March 2026 (Now)
Premium model cost $30–60/1M output tokens $8–25/1M output tokens (60–80% down)
Lightweight model cost $1–2/1M tokens $0.04–0.10/1M tokens
Startup monthly API budget $50,000 $3,000–5,000 (same workload)
Prompt caching None Up to 90% input cost savings
Off-peak discounts None Up to 75% additional discount (DeepSeek)

Even among frontier models, the price competition is fierce. Here's a comparison of current major model pricing:

ModelInput ($/1M tokens)Output ($/1M tokens)Key Feature
DeepSeek V3$0.28$1.10Best value, 75% off-peak discount
Gemini 2.5 Flash$0.30$2.50Google infrastructure, fast speed
GPT-5 (base)$1.25$10.00Cheaper than GPT-4o with better performance
Claude Sonnet 4.6$3.00$15.00Coding & analysis specialist
Claude Opus 4.6$5.00$25.00Peak performance premium

The price gap between the cheapest model (DeepSeek V3) and the most expensive (Claude Opus) is over 20x. Include ultra-lightweight models like Mistral Nemo and the gap between lowest and highest exceeds 1,000x. In the past, "good AI = expensive AI." Now, depending on the use case, $0.04 is plenty.

Deja vu from the AWS cloud revolution

In the 2010s, AWS kept lowering cloud costs, birthing an explosive generation of startups that couldn't afford their own infrastructure. The AI API price war is playing exactly the same role right now. Developers in Lagos, Sao Paulo, Jakarta, and Bangalore can now access frontier AI.

The Essentials: How to Optimize AI API Costs

  1. Route models by workload
    You don't need GPT-5 for everything. Route simple classification to lightweight models ($0.04/M), summarization to mid-tier ($0.30/M), and only complex reasoning to premium ($3–15/M).
  2. Use prompt caching
    Anthropic offers up to 90% cost savings on cached inputs. If you have repetitive system prompts, apply this immediately.
  3. Implement batch processing
    For tasks that don't need real-time responses (report generation, data classification, etc.), batch APIs can get you a 50% discount.
  4. Consider API aggregators
    Multi-provider platforms like OpenRouter and LemonData let you switch between 400+ models with a single API key. Markup is 0–10%.
  5. Consider open-source self-hosting
    DeepSeek V3 and Llama 3.3 70B deliver 90–95% of GPT-4 performance. If you have high traffic, self-hosting can save 90%+.

Cheap doesn't always mean good

DeepSeek maintains some API prices through subsidies — a market share strategy burning hedge fund capital. Data privacy, regulatory compliance, and geopolitical risks need consideration too. And beyond direct model costs, when you add infrastructure, monitoring, and compliance, actual costs can be 5–10x higher.