Free tool
LLM Token Cost Calculator
Compare what GPT-5, Claude, Gemini, and DeepSeek will actually cost you at production volume. Pick a workload, set your traffic, see the monthly bill across 12 frontier models.
Start from a workload
~750 tokens per 500 words. RAG context, system prompts, and history all count.
Output is 4x to 10x more expensive than input. Watch this number.
Total LLM calls per day across users, agents, and retries.
Cached input bills at roughly 10% of list price on OpenAI, Anthropic, and Google. Set to 0% for a pessimistic estimate.
Providers to compare
DeepSeek V3Best value
Open-weight; lowest cost per token of any frontier-class chat model
$0.14 in / $0.28 out per 1M · 128K context
GPT-5 mini
Cheap, fast workhorse for high-volume routing and extraction
$0.25 in / $2.00 out per 1M · 400K context
Gemini 2.5 Flash
Cheapest big-context model on the market
$0.30 in / $2.50 out per 1M · 1024K context
DeepSeek R1
Open-weight reasoning model; rivals o3 at a fraction of the cost
$0.55 in / $2.19 out per 1M · 128K context
Claude Haiku 4.5
Fast, cheap; good for classification and routing
$1.00 in / $5.00 out per 1M · 200K context
GPT-5
Frontier reasoning, highest quality on hard tasks
$1.25 in / $10.00 out per 1M · 400K context
Gemini 2.5 Pro
1M context; price doubles above 200K input tokens
$1.25 in / $10.00 out per 1M · 1024K context
o3
Deep reasoning, math, code; cost scales with thinking tokens
$2.00 in / $8.00 out per 1M · 200K context
GPT-4.1
1M-token context for long documents
$2.00 in / $8.00 out per 1M · 1024K context
GPT-4o
Multimodal, mature tooling
$2.50 in / $10.00 out per 1M · 128K context
Claude Sonnet 4.6
Balanced quality/cost; default for production agents
$3.00 in / $15.00 out per 1M · 200K context
Claude Opus 4.7
Best for agentic and code workflows; premium pricing
$5.00 in / $25.00 out per 1M · 200K context
Prices reflect public list pricing as of May 2026. Batch APIs cut costs by roughly 50% on OpenAI and Anthropic. Vertex AI and Bedrock use the same base prices but bill differently. Always confirm against the provider's pricing page before committing to a contract.
How to use this calculator
- Pick the workload closest to yours — or type your own input/output token counts. If you don't know the numbers, log a real request and count: most providers return token usage on every response.
- Set realistic daily volume. Production traffic spikes 3x to 5x at peak. Plan for the peak, not the average.
- Set a cache hit rate. Static system prompts, retrieved documents, and code context are cacheable. A 60-70% hit rate is realistic for well-built RAG and agent systems.
- Compare the spread. If the cheapest and most expensive models differ by 10x at your volume, the model decision matters more than the prompt engineering.
Why output tokens dominate the bill
Every model on this list charges 4x to 8x more for output than input. A request that reads 5K tokens and writes 500 tokens looks input-heavy, but the output bill is usually larger. Cutting your average response from 800 tokens to 300 tokens halves your variable cost without touching the model. Add a hard max_tokens. Ask for JSON over prose. Stop generations the moment the answer is complete.
Where the calculator is conservative
We use list prices. Your real bill can be lower in three places:
- Prompt caching. Cached input bills at 10% on OpenAI, Anthropic, and Google. The slider above models this. Anthropic also charges a one-time cache write fee of 1.25x the input rate, which we ignore for simplicity.
- Batch APIs. OpenAI Batch and Anthropic Message Batches halve token costs in exchange for up to 24-hour latency. Good for evals, embeddings, document backfills.
- Volume commits. Enterprise contracts on Azure OpenAI, Vertex, and Bedrock can cut 20-40% off list. None of those discounts apply to spot API usage.
Where the calculator can underestimate
- Reasoning models. o3 and DeepSeek R1 generate hidden "thinking" tokens that are billed as output. A 500-token visible answer can carry 5K tokens of internal reasoning. For these models, set output tokens 5x to 10x your visible target.
- Long context surcharges. Gemini 2.5 Pro doubles its rate above 200K input tokens. If you regularly send 500K-token prompts, model that explicitly.
- Tool-calling loops. Agents make multiple LLM calls per user turn. A "simple" agentic task often runs 4-8 model calls. Multiply your requests-per-day accordingly.
- Image, audio, and video tokens. This calculator covers text only. A 1080p image is roughly 1,200 tokens on most providers; a minute of audio is roughly 1,500.
FAQ
How do I count tokens before I have a working app?
Use OpenAI's tiktoken library or a free token counter. As a fast estimate: 1 token equals about 4 characters, or 750 tokens per 500 English words. Code, JSON, and non-English text run roughly 1.5x to 2x more tokens than the same characters in English prose.
Should I always pick the cheapest model?
No. The cheapest model is the right answer for high-volume, low-stakes traffic — classification, routing, simple extraction. Use a frontier model for the calls that make or break the user experience: a wrong answer to the first user question costs more than a year of token savings. The pattern that wins in production is a router: route easy traffic to Haiku or Flash, escalate hard traffic to Sonnet, Opus, or GPT-5.
Why do my real bills run higher than this calculator says?
Three reasons, in order: retries (failed tool calls and JSON-parse errors trigger reruns), runaway agents (tool loops that don't terminate), and dev/eval traffic on production keys. Set per-key budgets in your provider dashboard. Tag every call with the workflow that made it. Alert on cost-per-conversation, not just total spend.
How fresh are the prices?
Captured May 2026 from the official pricing pages of OpenAI, Anthropic, Google Gemini, and DeepSeek. We refresh whenever pricing moves materially. For procurement decisions, verify the live pricing page on the day you sign.
Does this work for self-hosted models?
Not directly. Self-hosting Llama 3, Mixtral, or DeepSeek means you pay for GPU hours, not tokens. As a rule of thumb, a single H100 running a 70B model serves roughly 200K-500K output tokens per hour at $2-$3 per GPU-hour, which works out to $0.005-$0.015 per 1K output tokens — competitive with DeepSeek V3, but only if you keep the GPU saturated. For bursty traffic, hosted APIs win.
Methodology
Cost per request = (input tokens × input price per million × cache multiplier) + (output tokens × output price per million). The cache multiplier blends list price for the uncached portion with 10% of list price for the cached portion. Daily cost multiplies by requests per day. Monthly is daily × 30; annual is daily × 365. Prices come from each provider's public API pricing page; volume discounts, free tiers, and batch APIs are not applied. Token counts assume the OpenAI tokenizer; other tokenizers vary by 5-15%.
Let's Talk
Have a challenge that needs AI? We'd love to hear about it.