Should I always pick the cheapest LLM?

No. The cheapest model wins for high-volume, low-stakes traffic like classification and routing. Use a frontier model for the calls that make or break the user experience. The pattern that wins in production is a router that sends easy traffic to Haiku or Flash and escalates hard traffic to Sonnet, Opus, or GPT-5.

Why do real bills run higher than the calculator estimates?

Three reasons, in order: retries from failed tool calls and JSON-parse errors, runaway agent loops that don't terminate, and dev or eval traffic on production keys. Set per-key budgets, tag every call with the workflow that made it, and alert on cost-per-conversation rather than total spend.

Free tool

LLM Token Cost Calculator

Compare what GPT-5, Claude, Gemini, and DeepSeek will actually cost you at production volume. Pick a workload, set your traffic, see the monthly bill across 12 frontier models.

Listen to the rundown (about 2 min)

0:00--:--

Start from a workload

Input tokens per request

~750 tokens per 500 words. RAG context, system prompts, and history all count.

Output tokens per request

Output is 4x to 10x more expensive than input. Watch this number.

Requests per day

Total LLM calls per day across users, agents, and retries.

Prompt cache hit rate 0%

Cached input bills at roughly 10% of list price on OpenAI, Anthropic, and Google. Set to 0% for a pessimistic estimate.

Providers to compare

Cheapest at this volumeDeepSeek V3$79.80/month

Most expensiveClaude Opus 4.7$3,750/month

Spread across providers47.0xSame workload. Same volume. Different bill.

Model

Cost / request

Daily

Monthly

Annual

DeepSeek

DeepSeek V3Best value

Open-weight; lowest cost per token of any frontier-class chat model

$0.14 in / $0.28 out per 1M · 128K context

$0.000532

$2.66

$79.80

$970.90

OpenAI

GPT-5 mini

Cheap, fast workhorse for high-volume routing and extraction

$0.25 in / $2.00 out per 1M · 400K context

$0.001550

$7.75

$232.50

$2,829

Google

Gemini 2.5 Flash

Cheapest big-context model on the market

$0.30 in / $2.50 out per 1M · 1024K context

$0.001900

$9.50

$285.00

$3,468

DeepSeek

DeepSeek R1

Open-weight reasoning model; rivals o3 at a fraction of the cost

$0.55 in / $2.19 out per 1M · 128K context

$0.002526

$12.63

$378.90

$4,610

Anthropic

Claude Haiku 4.5

Fast, cheap; good for classification and routing

$1.00 in / $5.00 out per 1M · 200K context

$0.005000

$25.00

$750.00

$9,125

OpenAI

GPT-5

Frontier reasoning, highest quality on hard tasks

$1.25 in / $10.00 out per 1M · 400K context

$0.007750

$38.75

$1,163

$14,144

Google

Gemini 2.5 Pro

1M context; price doubles above 200K input tokens

$1.25 in / $10.00 out per 1M · 1024K context

$0.007750

$38.75

$1,163

$14,144

OpenAI

Deep reasoning, math, code; cost scales with thinking tokens

$2.00 in / $8.00 out per 1M · 200K context

$0.009200

$46.00

$1,380

$16,790

OpenAI

GPT-4.1

1M-token context for long documents

$2.00 in / $8.00 out per 1M · 1024K context

$0.009200

$46.00

$1,380

$16,790

OpenAI

GPT-4o

Multimodal, mature tooling

$2.50 in / $10.00 out per 1M · 128K context

$0.0115

$57.50

$1,725

$20,988

Anthropic

Claude Sonnet 4.6

Balanced quality/cost; default for production agents

$3.00 in / $15.00 out per 1M · 200K context

$0.0150

$75.00

$2,250

$27,375

Anthropic

Claude Opus 4.7

Best for agentic and code workflows; premium pricing

$5.00 in / $25.00 out per 1M · 200K context

$0.0250

$125.00

$3,750

$45,625

Prices reflect public list pricing as of May 2026. Batch APIs cut costs by roughly 50% on OpenAI and Anthropic. Vertex AI and Bedrock use the same base prices but bill differently. Always confirm against the provider's pricing page before committing to a contract.

How to use this calculator

Pick the workload closest to yours — or type your own input/output token counts. If you don't know the numbers, log a real request and count: most providers return token usage on every response.
Set realistic daily volume. Production traffic spikes 3x to 5x at peak. Plan for the peak, not the average.
Set a cache hit rate. Static system prompts, retrieved documents, and code context are cacheable. A 60-70% hit rate is realistic for well-built RAG and agent systems.
Compare the spread. If the cheapest and most expensive models differ by 10x at your volume, the model decision matters more than the prompt engineering.

Why output tokens dominate the bill

Every model on this list charges 4x to 8x more for output than input. A request that reads 5K tokens and writes 500 tokens looks input-heavy, but the output bill is usually larger. Cutting your average response from 800 tokens to 300 tokens halves your variable cost without touching the model. Add a hard max_tokens. Ask for JSON over prose. Stop generations the moment the answer is complete.

Where the calculator is conservative

We use list prices. Your real bill can be lower in three places:

Prompt caching. Cached input bills at 10% on OpenAI, Anthropic, and Google. The slider above models this. Anthropic also charges a one-time cache write fee of 1.25x the input rate, which we ignore for simplicity.
Batch APIs. OpenAI Batch and Anthropic Message Batches halve token costs in exchange for up to 24-hour latency. Good for evals, embeddings, document backfills.
Volume commits. Enterprise contracts on Azure OpenAI, Vertex, and Bedrock can cut 20-40% off list. None of those discounts apply to spot API usage.

Where the calculator can underestimate

Reasoning models. o3 and DeepSeek R1 generate hidden "thinking" tokens that are billed as output. A 500-token visible answer can carry 5K tokens of internal reasoning. For these models, set output tokens 5x to 10x your visible target.
Long context surcharges. Gemini 2.5 Pro doubles its rate above 200K input tokens. If you regularly send 500K-token prompts, model that explicitly.
Tool-calling loops. Agents make multiple LLM calls per user turn. A "simple" agentic task often runs 4-8 model calls. Multiply your requests-per-day accordingly.
Image, audio, and video tokens. This calculator covers text only. A 1080p image is roughly 1,200 tokens on most providers; a minute of audio is roughly 1,500.

FAQ

How do I count tokens before I have a working app?

Use OpenAI's tiktoken library or a free token counter. As a fast estimate: 1 token equals about 4 characters, or 750 tokens per 500 English words. Code, JSON, and non-English text run roughly 1.5x to 2x more tokens than the same characters in English prose.

Should I always pick the cheapest model?

No. The cheapest model is the right answer for high-volume, low-stakes traffic — classification, routing, simple extraction. Use a frontier model for the calls that make or break the user experience: a wrong answer to the first user question costs more than a year of token savings. The pattern that wins in production is a router: route easy traffic to Haiku or Flash, escalate hard traffic to Sonnet, Opus, or GPT-5.

Why do my real bills run higher than this calculator says?

Three reasons, in order: retries (failed tool calls and JSON-parse errors trigger reruns), runaway agents (tool loops that don't terminate), and dev/eval traffic on production keys. Set per-key budgets in your provider dashboard. Tag every call with the workflow that made it. Alert on cost-per-conversation, not just total spend.

How fresh are the prices?

Captured May 2026 from the official pricing pages of OpenAI, Anthropic, Google Gemini, and DeepSeek. We refresh whenever pricing moves materially. For procurement decisions, verify the live pricing page on the day you sign.

Does this work for self-hosted models?

Not directly. Self-hosting Llama 3, Mixtral, or DeepSeek means you pay for GPU hours, not tokens. As a rule of thumb, a single H100 running a 70B model serves roughly 200K-500K output tokens per hour at $2-$3 per GPU-hour, which works out to $0.005-$0.015 per 1K output tokens — competitive with DeepSeek V3, but only if you keep the GPU saturated. For bursty traffic, hosted APIs win.

Methodology

Cost per request = (input tokens × input price per million × cache multiplier) + (output tokens × output price per million). The cache multiplier blends list price for the uncached portion with 10% of list price for the cached portion. Daily cost multiplies by requests per day. Monthly is daily × 30; annual is daily × 365. Prices come from each provider's public API pricing page; volume discounts, free tiers, and batch APIs are not applied. Token counts assume the OpenAI tokenizer; other tokenizers vary by 5-15%.

Let's Talk

Have a challenge that needs AI? We'd love to hear about it.

What happens next?

We'll schedule a call to understand your problem
We assess if AI is the right fit for your use case
If it is, we'll propose a clear path forward

⚡We usually respond within 24 hours