AI API bills scale with tokens, and tokens add up fast. The good news: most projects waste a large share of their tokens on filler, repetition, and over-long replies. Here are ten practical ways to cut your token usage — and your bill — without hurting quality.
Measure your savings: use the token counter on our homepage to compare a prompt before and after trimming. Open it →Start with a before-and-after token audit
Before changing prompts, measure one realistic request exactly as your app sends it. Then remove filler, shorten examples, cap the output, and count again. This creates a simple baseline for token optimization, prompt cost reduction, and LLM API budget planning.
| Prompt version | Input tokens | Output target | What changed |
|---|---|---|---|
| Original | 2,400 | 1,000 tokens | Long system prompt, repeated rules, no answer limit. |
| Trimmed | 1,450 | 400 tokens | Removed duplicate examples and added a concise output format. |
| Cached | 1,450 total, much of it reusable | 400 tokens | Moved stable instructions into cacheable context where supported. |
1. Shorten your outputs
Output tokens cost two to five times more than input tokens, so this is the highest-leverage fix. Tell the model exactly how long to be: "answer in 3 bullet points" or "reply in under 50 words." You will often halve the priciest part of your bill.
2. Trim the system prompt
A bloated system prompt is sent with every request, so every wasted word is paid for repeatedly. Cut redundant instructions, examples you do not need, and polite filler. Tight instructions usually work better anyway.
3. Use prompt caching
Most providers now offer caching for repeated context (like a long system prompt or reference document). Cached input tokens are billed at a steep discount. If you reuse the same context across calls, caching can dramatically lower input costs.
4. Pick the cheapest capable model
Do not pay flagship prices for simple tasks. Use a small, fast model (like Claude Haiku or Gemini Flash-Lite) for classification, extraction, and routing, and reserve premium models for genuinely hard reasoning. See the price gaps in our cost calculator guide.
5. Summarize conversation history
In chat apps, resending the full history every turn is expensive and eats the context window. Replace old turns with a short running summary so the model keeps the gist without paying for every word again.
6. Use retrieval instead of dumping documents
Rather than pasting an entire manual into the prompt, store it in a vector database and retrieve only the few relevant chunks per query. This can cut input tokens by 90% on document-heavy workloads.
7. Strip unnecessary formatting and whitespace
Markdown tables, repeated headers, and heavy indentation all consume tokens. Send clean, minimal text. For code, remove comments and dead code the model does not need to see.
8. Batch related requests
If you ask the same model five tiny questions, you pay the system-prompt overhead five times. Combine them into one well-structured request when it makes sense, and many providers offer discounted batch endpoints for non-urgent jobs.
9. Set max output tokens
Always set a sensible max_tokens limit. It is a hard ceiling that prevents a runaway response from quietly costing you ten times what you expected.
10. Measure before and after
You cannot optimize what you do not measure. Before shipping a prompt, count its tokens; after trimming, count again and confirm the saving. Multiply the per-call difference by your request volume to see the real monthly impact. Start with how many tokens is my text.
Which optimization should you do first?
Use this priority order when you need quick savings:
| Priority | Optimization | Best for |
|---|---|---|
| 1 | Limit output length | Apps with verbose answers or high output pricing. |
| 2 | Trim repeated system prompts | Chatbots and agents that send the same instructions every call. |
| 3 | Retrieve only relevant context | Document QA, support bots, and knowledge-base apps. |
| 4 | Use a cheaper capable model | Classification, routing, extraction, and simple rewriting. |
| 5 | Apply caching or batch pricing | High-volume workloads with repeated context or non-urgent jobs. |
After each change, use the token counter and cost calculator to compare the before and after totals. Then confirm final pricing with the official provider pages: OpenAI, Anthropic, and Google Gemini.
Optimize now: paste your prompt into TokenCounter.cc, trim it, and watch the token count and cost drop in real time. Open the tool →Frequently asked questions
What uses the most tokens in an API call?
Usually the output (the model's reply) because it is priced highest, followed by large system prompts and pasted documents.
Does shorter text always cost less?
Generally yes — fewer tokens means lower cost — but quality matters, so trim filler rather than essential context.
How much can prompt caching save?
It varies by provider, but cached input tokens are billed at a large discount, which adds up fast when you reuse the same context.
What is the single biggest cost lever?
Reducing output length, because output tokens are the most expensive part of nearly every model's pricing.