
Building a Cost-Aware AI Triage API: Stop Burning Money on LLM Calls You Don't Need
A practical breakdown of building an AI support triage API that doesn't hemorrhage money. Route requests intelligently, cache aggressively, and only call expensive models when you actually need them.
Your AI API Is Probably Burning Cash Right Now
If you're running any kind of AI-powered support system — chatbot, ticket triage, auto-responder — there's a good chance you're sending every single request to your most expensive model endpoint. Every "how do I reset my password?" question is getting the full GPT-4-class treatment when a cached response or a simple pattern match would do.
DigitalOcean published a tutorial a couple days ago (May 19, 2026) on building a cost-aware AI support triage API, and while the tutorial itself is fairly straightforward, the pattern it demonstrates is something I've been pushing teams to adopt for months. The idea is dead simple: not every request deserves the same computational spend.
This isn't about being cheap. It's about being smart with finite resources so you can actually scale without your CFO having a panic attack every time they see the OpenAI invoice.
The Core Pattern: Tiered Request Routing
The architecture breaks down into a classification layer that sits in front of your actual AI processing. Before any request hits an LLM, you run it through a lightweight classifier that determines complexity and routes accordingly.
Here's the basic flow:
from enum import Enum
from dataclasses import dataclass
from typing import Optional
class TicketTier(Enum):
CACHED = "cached" # Known answer, no LLM needed
SIMPLE = "simple" # Small model, low token budget
COMPLEX = "complex" # Full model, higher token budget
ESCALATE = "escalate" # Human handoff, don't waste tokens
@dataclass
class TriageResult:
tier: TicketTier
confidence: float
estimated_cost: float
cached_response: Optional[str] = None
def triage_request(message: str, context: dict) -> TriageResult:
# Check cache first — this is free
cached = check_semantic_cache(message)
if cached and cached.similarity > 0.95:
return TriageResult(
tier=TicketTier.CACHED,
confidence=cached.similarity,
estimated_cost=0.0,
cached_response=cached.response
)
# Classify complexity with lightweight model
complexity = classify_complexity(message, context)
if complexity.needs_human:
return TriageResult(tier=TicketTier.ESCALATE, confidence=complexity.score, estimated_cost=0.001)
if complexity.score < 0.3:
return TriageResult(tier=TicketTier.SIMPLE, confidence=complexity.score, estimated_cost=0.002)
return TriageResult(tier=TicketTier.COMPLEX, confidence=complexity.score, estimated_cost=0.015)The key insight: your classification step should cost 10-100x less than your actual generation step. If you're using an embedding model for semantic cache lookup and a tiny classifier for complexity scoring, you're spending fractions of a cent to potentially save 1-2 cents per request. At scale, that's the difference between a viable product and a money pit.
The Semantic Cache Layer
This is where most of the savings come from. Support requests are incredibly repetitive. In my experience, 40-60% of incoming tickets are variations of the same 50-100 questions.
import numpy as np
from redis import Redis
import json
class SemanticCache:
def __init__(self, redis_client: Redis, embedding_model, threshold: float = 0.92):
self.redis = redis_client
self.model = embedding_model
self.threshold = threshold
def lookup(self, query: str) -> Optional[dict]:
query_embedding = self.model.encode(query)
# Search against stored embeddings
results = self.redis.ft("cache_idx").search(
f"*=>[KNN 3 @embedding $vec AS score]",
query_params={"vec": query_embedding.tobytes()}
)
if results.docs and float(results.docs[0].score) > self.threshold:
return {
"response": results.docs[0].response,
"similarity": float(results.docs[0].score),
"original_query": results.docs[0].query
}
return None
def store(self, query: str, response: str, ttl: int = 86400):
embedding = self.model.encode(query)
key = f"cache:{hash(query)}"
self.redis.hset(key, mapping={
"query": query,
"response": response,
"embedding": embedding.tobytes()
})
self.redis.expire(key, ttl)I'm using Redis with vector search here because it's fast and most teams already have Redis in their stack. You could swap in Pinecone, Qdrant, or even pgvector — the pattern is the same. The TTL is important: support answers go stale, and you don't want to serve cached responses about features that changed last week.
Cost Tracking: The Part Everyone Skips
Here's what separates a production system from a tutorial demo. You need per-request cost attribution, and you need it in real-time.
import time
from dataclasses import field
@dataclass
class RequestMetrics:
request_id: str
tier: TicketTier
input_tokens: int = 0
output_tokens: int = 0
model_used: str = ""
latency_ms: float = 0.0
cache_hit: bool = False
estimated_cost_usd: float = 0.0
def calculate_cost(self):
# Pricing per 1K tokens (adjust to your provider)
pricing = {
"gpt-4o": {"input": 0.005, "output": 0.015},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"cached": {"input": 0.0, "output": 0.0},
}
rates = pricing.get(self.model_used, pricing["gpt-4o-mini"])
self.estimated_cost_usd = (
(self.input_tokens / 1000) * rates["input"] +
(self.output_tokens / 1000) * rates["output"]
)Log every single request with its tier, model used, token count, and calculated cost. Aggregate daily. Set alerts when your cost-per-request average drifts above your target. I've seen teams discover that a single poorly-written system prompt was inflating their token usage by 3x because it included unnecessary context on every call.
The Budget Circuit Breaker
This is the pattern I think more teams need to adopt: a hard budget ceiling that degrades gracefully instead of just... spending more money.
class BudgetCircuitBreaker:
def __init__(self, daily_budget_usd: float, redis_client: Redis):
self.daily_budget = daily_budget_usd
self.redis = redis_client
def check_budget(self) -> TicketTier:
today = time.strftime("%Y-%m-%d")
spent = float(self.redis.get(f"budget:{today}") or 0)
remaining_ratio = (self.daily_budget - spent) / self.daily_budget
if remaining_ratio < 0.1:
# Under 10% budget remaining — cache only, escalate everything else
return TicketTier.ESCALATE
elif remaining_ratio < 0.3:
# Under 30% — downgrade complex to simple model
return TicketTier.SIMPLE
return None # No override, proceed normallyWarning: Don't set your circuit breaker too aggressively. If it trips during peak hours and starts escalating everything to humans, you've just created a different kind of cost problem. Start with a generous budget and tighten based on actual usage data.
What You Should Do Right Now
If you're running an AI-powered support system without cost-aware routing, here's your action plan:
Step 1: Instrument your current system. Before you optimize anything, you need to know where the money is going. Add token counting and cost calculation to every LLM call. Give yourself a week of data.
Step 2: Build the semantic cache. This is your highest-ROI change. Even a naive implementation with a 0.90 similarity threshold will catch 30-40% of repeat questions. Use whatever vector store you already have access to.
Step 3: Add the tiered routing. Start with two tiers — "use the cheap model" and "use the expensive model." You can add more sophistication later. The classifier can be as simple as message length + keyword detection initially.
Step 4: Set up cost dashboards and alerts. You want to see cost-per-request, cache hit rate, and tier distribution in real-time. If your cache hit rate drops below 30%, something changed in your request patterns and you need to investigate.
The Bigger Picture
The pattern here isn't specific to support triage. Any AI application that handles heterogeneous requests — code review, content moderation, document processing — benefits from this tiered approach. The companies that will survive the current AI cost crunch aren't the ones with the biggest budgets. They're the ones that figured out how to route 60% of their traffic to a cache, 30% to a cheap model, and only 10% to the expensive one.
We're past the "just throw GPT-4 at everything" phase. The engineering challenge now is building intelligent routing layers that maximize output quality per dollar spent. That's not a cost-cutting exercise — it's an architecture decision that determines whether your AI features can actually scale to production traffic without bankrupting you.
Related posts
- Automation
CloakBrowser: I tested it against 5 bot detectors — here's what happened
CloakBrowser claims to be a stealth Chromium that passes every bot detection test. I installed it, ran it against reCAPTCHA v3, Cloudflare Turnstile, and FingerprintJS to see if the hype is real.
May 19, 2026 · 8 min - Automation
Automating web3 workflows at scale — a sanitized case study
How I built custom tooling to manage hundreds of wallets, automate on-chain transactions, and run social bots across multiple protocols.
May 18, 2026 · 10 min