ekofyi
Building a Cost-Aware AI Triage API: Stop Burning Money on LLM Calls You Don't Need
Automation Patterns6 min read

Building a Cost-Aware AI Triage API: Stop Burning Money on LLM Calls You Don't Need

A practical breakdown of building an AI support triage API that doesn't hemorrhage money. Route requests intelligently, cache aggressively, and only call expensive models when you actually need them.

Your AI API Is Probably Burning Cash Right Now

If you're running any kind of AI-powered support system — chatbot, ticket triage, auto-responder — there's a good chance you're sending every single request to your most expensive model endpoint. Every "how do I reset my password?" question is getting the full GPT-4-class treatment when a cached response or a simple pattern match would do.

DigitalOcean published a tutorial a couple days ago (May 19, 2026) on building a cost-aware AI support triage API, and while the tutorial itself is fairly straightforward, the pattern it demonstrates is something I've been pushing teams to adopt for months. The idea is dead simple: not every request deserves the same computational spend.

This isn't about being cheap. It's about being smart with finite resources so you can actually scale without your CFO having a panic attack every time they see the OpenAI invoice.

The Core Pattern: Tiered Request Routing

The architecture breaks down into a classification layer that sits in front of your actual AI processing. Before any request hits an LLM, you run it through a lightweight classifier that determines complexity and routes accordingly.

Here's the basic flow:

python
from enum import Enum
from dataclasses import dataclass
from typing import Optional

class TicketTier(Enum):
    CACHED = "cached"        # Known answer, no LLM needed
    SIMPLE = "simple"        # Small model, low token budget
    COMPLEX = "complex"      # Full model, higher token budget
    ESCALATE = "escalate"    # Human handoff, don't waste tokens

@dataclass
class TriageResult:
    tier: TicketTier
    confidence: float
    estimated_cost: float
    cached_response: Optional[str] = None

def triage_request(message: str, context: dict) -> TriageResult:
    # Check cache first — this is free
    cached = check_semantic_cache(message)
    if cached and cached.similarity > 0.95:
        return TriageResult(
            tier=TicketTier.CACHED,
            confidence=cached.similarity,
            estimated_cost=0.0,
            cached_response=cached.response
        )
    
    # Classify complexity with lightweight model
    complexity = classify_complexity(message, context)
    
    if complexity.needs_human:
        return TriageResult(tier=TicketTier.ESCALATE, confidence=complexity.score, estimated_cost=0.001)
    
    if complexity.score < 0.3:
        return TriageResult(tier=TicketTier.SIMPLE, confidence=complexity.score, estimated_cost=0.002)
    
    return TriageResult(tier=TicketTier.COMPLEX, confidence=complexity.score, estimated_cost=0.015)

The key insight: your classification step should cost 10-100x less than your actual generation step. If you're using an embedding model for semantic cache lookup and a tiny classifier for complexity scoring, you're spending fractions of a cent to potentially save 1-2 cents per request. At scale, that's the difference between a viable product and a money pit.

The Semantic Cache Layer

This is where most of the savings come from. Support requests are incredibly repetitive. In my experience, 40-60% of incoming tickets are variations of the same 50-100 questions.

python
import numpy as np
from redis import Redis
import json

class SemanticCache:
    def __init__(self, redis_client: Redis, embedding_model, threshold: float = 0.92):
        self.redis = redis_client
        self.model = embedding_model
        self.threshold = threshold
    
    def lookup(self, query: str) -> Optional[dict]:
        query_embedding = self.model.encode(query)
        
        # Search against stored embeddings
        results = self.redis.ft("cache_idx").search(
            f"*=>[KNN 3 @embedding $vec AS score]",
            query_params={"vec": query_embedding.tobytes()}
        )
        
        if results.docs and float(results.docs[0].score) > self.threshold:
            return {
                "response": results.docs[0].response,
                "similarity": float(results.docs[0].score),
                "original_query": results.docs[0].query
            }
        return None
    
    def store(self, query: str, response: str, ttl: int = 86400):
        embedding = self.model.encode(query)
        key = f"cache:{hash(query)}"
        self.redis.hset(key, mapping={
            "query": query,
            "response": response,
            "embedding": embedding.tobytes()
        })
        self.redis.expire(key, ttl)

I'm using Redis with vector search here because it's fast and most teams already have Redis in their stack. You could swap in Pinecone, Qdrant, or even pgvector — the pattern is the same. The TTL is important: support answers go stale, and you don't want to serve cached responses about features that changed last week.

Cost Tracking: The Part Everyone Skips

Here's what separates a production system from a tutorial demo. You need per-request cost attribution, and you need it in real-time.

python
import time
from dataclasses import field

@dataclass
class RequestMetrics:
    request_id: str
    tier: TicketTier
    input_tokens: int = 0
    output_tokens: int = 0
    model_used: str = ""
    latency_ms: float = 0.0
    cache_hit: bool = False
    estimated_cost_usd: float = 0.0
    
    def calculate_cost(self):
        # Pricing per 1K tokens (adjust to your provider)
        pricing = {
            "gpt-4o": {"input": 0.005, "output": 0.015},
            "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
            "cached": {"input": 0.0, "output": 0.0},
        }
        rates = pricing.get(self.model_used, pricing["gpt-4o-mini"])
        self.estimated_cost_usd = (
            (self.input_tokens / 1000) * rates["input"] +
            (self.output_tokens / 1000) * rates["output"]
        )

Log every single request with its tier, model used, token count, and calculated cost. Aggregate daily. Set alerts when your cost-per-request average drifts above your target. I've seen teams discover that a single poorly-written system prompt was inflating their token usage by 3x because it included unnecessary context on every call.

The Budget Circuit Breaker

This is the pattern I think more teams need to adopt: a hard budget ceiling that degrades gracefully instead of just... spending more money.

python
class BudgetCircuitBreaker:
    def __init__(self, daily_budget_usd: float, redis_client: Redis):
        self.daily_budget = daily_budget_usd
        self.redis = redis_client
    
    def check_budget(self) -> TicketTier:
        today = time.strftime("%Y-%m-%d")
        spent = float(self.redis.get(f"budget:{today}") or 0)
        remaining_ratio = (self.daily_budget - spent) / self.daily_budget
        
        if remaining_ratio < 0.1:
            # Under 10% budget remaining — cache only, escalate everything else
            return TicketTier.ESCALATE
        elif remaining_ratio < 0.3:
            # Under 30% — downgrade complex to simple model
            return TicketTier.SIMPLE
        return None  # No override, proceed normally

Warning: Don't set your circuit breaker too aggressively. If it trips during peak hours and starts escalating everything to humans, you've just created a different kind of cost problem. Start with a generous budget and tighten based on actual usage data.

What You Should Do Right Now

If you're running an AI-powered support system without cost-aware routing, here's your action plan:

Step 1: Instrument your current system. Before you optimize anything, you need to know where the money is going. Add token counting and cost calculation to every LLM call. Give yourself a week of data.

Step 2: Build the semantic cache. This is your highest-ROI change. Even a naive implementation with a 0.90 similarity threshold will catch 30-40% of repeat questions. Use whatever vector store you already have access to.

Step 3: Add the tiered routing. Start with two tiers — "use the cheap model" and "use the expensive model." You can add more sophistication later. The classifier can be as simple as message length + keyword detection initially.

Step 4: Set up cost dashboards and alerts. You want to see cost-per-request, cache hit rate, and tier distribution in real-time. If your cache hit rate drops below 30%, something changed in your request patterns and you need to investigate.

The Bigger Picture

The pattern here isn't specific to support triage. Any AI application that handles heterogeneous requests — code review, content moderation, document processing — benefits from this tiered approach. The companies that will survive the current AI cost crunch aren't the ones with the biggest budgets. They're the ones that figured out how to route 60% of their traffic to a cache, 30% to a cheap model, and only 10% to the expensive one.

We're past the "just throw GPT-4 at everything" phase. The engineering challenge now is building intelligent routing layers that maximize output quality per dollar spent. That's not a cost-cutting exercise — it's an architecture decision that determines whether your AI features can actually scale to production traffic without bankrupting you.

Related posts

Written by Eko

If you found this useful, follow @ekofyi on X for more notes like this — or get in touch if you have a problem to solve.

Building a Cost-Aware AI Triage API: Stop Burning Money on LLM Calls You Don't Need · ekofyi