Embedding Pipeline Resilience
A technical proposal for Jack & Jill AI. I hit this unhandled exception as a user. Here's what it told me, what it means for your infrastructure, and exactly how I'd fix it.
Prepared by Dave Cockson
davidcockson.comThe Error (What You Showed Me)
This is what a user sees in the chat interface. This topology leak should never reach a frontend client.
The Error (Parsed & Annotated)
Performed 7 actions
Search failed
litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}No fallback model group found for original model_group=embedding_1536. Fallbacks=[{: []}, {: [, ]}, {: [, ]}, {: ['gemini/gemini-2.5-flash-lite']}, {'jack-gemini-livechat': []}, {'jack-gemini-livechat-alt': []}, {'gemini-3-flash-lb': ['gemini-2.5-flash-lb', 'gemini/gemini-2.5-flash-lite']}, {'gemini-2.5-flash-lb': ['gemini/gemini-2.5-flash-lite']}, {'gemini-3-flash-lb-alt': ['gemini-2.5-flash-lb-alt', 'gemini-3-flash-lb', 'gemini-2.5-flash-lb', 'gemini/gemini-2.5-flash-lite']}, {'gemini-2.5-flash-lb-alt': ['gemini-3-flash-lb', 'gemini-2.5-flash-lb', 'gemini/gemini-2.5-flash-lite']}, {'gemini-3-flash-lb': ['gemini-2.5-flash-lb', 'gemini/gemini-2.5-flash-lite']}, {'gemini-2.5-flash-lb': ['gemini/gemini-2.5-flash-lite']}, {'jack-web-search': ['jack-web-search-alt', 'jack-web-search-2.5', 'jack-web-search-2.5-alt']}, {'jack-web-search-alt': ['jack-web-search-2.5', 'jack-web-search-2.5-alt']}, {'jack-web-search-2.5': ['jack-web-search-2.5-alt']}, {'gemini/gemini-3-flash-preview': ['gemini/gemini-2.5-flash', 'gemini/gemini-2.5-flash-lite']}, {: ['gemini-2.0-flash-thinking', ]}]. Received Model Group=embedding_1536 Available Model Group Fallbacks=None Error doing the fallback: litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details...', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}No fallback model group found for original model_group=embedding_1536. Fallbacks=[...] LiteLLM Retried: 2 times, LiteLLM Max Retries: 2 LiteLLM Retried: 2 times, LiteLLM Max Retries: 2
embedding_1536
Reveals the OpenAI embedding model group in use.
429 (type: 'insufficient_quota')
Not a transient rate limit. This is a hard spend cap (OpenAI project cap or LiteLLM budget control).
No fallback model group found
Shows no degradation path for embeddings. Chat models have fallbacks; embeddings are a single point of failure.
Full fallback chain exposed
Massive topology leak exposing model groups, provider names, and internal naming conventions.
LiteLLM Retried: 2 times
Retried a non-retryable error twice. This simply wasted latency and dumped the trace repeatedly to the user.
What This Reveals (Architecture Inference)
- → LiteLLM as gateway (confirmed from error trace).
- → Tiered chat model fallbacks working correctly:
SMALL → MEDIUM → LARGE → Gemini. - → Named internal groups:
jack-gemini-livechat,jack-web-search. - → Load-balanced Gemini pools:
gemini-3-flash-lb,gemini-2.5-flash-lb. - → Reasoning tier:
REASONING → gemini-2.0-flash-thinking. - → Langfuse likely as observability layer (standard LiteLLM pairing; would be unusual not to).
The Critical Gap
The embedding pipeline has a single provider, no fallback, no queue, and no cache.
It is the only pipeline without resilience.
Root Cause Hypothesis
The error body is specific: type: 'insufficient_quota'. A $20M-funded startup isn't broke. So why this error?
Global Spend Cap Hit
OpenAI dashboard has a configured monthly spend limit. Embedding call volume blew through it.
Least likely — too blunt an instrument.
Per-Customer Project Quotas
Revenue model (10% of salary) requires capping per-customer costs. One customer exhausts their cap → their search breaks, everyone else is fine.
Highly likely. Identical error output.
Per-Customer Rate Shaping
LiteLLM throttles per-customer to stay under global TPM ceiling. One customer's burst eats headroom → OpenAI rejects subsequent calls.
Highly likely. Deliberate cost controls.
Most likely: B or C. All three converge on the same solutions: reduce call volume via cache, shape traffic via queue, monitor per-customer usage.
The Fix: Cache + Queue
The real pressure isn't 14M job postings on ingest (predictable batch load). It's the chat side: 230,000+ candidates generating real-time embedding calls every conversation turn. Each call counts against a budget.
1. Duplication is the Multiplier
- Semantically identical queries across users ("remote Python London" vs "London remote dev role").
- Same user refining across turns ("Python roles" → "Python remote" → "Python remote Manchester").
- Candidate profiles potentially re-embedded per session instead of once at creation.
- Conversational context re-embedded every turn in RAG flows.
2. Semantic Query Cache
Normalise + hash queries. Cache the embedding AND the top-K search results (TTL 5–10 mins). "Remote Python London" and "London remote Python role" hit the same cache entry.
LiteLLM has built-in redis-semantic / qdrant-semantic cache modes. This requires a vector-capable store (Redis Stack/Qdrant) but directly extends per-customer embedding budgets without raising caps.
3. Lane Separation via Queue
Batch traffic (job ingest, CV parses, profile updates) goes through an async queue (Redis Streams or BullMQ) so it never competes with real-time chat queries for the same budget/rate ceiling.
For the async lane, utilize the OpenAI Batch API. This is a configuration-level change yielding a 50% cost reduction and higher throughput allowance.
⏱️ Timeline: 1 sprint (1–2 weeks)
The Safety Net: Circuit Breaker + BM25
When the embedding API still fails (outage, unexpected spike beyond queue capacity), the user experience shouldn't break.
The Fixed UX
- • Circuit breaker trips: Search falls back to BM25/keyword index. Results are less precise, but the user gets something. Zero downtime, no stack trace.
- • State machine: closed → open (on N failures) → half-open (probe recovery).
- • Low overhead: BM25 index synced alongside vector store.
⏱️ Timeline: 2–3 days (bolted onto queue work)
Hardening: Quick Wins (Priority 1)
Error Sanitisation
Current state: raw LiteLLM exception → chat. Topology exposed (model group names, provider identifiers, fallback chains). This is a security risk for targeted rate limit abuse.
- ↳Internal: full stack trace → structured log
- ↳User: "Search temporarily unavailable."
⏱️ Timeline: 1 day. Ship this first.
Retry Policy
Current behaviour: LiteLLM retried insufficient_quota twice. It will never succeed on retry. The user waits through doomed retries, and the error block appears twice.
- ✕
insufficient_quota: Do not retry → trip breaker. - ↻
rate_limit_exceeded: Retry w/ exponential backoff.
⏱️ Timeline: Hours. Ship alongside sanitisation.
Belt & Braces: Monitoring
If you're on Langfuse (and you likely are), you already have evaluation infrastructure detecting low match quality and stale embeddings. Currently, these signals are siloed. Let's make them operational.
The Killer Addition: Per-Customer Dashboards
Track each customer's embedding budget consumption in real time. Alert at 80% threshold → proactive cap increase or cache tuning before hard failure. Correlate customer embedding spend with revenue (10% of hire salary) for unit economics visibility.
Loki
Structured log aggregation. All LiteLLM callbacks, eval outputs, 429 events, queue depth.
Tempo
Distributed tracing. Trace query → embed → vector search → rerank. Pinpoint failure location.
Grafana
Single pane: embedding health, cache hits, search latency P99. Alerts at 80% quota.
- Low match score → trigger re-embedding via async queue.
- Quota approaching limit → proactive cache warming for that customer.
⏱️ Timeline: Weekend deployment (OSS & self-hostable).
Implementation Roadmap
| Priority | What | Effort | Impact |
|---|---|---|---|
| P1 | Error sanitisation middleware | 1 day | Stops topology leak immediately |
| P1 | Retry policy: distinguish 429s | Hours | Stops wasting latency on doomed retries |
| P1 | Embedding cache + message queue | 1–2 weeks | Eliminates quota exhaustion at source |
| P2 | Circuit breaker + BM25 fallback | 2–3 days | Graceful degradation when all else fails |
| P2 | OpenAI Batch API for async | Hours (config) | 50% cost reduction + higher throughput |
| P3 | Loki + Tempo + Grafana observability | 1 weekend | Pre-emptive alerting, budget tracking |
| P3 | Langfuse eval → infra loops | 2–3 days | Evals drive infrastructure behaviour |
P1 alone (error sanitisation + cache/queue) fixes the user-facing problem in under 2 weeks.
How I Know This
I diagnosed your architecture from a user-facing error dump. I've built this exact observability stack and handled these failure modes.
evalui (ui.davidcockson.com)
Dual-backend trace comparison, Langfuse + Phoenix integration, fully OTel instrumented.
Homelab Operations
Loki + Tempo + Grafana running in Docker on Contabo, behind Cloudflare Access. (Grafana community dashboard with 100+ downloads).
Project Immunity
Engineered a multi-model ensemble system with deterministic rule-checks — utilizing the exact same "belt and braces" resilience thinking.
8 Years Governance
Building incident response and governance frameworks for regulated systems that simply cannot fail silently.
Let's Talk.
Happy to walk through this live or build a proof of concept.