Production Post-Mortem

Embedding Pipeline Resilience

A technical proposal for Jack & Jill AI. I hit this unhandled exception as a user. Here's what it told me, what it means for your infrastructure, and exactly how I'd fix it.

Prepared by Dave Cockson

davidcockson.com

The Error (What You Showed Me)

This is what a user sees in the chat interface. This topology leak should never reach a frontend client.

The Error (Parsed & Annotated)

Raw Trace Dump

Performed 7 actions
Search failed
litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}No fallback model group found for original model_group=embedding_1536. Fallbacks=[{: []}, {: [, ]}, {: [, ]}, {: ['gemini/gemini-2.5-flash-lite']}, {'jack-gemini-livechat': []}, {'jack-gemini-livechat-alt': []}, {'gemini-3-flash-lb': ['gemini-2.5-flash-lb', 'gemini/gemini-2.5-flash-lite']}, {'gemini-2.5-flash-lb': ['gemini/gemini-2.5-flash-lite']}, {'gemini-3-flash-lb-alt': ['gemini-2.5-flash-lb-alt', 'gemini-3-flash-lb', 'gemini-2.5-flash-lb', 'gemini/gemini-2.5-flash-lite']}, {'gemini-2.5-flash-lb-alt': ['gemini-3-flash-lb', 'gemini-2.5-flash-lb', 'gemini/gemini-2.5-flash-lite']}, {'gemini-3-flash-lb': ['gemini-2.5-flash-lb', 'gemini/gemini-2.5-flash-lite']}, {'gemini-2.5-flash-lb': ['gemini/gemini-2.5-flash-lite']}, {'jack-web-search': ['jack-web-search-alt', 'jack-web-search-2.5', 'jack-web-search-2.5-alt']}, {'jack-web-search-alt': ['jack-web-search-2.5', 'jack-web-search-2.5-alt']}, {'jack-web-search-2.5': ['jack-web-search-2.5-alt']}, {'gemini/gemini-3-flash-preview': ['gemini/gemini-2.5-flash', 'gemini/gemini-2.5-flash-lite']}, {: ['gemini-2.0-flash-thinking', ]}]. Received Model Group=embedding_1536 Available Model Group Fallbacks=None Error doing the fallback: litellm.RateLimitError: RateLimitError: OpenAIException - Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details...', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}No fallback model group found for original model_group=embedding_1536. Fallbacks=[...] LiteLLM Retried: 2 times, LiteLLM Max Retries: 2 LiteLLM Retried: 2 times, LiteLLM Max Retries: 2

embedding_1536

Reveals the OpenAI embedding model group in use.

429 (type: 'insufficient_quota')

Not a transient rate limit. This is a hard spend cap (OpenAI project cap or LiteLLM budget control).

No fallback model group found

Shows no degradation path for embeddings. Chat models have fallbacks; embeddings are a single point of failure.

Full fallback chain exposed

Massive topology leak exposing model groups, provider names, and internal naming conventions.

LiteLLM Retried: 2 times

Retried a non-retryable error twice. This simply wasted latency and dumped the trace repeatedly to the user.

What This Reveals (Architecture Inference)

→ LiteLLM as gateway (confirmed from error trace).
→ Tiered chat model fallbacks working correctly: SMALL → MEDIUM → LARGE → Gemini.
→ Named internal groups: jack-gemini-livechat, jack-web-search.
→ Load-balanced Gemini pools: gemini-3-flash-lb, gemini-2.5-flash-lb.
→ Reasoning tier: REASONING → gemini-2.0-flash-thinking.
→ Langfuse likely as observability layer (standard LiteLLM pairing; would be unusual not to).

The Critical Gap The embedding pipeline has a single provider, no fallback, no queue, and no cache.
It is the only pipeline without resilience.

Root Cause Hypothesis

The error body is specific: type: 'insufficient_quota'. A $20M-funded startup isn't broke. So why this error?

Hypothesis A

Global Spend Cap Hit

OpenAI dashboard has a configured monthly spend limit. Embedding call volume blew through it.

Least likely — too blunt an instrument.

Hypothesis B

Per-Customer Project Quotas

Revenue model (10% of salary) requires capping per-customer costs. One customer exhausts their cap → their search breaks, everyone else is fine.

Highly likely. Identical error output.

Hypothesis C

Per-Customer Rate Shaping

LiteLLM throttles per-customer to stay under global TPM ceiling. One customer's burst eats headroom → OpenAI rejects subsequent calls.

Highly likely. Deliberate cost controls.

Most likely: B or C. All three converge on the same solutions: reduce call volume via cache, shape traffic via queue, monitor per-customer usage.

Priority 1

The Fix: Cache + Queue

The real pressure isn't 14M job postings on ingest (predictable batch load). It's the chat side: 230,000+ candidates generating real-time embedding calls every conversation turn. Each call counts against a budget.

1. Duplication is the Multiplier

Semantically identical queries across users ("remote Python London" vs "London remote dev role").
Same user refining across turns ("Python roles" → "Python remote" → "Python remote Manchester").
Candidate profiles potentially re-embedded per session instead of once at creation.
Conversational context re-embedded every turn in RAG flows.

2. Semantic Query Cache

Normalise + hash queries. Cache the embedding AND the top-K search results (TTL 5–10 mins). "Remote Python London" and "London remote Python role" hit the same cache entry.

LiteLLM has built-in redis-semantic / qdrant-semantic cache modes. This requires a vector-capable store (Redis Stack/Qdrant) but directly extends per-customer embedding budgets without raising caps.

Rule: Profile embeddings computed once at creation/update, stored in vector DB. Never per-session.

3. Lane Separation via Queue

Batch traffic (job ingest, CV parses, profile updates) goes through an async queue (Redis Streams or BullMQ) so it never competes with real-time chat queries for the same budget/rate ceiling.

For the async lane, utilize the OpenAI Batch API. This is a configuration-level change yielding a 50% cost reduction and higher throughput allowance.

⏱️ Timeline: 1 sprint (1–2 weeks)

Priority 2

The Safety Net: Circuit Breaker + BM25

When the embedding API still fails (outage, unexpected spike beyond queue capacity), the user experience shouldn't break.

The Fixed UX

User Searches

→

Embedding Fails Silently

→

BM25 Kicks In

→

Results Appear

• Circuit breaker trips: Search falls back to BM25/keyword index. Results are less precise, but the user gets something. Zero downtime, no stack trace.
• State machine: closed → open (on N failures) → half-open (probe recovery).
• Low overhead: BM25 index synced alongside vector store.

⏱️ Timeline: 2–3 days (bolted onto queue work)

Hardening: Quick Wins (Priority 1)

Error Sanitisation

Current state: raw LiteLLM exception → chat. Topology exposed (model group names, provider identifiers, fallback chains). This is a security risk for targeted rate limit abuse.

Fix: Middleware at API boundary

↳Internal: full stack trace → structured log
↳User: "Search temporarily unavailable."

⏱️ Timeline: 1 day. Ship this first.

Retry Policy

Current behaviour: LiteLLM retried insufficient_quota twice. It will never succeed on retry. The user waits through doomed retries, and the error block appears twice.

Fix: Check error type field before retry

✕insufficient_quota: Do not retry → trip breaker.
↻rate_limit_exceeded: Retry w/ exponential backoff.

⏱️ Timeline: Hours. Ship alongside sanitisation.

Priority 3

Belt & Braces: Monitoring

If you're on Langfuse (and you likely are), you already have evaluation infrastructure detecting low match quality and stale embeddings. Currently, these signals are siloed. Let's make them operational.

The Killer Addition: Per-Customer Dashboards

Track each customer's embedding budget consumption in real time. Alert at 80% threshold → proactive cap increase or cache tuning before hard failure. Correlate customer embedding spend with revenue (10% of hire salary) for unit economics visibility.

Loki

Structured log aggregation. All LiteLLM callbacks, eval outputs, 429 events, queue depth.

Tempo

Distributed tracing. Trace query → embed → vector search → rerank. Pinpoint failure location.

Grafana

Single pane: embedding health, cache hits, search latency P99. Alerts at 80% quota.

Evaluation Feedback Loops (Closing the loop):

Low match score → trigger re-embedding via async queue.
Quota approaching limit → proactive cache warming for that customer.

⏱️ Timeline: Weekend deployment (OSS & self-hostable).

Implementation Roadmap

Priority	What	Effort	Impact
P1	Error sanitisation middleware	1 day	Stops topology leak immediately
P1	Retry policy: distinguish 429s	Hours	Stops wasting latency on doomed retries
P1	Embedding cache + message queue	1–2 weeks	Eliminates quota exhaustion at source
P2	Circuit breaker + BM25 fallback	2–3 days	Graceful degradation when all else fails
P2	OpenAI Batch API for async	Hours (config)	50% cost reduction + higher throughput
P3	Loki + Tempo + Grafana observability	1 weekend	Pre-emptive alerting, budget tracking
P3	Langfuse eval → infra loops	2–3 days	Evals drive infrastructure behaviour

Total: ~3 weeks from zero to full resilience stack.
P1 alone (error sanitisation + cache/queue) fixes the user-facing problem in under 2 weeks.

How I Know This

I diagnosed your architecture from a user-facing error dump. I've built this exact observability stack and handled these failure modes.

evalui (ui.davidcockson.com)

Dual-backend trace comparison, Langfuse + Phoenix integration, fully OTel instrumented.

Homelab Operations

Loki + Tempo + Grafana running in Docker on Contabo, behind Cloudflare Access. (Grafana community dashboard with 100+ downloads).

Project Immunity

Engineered a multi-model ensemble system with deterministic rule-checks — utilizing the exact same "belt and braces" resilience thinking.

8 Years Governance

Building incident response and governance frameworks for regulated systems that simply cannot fail silently.

Let's Talk.

Happy to walk through this live or build a proof of concept.

davidcockson.com ui.davidcockson.com