- React 18
- TypeScript
- Vite
- SSE streaming
A production self-hosted control plane for a personal AI workflow: FastAPI + React front-end fronting a homelab GPU box, with a filesystem-backed job queue, hybrid RAG retrieval over a private knowledge vault, and end-to-end observability into Grafana Cloud — all delivered as version-controlled infrastructure.
A private control surface for running LLM jobs against models that live on hardware I own, exposed safely to the public internet, and observable end-to-end. Built deliberately to exercise the full lifecycle: infrastructure, application, UI, observability, and operations.
control.davidcockson.com is the front door to a small but complete LLM platform. The browser talks to a FastAPI service running on a Contabo VPS; submitted prompts become jobs on a filesystem-backed queue, are executed by a separate worker process, and stream responses back over Server-Sent Events. Chat history, job metadata, and per-job token usage are persisted to SQLite.
Heavy inference is offloaded to Davas, a home-lab GPU box reached over a Tailscale-secured tunnel — keeping latency low and GPU costs at zero, while public access is mediated by a Cloudflare Tunnel so nothing inbound is ever exposed to the open internet.
The platform also includes an MCP tool server (search, fetch, vault,
sandboxed code), and a hybrid RAG service combining
Qdrant vector search with a hand-rolled Neo4j knowledge
graph. Agent Skills — small user-authored packages — can be
selected per job to shape the model's behavior.
Every service is declared in docker-compose, every
secret resolved through Infisical at run-time, every host
bootstrapped by Ansible, and every byte of state-bearing
infrastructure tracked in Terraform with S3-backed state. The
whole platform emits logs, metrics, and OpenTelemetry traces to
Grafana Cloud.
Constraint that shaped everything: built solo on a fixed budget — a single VPS plus existing home-lab hardware. No managed services where a self-hosted one would do. The result is a system small enough to fully understand, but production-shaped in every operational dimension.
A browser request crosses four trust boundaries before a token comes back: Cloudflare's edge, the VPS, a Tailscale tunnel, and finally the home-lab GPU. Each hop adds observability; none adds an open port.
1. Edge. The browser hits a Cloudflare-managed hostname.
Cloudflare handles TLS, DDoS, and rate-limiting; cloudflared
opens an outbound-only tunnel from the VPS — there is no inbound
port on the host.
2. Web tier. A FastAPI process serves the React SPA and exposes
the JSON + SSE API. Requests that need work submit a JSON file to
_queue/; SSE streams events as the worker progresses.
3. Worker. A separate process atomically moves jobs through
_queue → _active → _completed / _failed via single
shutil.move calls — never copy + delete. A crash leaves
a half-done job in _active/, which the worker requeues
on startup.
4. Inference. The worker calls Davas over Tailscale for local models, or a cloud provider for frontier models. Either way, the token stream is forwarded straight through to the browser SSE.
After two UI iterations — a fast HTMX prototype, then a typed React rewrite — the interface settled on a calm phosphor-green aesthetic. JetBrains Mono throughout, three columns, keyboard-first.
Twelve containers behind a single tunnel. Stateful stores sit beside their consumers; Alloy is the only outbound agent for telemetry; all secrets are materialised at run-time from Infisical, never committed.
The original plan was P0 → P9. What shipped includes those, plus four unplanned phase blocks added in flight when the work demanded them — polish (QoL), user-defined skills, and two UI rewrites.
Engineering lesson worth keeping: the unplanned phases all came from using the system, not designing it. Throwing away two UI iterations was cheaper than guessing right the first time — because the queue, worker, and SSE contract underneath stayed unchanged through every rewrite.
Deliberate choices: typed everywhere, declarative where possible, no managed-service lock-in. Everything in this list is either open-source or has a self-hosted equivalent.
The work was run as a tightly-scoped, spec-driven, AI-augmented engineering loop. Every change was specified, sized, executed, tested, and logged before the session closed. The numbers below are pulled directly from the repository and from the Claude Code session logs.
Tooling. Built entirely with Claude Code as the
driver, running primarily against claude-sonnet-4-6
for implementation work, with claude-opus-4-7 reserved
for architecture decisions, complex debugging, and any session that
touched the cutover-sensitive path.
Process. Three durable artefacts drove the entire ten-day
build: SPEC.md as the contract, tasks.md
as the work breakdown (≈120 atomic tasks across 14 phases), and an
append-only RUN-SUMMARY.md as the cross-session
ledger — every session ended with a log entry covering tasks
worked, outcome, what changed, what surfaced, and what the next
session should pick up. This made resuming mid-phase from either
machine trivial.
Quality gate. A 10-stage make pre-push
pipeline ran on every push: lint, type-check (mypy strict),
unit + integration tests, Docker build, container health smoke.
A push only landed when all ten stages were green — typically
in 100-200 seconds.
Discipline. No commit without a passing pre-push gate. No phase ticked complete without its test coverage. No deferred work without an explicit follow-up task. Roughly half of the eventual scope was discovered during execution — the unplanned QoL, Skills, and UI phases — and was absorbed through the same loop, not by changing the loop.