Case Study · Self-hosted Platform

LLM job runner & chat platform control.davidcockson.com

A production self-hosted control plane for a personal AI workflow: FastAPI + React front-end fronting a homelab GPU box, with a filesystem-backed job queue, hybrid RAG retrieval over a private knowledge vault, and end-to-end observability into Grafana Cloud — all delivered as version-controlled infrastructure.

● live 10 days end-to-end 14 phases delivered 110+ tracked tasks 2 full UI rewrites solo
Delivery
10days
spec → live production
Stack depth
12services
across 2 hosts + cloud
UI rewrites
2×
HTMX → TUI → React/TS
Infra as code
100%
Terraform + Ansible
01 · Overview

What it is, and why I built it

A private control surface for running LLM jobs against models that live on hardware I own, exposed safely to the public internet, and observable end-to-end. Built deliberately to exercise the full lifecycle: infrastructure, application, UI, observability, and operations.

control.davidcockson.com is the front door to a small but complete LLM platform. The browser talks to a FastAPI service running on a Contabo VPS; submitted prompts become jobs on a filesystem-backed queue, are executed by a separate worker process, and stream responses back over Server-Sent Events. Chat history, job metadata, and per-job token usage are persisted to SQLite.

Heavy inference is offloaded to Davas, a home-lab GPU box reached over a Tailscale-secured tunnel — keeping latency low and GPU costs at zero, while public access is mediated by a Cloudflare Tunnel so nothing inbound is ever exposed to the open internet.

The platform also includes an MCP tool server (search, fetch, vault, sandboxed code), and a hybrid RAG service combining Qdrant vector search with a hand-rolled Neo4j knowledge graph. Agent Skills — small user-authored packages — can be selected per job to shape the model's behavior.

Every service is declared in docker-compose, every secret resolved through Infisical at run-time, every host bootstrapped by Ansible, and every byte of state-bearing infrastructure tracked in Terraform with S3-backed state. The whole platform emits logs, metrics, and OpenTelemetry traces to Grafana Cloud.

Constraint that shaped everything: built solo on a fixed budget — a single VPS plus existing home-lab hardware. No managed services where a self-hosted one would do. The result is a system small enough to fully understand, but production-shaped in every operational dimension.

02 · Project Flow

How a request travels through the system

A browser request crosses four trust boundaries before a token comes back: Cloudflare's edge, the VPS, a Tailscale tunnel, and finally the home-lab GPU. Each hop adds observability; none adds an open port.

Client Cloudflare Edge Contabo VPS · Docker Home-lab · GPU Browser React + SSE cloudflared tunnel · TLS web · FastAPI :8000 · SSE · auth worker queue · executor queue · fs _queue · _active · _done rag · mcp qdrant · neo4j · tools Davas · Ollama GPU · models HTTPS private tunnel tailscale
Figure 1. End-to-end request flow No open inbound ports

1. Edge. The browser hits a Cloudflare-managed hostname. Cloudflare handles TLS, DDoS, and rate-limiting; cloudflared opens an outbound-only tunnel from the VPS — there is no inbound port on the host.

2. Web tier. A FastAPI process serves the React SPA and exposes the JSON + SSE API. Requests that need work submit a JSON file to _queue/; SSE streams events as the worker progresses.

3. Worker. A separate process atomically moves jobs through _queue → _active → _completed / _failed via single shutil.move calls — never copy + delete. A crash leaves a half-done job in _active/, which the worker requeues on startup.

4. Inference. The worker calls Davas over Tailscale for local models, or a cloud provider for frontier models. Either way, the token stream is forwarded straight through to the browser SSE.

03 · Interface

The terminal-inspired UI

After two UI iterations — a fast HTMX prototype, then a typed React rewrite — the interface settled on a calm phosphor-green aesthetic. JetBrains Mono throughout, three columns, keyboard-first.

control.davidcockson.com/chat
Chat view with session sidebar and inspector
Chat view Session list (left), streaming chat (center), inspector with model, tokens, and job state (right). SSE keeps tokens flowing as the worker yields them.
control.davidcockson.com/job/…
Generated article view with metadata sidebar
Job detail Markdown-rendered output with metadata, token usage, and skill provenance. Every job is addressable, replayable, and diffable against sibling runs.
04 · Architecture

Services and data flow inside the VPS

Twelve containers behind a single tunnel. Stateful stores sit beside their consumers; Alloy is the only outbound agent for telemetry; all secrets are materialised at run-time from Infisical, never committed.

Contabo VPS · Docker Compose External · Cloud cloudflared tunnel · TLS web · FastAPI :8000 · React · SSE mcp · FastMCP :8001 · search · fetch · code vault obsidian · markdown worker poller · executor rag · LangGraph :8002 · hybrid retrieve queue · filesystem _queue · _active _completed · _failed qdrant :6333 · vector store neo4j · knowledge graph :7687 · entities · facts sqlite chat history · stats infisical agent runtime secret resolution alloy logs · metrics · traces reindexer nightly cron syncthing vault propagation Davas · Ollama home-lab GPU Grafana Cloud Loki · Mimir · Tempo + 4 alerts Discord job · disk · davas alerts Infisical (EU) homelab-rebuild · prod AWS S3 · us-east-1 tfstate · vault snapshots artifacts
Figure 2. Runtime topology — services, stores, and external dependencies Docker Compose · 12 services
05 · Delivery

Plan vs reality — ten days, fourteen phases

The original plan was P0 → P9. What shipped includes those, plus four unplanned phase blocks added in flight when the work demanded them — polish (QoL), user-defined skills, and two UI rewrites.

Phase
Planned
Delivered
When
P0
Infisical, machine identities, AWS+S3, GitLab repos with baseline CI.
All 6 sub-tasks done; reconciliation block bridged spec vs live infra.
May 08 · 2d
P1
Terraform S3 backend; Contabo + home-lab bootstrapped via Ansible.
All 6 done in one push. Contabo firewall captured as Ansible role.
May 10 · 1d
P2
Alloy on both hosts → Grafana Cloud; dashboards + Discord alerts.
Pipeline up; 4 dashboards + 4 alerts (spec called for 3).
May 11 → 14
P3
Settings, providers, OTel tracing, docker-compose, infisical-run.
All 7 done. Default model corrected mid-flight.
May 10 → 11
P4
Atomic queue poller, executor, handlers, Discord, OTel spans.
All 6 done. Executor refactored later to own notifications cleanly.
May 11 · 1d
P5
FastAPI factory, /health, SSE chat, HTMX UI, Cloudflare tunnel swap.
UI live behind control.davidcockson.com — later rebuilt twice.
May 11 · 1d
P6
FastMCP server; tools: search, fetch, vault, sandboxed code-exec.
All 6 done. Health route bug caught and fixed.
May 11 → 14
P7
Qdrant + graphiti KG over Neo4j; LangGraph hybrid pipeline.
graphiti dropped after 3 cloud failures → hand-rolled extractor.
May 13 → 14
P8
Token tracker + briefing / research / memory agents.
3 of 4 held — no concrete trigger; only token-tracker shipped.
May 14
P9
Vault snapshot, decommission old runner, runbook, security audit.
Cutover done; old runner archived; security audit passed.
May 15 · 1d
+QoL
— added in flight —
18 polish tasks — Discord regression, deploy wrapper, secrets manifest, RAG context-only mode.
May 13 → 14
+Skills
— added in flight —
Agent Skills — user-authored skill packages selectable per job, live-reloaded.
May 14
+UI-1
— added in flight —
Terminal/TUI redesign — JetBrains Mono, 3-pane grid, keyboard layer.
May 14 → 15
+UI-2
— added in flight —
Full React port — Vite + TS, virtualised queue, ~3,000 LOC of old UI deleted.
May 16 → 17

Engineering lesson worth keeping: the unplanned phases all came from using the system, not designing it. Throwing away two UI iterations was cheaper than guessing right the first time — because the queue, worker, and SSE contract underneath stayed unchanged through every rewrite.

06 · Stack

Technologies, top to bottom

Deliberate choices: typed everywhere, declarative where possible, no managed-service lock-in. Everything in this list is either open-source or has a self-hosted equivalent.

Front-end
  • React 18
  • TypeScript
  • Vite
  • SSE streaming
Back-end
  • Python 3.12
  • FastAPI
  • FastMCP
  • LangGraph
Data & AI
  • Ollama self-hosted
  • Qdrant vectors
  • Neo4j KG
  • SQLite
Infrastructure
  • Terraform
  • Ansible
  • Docker Compose
  • AWS S3 tfstate
Networking
  • Cloudflare Tunnel
  • Tailscale
  • iptables-nft
Secrets
  • Infisical EU
  • Machine identities
  • Runtime resolution
Observability
  • Grafana Alloy
  • Loki · Mimir · Tempo
  • OpenTelemetry
  • Discord alerts
Workflow
  • GitLab CI
  • Syncthing
  • Obsidian vault
  • Claude Code
07 · Methodology

How it was actually built

The work was run as a tightly-scoped, spec-driven, AI-augmented engineering loop. Every change was specified, sized, executed, tested, and logged before the session closed. The numbers below are pulled directly from the repository and from the Claude Code session logs.

Tooling. Built entirely with Claude Code as the driver, running primarily against claude-sonnet-4-6 for implementation work, with claude-opus-4-7 reserved for architecture decisions, complex debugging, and any session that touched the cutover-sensitive path.

Process. Three durable artefacts drove the entire ten-day build: SPEC.md as the contract, tasks.md as the work breakdown (≈120 atomic tasks across 14 phases), and an append-only RUN-SUMMARY.md as the cross-session ledger — every session ended with a log entry covering tasks worked, outcome, what changed, what surfaced, and what the next session should pick up. This made resuming mid-phase from either machine trivial.

Quality gate. A 10-stage make pre-push pipeline ran on every push: lint, type-check (mypy strict), unit + integration tests, Docker build, container health smoke. A push only landed when all ten stages were green — typically in 100-200 seconds.

Discipline. No commit without a passing pre-push gate. No phase ticked complete without its test coverage. No deferred work without an explicit follow-up task. Roughly half of the eventual scope was discovered during execution — the unplanned QoL, Skills, and UI phases — and was absorbed through the same loop, not by changing the loop.

Claude sessions
107
across 9 working days
Assistant turns
10k+
10,031 across Sonnet + Opus
Tokens processed
727M
in + out + cache, all sessions
Run-log entries
119
one per session, append-only
Models used
  • Sonnet 4.6 62.5%
  • Opus 4.7 37.5%
Source artefacts
  • SPEC.md 512 lines
  • tasks.md 453 lines
  • RUN-SUMMARY.md ~210 KB
Test surface
  • 518 tests at cutover
  • 10-stage pre-push gate
  • mypy strict type-checked
Build & deploy
  • Docker multi-stage
  • Container health smoke
  • One-command deploy.sh
  • Atomic git commits