One day, four stages, one job: prove I can drop into an unscoped problem and ship something a domain expert actually uses. PortSwigger is going all-in on AI, and the Pioneer is the roving, tribe-agnostic builder who makes that real. This is the plan — logistics, panel, culture, and the live build.
Booths Park isn't one office block — it's parkland with historic and modern buildings. Don't let the scale eat the clock.
Call one was contracts, call two was recruitment — and the final task arrived with no brief at all. That's the test: the role is roving and tribe-agnostic, so they'll almost certainly hand me a surprise problem I've never seen. So I don't rehearse a domain. I bring a method that survives any domain — a repeatable, hyper-accelerated cycle I can drop onto whatever they put in front of me.
I don't walk in empty-handed. I bring a working library of agent flows mapped to the Double Diamond and an EARS spec gate — the exact method below, already running.
Open the toolkit ↗System access, historical logs, tickets, previous work. Define the problem scope and identify the historical ground truth — what "good" already looked like.
Curate historical data without hand-creating it. Build an LLM-as-a-Judge straight from EARS requirements, seeding a 100-pair gold dataset from history. Gate = spec.ears.md.
The domain expert audits only the hardest, edge cases. Align the automated judge's scores to human judgment until they agree ≥95% — then the judge scales.
Final regression checks against the seeded gold dataset, verify the ROI, integrate the full pipeline to production — monitoring, runbooks & handoff included.
| Layer | Method | Cadence | Cost · speed |
|---|---|---|---|
| Unit tests | Deterministic — regex, JSON, schema | Daily | Fast & free |
| Integrated evals | LLM-as-a-Judge — golden dataset | Every PR | Moderate |
| Human review | HITL — critique shadowing | Weekly / spot | Slow & costly |
Both must land an enthusiastic yes. Everything I say and do today is read against these two.