Multi-agent CI orchestration across the Provenance org
Each PR fans out to the right model for the change: thread-safety patterns route to Opus, everything else stays on Sonnet. Cursor and Kimi listen on the same dispatch surface, so swapping the underlying model is a one-line change. The case study has the architecture; the workflow files have the proof.
240p Test Suite — AI as primary author on a platform with no training data
If the agent pattern only worked on well-trodden stacks, this project would have failed. Instead it shipped — proof that orchestrated AI development generalizes when you put the right scaffolding around it (a libretro smoke-test CI, a Docker SDK image, and adversarial review).
AI tooling at Wayfair
App Infrastructure is the platform team that the rest of mobile builds on top of. AI tooling here means leverage for 60+ engineers, not novelty for one.
- Custom MCP servers wrapping internal tooling — auth-aware, audit-logged, rate-limited.
- Claude skills + plugins automating engineering workflows for the platform team (code review, scaffolding, refactor recipes).
- n8n workflows integrating Jira + GitHub for cross-team automation — issue triage, PR routing, release coordination.
- Datadog RUM instrumentation for performance work — bootup tracing, SwiftUI hot-path identification, MP4 caching strategy that shipped ~$2M/year in bandwidth savings.
- Supports 60+ mobile engineers on the App Infrastructure team — the audience is the rest of mobile, not just one product.
Enterprise work, no public links — referenced here as applied-at-scale-internally proof. Specifics under NDA.
The architecture I keep reaching for
Patterns that have survived contact with production, across multiple projects and agent stacks. Each one is in shipping code somewhere above.
- Reviewer-model auto-selection by static diff analysis.
grepfor thread-safety / concurrency patterns in the diff to route between Sonnet (default) and Opus (deep-reasoning). Cheap, deterministic, observable. - Shared dispatch vocabulary across heterogeneous agents. One set of action verbs —
implement_issue,fix_ai_review,fix_rebase_conflict,ai_approved— that Claude, Cursor, and Kimi all speak. Swap the model, keep the orchestration. - Anthropic-API-compatible model fallback. Kimi Code via
ANTHROPIC_BASE_URLoverride — the same agent runtime, a different upstream. Useful for cost ceilings, regional availability, and red-team comparisons. - Scheduled poller bypass for GitHub bot-actor approval gates. GitHub won't let bot reviews count toward branch protection; a cron-driven poller re-projects bot signal as a status check the gate respects.
- Bounded review-cycle limits. Max 3 AI ↔ AI review cycles before a human is paged. Prevents runaway loops; makes "the agents are arguing" a measurable event.
- Diff-aware partial work. Translate only the strings that changed in this PR, not the whole localization file. Same for refactors, doc updates, test scaffolding — operate on the diff, not the universe.
What's in the toolbox
Compact map of what's in production right now. Replaceable parts where it matters; opinionated where it should be.
- Claude Sonnet 4.6
- Claude Opus 4.6
- Claude Haiku
- Cursor
- Kimi Code
- Claude Code
- Cursor Agent
- Kimi Agent
- GitHub Actions
- workflow_dispatch
- Anthropic API
- MCP
- n8n
- Datadog RUM
How I think about AI engineering leadership
Production engineering experience matters more than ML research credentials for shipping AI systems in 2026. The hardest part of an LLM system in production isn't the model — it's making it reliable. Retries, idempotency, audit logs, rate limits, cost ceilings, graceful degradation, eval CI, observability that lets you reproduce a regression a week later. Those are SRE problems wearing an LLM hat. The teams that nail them ship; the teams that don't write blog posts about why their PoC was great in October.
AI agents are workers, not magic. The orchestration matters more than which agent. A mediocre model on a great pipeline beats a great model on a mediocre one — every single time. Build the pipeline first: the dispatch vocabulary, the bounded review cycles, the human-escalation hooks, the diff-aware scope. Then plug in whichever model has the best price/quality this quarter, and keep the swap easy.
The right metric for AI tooling is "how much time does it save the engineer who'd otherwise do this manually." Not benchmarks, not leaderboards, not vibes. If a workflow saves 15 minutes per PR across 60 engineers, that's the ROI conversation. If it saves 30 seconds and adds a flaky failure mode, kill it. Measure the human cycle, not the model output.
Treat eval like CI. If you can't reproduce a regression on demand, you don't have evals — you have demos. Eval datasets get versioned, regressions block merges, and "the model felt worse this week" becomes a deterministic line on a graph instead of a Slack thread. This is the part most teams skip and the part that decides whether the system survives its first model upgrade.
Want to talk about shipping AI in production?
Whether it's a fixed-fee audit, an AI-platform leadership conversation, or just a senior outside read on a thorny architecture call — reach out.