AI Engineering — Joseph Mattiello | Multi-agent CI, MCP servers, production AI tooling

01 · Production multi-agent CI

Multi-agent CI orchestration across the Provenance org

3

Autonomous agents in production

Claude (Sonnet + Opus, auto-routed by static diff analysis), Cursor Agent, and Kimi Code — three heterogeneous agents sharing one dispatch vocabulary, routed by branch, label, and PR signal across the Provenance organization.

Each PR fans out to the right model for the change: thread-safety patterns route to Opus, everything else stays on Sonnet. Cursor and Kimi listen on the same dispatch surface, so swapping the underlying model is a one-line change. The case study has the architecture; the workflow files have the proof.

ai-review.yml

Sonnet vs Opus auto-selection by static diff analysis

cursor-agent.yml

Cursor Agent on the shared dispatch vocabulary

kimi-agent.yml

Kimi Code as Anthropic-API-compatible fallback

copilot-review-poller.yml

Scheduled poller that bypasses GitHub's bot-actor approval gate

auto-translate.yml

Diff-aware Claude Haiku localization (11 languages)

wiki / auto-fix-ci.yml

Claude self-heals spell-check & link-check failures

→ Read the full case study on the main site

02 · Bare-metal AI development

240p Test Suite — AI as primary author on a platform with no training data

36 / 10

PRs in days · ~84% AI co-authored

Port of Artemio Urbina's 240p Test Suite to Atari Jaguar — bare-metal m68k + RISC, a platform with effectively zero LLM training data. Built end-to-end with Claude Code as primary author and Copilot + Qodo as adversarial reviewers. 12 tagged releases including PAL/NTSC region switching and 7 hardware probes (93C46 EEPROM, Jaguar CD/Butch, JagLink, TOM registers).

If the agent pattern only worked on well-trodden stacks, this project would have failed. Instead it shipped — proof that orchestrated AI development generalizes when you put the right scaffolding around it (a libretro smoke-test CI, a Docker SDK image, and adversarial review).

atari_jaguar_240p_test_suite

Repository — 12 releases, v0.6.4 → v1.3.1

PR #27 — PAL pass

Full PAL/NTSC region-switching support

build.yml

libretro-based smoke-test CI — boots every PR's ROM, uploads framebuffer artifact

sdk-image.yml

Reproducible Jaguar SDK Docker image build

03 · Applied at scale internally

AI tooling at Wayfair

App Infrastructure is the platform team that the rest of mobile builds on top of. AI tooling here means leverage for 60+ engineers, not novelty for one.

Custom MCP servers wrapping internal tooling — auth-aware, audit-logged, rate-limited.
Claude skills + plugins automating engineering workflows for the platform team (code review, scaffolding, refactor recipes).
n8n workflows integrating Jira + GitHub for cross-team automation — issue triage, PR routing, release coordination.
Datadog RUM instrumentation for performance work — bootup tracing, SwiftUI hot-path identification, MP4 caching strategy that shipped ~$2M/year in bandwidth savings.
Supports 60+ mobile engineers on the App Infrastructure team — the audience is the rest of mobile, not just one product.

Enterprise work, no public links — referenced here as applied-at-scale-internally proof. Specifics under NDA.

04 · Patterns I keep using

The architecture I keep reaching for

Patterns that have survived contact with production, across multiple projects and agent stacks. Each one is in shipping code somewhere above.

Reviewer-model auto-selection by static diff analysis. grep for thread-safety / concurrency patterns in the diff to route between Sonnet (default) and Opus (deep-reasoning). Cheap, deterministic, observable.
Shared dispatch vocabulary across heterogeneous agents. One set of action verbs — implement_issue, fix_ai_review, fix_rebase_conflict, ai_approved — that Claude, Cursor, and Kimi all speak. Swap the model, keep the orchestration.
Anthropic-API-compatible model fallback. Kimi Code via ANTHROPIC_BASE_URL override — the same agent runtime, a different upstream. Useful for cost ceilings, regional availability, and red-team comparisons.
Scheduled poller bypass for GitHub bot-actor approval gates. GitHub won't let bot reviews count toward branch protection; a cron-driven poller re-projects bot signal as a status check the gate respects.
Bounded review-cycle limits. Max 3 AI ↔ AI review cycles before a human is paged. Prevents runaway loops; makes "the agents are arguing" a measurable event.
Diff-aware partial work. Translate only the strings that changed in this PR, not the whole localization file. Same for refactors, doc updates, test scaffolding — operate on the diff, not the universe.

05 · Stack overview

What's in the toolbox

Compact map of what's in production right now. Replaceable parts where it matters; opinionated where it should be.

Models

Claude Sonnet 4.6
Claude Opus 4.6
Claude Haiku
Cursor
Kimi Code

Agents

Claude Code
Cursor Agent
Kimi Agent

Infrastructure

GitHub Actions
workflow_dispatch
Anthropic API
MCP
n8n

Observability

Datadog RUM

06 · How I think about it

How I think about AI engineering leadership

Production engineering experience matters more than ML research credentials for shipping AI systems in 2026. The hardest part of an LLM system in production isn't the model — it's making it reliable. Retries, idempotency, audit logs, rate limits, cost ceilings, graceful degradation, eval CI, observability that lets you reproduce a regression a week later. Those are SRE problems wearing an LLM hat. The teams that nail them ship; the teams that don't write blog posts about why their PoC was great in October.

AI agents are workers, not magic. The orchestration matters more than which agent. A mediocre model on a great pipeline beats a great model on a mediocre one — every single time. Build the pipeline first: the dispatch vocabulary, the bounded review cycles, the human-escalation hooks, the diff-aware scope. Then plug in whichever model has the best price/quality this quarter, and keep the swap easy.

The right metric for AI tooling is "how much time does it save the engineer who'd otherwise do this manually." Not benchmarks, not leaderboards, not vibes. If a workflow saves 15 minutes per PR across 60 engineers, that's the ROI conversation. If it saves 30 seconds and adds a flaky failure mode, kill it. Measure the human cycle, not the model output.

Treat eval like CI. If you can't reproduce a regression on demand, you don't have evals — you have demos. Eval datasets get versioned, regressions block merges, and "the model felt worse this week" becomes a deterministic line on a graph instead of a Slack thread. This is the part most teams skip and the part that decides whether the system survives its first model upgrade.