AI Engineering Practice

AI in production, not in slideware.

A consolidated index of how Joseph Mattiello uses, builds, and ships AI tooling — with links to actual code, actual workflows, and actual outcomes. Updated as new work ships.

Multi-agent CI orchestration across the Provenance org

3
Autonomous agents in production
Claude (Sonnet + Opus, auto-routed by static diff analysis), Cursor Agent, and Kimi Code — three heterogeneous agents sharing one dispatch vocabulary, routed by branch, label, and PR signal across the Provenance organization.

Each PR fans out to the right model for the change: thread-safety patterns route to Opus, everything else stays on Sonnet. Cursor and Kimi listen on the same dispatch surface, so swapping the underlying model is a one-line change. The case study has the architecture; the workflow files have the proof.

Sonnet vs Opus auto-selection by static diff analysis
Cursor Agent on the shared dispatch vocabulary
Kimi Code as Anthropic-API-compatible fallback
Scheduled poller that bypasses GitHub's bot-actor approval gate
Diff-aware Claude Haiku localization (11 languages)
Claude self-heals spell-check & link-check failures

→ Read the full case study on the main site

240p Test Suite — AI as primary author on a platform with no training data

36 / 10
PRs in days · ~84% AI co-authored
Port of Artemio Urbina's 240p Test Suite to Atari Jaguar — bare-metal m68k + RISC, a platform with effectively zero LLM training data. Built end-to-end with Claude Code as primary author and Copilot + Qodo as adversarial reviewers. 12 tagged releases including PAL/NTSC region switching and 7 hardware probes (93C46 EEPROM, Jaguar CD/Butch, JagLink, TOM registers).

If the agent pattern only worked on well-trodden stacks, this project would have failed. Instead it shipped — proof that orchestrated AI development generalizes when you put the right scaffolding around it (a libretro smoke-test CI, a Docker SDK image, and adversarial review).

Repository — 12 releases, v0.6.4 → v1.3.1
Full PAL/NTSC region-switching support
libretro-based smoke-test CI — boots every PR's ROM, uploads framebuffer artifact
Reproducible Jaguar SDK Docker image build

AI tooling at Wayfair

App Infrastructure is the platform team that the rest of mobile builds on top of. AI tooling here means leverage for 60+ engineers, not novelty for one.

  • Custom MCP servers wrapping internal tooling — auth-aware, audit-logged, rate-limited.
  • Claude skills + plugins automating engineering workflows for the platform team (code review, scaffolding, refactor recipes).
  • n8n workflows integrating Jira + GitHub for cross-team automation — issue triage, PR routing, release coordination.
  • Datadog RUM instrumentation for performance work — bootup tracing, SwiftUI hot-path identification, MP4 caching strategy that shipped ~$2M/year in bandwidth savings.
  • Supports 60+ mobile engineers on the App Infrastructure team — the audience is the rest of mobile, not just one product.

Enterprise work, no public links — referenced here as applied-at-scale-internally proof. Specifics under NDA.

The architecture I keep reaching for

Patterns that have survived contact with production, across multiple projects and agent stacks. Each one is in shipping code somewhere above.

  • Reviewer-model auto-selection by static diff analysis. grep for thread-safety / concurrency patterns in the diff to route between Sonnet (default) and Opus (deep-reasoning). Cheap, deterministic, observable.
  • Shared dispatch vocabulary across heterogeneous agents. One set of action verbs — implement_issue, fix_ai_review, fix_rebase_conflict, ai_approved — that Claude, Cursor, and Kimi all speak. Swap the model, keep the orchestration.
  • Anthropic-API-compatible model fallback. Kimi Code via ANTHROPIC_BASE_URL override — the same agent runtime, a different upstream. Useful for cost ceilings, regional availability, and red-team comparisons.
  • Scheduled poller bypass for GitHub bot-actor approval gates. GitHub won't let bot reviews count toward branch protection; a cron-driven poller re-projects bot signal as a status check the gate respects.
  • Bounded review-cycle limits. Max 3 AI ↔ AI review cycles before a human is paged. Prevents runaway loops; makes "the agents are arguing" a measurable event.
  • Diff-aware partial work. Translate only the strings that changed in this PR, not the whole localization file. Same for refactors, doc updates, test scaffolding — operate on the diff, not the universe.

What's in the toolbox

Compact map of what's in production right now. Replaceable parts where it matters; opinionated where it should be.

Models
  • Claude Sonnet 4.6
  • Claude Opus 4.6
  • Claude Haiku
  • Cursor
  • Kimi Code
Agents
  • Claude Code
  • Cursor Agent
  • Kimi Agent
Infrastructure
  • GitHub Actions
  • workflow_dispatch
  • Anthropic API
  • MCP
  • n8n
Observability
  • Datadog RUM

How I think about AI engineering leadership

Production engineering experience matters more than ML research credentials for shipping AI systems in 2026. The hardest part of an LLM system in production isn't the model — it's making it reliable. Retries, idempotency, audit logs, rate limits, cost ceilings, graceful degradation, eval CI, observability that lets you reproduce a regression a week later. Those are SRE problems wearing an LLM hat. The teams that nail them ship; the teams that don't write blog posts about why their PoC was great in October.

AI agents are workers, not magic. The orchestration matters more than which agent. A mediocre model on a great pipeline beats a great model on a mediocre one — every single time. Build the pipeline first: the dispatch vocabulary, the bounded review cycles, the human-escalation hooks, the diff-aware scope. Then plug in whichever model has the best price/quality this quarter, and keep the swap easy.

The right metric for AI tooling is "how much time does it save the engineer who'd otherwise do this manually." Not benchmarks, not leaderboards, not vibes. If a workflow saves 15 minutes per PR across 60 engineers, that's the ROI conversation. If it saves 30 seconds and adds a flaky failure mode, kill it. Measure the human cycle, not the model output.

Treat eval like CI. If you can't reproduce a regression on demand, you don't have evals — you have demos. Eval datasets get versioned, regressions block merges, and "the model felt worse this week" becomes a deterministic line on a graph instead of a Slack thread. This is the part most teams skip and the part that decides whether the system survives its first model upgrade.

Want to talk about shipping AI in production?

Whether it's a fixed-fee audit, an AI-platform leadership conversation, or just a senior outside read on a thorny architecture call — reach out.