Multi-agent CI for Provenance EMU: routing, not models

What this is

I maintain Provenance EMU: 6,300 stars, 79 contributors, App Store since January 2025. Day job at Wayfair on App Infrastructure. The project hit the usual solo-maintainer wall around year five. PRs piled up. Localization rotted between sprints. Wiki spell-check broke and stayed broken. Code review gated everything else.

The fix wasn’t one more agent. It was a routing layer in front of three agents: Claude Code, Cursor, and Kimi. Nine workflow files in .github/workflows. One shared workflow_dispatch vocabulary. A reviewer model picked by grep over the diff, and a cron poller that gets around GitHub’s bot-actor approval gate. All of it public on the Provenance org.

Total Anthropic spend, including the wiki self-heal and the localization runs, is in the low-double-digits per month. The routing is what makes that possible. The model choice is downstream of the router.

The architecture in one diagram

            ┌──────────────────────────────────────────────┐
            │              PR opens / is updated           │
            └──────────────────────┬───────────────────────┘
                                   ▼
                  ┌─────────────────────────────────────┐
                  │  router workflow (cheap, fast):     │
                  │  - inspect branch name suffix       │
                  │  - inspect PR labels                │
                  │  - grep diff for risk patterns      │
                  └────────┬────────────┬───────────────┘
                           │            │
            ┌──────────────┘            └──────────────┐
            ▼                                          ▼
   ┌──────────────────┐                       ┌──────────────────┐
   │  Claude Code     │  (default lane)       │  Cursor Agent    │  (-cursor branch)
   │  Sonnet | Opus   │                       │  background mode │
   │  auto-selected   │                       │                  │
   └────────┬─────────┘                       └────────┬─────────┘
            │                                          │
            └──────────────────┬───────────────────────┘
                               ▼
                  ┌─────────────────────────────┐
                  │  shared dispatch vocabulary │
                  │  - implement_issue          │
                  │  - fix_ai_review            │
                  │  - fix_rebase_conflict      │
                  │  - ai_approved              │
                  └────────────┬────────────────┘
                               ▼
                  ┌─────────────────────────────┐
                  │  Copilot review → cron      │
                  │  poller picks up review →   │
                  │  next workflow → human gate │
                  └─────────────────────────────┘

            Fallback: Anthropic credits exhausted →
                      ANTHROPIC_BASE_URL = api.kimi.com/coding
                      → same Claude Code binary → Kimi K2 model

The router is dumb on purpose. if statements over branch name, labels, and a handful of grep patterns. All three agents (Claude Code, Cursor, Kimi via the Anthropic-compatible endpoint) consume the same workflow_dispatch inputs. Copilot does the actual review pass, but a cron job, not the review event itself, fires the next stage. That last point is the one that took longest to figure out.

Reviewer model picked by grep

The headliner is twelve lines of bash in ai-review.yml.

Sonnet 4.5 is fast and cheap. You can run it all day on small reviews and barely move the bill. Opus is meaningfully smarter on hard problems but the per-token cost is roughly an order of magnitude higher. Pick by gut and you either bottleneck reviews of genuinely tricky code on Sonnet, or pay Opus prices for UI tweaks. Neither is fine.

So the router greps the diff for patterns that, in this codebase, reliably mark the kind of change where Opus pays for itself. Thread safety, persistence boundaries, Swift concurrency, the C++ emulation core.

# inside ai-review.yml
DIFF=$(gh pr diff "$PR_NUMBER")

# patterns where Opus consistently beats Sonnet in this codebase:
#  - Realm: thread-confined object model, easy to corrupt
#  - @Model: SwiftData, similar story
#  - actor / @MainActor: Swift concurrency boundaries
#  - @synchronized / NSLock: explicit locks; subtle deadlocks
#  - DispatchQueue: GCD; race conditions are non-obvious
#  - .cpp / .mm: the emulation core has real concurrency

if echo "$DIFF" | grep -qE 'Realm|@Model|@synchronized|^\s*actor\s|@MainActor|DispatchQueue|NSLock|\.cpp$|\.mm$'; then
    echo "model=claude-opus-4-5" >> "$GITHUB_OUTPUT"
    echo "reason=high-risk-pattern-detected" >> "$GITHUB_OUTPUT"
else
    echo "model=claude-sonnet-4-5" >> "$GITHUB_OUTPUT"
    echo "reason=default" >> "$GITHUB_OUTPUT"
fi

The reason output gets posted as a PR comment so I can audit the routing later. A few times I’ve added new patterns after a Sonnet review missed something I felt Opus would have caught. The dictionary grows with the codebase.

Thread-safety and persistence-boundary bugs need a lot of context. The lifecycle of an object, which queue a method runs on, what other actors hold a reference. UI tweaks and dependency bumps need none of that. The diff itself carries enough signal to predict, statically, which class of review you’re about to do. No router LLM. No model in the loop to make the routing decision. grep is fine.

The pattern transfers. If you’re a Rails shop, the trip-wires are after_commit, has_many :through, anything inside ActiveRecord::Base.transaction. Go shop: go func, channel ops, anything in sync. Look at your last twenty production incidents and ask whether the diff that caused each one was distinguishable in advance by file extension or grep pattern. The answer is almost always yes.

Routing payoff. Roughly 70% of PRs land on Sonnet, 30% on Opus. Before the router existed I was paying Opus prices for everything to be safe. The bill dropped to about a third. No Sonnet review has missed something Opus caught in production yet. That sentence is true today; ask me in six months.

Three agents, one action vocabulary

The annoying default in multi-agent setups is a separate workflow tree per agent, three copies of the same orchestration logic. Change one thing, change it three times.

The pattern that landed (after reverting two earlier shapes) is one shared dispatch vocabulary, agents as workers. The orchestration layer doesn’t know which agent it’s calling. It knows which action it wants performed. Routing to a specific agent happens at the worker level.

# every agent worker accepts these workflow_dispatch inputs
inputs:
  action:
    type: choice
    options:
      - implement_issue       # take an issue, open a PR
      - fix_ai_review         # respond to a Copilot review with fixes
      - fix_rebase_conflict   # resolve merge conflicts on a stale branch
      - ai_approved           # signal review complete; ready for human gate
  issue_number:
    type: string
  pr_number:
    type: string
  model:
    type: string
    default: claude-sonnet-4-5

Routing is by branch suffix and label. fix/issue-1234-cursor goes to cursor-agent.yml. No suffix, no special label, goes to Claude. The label ai:fallback-kimi forces kimi-agent.yml. The router does that mapping in about ten lines of bash. Adding a fourth agent (I’ve been eyeing Aider) is a copy-paste of the worker workflow plus one new branch suffix.

When I add a fifth verb (fix_localization_drift is on the list), I touch one input definition and propagate the case to each worker. The orchestration above doesn’t care. Agents are interchangeable.

Kimi as Anthropic-compatible fallback

Claude Code is a binary distributed by Anthropic. It hardcodes Anthropic-SDK behavior. So how do you fail over when credits run dry, or when Anthropic has a partial outage, without rewriting the worker workflow?

Moonshot’s Kimi exposes an Anthropic-compatible coding endpoint. Same API surface, different model behind it. The fallback is one environment variable.

- name: Run Claude Code (fallback to Kimi K2)
  env:
    ANTHROPIC_API_KEY: ${{ secrets.KIMI_API_KEY }}
    ANTHROPIC_BASE_URL: https://api.kimi.com/coding
  run: |
    claude --print "..." --model kimi-k2

Same Claude Code binary, no other workflow changes, talking to Kimi. The trigger in production is rate-limit or credit-exhaustion errors from the Anthropic endpoint. Those bubble up as a non-zero exit, the next attempt sets the Kimi env vars, and the work continues. It’s fired a handful of times in the last quarter. Each time the PR finished without me noticing until I looked at the workflow log.

The point: most multi-agent setups assume the agents have wildly different SDKs and design around that. In practice, OpenAI-compatible and Anthropic-compatible endpoints are the lingua franca. You can swap models behind your existing tooling for the price of a base URL.

The cron poller that bypasses GitHub’s bot-actor gate

This is the second headliner. Nobody talks about it. If you’ve never tried to chain a GitHub bot’s output into another workflow, the next two paragraphs are why.

Setup. When Copilot finishes reviewing a PR, the Claude worker should read that review and address the feedback. The intuitive wiring is a pull_request_review event. GitHub fires it whenever a review is submitted.

Reality. GitHub silently gates event-triggered workflows when the actor is a bot. copilot[bot], github-actions[bot], and friends. The downstream run sits in action_required until a human clicks approve. The button lives deep in the UI. There’s no error. The workflow just doesn’t fire. Bot-to-bot chaining is impossible by default. The gate is a sensible security policy in the abstract; it shuts down this whole pattern in practice.

Workarounds I tried first. pull_request_target doesn’t help. The actor gate still applies. Manually approving each one defeats the automation. Service accounts get re-flagged as bots after enough activity. None of these scale.

What works. copilot-review-poller.yml runs on cron every ten minutes, finds new Copilot reviews, and dispatches the next workflow itself. The scheduled run’s actor is github-actions[bot], which is the trusted one for this purpose. The gate doesn’t apply.

name: Poll Copilot reviews
on:
  schedule:
    - cron: '*/10 * * * *'    # every 10 min
  workflow_dispatch:           # manual kick

# WHY THIS EXISTS, READ THIS BEFORE DELETING:
# GitHub silently gates event-triggered workflows when the actor is
# copilot[bot] (action_required approval). pull_request_review
# events from Copilot do NOT fire downstream workflows by default.
# A scheduled run carries the github-actions[bot] actor and is NOT
# subject to that gate. So we poll instead of subscribing.

permissions:
  pull-requests: write
  contents: read
  actions: write

jobs:
  poll:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Find PRs with new Copilot reviews
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          gh pr list --json number,labels,reviews \
            --jq '.[] | select(.reviews[]? | select(.author.login == "copilot[bot]"))' \
          | while read -r pr; do
              gh workflow run claude-worker.yml \
                -f action=fix_ai_review \
                -f pr_number="$(echo "$pr" | jq -r .number)"
            done

Ten-minute latency is fine. The goal isn’t sub-second turnaround. The goal is closing the loop without me in it. The poller is the bridge between “Copilot finished reviewing” and “Claude starts addressing feedback,” and it’s the missing piece in every multi-agent CI tutorial I’ve read. Scheduled actors bypass the bot-actor gate. Polling is the only reliable way to chain bot output into another workflow on GitHub today.

Self-healing wiki CI

The Provenance wiki lives in a separate repo. Spell check, link check, structural validation. It used to break every couple of weeks because someone (often me) merged a doc edit with a typo or a dead link, and the failure would sit there until I noticed.

It now self-heals. auto-fix-ci.yml on the wiki repo fires on CI failure, calls Claude via anthropics/claude-code-action, Claude opens a PR with the fix, and (the part that delights me) requests Copilot review on its own PR. The poller from the previous section picks that review up. If Copilot approves, the PR auto-merges via a branch-protection bypass token. Loop closes without me.

Most wiki CI failures self-resolve within an hour. The rest are real (broken external links, structural changes the doc tooling doesn’t handle) and bubble up via a Slack notification. The signal-to-noise on those notifications is finally what I want.

Diff-aware Haiku localization

Provenance ships strings localized into eleven target languages. The naive approach (the one I had for years) is to re-translate the entire file every time en.lproj/Localizable.strings changes. Slow, wasteful, and it stomps on any human edits the localizers made downstream.

Diff-aware is better. auto-translate.yml runs on every push that touches the English strings file:

Diff en.lproj/Localizable.strings between HEAD~1 and HEAD.
Extract keys that are new or whose values changed.
Send only those keys to Claude Haiku with the existing target-language file as context.
Apply Haiku’s output as a patch to each of the eleven target files.
Open a PR labeled ai:localization for human spot-check.

Haiku is the right model for two reasons. It’s cheap enough to run on every push without thinking about it. And translation of short UI strings is the kind of task where Sonnet’s extra capability buys nothing measurable. A full localization run drops from dollars per language to single-digit cents, and finishes in under a minute. The shape works for any string-keyed pipeline. Same loop fits Android strings.xml, gettext .po, JSON i18n bundles. Anything with stable keys.

Real outcomes

The stack has been running in roughly its current form for about nine weeks.

PR throughput. Noticeably faster on the routine PRs because I’m no longer in the first-review path. Median time-to-first-review dropped from days to hours.
Anthropic spend. Low-double-digits per month across the org. That covers routing, the wiki self-heal, the localization pipeline, the Claude Code workers, and the occasional Opus review. The diff-grep router is the biggest single cost lever.
What didn’t work. File-extension routing. I tried sending all .swift to Cursor and all .cpp to Claude. Killed it within two weeks. File type is a poor proxy for which agent does better. The diff-pattern signal beats it.
What surprised me. The AI reviewers catch real bugs, especially around threading and Realm transactions. They also flag documentation drift (a doc-comment that no longer matches the function signature), which I never asked them to do. False-positive rate is non-trivial. A quick triage pass is still cheaper than missing a real bug.
What still needs me. Design decisions, public API changes, anything in the C++ emulation core that touches sync timing, anything affecting save-state compatibility with prior releases. Those PRs get the AI review for completeness. The merge decision is mine.

Takeaway

Multi-agent CI isn’t about replacing engineers. It removes the routine review-and-triage layer that, for a solo maintainer, is the actual rate-limiting step. The orchestration doesn’t make the agents smarter. It puts the right agent on the right work, automatically, without a human in the routing loop.

Three things worth stealing. Route by static diff analysis, not by gut and not by router LLM. Build one shared dispatch vocabulary and treat agents as interchangeable workers behind it. Use a scheduled poller for any chain that crosses the GitHub bot-actor gate, because subscribing to the event will silently fail and you’ll spend a weekend figuring out why.

The nine workflow files are public on the Provenance org. ai-review.yml, cursor-agent.yml, kimi-agent.yml, copilot-review-poller.yml, auto-translate.yml, repo-health.yml, plus the wiki’s auto-fix-ci.yml. Copy them. Adapt the dispatch vocabulary to your stack. The orchestration is the part worth stealing. Whichever agent you put behind it is replaceable.