142 acid tests for a 30-year-old emulator, in one PR

Strict hardware-conformance tests written in 68K assembly that run inside the emulator and write a pass/fail signature to a known RAM address. 122/142 pass; the 19 failures are real bugs in the core, each one now a checked-in regression gate. PR #130 against libretro/virtualjaguar-libretro.

Open Source Emulation Atari Jaguar Testing AI engineering

What landed

PR #130 against libretro/virtualjaguar-libretro: 142 strict hardware-conformance tests across 13 categories, all written in 68K assembly that runs inside the emulator. Each test writes a pass/fail signature to a known RAM address. A host-side runner reads the signature back. 122/142 pass on develop today.

The 19 failures are real emulator bugs (or unmodelled hardware behavior). Each FAILing test is a checked-in regression gate. When somebody fixes the underlying bug, the test goes GREEN automatically.

Suite runtime: 12 seconds. Was 30+ minutes before a one-line fix in run.c. The branch is feature/acid-test-roms. The architecture doc is test/acid/README.md.

Acid tests fail until the emulator is correct

Most regression tests are written to pass on the current implementation, so they only catch regressions from current behavior, not deviations from correct behavior. Acid tests invert that. Each test makes a strict claim about what real Jaguar hardware does, with exact expected bytes. If the emulator diverges, the test FAILs. The checked-in test is a description of the bug.

That framing matters because the Jaguar core is deliberately not cycle-accurate. The OP’s timing budget is loose. Bus contention isn’t modelled. HLE BIOS doesn’t match real BIOS in many places. We expect failures. Each FAIL is more useful than prose in a TODO doc.

The working motto came out of an early design conversation:

we’re not trying to write tests that work with our emulator, we’re trying to write accurate tests that will fail unless our emulator is correct

I had every blitter command bit wrong

The first batch of blitter tests all FAILed identically. The README I wrote at the time confidently documented a “blitter source-data routing bug.” Then GitHub Copilot’s PR review pointed out that the entire B_COMMAND encoding I was using was bogus. I had $0001C000 thinking that was “LFU=copy source”; bits 14-15 are unused and the LFU function actually lives in bits 21-24. I’d also confused DSTEN (bit 3, $08) with DSTWRZ (bit 5, $20).

13 of the 14 SRC-reading blitter tests recovered from FAIL to PASS in one round of fixes. The “blitter bug” I had documented didn’t exist. It was my wrong test code all along.

This is the single most important lesson from the work. An acid test is only as good as its encoding correctness. Tests are software too, and they have bugs. Volume of tests doesn’t matter if the encodings are wrong; you’d just be measuring a fixed point between two different bugs.

Generating the constants from the source

Two pieces of infrastructure landed to make that class of mistake mechanically impossible.

The constants oracle. A Python script (test/acid/scripts/gen-jaguar-regs.py) parses the actual emulator C headers (src/tom/blitter.c, tom.h, gpu.h, jerry.h, dsp.h) and emits a single vasm-friendly equates file at test/acid/include/jaguar_regs.s. Tests reference fields by name (BCOMPEN, IRQ2_TIMER1, B_COMMAND) and get the right bit, mechanically. Single source of truth. If the C source changes a bit position, the oracle picks it up the next time you run make.

The lint pass. test/acid/scripts/lint-acid.py walks every test’s B_COMMAND literals and warns if any unknown bits are set beyond BLIT_CMD_VALID_BITS = OR of all defined fields. It also checks LFU operand consistency (an LFU=$E (S|D) op without DSTEN set means D reads as zero, so the test silently FAILs for the wrong reason). And it flags hard-coded $F02238-style MMIO literals where a symbolic name exists.

Run via make -C test/acid lint. Currently clean across the suite.

Four sub-agents, one plan

With the oracle and linter as safety rails, I wrote a 600-line COVERAGE_PLAN.md partitioned into 11 chunks small enough for one sub-agent each. Then I dispatched four LLM agents in parallel via Claude Code’s Task tool:

  • Agent A: blitter chunks 1+2+3 (tighten existing + pixsize matrix + LFU completion). 28 tests touched.
  • Agent B: GPU + DSP opcodes, chunks 5+6. 35 tests.
  • Agent C: OP scenarios + bus tests, chunks 7+9. 10 tests.
  • Agent D: strict timing + 68K coverage, chunks 10+11. 8 tests.

Each agent got the COVERAGE_PLAN.md and the oracle as context. Each was instructed to run make -C test/acid lint before claiming done. Each was told that failures are valuable data, not a bad outcome. Each reported back which files passed, which failed, and what the diagnostic codes meant.

~70 new tests in the time it took the slowest agent to finish (~25 minutes wall time). The agents also surfaced corrections to the prompt: one flagged that I’d written the STORE-opcode encoding wrong (rm in bits 9..5, not rn); another discovered that the OP modifies the BITMAP p0 phrase in place every halfline. Those corrections went back into the plan.

The guard rails are doing the work. The agents weren’t writing assembly from scratch with hardware docs open in another tab. They were filling slots in a constrained problem with a checker that ran before they declared done. That distinction is the whole point. Without the linter, I’d have ~70 confidently-broken tests checked in and the suite would be lying to me.

What the suite found

19 FAILing tests, each one a checked-in regression gate that flips GREEN when the underlying bug is fixed. Six findings worth calling out specifically.

Postscript pointer (2026-05-03): see the postscript at the bottom of this post. 8 of these 19 failures turned out to be test bugs, not emulator bugs, and three of the six findings below got retracted within 24 hours of this post going live. The original section is left intact below as a snapshot of what I thought I knew when I hit publish.

Event-clock vs 68K-instruction-clock divergence. Three strict-timing tests fail with a consistent ~1.7-2.0x ratio. This is almost certainly the root cause of the long-standing Doom-plays-1.5x-too-fast bug (#131). If the event clock advances faster than 68K instructions execute, games that use VBlank or PIT IRQs as their tick source see “more time” passing per actual second of CPU work. Their internal gametic advances faster, enemies move faster, demos play faster. Audio (clocked off SCLK independently) stays correct. That matches every reported symptom. The pinned test is test/acid/tests/timing/vblank_60hz_exact.s.

GPU/DSP control-register read shadowing. GPUReadLong (gpu.c:338-342) intercepts long-aligned reads in the $F02100..$F0211F range as register-bank reads before checking the control-RAM range, so 68K reads of G_PC, G_CTRL, G_FLAGS via long return wrong data. Same shape on DSP. Caught by gpu_basic_run.s and dsp_basic_run.s.

BlitterMidsummer2 hangs forever on 1bpp / 2bpp blits. Replicates for inner counts 4, 16, 64, 256. test/acid/tests/blitter/copy_pix2_phrase.s and a sibling are checked in as deliberate FAIL placeholders that document the hang. Once the hang is fixed, they’ll be replaced with real assertions.

DSP IRQ delivery to 68K is broken. The JERRY pending bit gets set when DSP raises a CPU IRQ. The 68K never enters its handler at autovector $68. Real path bug.

DIVL zero-divide trap doesn’t fire. The 68020 MULL/DIVL HLE path that landed in v2.2.0 handles the math correctly but doesn’t trap on divisor == 0.

Narrow-pixel blitter copies pick the wrong byte per pixel at 1/2/4 bpp partial pixel-mode copies.

Six real bugs surfaced by infrastructure that didn’t exist a few weeks ago. None of them got “fixed” in this PR; that’s not what the PR is for. The PR is the gate.

The numbers

CategoryTestsPass
smoke11
memory1010
timing1310
irq76
blitter3527
gpu1817
dsp2120
op109
bus52
hle66
quirks1110
stress33
perf33
Total142122
Updated 2026-05-03: PRs #135 + #136 (merged) and #139 (open) recover 8 more tests via test-side fixes. Pending #139’s merge, the rate is 130/142. See the postscript.

~1.5K lines of test code, plus ~700 lines of harness, scripts, and docs. Suite runtime 12 seconds.

The 30-minute-to-12-second drop is the fix in run.c: poll the ACID_RESULT word each frame, break out of the frame loop when it’s no longer zero. Tests that used to consume the full 600-frame budget waiting for the runner to give up now exit on the frame they actually complete on. One-line patch, ~150x speedup on the suite. The slow path was correct; it was just doing 600 frames of work per test for no reason.

What’s next

The Doom timing test points at the fix path. Whoever writes the patch (me, a contributor, somebody who shows up out of nowhere on the libretro discord) will know they got it right when vblank_60hz_exact.s and the two adjacent timing tests flip from FAIL to PASS without any test changes. That’s the contract.

Same shape for the other five findings. Each one has a test pinned to it. The PR description in #130 has the full list.

If you maintain an emulator and you don’t have an acid suite yet, the cost is real but bounded. Two weeks of evening work got the first 50 tests written, plus the oracle and linter. Less per-test after that, because the patterns repeat. The marginal cost of test #143 is something like 20 minutes, most of which is reading the hardware doc to confirm what the expected bytes should be.

If the LLM-agent angle is the part that interests you: the agents were genuinely useful, and the work would have taken substantially longer without them. They were also wrong about specific things, and would have written confidently broken tests at scale if the linter hadn’t been there to reject the output. The shape of the win is “fast labor for a constrained, mechanically-checkable task,” not “judgment about whether a test is worth writing.”

PR is at github.com/libretro/virtualjaguar-libretro/pull/130.

Postscript (2026-05-03): 8 of the 19 failures were the tests, not the emulator

A day after this post went live, three follow-up PRs recovered 8 of the 19 originally-failing tests by fixing the test code. The emulator code paths in all three PRs were verified correct via tracing.

  • #135 (timing, 3 tests). Busy-loop cycle counts were wrong. Each test had a hand-counted estimate of how many 68K cycles its counter loop took per iteration, and each estimate was off by enough that the loop measured the wrong wall-clock window. vblank_60hz_exact assumed 10 cycles for a subq.l/bne.s pair; it’s 18. halfline_period_us undercounted by ~25% because UAE 68K’s MMIO timing for move.w abs.L,Dn charges extra bus cycles per access. pit_countdown_rate had the same bug plus stale arithmetic from before PR #134 fixed the actual PIT clock rate.
  • #136 (blitter + OP, 2 tests). bcompen_basic had a comment that ended in ? because I wasn’t sure whether PATDSEL should be set; the linter couldn’t help, since PATDSEL is a valid bit and was just missing. Without it, BCOMPEN’s mask gated which dest pixels got written but the data was the source byte itself, not the pattern. op_gpu_int_object was reading the GPU IRQ latch from the wrong register: gpu_flags (Z/N/C condition codes) instead of gpu_control (the IRQ latches live at offset +$14).
  • #139 (gpu/dsp, 3 tests). gpu_basic_run and dsp_basic_run had a slab-overflow bug: the tests filled GPU/DSP local RAM with NOPs and spun the 68K for 500 iterations, but the GPU runs at ~2x the 68K rate plus higher IPC, so PC walked off the end of the slab into RAM initialized with JaguarRand(). Random opcodes decoded as JUMP/JR with bogus targets, landing PC at addresses like 0x9F0E. Fix: bigger slabs (full 4KB / 8KB), shorter spin (20-50 iterations). dsp_irq_to_68k installed its handler at 68K-architectural autovector $68; on Jaguar, TOM/JERRY return user vector 64 ($100) on the data bus for all hardware IRQs, so the handler at $68 was never reached. Plus the test asserted that the handler ran AND that the JERRY pending bit was still set, but the handler explicitly ack’d the pending bit on entry, so the second assertion could never pass even with a correct IRQ chain.

That’s 8 of 19. Three retractions to the “What the suite found” list above:

  • The GPU/DSP control-register read shadowing claim was wrong. Those two tests were failing on slab overflow, not on a read-shadowing bug. The emulator path was verified correct via tracing in #139.
  • The “DSP IRQ delivery to 68K is broken” claim was wrong. The IRQ chain works correctly. checkForIRQToHandle is consumed at the very next instruction in the same m68k_execute slice. A prior agent’s hypothesis about a deferred-IRQ-during-spin-loop bug was disproved by tracing.
  • The Doom event-clock divergence framing was wrong. The 1.7-2.0x ratio I reported was the timing tests’ wrong busy-loop cycle estimate (10 vs the actual 18), not the event clock running fast. Whether the Doom-plays-1.5x-too-fast bug has a different root cause is now an open question; vblank_60hz_exact.s no longer points at it.

What’s left: the DIVL zero-divide trap, the narrow-pixel blitter copies, and the BlitterMidsummer2 hang are still real emulator bugs. PR #134 (separate work, before this writeup landed) did fix a real PIT clock bug surfaced by pit_countdown_rate. So 4 of the 19 confirmed real, 8 confirmed test bugs, 7 still pending diagnosis.

The original post made the right point about “tests are software too” but understated it. The full picture: the failure rate on this suite isn’t 100% signal. As of today it’s running about 1 in 3 real (~33% confirmed real, ~42% confirmed test bugs, the rest pending diagnosis). The oracle-and-linter rails catch one class of test bug — wrong bit positions, wrong MMIO addresses — but not these:

  • Cycle-count math. Counting cycles from the M68K manual is approximate; UAE has its own quirks. The linter can’t know what wall-clock window your busy loop actually produces.
  • Wrong-register reads. Both gpu_flags and gpu_control are valid register names with valid offsets. The linter can’t know which one your test intends to read.
  • Slab sizing. Whether your NOP slab is big enough for your spin window is a runtime property of the target chip’s execution rate, not a syntactic property of the test.
  • Vector convention. Both $68 (68K autovector) and $100 (Jaguar HW IACK return) are valid 68K vectors. The linter can’t know which the platform uses.

GitHub Copilot’s PR review caught real issues across all three follow-up PRs: comment vs. code mismatches, contradictory headers (file says ±5% tolerance, code enforces ±10%), command-bit breakdowns inconsistent with the actual literal value. Each got addressed in a follow-up commit. The original post implied the Copilot save was a one-time event; it isn’t, it’s structural. Layered review (linter, then Copilot, then a human PR pass) catches more than any single layer.

None of this changes the architecture. It calibrates the expectation: when you’re writing acid tests in assembly for a platform you don’t have hardware for, the first ~40% of FAIL results in any new bucket should be treated as test bugs until proven otherwise. The path forward is the same as it was. Each FAIL is a checked-in claim, retract or fix in the PR review when the claim is wrong, fix the emulator when the claim is right.

PRs: #135 (merged), #136 (merged), #139 (open).