A repeatable benchmark suite for chip revisions

We’ve got a chart progression story (P59 → P61 → P62 → P63 → P64), but every column on it comes from boot milestones in a Linux kernel run. That’s a fine story — it does measure end-to-end CPU + paging + memory performance — but it’s noisy, expensive, and currently broken (the kernel parks in CRNG silent phase before reaching Run /init). We need benchmarks that are smaller, cheaper, and stable across revisions.

So this commit adds a benchmark suite to projects/64_prefetch_under_wb/bench/ with two flavours:

FreeRTOS micro-benchmarks — bare-metal-RTOS workloads that build on top of P43’s FreeRTOS port. Same chip, same timer, same MMIO. The app prints BENCH key=value lines on UART and halts cleanly via MMIO_HALT.
Linux userspace micro-benchmarks — static RV32 ELFs that drop into the existing initramfs in place of userspace/hello. Same flow as P60.

A new scripts/bench_run.py (using uv) drives either simulator, parses BENCH lines from the UART transcript, and writes a benchmark.json with the same shape the site already ingests for charts.

What’s measured

FreeRTOS bench, all in core clocks (because mtime increments by 1 per cycle on this chip):

key	what
`ctxswitch_cycles`	avg cycles per context switch (semaphore ping-pong)
`semaphore_rt_cycles`	uncontended give+take round trip (no scheduler call)
`tick_overhead_cycles`	per-tick overhead vs. ideal `vTaskDelay`
`alu_loop_iters_per_kcyc`	xorshift32 iterations per 1000 cycles

Linux bench, also in core clocks:

key	what
`memcpy_bytes_per_kcyc`	bytes/1000-cycles for a naive 4 KiB word memcpy
`syscall_rt_ns_avg`	avg cycles per `getpid()` round trip
`pagefault_cycles_avg`	avg cycles per first-touch fault on anonymous mmap
`alu_loop_iters_per_kcyc`	the same xorshift32 loop as the FreeRTOS bench

Sharing the ALU baseline across both flavours is deliberate — it gives us a CPU-bound workload that should match between FreeRTOS and Linux (modulo trap entry / privilege-mode cost) once both run end-to-end. That cross-check is one of the things the suite is supposed to expose.

What ran (status, per CLAUDE.md honesty rules)

flavour	builds	tb elaborates	end-to-end run
FreeRTOS	yes	yes (iverilog `-t null`)	`NOT RUN`
Linux	yes	n/a (reuses P60 tb)	`NOT RUN`, blocked: CRNG silent phase
`scripts/bench_run.py`	n/a	n/a	parse-replay tested on synthetic stdin (`PASS`)

bench/freertos/app/main.c was syntax-checked with riscv64-elf-gcc against stub FreeRTOS headers and compiles clean. bench/linux/bench.c was compiled with riscv64-elf-gcc -march=rv32ima_zicsr -mabi=ilp32 to a .o (UNKNOWN whether glibc-flavour riscv64-unknown-linux-gnu-gcc from the flake produces a smaller static ELF — that part of the toolchain isn’t on PATH outside the nix shell on this host). No simulation was actually run end-to-end; the chart-progression value of this suite is in being repeatable, not in the numbers from this one commit.

Known gaps / TODO

Linux bench is gated on the kernel actually booting. The current state (per 2026-05-04 is the same place we left P58/P59) is that the kernel prints Linux version, runs through architecture init, and parks somewhere in kernel/random.c’s BLAKE2s/CRNG init. PC samples land at 0xc015c2** and never advance. Both verilator and iverilog reproduce. The Linux bench is wired so it’ll Just Work once that’s resolved; until then, the FreeRTOS bench is the honest comparison metric.
syscall_rt_ns_avg is misnamed. The value is cycles, not nanoseconds. The name keeps the LMbench label so charts look familiar; an honest cleanup would be syscall_rt_cycles. Filed as a TODO in bench/README.md. We can rename once the suite has actually produced output and we know what naming pairs read well.
No FreeRTOS-on-P64-RTL run yet. The Makefile defaults to P43’s src/top.sv because it’s the proven FreeRTOS host and elaborates faster. Setting SRC=../../../src/top.sv should run the same image on P64’s bigger Sv32+S-mode RTL — both expose the identical MMIO map. Worth verifying once we have baseline numbers from P43.
No bench dashboard yet. Each project’s benchmark.json will drop alongside the existing one in test/benchmark.json. The site’s <MilestoneCompare> / <CpiCompare> components were built for Linux-boot benchmarks; they’ll need a new sibling component to plot bench values across revs, or the existing ones can be widened. Not in this commit.
rdcycle vs. rdtime in U-mode. The Linux bench uses rdtime (counter 1) because mcounteren.TM is set by stage-0. If a future bring-up zeroes mcounteren, U-mode rdtime traps and the bench silently produces nonsense (it’ll print zeros). Comment in bench.c flags this and points at the SYS_clock_gettime fallback path that is coded but not wired in.

Why this rung, in plain language

We’ve been hardening a chip for forty-odd projects and the only thing that maps “did it get faster?” is whether the kernel boot prints land sooner. That’s a decent integration test, but it’s a dull instrument for measuring small wins. The FreeRTOS bench gives us four numbers that should change in known directions when the next pipeline / cache / branch-predictor / ALU upgrade lands:

ctxswitch → drops with faster fetch and faster CSR access
semaphore_rt → drops with cheaper branch + load forwarding
tick_overhead → drops with cheaper trap entry
alu_loop → climbs with anything that improves CPI on ALU-bound code

When P65 ships, we run make bench-json on both P64 and P65 and the table tells us if it was a win, a wash, or a regression. That’s the artifact this commit adds.