journal 2026-05-04

A repeatable benchmark suite for chip revisions

benchfreertoslinuxp64infrastructure

We’ve got a chart progression story (P59 → P61 → P62 → P63 → P64), but every column on it comes from boot milestones in a Linux kernel run. That’s a fine story — it does measure end-to-end CPU + paging + memory performance — but it’s noisy, expensive, and currently broken (the kernel parks in CRNG silent phase before reaching Run /init). We need benchmarks that are smaller, cheaper, and stable across revisions.

So this commit adds a benchmark suite to projects/64_prefetch_under_wb/bench/ with two flavours:

A new scripts/bench_run.py (using uv) drives either simulator, parses BENCH lines from the UART transcript, and writes a benchmark.json with the same shape the site already ingests for charts.

What’s measured

FreeRTOS bench, all in core clocks (because mtime increments by 1 per cycle on this chip):

keywhat
ctxswitch_cyclesavg cycles per context switch (semaphore ping-pong)
semaphore_rt_cyclesuncontended give+take round trip (no scheduler call)
tick_overhead_cyclesper-tick overhead vs. ideal vTaskDelay
alu_loop_iters_per_kcycxorshift32 iterations per 1000 cycles

Linux bench, also in core clocks:

keywhat
memcpy_bytes_per_kcycbytes/1000-cycles for a naive 4 KiB word memcpy
syscall_rt_ns_avgavg cycles per getpid() round trip
pagefault_cycles_avgavg cycles per first-touch fault on anonymous mmap
alu_loop_iters_per_kcycthe same xorshift32 loop as the FreeRTOS bench

Sharing the ALU baseline across both flavours is deliberate — it gives us a CPU-bound workload that should match between FreeRTOS and Linux (modulo trap entry / privilege-mode cost) once both run end-to-end. That cross-check is one of the things the suite is supposed to expose.

What ran (status, per CLAUDE.md honesty rules)

flavourbuildstb elaboratesend-to-end run
FreeRTOSyesyes (iverilog -t null)NOT RUN
Linuxyesn/a (reuses P60 tb)NOT RUN, blocked: CRNG silent phase
scripts/bench_run.pyn/an/aparse-replay tested on synthetic stdin (PASS)

bench/freertos/app/main.c was syntax-checked with riscv64-elf-gcc against stub FreeRTOS headers and compiles clean. bench/linux/bench.c was compiled with riscv64-elf-gcc -march=rv32ima_zicsr -mabi=ilp32 to a .o (UNKNOWN whether glibc-flavour riscv64-unknown-linux-gnu-gcc from the flake produces a smaller static ELF — that part of the toolchain isn’t on PATH outside the nix shell on this host). No simulation was actually run end-to-end; the chart-progression value of this suite is in being repeatable, not in the numbers from this one commit.

Known gaps / TODO

Why this rung, in plain language

We’ve been hardening a chip for forty-odd projects and the only thing that maps “did it get faster?” is whether the kernel boot prints land sooner. That’s a decent integration test, but it’s a dull instrument for measuring small wins. The FreeRTOS bench gives us four numbers that should change in known directions when the next pipeline / cache / branch-predictor / ALU upgrade lands:

When P65 ships, we run make bench-json on both P64 and P65 and the table tells us if it was a win, a wash, or a regression. That’s the artifact this commit adds.