We’ve got a chart progression story (P59 → P61 → P62 → P63 → P64), but
every column on it comes from boot milestones in a Linux kernel run.
That’s a fine story — it does measure end-to-end CPU + paging + memory
performance — but it’s noisy, expensive, and currently broken (the
kernel parks in CRNG silent phase before reaching Run /init). We need
benchmarks that are smaller, cheaper, and stable across revisions.
So this commit adds a benchmark suite to projects/64_prefetch_under_wb/bench/
with two flavours:
- FreeRTOS micro-benchmarks — bare-metal-RTOS workloads that build
on top of P43’s FreeRTOS port. Same chip, same timer, same MMIO. The
app prints
BENCH key=valuelines on UART and halts cleanly viaMMIO_HALT. - Linux userspace micro-benchmarks — static RV32 ELFs that drop
into the existing initramfs in place of
userspace/hello. Same flow as P60.
A new scripts/bench_run.py (using uv) drives either simulator,
parses BENCH lines from the UART transcript, and writes a
benchmark.json with the same shape the site already ingests for
charts.
What’s measured
FreeRTOS bench, all in core clocks (because mtime increments by 1
per cycle on this chip):
| key | what |
|---|---|
ctxswitch_cycles | avg cycles per context switch (semaphore ping-pong) |
semaphore_rt_cycles | uncontended give+take round trip (no scheduler call) |
tick_overhead_cycles | per-tick overhead vs. ideal vTaskDelay |
alu_loop_iters_per_kcyc | xorshift32 iterations per 1000 cycles |
Linux bench, also in core clocks:
| key | what |
|---|---|
memcpy_bytes_per_kcyc | bytes/1000-cycles for a naive 4 KiB word memcpy |
syscall_rt_ns_avg | avg cycles per getpid() round trip |
pagefault_cycles_avg | avg cycles per first-touch fault on anonymous mmap |
alu_loop_iters_per_kcyc | the same xorshift32 loop as the FreeRTOS bench |
Sharing the ALU baseline across both flavours is deliberate — it gives us a CPU-bound workload that should match between FreeRTOS and Linux (modulo trap entry / privilege-mode cost) once both run end-to-end. That cross-check is one of the things the suite is supposed to expose.
What ran (status, per CLAUDE.md honesty rules)
| flavour | builds | tb elaborates | end-to-end run |
|---|---|---|---|
| FreeRTOS | yes | yes (iverilog -t null) | NOT RUN |
| Linux | yes | n/a (reuses P60 tb) | NOT RUN, blocked: CRNG silent phase |
scripts/bench_run.py | n/a | n/a | parse-replay tested on synthetic stdin (PASS) |
bench/freertos/app/main.c was syntax-checked with riscv64-elf-gcc
against stub FreeRTOS headers and compiles clean. bench/linux/bench.c
was compiled with riscv64-elf-gcc -march=rv32ima_zicsr -mabi=ilp32
to a .o (UNKNOWN whether glibc-flavour
riscv64-unknown-linux-gnu-gcc from the flake produces a smaller
static ELF — that part of the toolchain isn’t on PATH outside the nix
shell on this host). No simulation was actually run end-to-end; the
chart-progression value of this suite is in being repeatable, not in
the numbers from this one commit.
Known gaps / TODO
-
Linux bench is gated on the kernel actually booting. The current state (per
2026-05-04is the same place we left P58/P59) is that the kernel printsLinux version, runs through architecture init, and parks somewhere inkernel/random.c’s BLAKE2s/CRNG init. PC samples land at0xc015c2**and never advance. Both verilator and iverilog reproduce. The Linux bench is wired so it’ll Just Work once that’s resolved; until then, the FreeRTOS bench is the honest comparison metric. -
syscall_rt_ns_avgis misnamed. The value is cycles, not nanoseconds. The name keeps the LMbench label so charts look familiar; an honest cleanup would besyscall_rt_cycles. Filed as a TODO inbench/README.md. We can rename once the suite has actually produced output and we know what naming pairs read well. -
No FreeRTOS-on-P64-RTL run yet. The Makefile defaults to P43’s
src/top.svbecause it’s the proven FreeRTOS host and elaborates faster. SettingSRC=../../../src/top.svshould run the same image on P64’s bigger Sv32+S-mode RTL — both expose the identical MMIO map. Worth verifying once we have baseline numbers from P43. -
No bench dashboard yet. Each project’s
benchmark.jsonwill drop alongside the existing one intest/benchmark.json. The site’s<MilestoneCompare>/<CpiCompare>components were built for Linux-boot benchmarks; they’ll need a new sibling component to plot bench values across revs, or the existing ones can be widened. Not in this commit. -
rdcycle vs. rdtime in U-mode. The Linux bench uses
rdtime(counter 1) becausemcounteren.TMis set by stage-0. If a future bring-up zeroesmcounteren, U-moderdtimetraps and the bench silently produces nonsense (it’ll print zeros). Comment inbench.cflags this and points at the SYS_clock_gettime fallback path that is coded but not wired in.
Why this rung, in plain language
We’ve been hardening a chip for forty-odd projects and the only thing that maps “did it get faster?” is whether the kernel boot prints land sooner. That’s a decent integration test, but it’s a dull instrument for measuring small wins. The FreeRTOS bench gives us four numbers that should change in known directions when the next pipeline / cache / branch-predictor / ALU upgrade lands:
ctxswitch→ drops with faster fetch and faster CSR accesssemaphore_rt→ drops with cheaper branch + load forwardingtick_overhead→ drops with cheaper trap entryalu_loop→ climbs with anything that improves CPI on ALU-bound code
When P65 ships, we run make bench-json on both P64 and P65 and the
table tells us if it was a win, a wash, or a regression. That’s the
artifact this commit adds.