No. 64 / project of 147 on the ladder

Prefetch-under-writeback + Verilator dual-target sim (~270× faster iteration)

introduces — prefetch path overlapped with WB; second simulator (Verilator) sharing the chip RTL with iverilog

harden statelast run2026-05-04
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P63 cut S_FETCH to one cycle on TLB hits. P64 takes the next bite: overlap the next fetch with the current writeback so back-to-back register-only instructions don’t pay a fetch cycle at all. The chip runs through the full Linux boot path to userspace + clean panic. But the bigger story this rung is the simulator upgrade.

milestonecycles
Linux version1,078,147
Switched to clocksource12,603,459
Run /init as init89,432,784
userspace hello90,246,176
Attempted to kill init (clean panic)90,505,484

That’s ~90.5M post-load cycles to userspace + return — 21 seconds of Verilator wall time at 4.7 MHz effective on this host. The same boot under iverilog takes ~50–60 minutes, and is currently the only way to cross-check chip behaviour.

The captured transcript (raw, benchmark.json) ends with:

===========================================
  hello from userspace on a homemade chip!
===========================================

  this binary:
    - cross-compiled for rv32ima/ilp32
    - statically linked, no libc
    - reaches the linux SYS_write syscall
    - is running as PID 1 from initramfs

  the chip below us:
    - 16 MiB DRAM, 25 MHz
    - rv32ima_zba, sv32 paging, m+s priv
    - SBI v0.1 firmware in stage-0

  exiting cleanly...
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000

The “panic” line is the expected exit — when PID 1 returns, Linux calls panic. exitcode=0x00000000 confirms a clean userspace exit.

Boot milestone progression: P62 → P63 → P64

milestone progression runs 3 scale 0 → 131.6M cycles
milestone P62 Zbb rotates + U-mode P63 fetch fast-path P64 prefetch-under-WB Δ last col
“Linux version”
P62: 1,611,173 cycles
1.6M
P63: 1,346,691 cycles
1.3M
P64: 1,078,147 cycles
1.1M
−19.9%
clocksource switched
P62: 18,426,917 cycles
18.4M
P63: 15,530,502 cycles
15.5M
P64: 12,603,459 cycles
12.6M
−18.8%
“Run /init”
P62: 130,054,364 cycles
130.1M
P63: 109,755,383 cycles
109.8M
P64: 89,432,784 cycles
89.4M
−18.5%
userspace hello
P62: 131,241,874 cycles
131.2M
P63: 110,746,765 cycles
110.7M
P64: 90,246,176 cycles
90.2M
−18.5%
init exit (panic)
P62: 131,610,501 cycles
131.6M
P63: 111,063,721 cycles
111.1M
P64: 90,505,484 cycles
90.5M
−18.5%

P64 reaches userspace hello in 90.25M cycles versus P63’s 110.75M — about 18.5% fewer cycles to userspace and clean panic, on the same kernel image and the same boot blob. The shrinking lifelines stack: P63’s fetch fast-path saved ~17% over P62; P64’s prefetch-under-WB saves another ~18% over P63.

CPI comparison

cpi compare runs 3 best 4.29 CPI worst 6.24 CPI
P62 Zbb rotates + U-mode
6.24
baseline
P63 fetch fast-path
5.27
−0.97 vs P62
P64 prefetch-under-WB
4.29
−0.98 vs P63

cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.

P64 retires the same ~21M instructions as P63 in 92M cycles instead of 112M — CPI drops from 5.27 → 4.29 (~18.6% fewer cycles per instruction). That’s the prefetch-under-WB hypothesis paid out cleanly: when the pipeline can issue the next fetch during S_WB, the common-case “register-only” path no longer pays a fetch cycle at all.

Where the cycles went: state-cycle breakdown

state distribution runs 3 scale % of post-load cycles
  1. P62 Zbb rotates + U-mode 132,610,502 cycles
  2. P63 fetch fast-path 112,063,722 cycles
  3. P64 prefetch-under-WB 92,000,000 cycles

The single most striking column: S_FETCH cycles drop from ~21.8M (P63) to ~1.08M (P64) — a ~95% reduction. P63 still spent one cycle in S_FETCH on every TLB hit; P64 collapses that into the previous instruction’s S_WB cycle whenever the prefetch is allowed to fire (TLB hit + no halt + no pending interrupt + a non-jump retirement). That cycle is not free — it has to come from somewhere, and looking at the breakdown it shows up as an unchanged S_DECODE / S_EXECUTE / S_MEM / S_WB profile, just with fewer total cycles.

state breakdown label cycles 92,000,000 cpi 4.29
  1. fetch 1.2% 1,084,166
  2. decode 23.3% 21,437,063
  3. execute 23.3% 21,437,063
  4. mem 23.1% 21,293,966
  5. walker 2.6% 2,375,026
  6. writeback 23.3% 21,429,846
  7. mul/div 3.2% 2,942,870

Walker activity

walker activity label flushes 6,156 L1 hits 9,392 L0 hits 1,182,817
fetch translations 538,486 total
0% TLB hit 538,486 walks
load/store/amo translations 11,397,613 total
94.3% TLB hit 653,723 walks

Walker behaviour is essentially unchanged from P63 — same fetch-miss / load-miss / store-miss profile, same megapage vs 4 KiB-page hit ratio. The prefetch path piggybacks on the walker’s existing TLB-hit shortcut; it doesn’t change the miss rate or the walk-completion behaviour.

Where time goes inside the kernel

hot functions label samples 97,656 period every 1,024 cycles
  1. blake2s_compress_generic kernel
    14.7% 14,342
  2. machine_restart kernel
    9.4% 9,222
  3. memset kernel
    6.8% 6,663
  4. memcpy kernel
    6.8% 6,623
  5. format_decode kernel
    5% 4,874
  6. vsnprintf kernel
    3.4% 3,286
  7. number kernel
    2.4% 2,312
  8. vruntime_eligible kernel
    2.3% 2,214
  9. memcmp kernel
    2.2% 2,132
  10. __slab_alloc_node.isra.0 kernel
    1.6% 1,567
  11. strlen kernel
    1.4% 1,403
  12. avg_vruntime kernel
    1.3% 1,313
  13. string_nocheck kernel
    1.1% 1,094
  14. chacha_permute kernel
    1.1% 1,022
  15. add_uevent_var kernel
    1% 924
  16. (remaining) remaining
    31.3% 30,561

blake2s_compress_generic and machine_restart dominate, which matches every prior rung — the kernel’s CRNG init does a lot of BLAKE2s rounds, and after PID 1 returns, machine_restart is where Linux spins waiting for a reboot we don’t implement. Compute-bound code (memset, memcpy, the OS-AGNOSTIC string helpers) sits behind those.

Headline: added a Verilator harness alongside the existing iverilog flow. Same chip RTL, two simulators. Effective speed went from ~23 KHz (iverilog interpreter) to ~6 MHz (Verilator native C++). That’s the difference between minutes and seconds per kernel-boot iteration.

Why two simulators

iverilog interprets SystemVerilog on every event. It’s perfectly correct but it’s slow — the P63 site charts came from a profile run that took 53 minutes to reach the boot panic. That round-trip is too long for the kind of “change one line, re-run, look at the trace” iteration we want for the pipeline work coming up (P65 3-stage, P66 5-stage with forwarding).

Verilator compiles the entire SystemVerilog design into C++ and links it against a hand-rolled harness. It runs natively. The same RTL — projects/64_prefetch_under_wb/src/top.sv, no forks, no ifdefs — produces a binary at verilator-obj/Vtop that simulates the chip at C++ speed.

simulatorbinaryeffective clock120M-cycle wall time
iverilogtb_freertos_demo.vvp~23 KHz~90 minutes
verilatorVtop~6 MHz~20 seconds

That is roughly a 270× speedup on the same chip, on the same host, doing the same boot-blob workload. Long-term we’d love it to be 100×; we’ll take 270×.

What’s new in the build

  • tools/verilator/Makefile.inc — per-project include that adds make verilator, make verilator-run, and make verilator-profile targets. Each project’s test/Makefile includes it once it’s defined SRC, BOOT_BLOB, BOOT_MEMH, and PROJECT_ID.
  • tools/verilator/harness.cpp — the C++ replacement for tb_freertos_demo.sv. Owns a 16 MiB memory model that mirrors p58_external_memory.sv exactly (size-aware byte/halfword alignment included — see “What broke and how we fixed it” below), drives the load-mode preload sequence, streams every UART TX byte to stdout, runs the milestone substring matchers in C++ (no $display overhead), and emits the same benchmark.json shape the iverilog flow does so site charts can ingest either.
  • The chip top RTL now exposes a small trap-debug bundledbg_trap_strobe, dbg_trap_cause, dbg_trap_epc, dbg_trap_tval, dbg_trap_to_priv — set inside enter_trap. The Verilator harness logs every trap when given +log_traps. iverilog’s tb_freertos_demo.sv doesn’t connect them (named-port instantiation, so they just dangle).
# fast iteration loop with verilator
cd projects/64_prefetch_under_wb/test
make verilator
./verilator-obj/Vtop +memh=../app/build/p58_boot.memh \
                    +image_bytes=$(wc -c < ../app/build/p58_boot.bin) \
                    +preload_image +profile +log_traps \
                    +max_cycles=300000000

# canonical iverilog flow still works exactly the same
make profile

What broke and how we fixed it

Three real bugs while bringing Verilator up, all the same shape: the C++ memory model didn’t mirror p58_external_memory.sv.

  1. Byte loads at non-zero offsets read the wrong byte. The chip’s load_value picks LB/LBU bytes from rdata[7:0] and trusts the memory model to lane-shift the addressed byte into position. iverilog’s memory model does the shift; the original harness just returned the raw word, so reading 'P' at offset 1 of a string returned '\n' from offset 0 instead. Symptom: stage-0’s \nP58 stage-0 firmware\n came back as \n\n\n\n then garbled.
  2. Halfword misaligned check was too strict. First-pass fix used addr[1:0] != 0 for halfwords, which rejects legal accesses at addr[1:0] == 2. The RTL only flags addr[0] != 0. Faked store-access faults triggered kernel oops, dump_instr, BUG_ON, and a permanent re-trap loop in handle_exception.
  3. Byte stores at non-zero offsets wrote zero into the lane. The chip’s mem_wstrb is “size-shaped” — 4'b0001 for SB, 4'b0011 for SH, 4'b1111 for SW — and is not pre-shifted by addr[1:0]. The harness was using wstrb directly as the lane mask. iverilog’s memory model derives the lane from size + addr[1:0] and ignores wstrb; the harness has to do the same. Symptom: kernel printk formatted into a buffer using SB at varying offsets, read it back, sent each byte to SBI putchar — every non-aligned byte came back as \x00. The kernel saw garbage strings, eventually called panic() on a recoverable error, and spun forever in panic_blink’s udelay loop. PC-sample histograms made it look like a udelay calibration hang; it was always a string-corruption symptom.

The lesson is the same all three times, sharper each iteration: when you write a second simulator’s memory model, copy the first one’s behaviour literally. The chip-side wstrb lies; the address is the source of truth.

What’s next on this rung

  • Regenerate the per-project chart pack now that boot lands and add a P63 → P64 comparison column. Quick under Verilator — about a minute per profile run.
  • Then on to P65: real 3-stage pipeline (decode separated), where the 270× speedup will actually matter for tracking per-stage cycle effects.