Prefetch-under-writeback + Verilator dual-target sim (~270× faster iteration)

P63 cut S_FETCH to one cycle on TLB hits. P64 takes the next bite: overlap the next fetch with the current writeback so back-to-back register-only instructions don’t pay a fetch cycle at all. The chip runs through the full Linux boot path to userspace + clean panic. But the bigger story this rung is the simulator upgrade.

milestone	cycles
`Linux version`	1,078,147
`Switched to clocksource`	12,603,459
`Run /init as init`	89,432,784
userspace `hello`	90,246,176
`Attempted to kill init` (clean panic)	90,505,484

That’s ~90.5M post-load cycles to userspace + return — 21 seconds of Verilator wall time at 4.7 MHz effective on this host. The same boot under iverilog takes ~50–60 minutes, and is currently the only way to cross-check chip behaviour.

The captured transcript (raw, benchmark.json) ends with:

===========================================
  hello from userspace on a homemade chip!
===========================================

  this binary:
    - cross-compiled for rv32ima/ilp32
    - statically linked, no libc
    - reaches the linux SYS_write syscall
    - is running as PID 1 from initramfs

  the chip below us:
    - 16 MiB DRAM, 25 MHz
    - rv32ima_zba, sv32 paging, m+s priv
    - SBI v0.1 firmware in stage-0

  exiting cleanly...
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000

The “panic” line is the expected exit — when PID 1 returns, Linux calls panic. exitcode=0x00000000 confirms a clean userspace exit.

Boot milestone progression: P62 → P63 → P64

milestone progression runs 3 scale 0 → 131.6M cycles

milestone	P62 Zbb rotates + U-mode	P63 fetch fast-path	P64 prefetch-under-WB	Δ last col
“Linux version”	P62: 1,611,173 cycles 1.6M	P63: 1,346,691 cycles 1.3M	P64: 1,078,147 cycles 1.1M	−19.9%
clocksource switched	P62: 18,426,917 cycles 18.4M	P63: 15,530,502 cycles 15.5M	P64: 12,603,459 cycles 12.6M	−18.8%
“Run /init”	P62: 130,054,364 cycles 130.1M	P63: 109,755,383 cycles 109.8M	P64: 89,432,784 cycles 89.4M	−18.5%
userspace hello	P62: 131,241,874 cycles 131.2M	P63: 110,746,765 cycles 110.7M	P64: 90,246,176 cycles 90.2M	−18.5%
init exit (panic)	P62: 131,610,501 cycles 131.6M	P63: 111,063,721 cycles 111.1M	P64: 90,505,484 cycles 90.5M	−18.5%

P64 reaches userspace hello in 90.25M cycles versus P63’s 110.75M — about 18.5% fewer cycles to userspace and clean panic, on the same kernel image and the same boot blob. The shrinking lifelines stack: P63’s fetch fast-path saved ~17% over P62; P64’s prefetch-under-WB saves another ~18% over P63.

CPI comparison

cpi compare runs 3 best 4.29 CPI worst 6.24 CPI

P62 Zbb rotates + U-mode

6.24

baseline

P63 fetch fast-path

5.27

−0.97 vs P62

P64 prefetch-under-WB

4.29

−0.98 vs P63

cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.

P64 retires the same ~21M instructions as P63 in 92M cycles instead of 112M — CPI drops from 5.27 → 4.29 (~18.6% fewer cycles per instruction). That’s the prefetch-under-WB hypothesis paid out cleanly: when the pipeline can issue the next fetch during S_WB, the common-case “register-only” path no longer pays a fetch cycle at all.

Where the cycles went: state-cycle breakdown

state distribution runs 3 scale % of post-load cycles

P62 Zbb rotates + U-mode 132,610,502 cycles

fetch 32% fetch: 32% (42,437,741 cycles)
decode 16% decode: 16% (21,249,591 cycles)
execute 16% execute: 16% (21,249,591 cycles)
mem 15.9% mem: 15.9% (21,106,222 cycles)
walker: 1.8% (2,377,514 cycles)
writeback 16% writeback: 16% (21,242,373 cycles)
mul/div: 2.2% (2,947,470 cycles)
P63 fetch fast-path 112,063,722 cycles

fetch 19.5% fetch: 19.5% (21,809,761 cycles)
decode 19% decode: 19% (21,271,018 cycles)
execute 19% execute: 19% (21,271,018 cycles)
mem 18.9% mem: 18.9% (21,127,815 cycles)
walker: 2.1% (2,376,234 cycles)
writeback 19% writeback: 19% (21,263,802 cycles)
mul/div: 2.6% (2,944,074 cycles)
P64 prefetch-under-WB 92,000,000 cycles

fetch: 1.2% (1,084,166 cycles)
decode 23.3% decode: 23.3% (21,437,063 cycles)
execute 23.3% execute: 23.3% (21,437,063 cycles)
mem 23.1% mem: 23.1% (21,293,966 cycles)
walker: 2.6% (2,375,026 cycles)
writeback 23.3% writeback: 23.3% (21,429,846 cycles)
3.2% mul/div: 3.2% (2,942,870 cycles)

The single most striking column: S_FETCH cycles drop from ~21.8M (P63) to ~1.08M (P64) — a ~95% reduction. P63 still spent one cycle in S_FETCH on every TLB hit; P64 collapses that into the previous instruction’s S_WB cycle whenever the prefetch is allowed to fire (TLB hit + no halt + no pending interrupt + a non-jump retirement). That cycle is not free — it has to come from somewhere, and looking at the breakdown it shows up as an unchanged S_DECODE / S_EXECUTE / S_MEM / S_WB profile, just with fewer total cycles.

state breakdown label cycles 92,000,000 cpi 4.29

fetch 1.2% 1,084,166
decode 23.3% 21,437,063
execute 23.3% 21,437,063
mem 23.1% 21,293,966
walker 2.6% 2,375,026
writeback 23.3% 21,429,846
mul/div 3.2% 2,942,870

Walker activity

walker activity label flushes 6,156 L1 hits 9,392 L0 hits 1,182,817

fetch translations 538,486 total

0% TLB hit 538,486 walks

load/store/amo translations 11,397,613 total

94.3% TLB hit 653,723 walks

Walker behaviour is essentially unchanged from P63 — same fetch-miss / load-miss / store-miss profile, same megapage vs 4 KiB-page hit ratio. The prefetch path piggybacks on the walker’s existing TLB-hit shortcut; it doesn’t change the miss rate or the walk-completion behaviour.

Where time goes inside the kernel

hot functions label samples 97,656 period every 1,024 cycles

blake2s_compress_generic kernel

14.7% of samples (14,342 samples)

14.7% 14,342
machine_restart kernel

9.4% of samples (9,222 samples)

9.4% 9,222
memset kernel

6.8% of samples (6,663 samples)

6.8% 6,663
memcpy kernel

6.8% of samples (6,623 samples)

6.8% 6,623
format_decode kernel

5% of samples (4,874 samples)

5% 4,874
vsnprintf kernel

3.4% of samples (3,286 samples)

3.4% 3,286
number kernel

2.4% of samples (2,312 samples)

2.4% 2,312
vruntime_eligible kernel

2.3% of samples (2,214 samples)

2.3% 2,214
memcmp kernel

2.2% of samples (2,132 samples)

2.2% 2,132
__slab_alloc_node.isra.0 kernel

1.6% of samples (1,567 samples)

1.6% 1,567
strlen kernel

1.4% of samples (1,403 samples)

1.4% 1,403
avg_vruntime kernel

1.3% of samples (1,313 samples)

1.3% 1,313
string_nocheck kernel

1.1% of samples (1,094 samples)

1.1% 1,094
chacha_permute kernel

1.1% of samples (1,022 samples)

1.1% 1,022
add_uevent_var kernel

1% of samples (924 samples)

1% 924
(remaining) remaining

31.3% of samples (30,561 samples)

31.3% 30,561

blake2s_compress_generic and machine_restart dominate, which matches every prior rung — the kernel’s CRNG init does a lot of BLAKE2s rounds, and after PID 1 returns, machine_restart is where Linux spins waiting for a reboot we don’t implement. Compute-bound code (memset, memcpy, the OS-AGNOSTIC string helpers) sits behind those.

Headline: added a Verilator harness alongside the existing iverilog flow. Same chip RTL, two simulators. Effective speed went from ~23 KHz (iverilog interpreter) to ~6 MHz (Verilator native C++). That’s the difference between minutes and seconds per kernel-boot iteration.

Why two simulators

iverilog interprets SystemVerilog on every event. It’s perfectly correct but it’s slow — the P63 site charts came from a profile run that took 53 minutes to reach the boot panic. That round-trip is too long for the kind of “change one line, re-run, look at the trace” iteration we want for the pipeline work coming up (P65 3-stage, P66 5-stage with forwarding).

Verilator compiles the entire SystemVerilog design into C++ and links it against a hand-rolled harness. It runs natively. The same RTL — projects/64_prefetch_under_wb/src/top.sv, no forks, no ifdefs — produces a binary at verilator-obj/Vtop that simulates the chip at C++ speed.

simulator	binary	effective clock	120M-cycle wall time
iverilog	`tb_freertos_demo.vvp`	~23 KHz	~90 minutes
verilator	`Vtop`	~6 MHz	~20 seconds

That is roughly a 270× speedup on the same chip, on the same host, doing the same boot-blob workload. Long-term we’d love it to be 100×; we’ll take 270×.

What’s new in the build

tools/verilator/Makefile.inc — per-project include that adds make verilator, make verilator-run, and make verilator-profile targets. Each project’s test/Makefile includes it once it’s defined SRC, BOOT_BLOB, BOOT_MEMH, and PROJECT_ID.
tools/verilator/harness.cpp — the C++ replacement for tb_freertos_demo.sv. Owns a 16 MiB memory model that mirrors p58_external_memory.sv exactly (size-aware byte/halfword alignment included — see “What broke and how we fixed it” below), drives the load-mode preload sequence, streams every UART TX byte to stdout, runs the milestone substring matchers in C++ (no $display overhead), and emits the same benchmark.json shape the iverilog flow does so site charts can ingest either.
The chip top RTL now exposes a small trap-debug bundle — dbg_trap_strobe, dbg_trap_cause, dbg_trap_epc, dbg_trap_tval, dbg_trap_to_priv — set inside enter_trap. The Verilator harness logs every trap when given +log_traps. iverilog’s tb_freertos_demo.sv doesn’t connect them (named-port instantiation, so they just dangle).

# fast iteration loop with verilator
cd projects/64_prefetch_under_wb/test
make verilator
./verilator-obj/Vtop +memh=../app/build/p58_boot.memh \
                    +image_bytes=$(wc -c < ../app/build/p58_boot.bin) \
                    +preload_image +profile +log_traps \
                    +max_cycles=300000000

# canonical iverilog flow still works exactly the same
make profile

What broke and how we fixed it

Three real bugs while bringing Verilator up, all the same shape: the C++ memory model didn’t mirror p58_external_memory.sv.

Byte loads at non-zero offsets read the wrong byte. The chip’s load_value picks LB/LBU bytes from rdata[7:0] and trusts the memory model to lane-shift the addressed byte into position. iverilog’s memory model does the shift; the original harness just returned the raw word, so reading 'P' at offset 1 of a string returned '\n' from offset 0 instead. Symptom: stage-0’s \nP58 stage-0 firmware\n came back as \n\n\n\n then garbled.
Halfword misaligned check was too strict. First-pass fix used addr[1:0] != 0 for halfwords, which rejects legal accesses at addr[1:0] == 2. The RTL only flags addr[0] != 0. Faked store-access faults triggered kernel oops, dump_instr, BUG_ON, and a permanent re-trap loop in handle_exception.
Byte stores at non-zero offsets wrote zero into the lane. The chip’s mem_wstrb is “size-shaped” — 4'b0001 for SB, 4'b0011 for SH, 4'b1111 for SW — and is not pre-shifted by addr[1:0]. The harness was using wstrb directly as the lane mask. iverilog’s memory model derives the lane from size + addr[1:0] and ignores wstrb; the harness has to do the same. Symptom: kernel printk formatted into a buffer using SB at varying offsets, read it back, sent each byte to SBI putchar — every non-aligned byte came back as \x00. The kernel saw garbage strings, eventually called panic() on a recoverable error, and spun forever in panic_blink’s udelay loop. PC-sample histograms made it look like a udelay calibration hang; it was always a string-corruption symptom.

The lesson is the same all three times, sharper each iteration: when you write a second simulator’s memory model, copy the first one’s behaviour literally. The chip-side wstrb lies; the address is the source of truth.

What’s next on this rung

Regenerate the per-project chart pack now that boot lands and add a P63 → P64 comparison column. Quick under Verilator — about a minute per profile run.
Then on to P65: real 3-stage pipeline (decode separated), where the 270× speedup will actually matter for tracking per-stage cycle effects.