P63 cut S_FETCH to one cycle on TLB hits. P64 takes the next bite:
overlap the next fetch with the current writeback so back-to-back
register-only instructions don’t pay a fetch cycle at all. The chip
runs through the full Linux boot path to userspace + clean panic.
But the bigger story this rung is the simulator upgrade.
| milestone | cycles |
|---|---|
Linux version | 1,078,147 |
Switched to clocksource | 12,603,459 |
Run /init as init | 89,432,784 |
userspace hello | 90,246,176 |
Attempted to kill init (clean panic) | 90,505,484 |
That’s ~90.5M post-load cycles to userspace + return — 21 seconds of Verilator wall time at 4.7 MHz effective on this host. The same boot under iverilog takes ~50–60 minutes, and is currently the only way to cross-check chip behaviour.
The captured transcript (raw, benchmark.json) ends with:
===========================================
hello from userspace on a homemade chip!
===========================================
this binary:
- cross-compiled for rv32ima/ilp32
- statically linked, no libc
- reaches the linux SYS_write syscall
- is running as PID 1 from initramfs
the chip below us:
- 16 MiB DRAM, 25 MHz
- rv32ima_zba, sv32 paging, m+s priv
- SBI v0.1 firmware in stage-0
exiting cleanly...
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000
The “panic” line is the expected exit — when PID 1 returns, Linux
calls panic. exitcode=0x00000000 confirms a clean userspace exit.
Boot milestone progression: P62 → P63 → P64
| milestone | P62 Zbb rotates + U-mode | P63 fetch fast-path | P64 prefetch-under-WB | Δ last col |
|---|---|---|---|---|
| “Linux version” | −19.9% | |||
| clocksource switched | −18.8% | |||
| “Run /init” | −18.5% | |||
| userspace hello | −18.5% | |||
| init exit (panic) | −18.5% |
P64 reaches userspace hello in 90.25M cycles versus P63’s
110.75M — about 18.5% fewer cycles to userspace and clean
panic, on the same kernel image and the same boot blob. The
shrinking lifelines stack: P63’s fetch fast-path saved ~17% over
P62; P64’s prefetch-under-WB saves another ~18% over P63.
CPI comparison
cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.
P64 retires the same ~21M instructions as P63 in 92M cycles instead of 112M — CPI drops from 5.27 → 4.29 (~18.6% fewer cycles per instruction). That’s the prefetch-under-WB hypothesis paid out cleanly: when the pipeline can issue the next fetch during S_WB, the common-case “register-only” path no longer pays a fetch cycle at all.
Where the cycles went: state-cycle breakdown
The single most striking column: S_FETCH cycles drop from
~21.8M (P63) to ~1.08M (P64) — a ~95% reduction. P63 still
spent one cycle in S_FETCH on every TLB hit; P64 collapses that
into the previous instruction’s S_WB cycle whenever the prefetch
is allowed to fire (TLB hit + no halt + no pending interrupt + a
non-jump retirement). That cycle is not free — it has to come
from somewhere, and looking at the breakdown it shows up as an
unchanged S_DECODE / S_EXECUTE / S_MEM / S_WB profile, just with
fewer total cycles.
- fetch 1.2% 1,084,166
- decode 23.3% 21,437,063
- execute 23.3% 21,437,063
- mem 23.1% 21,293,966
- walker 2.6% 2,375,026
- writeback 23.3% 21,429,846
- mul/div 3.2% 2,942,870
Walker activity
Walker behaviour is essentially unchanged from P63 — same fetch-miss / load-miss / store-miss profile, same megapage vs 4 KiB-page hit ratio. The prefetch path piggybacks on the walker’s existing TLB-hit shortcut; it doesn’t change the miss rate or the walk-completion behaviour.
Where time goes inside the kernel
- 14.7% of samples (14,342 samples)14.7% 14,342
- 9.4% of samples (9,222 samples)9.4% 9,222
- 6.8% of samples (6,663 samples)6.8% 6,663
- 6.8% of samples (6,623 samples)6.8% 6,623
- 5% of samples (4,874 samples)5% 4,874
- 3.4% of samples (3,286 samples)3.4% 3,286
- 2.4% of samples (2,312 samples)2.4% 2,312
- 2.3% of samples (2,214 samples)2.3% 2,214
- 2.2% of samples (2,132 samples)2.2% 2,132
- 1.6% of samples (1,567 samples)1.6% 1,567
- 1.4% of samples (1,403 samples)1.4% 1,403
- 1.3% of samples (1,313 samples)1.3% 1,313
- 1.1% of samples (1,094 samples)1.1% 1,094
- 1.1% of samples (1,022 samples)1.1% 1,022
- 1% of samples (924 samples)1% 924
- 31.3% of samples (30,561 samples)31.3% 30,561
blake2s_compress_generic and machine_restart dominate, which
matches every prior rung — the kernel’s CRNG init does a lot of
BLAKE2s rounds, and after PID 1 returns, machine_restart is
where Linux spins waiting for a reboot we don’t implement.
Compute-bound code (memset, memcpy, the OS-AGNOSTIC string
helpers) sits behind those.
Headline: added a Verilator harness alongside the existing iverilog flow. Same chip RTL, two simulators. Effective speed went from ~23 KHz (iverilog interpreter) to ~6 MHz (Verilator native C++). That’s the difference between minutes and seconds per kernel-boot iteration.
Why two simulators
iverilog interprets SystemVerilog on every event. It’s perfectly correct but it’s slow — the P63 site charts came from a profile run that took 53 minutes to reach the boot panic. That round-trip is too long for the kind of “change one line, re-run, look at the trace” iteration we want for the pipeline work coming up (P65 3-stage, P66 5-stage with forwarding).
Verilator compiles the entire SystemVerilog design into C++ and
links it against a hand-rolled harness. It runs natively. The same
RTL — projects/64_prefetch_under_wb/src/top.sv, no forks, no
ifdefs — produces a binary at verilator-obj/Vtop that simulates
the chip at C++ speed.
| simulator | binary | effective clock | 120M-cycle wall time |
|---|---|---|---|
| iverilog | tb_freertos_demo.vvp | ~23 KHz | ~90 minutes |
| verilator | Vtop | ~6 MHz | ~20 seconds |
That is roughly a 270× speedup on the same chip, on the same host, doing the same boot-blob workload. Long-term we’d love it to be 100×; we’ll take 270×.
What’s new in the build
tools/verilator/Makefile.inc— per-project include that addsmake verilator,make verilator-run, andmake verilator-profiletargets. Each project’stest/Makefileincludes it once it’s definedSRC,BOOT_BLOB,BOOT_MEMH, andPROJECT_ID.tools/verilator/harness.cpp— the C++ replacement fortb_freertos_demo.sv. Owns a 16 MiB memory model that mirrorsp58_external_memory.svexactly (size-aware byte/halfword alignment included — see “What broke and how we fixed it” below), drives the load-mode preload sequence, streams every UART TX byte to stdout, runs the milestone substring matchers in C++ (no$displayoverhead), and emits the samebenchmark.jsonshape the iverilog flow does so site charts can ingest either.- The chip top RTL now exposes a small trap-debug bundle —
dbg_trap_strobe,dbg_trap_cause,dbg_trap_epc,dbg_trap_tval,dbg_trap_to_priv— set insideenter_trap. The Verilator harness logs every trap when given+log_traps. iverilog’stb_freertos_demo.svdoesn’t connect them (named-port instantiation, so they just dangle).
# fast iteration loop with verilator
cd projects/64_prefetch_under_wb/test
make verilator
./verilator-obj/Vtop +memh=../app/build/p58_boot.memh \
+image_bytes=$(wc -c < ../app/build/p58_boot.bin) \
+preload_image +profile +log_traps \
+max_cycles=300000000
# canonical iverilog flow still works exactly the same
make profile
What broke and how we fixed it
Three real bugs while bringing Verilator up, all the same shape:
the C++ memory model didn’t mirror p58_external_memory.sv.
- Byte loads at non-zero offsets read the wrong byte. The
chip’s
load_valuepicks LB/LBU bytes fromrdata[7:0]and trusts the memory model to lane-shift the addressed byte into position. iverilog’s memory model does the shift; the original harness just returned the raw word, so reading'P'at offset 1 of a string returned'\n'from offset 0 instead. Symptom: stage-0’s\nP58 stage-0 firmware\ncame back as\n\n\n\nthen garbled. - Halfword misaligned check was too strict. First-pass fix
used
addr[1:0] != 0for halfwords, which rejects legal accesses ataddr[1:0] == 2. The RTL only flagsaddr[0] != 0. Faked store-access faults triggered kernel oops,dump_instr,BUG_ON, and a permanent re-trap loop inhandle_exception. - Byte stores at non-zero offsets wrote zero into the lane.
The chip’s
mem_wstrbis “size-shaped” —4'b0001for SB,4'b0011for SH,4'b1111for SW — and is not pre-shifted byaddr[1:0]. The harness was usingwstrbdirectly as the lane mask. iverilog’s memory model derives the lane fromsize + addr[1:0]and ignoreswstrb; the harness has to do the same. Symptom: kernelprintkformatted into a buffer using SB at varying offsets, read it back, sent each byte to SBI putchar — every non-aligned byte came back as\x00. The kernel saw garbage strings, eventually calledpanic()on a recoverable error, and spun forever inpanic_blink’sudelayloop. PC-sample histograms made it look like audelaycalibration hang; it was always a string-corruption symptom.
The lesson is the same all three times, sharper each iteration:
when you write a second simulator’s memory model, copy the first
one’s behaviour literally. The chip-side wstrb lies; the
address is the source of truth.
What’s next on this rung
- Regenerate the per-project chart pack now that boot lands and add a P63 → P64 comparison column. Quick under Verilator — about a minute per profile run.
- Then on to P65: real 3-stage pipeline (decode separated), where the 270× speedup will actually matter for tracking per-stage cycle effects.