No. 88 / project of 147 on the ladder

Memory stall attribution

introduces — memory-bus request attribution; BusyBox shell stall split; open-core architecture comparison

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P88 does not make the CPU faster. It makes the profiler less vague. The BusyBox shell still runs through the P87 PTY/direct-UART bridge, but the Verilator harness now breaks memory-bus waits into fetch, load, store, AMO, and page-walk buckets.

That matters because “memory is slow” is not an actionable CPU feature. “Instruction fetch accounts for about two thirds of memory stalls” is.

Result

metricP87 direct UARTP88 attribution rundelta
post-load cycles222,825,777221,748,021-0.48%
retired instructions86,750,47986,435,211-0.36%
CPI2.56862.5655-0.12%
memory handshakes33,187,76432,917,717-0.81%
memory stall cycles88,210,45887,892,031-0.36%

That small movement is not a speedup claim. The RTL behavior is the same shape as P87; P88 is a better measurement pass.

Memory Stalls

memory stalls label P88 BusyBox shell workload stalls 87,892,031 handshakes 32,917,717
  1. instruction fetch 58,870,166 67% 29,224,093 req
  2. data load 14,576,692 16.6% 969,755 req
  3. data store 11,955,460 13.6% 216,442 req
  4. atomic memory op 157,725 0.2% 183,819 req
  5. page walk for fetch 1,117,409 1.3% 1,111,255 req
  6. page walk for load/store 1,212,863 1.4% 1,212,353 req
  7. other 1,716 0% 0 req
request kindstall cyclesshare
instruction fetch58,870,16666.98%
data load14,576,69216.58%
data store11,955,46013.60%
AMO157,7250.18%
page walk for fetch1,117,4091.27%
page walk for load/store1,212,8631.38%
other1,7160.00%

The first version of this harness mislabeled most fetch waits as other, because P64/P66 can launch the next fetch during S_WB. P88 counts S_WB memory requests as instruction fetches and labels the P70 FP high-half states as load/store.

Shell Phases

shell phases label P88 shell workload cycles 221,748,021 cpi 2.57
  1. kernel banner to /init 117,614,758 53.2%
  2. /init to shell banner 1,092,067 0.5%
  3. shell banner to first command 36,133,324 16.3%
  4. echo command 1,598 0%
  5. uname -a 2,446,206 1.1%
  6. ls /bin /usr/share 32,252,107 14.6%
  7. cat sample file 2,745,025 1.2%
  8. touch/write/cat/rm /tmp file 11,581,923 5.2%
  9. 8x ash loop with file I/O 16,297,942 7.4%
  10. final marker 955,006 0.4%

The visible workload is still the shell: boot to /init, shell setup, uname, ls, cat, /tmp file work, and a small ash loop. The slow phase is still ls /bin /usr/share, which is exactly the kind of filesystem and userspace formatting path that pounds instruction fetch.

Cycle Shape

state breakdown label P88 memory attribution workload cycles 221,748,021 cpi 2.57
  1. fetch 3.8% 8,316,547
  2. execute 39% 86,460,157
  3. mem 12.7% 28,059,893
  4. walker 2.1% 4,653,880
  5. writeback 39% 86,435,211
  6. mul/div 3.5% 7,820,617

The state chart is useful as a cross-check. P88 did not add a cache or a new pipeline stage, so the high-level state distribution should not dramatically move from P87.

Hot Functions

hot functions label P88 BusyBox shell symbols samples 64,726 period every 1,024 cycles
  1. printf_core busybox
    5.6% 3,612
  2. memset kernel
    5.1% 3,269
  3. memcpy busybox
    3.6% 2,344
  4. vruntime_eligible kernel
    3.3% 2,151
  5. blake2s_compress_generic kernel
    2.8% 1,809
  6. memcpy kernel
    2.6% 1,707
  7. __fwritex busybox
    2.6% 1,690
  8. handle_exception kernel
    1.7% 1,120
  9. unmap_page_range kernel
    1.7% 1,089
  10. avg_vruntime kernel
    1.4% 886
  11. memset busybox
    1.3% 844
  12. n_tty_write kernel
    1.3% 842
  13. ret_from_exception kernel
    1.2% 749
  14. next_uptodate_folio kernel
    1.1% 696
  15. n_tty_read kernel
    1% 660
  16. (remaining) remaining
    55.5% 35,946

The top symbols match the stall split: BusyBox formatting (printf_core, memcpy, __fwritex) and kernel memory/scheduler/exception work stay hot. The direct-UART bridge removed the old SBI console leaf path in P87, but the remaining work is broader than terminal output.

Compared With Open Cores

This project core is a teaching ASIC core first. In current project terms it is an RV32 Linux-capable, single-issue, in-order, FSM-style core with M-mode/S-mode, Sv32, SBI boot support, an 8-entry unified TLB, simple valid/ready memory, and no real instruction/data caches yet. Feature work in this ladder has covered RV32I base tests, M, A, selected Zba/Zbb, Zicsr/Zifencei, Zicntr, compressed-instruction support, and F/D support for later Linux/AtomVM experiments. P88 itself did not rerun full architectural compliance; it ran the BusyBox shell smoke and profile workload.

corearchitectural shapehow ours differs
Rocket5-stage in-order RV64GC generator with MMU, L1 I/D caches, branch prediction, and Rocket Chip tile/SoC integrationRocket is the mature application-class in-order baseline. Ours is RV32, hand-written teaching RTL, no cache hierarchy, and a much simpler memory system.
BOOMparameterized RV64 out-of-order generator with rename, issue queues, ROB, LSU, and Rocket Chip ecosystem reuseBOOM chases IPC with speculative out-of-order execution. Ours retires in order and spends effort on making each Linux bring-up feature understandable.
CVA6/Ariane6-stage single-issue in-order application core with M/S/U privilege support, MMU, caches, and scoreboard behaviorCVA6 is the closest philosophical neighbor: simple enough to reason about, but Linux-class. Ours is much smaller and less mature, with no cache and less verification.
Ibexcompact 2-stage or optional 3-stage RV32 embedded core with strong verification and no Linux MMU targetIbex is cleaner and more production-quality for microcontroller work. Ours is less verified but has Sv32/S-mode/Linux experiments that Ibex intentionally avoids.
VexRiscvhighly configurable RV32 SpinalHDL core, 2 to 5+ stages, optional caches, MMU, FPU, debug, and Linux-capable configurationsVexRiscv is a plugin generator. Ours is a fixed pedagogical RTL line where each feature lands as a visible project step.
PicoRV32size-optimized RV32IMC-capable core with simple valid/ready memory and optional IRQ/PCPIPicoRV32 is much smaller and cleaner as an embedded helper CPU. Ours is bigger and slower to close, because it carries privilege, MMU, Linux, and shell experiments.
SERVbit-serial RV32 core optimized for minimum areaSERV is what you pick when gates matter more than throughput. Ours is not bit-serial; it is already large enough to boot Linux slowly.
XiangShanhigh-performance open RV64 application-processor project with modern large-core microarchitecture workXiangShan is at the opposite end: high-performance, team-scale, modern application processor design. Ours is a notebook-scale bring-up core for understanding the path.

The blunt comparison: our core is closest to a stripped-down, educational CVA6/VexRiscv-Linux-class experiment, not to BOOM or XiangShan. The next architectural gap is obvious from P88: every major Linux-capable open core has an instruction cache. We do not.

Honest Status

checkstatus
Memory request attribution in harnessPASS
BusyBox shell workload runsPASS
memory_bus.by_kind emitted into chart dataPASS
Open-core architecture comparison recordedPASS
LibreLane hardeningNOT RUN

Next

P89 should be a CPU feature, not another chart. The strongest candidate is an instruction cache or a tiny line-fill buffer, with P84/P86/P87/P88 as the regression benchmark line.