Memory stall attribution · librelane-playground

P88 does not make the CPU faster. It makes the profiler less vague. The BusyBox shell still runs through the P87 PTY/direct-UART bridge, but the Verilator harness now breaks memory-bus waits into fetch, load, store, AMO, and page-walk buckets.

That matters because “memory is slow” is not an actionable CPU feature. “Instruction fetch accounts for about two thirds of memory stalls” is.

Result

metric	P87 direct UART	P88 attribution run	delta
post-load cycles	222,825,777	221,748,021	-0.48%
retired instructions	86,750,479	86,435,211	-0.36%
CPI	2.5686	2.5655	-0.12%
memory handshakes	33,187,764	32,917,717	-0.81%
memory stall cycles	88,210,458	87,892,031	-0.36%

That small movement is not a speedup claim. The RTL behavior is the same shape as P87; P88 is a better measurement pass.

Memory Stalls

memory stalls label P88 BusyBox shell workload stalls 87,892,031 handshakes 32,917,717

instruction fetch 58,870,166 67% 29,224,093 req
data load 14,576,692 16.6% 969,755 req
data store 11,955,460 13.6% 216,442 req
atomic memory op 157,725 0.2% 183,819 req
page walk for fetch 1,117,409 1.3% 1,111,255 req
page walk for load/store 1,212,863 1.4% 1,212,353 req
other 1,716 0% 0 req

request kind	stall cycles	share
instruction fetch	58,870,166	66.98%
data load	14,576,692	16.58%
data store	11,955,460	13.60%
AMO	157,725	0.18%
page walk for fetch	1,117,409	1.27%
page walk for load/store	1,212,863	1.38%
other	1,716	0.00%

The first version of this harness mislabeled most fetch waits as other, because P64/P66 can launch the next fetch during S_WB. P88 counts S_WB memory requests as instruction fetches and labels the P70 FP high-half states as load/store.

Shell Phases

shell phases label P88 shell workload cycles 221,748,021 cpi 2.57

kernel banner to /init 117,614,758 53.2%
/init to shell banner 1,092,067 0.5%
shell banner to first command 36,133,324 16.3%
echo command 1,598 0%
uname -a 2,446,206 1.1%
ls /bin /usr/share 32,252,107 14.6%
cat sample file 2,745,025 1.2%
touch/write/cat/rm /tmp file 11,581,923 5.2%
8x ash loop with file I/O 16,297,942 7.4%
final marker 955,006 0.4%

The visible workload is still the shell: boot to /init, shell setup, uname, ls, cat, /tmp file work, and a small ash loop. The slow phase is still ls /bin /usr/share, which is exactly the kind of filesystem and userspace formatting path that pounds instruction fetch.

Cycle Shape

state breakdown label P88 memory attribution workload cycles 221,748,021 cpi 2.57

fetch 3.8% 8,316,547
execute 39% 86,460,157
mem 12.7% 28,059,893
walker 2.1% 4,653,880
writeback 39% 86,435,211
mul/div 3.5% 7,820,617

The state chart is useful as a cross-check. P88 did not add a cache or a new pipeline stage, so the high-level state distribution should not dramatically move from P87.

Hot Functions

hot functions label P88 BusyBox shell symbols samples 64,726 period every 1,024 cycles

printf_core busybox

5.6% of samples (3,612 samples)

5.6% 3,612
memset kernel

5.1% of samples (3,269 samples)

5.1% 3,269
memcpy busybox

3.6% of samples (2,344 samples)

3.6% 2,344
vruntime_eligible kernel

3.3% of samples (2,151 samples)

3.3% 2,151
blake2s_compress_generic kernel

2.8% of samples (1,809 samples)

2.8% 1,809
memcpy kernel

2.6% of samples (1,707 samples)

2.6% 1,707
__fwritex busybox

2.6% of samples (1,690 samples)

2.6% 1,690
handle_exception kernel

1.7% of samples (1,120 samples)

1.7% 1,120
unmap_page_range kernel

1.7% of samples (1,089 samples)

1.7% 1,089
avg_vruntime kernel

1.4% of samples (886 samples)

1.4% 886
memset busybox

1.3% of samples (844 samples)

1.3% 844
n_tty_write kernel

1.3% of samples (842 samples)

1.3% 842
ret_from_exception kernel

1.2% of samples (749 samples)

1.2% 749
next_uptodate_folio kernel

1.1% of samples (696 samples)

1.1% 696
n_tty_read kernel

1% of samples (660 samples)

1% 660
(remaining) remaining

55.5% of samples (35,946 samples)

55.5% 35,946

The top symbols match the stall split: BusyBox formatting (printf_core, memcpy, __fwritex) and kernel memory/scheduler/exception work stay hot. The direct-UART bridge removed the old SBI console leaf path in P87, but the remaining work is broader than terminal output.

Compared With Open Cores

This project core is a teaching ASIC core first. In current project terms it is an RV32 Linux-capable, single-issue, in-order, FSM-style core with M-mode/S-mode, Sv32, SBI boot support, an 8-entry unified TLB, simple valid/ready memory, and no real instruction/data caches yet. Feature work in this ladder has covered RV32I base tests, M, A, selected Zba/Zbb, Zicsr/Zifencei, Zicntr, compressed-instruction support, and F/D support for later Linux/AtomVM experiments. P88 itself did not rerun full architectural compliance; it ran the BusyBox shell smoke and profile workload.

core	architectural shape	how ours differs
Rocket	5-stage in-order RV64GC generator with MMU, L1 I/D caches, branch prediction, and Rocket Chip tile/SoC integration	Rocket is the mature application-class in-order baseline. Ours is RV32, hand-written teaching RTL, no cache hierarchy, and a much simpler memory system.
BOOM	parameterized RV64 out-of-order generator with rename, issue queues, ROB, LSU, and Rocket Chip ecosystem reuse	BOOM chases IPC with speculative out-of-order execution. Ours retires in order and spends effort on making each Linux bring-up feature understandable.
CVA6/Ariane	6-stage single-issue in-order application core with M/S/U privilege support, MMU, caches, and scoreboard behavior	CVA6 is the closest philosophical neighbor: simple enough to reason about, but Linux-class. Ours is much smaller and less mature, with no cache and less verification.
Ibex	compact 2-stage or optional 3-stage RV32 embedded core with strong verification and no Linux MMU target	Ibex is cleaner and more production-quality for microcontroller work. Ours is less verified but has Sv32/S-mode/Linux experiments that Ibex intentionally avoids.
VexRiscv	highly configurable RV32 SpinalHDL core, 2 to 5+ stages, optional caches, MMU, FPU, debug, and Linux-capable configurations	VexRiscv is a plugin generator. Ours is a fixed pedagogical RTL line where each feature lands as a visible project step.
PicoRV32	size-optimized RV32IMC-capable core with simple valid/ready memory and optional IRQ/PCPI	PicoRV32 is much smaller and cleaner as an embedded helper CPU. Ours is bigger and slower to close, because it carries privilege, MMU, Linux, and shell experiments.
SERV	bit-serial RV32 core optimized for minimum area	SERV is what you pick when gates matter more than throughput. Ours is not bit-serial; it is already large enough to boot Linux slowly.
XiangShan	high-performance open RV64 application-processor project with modern large-core microarchitecture work	XiangShan is at the opposite end: high-performance, team-scale, modern application processor design. Ours is a notebook-scale bring-up core for understanding the path.

The blunt comparison: our core is closest to a stripped-down, educational CVA6/VexRiscv-Linux-class experiment, not to BOOM or XiangShan. The next architectural gap is obvious from P88: every major Linux-capable open core has an instruction cache. We do not.

Honest Status

check	status
Memory request attribution in harness	PASS
BusyBox shell workload runs	PASS
`memory_bus.by_kind` emitted into chart data	PASS
Open-core architecture comparison recorded	PASS
LibreLane hardening	NOT RUN

P89 should be a CPU feature, not another chart. The strongest candidate is an instruction cache or a tiny line-fill buffer, with P84/P86/P87/P88 as the regression benchmark line.

Result

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Compared With Open Cores

Honest Status

Next