P88 does not make the CPU faster. It makes the profiler less vague. The BusyBox shell still runs through the P87 PTY/direct-UART bridge, but the Verilator harness now breaks memory-bus waits into fetch, load, store, AMO, and page-walk buckets.
That matters because “memory is slow” is not an actionable CPU feature. “Instruction fetch accounts for about two thirds of memory stalls” is.
Result
| metric | P87 direct UART | P88 attribution run | delta |
|---|---|---|---|
| post-load cycles | 222,825,777 | 221,748,021 | -0.48% |
| retired instructions | 86,750,479 | 86,435,211 | -0.36% |
| CPI | 2.5686 | 2.5655 | -0.12% |
| memory handshakes | 33,187,764 | 32,917,717 | -0.81% |
| memory stall cycles | 88,210,458 | 87,892,031 | -0.36% |
That small movement is not a speedup claim. The RTL behavior is the same shape as P87; P88 is a better measurement pass.
Memory Stalls
- instruction fetch 58,870,166 67% 29,224,093 req
- data load 14,576,692 16.6% 969,755 req
- data store 11,955,460 13.6% 216,442 req
- atomic memory op 157,725 0.2% 183,819 req
- page walk for fetch 1,117,409 1.3% 1,111,255 req
- page walk for load/store 1,212,863 1.4% 1,212,353 req
- other 1,716 0% 0 req
| request kind | stall cycles | share |
|---|---|---|
| instruction fetch | 58,870,166 | 66.98% |
| data load | 14,576,692 | 16.58% |
| data store | 11,955,460 | 13.60% |
| AMO | 157,725 | 0.18% |
| page walk for fetch | 1,117,409 | 1.27% |
| page walk for load/store | 1,212,863 | 1.38% |
| other | 1,716 | 0.00% |
The first version of this harness mislabeled most fetch waits as
other, because P64/P66 can launch the next fetch during S_WB. P88
counts S_WB memory requests as instruction fetches and labels the P70
FP high-half states as load/store.
Shell Phases
- kernel banner to /init 117,614,758 53.2%
- /init to shell banner 1,092,067 0.5%
- shell banner to first command 36,133,324 16.3%
- echo command 1,598 0%
- uname -a 2,446,206 1.1%
- ls /bin /usr/share 32,252,107 14.6%
- cat sample file 2,745,025 1.2%
- touch/write/cat/rm /tmp file 11,581,923 5.2%
- 8x ash loop with file I/O 16,297,942 7.4%
- final marker 955,006 0.4%
The visible workload is still the shell: boot to /init, shell setup,
uname, ls, cat, /tmp file work, and a small ash loop. The slow
phase is still ls /bin /usr/share, which is exactly the kind of
filesystem and userspace formatting path that pounds instruction fetch.
Cycle Shape
- fetch 3.8% 8,316,547
- execute 39% 86,460,157
- mem 12.7% 28,059,893
- walker 2.1% 4,653,880
- writeback 39% 86,435,211
- mul/div 3.5% 7,820,617
The state chart is useful as a cross-check. P88 did not add a cache or a new pipeline stage, so the high-level state distribution should not dramatically move from P87.
Hot Functions
- 5.6% of samples (3,612 samples)5.6% 3,612
- 5.1% of samples (3,269 samples)5.1% 3,269
- 3.6% of samples (2,344 samples)3.6% 2,344
- 3.3% of samples (2,151 samples)3.3% 2,151
- 2.8% of samples (1,809 samples)2.8% 1,809
- 2.6% of samples (1,707 samples)2.6% 1,707
- 2.6% of samples (1,690 samples)2.6% 1,690
- 1.7% of samples (1,120 samples)1.7% 1,120
- 1.7% of samples (1,089 samples)1.7% 1,089
- 1.4% of samples (886 samples)1.4% 886
- 1.3% of samples (844 samples)1.3% 844
- 1.3% of samples (842 samples)1.3% 842
- 1.2% of samples (749 samples)1.2% 749
- 1.1% of samples (696 samples)1.1% 696
- 1% of samples (660 samples)1% 660
- 55.5% of samples (35,946 samples)55.5% 35,946
The top symbols match the stall split: BusyBox formatting (printf_core,
memcpy, __fwritex) and kernel memory/scheduler/exception work stay
hot. The direct-UART bridge removed the old SBI console leaf path in P87,
but the remaining work is broader than terminal output.
Compared With Open Cores
This project core is a teaching ASIC core first. In current project terms it is an RV32 Linux-capable, single-issue, in-order, FSM-style core with M-mode/S-mode, Sv32, SBI boot support, an 8-entry unified TLB, simple valid/ready memory, and no real instruction/data caches yet. Feature work in this ladder has covered RV32I base tests, M, A, selected Zba/Zbb, Zicsr/Zifencei, Zicntr, compressed-instruction support, and F/D support for later Linux/AtomVM experiments. P88 itself did not rerun full architectural compliance; it ran the BusyBox shell smoke and profile workload.
| core | architectural shape | how ours differs |
|---|---|---|
| Rocket | 5-stage in-order RV64GC generator with MMU, L1 I/D caches, branch prediction, and Rocket Chip tile/SoC integration | Rocket is the mature application-class in-order baseline. Ours is RV32, hand-written teaching RTL, no cache hierarchy, and a much simpler memory system. |
| BOOM | parameterized RV64 out-of-order generator with rename, issue queues, ROB, LSU, and Rocket Chip ecosystem reuse | BOOM chases IPC with speculative out-of-order execution. Ours retires in order and spends effort on making each Linux bring-up feature understandable. |
| CVA6/Ariane | 6-stage single-issue in-order application core with M/S/U privilege support, MMU, caches, and scoreboard behavior | CVA6 is the closest philosophical neighbor: simple enough to reason about, but Linux-class. Ours is much smaller and less mature, with no cache and less verification. |
| Ibex | compact 2-stage or optional 3-stage RV32 embedded core with strong verification and no Linux MMU target | Ibex is cleaner and more production-quality for microcontroller work. Ours is less verified but has Sv32/S-mode/Linux experiments that Ibex intentionally avoids. |
| VexRiscv | highly configurable RV32 SpinalHDL core, 2 to 5+ stages, optional caches, MMU, FPU, debug, and Linux-capable configurations | VexRiscv is a plugin generator. Ours is a fixed pedagogical RTL line where each feature lands as a visible project step. |
| PicoRV32 | size-optimized RV32IMC-capable core with simple valid/ready memory and optional IRQ/PCPI | PicoRV32 is much smaller and cleaner as an embedded helper CPU. Ours is bigger and slower to close, because it carries privilege, MMU, Linux, and shell experiments. |
| SERV | bit-serial RV32 core optimized for minimum area | SERV is what you pick when gates matter more than throughput. Ours is not bit-serial; it is already large enough to boot Linux slowly. |
| XiangShan | high-performance open RV64 application-processor project with modern large-core microarchitecture work | XiangShan is at the opposite end: high-performance, team-scale, modern application processor design. Ours is a notebook-scale bring-up core for understanding the path. |
The blunt comparison: our core is closest to a stripped-down, educational CVA6/VexRiscv-Linux-class experiment, not to BOOM or XiangShan. The next architectural gap is obvious from P88: every major Linux-capable open core has an instruction cache. We do not.
Honest Status
| check | status |
|---|---|
| Memory request attribution in harness | PASS |
| BusyBox shell workload runs | PASS |
memory_bus.by_kind emitted into chart data | PASS |
| Open-core architecture comparison recorded | PASS |
| LibreLane hardening | NOT RUN |
Next
P89 should be a CPU feature, not another chart. The strongest candidate is an instruction cache or a tiny line-fill buffer, with P84/P86/P87/P88 as the regression benchmark line.