P88 kept the P87 BusyBox shell workload and added memory-bus attribution
to the Verilator harness. The first classifier was too naive: it only
treated S_FETCH as instruction fetch, but the core can issue the next
fetch from S_WB after the P64/P66 prefetch work. That made most waits
show up as other, which was technically counted but useless.
After fixing the classifier, the workload passed and emitted
memory_bus.by_kind into the benchmark JSON. The split is clear:
instruction fetch accounts for 58,870,166 of 87,892,031 memory stall
cycles, about 67%. Data loads are about 17%, stores about 14%, and page
walks are only about 3% combined.
That changes the next optimization target. We have already squeezed page walks with the 8-entry TLB and trimmed the console path with direct UART output. The remaining obvious architectural hole is fetch-side memory: instruction cache, line fill, burst reads, or a more serious fetch queue.
Status: PASS for the profiled shell workload, NOT RUN for LibreLane hardening.