Memory stall attribution for the BusyBox shell

P88 kept the P87 BusyBox shell workload and added memory-bus attribution to the Verilator harness. The first classifier was too naive: it only treated S_FETCH as instruction fetch, but the core can issue the next fetch from S_WB after the P64/P66 prefetch work. That made most waits show up as other, which was technically counted but useless.

After fixing the classifier, the workload passed and emitted memory_bus.by_kind into the benchmark JSON. The split is clear: instruction fetch accounts for 58,870,166 of 87,892,031 memory stall cycles, about 67%. Data loads are about 17%, stores about 14%, and page walks are only about 3% combined.

That changes the next optimization target. We have already squeezed page walks with the 8-entry TLB and trimmed the console path with direct UART output. The remaining obvious architectural hole is fetch-side memory: instruction cache, line fill, burst reads, or a more serious fetch queue.

Status: PASS for the profiled shell workload, NOT RUN for LibreLane hardening.