No. 103 / project of 147 on the ladder

Store-buffer trace and repair

introduces — store-buffer transaction trace; grant-qualified store-buffer drain; BusyBox shell correctness repair

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P103 fixes the P102 store-buffer failure. P102 got Linux to /init, then BusyBox died before the shell prompt after only 79 buffered user stores. P103 added a transaction trace, found the bug, and made the full BusyBox shell smoke pass again.

The bug was a request/grant mixup. The core requested a store-buffer drain, but prefetch or background fill could win the actual memory port. The sequential logic then saw mem_ready and cleared the store buffer as if the write had completed.

Result

checkresult
make check-toolsPASS
Verilator buildPASS
Linux reaches /initPASS
BusyBox promptPASS
BusyBox shell workload reaches P103-FILE-OKPASS
Store-buffer trace captures grant-qualified drainsPASS
Shell-window speedup versus P101FAIL
Hardened layoutNOT RUN

Timing

metricP101 split TLB baselineP103 repaired store buffer
post-load cycles217,929,367218,842,451
shell window cycles64,084,05064,809,989
retired instructions86,116,80386,218,075
CPI2.53062.5382
BusyBox ready milestone0118,415,663
shell FILE-OK milestone217,929,510218,842,594
kernel panic milestone00

This is an RTL correctness PASS, not a performance pass. P103 is 1.13% slower in the shell window than the same P101 baseline run. That is the cost of the conservative policy: accept one user word store, drain it on the next cycle, then continue.

Store Buffer Counters

countervalue
accepts1,156,624
drains1,156,624
forwards0
valid cycles1,156,624
full-wait cycles0
order-wait cycles0
drain issue cycles1,156,624
drain stall cycles0

No forwards occurred because the policy drains before the next fetch. That is intentionally boring. First make the store visible correctly; then relax the policy.

The Repair

P102 had:

mem_req_storebuf && mem_ready -> clear store-buffer entry

That was wrong because mem_req_storebuf meant “the buffer wants the port”, not “the buffer got the port.” P103 adds:

wire mem_storebuf_grant =
    mem_req_storebuf && (mem_arb_class == MEMC_STORE) && mem_we;

The clear condition is now:

mem_storebuf_grant && mem_ready && !mem_error

The memory priority also gives a pending store-buffer drain the port before fetch, prefetch, or background fill.

Trace Proof

The new +storebuf_trace=PATH harness option writes a CSV with PC, physical address, write data, memory class, store-buffer request, and store-buffer grant.

Early BusyBox startup now shows real store grants:

117899729 accept pc=0x10110 pa=0xa0ddb8 data=0x0 grant=0
117899730 drain  pc=0x10110 class=6 req=1 grant=1
117899733 accept pc=0x10112 pa=0xa0ddbc data=0x0 grant=0
117899734 drain  pc=0x10112 class=6 req=1 grant=1

class=6 is the P94/P103 store class. The trace hit its 200,000-row cap after /init, and the fixed shell run continued through P103-FILE-OK.

What Other Cores Do

Simple in-order cores can dodge this bug by not decoupling stores. The Ibex LSU stalls the pipeline for loads and stores until the data-side response arrives.

Application-class cores decouple, but they pay for a real ordering contract. CVA6 keeps stores in a buffer and has loads check that buffer for potential aliasing. BOOM uses load/store queues, store dependency masks, store-to-load forwarding, and replay for memory-ordering failures.

The ISA rule underneath this is RVWMO. The RISC-V unprivileged spec allows relaxed write-to-read ordering, but the load-value axiom still requires each loaded byte to come from the newest matching store in global memory order or from an earlier matching store in the same hart’s program order.

P103 is deliberately closer to Ibex than BOOM. It decouples one store, then immediately drains it. That is conservative, but it stops lying about when the write happened.

Memory Stalls

memory stalls label P103 repaired store buffer workload stalls 58,733,876 handshakes 65,798,743
  1. instruction fetch 27,358,959 46.6% 46,800,888 req
  2. data load 11,641,047 19.8% 558,961 req
  3. data store 10,903,340 18.6% 76,998 req
  4. atomic memory op 173,118 0.3% 166,870 req
  5. page walk for fetch 678,827 1.2% 672,673 req
  6. page walk for load/store 670,646 1.1% 664,454 req
  7. other 7,307,939 12.4% 16,857,899 req

The fixed buffer does not reduce the main memory-stall buckets yet. It mostly changes where correctness is enforced: store-buffer requests are now real data-side grants.

Shell Phases

shell phases label P103 shell workload cycles 218,842,451 cpi 2.54
  1. kernel banner to /init 116,722,143 53.5%
  2. /init to shell banner 1,065,310 0.5%
  3. shell banner to first command 35,616,942 16.3%
  4. echo command 1,649 0%
  5. uname -a 2,503,630 1.2%
  6. ls /bin /usr/share 31,456,454 14.4%
  7. cat sample file 4,153,621 1.9%
  8. touch/write/cat/rm /tmp file 10,608,449 4.9%
  9. 8x ash loop with file I/O 16,085,506 7.4%
  10. final marker 680 0%

The full BusyBox script reaches P103-FILE-OK, including uname, ls, cat, touch, and the looped temp-file workload.

Cycle Shape

state breakdown label P103 repaired store buffer workload cycles 218,842,451 cpi 2.54
  1. fetch 3.7% 8,110,602
  2. execute 39.4% 86,242,837
  3. mem 12.8% 27,969,974
  4. walker 1.2% 2,686,600
  5. writeback 39.4% 86,218,075
  6. mul/div 3.5% 7,612,647

P103 retires 86.22M instructions at CPI 2.5382. The cost is acceptable for a correctness repair, but it is not the final store-buffer design.

Hot Functions

hot functions label P103 BusyBox shell symbols samples 63,291 period every 1,024 cycles
  1. printf_core busybox
    5.7% 3,580
  2. memset kernel
    5.2% 3,298
  3. memcpy busybox
    3.7% 2,313
  4. vruntime_eligible kernel
    3.3% 2,099
  5. blake2s_compress_generic kernel
    2.9% 1,803
  6. memcpy kernel
    2.7% 1,688
  7. __fwritex busybox
    2.6% 1,658
  8. handle_exception kernel
    1.8% 1,115
  9. unmap_page_range kernel
    1.6% 1,008
  10. memset busybox
    1.4% 896
  11. n_tty_write kernel
    1.4% 856
  12. avg_vruntime kernel
    1.4% 855
  13. ret_from_exception kernel
    1.2% 771
  14. next_uptodate_folio kernel
    1.1% 688
  15. n_tty_read kernel
    1% 632
  16. (remaining) remaining
    54.9% 34,751

The workload is the same shell script used by the preceding rungs. The important result is no panic and a complete file smoke milestone.

Next

P104 should keep the Harvard arc moving by exposing lower-memory banking or conflict data. If we stay on the store-buffer side, the next useful micro-test is same-address forwarding with a targeted SW; LW program before loosening the drain-before-fetch policy.