Store-buffer trace and repair · librelane-playground

P103 fixes the P102 store-buffer failure. P102 got Linux to /init, then BusyBox died before the shell prompt after only 79 buffered user stores. P103 added a transaction trace, found the bug, and made the full BusyBox shell smoke pass again.

The bug was a request/grant mixup. The core requested a store-buffer drain, but prefetch or background fill could win the actual memory port. The sequential logic then saw mem_ready and cleared the store buffer as if the write had completed.

Result

check	result
`make check-tools`	PASS
Verilator build	PASS
Linux reaches `/init`	PASS
BusyBox prompt	PASS
BusyBox shell workload reaches `P103-FILE-OK`	PASS
Store-buffer trace captures grant-qualified drains	PASS
Shell-window speedup versus P101	FAIL
Hardened layout	NOT RUN

Timing

metric	P101 split TLB baseline	P103 repaired store buffer
post-load cycles	217,929,367	218,842,451
shell window cycles	64,084,050	64,809,989
retired instructions	86,116,803	86,218,075
CPI	2.5306	2.5382
BusyBox ready milestone	0	118,415,663
shell `FILE-OK` milestone	217,929,510	218,842,594
kernel panic milestone	0	0

This is an RTL correctness PASS, not a performance pass. P103 is 1.13% slower in the shell window than the same P101 baseline run. That is the cost of the conservative policy: accept one user word store, drain it on the next cycle, then continue.

Store Buffer Counters

counter	value
accepts	1,156,624
drains	1,156,624
forwards	0
valid cycles	1,156,624
full-wait cycles	0
order-wait cycles	0
drain issue cycles	1,156,624
drain stall cycles	0

No forwards occurred because the policy drains before the next fetch. That is intentionally boring. First make the store visible correctly; then relax the policy.

The Repair

P102 had:

mem_req_storebuf && mem_ready -> clear store-buffer entry

That was wrong because mem_req_storebuf meant “the buffer wants the port”, not “the buffer got the port.” P103 adds:

wire mem_storebuf_grant =
    mem_req_storebuf && (mem_arb_class == MEMC_STORE) && mem_we;

The clear condition is now:

mem_storebuf_grant && mem_ready && !mem_error

The memory priority also gives a pending store-buffer drain the port before fetch, prefetch, or background fill.

Trace Proof

The new +storebuf_trace=PATH harness option writes a CSV with PC, physical address, write data, memory class, store-buffer request, and store-buffer grant.

Early BusyBox startup now shows real store grants:

117899729 accept pc=0x10110 pa=0xa0ddb8 data=0x0 grant=0
117899730 drain  pc=0x10110 class=6 req=1 grant=1
117899733 accept pc=0x10112 pa=0xa0ddbc data=0x0 grant=0
117899734 drain  pc=0x10112 class=6 req=1 grant=1

class=6 is the P94/P103 store class. The trace hit its 200,000-row cap after /init, and the fixed shell run continued through P103-FILE-OK.

What Other Cores Do

Simple in-order cores can dodge this bug by not decoupling stores. The Ibex LSU stalls the pipeline for loads and stores until the data-side response arrives.

Application-class cores decouple, but they pay for a real ordering contract. CVA6 keeps stores in a buffer and has loads check that buffer for potential aliasing. BOOM uses load/store queues, store dependency masks, store-to-load forwarding, and replay for memory-ordering failures.

The ISA rule underneath this is RVWMO. The RISC-V unprivileged spec allows relaxed write-to-read ordering, but the load-value axiom still requires each loaded byte to come from the newest matching store in global memory order or from an earlier matching store in the same hart’s program order.

P103 is deliberately closer to Ibex than BOOM. It decouples one store, then immediately drains it. That is conservative, but it stops lying about when the write happened.

Memory Stalls

memory stalls label P103 repaired store buffer workload stalls 58,733,876 handshakes 65,798,743

instruction fetch 27,358,959 46.6% 46,800,888 req
data load 11,641,047 19.8% 558,961 req
data store 10,903,340 18.6% 76,998 req
atomic memory op 173,118 0.3% 166,870 req
page walk for fetch 678,827 1.2% 672,673 req
page walk for load/store 670,646 1.1% 664,454 req
other 7,307,939 12.4% 16,857,899 req

The fixed buffer does not reduce the main memory-stall buckets yet. It mostly changes where correctness is enforced: store-buffer requests are now real data-side grants.

Shell Phases

shell phases label P103 shell workload cycles 218,842,451 cpi 2.54

kernel banner to /init 116,722,143 53.5%
/init to shell banner 1,065,310 0.5%
shell banner to first command 35,616,942 16.3%
echo command 1,649 0%
uname -a 2,503,630 1.2%
ls /bin /usr/share 31,456,454 14.4%
cat sample file 4,153,621 1.9%
touch/write/cat/rm /tmp file 10,608,449 4.9%
8x ash loop with file I/O 16,085,506 7.4%
final marker 680 0%

The full BusyBox script reaches P103-FILE-OK, including uname, ls, cat, touch, and the looped temp-file workload.

Cycle Shape

state breakdown label P103 repaired store buffer workload cycles 218,842,451 cpi 2.54

fetch 3.7% 8,110,602
execute 39.4% 86,242,837
mem 12.8% 27,969,974
walker 1.2% 2,686,600
writeback 39.4% 86,218,075
mul/div 3.5% 7,612,647

P103 retires 86.22M instructions at CPI 2.5382. The cost is acceptable for a correctness repair, but it is not the final store-buffer design.

Hot Functions

hot functions label P103 BusyBox shell symbols samples 63,291 period every 1,024 cycles

printf_core busybox

5.7% of samples (3,580 samples)

5.7% 3,580
memset kernel

5.2% of samples (3,298 samples)

5.2% 3,298
memcpy busybox

3.7% of samples (2,313 samples)

3.7% 2,313
vruntime_eligible kernel

3.3% of samples (2,099 samples)

3.3% 2,099
blake2s_compress_generic kernel

2.9% of samples (1,803 samples)

2.9% 1,803
memcpy kernel

2.7% of samples (1,688 samples)

2.7% 1,688
__fwritex busybox

2.6% of samples (1,658 samples)

2.6% 1,658
handle_exception kernel

1.8% of samples (1,115 samples)

1.8% 1,115
unmap_page_range kernel

1.6% of samples (1,008 samples)

1.6% 1,008
memset busybox

1.4% of samples (896 samples)

1.4% 896
n_tty_write kernel

1.4% of samples (856 samples)

1.4% 856
avg_vruntime kernel

1.4% of samples (855 samples)

1.4% 855
ret_from_exception kernel

1.2% of samples (771 samples)

1.2% 771
next_uptodate_folio kernel

1.1% of samples (688 samples)

1.1% 688
n_tty_read kernel

1% of samples (632 samples)

1% 632
(remaining) remaining

54.9% of samples (34,751 samples)

54.9% 34,751

The workload is the same shell script used by the preceding rungs. The important result is no panic and a complete file smoke milestone.

P104 should keep the Harvard arc moving by exposing lower-memory banking or conflict data. If we stay on the store-buffer side, the next useful micro-test is same-address forwarding with a targeted SW; LW program before loosening the drain-before-fetch policy.

Result

Timing

Store Buffer Counters

The Repair

Trace Proof

What Other Cores Do

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Next