P103 fixes the P102 store-buffer failure. P102 got Linux to /init,
then BusyBox died before the shell prompt after only 79 buffered user
stores. P103 added a transaction trace, found the bug, and made the full
BusyBox shell smoke pass again.
The bug was a request/grant mixup. The core requested a store-buffer
drain, but prefetch or background fill could win the actual memory port.
The sequential logic then saw mem_ready and cleared the store buffer as
if the write had completed.
Result
| check | result |
|---|---|
make check-tools | PASS |
| Verilator build | PASS |
Linux reaches /init | PASS |
| BusyBox prompt | PASS |
BusyBox shell workload reaches P103-FILE-OK | PASS |
| Store-buffer trace captures grant-qualified drains | PASS |
| Shell-window speedup versus P101 | FAIL |
| Hardened layout | NOT RUN |
Timing
| metric | P101 split TLB baseline | P103 repaired store buffer |
|---|---|---|
| post-load cycles | 217,929,367 | 218,842,451 |
| shell window cycles | 64,084,050 | 64,809,989 |
| retired instructions | 86,116,803 | 86,218,075 |
| CPI | 2.5306 | 2.5382 |
| BusyBox ready milestone | 0 | 118,415,663 |
shell FILE-OK milestone | 217,929,510 | 218,842,594 |
| kernel panic milestone | 0 | 0 |
This is an RTL correctness PASS, not a performance pass. P103 is 1.13% slower in the shell window than the same P101 baseline run. That is the cost of the conservative policy: accept one user word store, drain it on the next cycle, then continue.
Store Buffer Counters
| counter | value |
|---|---|
| accepts | 1,156,624 |
| drains | 1,156,624 |
| forwards | 0 |
| valid cycles | 1,156,624 |
| full-wait cycles | 0 |
| order-wait cycles | 0 |
| drain issue cycles | 1,156,624 |
| drain stall cycles | 0 |
No forwards occurred because the policy drains before the next fetch. That is intentionally boring. First make the store visible correctly; then relax the policy.
The Repair
P102 had:
mem_req_storebuf && mem_ready -> clear store-buffer entry
That was wrong because mem_req_storebuf meant “the buffer wants the
port”, not “the buffer got the port.” P103 adds:
wire mem_storebuf_grant =
mem_req_storebuf && (mem_arb_class == MEMC_STORE) && mem_we;
The clear condition is now:
mem_storebuf_grant && mem_ready && !mem_error
The memory priority also gives a pending store-buffer drain the port before fetch, prefetch, or background fill.
Trace Proof
The new +storebuf_trace=PATH harness option writes a CSV with PC,
physical address, write data, memory class, store-buffer request, and
store-buffer grant.
Early BusyBox startup now shows real store grants:
117899729 accept pc=0x10110 pa=0xa0ddb8 data=0x0 grant=0
117899730 drain pc=0x10110 class=6 req=1 grant=1
117899733 accept pc=0x10112 pa=0xa0ddbc data=0x0 grant=0
117899734 drain pc=0x10112 class=6 req=1 grant=1
class=6 is the P94/P103 store class. The trace hit its 200,000-row cap
after /init, and the fixed shell run continued through P103-FILE-OK.
What Other Cores Do
Simple in-order cores can dodge this bug by not decoupling stores. The Ibex LSU stalls the pipeline for loads and stores until the data-side response arrives.
Application-class cores decouple, but they pay for a real ordering contract. CVA6 keeps stores in a buffer and has loads check that buffer for potential aliasing. BOOM uses load/store queues, store dependency masks, store-to-load forwarding, and replay for memory-ordering failures.
The ISA rule underneath this is RVWMO. The RISC-V unprivileged spec allows relaxed write-to-read ordering, but the load-value axiom still requires each loaded byte to come from the newest matching store in global memory order or from an earlier matching store in the same hart’s program order.
P103 is deliberately closer to Ibex than BOOM. It decouples one store, then immediately drains it. That is conservative, but it stops lying about when the write happened.
Memory Stalls
- instruction fetch 27,358,959 46.6% 46,800,888 req
- data load 11,641,047 19.8% 558,961 req
- data store 10,903,340 18.6% 76,998 req
- atomic memory op 173,118 0.3% 166,870 req
- page walk for fetch 678,827 1.2% 672,673 req
- page walk for load/store 670,646 1.1% 664,454 req
- other 7,307,939 12.4% 16,857,899 req
The fixed buffer does not reduce the main memory-stall buckets yet. It mostly changes where correctness is enforced: store-buffer requests are now real data-side grants.
Shell Phases
- kernel banner to /init 116,722,143 53.5%
- /init to shell banner 1,065,310 0.5%
- shell banner to first command 35,616,942 16.3%
- echo command 1,649 0%
- uname -a 2,503,630 1.2%
- ls /bin /usr/share 31,456,454 14.4%
- cat sample file 4,153,621 1.9%
- touch/write/cat/rm /tmp file 10,608,449 4.9%
- 8x ash loop with file I/O 16,085,506 7.4%
- final marker 680 0%
The full BusyBox script reaches P103-FILE-OK, including uname, ls,
cat, touch, and the looped temp-file workload.
Cycle Shape
- fetch 3.7% 8,110,602
- execute 39.4% 86,242,837
- mem 12.8% 27,969,974
- walker 1.2% 2,686,600
- writeback 39.4% 86,218,075
- mul/div 3.5% 7,612,647
P103 retires 86.22M instructions at CPI 2.5382. The cost is acceptable for a correctness repair, but it is not the final store-buffer design.
Hot Functions
- 5.7% of samples (3,580 samples)5.7% 3,580
- 5.2% of samples (3,298 samples)5.2% 3,298
- 3.7% of samples (2,313 samples)3.7% 2,313
- 3.3% of samples (2,099 samples)3.3% 2,099
- 2.9% of samples (1,803 samples)2.9% 1,803
- 2.7% of samples (1,688 samples)2.7% 1,688
- 2.6% of samples (1,658 samples)2.6% 1,658
- 1.8% of samples (1,115 samples)1.8% 1,115
- 1.6% of samples (1,008 samples)1.6% 1,008
- 1.4% of samples (896 samples)1.4% 896
- 1.4% of samples (856 samples)1.4% 856
- 1.4% of samples (855 samples)1.4% 855
- 1.2% of samples (771 samples)1.2% 771
- 1.1% of samples (688 samples)1.1% 688
- 1% of samples (632 samples)1% 632
- 54.9% of samples (34,751 samples)54.9% 34,751
The workload is the same shell script used by the preceding rungs. The important result is no panic and a complete file smoke milestone.
Next
P104 should keep the Harvard arc moving by exposing lower-memory banking
or conflict data. If we stay on the store-buffer side, the next useful
micro-test is same-address forwarding with a targeted SW; LW program
before loosening the drain-before-fetch policy.