No. 95 / project of 147 on the ladder

Store buffer v0

introduces — one-entry external-RAM store buffer; store accept/drain counters; measured store-buffer negative result

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P95 adds a one-entry store buffer at the SoC boundary. Ordinary external-RAM stores are accepted immediately when the buffer is empty, then drained through the shared RAM port before later external CPU requests are allowed through.

It works functionally. It is slower.

Result

metricP94 arbiterP95 store buffer
post-load cycles222,459,202241,494,238
shell window cycles67,050,37477,821,976
retired instructions86,664,08988,851,638
CPI2.56692.7179
memory stall cycles60,032,32960,797,627
fetch stall cycles23,549,35935,563,846
I-cache hits42,662,02843,429,434
fetch queue fills53,967,74855,096,088
comparisonresult
shell window vs P94+16.06%
post-load cycles vs P94+8.56%
memory stalls vs P94+1.27%
fetch stalls vs P94+51.02%

Store Buffer Counters

countervalue
accepts12,743,615
drains12,743,615
valid cycles12,743,615
block cycles12,145,646

The buffer did accept and drain stores. The problem is the strict ordering policy: while the buffer drains, later external CPU requests wait. The store line improves, but fetch and writeback prefetch pay for it.

What Moved

classP94 stall cyclesP95 stall cycles
fetch23,549,35935,563,846
load14,632,99215,144,900
store11,994,1731,692
writeback prefetch323,59112,500,307
execute prefetch7,351,2007,489,611

This is exactly the failure mode worth documenting. A buffer can remove the store instruction’s local wait without improving the machine if the drain policy steals the frontend’s useful slots.

Memory Stalls

memory stalls label P95 store-buffer workload stalls 60,797,627 handshakes 85,300,956
  1. instruction fetch 35,563,846 58.5% 51,766,489 req
  2. data load 15,144,900 24.9% 1,020,802 req
  3. data store 1,692 0% 12,584,006 req
  4. atomic memory op 136,668 0.2% 219,056 req
  5. page walk for fetch 1,180,811 1.9% 1,174,657 req
  6. page walk for load/store 1,280,087 2.1% 1,279,569 req
  7. other 7,489,623 12.3% 17,256,377 req

The memory-kind view shows the trade: store stalls collapse, but fetch stalls rise enough to dominate the result.

Shell Phases

shell phases label P95 shell workload cycles 241,494,238 cpi 2.72
  1. kernel banner to /init 123,204,999 51.2%
  2. /init to shell banner 1,128,402 0.5%
  3. shell banner to first command 38,652,740 16.1%
  4. echo command 1,598 0%
  5. uname -a 2,164,048 0.9%
  6. ls /bin /usr/share 34,815,180 14.5%
  7. cat sample file 5,283,101 2.2%
  8. touch/write/cat/rm /tmp file 10,852,334 4.5%
  9. 8x ash loop with file I/O 23,171,954 9.6%
  10. final marker 1,533,761 0.6%

Every shell phase completes, but every meaningful phase is later than P94. The loop phase is especially exposed because it alternates small file writes and reads.

Cycle Shape

state breakdown label P95 store-buffer workload cycles 241,494,238 cpi 2.72
  1. fetch 8.5% 20,601,809
  2. execute 36.8% 88,877,988
  3. mem 12.1% 29,105,432
  4. walker 2% 4,915,124
  5. writeback 36.8% 88,851,638
  6. mul/div 3.8% 9,138,851

The architectural state machine did not gain a new CPU state; the buffer is outside the core. The extra time shows up through blocked memory service.

Hot Functions

hot functions label P95 BusyBox shell symbols samples 75,998 period every 1,024 cycles
  1. memset kernel
    5.5% 4,163
  2. printf_core busybox
    4.8% 3,646
  3. vruntime_eligible kernel
    4.1% 3,123
  4. memcpy busybox
    3.2% 2,397
  5. memcpy kernel
    2.4% 1,850
  6. blake2s_compress_generic kernel
    2.4% 1,822
  7. __fwritex busybox
    2.3% 1,779
  8. handle_exception kernel
    1.9% 1,455
  9. avg_vruntime kernel
    1.8% 1,358
  10. unmap_page_range kernel
    1.5% 1,137
  11. update_curr kernel
    1.2% 917
  12. n_tty_write kernel
    1.2% 872
  13. ret_from_exception kernel
    1.1% 857
  14. memset busybox
    1.1% 856
  15. n_tty_read kernel
    1.1% 800
  16. (remaining) remaining
    56% 42,525

The hot-symbol mix remains the same kind of BusyBox and kernel work. The slowdown is not a new software path; it is memory-service policy.

Honest Status

checkstatus
One-entry external-RAM store bufferPASS
MMIO stores left unbufferedPASS
BusyBox shell workload runsPASS
Store-buffer counters capturedPASS
Forwarding from buffer to loadsNOT RUN
Store merge / coalescingNOT RUN
Shell-window speedup vs P94FAIL
LibreLane hardeningNOT RUN

Next

P96 should not blindly make this buffer bigger. The next useful rung is either a tiny D-cache for load hits or a smarter store buffer with forwarding and a drain policy that does not block instruction delivery so aggressively.