Store buffer v0 · librelane-playground

P95 adds a one-entry store buffer at the SoC boundary. Ordinary external-RAM stores are accepted immediately when the buffer is empty, then drained through the shared RAM port before later external CPU requests are allowed through.

It works functionally. It is slower.

Result

metric	P94 arbiter	P95 store buffer
post-load cycles	222,459,202	241,494,238
shell window cycles	67,050,374	77,821,976
retired instructions	86,664,089	88,851,638
CPI	2.5669	2.7179
memory stall cycles	60,032,329	60,797,627
fetch stall cycles	23,549,359	35,563,846
I-cache hits	42,662,028	43,429,434
fetch queue fills	53,967,748	55,096,088

comparison	result
shell window vs P94	+16.06%
post-load cycles vs P94	+8.56%
memory stalls vs P94	+1.27%
fetch stalls vs P94	+51.02%

Store Buffer Counters

counter	value
accepts	12,743,615
drains	12,743,615
valid cycles	12,743,615
block cycles	12,145,646

The buffer did accept and drain stores. The problem is the strict ordering policy: while the buffer drains, later external CPU requests wait. The store line improves, but fetch and writeback prefetch pay for it.

What Moved

class	P94 stall cycles	P95 stall cycles
fetch	23,549,359	35,563,846
load	14,632,992	15,144,900
store	11,994,173	1,692
writeback prefetch	323,591	12,500,307
execute prefetch	7,351,200	7,489,611

This is exactly the failure mode worth documenting. A buffer can remove the store instruction’s local wait without improving the machine if the drain policy steals the frontend’s useful slots.

Memory Stalls

memory stalls label P95 store-buffer workload stalls 60,797,627 handshakes 85,300,956

instruction fetch 35,563,846 58.5% 51,766,489 req
data load 15,144,900 24.9% 1,020,802 req
data store 1,692 0% 12,584,006 req
atomic memory op 136,668 0.2% 219,056 req
page walk for fetch 1,180,811 1.9% 1,174,657 req
page walk for load/store 1,280,087 2.1% 1,279,569 req
other 7,489,623 12.3% 17,256,377 req

The memory-kind view shows the trade: store stalls collapse, but fetch stalls rise enough to dominate the result.

Shell Phases

shell phases label P95 shell workload cycles 241,494,238 cpi 2.72

kernel banner to /init 123,204,999 51.2%
/init to shell banner 1,128,402 0.5%
shell banner to first command 38,652,740 16.1%
echo command 1,598 0%
uname -a 2,164,048 0.9%
ls /bin /usr/share 34,815,180 14.5%
cat sample file 5,283,101 2.2%
touch/write/cat/rm /tmp file 10,852,334 4.5%
8x ash loop with file I/O 23,171,954 9.6%
final marker 1,533,761 0.6%

Every shell phase completes, but every meaningful phase is later than P94. The loop phase is especially exposed because it alternates small file writes and reads.

Cycle Shape

state breakdown label P95 store-buffer workload cycles 241,494,238 cpi 2.72

fetch 8.5% 20,601,809
execute 36.8% 88,877,988
mem 12.1% 29,105,432
walker 2% 4,915,124
writeback 36.8% 88,851,638
mul/div 3.8% 9,138,851

The architectural state machine did not gain a new CPU state; the buffer is outside the core. The extra time shows up through blocked memory service.

Hot Functions

hot functions label P95 BusyBox shell symbols samples 75,998 period every 1,024 cycles

memset kernel

5.5% of samples (4,163 samples)

5.5% 4,163
printf_core busybox

4.8% of samples (3,646 samples)

4.8% 3,646
vruntime_eligible kernel

4.1% of samples (3,123 samples)

4.1% 3,123
memcpy busybox

3.2% of samples (2,397 samples)

3.2% 2,397
memcpy kernel

2.4% of samples (1,850 samples)

2.4% 1,850
blake2s_compress_generic kernel

2.4% of samples (1,822 samples)

2.4% 1,822
__fwritex busybox

2.3% of samples (1,779 samples)

2.3% 1,779
handle_exception kernel

1.9% of samples (1,455 samples)

1.9% 1,455
avg_vruntime kernel

1.8% of samples (1,358 samples)

1.8% 1,358
unmap_page_range kernel

1.5% of samples (1,137 samples)

1.5% 1,137
update_curr kernel

1.2% of samples (917 samples)

1.2% 917
n_tty_write kernel

1.2% of samples (872 samples)

1.2% 872
ret_from_exception kernel

1.1% of samples (857 samples)

1.1% 857
memset busybox

1.1% of samples (856 samples)

1.1% 856
n_tty_read kernel

1.1% of samples (800 samples)

1.1% 800
(remaining) remaining

56% of samples (42,525 samples)

56% 42,525

The hot-symbol mix remains the same kind of BusyBox and kernel work. The slowdown is not a new software path; it is memory-service policy.

Honest Status

check	status
One-entry external-RAM store buffer	PASS
MMIO stores left unbuffered	PASS
BusyBox shell workload runs	PASS
Store-buffer counters captured	PASS
Forwarding from buffer to loads	NOT RUN
Store merge / coalescing	NOT RUN
Shell-window speedup vs P94	FAIL
LibreLane hardening	NOT RUN

P96 should not blindly make this buffer bigger. The next useful rung is either a tiny D-cache for load hits or a smarter store buffer with forwarding and a drain policy that does not block instruction delivery so aggressively.

Result

Store Buffer Counters

What Moved

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next