Memory arbiter v0 · librelane-playground

P94 turns the core’s single external-memory mux into an explicit arbiter with named request classes. The external RAM is still shared. The point is visibility: fetch, prefetch, background I-cache fill, loads, stores, FP memory, AMOs, and page-table walks now report who wanted service and who got it.

This is a structure rung, not a speed rung.

Result

metric	P93 predictor	P94 arbiter
post-load cycles	221,863,586	222,459,202
shell window cycles	66,342,842	67,050,374
retired instructions	86,469,444	86,664,089
CPI	2.5658	2.5669
memory stall cycles	59,886,452	60,032,329
fetch stall cycles	23,503,650	23,549,359
I-cache hits	42,594,442	42,662,028
fetch queue fills	53,869,769	53,967,748

P94 is slightly slower than P93:

comparison	result
shell window vs P93	+1.07%
post-load cycles vs P93	+0.27%
memory stalls vs P93	+0.24%
fetch stalls vs P93	+0.19%

That is acceptable for this rung because it gives the next experiments specific traffic classes to attack instead of one global memory-stall number.

Arbiter Attribution

class	want cycles	handshakes	stall cycles	denied cycles
fetch	2,723,143	1,738,381	984,762	0
execute prefetch	24,125,041	16,773,841	7,351,200	0
writeback prefetch	18,813,785	18,490,194	323,591	0
I-cache background fill	56,215,442	29,432,432	22,241,006	4,542,004
load	15,607,728	974,736	14,632,992	0
store	12,211,224	217,051	11,994,173	0
FP load	24	12	12	0
FP store	3,408	1,704	1,704	0
AMO	343,153	184,727	158,426	0
fetch PTW	2,241,820	1,117,833	1,123,987	0
LSU PTW	2,440,449	1,219,973	1,220,476	0

Total contention cycles: 4,542,004.

Only background I-cache line fill is denied service in this policy. Foreground fetch, loads, stores, AMOs, and page walks are not losing arbitration; they are paying the latency of the single shared service. That matters for P95. A store buffer or tiny D-cache is a better next target than more blind frontend tweaking.

Memory Stalls

memory stalls label P94 memory-arbiter workload stalls 60,032,329 handshakes 70,150,884

instruction fetch 23,549,359 39.2% 49,661,007 req
data load 14,632,992 24.4% 974,748 req
data store 11,994,173 20% 218,755 req
atomic memory op 158,426 0.3% 184,727 req
page walk for fetch 1,123,987 1.9% 1,117,833 req
page walk for load/store 1,220,476 2% 1,219,973 req
other 7,352,916 12.2% 16,773,841 req

The older memory-kind view still says the same broad thing: fetch and data traffic both matter, and page-table walks are nontrivial but not the biggest slice.

Shell Phases

shell phases label P94 shell workload cycles 222,459,202 cpi 2.57

kernel banner to /init 117,615,610 53%
/init to shell banner 1,075,033 0.5%
shell banner to first command 36,090,120 16.3%
echo command 1,598 0%
uname -a 2,572,368 1.2%
ls /bin /usr/share 32,428,238 14.6%
cat sample file 2,767,249 1.3%
touch/write/cat/rm /tmp file 10,434,293 4.7%
8x ash loop with file I/O 17,891,364 8.1%
final marker 955,264 0.4%

The benchmark reaches the same final file marker. The shell-window slowdown is visible, so the honest status stays FAIL for speedup.

Cycle Shape

state breakdown label P94 memory-arbiter workload cycles 222,459,202 cpi 2.57

fetch 3.7% 8,333,015
execute 39% 86,689,285
mem 12.7% 28,163,821
walker 2.1% 4,682,269
writeback 39% 86,664,089
mul/div 3.6% 7,925,007

The arbiter does not add a new architectural state. It names the memory requests that were already passing through the shared port.

Hot Functions

hot functions label P94 BusyBox shell symbols samples 65,479 period every 1,024 cycles

printf_core busybox

5.5% of samples (3,590 samples)

5.5% 3,590
memset kernel

5.1% of samples (3,320 samples)

5.1% 3,320
memcpy busybox

3.6% of samples (2,341 samples)

3.6% 2,341
vruntime_eligible kernel

3.5% of samples (2,257 samples)

3.5% 2,257
blake2s_compress_generic kernel

2.8% of samples (1,802 samples)

2.8% 1,802
memcpy kernel

2.6% of samples (1,698 samples)

2.6% 1,698
__fwritex busybox

2.6% of samples (1,697 samples)

2.6% 1,697
handle_exception kernel

1.8% of samples (1,175 samples)

1.8% 1,175
unmap_page_range kernel

1.7% of samples (1,122 samples)

1.7% 1,122
avg_vruntime kernel

1.4% of samples (904 samples)

1.4% 904
memset busybox

1.3% of samples (848 samples)

1.3% 848
n_tty_write kernel

1.3% of samples (830 samples)

1.3% 830
ret_from_exception kernel

1.2% of samples (784 samples)

1.2% 784
n_tty_read kernel

1% of samples (681 samples)

1% 681
next_uptodate_folio kernel

1% of samples (673 samples)

1% 673
(remaining) remaining

55.6% of samples (36,372 samples)

55.6% 36,372

The hot-symbol mix remains BusyBox plus kernel memory, scheduler, and exception paths. P94’s value is that future speed rungs can line those symbols up with traffic classes.

Honest Status

check	status
Named memory request classes in RTL	PASS
Public want/grant/class counters exported to harness	PASS
BusyBox shell workload runs	PASS
Arbiter contention counters captured	PASS
Separate physical instruction/data RAM ports	NOT RUN
Nonblocking data cache	NOT RUN
Shell-window speedup vs P93	FAIL
LibreLane hardening	NOT RUN

P95 should target data-side latency. The low-risk first step is a store buffer; the more ambitious step is a tiny D-cache. P94 gives us enough traffic attribution to tell whether either one actually reduces the foreground load/store pressure.

Result

Arbiter Attribution

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next