I-cache fill buffer · librelane-playground

P91 fixes the P90 line-fill policy. The cache is still 64 direct-mapped lines with 4 words per line, but an I-cache miss no longer parks the core in S_IC_FILL. The returned critical word is executed immediately; the rest of the line is filled later from a one-entry background descriptor when S_WB has a spare memory slot.

This is still not a real XiangShan-style frontend. It is the first measured step away from a blocking frontend.

Result

metric	P89 word I-cache	P90 blocking line fill	P91 fill buffer
post-load cycles	222,317,206	245,417,593	221,327,811
shell window cycles	66,957,620	84,084,195	65,985,297
retired instructions	86,601,839	89,490,840	86,295,205
CPI	2.5671	2.7424	2.5648
memory handshakes	31,289,313	37,457,672	31,944,860
memory stall cycles	83,361,545	86,350,457	84,488,165
fetch handshakes	27,560,784	33,504,298	28,263,597
fetch stall cycles	54,266,192	55,950,261	55,533,555
I-cache hits	6,434,333	12,352,686	8,855,599

P91 is a recovery from P90 and a small win over P89:

comparison	result
shell window vs P90	-21.52%
post-load cycles vs P90	-9.82%
shell window vs P89	-1.45%
post-load cycles vs P89	-0.45%
fetch stalls vs P89	+2.34%

The last row matters. P91 wins shell time, but it does not yet beat P89 on fetch-stall cycles.

Fill Buffer

counter	value
background active cycles	195,614,189
background issue cycles	4,698,933
background fills	846,917

The descriptor is active for most of the run but only occasionally gets an idle bus slot. That is enough to erase P90’s blocking penalty, but it is not enough to make instruction delivery feel like a modern frontend.

Memory Stalls

memory stalls label P91 fill-buffer workload stalls 84,488,165 handshakes 31,944,860

instruction fetch 55,533,555 65.7% 28,263,597 req
data load 14,549,605 17.2% 966,517 req
data store 11,923,432 14.1% 215,684 req
atomic memory op 157,302 0.2% 183,178 req
page walk for fetch 1,113,908 1.3% 1,107,754 req
page walk for load/store 1,208,647 1.4% 1,208,130 req
other 1,716 0% 0 req

Fetch is still the largest memory-stall bucket. The fill buffer changes where the pain lands; it does not remove the instruction-delivery problem.

Shell Phases

shell phases label P91 shell workload cycles 221,327,811 cpi 2.56

kernel banner to /init 117,613,988 53.3%
/init to shell banner 1,085,500 0.5%
shell banner to first command 36,014,961 16.3%
echo command 1,598 0%
uname -a 2,599,477 1.2%
ls /bin /usr/share 32,253,703 14.6%
cat sample file 3,183,166 1.4%
touch/write/cat/rm /tmp file 10,684,061 4.8%
8x ash loop with file I/O 16,308,284 7.4%
final marker 955,008 0.4%

The shell window is the practical win. This run gets through the same BusyBox commands faster than both P89 and P90.

Cycle Shape

state breakdown label P91 fill-buffer workload cycles 221,327,811 cpi 2.56

fetch 3.8% 8,307,587
execute 39% 86,320,041
mem 12.6% 27,995,718
walker 2.1% 4,638,439
writeback 39% 86,295,205
mul/div 3.5% 7,769,105

The important absence is S_IC_FILL. P91 keeps the 4-word line cache without spending a visible state bucket waiting for future words.

Hot Functions

hot functions label P91 BusyBox shell symbols samples 64,439 period every 1,024 cycles

printf_core busybox

5.4% of samples (3,504 samples)

5.4% 3,504
memset kernel

5.2% of samples (3,318 samples)

5.2% 3,318
memcpy busybox

3.6% of samples (2,334 samples)

3.6% 2,334
vruntime_eligible kernel

3.4% of samples (2,206 samples)

3.4% 2,206
blake2s_compress_generic kernel

2.8% of samples (1,798 samples)

2.8% 1,798
memcpy kernel

2.6% of samples (1,704 samples)

2.6% 1,704
__fwritex busybox

2.6% of samples (1,670 samples)

2.6% 1,670
handle_exception kernel

1.8% of samples (1,156 samples)

1.8% 1,156
unmap_page_range kernel

1.7% of samples (1,103 samples)

1.7% 1,103
n_tty_write kernel

1.4% of samples (884 samples)

1.4% 884
avg_vruntime kernel

1.3% of samples (829 samples)

1.3% 829
memset busybox

1.3% of samples (822 samples)

1.3% 822
ret_from_exception kernel

1.2% of samples (777 samples)

1.2% 777
next_uptodate_folio kernel

1% of samples (655 samples)

1% 655
n_tty_read kernel

1% of samples (640 samples)

1% 640
(remaining) remaining

55.5% of samples (35,730 samples)

55.5% 35,730

The hot-symbol view remains a mix of BusyBox, kernel memory routines, scheduler/exception paths, and shell machinery. Frontend work helps, but Linux shell performance is not a single-knob problem.

Honest Status

check	status
Critical-word-first miss handling	PASS
Background I-cache fill descriptor	PASS
BusyBox shell workload runs	PASS
Shell-window speedup vs P90	PASS
Shell-window speedup vs P89	PASS
Fetch-stall speedup vs P89	FAIL
LibreLane hardening	NOT RUN

P92 should add a fetch queue. P91 makes line fill less foolish, but execute still sees frontend memory timing directly. A queue is the next decoupling point.

Result

Fill Buffer

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next