I-cache line fill · librelane-playground

P90 does the obvious follow-up to P89: turn the word I-cache into a 4-word line cache. The implementation is deliberately simple and blocking. On a miss, the core fetches the requested word, enters S_IC_FILL, fills the rest of the line, then returns to S_FETCH.

That simplicity is useful because the result is clear: this version is wrong for the shell workload.

Result

metric	P89 word I-cache	P90 blocking line fill	delta
post-load cycles	222,317,206	245,417,593	+10.39%
shell window cycles	66,957,620	84,084,195	+25.58%
retired instructions	86,601,839	89,490,840	+3.34%
CPI	2.5671	2.7424	+6.83%
memory handshakes	31,289,313	37,457,672	+19.71%
memory stall cycles	83,361,545	86,350,457	+3.59%
fetch handshakes	27,560,784	33,504,298	+21.57%
fetch stall cycles	54,266,192	55,950,261	+3.10%
I-cache hits	6,434,333	12,352,686	+91.98%

The cache gets far more hits, but it manufactures them by stalling the core. S_IC_FILL accounts for 11,224,132 cycles.

Memory Stalls

memory stalls label P90 blocking line-fill workload stalls 86,350,457 handshakes 37,457,672

instruction fetch 55,950,261 64.8% 33,504,298 req
data load 15,291,828 17.7% 1,034,172 req
data store 12,441,554 14.4% 234,015 req
atomic memory op 166,469 0.2% 193,202 req
page walk for fetch 1,200,622 1.4% 1,194,468 req
page walk for load/store 1,298,019 1.5% 1,297,517 req
other 1,704 0% 0 req

Compared with P88, fetch stalls are still lower. Compared with P89, they are worse. That tells us P89’s tiny cache was directionally right, but P90’s blocking fill policy is too blunt.

Shell Phases

shell phases label P90 shell workload cycles 245,417,593 cpi 2.74

kernel banner to /init 119,143,502 48.7%
/init to shell banner 1,156,340 0.5%
shell banner to first command 40,405,326 16.5%
echo command 2,538 0%
uname -a 2,207,729 0.9%
ls /bin /usr/share 39,732,205 16.2%
cat sample file 5,276,294 2.2%
touch/write/cat/rm /tmp file 10,985,502 4.5%
8x ash loop with file I/O 24,353,026 10%
final marker 1,526,901 0.6%

The shell window gets hit hard: ls, cat, and the ash loop all pay for the blocking frontend. Linux shell work jumps around enough that waiting for every line to fill before executing is a bad trade.

Cycle Shape

state breakdown label P90 line-fill workload cycles 245,417,593 cpi 2.74

fetch 4.6% 11,323,240
execute 36.5% 89,518,022
mem 12% 29,361,240
walker 2% 4,990,626
writeback 36.5% 89,490,840
mul/div 3.9% 9,507,789

The new state exists for a reason. If S_IC_FILL is visually large, that is the bug, not a side detail.

Hot Functions

hot functions label P90 BusyBox shell symbols samples 82,113 period every 1,024 cycles

printf_core busybox

5.6% of samples (4,566 samples)

5.6% 4,566
memset kernel

4.4% of samples (3,581 samples)

4.4% 3,581
vruntime_eligible kernel

4.2% of samples (3,427 samples)

4.2% 3,427
memcpy busybox

3.7% of samples (3,070 samples)

3.7% 3,070
__fwritex busybox

2.4% of samples (1,969 samples)

2.4% 1,969
blake2s_compress_generic kernel

2.2% of samples (1,804 samples)

2.2% 1,804
memcpy kernel

2.1% of samples (1,739 samples)

2.1% 1,739
handle_exception kernel

1.8% of samples (1,433 samples)

1.8% 1,433
avg_vruntime kernel

1.6% of samples (1,341 samples)

1.6% 1,341
memset busybox

1.6% of samples (1,339 samples)

1.6% 1,339
unmap_page_range kernel

1.4% of samples (1,120 samples)

1.4% 1,120
update_curr kernel

1.1% of samples (929 samples)

1.1% 929
n_tty_write kernel

1.1% of samples (859 samples)

1.1% 859
ret_from_exception kernel

1% of samples (851 samples)

1% 851
n_tty_read kernel

1% of samples (813 samples)

1% 813
(remaining) remaining

56.3% of samples (46,251 samples)

56.3% 46,251

The hot-symbol picture does not magically change. Frontend stalls are important, but BusyBox formatting, kernel memory routines, scheduler work, and exception paths still matter.

Toward XiangShan

This is a small but important architectural lesson on the way toward a more XiangShan-like frontend: line fill has to be decoupled. Modern application cores do not stop the machine just to make future fetches pretty. They return the critical word, keep fill buffers in flight, feed a queue, and let prediction keep the frontend pointed at useful code.

step	result
P89 tiny word cache	fetch stalls down, whole workload flat
P90 blocking line fill	more hits, whole workload much worse
P91 target	nonblocking or critical-word-first line fill

Honest Status

check	status
4-word I-cache line storage	PASS
Blocking line-fill FSM	PASS
BusyBox shell workload runs	PASS
Whole workload speedup vs P89	FAIL
LibreLane hardening	NOT RUN

P91 should keep the line cache but remove the blocking behavior. The critical word should execute immediately; the rest of the line should be filled only when the fetch side can spare the bus.

Result

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Toward XiangShan

Honest Status

Next