Tiny I-cache fetch · librelane-playground

P88 said the next architectural feature should hit fetch. P89 adds the first fetch-side storage point: a 256-word direct-mapped instruction cache with physical tags.

This is not yet a modern frontend. It is the smallest measured step from “single-word external fetch every time” toward a real application-core frontend.

Result

metric	P88 no I-cache	P89 tiny I-cache	delta
post-load cycles	221,748,021	222,317,206	+0.26%
shell window cycles	66,279,807	66,957,620	+1.02%
retired instructions	86,435,211	86,601,839	+0.19%
CPI	2.5655	2.5671	+0.06%
memory handshakes	32,917,717	31,289,313	-4.95%
memory stall cycles	87,892,031	83,361,545	-5.15%
fetch handshakes	29,224,093	27,560,784	-5.69%
fetch stall cycles	58,870,166	54,266,192	-7.82%

The cache works in the narrow sense: fetch refills and fetch stalls go down. It does not improve the full shell workload yet. That is the useful part of this result, not an embarrassment to hide.

I-Cache Counters

counter	value
total hits	6,434,333
hits from `S_FETCH`	3,528,678
hits from `S_WB` prefetch	2,905,655
miss refills	27,560,784

The hit count is real but the miss count is still huge. A one-word line means sequential code still creates refill pressure, and whole-cache invalidation on every store is intentionally brutal.

Memory Stalls

memory stalls label P89 tiny I-cache workload stalls 83,361,545 handshakes 31,289,313

instruction fetch 54,266,192 65.1% 27,560,784 req
data load 14,615,626 17.5% 979,239 req
data store 11,967,986 14.4% 220,079 req
atomic memory op 158,190 0.2% 184,053 req
page walk for fetch 1,125,783 1.4% 1,119,629 req
page walk for load/store 1,226,052 1.5% 1,225,529 req
other 1,716 0% 0 req

Fetch remains the largest memory-stall bucket, but it moved from 58,870,166 cycles in P88 to 54,266,192 cycles here. The next frontend step should be line fill, not another tiny single-word cache tweak.

Shell Phases

shell phases label P89 shell workload cycles 222,317,206 cpi 2.57

kernel banner to /init 117,614,394 53.1%
/init to shell banner 1,087,274 0.5%
shell banner to first command 36,029,853 16.3%
echo command 1,600 0%
uname -a 1,990,527 0.9%
ls /bin /usr/share 33,547,374 15.1%
cat sample file 2,757,580 1.2%
touch/write/cat/rm /tmp file 11,707,774 5.3%
8x ash loop with file I/O 16,952,102 7.7%
final marker 663 0%

The shell-window number regressed by about 1%. That can happen while fetch stalls improve because the workload has several other large costs: filesystem work, scheduler paths, exceptions/syscalls, BusyBox formatting, and cache invalidations caused by stores.

Cycle Shape

state breakdown label P89 tiny I-cache workload cycles 222,317,206 cpi 2.57

fetch 3.7% 8,335,661
execute 39% 86,626,877
mem 12.7% 28,125,173
walker 2.1% 4,696,993
writeback 39% 86,601,839
mul/div 3.6% 7,928,947

The interesting state-level change is that memory wait is reduced, not that the core suddenly has a different pipeline. P89 is still the same single-issue in-order FSM core.

Hot Functions

hot functions label P89 BusyBox shell symbols samples 65,388 period every 1,024 cycles

printf_core busybox

5.5% of samples (3,598 samples)

5.5% 3,598
memset kernel

5.1% of samples (3,341 samples)

5.1% 3,341
memcpy busybox

3.5% of samples (2,304 samples)

3.5% 2,304
vruntime_eligible kernel

3.4% of samples (2,246 samples)

3.4% 2,246
blake2s_compress_generic kernel

2.8% of samples (1,811 samples)

2.8% 1,811
__fwritex busybox

2.6% of samples (1,697 samples)

2.6% 1,697
memcpy kernel

2.6% of samples (1,690 samples)

2.6% 1,690
handle_exception kernel

1.9% of samples (1,208 samples)

1.9% 1,208
unmap_page_range kernel

1.7% of samples (1,095 samples)

1.7% 1,095
avg_vruntime kernel

1.4% of samples (913 samples)

1.4% 913
n_tty_write kernel

1.3% of samples (860 samples)

1.3% 860
memset busybox

1.3% of samples (830 samples)

1.3% 830
ret_from_exception kernel

1.2% of samples (776 samples)

1.2% 776
next_uptodate_folio kernel

1% of samples (675 samples)

1% 675
do_trap_ecall_u kernel

1% of samples (639 samples)

1% 639
(remaining) remaining

55.5% of samples (36,297 samples)

55.5% 36,297

The hot-symbol picture is mostly unchanged: BusyBox formatting and kernel memory/scheduler/exception work remain visible. That says frontend work is necessary but not sufficient.

Toward XiangShan

XiangShan is a modern high-performance RV64 application processor project. P89 is not trying to jump there in one move. The staged path is more like:

step	feature	why it moves us closer
P89	tiny physical I-cache	first frontend storage point
next	cache line fill	reduce sequential-code refill traffic
soon	better I-cache invalidation	stop throwing away the whole frontend on ordinary stores
soon	fetch queue	decouple frontend memory timing from execute timing
later	BTB / branch predictor	stop paying full redirect cost on hot control flow
later	D-cache / LSU cleanup	separate instruction and data locality problems
much later	deeper pipeline, scoreboard, rename, ROB	the long road toward BOOM/XiangShan-class machinery

The important constraint is pacing: every one of those steps needs a benchmark like P88/P89 attached to it, or we are just adding hardware because it sounds adult.

Honest Status

check	status
256-word direct-mapped I-cache RTL	PASS
BusyBox shell workload runs	PASS
Fetch memory stalls improve	PASS
Whole shell workload improves	FAIL
LibreLane hardening	NOT RUN

P90 should turn the word cache into a small line-fill I-cache. If that does not move total shell time, the evidence will be pointing away from raw fetch storage and toward branch/control flow, store invalidation, or D-side/cache/LSU work.

Result

I-Cache Counters

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Toward XiangShan

Honest Status

Next