D-cache line-fill · librelane-playground

P97 extends P96’s one-word D-cache into four-word lines. Demand loads are still critical-word-first: the requested word returns immediately, then the cache tries to fill the other words through a background request class.

It works functionally. It is slower than P96.

Result

metric	P94 arbiter	P96 D-cache v0	P97 line-fill
post-load cycles	222,459,202	221,522,958	222,850,787
shell window cycles	67,050,374	66,084,155	67,369,576
retired instructions	86,664,089	86,344,929	86,777,980
CPI	2.5669	2.5656	2.5681
memory stall cycles	60,032,329	59,418,375	60,295,642
load stall cycles	14,632,992	10,976,902	10,387,310
fetch stall cycles	23,549,359	26,676,104	29,593,757

comparison	result
shell window vs P96	+1.95%
post-load cycles vs P96	+0.60%
memory stalls vs P96	+1.48%
load stalls vs P96	-5.37%
fetch stalls vs P96	+10.94%

D-cache Counters

counter	P96	P97
load hits	3,656,064	4,370,122
load misses	6,354,876	5,746,602
demand fills	6,354,876	5,746,602
background fills	0	3,419,006
background active cycles	0	85,257,787
store updates	10,473,803	10,547,848
invalidations	1,873,327	1,874,674

The cache geometry helps the data side. The fill policy hurts the machine. That is the useful result.

Memory Stalls

memory stalls label P97 D-cache line-fill workload stalls 60,295,642 handshakes 69,126,542

instruction fetch 29,593,757 49.1% 47,122,281 req
data load 10,387,310 17.2% 875,234 req
data store 12,008,121 19.9% 219,405 req
atomic memory op 158,700 0.3% 184,997 req
page walk for fetch 1,130,883 1.9% 1,124,729 req
page walk for load/store 1,229,718 2% 1,229,207 req
other 5,787,153 9.6% 18,370,689 req

Load stalls drop again, but fetch stalls rise enough to lose the P96 shell-window win.

Shell Phases

shell phases label P97 shell workload cycles 222,850,787 cpi 2.57

kernel banner to /init 117,614,831 52.9%
/init to shell banner 1,081,377 0.5%
shell banner to first command 36,156,938 16.3%
echo command 1,598 0%
uname -a 2,616,228 1.2%
ls /bin /usr/share 31,715,496 14.3%
cat sample file 4,087,721 1.8%
touch/write/cat/rm /tmp file 11,430,280 5.1%
8x ash loop with file I/O 16,108,796 7.3%
final marker 1,409,457 0.6%

The full BusyBox shell script reaches P97-FILE-OK. The shell window is 67.37M cycles, slower than both P96 and P94.

Cycle Shape

state breakdown label P97 D-cache line-fill workload cycles 222,850,787 cpi 2.57

fetch 3.7% 8,335,018
execute 39% 86,803,158
mem 12.7% 28,203,889
walker 2.1% 4,714,537
writeback 38.9% 86,777,980
mul/div 3.6% 8,014,501

P97 does not add a blocking line-fill state. The cost appears as more shared-memory service pressure while the normal state machine runs.

Hot Functions

hot functions label P97 BusyBox shell symbols samples 65,790 period every 1,024 cycles

printf_core busybox

5.5% of samples (3,630 samples)

5.5% 3,630
memset kernel

5.1% of samples (3,376 samples)

5.1% 3,376
vruntime_eligible kernel

3.5% of samples (2,318 samples)

3.5% 2,318
memcpy busybox

3.5% of samples (2,305 samples)

3.5% 2,305
blake2s_compress_generic kernel

2.8% of samples (1,808 samples)

2.8% 1,808
memcpy kernel

2.5% of samples (1,668 samples)

2.5% 1,668
__fwritex busybox

2.5% of samples (1,630 samples)

2.5% 1,630
handle_exception kernel

1.8% of samples (1,182 samples)

1.8% 1,182
unmap_page_range kernel

1.6% of samples (1,071 samples)

1.6% 1,071
avg_vruntime kernel

1.5% of samples (956 samples)

1.5% 956
n_tty_write kernel

1.3% of samples (850 samples)

1.3% 850
memset busybox

1.3% of samples (832 samples)

1.3% 832
ret_from_exception kernel

1.2% of samples (791 samples)

1.2% 791
do_trap_ecall_u kernel

1.1% of samples (688 samples)

1.1% 688
next_uptodate_folio kernel

1% of samples (652 samples)

1% 652
(remaining) remaining

55.7% of samples (36,611 samples)

55.7% 36,611

The hot-symbol mix remains the same BusyBox shell workload. The change is memory-system policy, not different software.

Honest Status

check	status
Four-word D-cache line storage	PASS
Critical-word-first demand load	PASS
Background D-cache fill descriptor	PASS
`dcache_background` arbiter class	PASS
BusyBox shell workload runs	PASS
D-cache line-fill counters captured	PASS
Shell-window speedup vs P96	FAIL
Smarter fill throttling	NOT RUN
True split I/D RAM ports	NOT RUN
LibreLane hardening	NOT RUN

P98 should keep the four-word line shape but throttle background D-cache fill much more aggressively. The obvious rule is to let it run only when the frontend is genuinely not waiting.

Result

D-cache Counters

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next