D-cache v0 · librelane-playground

P96 adds the first data cache to the Linux-capable core. It is small on purpose: 64 direct-mapped entries, one 32-bit word per entry, physical address tags, aligned integer LW hit bypass, and aligned integer SW write-through update/allocate.

This is not a real nonblocking cache yet. It is the first proof that data-side caching is worth continuing.

Result

metric	P94 arbiter	P95 store buffer	P96 D-cache
post-load cycles	222,459,202	241,494,238	221,522,958
shell window cycles	67,050,374	77,821,976	66,084,155
retired instructions	86,664,089	88,851,638	86,344,929
CPI	2.5669	2.7179	2.5656
memory stall cycles	60,032,329	60,797,627	59,418,375
load stall cycles	14,632,992	15,144,900	10,976,902
fetch stall cycles	23,549,359	35,563,846	26,676,104

comparison	result
shell window vs P94	-1.44%
post-load cycles vs P94	-0.42%
memory stalls vs P94	-1.02%
load stalls vs P94	-24.99%
fetch stalls vs P94	+13.28%

D-cache Counters

counter	value
load hits	3,656,064
load misses	6,354,876
fills	6,354,876
store updates	10,473,803
invalidations	1,873,327

The one-word cache cuts data-side stalls, but the workload still has more misses than hits. A line-based D-cache is the obvious next test, provided it avoids the blocking-fill mistake P90 made on the instruction side.

Memory Stalls

memory stalls label P96 D-cache workload stalls 59,418,375 handshakes 66,576,563

instruction fetch 26,676,104 44.9% 46,243,128 req
data load 10,976,902 18.5% 890,777 req
data store 11,935,094 20.1% 218,712 req
atomic memory op 157,413 0.3% 183,483 req
page walk for fetch 1,122,386 1.9% 1,116,232 req
page walk for load/store 1,221,734 2.1% 1,221,224 req
other 7,328,742 12.3% 16,703,007 req

Load stall cycles drop from P94’s 14.63M to 10.98M. Fetch stalls rise, which means the memory system is still a shared-port negotiation rather than independent instruction/data service.

Shell Phases

shell phases label P96 shell workload cycles 221,522,958 cpi 2.57

kernel banner to /init 117,615,769 53.3%
/init to shell banner 1,069,519 0.5%
shell banner to first command 36,125,450 16.4%
echo command 1,598 0%
uname -a 1,991,318 0.9%
ls /bin /usr/share 32,798,794 14.9%
cat sample file 4,516,475 2%
touch/write/cat/rm /tmp file 10,556,885 4.8%
8x ash loop with file I/O 16,218,422 7.3%
final marker 663 0%

The full BusyBox shell script reaches P96-FILE-OK. The shell window is 66.08M cycles, a 1.44% improvement over P94 and a clear recovery from P95’s store-buffer regression.

Cycle Shape

state breakdown label P96 D-cache workload cycles 221,522,958 cpi 2.57

fetch 3.8% 8,323,856
execute 39% 86,369,781
mem 12.6% 28,018,445
walker 2.1% 4,681,576
writeback 39% 86,344,929
mul/div 3.5% 7,782,655

P96 does not add a new architectural state. The speedup shows up as fewer external data-memory waits from the existing S_MEM path.

Hot Functions

hot functions label P96 BusyBox shell symbols samples 64,536 period every 1,024 cycles

printf_core busybox

5.5% of samples (3,545 samples)

5.5% 3,545
memset kernel

5.1% of samples (3,311 samples)

5.1% 3,311
memcpy busybox

3.6% of samples (2,320 samples)

3.6% 2,320
vruntime_eligible kernel

3.3% of samples (2,137 samples)

3.3% 2,137
blake2s_compress_generic kernel

2.8% of samples (1,815 samples)

2.8% 1,815
__fwritex busybox

2.6% of samples (1,690 samples)

2.6% 1,690
memcpy kernel

2.6% of samples (1,673 samples)

2.6% 1,673
handle_exception kernel

1.7% of samples (1,121 samples)

1.7% 1,121
unmap_page_range kernel

1.7% of samples (1,073 samples)

1.7% 1,073
memset busybox

1.3% of samples (864 samples)

1.3% 864
avg_vruntime kernel

1.3% of samples (857 samples)

1.3% 857
n_tty_write kernel

1.3% of samples (841 samples)

1.3% 841
ret_from_exception kernel

1.2% of samples (784 samples)

1.2% 784
next_uptodate_folio kernel

1% of samples (651 samples)

1% 651
do_trap_ecall_u kernel

1% of samples (646 samples)

1% 646
(remaining) remaining

55.5% of samples (35,846 samples)

55.5% 35,846

The symbol mix remains the same shell/kernel workload. The improvement comes from memory behavior, not from running a different software path.

Honest Status

check	status
Direct-mapped word D-cache in RTL	PASS
Aligned `LW` hit bypass	PASS
Aligned `SW` write-through update/allocate	PASS
BusyBox shell workload runs	PASS
D-cache counters captured	PASS
Subword store merge	NOT RUN
Multi-word line fill	NOT RUN
Nonblocking D-cache miss handling	NOT RUN
Shell-window speedup vs P94	PASS
LibreLane hardening	NOT RUN

P97 should try the data-cache policy that P96 points toward: four-word lines, critical-word-first response, and background fill through the existing P94 arbiter counters. The constraint is clear: do not block the core just to fill the rest of a line.

Result

D-cache Counters

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next