D-cache throttle · librelane-playground

P98 keeps P97’s four-word D-cache line structure, but stops treating background data-line fill as free. The fill descriptor can only issue when the frontend already has useful work queued and the I-cache fill path is not active.

It works functionally. It recovers the P96 shell-window timing. It also shows why the next arc needs a real Harvard-style instruction/data split instead of more shared-port negotiation.

Result

metric	P94 arbiter	P96 D-cache v0	P97 line-fill	P98 throttle
post-load cycles	222,459,202	221,522,958	222,850,787	221,452,591
shell window cycles	67,050,374	66,084,155	67,369,576	66,055,345
retired instructions	86,664,089	86,344,929	86,777,980	86,329,983
CPI	2.5669	2.5656	2.5681	2.5652
memory stall cycles	60,032,329	59,418,375	60,295,642	59,683,338
load stall cycles	14,632,992	10,976,902	10,387,310	10,697,962
fetch stall cycles	23,549,359	26,676,104	29,593,757	27,286,526

comparison	result
shell window vs P96	-0.04%
post-load cycles vs P96	-0.03%
memory stalls vs P96	+0.45%
load stalls vs P96	-2.54%
fetch stalls vs P96	+2.29%
shell window vs P97	-1.95%
fetch stalls vs P97	-7.80%

D-cache Counters

counter	P96	P97	P98
load hits	3,656,064	4,370,122	3,945,531
load misses	6,354,876	5,746,602	6,060,778
demand fills	6,354,876	5,746,602	6,060,778
background fills	0	3,419,006	377,930
background active cycles	0	85,257,787	102,335,320
store updates	10,473,803	10,547,848	10,477,277
invalidations	1,873,327	1,874,674	1,873,376

The throttle cut background fill grants sharply. That gives back some P97 data locality, but it avoids most of P97’s frontend damage.

Memory Stalls

memory stalls label P98 D-cache throttle workload stalls 59,683,338 handshakes 66,358,456

instruction fetch 27,286,526 45.7% 45,996,088 req
data load 10,697,962 17.9% 875,585 req
data store 11,941,385 20% 216,019 req
atomic memory op 157,331 0.3% 183,415 req
page walk for fetch 1,118,488 1.9% 1,112,334 req
page walk for load/store 1,213,747 2% 1,213,221 req
other 7,267,899 12.2% 16,761,794 req

P98 is still worse than P96 on fetch stalls, but much better than P97. That is the narrow win: less data-side opportunism on the one RAM port.

Shell Phases

shell phases label P98 shell workload cycles 221,452,591 cpi 2.57

kernel banner to /init 117,616,704 53.3%
/init to shell banner 1,084,530 0.5%
shell banner to first command 36,067,947 16.3%
echo command 1,598 0%
uname -a 2,432,864 1.1%
ls /bin /usr/share 31,670,845 14.3%
cat sample file 4,549,496 2.1%
touch/write/cat/rm /tmp file 11,060,226 5%
8x ash loop with file I/O 16,339,653 7.4%
final marker 663 0%

The full BusyBox shell script reaches P98-FILE-OK. The shell window is 66.06M cycles, slightly faster than P96 and 1.95% faster than P97.

Cycle Shape

state breakdown label P98 D-cache throttle workload cycles 221,452,591 cpi 2.57

fetch 3.8% 8,315,386
execute 39% 86,354,801
mem 12.7% 28,017,228
walker 2.1% 4,657,790
writeback 39% 86,329,983
mul/div 3.5% 7,775,687

There is no new blocking cache-fill state. The change is request gating before the shared memory arbiter.

Hot Functions

hot functions label P98 BusyBox shell symbols samples 64,507 period every 1,024 cycles

printf_core busybox

5.6% of samples (3,624 samples)

5.6% 3,624
memset kernel

5.1% of samples (3,293 samples)

5.1% 3,293
memcpy busybox

3.7% of samples (2,357 samples)

3.7% 2,357
vruntime_eligible kernel

3.4% of samples (2,196 samples)

3.4% 2,196
blake2s_compress_generic kernel

2.8% of samples (1,808 samples)

2.8% 1,808
__fwritex busybox

2.7% of samples (1,708 samples)

2.7% 1,708
memcpy kernel

2.6% of samples (1,674 samples)

2.6% 1,674
unmap_page_range kernel

1.7% of samples (1,125 samples)

1.7% 1,125
handle_exception kernel

1.7% of samples (1,119 samples)

1.7% 1,119
avg_vruntime kernel

1.4% of samples (873 samples)

1.4% 873
n_tty_write kernel

1.3% of samples (831 samples)

1.3% 831
memset busybox

1.3% of samples (810 samples)

1.3% 810
ret_from_exception kernel

1.1% of samples (708 samples)

1.1% 708
next_uptodate_folio kernel

1% of samples (664 samples)

1% 664
do_trap_ecall_u kernel

1% of samples (632 samples)

1% 632
(remaining) remaining

55.4% of samples (35,739 samples)

55.4% 35,739

The software workload stayed the same. The measured change is memory policy.

Honest Status

check	status
Four-word D-cache line storage	PASS
Critical-word-first demand load	PASS
Frontend-aware background-fill throttle	PASS
BusyBox shell workload runs	PASS
D-cache throttle counters captured	PASS
Shell-window speedup vs P96	PASS
True split I/D RAM ports	NOT RUN
Split ITLB/DTLB	NOT RUN
Nonblocking miss machinery	NOT RUN
LibreLane hardening	NOT RUN

P99 should stop trying to make one port polite and instead map the Harvard split clearly: what an instruction path owns, what a data path owns, where translation lives, and where the lower shared memory system is allowed to reappear.

Result

D-cache Counters

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next