No. 98 / project of 147 on the ladder

D-cache throttle

introduces — frontend-aware throttling for background D-cache line fill; measured recovery from P97 shared-port regression; setup for Harvard I/D service arc

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P98 keeps P97’s four-word D-cache line structure, but stops treating background data-line fill as free. The fill descriptor can only issue when the frontend already has useful work queued and the I-cache fill path is not active.

It works functionally. It recovers the P96 shell-window timing. It also shows why the next arc needs a real Harvard-style instruction/data split instead of more shared-port negotiation.

Result

metricP94 arbiterP96 D-cache v0P97 line-fillP98 throttle
post-load cycles222,459,202221,522,958222,850,787221,452,591
shell window cycles67,050,37466,084,15567,369,57666,055,345
retired instructions86,664,08986,344,92986,777,98086,329,983
CPI2.56692.56562.56812.5652
memory stall cycles60,032,32959,418,37560,295,64259,683,338
load stall cycles14,632,99210,976,90210,387,31010,697,962
fetch stall cycles23,549,35926,676,10429,593,75727,286,526
comparisonresult
shell window vs P96-0.04%
post-load cycles vs P96-0.03%
memory stalls vs P96+0.45%
load stalls vs P96-2.54%
fetch stalls vs P96+2.29%
shell window vs P97-1.95%
fetch stalls vs P97-7.80%

D-cache Counters

counterP96P97P98
load hits3,656,0644,370,1223,945,531
load misses6,354,8765,746,6026,060,778
demand fills6,354,8765,746,6026,060,778
background fills03,419,006377,930
background active cycles085,257,787102,335,320
store updates10,473,80310,547,84810,477,277
invalidations1,873,3271,874,6741,873,376

The throttle cut background fill grants sharply. That gives back some P97 data locality, but it avoids most of P97’s frontend damage.

Memory Stalls

memory stalls label P98 D-cache throttle workload stalls 59,683,338 handshakes 66,358,456
  1. instruction fetch 27,286,526 45.7% 45,996,088 req
  2. data load 10,697,962 17.9% 875,585 req
  3. data store 11,941,385 20% 216,019 req
  4. atomic memory op 157,331 0.3% 183,415 req
  5. page walk for fetch 1,118,488 1.9% 1,112,334 req
  6. page walk for load/store 1,213,747 2% 1,213,221 req
  7. other 7,267,899 12.2% 16,761,794 req

P98 is still worse than P96 on fetch stalls, but much better than P97. That is the narrow win: less data-side opportunism on the one RAM port.

Shell Phases

shell phases label P98 shell workload cycles 221,452,591 cpi 2.57
  1. kernel banner to /init 117,616,704 53.3%
  2. /init to shell banner 1,084,530 0.5%
  3. shell banner to first command 36,067,947 16.3%
  4. echo command 1,598 0%
  5. uname -a 2,432,864 1.1%
  6. ls /bin /usr/share 31,670,845 14.3%
  7. cat sample file 4,549,496 2.1%
  8. touch/write/cat/rm /tmp file 11,060,226 5%
  9. 8x ash loop with file I/O 16,339,653 7.4%
  10. final marker 663 0%

The full BusyBox shell script reaches P98-FILE-OK. The shell window is 66.06M cycles, slightly faster than P96 and 1.95% faster than P97.

Cycle Shape

state breakdown label P98 D-cache throttle workload cycles 221,452,591 cpi 2.57
  1. fetch 3.8% 8,315,386
  2. execute 39% 86,354,801
  3. mem 12.7% 28,017,228
  4. walker 2.1% 4,657,790
  5. writeback 39% 86,329,983
  6. mul/div 3.5% 7,775,687

There is no new blocking cache-fill state. The change is request gating before the shared memory arbiter.

Hot Functions

hot functions label P98 BusyBox shell symbols samples 64,507 period every 1,024 cycles
  1. printf_core busybox
    5.6% 3,624
  2. memset kernel
    5.1% 3,293
  3. memcpy busybox
    3.7% 2,357
  4. vruntime_eligible kernel
    3.4% 2,196
  5. blake2s_compress_generic kernel
    2.8% 1,808
  6. __fwritex busybox
    2.7% 1,708
  7. memcpy kernel
    2.6% 1,674
  8. unmap_page_range kernel
    1.7% 1,125
  9. handle_exception kernel
    1.7% 1,119
  10. avg_vruntime kernel
    1.4% 873
  11. n_tty_write kernel
    1.3% 831
  12. memset busybox
    1.3% 810
  13. ret_from_exception kernel
    1.1% 708
  14. next_uptodate_folio kernel
    1% 664
  15. do_trap_ecall_u kernel
    1% 632
  16. (remaining) remaining
    55.4% 35,739

The software workload stayed the same. The measured change is memory policy.

Honest Status

checkstatus
Four-word D-cache line storagePASS
Critical-word-first demand loadPASS
Frontend-aware background-fill throttlePASS
BusyBox shell workload runsPASS
D-cache throttle counters capturedPASS
Shell-window speedup vs P96PASS
True split I/D RAM portsNOT RUN
Split ITLB/DTLBNOT RUN
Nonblocking miss machineryNOT RUN
LibreLane hardeningNOT RUN

Next

P99 should stop trying to make one port polite and instead map the Harvard split clearly: what an instruction path owns, what a data path owns, where translation lives, and where the lower shared memory system is allowed to reappear.