No. 97 / project of 147 on the ladder

D-cache line-fill

introduces — four-word D-cache lines; critical-word-first demand loads; background D-cache fill request class; measured line-fill negative result

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P97 extends P96’s one-word D-cache into four-word lines. Demand loads are still critical-word-first: the requested word returns immediately, then the cache tries to fill the other words through a background request class.

It works functionally. It is slower than P96.

Result

metricP94 arbiterP96 D-cache v0P97 line-fill
post-load cycles222,459,202221,522,958222,850,787
shell window cycles67,050,37466,084,15567,369,576
retired instructions86,664,08986,344,92986,777,980
CPI2.56692.56562.5681
memory stall cycles60,032,32959,418,37560,295,642
load stall cycles14,632,99210,976,90210,387,310
fetch stall cycles23,549,35926,676,10429,593,757
comparisonresult
shell window vs P96+1.95%
post-load cycles vs P96+0.60%
memory stalls vs P96+1.48%
load stalls vs P96-5.37%
fetch stalls vs P96+10.94%

D-cache Counters

counterP96P97
load hits3,656,0644,370,122
load misses6,354,8765,746,602
demand fills6,354,8765,746,602
background fills03,419,006
background active cycles085,257,787
store updates10,473,80310,547,848
invalidations1,873,3271,874,674

The cache geometry helps the data side. The fill policy hurts the machine. That is the useful result.

Memory Stalls

memory stalls label P97 D-cache line-fill workload stalls 60,295,642 handshakes 69,126,542
  1. instruction fetch 29,593,757 49.1% 47,122,281 req
  2. data load 10,387,310 17.2% 875,234 req
  3. data store 12,008,121 19.9% 219,405 req
  4. atomic memory op 158,700 0.3% 184,997 req
  5. page walk for fetch 1,130,883 1.9% 1,124,729 req
  6. page walk for load/store 1,229,718 2% 1,229,207 req
  7. other 5,787,153 9.6% 18,370,689 req

Load stalls drop again, but fetch stalls rise enough to lose the P96 shell-window win.

Shell Phases

shell phases label P97 shell workload cycles 222,850,787 cpi 2.57
  1. kernel banner to /init 117,614,831 52.9%
  2. /init to shell banner 1,081,377 0.5%
  3. shell banner to first command 36,156,938 16.3%
  4. echo command 1,598 0%
  5. uname -a 2,616,228 1.2%
  6. ls /bin /usr/share 31,715,496 14.3%
  7. cat sample file 4,087,721 1.8%
  8. touch/write/cat/rm /tmp file 11,430,280 5.1%
  9. 8x ash loop with file I/O 16,108,796 7.3%
  10. final marker 1,409,457 0.6%

The full BusyBox shell script reaches P97-FILE-OK. The shell window is 67.37M cycles, slower than both P96 and P94.

Cycle Shape

state breakdown label P97 D-cache line-fill workload cycles 222,850,787 cpi 2.57
  1. fetch 3.7% 8,335,018
  2. execute 39% 86,803,158
  3. mem 12.7% 28,203,889
  4. walker 2.1% 4,714,537
  5. writeback 38.9% 86,777,980
  6. mul/div 3.6% 8,014,501

P97 does not add a blocking line-fill state. The cost appears as more shared-memory service pressure while the normal state machine runs.

Hot Functions

hot functions label P97 BusyBox shell symbols samples 65,790 period every 1,024 cycles
  1. printf_core busybox
    5.5% 3,630
  2. memset kernel
    5.1% 3,376
  3. vruntime_eligible kernel
    3.5% 2,318
  4. memcpy busybox
    3.5% 2,305
  5. blake2s_compress_generic kernel
    2.8% 1,808
  6. memcpy kernel
    2.5% 1,668
  7. __fwritex busybox
    2.5% 1,630
  8. handle_exception kernel
    1.8% 1,182
  9. unmap_page_range kernel
    1.6% 1,071
  10. avg_vruntime kernel
    1.5% 956
  11. n_tty_write kernel
    1.3% 850
  12. memset busybox
    1.3% 832
  13. ret_from_exception kernel
    1.2% 791
  14. do_trap_ecall_u kernel
    1.1% 688
  15. next_uptodate_folio kernel
    1% 652
  16. (remaining) remaining
    55.7% 36,611

The hot-symbol mix remains the same BusyBox shell workload. The change is memory-system policy, not different software.

Honest Status

checkstatus
Four-word D-cache line storagePASS
Critical-word-first demand loadPASS
Background D-cache fill descriptorPASS
dcache_background arbiter classPASS
BusyBox shell workload runsPASS
D-cache line-fill counters capturedPASS
Shell-window speedup vs P96FAIL
Smarter fill throttlingNOT RUN
True split I/D RAM portsNOT RUN
LibreLane hardeningNOT RUN

Next

P98 should keep the four-word line shape but throttle background D-cache fill much more aggressively. The obvious rule is to let it run only when the frontend is genuinely not waiting.