No. 96 / project of 147 on the ladder

D-cache v0

introduces — direct-mapped word D-cache; aligned LW hit bypass; write-through SW update; D-cache performance counters

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P96 adds the first data cache to the Linux-capable core. It is small on purpose: 64 direct-mapped entries, one 32-bit word per entry, physical address tags, aligned integer LW hit bypass, and aligned integer SW write-through update/allocate.

This is not a real nonblocking cache yet. It is the first proof that data-side caching is worth continuing.

Result

metricP94 arbiterP95 store bufferP96 D-cache
post-load cycles222,459,202241,494,238221,522,958
shell window cycles67,050,37477,821,97666,084,155
retired instructions86,664,08988,851,63886,344,929
CPI2.56692.71792.5656
memory stall cycles60,032,32960,797,62759,418,375
load stall cycles14,632,99215,144,90010,976,902
fetch stall cycles23,549,35935,563,84626,676,104
comparisonresult
shell window vs P94-1.44%
post-load cycles vs P94-0.42%
memory stalls vs P94-1.02%
load stalls vs P94-24.99%
fetch stalls vs P94+13.28%

D-cache Counters

countervalue
load hits3,656,064
load misses6,354,876
fills6,354,876
store updates10,473,803
invalidations1,873,327

The one-word cache cuts data-side stalls, but the workload still has more misses than hits. A line-based D-cache is the obvious next test, provided it avoids the blocking-fill mistake P90 made on the instruction side.

Memory Stalls

memory stalls label P96 D-cache workload stalls 59,418,375 handshakes 66,576,563
  1. instruction fetch 26,676,104 44.9% 46,243,128 req
  2. data load 10,976,902 18.5% 890,777 req
  3. data store 11,935,094 20.1% 218,712 req
  4. atomic memory op 157,413 0.3% 183,483 req
  5. page walk for fetch 1,122,386 1.9% 1,116,232 req
  6. page walk for load/store 1,221,734 2.1% 1,221,224 req
  7. other 7,328,742 12.3% 16,703,007 req

Load stall cycles drop from P94’s 14.63M to 10.98M. Fetch stalls rise, which means the memory system is still a shared-port negotiation rather than independent instruction/data service.

Shell Phases

shell phases label P96 shell workload cycles 221,522,958 cpi 2.57
  1. kernel banner to /init 117,615,769 53.3%
  2. /init to shell banner 1,069,519 0.5%
  3. shell banner to first command 36,125,450 16.4%
  4. echo command 1,598 0%
  5. uname -a 1,991,318 0.9%
  6. ls /bin /usr/share 32,798,794 14.9%
  7. cat sample file 4,516,475 2%
  8. touch/write/cat/rm /tmp file 10,556,885 4.8%
  9. 8x ash loop with file I/O 16,218,422 7.3%
  10. final marker 663 0%

The full BusyBox shell script reaches P96-FILE-OK. The shell window is 66.08M cycles, a 1.44% improvement over P94 and a clear recovery from P95’s store-buffer regression.

Cycle Shape

state breakdown label P96 D-cache workload cycles 221,522,958 cpi 2.57
  1. fetch 3.8% 8,323,856
  2. execute 39% 86,369,781
  3. mem 12.6% 28,018,445
  4. walker 2.1% 4,681,576
  5. writeback 39% 86,344,929
  6. mul/div 3.5% 7,782,655

P96 does not add a new architectural state. The speedup shows up as fewer external data-memory waits from the existing S_MEM path.

Hot Functions

hot functions label P96 BusyBox shell symbols samples 64,536 period every 1,024 cycles
  1. printf_core busybox
    5.5% 3,545
  2. memset kernel
    5.1% 3,311
  3. memcpy busybox
    3.6% 2,320
  4. vruntime_eligible kernel
    3.3% 2,137
  5. blake2s_compress_generic kernel
    2.8% 1,815
  6. __fwritex busybox
    2.6% 1,690
  7. memcpy kernel
    2.6% 1,673
  8. handle_exception kernel
    1.7% 1,121
  9. unmap_page_range kernel
    1.7% 1,073
  10. memset busybox
    1.3% 864
  11. avg_vruntime kernel
    1.3% 857
  12. n_tty_write kernel
    1.3% 841
  13. ret_from_exception kernel
    1.2% 784
  14. next_uptodate_folio kernel
    1% 651
  15. do_trap_ecall_u kernel
    1% 646
  16. (remaining) remaining
    55.5% 35,846

The symbol mix remains the same shell/kernel workload. The improvement comes from memory behavior, not from running a different software path.

Honest Status

checkstatus
Direct-mapped word D-cache in RTLPASS
Aligned LW hit bypassPASS
Aligned SW write-through update/allocatePASS
BusyBox shell workload runsPASS
D-cache counters capturedPASS
Subword store mergeNOT RUN
Multi-word line fillNOT RUN
Nonblocking D-cache miss handlingNOT RUN
Shell-window speedup vs P94PASS
LibreLane hardeningNOT RUN

Next

P97 should try the data-cache policy that P96 points toward: four-word lines, critical-word-first response, and background fill through the existing P94 arbiter counters. The constraint is clear: do not block the core just to fill the rest of a line.