No. 107 / project of 147 on the ladder

Banked auxiliary D-cache fill

introduces — auxiliary lower-bank response path; D-cache background-fill consumer; aux response consumption counters

harden statelast run2026-05-06
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P107 is the first lower-bank rung where the core consumes the auxiliary response. It keeps the consumer deliberately narrow: only blocked D-cache background fills use the new response. Demand loads, demand fetch, page-table walks, AMOs, and stores still use the original memory path.

Result

checkresult
make check-toolsPASS
Verilator buildPASS
Linux reaches /initPASS
BusyBox promptPASS
BusyBox shell workload reaches P107-FILE-OKPASS
Auxiliary response consumed by D-cache background fillPASS
Auxiliary read errorsPASS
Hardened layoutNOT RUN

Timing

metricP106 contractP107 aux D-cache fill
post-load cycles219,613,584219,407,400
shell window cycles65,558,07765,269,213
retired instructions86,478,20786,411,402
CPI2.53952.5391
BusyBox ready milestone118,413,096118,427,145
shell FILE-OK milestone219,613,727219,407,543
kernel panic milestone00

This is a small real speedup against P106: the shell window drops by 288,864 cycles, or about 0.44%. It is not enough to declare the memory arc solved.

Auxiliary Consumer

The P106 lane now has response inputs:

banked_aux_ready, banked_aux_rdata, banked_aux_error

P107 consumes the response only when the blocked auxiliary request is the D-cache background-fill descriptor. On a valid response, the core marks the current D-cache fill word valid, stores banked_aux_rdata, advances the fill pointer, and increments an auxiliary fill counter.

countervalue
auxiliary instruction reads serviced488,267
auxiliary data reads serviced10,065,241
auxiliary reads serviced total10,553,508
shell-window auxiliary reads4,426,763
auxiliary read errors0
D-cache aux background fills consumed10,065,241
shell-window D-cache aux fills consumed4,099,429
core aux-fill counter10,065,241
auxiliary read checksum2,929,617,952

The consumed-fill count and the core counter match exactly.

D-cache Shape

counterP106P107
load hits4,605,9654,603,565
load misses5,417,8205,417,845
demand fills5,417,8205,417,845
background fills438,88310,241,841
aux background fills010,065,241
background active cycles55,759,37355,537,629

P107 turns the auxiliary lane into a lot of cache maintenance work, but demand load misses barely move. That explains the modest speedup.

Memory Stalls

memory stalls label P107 banked auxiliary D-cache-fill workload stalls 58,493,110 handshakes 64,791,143
  1. instruction fetch 28,293,238 48.4% 45,777,433 req
  2. data load 10,378,588 17.7% 559,131 req
  3. data store 10,940,802 18.7% 77,582 req
  4. atomic memory op 173,764 0.3% 167,687 req
  5. page walk for fetch 681,848 1.2% 675,694 req
  6. page walk for load/store 673,456 1.2% 667,265 req
  7. other 7,351,414 12.6% 16,866,351 req

The main shared response path is still the path that demand work uses. P107 only consumes auxiliary data for background fill.

Shell Phases

shell phases label P107 shell workload cycles 219,407,400 cpi 2.54
  1. kernel banner to /init 116,720,762 53.4%
  2. /init to shell banner 1,078,173 0.5%
  3. shell banner to first command 35,711,185 16.3%
  4. echo command 1,649 0%
  5. uname -a 2,560,223 1.2%
  6. ls /bin /usr/share 32,199,160 14.7%
  7. cat sample file 3,161,684 1.5%
  8. touch/write/cat/rm /tmp file 10,996,952 5%
  9. 8x ash loop with file I/O 16,348,865 7.5%
  10. final marker 680 0%

The shell script reaches P107-FILE-OK.

Cycle Shape

state breakdown label P107 banked auxiliary D-cache-fill workload cycles 219,407,400 cpi 2.54
  1. fetch 3.7% 8,118,654
  2. execute 39.4% 86,436,270
  3. mem 12.8% 28,058,018
  4. walker 1.2% 2,698,263
  5. writeback 39.4% 86,411,402
  6. mul/div 3.5% 7,683,077

P107 retires 86.41M instructions at CPI 2.5391.

Hot Functions

hot functions label P107 BusyBox shell symbols samples 63,740 period every 1,024 cycles
  1. printf_core busybox
    5.5% 3,506
  2. memset kernel
    5.2% 3,296
  3. memcpy busybox
    3.8% 2,418
  4. vruntime_eligible kernel
    3.3% 2,079
  5. blake2s_compress_generic kernel
    2.8% 1,799
  6. __fwritex busybox
    2.7% 1,696
  7. memcpy kernel
    2.6% 1,677
  8. handle_exception kernel
    1.7% 1,063
  9. unmap_page_range kernel
    1.6% 1,020
  10. avg_vruntime kernel
    1.3% 857
  11. n_tty_write kernel
    1.3% 812
  12. memset busybox
    1.3% 794
  13. ret_from_exception kernel
    1.2% 739
  14. next_uptodate_folio kernel
    1% 656
  15. vfprintf busybox
    1% 619
  16. (remaining) remaining
    55.6% 35,462

The software workload is the same shell script. The experiment is the hardware response path.

Next

P108 should consume auxiliary data in a path that can directly shorten a stall. Instruction-side prefetch/background fill is the safer next step; a tagged demand fetch/load response path is more valuable but needs proper ownership and cancellation rules.