No. 109 / project of 147 on the ladder

Banked auxiliary demand prefetch

introduces — demand-visible auxiliary prefetch bypass; S_WB store-buffer drain overlap; auxiliary demand-prefetch counter

harden statelast run2026-05-06
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P109 makes the auxiliary instruction response demand-visible for the writeback-prefetch case. P108 used the auxiliary bank to fill I-cache while the main port drained a store-buffer word. P109 also decodes that returned prefetch word immediately and advances to S_EXECUTE.

Result

checkresult
make check-toolsPASS
Verilator buildPASS
Linux reaches /initPASS
BusyBox promptPASS
BusyBox shell workload reaches P109-FILE-OKPASS
Auxiliary demand-prefetch bypass counter nonzeroPASS
Auxiliary read errorsPASS
Hardened layoutNOT RUN

Timing

metricP108 aux I-cache fillP109 demand prefetch
post-load cycles218,960,570218,922,720
shell window cycles64,823,65865,023,598
retired instructions86,260,23386,402,301
CPI2.53842.5338
S_FETCH cycles8,109,4107,627,570
BusyBox ready milestone118,420,395118,416,748
shell FILE-OK milestone218,960,713218,922,863
kernel panic milestone00

The local frontend effect is clear: S_FETCH drops by 481,840 cycles. The full post-load run is 37,850 cycles shorter than P108. The shell window is 199,940 cycles longer in this run, so this is a mixed workload result, not a clean shell speedup.

Auxiliary Consumers

consumerconsumedshell-window consumed
S_WB demand prefetch bypass488,086327,181
plain S_FETCH demand fetch00
I-cache background fill00
D-cache background fill10,062,5124,105,433

The plain demand-fetch case staying at 0 is the important negative result. The store buffer prevents S_FETCH from becoming the blocked banked instruction request. The useful overlap is S_WB store-buffer drain plus next-PC writeback prefetch.

countervalue
auxiliary instruction reads serviced488,086
auxiliary data reads serviced10,062,512
auxiliary reads serviced total10,550,598
shell-window auxiliary reads4,432,614
auxiliary read errors0
auxiliary read checksum1,857,269,570

Cache Shape

countervalue
I-cache hits42,828,957
I-cache fetch-state hits4,451,195
I-cache writeback-prefetch hits38,377,762
I-cache miss refills45,284,972
I-cache aux demand prefetches488,086
D-cache load hits4,607,962
D-cache load misses5,416,558
D-cache aux background fills10,062,512

Memory Stalls

memory stalls label P109 banked auxiliary demand-prefetch workload stalls 58,492,090 handshakes 64,295,979
  1. instruction fetch 28,295,901 48.4% 45,284,972 req
  2. data load 10,380,077 17.7% 556,535 req
  3. data store 10,937,181 18.7% 77,513 req
  4. atomic memory op 173,631 0.3% 167,445 req
  5. page walk for fetch 681,489 1.2% 675,335 req
  6. page walk for load/store 671,969 1.1% 665,776 req
  7. other 7,351,842 12.6% 16,868,403 req

The lower-memory system still has a single main port plus an auxiliary read lane. P109 proves the aux lane can advance frontend state, but the remaining stalls want a more general response owner.

Shell Phases

shell phases label P109 shell workload cycles 218,922,720 cpi 2.53
  1. kernel banner to /init 116,719,999 53.5%
  2. /init to shell banner 1,068,539 0.5%
  3. shell banner to first command 35,482,517 16.3%
  4. echo command 1,649 0%
  5. uname -a 2,496,533 1.1%
  6. ls /bin /usr/share 31,832,322 14.6%
  7. cat sample file 2,663,980 1.2%
  8. touch/write/cat/rm /tmp file 11,795,687 5.4%
  9. 8x ash loop with file I/O 16,232,747 7.4%
  10. final marker 680 0%

The shell script reaches P109-FILE-OK.

Cycle Shape

state breakdown label P109 banked auxiliary demand-prefetch workload cycles 218,922,720 cpi 2.53
  1. fetch 3.5% 7,627,570
  2. execute 39.5% 86,427,071
  3. mem 12.8% 28,056,808
  4. walker 1.2% 2,694,569
  5. writeback 39.5% 86,402,301
  6. mul/div 3.5% 7,712,685

P109 retires 86.40M instructions at CPI 2.5338.

Hot Functions

hot functions label P109 BusyBox shell symbols samples 63,499 period every 1,024 cycles
  1. printf_core busybox
    5.6% 3,555
  2. memset kernel
    5.2% 3,305
  3. memcpy busybox
    3.6% 2,304
  4. vruntime_eligible kernel
    3.5% 2,197
  5. blake2s_compress_generic kernel
    2.8% 1,806
  6. memcpy kernel
    2.7% 1,739
  7. __fwritex busybox
    2.7% 1,680
  8. handle_exception kernel
    1.7% 1,096
  9. unmap_page_range kernel
    1.6% 1,026
  10. n_tty_write kernel
    1.3% 851
  11. memset busybox
    1.3% 842
  12. avg_vruntime kernel
    1.3% 821
  13. ret_from_exception kernel
    1.2% 761
  14. n_tty_read kernel
    1% 654
  15. next_uptodate_folio kernel
    1% 643
  16. (remaining) remaining
    55.1% 34,977

The software workload is unchanged; the experiment is whether an auxiliary response can skip a frontend bubble.

Next

P110 should replace the ad hoc consumers with a tiny tagged auxiliary response slot: owner, physical word address, data, valid, and cancel. That gives fetch, prefetch, background fill, and later load-miss service one shared response rule.