No. 108 / project of 147 on the ladder

Banked auxiliary I-cache fill

introduces — auxiliary instruction prefetch consumer; I-cache aux fill counter; instruction-side lower-bank response use

harden statelast run2026-05-06
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P108 adds an instruction-side consumer for the auxiliary lower-bank response. P107 used the second response for D-cache background fill. P108 also uses it when writeback prefetch is blocked by data-side work: the auxiliary response fills the target I-cache word.

Result

checkresult
make check-toolsPASS
Verilator buildPASS
Linux reaches /initPASS
BusyBox promptPASS
BusyBox shell workload reaches P108-FILE-OKPASS
Auxiliary I-cache prefetch response consumedPASS
Auxiliary read errorsPASS
Hardened layoutNOT RUN

Timing

metricP107 aux D-cache fillP108 aux I-cache fill
post-load cycles219,407,400218,960,570
shell window cycles65,269,21364,823,658
retired instructions86,411,40286,260,233
CPI2.53912.5384
BusyBox ready milestone118,427,145118,420,395
shell FILE-OK milestone219,407,543218,960,713
kernel panic milestone00

P108 improves the shell window by 445,555 cycles versus P107, about 0.68%. Against P106, the two auxiliary-response consumers have cut 734,419 shell-window cycles.

Auxiliary Consumers

P108 consumes auxiliary data in two places:

consumerconsumed fillsshell-window fills
I-cache writeback prefetch488,027327,106
D-cache background fill10,034,8694,069,556
I-cache background fill00

I-cache background fill staying at 0 is not a bug in the counter. With the current priority policy, I-cache background fill usually wins before data background work. Blocked writeback prefetch is the real instruction-side auxiliary opportunity.

countervalue
auxiliary instruction reads serviced488,027
auxiliary data reads serviced10,034,869
auxiliary reads serviced total10,522,896
shell-window auxiliary reads4,396,662
auxiliary read errors0
auxiliary read checksum1,087,009,691

Cache Shape

countervalue
I-cache hits43,271,616
I-cache fetch-state hits4,938,617
I-cache writeback-prefetch hits38,332,999
I-cache miss refills45,205,080
I-cache aux prefetch fills488,027
D-cache load hits4,579,787
D-cache load misses5,404,513
D-cache aux background fills10,034,869

The speedup is real but still modest because P108 fills cache state. It does not let an auxiliary response directly complete a stalled demand fetch or demand load.

Memory Stalls

memory stalls label P108 banked auxiliary I-cache-fill workload stalls 58,447,457 handshakes 64,128,237
  1. instruction fetch 28,239,593 48.3% 45,205,080 req
  2. data load 10,362,110 17.7% 561,037 req
  3. data store 10,914,944 18.7% 77,240 req
  4. atomic memory op 173,730 0.3% 166,815 req
  5. page walk for fetch 678,767 1.2% 672,613 req
  6. page walk for load/store 677,027 1.2% 670,851 req
  7. other 7,401,286 12.7% 16,774,601 req

The lower-memory conflict is smaller, but demand-visible stalls remain.

Shell Phases

shell phases label P108 shell workload cycles 218,960,570 cpi 2.54
  1. kernel banner to /init 116,720,090 53.5%
  2. /init to shell banner 1,072,095 0.5%
  3. shell banner to first command 35,716,660 16.4%
  4. echo command 1,649 0%
  5. uname -a 2,407,721 1.1%
  6. ls /bin /usr/share 32,106,012 14.7%
  7. cat sample file 3,070,534 1.4%
  8. touch/write/cat/rm /tmp file 11,008,246 5%
  9. 8x ash loop with file I/O 16,228,816 7.4%
  10. final marker 680 0%

The shell script reaches P108-FILE-OK.

Cycle Shape

state breakdown label P108 banked auxiliary I-cache-fill workload cycles 218,960,570 cpi 2.54
  1. fetch 3.7% 8,109,410
  2. execute 39.4% 86,284,961
  3. mem 12.8% 27,992,010
  4. walker 1.2% 2,699,258
  5. writeback 39.4% 86,260,233
  6. mul/div 3.5% 7,612,982

P108 retires 86.26M instructions at CPI 2.5384.

Hot Functions

hot functions label P108 BusyBox shell symbols samples 63,304 period every 1,024 cycles
  1. printf_core busybox
    5.7% 3,576
  2. memset kernel
    5.2% 3,300
  3. memcpy busybox
    3.7% 2,353
  4. vruntime_eligible kernel
    3.4% 2,121
  5. blake2s_compress_generic kernel
    2.8% 1,791
  6. __fwritex busybox
    2.7% 1,721
  7. memcpy kernel
    2.7% 1,719
  8. handle_exception kernel
    1.7% 1,067
  9. unmap_page_range kernel
    1.6% 1,036
  10. memset busybox
    1.4% 862
  11. avg_vruntime kernel
    1.4% 856
  12. n_tty_write kernel
    1.3% 821
  13. ret_from_exception kernel
    1.2% 761
  14. next_uptodate_folio kernel
    1% 648
  15. n_tty_read kernel
    1% 647
  16. (remaining) remaining
    55.1% 34,850

The software workload is unchanged; the experiment is the instruction prefetch response path.

Next

P109 should add a tiny ownership tracker for a demand-visible auxiliary response. Demand fetch is probably the first target: tag the auxiliary word by physical address, cancel it on PC or translation changes, and only then let it advance architectural fetch.