No. 111 / project of 147 on the ladder

Nonblocking load aux

introduces — aux-owned aligned load miss; load/prefetch split-bank overlap; D-cache aux load-fill counters

harden statelast run2026-05-06
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P111 lets one aligned integer load miss consume the tagged auxiliary response while the main lower-memory port fetches a safe next-PC instruction word. It is the first live data-load owner for the P110 response slot.

Result

checkresult
make check-toolsPASS
Verilator buildPASS
Linux reaches /initPASS
BusyBox promptPASS
BusyBox shell workload reaches P111-FILE-OKPASS
AUX_OWNER_LOAD responses nonzeroPASS
Auxiliary read errorsPASS
Speedup against P110FAIL
Hardened layoutNOT RUN

Timing

metricP110 tagged responseP111 nonblocking load aux
post-load cycles217,717,374218,643,837
shell window cycles63,761,23164,766,712
retired instructions86,014,05786,315,546
CPI2.53122.5331
S_FETCH cycles7,613,9667,626,319
S_MEM cycles27,608,34627,724,605
shell FILE-OK milestone217,717,517218,643,980
kernel panic milestone00

This is a functionality PASS and a performance regression. The tagged load path works, but the first policy is too eager for the shell workload.

Load Owner

P111 fires only when the load miss and next-PC prefetch target different lower-memory banks. The main port fills the fetch queue and I-cache; the auxiliary response fills the D-cache critical word and the architectural load result.

ownerP110 responsesP111 responses
writeback prefetch488,037488,110
data load03,545,688
D-cache background9,984,59810,384,721
errors00
cancels00

D-cache Effect

counterP110P111
load hits4,549,1214,878,488
load misses5,376,3265,129,479
auxiliary load fills03,545,688
auxiliary background fills9,984,59810,384,721
invalidations3,031,9693,033,217

The load path is real: 3.545M load misses complete through the auxiliary owner. The total workload still slows down, which points at scheduling policy and response buffering rather than missing functionality.

Memory Stalls

memory stalls label P111 nonblocking-load workload stalls 58,736,857 handshakes 62,400,730
  1. instruction fetch 28,806,378 49% 44,160,113 req
  2. data load 10,088,549 17.2% 559,937 req
  3. data store 10,901,554 18.6% 78,132 req
  4. atomic memory op 173,799 0.3% 166,836 req
  5. page walk for fetch 680,798 1.2% 674,644 req
  6. page walk for load/store 673,615 1.1% 667,441 req
  7. other 7,412,164 12.6% 16,093,627 req

P111 adds useful overlap but not yet enough control over when that overlap is worth taking.

Shell Phases

shell phases label P111 shell workload cycles 218,643,837 cpi 2.53
  1. kernel banner to /init 116,719,007 53.5%
  2. /init to shell banner 1,062,143 0.5%
  3. shell banner to first command 35,467,908 16.3%
  4. echo command 1,649 0%
  5. uname -a 1,944,631 0.9%
  6. ls /bin /usr/share 32,960,913 15.1%
  7. cat sample file 3,150,246 1.4%
  8. touch/write/cat/rm /tmp file 10,880,234 5%
  9. 8x ash loop with file I/O 15,828,359 7.3%
  10. final marker 680 0%

The shell script reaches P111-FILE-OK.

Cycle Shape

state breakdown label P111 nonblocking-load workload cycles 218,643,837 cpi 2.53
  1. fetch 3.5% 7,626,319
  2. execute 39.5% 86,340,348
  3. mem 12.8% 28,003,831
  4. walker 1.2% 2,696,498
  5. writeback 39.5% 86,315,546
  6. mul/div 3.5% 7,659,579

P111 retires 86.39M instructions at CPI 2.5331.

Hot Functions

hot functions label P111 BusyBox shell symbols samples 63,249 period every 1,024 cycles
  1. printf_core busybox
    5.5% 3,491
  2. memset kernel
    5.2% 3,306
  3. memcpy busybox
    3.7% 2,364
  4. vruntime_eligible kernel
    3.4% 2,125
  5. blake2s_compress_generic kernel
    2.8% 1,792
  6. memcpy kernel
    2.7% 1,693
  7. __fwritex busybox
    2.5% 1,606
  8. handle_exception kernel
    1.7% 1,092
  9. unmap_page_range kernel
    1.6% 1,034
  10. memset busybox
    1.4% 887
  11. avg_vruntime kernel
    1.4% 868
  12. n_tty_write kernel
    1.3% 832
  13. ret_from_exception kernel
    1.3% 791
  14. next_uptodate_folio kernel
    1.1% 666
  15. n_tty_read kernel
    1% 644
  16. (remaining) remaining
    55.1% 34,834

The software workload is unchanged; the architectural experiment is the data-side load owner.

Next

P112 should keep the load owner but add a tiny response queue or MSHR-like record so the policy can distinguish useful overlap from fill traffic that only perturbs the frontend.