No. 112 / project of 147 on the ladder

Aux response queue

introduces — one-entry aux load response queue; MSHR-shaped load response record; aux queue counters

harden statelast run2026-05-06
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P112 adds a one-entry registered queue for AUX_OWNER_LOAD. P111 proved the aux lane can carry architectural load data; P112 makes that response a pending object with enqueue/dequeue/drop counters.

Result

checkresult
make check-toolsPASS
Verilator buildPASS
Linux reaches /initPASS
BusyBox promptPASS
BusyBox shell workload reaches P112-FILE-OKPASS
Aux queue enqueues/dequeues nonzeroPASS
Aux queue full dropsPASS
Speedup against P111FAIL
Hardened layoutNOT RUN

Timing

metricP111 load auxP112 queued load aux
post-load cycles218,643,837227,966,087
shell window cycles64,766,71271,950,542
retired instructions86,315,54688,125,232
CPI2.53312.5868
S_FETCH cycles7,626,3197,694,368
S_MEM cycles27,724,60532,192,993
shell FILE-OK milestone218,643,980227,966,230
kernel panic milestone00

This is a mechanism PASS and a performance regression. The queue is correctly exercised, but adding a registered completion cycle to every eager aux load is too expensive.

Queue Counters

countervalue
aux load responses3,677,395
queue enqueues3,677,395
queue dequeues3,677,395
load dequeues3,677,395
full drops0
aux errors0
aux cancels0

Memory Stalls

memory stalls label P112 queued-aux workload stalls 63,147,391 handshakes 60,726,515
  1. instruction fetch 32,536,349 51.5% 42,038,004 req
  2. data load 10,257,734 16.2% 579,526 req
  3. data store 11,234,239 17.8% 81,302 req
  4. atomic memory op 179,382 0.3% 171,842 req
  5. page walk for fetch 711,744 1.1% 705,589 req
  6. page walk for load/store 699,768 1.1% 693,590 req
  7. other 7,528,175 11.9% 16,456,662 req

P112 makes the pending-response boundary explicit. It does not yet let independent work run past that pending load.

Shell Phases

shell phases label P112 shell workload cycles 227,966,087 cpi 2.59
  1. kernel banner to /init 117,955,941 51.9%
  2. /init to shell banner 1,093,301 0.5%
  3. shell banner to first command 36,336,899 16%
  4. echo command 1,649 0%
  5. uname -a 2,029,963 0.9%
  6. ls /bin /usr/share 34,198,815 15%
  7. cat sample file 2,769,009 1.2%
  8. touch/write/cat/rm /tmp file 9,632,564 4.2%
  9. 8x ash loop with file I/O 21,920,850 9.6%
  10. final marker 1,397,692 0.6%

The shell script reaches P112-FILE-OK.

Cycle Shape

state breakdown label P112 queued-aux workload cycles 227,966,087 cpi 2.59
  1. fetch 3.4% 7,694,368
  2. execute 38.7% 88,150,944
  3. mem 14.2% 32,479,725
  4. walker 1.2% 2,810,691
  5. writeback 38.7% 88,125,232
  6. mul/div 3.8% 8,703,423

S_MEM grows sharply, which is the cost of queueing without better issue policy.

Hot Functions

hot functions label P112 BusyBox shell symbols samples 70,264 period every 1,024 cycles
  1. printf_core busybox
    5.1% 3,595
  2. memset kernel
    4.9% 3,425
  3. vruntime_eligible kernel
    3.9% 2,740
  4. memcpy busybox
    3.4% 2,358
  5. blake2s_compress_generic kernel
    2.6% 1,797
  6. memcpy kernel
    2.5% 1,763
  7. __fwritex busybox
    2.5% 1,730
  8. handle_exception kernel
    1.7% 1,209
  9. avg_vruntime kernel
    1.7% 1,175
  10. unmap_page_range kernel
    1.6% 1,088
  11. ret_from_exception kernel
    1.3% 910
  12. update_curr kernel
    1.2% 842
  13. n_tty_write kernel
    1.2% 832
  14. memset busybox
    1.2% 825
  15. n_tty_read kernel
    1% 710
  16. (remaining) remaining
    56% 39,343

Next

P113 should keep the queue but reduce the number of aux-load issues. The policy needs to account for frontend pressure, D-cache line state, and background-fill debt.