No. 110 / project of 147 on the ladder

Tagged auxiliary response

introduces — tagged auxiliary response record; owner-counted aux lane; explicit aux cancel/error fields

harden statelast run2026-05-06
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P110 turns the auxiliary lower-bank response into a tagged microarchitectural record. P109 proved the response can advance frontend state. P110 gives that response owner, address, data, error, and cancel fields so later nonblocking paths do not have to infer ownership from local candidate wires.

Result

checkresult
make check-toolsPASS
Verilator buildPASS
Linux reaches /initPASS
BusyBox promptPASS
BusyBox shell workload reaches P110-FILE-OKPASS
Tagged auxiliary response owner counters nonzeroPASS
Auxiliary read errorsPASS
Hardened layoutNOT RUN

Timing

metricP109 demand prefetchP110 tagged response
post-load cycles218,922,720217,717,374
shell window cycles65,023,59863,761,231
retired instructions86,402,30186,014,057
CPI2.53382.5312
S_FETCH cycles7,627,5707,613,966
BusyBox ready milestone118,416,748118,428,463
shell FILE-OK milestone218,922,863217,717,517
kernel panic milestone00

P110 is primarily a contract rung. The measured shell window is 1,262,367 cycles faster than P109, but the important result is that the P109 behavior now flows through one tagged response shape.

Response Slot

fieldmeaning
validthe auxiliary memory model returned a response
ownerwhich blocked request class owns the response
addrphysical word address of the auxiliary read
datareturned word
errormemory model reported an error
cancelresponse was invalidated by frontend cancellation rules

Owner counts from the measured run:

ownerresponses
fetch0
execute prefetch0
writeback prefetch488,037
I-cache background0
fetch page-table walk0
data load0
FP load0
data page-table walk0
D-cache background9,984,598
errors0
cancels0

Auxiliary Consumers

consumerconsumedshell-window consumed
S_WB demand prefetch bypass488,037327,112
plain S_FETCH demand fetch00
I-cache background fill00
D-cache background fill9,984,5984,024,110
countervalue
auxiliary instruction reads serviced488,037
auxiliary data reads serviced9,984,598
auxiliary reads serviced total10,472,635
shell-window auxiliary reads4,351,222
auxiliary read errors0
auxiliary read checksum302,646,632

Memory Stalls

memory stalls label P110 tagged auxiliary-response workload stalls 58,219,792 handshakes 63,975,890
  1. instruction fetch 28,166,726 48.4% 45,065,903 req
  2. data load 10,338,257 17.8% 554,572 req
  3. data store 10,872,486 18.7% 76,749 req
  4. atomic memory op 172,382 0.3% 166,671 req
  5. page walk for fetch 675,086 1.2% 668,932 req
  6. page walk for load/store 667,552 1.1% 661,383 req
  7. other 7,327,303 12.6% 16,781,680 req

The lower-memory lane is still same-cycle in the Verilator contract. P110 makes the response contract explicit enough for a later queued response or load-miss owner.

Shell Phases

shell phases label P110 shell workload cycles 217,717,374 cpi 2.53
  1. kernel banner to /init 116,722,764 53.8%
  2. /init to shell banner 1,077,489 0.5%
  3. shell banner to first command 35,527,823 16.4%
  4. echo command 1,649 0%
  5. uname -a 2,516,026 1.2%
  6. ls /bin /usr/share 31,857,432 14.7%
  7. cat sample file 2,844,771 1.3%
  8. touch/write/cat/rm /tmp file 10,527,184 4.9%
  9. 8x ash loop with file I/O 16,013,487 7.4%
  10. final marker 682 0%

The shell script reaches P110-FILE-OK.

Cycle Shape

state breakdown label P110 tagged auxiliary-response workload cycles 217,717,374 cpi 2.53
  1. fetch 3.5% 7,613,966
  2. execute 39.5% 86,038,725
  3. mem 12.8% 27,886,634
  4. walker 1.2% 2,672,953
  5. writeback 39.5% 86,014,057
  6. mul/div 3.4% 7,489,323

P110 retires 86.01M instructions at CPI 2.5312.

Hot Functions

hot functions label P110 BusyBox shell symbols samples 62,267 period every 1,024 cycles
  1. printf_core busybox
    5.8% 3,605
  2. memset kernel
    5.3% 3,269
  3. memcpy busybox
    3.7% 2,295
  4. vruntime_eligible kernel
    3.2% 2,008
  5. blake2s_compress_generic kernel
    2.9% 1,806
  6. __fwritex busybox
    2.8% 1,710
  7. memcpy kernel
    2.7% 1,695
  8. handle_exception kernel
    1.7% 1,077
  9. unmap_page_range kernel
    1.6% 1,013
  10. n_tty_write kernel
    1.3% 833
  11. memset busybox
    1.3% 824
  12. avg_vruntime kernel
    1.3% 806
  13. ret_from_exception kernel
    1.2% 743
  14. next_uptodate_folio kernel
    1.1% 700
  15. n_tty_read kernel
    1% 590
  16. (remaining) remaining
    54.9% 34,211

The software workload is unchanged; the architecture change is response ownership.

Next

P111 should use the tagged slot for a real nonblocking data-side path: probably an aligned integer load miss first, with explicit cancellation for traps, stores to the same word, and D-cache invalidation.