P112 adds a one-entry registered queue for AUX_OWNER_LOAD. P111 proved
the aux lane can carry architectural load data; P112 makes that response
a pending object with enqueue/dequeue/drop counters.
Result
| check | result |
|---|---|
make check-tools | PASS |
| Verilator build | PASS |
Linux reaches /init | PASS |
| BusyBox prompt | PASS |
BusyBox shell workload reaches P112-FILE-OK | PASS |
| Aux queue enqueues/dequeues nonzero | PASS |
| Aux queue full drops | PASS |
| Speedup against P111 | FAIL |
| Hardened layout | NOT RUN |
Timing
| metric | P111 load aux | P112 queued load aux |
|---|---|---|
| post-load cycles | 218,643,837 | 227,966,087 |
| shell window cycles | 64,766,712 | 71,950,542 |
| retired instructions | 86,315,546 | 88,125,232 |
| CPI | 2.5331 | 2.5868 |
| S_FETCH cycles | 7,626,319 | 7,694,368 |
| S_MEM cycles | 27,724,605 | 32,192,993 |
shell FILE-OK milestone | 218,643,980 | 227,966,230 |
| kernel panic milestone | 0 | 0 |
This is a mechanism PASS and a performance regression. The queue is correctly exercised, but adding a registered completion cycle to every eager aux load is too expensive.
Queue Counters
| counter | value |
|---|---|
| aux load responses | 3,677,395 |
| queue enqueues | 3,677,395 |
| queue dequeues | 3,677,395 |
| load dequeues | 3,677,395 |
| full drops | 0 |
| aux errors | 0 |
| aux cancels | 0 |
Memory Stalls
- instruction fetch 32,536,349 51.5% 42,038,004 req
- data load 10,257,734 16.2% 579,526 req
- data store 11,234,239 17.8% 81,302 req
- atomic memory op 179,382 0.3% 171,842 req
- page walk for fetch 711,744 1.1% 705,589 req
- page walk for load/store 699,768 1.1% 693,590 req
- other 7,528,175 11.9% 16,456,662 req
P112 makes the pending-response boundary explicit. It does not yet let independent work run past that pending load.
Shell Phases
- kernel banner to /init 117,955,941 51.9%
- /init to shell banner 1,093,301 0.5%
- shell banner to first command 36,336,899 16%
- echo command 1,649 0%
- uname -a 2,029,963 0.9%
- ls /bin /usr/share 34,198,815 15%
- cat sample file 2,769,009 1.2%
- touch/write/cat/rm /tmp file 9,632,564 4.2%
- 8x ash loop with file I/O 21,920,850 9.6%
- final marker 1,397,692 0.6%
The shell script reaches P112-FILE-OK.
Cycle Shape
- fetch 3.4% 7,694,368
- execute 38.7% 88,150,944
- mem 14.2% 32,479,725
- walker 1.2% 2,810,691
- writeback 38.7% 88,125,232
- mul/div 3.8% 8,703,423
S_MEM grows sharply, which is the cost of queueing without better issue policy.
Hot Functions
- 5.1% of samples (3,595 samples)5.1% 3,595
- 4.9% of samples (3,425 samples)4.9% 3,425
- 3.9% of samples (2,740 samples)3.9% 2,740
- 3.4% of samples (2,358 samples)3.4% 2,358
- 2.6% of samples (1,797 samples)2.6% 1,797
- 2.5% of samples (1,763 samples)2.5% 1,763
- 2.5% of samples (1,730 samples)2.5% 1,730
- 1.7% of samples (1,209 samples)1.7% 1,209
- 1.7% of samples (1,175 samples)1.7% 1,175
- 1.6% of samples (1,088 samples)1.6% 1,088
- 1.3% of samples (910 samples)1.3% 910
- 1.2% of samples (842 samples)1.2% 842
- 1.2% of samples (832 samples)1.2% 832
- 1.2% of samples (825 samples)1.2% 825
- 1% of samples (710 samples)1% 710
- 56% of samples (39,343 samples)56% 39,343
Next
P113 should keep the queue but reduce the number of aux-load issues. The policy needs to account for frontend pressure, D-cache line state, and background-fill debt.