P111 lets one aligned integer load miss consume the tagged auxiliary response while the main lower-memory port fetches a safe next-PC instruction word. It is the first live data-load owner for the P110 response slot.
Result
| check | result |
|---|---|
make check-tools | PASS |
| Verilator build | PASS |
Linux reaches /init | PASS |
| BusyBox prompt | PASS |
BusyBox shell workload reaches P111-FILE-OK | PASS |
AUX_OWNER_LOAD responses nonzero | PASS |
| Auxiliary read errors | PASS |
| Speedup against P110 | FAIL |
| Hardened layout | NOT RUN |
Timing
| metric | P110 tagged response | P111 nonblocking load aux |
|---|---|---|
| post-load cycles | 217,717,374 | 218,643,837 |
| shell window cycles | 63,761,231 | 64,766,712 |
| retired instructions | 86,014,057 | 86,315,546 |
| CPI | 2.5312 | 2.5331 |
| S_FETCH cycles | 7,613,966 | 7,626,319 |
| S_MEM cycles | 27,608,346 | 27,724,605 |
shell FILE-OK milestone | 217,717,517 | 218,643,980 |
| kernel panic milestone | 0 | 0 |
This is a functionality PASS and a performance regression. The tagged load path works, but the first policy is too eager for the shell workload.
Load Owner
P111 fires only when the load miss and next-PC prefetch target different lower-memory banks. The main port fills the fetch queue and I-cache; the auxiliary response fills the D-cache critical word and the architectural load result.
| owner | P110 responses | P111 responses |
|---|---|---|
| writeback prefetch | 488,037 | 488,110 |
| data load | 0 | 3,545,688 |
| D-cache background | 9,984,598 | 10,384,721 |
| errors | 0 | 0 |
| cancels | 0 | 0 |
D-cache Effect
| counter | P110 | P111 |
|---|---|---|
| load hits | 4,549,121 | 4,878,488 |
| load misses | 5,376,326 | 5,129,479 |
| auxiliary load fills | 0 | 3,545,688 |
| auxiliary background fills | 9,984,598 | 10,384,721 |
| invalidations | 3,031,969 | 3,033,217 |
The load path is real: 3.545M load misses complete through the auxiliary owner. The total workload still slows down, which points at scheduling policy and response buffering rather than missing functionality.
Memory Stalls
- instruction fetch 28,806,378 49% 44,160,113 req
- data load 10,088,549 17.2% 559,937 req
- data store 10,901,554 18.6% 78,132 req
- atomic memory op 173,799 0.3% 166,836 req
- page walk for fetch 680,798 1.2% 674,644 req
- page walk for load/store 673,615 1.1% 667,441 req
- other 7,412,164 12.6% 16,093,627 req
P111 adds useful overlap but not yet enough control over when that overlap is worth taking.
Shell Phases
- kernel banner to /init 116,719,007 53.5%
- /init to shell banner 1,062,143 0.5%
- shell banner to first command 35,467,908 16.3%
- echo command 1,649 0%
- uname -a 1,944,631 0.9%
- ls /bin /usr/share 32,960,913 15.1%
- cat sample file 3,150,246 1.4%
- touch/write/cat/rm /tmp file 10,880,234 5%
- 8x ash loop with file I/O 15,828,359 7.3%
- final marker 680 0%
The shell script reaches P111-FILE-OK.
Cycle Shape
- fetch 3.5% 7,626,319
- execute 39.5% 86,340,348
- mem 12.8% 28,003,831
- walker 1.2% 2,696,498
- writeback 39.5% 86,315,546
- mul/div 3.5% 7,659,579
P111 retires 86.39M instructions at CPI 2.5331.
Hot Functions
- 5.5% of samples (3,491 samples)5.5% 3,491
- 5.2% of samples (3,306 samples)5.2% 3,306
- 3.7% of samples (2,364 samples)3.7% 2,364
- 3.4% of samples (2,125 samples)3.4% 2,125
- 2.8% of samples (1,792 samples)2.8% 1,792
- 2.7% of samples (1,693 samples)2.7% 1,693
- 2.5% of samples (1,606 samples)2.5% 1,606
- 1.7% of samples (1,092 samples)1.7% 1,092
- 1.6% of samples (1,034 samples)1.6% 1,034
- 1.4% of samples (887 samples)1.4% 887
- 1.4% of samples (868 samples)1.4% 868
- 1.3% of samples (832 samples)1.3% 832
- 1.3% of samples (791 samples)1.3% 791
- 1.1% of samples (666 samples)1.1% 666
- 1% of samples (644 samples)1% 644
- 55.1% of samples (34,834 samples)55.1% 34,834
The software workload is unchanged; the architectural experiment is the data-side load owner.
Next
P112 should keep the load owner but add a tiny response queue or MSHR-like record so the policy can distinguish useful overlap from fill traffic that only perturbs the frontend.