P109 makes the auxiliary instruction response demand-visible for the writeback-prefetch case. P108 used the auxiliary bank to fill I-cache while the main port drained a store-buffer word. P109 also decodes that returned prefetch word immediately and advances to S_EXECUTE.
Result
| check | result |
|---|---|
make check-tools | PASS |
| Verilator build | PASS |
Linux reaches /init | PASS |
| BusyBox prompt | PASS |
BusyBox shell workload reaches P109-FILE-OK | PASS |
| Auxiliary demand-prefetch bypass counter nonzero | PASS |
| Auxiliary read errors | PASS |
| Hardened layout | NOT RUN |
Timing
| metric | P108 aux I-cache fill | P109 demand prefetch |
|---|---|---|
| post-load cycles | 218,960,570 | 218,922,720 |
| shell window cycles | 64,823,658 | 65,023,598 |
| retired instructions | 86,260,233 | 86,402,301 |
| CPI | 2.5384 | 2.5338 |
| S_FETCH cycles | 8,109,410 | 7,627,570 |
| BusyBox ready milestone | 118,420,395 | 118,416,748 |
shell FILE-OK milestone | 218,960,713 | 218,922,863 |
| kernel panic milestone | 0 | 0 |
The local frontend effect is clear: S_FETCH drops by 481,840 cycles. The full post-load run is 37,850 cycles shorter than P108. The shell window is 199,940 cycles longer in this run, so this is a mixed workload result, not a clean shell speedup.
Auxiliary Consumers
| consumer | consumed | shell-window consumed |
|---|---|---|
| S_WB demand prefetch bypass | 488,086 | 327,181 |
| plain S_FETCH demand fetch | 0 | 0 |
| I-cache background fill | 0 | 0 |
| D-cache background fill | 10,062,512 | 4,105,433 |
The plain demand-fetch case staying at 0 is the important negative result. The store buffer prevents S_FETCH from becoming the blocked banked instruction request. The useful overlap is S_WB store-buffer drain plus next-PC writeback prefetch.
| counter | value |
|---|---|
| auxiliary instruction reads serviced | 488,086 |
| auxiliary data reads serviced | 10,062,512 |
| auxiliary reads serviced total | 10,550,598 |
| shell-window auxiliary reads | 4,432,614 |
| auxiliary read errors | 0 |
| auxiliary read checksum | 1,857,269,570 |
Cache Shape
| counter | value |
|---|---|
| I-cache hits | 42,828,957 |
| I-cache fetch-state hits | 4,451,195 |
| I-cache writeback-prefetch hits | 38,377,762 |
| I-cache miss refills | 45,284,972 |
| I-cache aux demand prefetches | 488,086 |
| D-cache load hits | 4,607,962 |
| D-cache load misses | 5,416,558 |
| D-cache aux background fills | 10,062,512 |
Memory Stalls
- instruction fetch 28,295,901 48.4% 45,284,972 req
- data load 10,380,077 17.7% 556,535 req
- data store 10,937,181 18.7% 77,513 req
- atomic memory op 173,631 0.3% 167,445 req
- page walk for fetch 681,489 1.2% 675,335 req
- page walk for load/store 671,969 1.1% 665,776 req
- other 7,351,842 12.6% 16,868,403 req
The lower-memory system still has a single main port plus an auxiliary read lane. P109 proves the aux lane can advance frontend state, but the remaining stalls want a more general response owner.
Shell Phases
- kernel banner to /init 116,719,999 53.5%
- /init to shell banner 1,068,539 0.5%
- shell banner to first command 35,482,517 16.3%
- echo command 1,649 0%
- uname -a 2,496,533 1.1%
- ls /bin /usr/share 31,832,322 14.6%
- cat sample file 2,663,980 1.2%
- touch/write/cat/rm /tmp file 11,795,687 5.4%
- 8x ash loop with file I/O 16,232,747 7.4%
- final marker 680 0%
The shell script reaches P109-FILE-OK.
Cycle Shape
- fetch 3.5% 7,627,570
- execute 39.5% 86,427,071
- mem 12.8% 28,056,808
- walker 1.2% 2,694,569
- writeback 39.5% 86,402,301
- mul/div 3.5% 7,712,685
P109 retires 86.40M instructions at CPI 2.5338.
Hot Functions
- 5.6% of samples (3,555 samples)5.6% 3,555
- 5.2% of samples (3,305 samples)5.2% 3,305
- 3.6% of samples (2,304 samples)3.6% 2,304
- 3.5% of samples (2,197 samples)3.5% 2,197
- 2.8% of samples (1,806 samples)2.8% 1,806
- 2.7% of samples (1,739 samples)2.7% 1,739
- 2.7% of samples (1,680 samples)2.7% 1,680
- 1.7% of samples (1,096 samples)1.7% 1,096
- 1.6% of samples (1,026 samples)1.6% 1,026
- 1.3% of samples (851 samples)1.3% 851
- 1.3% of samples (842 samples)1.3% 842
- 1.3% of samples (821 samples)1.3% 821
- 1.2% of samples (761 samples)1.2% 761
- 1% of samples (654 samples)1% 654
- 1% of samples (643 samples)1% 643
- 55.1% of samples (34,977 samples)55.1% 34,977
The software workload is unchanged; the experiment is whether an auxiliary response can skip a frontend bubble.
Next
P110 should replace the ad hoc consumers with a tiny tagged auxiliary response slot: owner, physical word address, data, valid, and cancel. That gives fetch, prefetch, background fill, and later load-miss service one shared response rule.