P109 tried to turn the auxiliary lower-bank response into a demand-visible frontend result.
The first target was plain S_FETCH demand fetch. That measured at 0 candidates in the BusyBox shell workload. The actual overlap was one state earlier: S_WB can drain a pending store-buffer word on the main lower-memory port while writeback prefetch is blocked on the instruction side.
The final RTL keeps the P108 I-cache fill and also decodes the auxiliary writeback-prefetch response immediately:
banked_aux_i_wb_pf_fill_fire
and mem_req_storebuf_for_idle
and mem_storebuf_grant
and mem_ready
and !mem_error
Result:
| metric | value |
|---|---|
| post-load cycles | 218,922,720 |
| shell-window cycles | 65,023,598 |
| retired instructions | 86,402,301 |
| CPI | 2.5338 |
| S_FETCH cycles | 7,627,570 |
| aux demand prefetch bypasses | 488,086 |
| shell-window aux demand prefetch bypasses | 327,181 |
| aux read errors | 0 |
Compared with P108, S_FETCH drops by 481,840 cycles and post-load time drops by 37,850 cycles. The shell window is 199,940 cycles slower in this run, so the honest conclusion is mixed: the microarchitectural bubble is gone, but the workload-level phase timing did not turn into a clean shell speedup.
The next step should be less one-off: add a tiny tagged auxiliary response slot with owner, physical word address, data, valid, and cancel. That should let the design stop threading every aux response through a different FSM special case.