P109 banked aux demand prefetch

P109 tried to turn the auxiliary lower-bank response into a demand-visible frontend result.

The first target was plain S_FETCH demand fetch. That measured at 0 candidates in the BusyBox shell workload. The actual overlap was one state earlier: S_WB can drain a pending store-buffer word on the main lower-memory port while writeback prefetch is blocked on the instruction side.

The final RTL keeps the P108 I-cache fill and also decodes the auxiliary writeback-prefetch response immediately:

banked_aux_i_wb_pf_fill_fire
and mem_req_storebuf_for_idle
and mem_storebuf_grant
and mem_ready
and !mem_error

Result:

metric	value
post-load cycles	218,922,720
shell-window cycles	65,023,598
retired instructions	86,402,301
CPI	2.5338
S_FETCH cycles	7,627,570
aux demand prefetch bypasses	488,086
shell-window aux demand prefetch bypasses	327,181
aux read errors	0

Compared with P108, S_FETCH drops by 481,840 cycles and post-load time drops by 37,850 cycles. The shell window is 199,940 cycles slower in this run, so the honest conclusion is mixed: the microarchitectural bubble is gone, but the workload-level phase timing did not turn into a clean shell speedup.

The next step should be less one-off: add a tiny tagged auxiliary response slot with owner, physical word address, data, valid, and cancel. That should let the design stop threading every aux response through a different FSM special case.