journal 2026-05-06

P109 banked aux demand prefetch

P109 tried to turn the auxiliary lower-bank response into a demand-visible frontend result.

The first target was plain S_FETCH demand fetch. That measured at 0 candidates in the BusyBox shell workload. The actual overlap was one state earlier: S_WB can drain a pending store-buffer word on the main lower-memory port while writeback prefetch is blocked on the instruction side.

The final RTL keeps the P108 I-cache fill and also decodes the auxiliary writeback-prefetch response immediately:

banked_aux_i_wb_pf_fill_fire
and mem_req_storebuf_for_idle
and mem_storebuf_grant
and mem_ready
and !mem_error

Result:

metricvalue
post-load cycles218,922,720
shell-window cycles65,023,598
retired instructions86,402,301
CPI2.5338
S_FETCH cycles7,627,570
aux demand prefetch bypasses488,086
shell-window aux demand prefetch bypasses327,181
aux read errors0

Compared with P108, S_FETCH drops by 481,840 cycles and post-load time drops by 37,850 cycles. The shell window is 199,940 cycles slower in this run, so the honest conclusion is mixed: the microarchitectural bubble is gone, but the workload-level phase timing did not turn into a clean shell speedup.

The next step should be less one-off: add a tiny tagged auxiliary response slot with owner, physical word address, data, valid, and cancel. That should let the design stop threading every aux response through a different FSM special case.