P98: D-cache throttle

P98 kept P97’s four-word D-cache lines but throttled background fill so data-line repair only runs in frontend-safe slots.

Functional result: PASS. Speed result: PASS versus P96, but barely. The important result is that P98 recovers from P97’s regression without pretending shared-port scheduling is the final architecture.

metric	P96	P97	P98
shell window cycles	66,084,155	67,369,576	66,055,345
post-load cycles	221,522,958	222,850,787	221,452,591
memory stall cycles	59,418,375	60,295,642	59,683,338
load stall cycles	10,976,902	10,387,310	10,697,962
fetch stall cycles	26,676,104	29,593,757	27,286,526

The D-cache counters show the trade:

counter	P96	P97	P98
load hits	3,656,064	4,370,122	3,945,531
load misses	6,354,876	5,746,602	6,060,778
background fills	0	3,419,006	377,930

So P98 is the last useful one-port policy rung. Next up should be the Harvard architecture map: what separate instruction and data service would actually look like for this core, which pieces we lack, and how to phase it in without hand-waving.