journal 2026-05-05

P98: D-cache throttle

P98 kept P97’s four-word D-cache lines but throttled background fill so data-line repair only runs in frontend-safe slots.

Functional result: PASS. Speed result: PASS versus P96, but barely. The important result is that P98 recovers from P97’s regression without pretending shared-port scheduling is the final architecture.

metricP96P97P98
shell window cycles66,084,15567,369,57666,055,345
post-load cycles221,522,958222,850,787221,452,591
memory stall cycles59,418,37560,295,64259,683,338
load stall cycles10,976,90210,387,31010,697,962
fetch stall cycles26,676,10429,593,75727,286,526

The D-cache counters show the trade:

counterP96P97P98
load hits3,656,0644,370,1223,945,531
load misses6,354,8765,746,6026,060,778
background fills03,419,006377,930

So P98 is the last useful one-port policy rung. Next up should be the Harvard architecture map: what separate instruction and data service would actually look like for this core, which pieces we lack, and how to phase it in without hand-waving.