P98 kept P97’s four-word D-cache lines but throttled background fill so data-line repair only runs in frontend-safe slots.
Functional result: PASS. Speed result: PASS versus P96, but barely. The important result is that P98 recovers from P97’s regression without pretending shared-port scheduling is the final architecture.
| metric | P96 | P97 | P98 |
|---|---|---|---|
| shell window cycles | 66,084,155 | 67,369,576 | 66,055,345 |
| post-load cycles | 221,522,958 | 222,850,787 | 221,452,591 |
| memory stall cycles | 59,418,375 | 60,295,642 | 59,683,338 |
| load stall cycles | 10,976,902 | 10,387,310 | 10,697,962 |
| fetch stall cycles | 26,676,104 | 29,593,757 | 27,286,526 |
The D-cache counters show the trade:
| counter | P96 | P97 | P98 |
|---|---|---|---|
| load hits | 3,656,064 | 4,370,122 | 3,945,531 |
| load misses | 6,354,876 | 5,746,602 | 6,060,778 |
| background fills | 0 | 3,419,006 | 377,930 |
So P98 is the last useful one-port policy rung. Next up should be the Harvard architecture map: what separate instruction and data service would actually look like for this core, which pieces we lack, and how to phase it in without hand-waving.