P97 tried the obvious D-cache follow-up: four-word lines with critical-word-first demand loads and background fill for the rest of the line.
Functional result: PASS. Speed result: FAIL versus P96.
| metric | P96 | P97 |
|---|---|---|
| shell window cycles | 66,084,155 | 67,369,576 |
| post-load cycles | 221,522,958 | 222,850,787 |
| memory stall cycles | 59,418,375 | 60,295,642 |
| load stall cycles | 10,976,902 | 10,387,310 |
| fetch stall cycles | 26,676,104 | 29,593,757 |
The local cache counters improved:
| counter | P96 | P97 |
|---|---|---|
| load hits | 3,656,064 | 4,370,122 |
| load misses | 6,354,876 | 5,746,602 |
| background fills | 0 | 3,419,006 |
So the line-fill geometry is useful, but the background policy is too eager for a one-port memory system. P98 should make background fill conditional on real frontend idleness.