P96 took the data-side path after P95’s store-buffer miss. The new RTL starts from P94, adds a tiny direct-mapped word D-cache, and leaves stores ordered and write-through.
The shell profile passed:
| metric | P94 | P96 |
|---|---|---|
| post-load cycles | 222,459,202 | 221,522,958 |
| shell window cycles | 67,050,374 | 66,084,155 |
| CPI | 2.5669 | 2.5656 |
| memory stall cycles | 60,032,329 | 59,418,375 |
| load stall cycles | 14,632,992 | 10,976,902 |
| fetch stall cycles | 23,549,359 | 26,676,104 |
D-cache counters:
| counter | value |
|---|---|
| load hits | 3,656,064 |
| load misses | 6,354,876 |
| fills | 6,354,876 |
| store updates | 10,473,803 |
| invalidations | 1,873,327 |
This is a modest speed win and a useful signal. Load stalls drop by about 25%, but fetch stalls rise by about 13%, so the next data-cache rung should use multi-word lines with critical-word-first service and background fill. Blocking fill is already known-bad from P90.