P95 added a one-entry external-RAM store buffer at the SoC boundary. MMIO stays unbuffered. Ordinary RAM stores are accepted into the buffer, then drained through the shared external RAM port.
The shell smoke passed:
P95 direct UART console + memory attribution smoke PASS
The performance result is bad, which is useful:
| metric | P94 | P95 |
|---|---|---|
| post-load cycles | 222,459,202 | 241,494,238 |
| shell window cycles | 67,050,374 | 77,821,976 |
| fetch stall cycles | 23,549,359 | 35,563,846 |
| store stall cycles | 11,994,173 | 1,692 |
Store stalls disappeared, but fetch stalls exploded. The buffer accepted and drained 12,743,615 stores, and blocked later external requests for 12,145,646 cycles.
So the lesson is not “store buffers are bad.” The lesson is that this specific one-entry, drain-before-next-external-request policy just moves the stall from stores to the frontend. P96 needs either load hits via a tiny D-cache, or a smarter buffer with forwarding and a less blunt drain policy.