P95: store buffer v0

P95 added a one-entry external-RAM store buffer at the SoC boundary. MMIO stays unbuffered. Ordinary RAM stores are accepted into the buffer, then drained through the shared external RAM port.

The shell smoke passed:

P95 direct UART console + memory attribution smoke PASS

The performance result is bad, which is useful:

metric	P94	P95
post-load cycles	222,459,202	241,494,238
shell window cycles	67,050,374	77,821,976
fetch stall cycles	23,549,359	35,563,846
store stall cycles	11,994,173	1,692

Store stalls disappeared, but fetch stalls exploded. The buffer accepted and drained 12,743,615 stores, and blocked later external requests for 12,145,646 cycles.

So the lesson is not “store buffers are bad.” The lesson is that this specific one-entry, drain-before-next-external-request policy just moves the stall from stores to the frontend. P96 needs either load hits via a tiny D-cache, or a smarter buffer with forwarding and a less blunt drain policy.