P95 adds a one-entry store buffer at the SoC boundary. Ordinary external-RAM stores are accepted immediately when the buffer is empty, then drained through the shared RAM port before later external CPU requests are allowed through.
It works functionally. It is slower.
Result
| metric | P94 arbiter | P95 store buffer |
|---|---|---|
| post-load cycles | 222,459,202 | 241,494,238 |
| shell window cycles | 67,050,374 | 77,821,976 |
| retired instructions | 86,664,089 | 88,851,638 |
| CPI | 2.5669 | 2.7179 |
| memory stall cycles | 60,032,329 | 60,797,627 |
| fetch stall cycles | 23,549,359 | 35,563,846 |
| I-cache hits | 42,662,028 | 43,429,434 |
| fetch queue fills | 53,967,748 | 55,096,088 |
| comparison | result |
|---|---|
| shell window vs P94 | +16.06% |
| post-load cycles vs P94 | +8.56% |
| memory stalls vs P94 | +1.27% |
| fetch stalls vs P94 | +51.02% |
Store Buffer Counters
| counter | value |
|---|---|
| accepts | 12,743,615 |
| drains | 12,743,615 |
| valid cycles | 12,743,615 |
| block cycles | 12,145,646 |
The buffer did accept and drain stores. The problem is the strict ordering policy: while the buffer drains, later external CPU requests wait. The store line improves, but fetch and writeback prefetch pay for it.
What Moved
| class | P94 stall cycles | P95 stall cycles |
|---|---|---|
| fetch | 23,549,359 | 35,563,846 |
| load | 14,632,992 | 15,144,900 |
| store | 11,994,173 | 1,692 |
| writeback prefetch | 323,591 | 12,500,307 |
| execute prefetch | 7,351,200 | 7,489,611 |
This is exactly the failure mode worth documenting. A buffer can remove the store instruction’s local wait without improving the machine if the drain policy steals the frontend’s useful slots.
Memory Stalls
- instruction fetch 35,563,846 58.5% 51,766,489 req
- data load 15,144,900 24.9% 1,020,802 req
- data store 1,692 0% 12,584,006 req
- atomic memory op 136,668 0.2% 219,056 req
- page walk for fetch 1,180,811 1.9% 1,174,657 req
- page walk for load/store 1,280,087 2.1% 1,279,569 req
- other 7,489,623 12.3% 17,256,377 req
The memory-kind view shows the trade: store stalls collapse, but fetch stalls rise enough to dominate the result.
Shell Phases
- kernel banner to /init 123,204,999 51.2%
- /init to shell banner 1,128,402 0.5%
- shell banner to first command 38,652,740 16.1%
- echo command 1,598 0%
- uname -a 2,164,048 0.9%
- ls /bin /usr/share 34,815,180 14.5%
- cat sample file 5,283,101 2.2%
- touch/write/cat/rm /tmp file 10,852,334 4.5%
- 8x ash loop with file I/O 23,171,954 9.6%
- final marker 1,533,761 0.6%
Every shell phase completes, but every meaningful phase is later than P94. The loop phase is especially exposed because it alternates small file writes and reads.
Cycle Shape
- fetch 8.5% 20,601,809
- execute 36.8% 88,877,988
- mem 12.1% 29,105,432
- walker 2% 4,915,124
- writeback 36.8% 88,851,638
- mul/div 3.8% 9,138,851
The architectural state machine did not gain a new CPU state; the buffer is outside the core. The extra time shows up through blocked memory service.
Hot Functions
- 5.5% of samples (4,163 samples)5.5% 4,163
- 4.8% of samples (3,646 samples)4.8% 3,646
- 4.1% of samples (3,123 samples)4.1% 3,123
- 3.2% of samples (2,397 samples)3.2% 2,397
- 2.4% of samples (1,850 samples)2.4% 1,850
- 2.4% of samples (1,822 samples)2.4% 1,822
- 2.3% of samples (1,779 samples)2.3% 1,779
- 1.9% of samples (1,455 samples)1.9% 1,455
- 1.8% of samples (1,358 samples)1.8% 1,358
- 1.5% of samples (1,137 samples)1.5% 1,137
- 1.2% of samples (917 samples)1.2% 917
- 1.2% of samples (872 samples)1.2% 872
- 1.1% of samples (857 samples)1.1% 857
- 1.1% of samples (856 samples)1.1% 856
- 1.1% of samples (800 samples)1.1% 800
- 56% of samples (42,525 samples)56% 42,525
The hot-symbol mix remains the same kind of BusyBox and kernel work. The slowdown is not a new software path; it is memory-service policy.
Honest Status
| check | status |
|---|---|
| One-entry external-RAM store buffer | PASS |
| MMIO stores left unbuffered | PASS |
| BusyBox shell workload runs | PASS |
| Store-buffer counters captured | PASS |
| Forwarding from buffer to loads | NOT RUN |
| Store merge / coalescing | NOT RUN |
| Shell-window speedup vs P94 | FAIL |
| LibreLane hardening | NOT RUN |
Next
P96 should not blindly make this buffer bigger. The next useful rung is either a tiny D-cache for load hits or a smarter store buffer with forwarding and a drain policy that does not block instruction delivery so aggressively.