P94 turns the core’s single external-memory mux into an explicit arbiter with named request classes. The external RAM is still shared. The point is visibility: fetch, prefetch, background I-cache fill, loads, stores, FP memory, AMOs, and page-table walks now report who wanted service and who got it.
This is a structure rung, not a speed rung.
Result
| metric | P93 predictor | P94 arbiter |
|---|---|---|
| post-load cycles | 221,863,586 | 222,459,202 |
| shell window cycles | 66,342,842 | 67,050,374 |
| retired instructions | 86,469,444 | 86,664,089 |
| CPI | 2.5658 | 2.5669 |
| memory stall cycles | 59,886,452 | 60,032,329 |
| fetch stall cycles | 23,503,650 | 23,549,359 |
| I-cache hits | 42,594,442 | 42,662,028 |
| fetch queue fills | 53,869,769 | 53,967,748 |
P94 is slightly slower than P93:
| comparison | result |
|---|---|
| shell window vs P93 | +1.07% |
| post-load cycles vs P93 | +0.27% |
| memory stalls vs P93 | +0.24% |
| fetch stalls vs P93 | +0.19% |
That is acceptable for this rung because it gives the next experiments specific traffic classes to attack instead of one global memory-stall number.
Arbiter Attribution
| class | want cycles | handshakes | stall cycles | denied cycles |
|---|---|---|---|---|
| fetch | 2,723,143 | 1,738,381 | 984,762 | 0 |
| execute prefetch | 24,125,041 | 16,773,841 | 7,351,200 | 0 |
| writeback prefetch | 18,813,785 | 18,490,194 | 323,591 | 0 |
| I-cache background fill | 56,215,442 | 29,432,432 | 22,241,006 | 4,542,004 |
| load | 15,607,728 | 974,736 | 14,632,992 | 0 |
| store | 12,211,224 | 217,051 | 11,994,173 | 0 |
| FP load | 24 | 12 | 12 | 0 |
| FP store | 3,408 | 1,704 | 1,704 | 0 |
| AMO | 343,153 | 184,727 | 158,426 | 0 |
| fetch PTW | 2,241,820 | 1,117,833 | 1,123,987 | 0 |
| LSU PTW | 2,440,449 | 1,219,973 | 1,220,476 | 0 |
Total contention cycles: 4,542,004.
Only background I-cache line fill is denied service in this policy. Foreground fetch, loads, stores, AMOs, and page walks are not losing arbitration; they are paying the latency of the single shared service. That matters for P95. A store buffer or tiny D-cache is a better next target than more blind frontend tweaking.
Memory Stalls
- instruction fetch 23,549,359 39.2% 49,661,007 req
- data load 14,632,992 24.4% 974,748 req
- data store 11,994,173 20% 218,755 req
- atomic memory op 158,426 0.3% 184,727 req
- page walk for fetch 1,123,987 1.9% 1,117,833 req
- page walk for load/store 1,220,476 2% 1,219,973 req
- other 7,352,916 12.2% 16,773,841 req
The older memory-kind view still says the same broad thing: fetch and data traffic both matter, and page-table walks are nontrivial but not the biggest slice.
Shell Phases
- kernel banner to /init 117,615,610 53%
- /init to shell banner 1,075,033 0.5%
- shell banner to first command 36,090,120 16.3%
- echo command 1,598 0%
- uname -a 2,572,368 1.2%
- ls /bin /usr/share 32,428,238 14.6%
- cat sample file 2,767,249 1.3%
- touch/write/cat/rm /tmp file 10,434,293 4.7%
- 8x ash loop with file I/O 17,891,364 8.1%
- final marker 955,264 0.4%
The benchmark reaches the same final file marker. The shell-window
slowdown is visible, so the honest status stays FAIL for speedup.
Cycle Shape
- fetch 3.7% 8,333,015
- execute 39% 86,689,285
- mem 12.7% 28,163,821
- walker 2.1% 4,682,269
- writeback 39% 86,664,089
- mul/div 3.6% 7,925,007
The arbiter does not add a new architectural state. It names the memory requests that were already passing through the shared port.
Hot Functions
- 5.5% of samples (3,590 samples)5.5% 3,590
- 5.1% of samples (3,320 samples)5.1% 3,320
- 3.6% of samples (2,341 samples)3.6% 2,341
- 3.5% of samples (2,257 samples)3.5% 2,257
- 2.8% of samples (1,802 samples)2.8% 1,802
- 2.6% of samples (1,698 samples)2.6% 1,698
- 2.6% of samples (1,697 samples)2.6% 1,697
- 1.8% of samples (1,175 samples)1.8% 1,175
- 1.7% of samples (1,122 samples)1.7% 1,122
- 1.4% of samples (904 samples)1.4% 904
- 1.3% of samples (848 samples)1.3% 848
- 1.3% of samples (830 samples)1.3% 830
- 1.2% of samples (784 samples)1.2% 784
- 1% of samples (681 samples)1% 681
- 1% of samples (673 samples)1% 673
- 55.6% of samples (36,372 samples)55.6% 36,372
The hot-symbol mix remains BusyBox plus kernel memory, scheduler, and exception paths. P94’s value is that future speed rungs can line those symbols up with traffic classes.
Honest Status
| check | status |
|---|---|
| Named memory request classes in RTL | PASS |
| Public want/grant/class counters exported to harness | PASS |
| BusyBox shell workload runs | PASS |
| Arbiter contention counters captured | PASS |
| Separate physical instruction/data RAM ports | NOT RUN |
| Nonblocking data cache | NOT RUN |
| Shell-window speedup vs P93 | FAIL |
| LibreLane hardening | NOT RUN |
Next
P95 should target data-side latency. The low-risk first step is a store buffer; the more ambitious step is a tiny D-cache. P94 gives us enough traffic attribution to tell whether either one actually reduces the foreground load/store pressure.