No. 94 / project of 147 on the ladder

Memory arbiter v0

introduces — explicit memory request classes; arbiter want/grant counters; contention attribution behind the shared RAM port

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P94 turns the core’s single external-memory mux into an explicit arbiter with named request classes. The external RAM is still shared. The point is visibility: fetch, prefetch, background I-cache fill, loads, stores, FP memory, AMOs, and page-table walks now report who wanted service and who got it.

This is a structure rung, not a speed rung.

Result

metricP93 predictorP94 arbiter
post-load cycles221,863,586222,459,202
shell window cycles66,342,84267,050,374
retired instructions86,469,44486,664,089
CPI2.56582.5669
memory stall cycles59,886,45260,032,329
fetch stall cycles23,503,65023,549,359
I-cache hits42,594,44242,662,028
fetch queue fills53,869,76953,967,748

P94 is slightly slower than P93:

comparisonresult
shell window vs P93+1.07%
post-load cycles vs P93+0.27%
memory stalls vs P93+0.24%
fetch stalls vs P93+0.19%

That is acceptable for this rung because it gives the next experiments specific traffic classes to attack instead of one global memory-stall number.

Arbiter Attribution

classwant cycleshandshakesstall cyclesdenied cycles
fetch2,723,1431,738,381984,7620
execute prefetch24,125,04116,773,8417,351,2000
writeback prefetch18,813,78518,490,194323,5910
I-cache background fill56,215,44229,432,43222,241,0064,542,004
load15,607,728974,73614,632,9920
store12,211,224217,05111,994,1730
FP load2412120
FP store3,4081,7041,7040
AMO343,153184,727158,4260
fetch PTW2,241,8201,117,8331,123,9870
LSU PTW2,440,4491,219,9731,220,4760

Total contention cycles: 4,542,004.

Only background I-cache line fill is denied service in this policy. Foreground fetch, loads, stores, AMOs, and page walks are not losing arbitration; they are paying the latency of the single shared service. That matters for P95. A store buffer or tiny D-cache is a better next target than more blind frontend tweaking.

Memory Stalls

memory stalls label P94 memory-arbiter workload stalls 60,032,329 handshakes 70,150,884
  1. instruction fetch 23,549,359 39.2% 49,661,007 req
  2. data load 14,632,992 24.4% 974,748 req
  3. data store 11,994,173 20% 218,755 req
  4. atomic memory op 158,426 0.3% 184,727 req
  5. page walk for fetch 1,123,987 1.9% 1,117,833 req
  6. page walk for load/store 1,220,476 2% 1,219,973 req
  7. other 7,352,916 12.2% 16,773,841 req

The older memory-kind view still says the same broad thing: fetch and data traffic both matter, and page-table walks are nontrivial but not the biggest slice.

Shell Phases

shell phases label P94 shell workload cycles 222,459,202 cpi 2.57
  1. kernel banner to /init 117,615,610 53%
  2. /init to shell banner 1,075,033 0.5%
  3. shell banner to first command 36,090,120 16.3%
  4. echo command 1,598 0%
  5. uname -a 2,572,368 1.2%
  6. ls /bin /usr/share 32,428,238 14.6%
  7. cat sample file 2,767,249 1.3%
  8. touch/write/cat/rm /tmp file 10,434,293 4.7%
  9. 8x ash loop with file I/O 17,891,364 8.1%
  10. final marker 955,264 0.4%

The benchmark reaches the same final file marker. The shell-window slowdown is visible, so the honest status stays FAIL for speedup.

Cycle Shape

state breakdown label P94 memory-arbiter workload cycles 222,459,202 cpi 2.57
  1. fetch 3.7% 8,333,015
  2. execute 39% 86,689,285
  3. mem 12.7% 28,163,821
  4. walker 2.1% 4,682,269
  5. writeback 39% 86,664,089
  6. mul/div 3.6% 7,925,007

The arbiter does not add a new architectural state. It names the memory requests that were already passing through the shared port.

Hot Functions

hot functions label P94 BusyBox shell symbols samples 65,479 period every 1,024 cycles
  1. printf_core busybox
    5.5% 3,590
  2. memset kernel
    5.1% 3,320
  3. memcpy busybox
    3.6% 2,341
  4. vruntime_eligible kernel
    3.5% 2,257
  5. blake2s_compress_generic kernel
    2.8% 1,802
  6. memcpy kernel
    2.6% 1,698
  7. __fwritex busybox
    2.6% 1,697
  8. handle_exception kernel
    1.8% 1,175
  9. unmap_page_range kernel
    1.7% 1,122
  10. avg_vruntime kernel
    1.4% 904
  11. memset busybox
    1.3% 848
  12. n_tty_write kernel
    1.3% 830
  13. ret_from_exception kernel
    1.2% 784
  14. n_tty_read kernel
    1% 681
  15. next_uptodate_folio kernel
    1% 673
  16. (remaining) remaining
    55.6% 36,372

The hot-symbol mix remains BusyBox plus kernel memory, scheduler, and exception paths. P94’s value is that future speed rungs can line those symbols up with traffic classes.

Honest Status

checkstatus
Named memory request classes in RTLPASS
Public want/grant/class counters exported to harnessPASS
BusyBox shell workload runsPASS
Arbiter contention counters capturedPASS
Separate physical instruction/data RAM portsNOT RUN
Nonblocking data cacheNOT RUN
Shell-window speedup vs P93FAIL
LibreLane hardeningNOT RUN

Next

P95 should target data-side latency. The low-risk first step is a store buffer; the more ambitious step is a tiny D-cache. P94 gives us enough traffic attribution to tell whether either one actually reduces the foreground load/store pressure.