No. 105 / project of 147 on the ladder

Banked lower service model

introduces — read-like lower-bank service estimator; conservative side-effect classification; projected banked-memory shell speedup

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P105 keeps the real lower memory single-port, but models what a four-bank service could grant in parallel. This matters because P104 showed plenty of different-bank instruction/data overlap, but did not separate safe read-like overlap from ordering-sensitive traffic.

Result

checkresult
make check-toolsPASS
Verilator buildPASS
Linux reaches /initPASS
BusyBox promptPASS
BusyBox shell workload reaches P105-FILE-OKPASS
Banked service model emittedPASS
Hardened layoutNOT RUN

Timing

metricP104 bank countersP105 banked service model
post-load cycles219,172,843218,480,625
shell window cycles65,062,46264,438,096
retired instructions86,339,94286,106,731
CPI2.53852.5373
BusyBox ready milestone118,418,832118,422,909
shell FILE-OK milestone219,172,986218,480,768
kernel panic milestone00

The actual run is slightly faster than P104, but P105 is not claiming a hardware speedup. The lower memory port is still one real lane. The meaningful result is the model block below.

Model Policy

P105 adds side-effect flags to the Harvard service taps. It models an extra grant only when instruction and data want different lower banks and the blocked side is read-like. Fetch, prefetch, cache fills, loads, and read-only page-table walks qualify. Stores, FP stores, AMOs, store-buffer drains, and PTE A/D writes do not.

Modeled Service

countervalue
split-bank wants20,460,163
modeled extra instruction grants488,212
modeled extra data grants19,971,951
modeled extra grants total20,460,163
shell-window extra grants8,290,549
same-bank cycles left serialized8,199,513
unsafe split cycles left serialized0
actual shell window64,438,096
projected shell window if each extra grant saves one cycle56,147,547
idealized shell-window reduction12.87%

The surprising part is unsafe_split_cycles = 0 for this run. The single-port policy usually grants the side-effecting data operation first when one exists, so the blocked different-bank request is read-like instruction traffic. Under P105’s conservative model, every split-bank blocked cycle becomes a candidate extra grant.

Bank Distribution

bankI wantD wantI grantD grant
027,608,89618,463,91327,430,9416,359,291
123,344,44513,601,19423,222,3376,389,211
224,054,83712,984,54123,920,4326,529,923
323,573,24613,630,26323,376,5877,117,260

Bank 0 remains hottest, but the model still finds enough split-bank overlap to justify a real implementation experiment.

Memory Stalls

memory stalls label P105 banked lower-service workload stalls 58,657,484 handshakes 65,688,498
  1. instruction fetch 27,340,309 46.6% 46,723,707 req
  2. data load 11,613,521 19.8% 556,407 req
  3. data store 10,885,814 18.6% 76,888 req
  4. atomic memory op 173,144 0.3% 166,597 req
  5. page walk for fetch 677,148 1.2% 670,994 req
  6. page walk for load/store 666,477 1.1% 660,291 req
  7. other 7,301,071 12.4% 16,833,614 req

The stall chart is still the single-port hardware. Use the banked_service_model numbers to read what could change.

Shell Phases

shell phases label P105 shell workload cycles 218,480,625 cpi 2.54
  1. kernel banner to /init 116,719,059 53.6%
  2. /init to shell banner 1,075,640 0.5%
  3. shell banner to first command 35,619,763 16.4%
  4. echo command 1,649 0%
  5. uname -a 2,233,445 1%
  6. ls /bin /usr/share 32,300,369 14.8%
  7. cat sample file 2,857,053 1.3%
  8. touch/write/cat/rm /tmp file 10,994,749 5.1%
  9. 8x ash loop with file I/O 16,050,151 7.4%
  10. final marker 680 0%

The same BusyBox script reaches P105-FILE-OK.

Cycle Shape

state breakdown label P105 banked lower-service workload cycles 218,480,625 cpi 2.54
  1. fetch 3.7% 8,108,238
  2. execute 39.4% 86,131,517
  3. mem 12.8% 27,922,270
  4. walker 1.2% 2,674,910
  5. writeback 39.4% 86,106,731
  6. mul/div 3.4% 7,535,243

P105 retires 86.11M instructions at CPI 2.5373.

Hot Functions

hot functions label P105 BusyBox shell symbols samples 62,927 period every 1,024 cycles
  1. printf_core busybox
    5.8% 3,626
  2. memset kernel
    5.2% 3,249
  3. memcpy busybox
    3.8% 2,374
  4. vruntime_eligible kernel
    3.2% 2,011
  5. blake2s_compress_generic kernel
    2.9% 1,811
  6. memcpy kernel
    2.7% 1,712
  7. __fwritex busybox
    2.7% 1,671
  8. handle_exception kernel
    1.7% 1,059
  9. unmap_page_range kernel
    1.6% 1,026
  10. memset busybox
    1.4% 856
  11. n_tty_write kernel
    1.3% 832
  12. avg_vruntime kernel
    1.2% 778
  13. ret_from_exception kernel
    1.2% 758
  14. next_uptodate_folio kernel
    1% 644
  15. do_trap_ecall_u kernel
    1% 641
  16. (remaining) remaining
    55.1% 34,665

The workload is unchanged. The new result is an architecture estimate: about 8.29M shell-window cycles of blocked read-like lower-memory service are candidates for real different-bank parallelism.

Next

P106 should change the memory contract instead of adding another estimator: two near-core lanes, four lower banks, same-cycle different bank grants, deterministic same-bank priority, and conservative serialization for stores, AMOs, and PTE updates until ordering tests exist.