Banked lower service model · librelane-playground

P105 keeps the real lower memory single-port, but models what a four-bank service could grant in parallel. This matters because P104 showed plenty of different-bank instruction/data overlap, but did not separate safe read-like overlap from ordering-sensitive traffic.

Result

check	result
`make check-tools`	PASS
Verilator build	PASS
Linux reaches `/init`	PASS
BusyBox prompt	PASS
BusyBox shell workload reaches `P105-FILE-OK`	PASS
Banked service model emitted	PASS
Hardened layout	NOT RUN

Timing

metric	P104 bank counters	P105 banked service model
post-load cycles	219,172,843	218,480,625
shell window cycles	65,062,462	64,438,096
retired instructions	86,339,942	86,106,731
CPI	2.5385	2.5373
BusyBox ready milestone	118,418,832	118,422,909
shell `FILE-OK` milestone	219,172,986	218,480,768
kernel panic milestone	0	0

The actual run is slightly faster than P104, but P105 is not claiming a hardware speedup. The lower memory port is still one real lane. The meaningful result is the model block below.

Model Policy

P105 adds side-effect flags to the Harvard service taps. It models an extra grant only when instruction and data want different lower banks and the blocked side is read-like. Fetch, prefetch, cache fills, loads, and read-only page-table walks qualify. Stores, FP stores, AMOs, store-buffer drains, and PTE A/D writes do not.

Modeled Service

counter	value
split-bank wants	20,460,163
modeled extra instruction grants	488,212
modeled extra data grants	19,971,951
modeled extra grants total	20,460,163
shell-window extra grants	8,290,549
same-bank cycles left serialized	8,199,513
unsafe split cycles left serialized	0
actual shell window	64,438,096
projected shell window if each extra grant saves one cycle	56,147,547
idealized shell-window reduction	12.87%

The surprising part is unsafe_split_cycles = 0 for this run. The single-port policy usually grants the side-effecting data operation first when one exists, so the blocked different-bank request is read-like instruction traffic. Under P105’s conservative model, every split-bank blocked cycle becomes a candidate extra grant.

Bank Distribution

bank	I want	D want	I grant	D grant
0	27,608,896	18,463,913	27,430,941	6,359,291
1	23,344,445	13,601,194	23,222,337	6,389,211
2	24,054,837	12,984,541	23,920,432	6,529,923
3	23,573,246	13,630,263	23,376,587	7,117,260

Bank 0 remains hottest, but the model still finds enough split-bank overlap to justify a real implementation experiment.

Memory Stalls

memory stalls label P105 banked lower-service workload stalls 58,657,484 handshakes 65,688,498

instruction fetch 27,340,309 46.6% 46,723,707 req
data load 11,613,521 19.8% 556,407 req
data store 10,885,814 18.6% 76,888 req
atomic memory op 173,144 0.3% 166,597 req
page walk for fetch 677,148 1.2% 670,994 req
page walk for load/store 666,477 1.1% 660,291 req
other 7,301,071 12.4% 16,833,614 req

The stall chart is still the single-port hardware. Use the banked_service_model numbers to read what could change.

Shell Phases

shell phases label P105 shell workload cycles 218,480,625 cpi 2.54

kernel banner to /init 116,719,059 53.6%
/init to shell banner 1,075,640 0.5%
shell banner to first command 35,619,763 16.4%
echo command 1,649 0%
uname -a 2,233,445 1%
ls /bin /usr/share 32,300,369 14.8%
cat sample file 2,857,053 1.3%
touch/write/cat/rm /tmp file 10,994,749 5.1%
8x ash loop with file I/O 16,050,151 7.4%
final marker 680 0%

The same BusyBox script reaches P105-FILE-OK.

Cycle Shape

state breakdown label P105 banked lower-service workload cycles 218,480,625 cpi 2.54

fetch 3.7% 8,108,238
execute 39.4% 86,131,517
mem 12.8% 27,922,270
walker 1.2% 2,674,910
writeback 39.4% 86,106,731
mul/div 3.4% 7,535,243

P105 retires 86.11M instructions at CPI 2.5373.

Hot Functions

hot functions label P105 BusyBox shell symbols samples 62,927 period every 1,024 cycles

printf_core busybox

5.8% of samples (3,626 samples)

5.8% 3,626
memset kernel

5.2% of samples (3,249 samples)

5.2% 3,249
memcpy busybox

3.8% of samples (2,374 samples)

3.8% 2,374
vruntime_eligible kernel

3.2% of samples (2,011 samples)

3.2% 2,011
blake2s_compress_generic kernel

2.9% of samples (1,811 samples)

2.9% 1,811
memcpy kernel

2.7% of samples (1,712 samples)

2.7% 1,712
__fwritex busybox

2.7% of samples (1,671 samples)

2.7% 1,671
handle_exception kernel

1.7% of samples (1,059 samples)

1.7% 1,059
unmap_page_range kernel

1.6% of samples (1,026 samples)

1.6% 1,026
memset busybox

1.4% of samples (856 samples)

1.4% 856
n_tty_write kernel

1.3% of samples (832 samples)

1.3% 832
avg_vruntime kernel

1.2% of samples (778 samples)

1.2% 778
ret_from_exception kernel

1.2% of samples (758 samples)

1.2% 758
next_uptodate_folio kernel

1% of samples (644 samples)

1% 644
do_trap_ecall_u kernel

1% of samples (641 samples)

1% 641
(remaining) remaining

55.1% of samples (34,665 samples)

55.1% 34,665

The workload is unchanged. The new result is an architecture estimate: about 8.29M shell-window cycles of blocked read-like lower-memory service are candidates for real different-bank parallelism.

P106 should change the memory contract instead of adding another estimator: two near-core lanes, four lower banks, same-cycle different bank grants, deterministic same-bank priority, and conservative serialization for stores, AMOs, and PTE updates until ordering tests exist.

Result

Timing

Model Policy

Modeled Service

Bank Distribution

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Next