Split I/D service · librelane-playground

P100 is the first measured Harvard-service rung. It keeps the ISA-visible behavior intact, splits the memory clients into instruction-side and data-side service intentions, and counts where the old lower shared memory policy still serializes them.

The short version: this is not two physical RAM ports yet. It is the boundary and the counters that make the next split honest.

Result

metric	P94 arbiter	P98 throttle	P99 map	P100 split service
post-load cycles	222,459,202	221,452,591	222,509,604	221,990,140
shell window cycles	67,050,374	66,055,345	66,998,698	66,518,626
retired instructions	86,664,089	86,329,983	86,648,693	86,512,027
CPI	2.5669	2.5652	2.5680	2.5660
memory stall cycles	60,032,329	59,683,338	59,928,278	59,819,129
fetch stall cycles	23,549,359	27,286,526	27,399,253	27,346,150
load stall cycles	14,632,992	10,697,962	10,718,661	10,729,427

comparison	result
shell window vs P99	-0.72%
post-load cycles vs P99	-0.23%
memory stalls vs P99	-0.18%
fetch stalls vs P99	-0.19%
load stalls vs P99	+0.10%
shell window vs P98	+0.70%
shell window vs P94	-0.79%

P100 is a functional PASS. The small shell-window improvement versus P99 is useful, but the main claim is structural: the old one-port service policy is now visible as instruction/data demand and lower conflict.

Split Service Counters

side	want cycles	grant cycles	not granted by lower policy
instruction	99,366,119	99,366,119	0
data	59,349,365	26,952,797	32,396,568

lower shared service	cycles
both sides wanted service	28,051,030
lower handshakes	66,499,787
lower stall cycles	59,819,129

In this policy, instruction-side traffic always wins when it wants the lower service. Data-side traffic wants service for 59.35M cycles and is not granted for 32.40M cycles. That is the number the next rungs should either reduce or move into a better explained lower-memory bucket.

Request Split

class	P100 side
`fetch`	instruction
`execute_prefetch`	instruction
`writeback_prefetch`	instruction
`icache_background`	instruction
`ptw_fetch`	instruction
`load`	data
`store`	data
`fp_load`	data
`fp_store`	data
`amo`	data
`ptw_lsu`	data
`dcache_background`	data

The old lower mem_valid path still exists underneath this split. That is why P100 can measure lower conflict without also changing cache, translation, store, AMO, or fence behavior.

Memory Stalls

memory stalls label P100 split I/D service workload stalls 59,819,129 handshakes 66,499,787

instruction fetch 27,346,150 45.7% 46,090,879 req
data load 10,729,427 17.9% 879,270 req
data store 11,963,949 20% 217,262 req
atomic memory op 157,854 0.3% 183,986 req
page walk for fetch 1,122,943 1.9% 1,116,789 req
page walk for load/store 1,218,084 2% 1,217,565 req
other 7,280,722 12.2% 16,794,036 req

Load stalls remain far below P94 thanks to the D-cache work. Fetch stalls remain elevated versus P94 because the frontend still competes with lower shared service whenever instruction-side work misses the near-core structures.

Shell Phases

shell phases label P100 shell workload cycles 221,990,140 cpi 2.57

kernel banner to /init 117,624,921 53.1%
/init to shell banner 1,093,314 0.5%
shell banner to first command 36,125,214 16.3%
echo command 1,649 0%
uname -a 2,535,944 1.2%
ls /bin /usr/share 31,727,690 14.3%
cat sample file 4,129,183 1.9%
touch/write/cat/rm /tmp file 11,905,538 5.4%
8x ash loop with file I/O 16,217,942 7.3%
final marker 680 0%

The full BusyBox shell script reaches P100-FILE-OK.

Cycle Shape

state breakdown label P100 split I/D service workload cycles 221,990,140 cpi 2.57

fetch 3.8% 8,346,661
execute 39% 86,537,001
mem 12.7% 28,085,707
walker 2.1% 4,675,381
writeback 39% 86,512,027
mul/div 3.5% 7,831,647

No ISA-visible execution state was added. The new work is counters and service classification around the existing memory arbiter.

Hot Functions

hot functions label P100 BusyBox shell symbols samples 64,960 period every 1,024 cycles

printf_core busybox

5.6% of samples (3,662 samples)

5.6% 3,662
memset kernel

5% of samples (3,253 samples)

5% 3,253
memcpy busybox

3.7% of samples (2,385 samples)

3.7% 2,385
vruntime_eligible kernel

3.3% of samples (2,151 samples)

3.3% 2,151
blake2s_compress_generic kernel

2.8% of samples (1,806 samples)

2.8% 1,806
memcpy kernel

2.7% of samples (1,760 samples)

2.7% 1,760
__fwritex busybox

2.6% of samples (1,688 samples)

2.6% 1,688
handle_exception kernel

1.8% of samples (1,160 samples)

1.8% 1,160
unmap_page_range kernel

1.7% of samples (1,102 samples)

1.7% 1,102
avg_vruntime kernel

1.4% of samples (895 samples)

1.4% 895
n_tty_write kernel

1.3% of samples (858 samples)

1.3% 858
memset busybox

1.3% of samples (830 samples)

1.3% 830
ret_from_exception kernel

1.1% of samples (738 samples)

1.1% 738
do_trap_ecall_u kernel

1% of samples (652 samples)

1% 652
next_uptodate_folio kernel

1% of samples (641 samples)

1% 641
(remaining) remaining

55.4% of samples (35,999 samples)

55.4% 35,999

The software mix is still the BusyBox shell workload. P100 changes how we explain the hardware service path underneath it.

Honest Status

check	status
`make check-tools`	PASS
Verilator build	PASS
BusyBox userspace/initramfs build	PASS
Linux image rebuilt with P100 initramfs	PASS
BusyBox shell workload reaches `P100-FILE-OK`	PASS
P100 chart data captured	PASS
Instruction/data service intent counters	PASS
Split physical RAM ports	NOT RUN
Split ITLB/DTLB storage	NOT RUN
Data-side write buffer with forwarding	NOT RUN
Nonblocking miss machinery	NOT RUN
LibreLane hardening	NOT RUN

P101 should keep the P100 service counters and split translation next: separate ITLB/DTLB lookup/refill accounting, with the shared page-table walker still underneath at first. That tells us whether the next block of lost cycles is translation interference or lower memory service.

Result

Split Service Counters

Request Split

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next