Harvard I/D map · librelane-playground

P99 is a map rung. It keeps the P98 core behavior, reruns the BusyBox shell profile under P99 labels, and writes down exactly what the Harvard instruction/data split needs to mean for this core.

The short version: we have I-cache and D-cache structures. We do not yet have independent instruction and data service.

Current Shape

area	current P99 reality	Harvard target
instruction cache	64-line, 4-word direct-mapped I-cache	independent instruction-side L1 service
data cache	64-line, 4-word write-through D-cache for aligned RAM `LW`/`SW`	independent data-side L1 service
fetch queue	one-entry next-instruction queue	enough frontend buffering to cover data hiccups
translation	one 8-entry unified TLB with separate fetch/LSU lookup wires	split ITLB/DTLB lookup and refill accounting
page-table walker	one walker tagged by `ptw_for_fetch_q`	split request queues or explicit lower arbitration
stores	write-through on the shared path	data-side write buffer with forwarding
lower memory	one selected `mem_arb_class` drives `mem_valid`	lower shared memory or banks with conflict counters

Result

metric	P94 arbiter	P98 throttle	P99 map
post-load cycles	222,459,202	221,452,591	222,509,604
shell window cycles	67,050,374	66,055,345	66,998,698
retired instructions	86,664,089	86,329,983	86,648,693
CPI	2.5669	2.5652	2.5680
memory stall cycles	60,032,329	59,683,338	59,928,278
fetch stall cycles	23,549,359	27,286,526	27,399,253
load stall cycles	14,632,992	10,697,962	10,718,661

comparison	result
shell window vs P98	+1.43%
post-load cycles vs P98	+0.48%
memory stalls vs P98	+0.41%
fetch stalls vs P98	+0.41%
load stalls vs P98	+0.19%
shell window vs P94	-0.08%
load stalls vs P94	-26.75%
fetch stalls vs P94	+16.35%

P99 is a functional PASS. It is not a speed PASS versus P98. That is expected because it does not implement the split yet.

Current Request Classes

class	future side
`fetch`	instruction
`execute_prefetch`	instruction
`writeback_prefetch`	instruction
`icache_background`	instruction
`load`	data
`store`	data
`fp_load`	data
`fp_store`	data
`amo`	data
`ptw_fetch`	instruction translation
`ptw_lsu`	data translation
`dcache_background`	data

P94 gave these clients names. P100 should stop forcing them through one final near-core port before the memory model can see them.

Memory Stalls

memory stalls label P99 Harvard I/D map workload stalls 59,928,278 handshakes 66,683,777

instruction fetch 27,399,253 45.7% 46,182,344 req
data load 10,718,661 17.9% 892,084 req
data store 11,987,867 20% 222,648 req
atomic memory op 158,393 0.3% 184,717 req
page walk for fetch 1,135,398 1.9% 1,129,244 req
page walk for load/store 1,243,664 2.1% 1,243,154 req
other 7,285,042 12.2% 16,829,586 req

The profile still shows the same basic shape: load stalls are much better than P94 thanks to the D-cache work, but fetch stalls are still higher because the frontend remains tied to the shared service point.

Shell Phases

shell phases label P99 shell workload cycles 222,509,604 cpi 2.57

kernel banner to /init 117,615,427 53%
/init to shell banner 1,076,089 0.5%
shell banner to first command 36,191,325 16.3%
echo command 1,598 0%
uname -a 2,604,026 1.2%
ls /bin /usr/share 32,177,153 14.5%
cat sample file 2,923,000 1.3%
touch/write/cat/rm /tmp file 11,447,584 5.2%
8x ash loop with file I/O 16,330,121 7.4%
final marker 1,515,216 0.7%

The full BusyBox shell script reaches P99-FILE-OK.

Cycle Shape

state breakdown label P99 Harvard I/D map workload cycles 222,509,604 cpi 2.57

fetch 3.8% 8,352,672
execute 39% 86,673,821
mem 12.7% 28,152,093
walker 2.1% 4,751,460
writeback 38.9% 86,648,693
mul/div 3.6% 7,929,149

No new execution state was added. P99 validates that the mapped core still runs Linux userspace.

Hot Functions

hot functions label P99 BusyBox shell symbols samples 65,428 period every 1,024 cycles

printf_core busybox

5.4% of samples (3,548 samples)

5.4% 3,548
memset kernel

5.1% of samples (3,317 samples)

5.1% 3,317
memcpy busybox

3.6% of samples (2,348 samples)

3.6% 2,348
vruntime_eligible kernel

3.4% of samples (2,252 samples)

3.4% 2,252
blake2s_compress_generic kernel

2.7% of samples (1,795 samples)

2.7% 1,795
__fwritex busybox

2.7% of samples (1,783 samples)

2.7% 1,783
memcpy kernel

2.6% of samples (1,678 samples)

2.6% 1,678
handle_exception kernel

1.8% of samples (1,188 samples)

1.8% 1,188
unmap_page_range kernel

1.7% of samples (1,086 samples)

1.7% 1,086
avg_vruntime kernel

1.4% of samples (923 samples)

1.4% 923
n_tty_write kernel

1.3% of samples (858 samples)

1.3% 858
memset busybox

1.2% of samples (808 samples)

1.2% 808
ret_from_exception kernel

1.2% of samples (805 samples)

1.2% 805
n_tty_read kernel

1% of samples (671 samples)

1% 671
next_uptodate_folio kernel

1% of samples (664 samples)

1% 664
(remaining) remaining

55.7% of samples (36,422 samples)

55.7% 36,422

The software mix remains the same shell workload. The P99 page is about hardware service boundaries, not a new application.

Honest Status

check	status
P98 RTL cloned and relabeled for P99	PASS
Verilator build	PASS
BusyBox shell workload runs	PASS
P99 chart data captured	PASS
Harvard I/D map written	PASS
Shell-window speedup vs P98	FAIL
True split I/D RAM ports	NOT RUN
Split ITLB/DTLB	NOT RUN
Data-side write buffer with forwarding	NOT RUN
Nonblocking miss machinery	NOT RUN
LibreLane hardening	NOT RUN

P100 should implement the first split-port model: instruction-side and data-side service intents near the core, with the old shared memory model kept underneath as a measured lower conflict point.

Current Shape

Result

Current Request Classes

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next