Split ITLB/DTLB · librelane-playground

P101 splits translation storage. P100 separated instruction-side and data-side memory-service intent, but both sides still shared one tiny 8-entry TLB. P101 replaces that with an 8-entry ITLB and an 8-entry DTLB while keeping the page-table walker shared.

This one is not just cleaner architecture. It is faster.

Result

metric	P94 arbiter	P100 split service	P101 split TLB
post-load cycles	222,459,202	221,990,140	217,630,965
shell window cycles	67,050,374	66,518,626	63,777,267
retired instructions	86,664,089	86,512,027	86,031,234
CPI	2.5669	2.5660	2.5297
memory stall cycles	60,032,329	59,819,129	58,999,994
fetch stall cycles	23,549,359	27,346,150	27,158,844
load stall cycles	14,632,992	10,729,427	10,999,370
fetch page walks	1,123,987	1,122,943	674,678
data page walks	1,220,476	1,218,084	666,522

comparison	result
shell window vs P100	-4.12%
post-load cycles vs P100	-1.96%
memory stalls vs P100	-1.37%
fetch stalls vs P100	-0.68%
load stalls vs P100	+2.52%
fetch walks vs P100	-39.92%
data walks vs P100	-45.28%
shell window vs P94	-4.88%

P101 is a speed PASS. The load-stall bucket rises slightly, but the shell workload wins because translation walks fall sharply.

Split TLB Counters

bank	entries	activity	misses	fills	replacement index
ITLB fetch-hit cycles	8	6,107,447	674,678	674,678	5
ITLB prefetch-hit cycles	8	139,274,063	674,678	674,678	5
DTLB LSU hits	8	27,081,725	666,522	666,522	6

Shared walker:

walker metric	count
fetch walks	674,678
data walks	666,522
A/D writebacks	96
TLB flushes	13,447

The split removes replacement interference, not walker serialization. That distinction matters for the next rung.

Translation Shape

access	P101 bank
current PC fetch	ITLB
`next_pc` prefetch	ITLB
load/store effective address	DTLB
AMO effective address	DTLB

Both banks are flushed on reset, satp writes, and sfence.vma.

Memory Stalls

memory stalls label P101 split ITLB/DTLB workload stalls 58,999,994 handshakes 64,560,204

instruction fetch 27,158,844 46% 45,623,132 req
data load 10,999,370 18.6% 550,464 req
data store 12,019,517 20.4% 83,798 req
atomic memory op 172,099 0.3% 166,763 req
page walk for fetch 674,678 1.1% 668,524 req
page walk for load/store 666,522 1.1% 660,352 req
other 7,308,964 12.4% 16,807,171 req

Memory stalls fall 1.37% versus P100. The more interesting movement is inside the page-walk traffic: fetch-side and data-side PTE work both drop because each side keeps its own translations longer.

Shell Phases

shell phases label P101 shell workload cycles 217,630,965 cpi 2.53

kernel banner to /init 116,724,467 53.8%
/init to shell banner 1,073,233 0.5%
shell banner to first command 35,427,931 16.3%
echo command 1,649 0%
uname -a 2,387,886 1.1%
ls /bin /usr/share 31,716,614 14.6%
cat sample file 3,042,641 1.4%
touch/write/cat/rm /tmp file 10,791,609 5%
8x ash loop with file I/O 15,836,188 7.3%
final marker 680 0%

The full BusyBox shell script reaches P101-FILE-OK.

Cycle Shape

state breakdown label P101 split ITLB/DTLB workload cycles 217,630,965 cpi 2.53

fetch 3.4% 7,465,407
execute 39.5% 86,055,752
mem 12.8% 27,891,331
walker 1.2% 2,670,076
writeback 39.5% 86,031,234
mul/div 3.5% 7,515,449

The CPI improvement is visible here: 2.5660 in P100, 2.5297 in P101.

Hot Functions

hot functions label P101 BusyBox shell symbols samples 62,283 period every 1,024 cycles

printf_core busybox

5.6% of samples (3,500 samples)

5.6% 3,500
memset kernel

5.1% of samples (3,198 samples)

5.1% 3,198
memcpy busybox

3.7% of samples (2,319 samples)

3.7% 2,319
vruntime_eligible kernel

3.2% of samples (1,984 samples)

3.2% 1,984
blake2s_compress_generic kernel

2.9% of samples (1,802 samples)

2.9% 1,802
memcpy kernel

2.8% of samples (1,726 samples)

2.8% 1,726
__fwritex busybox

2.7% of samples (1,685 samples)

2.7% 1,685
unmap_page_range kernel

1.6% of samples (1,019 samples)

1.6% 1,019
handle_exception kernel

1.6% of samples (1,004 samples)

1.6% 1,004
n_tty_write kernel

1.3% of samples (830 samples)

1.3% 830
memset busybox

1.3% of samples (785 samples)

1.3% 785
avg_vruntime kernel

1.2% of samples (749 samples)

1.2% 749
ret_from_exception kernel

1.1% of samples (701 samples)

1.1% 701
next_uptodate_folio kernel

1.1% of samples (680 samples)

1.1% 680
n_tty_read kernel

1.1% of samples (661 samples)

1.1% 661
(remaining) remaining

55.3% of samples (34,438 samples)

55.3% 34,438

The software workload did not change. The speedup comes from fewer translation walks underneath the same shell script.

Honest Status

check	status
`make check-tools`	PASS
Verilator build	PASS
BusyBox userspace/initramfs build	PASS
Linux image rebuilt with P101 initramfs	PASS
BusyBox shell workload reaches `P101-FILE-OK`	PASS
P101 chart data captured	PASS
Separate ITLB and DTLB storage	PASS
Shared walker fills side-specific bank	PASS
Shell-window speedup vs P100	PASS
Parallel page walkers	NOT RUN
Data-side write buffer with forwarding	NOT RUN
Nonblocking miss machinery	NOT RUN
LibreLane hardening	NOT RUN

P102 should move back to the data side: a write buffer with same-address forwarding and clear fence/AMO ordering, or an MSHR-lite miss tracker if we want to attack blocking loads first.

Result

Split TLB Counters

Translation Shape

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next