8-entry TLB shell perf · librelane-playground

P86 is the first CPU-side speed round after profiling the shell. The change is intentionally small: the unified Sv32 TLB grows from four entries to eight, and the same BusyBox shell workload from P84 runs again.

Result

metric	P84 4-entry TLB	P86 8-entry TLB	delta
post-load cycles	239,533,716	223,777,049	-6.58%
CPI	2.6615	2.5615	-3.76%
fetch walks	2,263,038	1,117,037	-50.64%
load walks	2,267,672	973,288	-57.08%
store walks	601,266	199,592	-66.80%
memory handshakes	39,642,301	33,111,189	-16.48%
memory stall cycles	91,814,540	88,823,193	-3.26%

The larger TLB does what it should: page walks drop hard. Whole-workload cycles improve by 6.58%, which is meaningful for a one-line RTL parameter change.

Shell Phases

shell phases label P86 shell workload cycles 223,777,049 cpi 2.56

kernel banner to /init 117,604,269 52.7%
/init to shell banner 1,092,051 0.5%
shell banner to first command 36,090,719 16.2%
echo command 20,376 0%
uname -a 2,512,591 1.1%
ls /bin /usr/share/p84 34,108,367 15.3%
cat sample file 3,033,425 1.4%
touch/write/cat/rm /tmp file 12,040,697 5.4%
8x ash loop with file I/O 16,637,629 7.5%
final marker 8,860 0%

phase	P84 cycles	P86 cycles	delta
kernel banner to `/init`	120,446,463	117,604,269	-2.36%
shell setup to first command	37,525,853	36,090,719	-3.82%
`ls /bin /usr/share`	36,947,459	34,108,367	-7.68%
`cat` sample file	5,484,333	3,033,425	-44.69%
`/tmp` file create/read/remove	9,997,660	12,040,697	+20.44%
8x ash loop with file I/O	23,440,310	16,637,629	-29.02%

The /tmp phase going backwards is a useful warning. The single run is not a statistical benchmark, and once walks drop, the visible bottleneck can move to scheduler, filesystem, or console behavior.

Cycle Shape

state breakdown label P86 8-entry TLB shell workload cycles 223,777,049 cpi 2.56

fetch 3.7% 8,259,292
execute 39.1% 87,409,326
mem 12.6% 28,254,762
walker 2.1% 4,661,932
writeback 39% 87,361,454
mul/div 3.5% 7,828,567

The walker states shrink from about 10.4M cycles in P84 to about 4.66M cycles in P86. That is the cleanest evidence that the larger TLB is actually doing work.

Hot Functions

hot functions label P86 BusyBox shell symbols samples 66,760 period every 1,024 cycles

printf_core busybox

5.3% of samples (3,567 samples)

5.3% 3,567
memset kernel

4.7% of samples (3,167 samples)

4.7% 3,167
memcpy busybox

3.5% of samples (2,349 samples)

3.5% 2,349
vruntime_eligible kernel

3.1% of samples (2,039 samples)

3.1% 2,039
memcpy kernel

2.7% of samples (1,823 samples)

2.7% 1,823
blake2s_compress_generic kernel

2.7% of samples (1,803 samples)

2.7% 1,803
__fwritex busybox

2.5% of samples (1,670 samples)

2.5% 1,670
n_tty_write kernel

2.4% of samples (1,613 samples)

2.4% 1,613
unmap_page_range kernel

1.6% of samples (1,097 samples)

1.6% 1,097
handle_exception kernel

1.6% of samples (1,066 samples)

1.6% 1,066
avg_vruntime kernel

1.4% of samples (943 samples)

1.4% 943
memset busybox

1.3% of samples (857 samples)

1.3% 857
ret_from_exception kernel

1.1% of samples (729 samples)

1.1% 729
next_uptodate_folio kernel

1.1% of samples (706 samples)

1.1% 706
sortcmp busybox

0.9% of samples (606 samples)

0.9% 606
(remaining) remaining

55.7% of samples (37,208 samples)

55.7% 37,208

The BusyBox-symbolized shell window still points at formatting and terminal output: printf_core, memcpy, __fwritex, and kernel n_tty_write remain visible.

Honest Status

check	status
8-entry unified TLB RTL change	PASS
BusyBox shell workload runs	PASS
P84/P86 benchmark comparison staged	PASS
BusyBox-symbolized hot-function profile staged	PASS
LibreLane hardening	NOT RUN

P87 should do the next feature round with P84/P86 as the regression benchmark. The candidates now are console batching, syscall/trap cleanup, or separating instruction/data TLB behavior instead of simply growing the unified table again.

Result

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next