No. 86 / project of 147 on the ladder

8-entry TLB shell perf

introduces — 8-entry unified TLB; shell workload before/after; BusyBox-symbolized perf regression test

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P86 is the first CPU-side speed round after profiling the shell. The change is intentionally small: the unified Sv32 TLB grows from four entries to eight, and the same BusyBox shell workload from P84 runs again.

Result

metricP84 4-entry TLBP86 8-entry TLBdelta
post-load cycles239,533,716223,777,049-6.58%
CPI2.66152.5615-3.76%
fetch walks2,263,0381,117,037-50.64%
load walks2,267,672973,288-57.08%
store walks601,266199,592-66.80%
memory handshakes39,642,30133,111,189-16.48%
memory stall cycles91,814,54088,823,193-3.26%

The larger TLB does what it should: page walks drop hard. Whole-workload cycles improve by 6.58%, which is meaningful for a one-line RTL parameter change.

Shell Phases

shell phases label P86 shell workload cycles 223,777,049 cpi 2.56
  1. kernel banner to /init 117,604,269 52.7%
  2. /init to shell banner 1,092,051 0.5%
  3. shell banner to first command 36,090,719 16.2%
  4. echo command 20,376 0%
  5. uname -a 2,512,591 1.1%
  6. ls /bin /usr/share/p84 34,108,367 15.3%
  7. cat sample file 3,033,425 1.4%
  8. touch/write/cat/rm /tmp file 12,040,697 5.4%
  9. 8x ash loop with file I/O 16,637,629 7.5%
  10. final marker 8,860 0%
phaseP84 cyclesP86 cyclesdelta
kernel banner to /init120,446,463117,604,269-2.36%
shell setup to first command37,525,85336,090,719-3.82%
ls /bin /usr/share36,947,45934,108,367-7.68%
cat sample file5,484,3333,033,425-44.69%
/tmp file create/read/remove9,997,66012,040,697+20.44%
8x ash loop with file I/O23,440,31016,637,629-29.02%

The /tmp phase going backwards is a useful warning. The single run is not a statistical benchmark, and once walks drop, the visible bottleneck can move to scheduler, filesystem, or console behavior.

Cycle Shape

state breakdown label P86 8-entry TLB shell workload cycles 223,777,049 cpi 2.56
  1. fetch 3.7% 8,259,292
  2. execute 39.1% 87,409,326
  3. mem 12.6% 28,254,762
  4. walker 2.1% 4,661,932
  5. writeback 39% 87,361,454
  6. mul/div 3.5% 7,828,567

The walker states shrink from about 10.4M cycles in P84 to about 4.66M cycles in P86. That is the cleanest evidence that the larger TLB is actually doing work.

Hot Functions

hot functions label P86 BusyBox shell symbols samples 66,760 period every 1,024 cycles
  1. printf_core busybox
    5.3% 3,567
  2. memset kernel
    4.7% 3,167
  3. memcpy busybox
    3.5% 2,349
  4. vruntime_eligible kernel
    3.1% 2,039
  5. memcpy kernel
    2.7% 1,823
  6. blake2s_compress_generic kernel
    2.7% 1,803
  7. __fwritex busybox
    2.5% 1,670
  8. n_tty_write kernel
    2.4% 1,613
  9. unmap_page_range kernel
    1.6% 1,097
  10. handle_exception kernel
    1.6% 1,066
  11. avg_vruntime kernel
    1.4% 943
  12. memset busybox
    1.3% 857
  13. ret_from_exception kernel
    1.1% 729
  14. next_uptodate_folio kernel
    1.1% 706
  15. sortcmp busybox
    0.9% 606
  16. (remaining) remaining
    55.7% 37,208

The BusyBox-symbolized shell window still points at formatting and terminal output: printf_core, memcpy, __fwritex, and kernel n_tty_write remain visible.

Honest Status

checkstatus
8-entry unified TLB RTL changePASS
BusyBox shell workload runsPASS
P84/P86 benchmark comparison stagedPASS
BusyBox-symbolized hot-function profile stagedPASS
LibreLane hardeningNOT RUN

Next

P87 should do the next feature round with P84/P86 as the regression benchmark. The candidates now are console batching, syscall/trap cleanup, or separating instruction/data TLB behavior instead of simply growing the unified table again.