P86 is the first CPU-side speed round after profiling the shell. The change is intentionally small: the unified Sv32 TLB grows from four entries to eight, and the same BusyBox shell workload from P84 runs again.
Result
| metric | P84 4-entry TLB | P86 8-entry TLB | delta |
|---|---|---|---|
| post-load cycles | 239,533,716 | 223,777,049 | -6.58% |
| CPI | 2.6615 | 2.5615 | -3.76% |
| fetch walks | 2,263,038 | 1,117,037 | -50.64% |
| load walks | 2,267,672 | 973,288 | -57.08% |
| store walks | 601,266 | 199,592 | -66.80% |
| memory handshakes | 39,642,301 | 33,111,189 | -16.48% |
| memory stall cycles | 91,814,540 | 88,823,193 | -3.26% |
The larger TLB does what it should: page walks drop hard. Whole-workload cycles improve by 6.58%, which is meaningful for a one-line RTL parameter change.
Shell Phases
- kernel banner to /init 117,604,269 52.7%
- /init to shell banner 1,092,051 0.5%
- shell banner to first command 36,090,719 16.2%
- echo command 20,376 0%
- uname -a 2,512,591 1.1%
- ls /bin /usr/share/p84 34,108,367 15.3%
- cat sample file 3,033,425 1.4%
- touch/write/cat/rm /tmp file 12,040,697 5.4%
- 8x ash loop with file I/O 16,637,629 7.5%
- final marker 8,860 0%
| phase | P84 cycles | P86 cycles | delta |
|---|---|---|---|
kernel banner to /init | 120,446,463 | 117,604,269 | -2.36% |
| shell setup to first command | 37,525,853 | 36,090,719 | -3.82% |
ls /bin /usr/share | 36,947,459 | 34,108,367 | -7.68% |
cat sample file | 5,484,333 | 3,033,425 | -44.69% |
/tmp file create/read/remove | 9,997,660 | 12,040,697 | +20.44% |
| 8x ash loop with file I/O | 23,440,310 | 16,637,629 | -29.02% |
The /tmp phase going backwards is a useful warning. The single run is
not a statistical benchmark, and once walks drop, the visible bottleneck
can move to scheduler, filesystem, or console behavior.
Cycle Shape
- fetch 3.7% 8,259,292
- execute 39.1% 87,409,326
- mem 12.6% 28,254,762
- walker 2.1% 4,661,932
- writeback 39% 87,361,454
- mul/div 3.5% 7,828,567
The walker states shrink from about 10.4M cycles in P84 to about 4.66M cycles in P86. That is the cleanest evidence that the larger TLB is actually doing work.
Hot Functions
- 5.3% of samples (3,567 samples)5.3% 3,567
- 4.7% of samples (3,167 samples)4.7% 3,167
- 3.5% of samples (2,349 samples)3.5% 2,349
- 3.1% of samples (2,039 samples)3.1% 2,039
- 2.7% of samples (1,823 samples)2.7% 1,823
- 2.7% of samples (1,803 samples)2.7% 1,803
- 2.5% of samples (1,670 samples)2.5% 1,670
- 2.4% of samples (1,613 samples)2.4% 1,613
- 1.6% of samples (1,097 samples)1.6% 1,097
- 1.6% of samples (1,066 samples)1.6% 1,066
- 1.4% of samples (943 samples)1.4% 943
- 1.3% of samples (857 samples)1.3% 857
- 1.1% of samples (729 samples)1.1% 729
- 1.1% of samples (706 samples)1.1% 706
- 0.9% of samples (606 samples)0.9% 606
- 55.7% of samples (37,208 samples)55.7% 37,208
The BusyBox-symbolized shell window still points at formatting and
terminal output: printf_core, memcpy, __fwritex, and kernel
n_tty_write remain visible.
Honest Status
| check | status |
|---|---|
| 8-entry unified TLB RTL change | PASS |
| BusyBox shell workload runs | PASS |
| P84/P86 benchmark comparison staged | PASS |
| BusyBox-symbolized hot-function profile staged | PASS |
| LibreLane hardening | NOT RUN |
Next
P87 should do the next feature round with P84/P86 as the regression benchmark. The candidates now are console batching, syscall/trap cleanup, or separating instruction/data TLB behavior instead of simply growing the unified table again.