P87 keeps the P86 core and changes the shell bridge. Once BusyBox ash is
running on its Linux PTY, console_sh writes PTY output directly to the
platform UART MMIO register instead of writing it back through
/dev/console.
That bypasses the guest Linux HVC/SBI output path for shell text. Early kernel printk still uses the normal SBI console.
Result
| metric | P86 /dev/console bridge | P87 direct UART bridge | delta |
|---|---|---|---|
| post-load cycles | 223,777,049 | 222,825,777 | -0.43% |
| shell window cycles | 68,361,945 | 67,266,772 | -1.60% |
| retired instructions | 87,361,454 | 86,750,479 | -0.70% |
| CPI | 2.5615 | 2.5686 | +0.28% |
| memory stall cycles | 88,823,193 | 88,210,458 | -0.69% |
n_tty_write samples | 1,613 | 847 | -47.49% |
hvc_sbi_tty_put samples | 264 | 0 | -100.00% |
sbi_console_putchar samples | 173 | 0 | -100.00% |
The profile moved in the intended direction. The old HVC/SBI output
symbols disappear from the shell-window folded profile, and
n_tty_write roughly halves. The wall-cycle gain is smaller, which is
also useful data: the final console byte path was not the only bottleneck.
Shell Phases
- kernel banner to /init 117,614,359 52.9%
- /init to shell banner 1,085,555 0.5%
- shell banner to first command 36,231,026 16.3%
- echo command 1,598 0%
- uname -a 2,544,990 1.2%
- ls /bin /usr/share 31,752,029 14.3%
- cat sample file 4,837,758 2.2%
- touch/write/cat/rm /tmp file 10,838,537 4.9%
- 8x ash loop with file I/O 16,336,480 7.4%
- final marker 955,380 0.4%
| phase | P86 cycles | P87 cycles | delta |
|---|---|---|---|
| shell setup to first command | 36,090,719 | 36,231,026 | +0.39% |
echo marker | 20,376 | 1,598 | -92.16% |
uname -a | 2,512,591 | 2,544,990 | +1.29% |
ls /bin /usr/share | 34,108,367 | 31,752,029 | -6.91% |
cat sample file | 3,033,425 | 4,837,758 | +59.48% |
/tmp file create/read/remove | 12,040,697 | 10,838,537 | -9.98% |
| 8x ash loop with file I/O | 16,637,629 | 16,336,480 | -1.81% |
The phase movement is mixed. ls and /tmp work improve, cat gets
slower in this single run, and the final marker is noisy. The aggregate
shell window is the better number here.
Cycle Shape
- fetch 3.8% 8,359,801
- execute 38.9% 86,775,673
- mem 12.7% 28,211,557
- walker 2.1% 4,774,268
- writeback 38.9% 86,750,479
- mul/div 3.6% 7,952,283
P87 did not change the core pipeline, so the state chart should look a lot like P86. That is expected. The experiment is about removing a Linux console path, not about reducing page walks or execute/writeback cycles.
Hot Functions
- 5.5% of samples (3,593 samples)5.5% 3,593
- 5% of samples (3,286 samples)5% 3,286
- 3.6% of samples (2,339 samples)3.6% 2,339
- 3.5% of samples (2,274 samples)3.5% 2,274
- 2.8% of samples (1,806 samples)2.8% 1,806
- 2.7% of samples (1,775 samples)2.7% 1,775
- 2.6% of samples (1,699 samples)2.6% 1,699
- 1.8% of samples (1,186 samples)1.8% 1,186
- 1.7% of samples (1,093 samples)1.7% 1,093
- 1.4% of samples (893 samples)1.4% 893
- 1.3% of samples (847 samples)1.3% 847
- 1.3% of samples (846 samples)1.3% 846
- 1.2% of samples (781 samples)1.2% 781
- 1% of samples (684 samples)1% 684
- 1% of samples (658 samples)1% 658
- 55.6% of samples (36,532 samples)55.6% 36,532
BusyBox formatting remains the dominant named userspace cost:
printf_core, memcpy, and __fwritex are still high. The console
driver symbols are much less prominent, which means P87 answered the
specific question it asked.
Honest Status
| check | status |
|---|---|
Direct UART MMIO output path in console_sh | PASS |
| BusyBox shell workload runs | PASS |
| P86/P87 benchmark comparison staged | PASS |
| BusyBox-symbolized hot-function profile staged | PASS |
| LibreLane hardening | NOT RUN |
Next
The next feature round should target a bigger remaining cost: memory latency, exception/syscall overhead, or a more serious terminal device model. P87 says console output mattered, but it was not the whole story.