P84 profiles normal BusyBox shell use on the chip. The host feeds a
scripted interactive session into the running guest through the same
host-input path used by screen and the PTY bridge, while the Verilator
harness records cycle counters and PC samples.
The point is not one heroic benchmark. It is a stable little workload we can rerun after each performance change.
Workload
The shell runs echo, uname -a, ls, cat, /tmp file
create/read/remove, and an eight-iteration ash loop that does shell
arithmetic plus file I/O. Each command phase prints a marker after it
finishes, so the harness can timestamp completed guest work.
- kernel banner to /init 120,446,463 50.4%
- /init to shell banner 1,133,019 0.5%
- shell banner to first command 37,525,853 15.7%
- echo command 22,448 0%
- uname -a 2,328,911 1%
- ls /bin /usr/share/p84 36,947,459 15.5%
- cat sample file 5,484,333 2.3%
- touch/write/cat/rm /tmp file 9,997,660 4.2%
- 8x ash loop with file I/O 23,440,310 9.8%
- final marker 1,579,195 0.7%
The numbers from this run are already opinionated:
| phase | cycles |
|---|---|
kernel banner to /init | 120,446,463 |
| shell setup to first command | 37,525,853 |
uname -a | 2,328,911 |
ls /bin /usr/share/p84 | 36,947,459 |
cat sample file | 5,484,333 |
/tmp file create/read/remove | 9,997,660 |
| 8x ash loop with file I/O | 23,440,310 |
The ls /bin phase is deliberately heavy. BusyBox installed a large
applet set, and colored ls produced a lot of console output, so this
phase stresses directory walking and terminal writes at the same time.
Cycle Shape
The same run emits benchmark.json, so we still get the familiar
core-state view:
- fetch 4.4% 10,502,094
- execute 37.6% 90,049,522
- mem 12.3% 29,391,812
- walker 4.3% 10,406,117
- writeback 37.6% 90,000,090
- mul/div 3.8% 9,182,377
The useful reading here is structural. If fetch/decode/execute dominate, the in-order core is still the problem. If walker states are visible, the shell workload is pushing the tiny TLB. If memory and console-heavy kernel functions dominate, batching and memory bandwidth deserve more attention than another ALU tweak.
Flamegraph-Style View
- 21.4% of samples (49,975 samples)21.4% 49,975
- 6% of samples (14,081 samples)6% 14,081
- 4.9% of samples (11,528 samples)4.9% 11,528
- 4.6% of samples (10,848 samples)4.6% 10,848
- 2.2% of samples (5,100 samples)2.2% 5,100
- 1.2% of samples (2,855 samples)1.2% 2,855
- 1.1% of samples (2,594 samples)1.1% 2,594
- 0.9% of samples (2,117 samples)0.9% 2,117
- 0.9% of samples (2,107 samples)0.9% 2,107
- 0.8% of samples (1,841 samples)0.8% 1,841
- 0.8% of samples (1,788 samples)0.8% 1,788
- 0.7% of samples (1,677 samples)0.7% 1,677
- 0.7% of samples (1,577 samples)0.7% 1,577
- 0.6% of samples (1,485 samples)0.6% 1,485
- 0.6% of samples (1,438 samples)0.6% 1,438
- 52.5% of samples (122,908 samples)52.5% 122,908
This is a flat PC-sample profile, not a stack-unwound flamegraph.
Kernel PCs are symbolized with vmlinux; userspace and low physical
addresses are still bucketed by raw address. That limitation is now
visible enough to justify a follow-up: symbolize BusyBox userspace too.
Verification
The passing run checked these markers:
P84-SHELL-START
P84-ECHO-DONE
P84-UNAME-DONE
P84-LS-DONE
P84-CAT-DONE
P84-FS-DONE
P84-LOOP-DONE
P84-FILE-OK
make profile-shell produced:
projects/84_shell_profile_flamegraph/test/benchmark.jsonprojects/84_shell_profile_flamegraph/test/pc_samples.txtprojects/84_shell_profile_flamegraph/test/captured/2026-05-05-p84-shell-flamegraph.foldedsite/src/data/charts/84_shell_profile_flamegraph/benchmark.jsonsite/src/data/charts/84_shell_profile_flamegraph/hot_functions.jsonsite/src/data/charts/84_shell_profile_flamegraph/shell_profile.jsonsite/src/data/charts/84_shell_profile_flamegraph/flamegraph.folded
Honest Status
| check | status |
|---|---|
| BusyBox shell boots on guest Linux | PASS |
| Host input drives a scripted shell workload | PASS |
benchmark.json emitted | PASS |
PC samples emitted and symbolized against kernel vmlinux | PASS |
| Shell phase timing staged for the site | PASS |
| Userspace BusyBox symbol profile | NOT RUN |
| Stack-unwound flamegraph | NOT RUN |
| LibreLane hardening | NOT RUN |
Next
P85 should use this benchmark as the before/after test. The two most useful directions are a better profiler that knows BusyBox symbols, or a speed feature picked directly from this run: larger TLB, batched console MMIO, or memory throughput work.