No. 84 / project of 147 on the ladder

Shell profile flamegraph

introduces — BusyBox shell workload benchmark; command milestone timing; PC-sample flamegraph-style profile

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P84 profiles normal BusyBox shell use on the chip. The host feeds a scripted interactive session into the running guest through the same host-input path used by screen and the PTY bridge, while the Verilator harness records cycle counters and PC samples.

The point is not one heroic benchmark. It is a stable little workload we can rerun after each performance change.

Workload

The shell runs echo, uname -a, ls, cat, /tmp file create/read/remove, and an eight-iteration ash loop that does shell arithmetic plus file I/O. Each command phase prints a marker after it finishes, so the harness can timestamp completed guest work.

shell phases label P84 shell workload cycles 239,533,716 cpi 2.66
  1. kernel banner to /init 120,446,463 50.4%
  2. /init to shell banner 1,133,019 0.5%
  3. shell banner to first command 37,525,853 15.7%
  4. echo command 22,448 0%
  5. uname -a 2,328,911 1%
  6. ls /bin /usr/share/p84 36,947,459 15.5%
  7. cat sample file 5,484,333 2.3%
  8. touch/write/cat/rm /tmp file 9,997,660 4.2%
  9. 8x ash loop with file I/O 23,440,310 9.8%
  10. final marker 1,579,195 0.7%

The numbers from this run are already opinionated:

phasecycles
kernel banner to /init120,446,463
shell setup to first command37,525,853
uname -a2,328,911
ls /bin /usr/share/p8436,947,459
cat sample file5,484,333
/tmp file create/read/remove9,997,660
8x ash loop with file I/O23,440,310

The ls /bin phase is deliberately heavy. BusyBox installed a large applet set, and colored ls produced a lot of console output, so this phase stresses directory walking and terminal writes at the same time.

Cycle Shape

The same run emits benchmark.json, so we still get the familiar core-state view:

state breakdown label P84 shell workload cycles 239,533,716 cpi 2.66
  1. fetch 4.4% 10,502,094
  2. execute 37.6% 90,049,522
  3. mem 12.3% 29,391,812
  4. walker 4.3% 10,406,117
  5. writeback 37.6% 90,000,090
  6. mul/div 3.8% 9,182,377

The useful reading here is structural. If fetch/decode/execute dominate, the in-order core is still the problem. If walker states are visible, the shell workload is pushing the tiny TLB. If memory and console-heavy kernel functions dominate, batching and memory bandwidth deserve more attention than another ALU tweak.

Flamegraph-Style View

hot functions label P84 shell workload samples 233,919 period every 1,024 cycles
  1. inflate_fast kernel
    21.4% 49,975
  2. blake2s_compress_generic kernel
    6% 14,081
  3. memset kernel
    4.9% 11,528
  4. memcpy kernel
    4.6% 10,848
  5. vruntime_eligible kernel
    2.2% 5,100
  6. format_decode kernel
    1.2% 2,855
  7. avg_vruntime kernel
    1.1% 2,594
  8. vsnprintf kernel
    0.9% 2,117
  9. n_tty_write kernel
    0.9% 2,107
  10. unmap_page_range kernel
    0.8% 1,841
  11. __slab_alloc_node.isra.0 kernel
    0.8% 1,788
  12. handle_exception kernel
    0.7% 1,677
  13. zlib_inflate kernel
    0.7% 1,577
  14. update_curr kernel
    0.6% 1,485
  15. zlib_inflate_table kernel
    0.6% 1,438
  16. (remaining) remaining
    52.5% 122,908

This is a flat PC-sample profile, not a stack-unwound flamegraph. Kernel PCs are symbolized with vmlinux; userspace and low physical addresses are still bucketed by raw address. That limitation is now visible enough to justify a follow-up: symbolize BusyBox userspace too.

Verification

The passing run checked these markers:

P84-SHELL-START
P84-ECHO-DONE
P84-UNAME-DONE
P84-LS-DONE
P84-CAT-DONE
P84-FS-DONE
P84-LOOP-DONE
P84-FILE-OK

make profile-shell produced:

  • projects/84_shell_profile_flamegraph/test/benchmark.json
  • projects/84_shell_profile_flamegraph/test/pc_samples.txt
  • projects/84_shell_profile_flamegraph/test/captured/2026-05-05-p84-shell-flamegraph.folded
  • site/src/data/charts/84_shell_profile_flamegraph/benchmark.json
  • site/src/data/charts/84_shell_profile_flamegraph/hot_functions.json
  • site/src/data/charts/84_shell_profile_flamegraph/shell_profile.json
  • site/src/data/charts/84_shell_profile_flamegraph/flamegraph.folded

Honest Status

checkstatus
BusyBox shell boots on guest LinuxPASS
Host input drives a scripted shell workloadPASS
benchmark.json emittedPASS
PC samples emitted and symbolized against kernel vmlinuxPASS
Shell phase timing staged for the sitePASS
Userspace BusyBox symbol profileNOT RUN
Stack-unwound flamegraphNOT RUN
LibreLane hardeningNOT RUN

Next

P85 should use this benchmark as the before/after test. The two most useful directions are a better profiler that knows BusyBox symbols, or a speed feature picked directly from this run: larger TLB, batched console MMIO, or memory throughput work.