Shell profile flamegraph · librelane-playground

P84 profiles normal BusyBox shell use on the chip. The host feeds a scripted interactive session into the running guest through the same host-input path used by screen and the PTY bridge, while the Verilator harness records cycle counters and PC samples.

The point is not one heroic benchmark. It is a stable little workload we can rerun after each performance change.

Workload

The shell runs echo, uname -a, ls, cat, /tmp file create/read/remove, and an eight-iteration ash loop that does shell arithmetic plus file I/O. Each command phase prints a marker after it finishes, so the harness can timestamp completed guest work.

shell phases label P84 shell workload cycles 239,533,716 cpi 2.66

kernel banner to /init 120,446,463 50.4%
/init to shell banner 1,133,019 0.5%
shell banner to first command 37,525,853 15.7%
echo command 22,448 0%
uname -a 2,328,911 1%
ls /bin /usr/share/p84 36,947,459 15.5%
cat sample file 5,484,333 2.3%
touch/write/cat/rm /tmp file 9,997,660 4.2%
8x ash loop with file I/O 23,440,310 9.8%
final marker 1,579,195 0.7%

The numbers from this run are already opinionated:

phase	cycles
kernel banner to `/init`	120,446,463
shell setup to first command	37,525,853
`uname -a`	2,328,911
`ls /bin /usr/share/p84`	36,947,459
`cat` sample file	5,484,333
`/tmp` file create/read/remove	9,997,660
8x ash loop with file I/O	23,440,310

The ls /bin phase is deliberately heavy. BusyBox installed a large applet set, and colored ls produced a lot of console output, so this phase stresses directory walking and terminal writes at the same time.

Cycle Shape

The same run emits benchmark.json, so we still get the familiar core-state view:

state breakdown label P84 shell workload cycles 239,533,716 cpi 2.66

fetch 4.4% 10,502,094
execute 37.6% 90,049,522
mem 12.3% 29,391,812
walker 4.3% 10,406,117
writeback 37.6% 90,000,090
mul/div 3.8% 9,182,377

The useful reading here is structural. If fetch/decode/execute dominate, the in-order core is still the problem. If walker states are visible, the shell workload is pushing the tiny TLB. If memory and console-heavy kernel functions dominate, batching and memory bandwidth deserve more attention than another ALU tweak.

Flamegraph-Style View

hot functions label P84 shell workload samples 233,919 period every 1,024 cycles

inflate_fast kernel

21.4% of samples (49,975 samples)

21.4% 49,975
blake2s_compress_generic kernel

6% of samples (14,081 samples)

6% 14,081
memset kernel

4.9% of samples (11,528 samples)

4.9% 11,528
memcpy kernel

4.6% of samples (10,848 samples)

4.6% 10,848
vruntime_eligible kernel

2.2% of samples (5,100 samples)

2.2% 5,100
format_decode kernel

1.2% of samples (2,855 samples)

1.2% 2,855
avg_vruntime kernel

1.1% of samples (2,594 samples)

1.1% 2,594
vsnprintf kernel

0.9% of samples (2,117 samples)

0.9% 2,117
n_tty_write kernel

0.9% of samples (2,107 samples)

0.9% 2,107
unmap_page_range kernel

0.8% of samples (1,841 samples)

0.8% 1,841
__slab_alloc_node.isra.0 kernel

0.8% of samples (1,788 samples)

0.8% 1,788
handle_exception kernel

0.7% of samples (1,677 samples)

0.7% 1,677
zlib_inflate kernel

0.7% of samples (1,577 samples)

0.7% 1,577
update_curr kernel

0.6% of samples (1,485 samples)

0.6% 1,485
zlib_inflate_table kernel

0.6% of samples (1,438 samples)

0.6% 1,438
(remaining) remaining

52.5% of samples (122,908 samples)

52.5% 122,908

This is a flat PC-sample profile, not a stack-unwound flamegraph. Kernel PCs are symbolized with vmlinux; userspace and low physical addresses are still bucketed by raw address. That limitation is now visible enough to justify a follow-up: symbolize BusyBox userspace too.

Verification

The passing run checked these markers:

P84-SHELL-START
P84-ECHO-DONE
P84-UNAME-DONE
P84-LS-DONE
P84-CAT-DONE
P84-FS-DONE
P84-LOOP-DONE
P84-FILE-OK

make profile-shell produced:

projects/84_shell_profile_flamegraph/test/benchmark.json
projects/84_shell_profile_flamegraph/test/pc_samples.txt
projects/84_shell_profile_flamegraph/test/captured/2026-05-05-p84-shell-flamegraph.folded
site/src/data/charts/84_shell_profile_flamegraph/benchmark.json
site/src/data/charts/84_shell_profile_flamegraph/hot_functions.json
site/src/data/charts/84_shell_profile_flamegraph/shell_profile.json
site/src/data/charts/84_shell_profile_flamegraph/flamegraph.folded

Honest Status

check	status
BusyBox shell boots on guest Linux	PASS
Host input drives a scripted shell workload	PASS
`benchmark.json` emitted	PASS
PC samples emitted and symbolized against kernel `vmlinux`	PASS
Shell phase timing staged for the site	PASS
Userspace BusyBox symbol profile	NOT RUN
Stack-unwound flamegraph	NOT RUN
LibreLane hardening	NOT RUN

P85 should use this benchmark as the before/after test. The two most useful directions are a better profiler that knows BusyBox symbols, or a speed feature picked directly from this run: larger TLB, batched console MMIO, or memory throughput work.

Workload

Cycle Shape

Flamegraph-Style View

Verification

Honest Status

Next