Fetch queue · librelane-playground

P92 adds a one-entry fetch queue between instruction delivery and execute. It lets a safe subset of S_EXECUTE cycles prepare the fall-through next_pc, then lets S_WB consume that queued instruction before using the older writeback prefetch path.

This is the first real frontend/backend decoupling point in the Linux shell core. It is also a useful failure: the queue works, but it does not make the workload faster yet.

Result

metric	P89 word I-cache	P91 fill buffer	P92 fetch queue
post-load cycles	222,317,206	221,327,811	222,624,131
shell window cycles	66,957,620	65,985,297	67,206,635
retired instructions	86,601,839	86,295,205	86,687,669
CPI	2.5671	2.5648	2.5681
memory handshakes	31,289,313	31,944,860	70,222,426
memory stall cycles	83,361,545	84,488,165	60,050,776
fetch stall cycles	54,266,192	55,533,555	23,555,005
I-cache hits	6,434,333	8,855,599	42,665,352

P92 changes the shape of instruction delivery but does not beat P91:

comparison	result
shell window vs P91	+1.85%
post-load cycles vs P91	+0.59%
shell window vs P89	+0.37%
fetch stalls vs P91	-57.58%

That is the honest result. The queue dramatically lowers measured fetch-class stalls, but the whole shell workload is slightly slower.

Queue Counters

counter	value
queue valid cycles	53,982,463
queue fills	53,982,463
queue consumes	53,982,463
execute-prefetch cycles	53,982,463

The queue is not dead. It fills and consumes tens of millions of instructions. The missing piece is not activity; it is enough frontend independence to make that activity profitable.

Memory Stalls

memory stalls label P92 fetch-queue workload stalls 60,050,776 handshakes 70,222,426

instruction fetch 23,555,005 39.2% 49,683,780 req
data load 14,625,563 24.4% 987,610 req
data store 11,991,847 20% 221,694 req
atomic memory op 158,522 0.3% 184,825 req
page walk for fetch 1,130,996 1.9% 1,124,842 req
page walk for load/store 1,236,316 2.1% 1,235,813 req
other 7,352,527 12.2% 16,783,862 req

Fetch-class stall time falls hard, but the single memory path still has to carry instruction fetch, data traffic, AMOs, and page walks. P92 moves pressure around more than it removes pressure.

Shell Phases

shell phases label P92 shell workload cycles 222,624,131 cpi 2.57

kernel banner to /init 117,615,946 53%
/init to shell banner 1,085,876 0.5%
shell banner to first command 36,087,609 16.3%
echo command 1,598 0%
uname -a 1,991,539 0.9%
ls /bin /usr/share 33,283,552 15%
cat sample file 3,445,788 1.6%
touch/write/cat/rm /tmp file 10,681,546 4.8%
8x ash loop with file I/O 16,364,812 7.4%
final marker 1,437,800 0.7%

The shell benchmark still reaches the same final file marker. The slower window is the important part: functional progress is not the same as performance progress.

Cycle Shape

state breakdown label P92 fetch-queue workload cycles 222,624,131 cpi 2.57

fetch 3.7% 8,335,372
execute 39% 86,712,815
mem 12.7% 28,170,061
walker 2.1% 4,727,967
writeback 38.9% 86,687,669
mul/div 3.6% 7,988,543

The old S_IC_FILL cliff is still gone. P92’s cost is subtler: more frontend work happens earlier, but the core is still effectively single-lane around memory service and control flow.

Hot Functions

hot functions label P92 BusyBox shell symbols samples 65,631 period every 1,024 cycles

printf_core busybox

5.5% of samples (3,595 samples)

5.5% 3,595
memset kernel

5% of samples (3,306 samples)

5% 3,306
memcpy busybox

3.5% of samples (2,326 samples)

3.5% 2,326
vruntime_eligible kernel

3.5% of samples (2,310 samples)

3.5% 2,310
blake2s_compress_generic kernel

2.7% of samples (1,797 samples)

2.7% 1,797
__fwritex busybox

2.6% of samples (1,706 samples)

2.6% 1,706
memcpy kernel

2.6% of samples (1,686 samples)

2.6% 1,686
unmap_page_range kernel

1.7% of samples (1,102 samples)

1.7% 1,102
handle_exception kernel

1.7% of samples (1,098 samples)

1.7% 1,098
avg_vruntime kernel

1.4% of samples (890 samples)

1.4% 890
n_tty_write kernel

1.3% of samples (860 samples)

1.3% 860
memset busybox

1.2% of samples (805 samples)

1.2% 805
ret_from_exception kernel

1.2% of samples (795 samples)

1.2% 795
n_tty_read kernel

1.1% of samples (692 samples)

1.1% 692
next_uptodate_folio kernel

1% of samples (665 samples)

1% 665
(remaining) remaining

55.8% of samples (36,633 samples)

55.8% 36,633

The hot-symbol profile remains a mix of BusyBox shell code, kernel memory paths, scheduler/exception overhead, and libc-style routines. Frontend work helps only if it stops showing up as contention elsewhere.

Honest Status

check	status
One-entry fetch queue	PASS
Execute-stage safe next-PC prefetch	PASS
BusyBox shell workload runs	PASS
Shell-window speedup vs P91	FAIL
Fetch-stall reduction vs P91	PASS
LibreLane hardening	NOT RUN

P93 should add the tiny predictor: BTB, 2-bit direction counters, and a small return-address stack. A fetch queue that only trusts fall-through has limited room to run ahead.

Result

Queue Counters

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next