No. 92 / project of 147 on the ladder

Fetch queue

introduces — one-entry fetch queue; execute-stage next-PC prefetch; frontend decoupling measurement

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P92 adds a one-entry fetch queue between instruction delivery and execute. It lets a safe subset of S_EXECUTE cycles prepare the fall-through next_pc, then lets S_WB consume that queued instruction before using the older writeback prefetch path.

This is the first real frontend/backend decoupling point in the Linux shell core. It is also a useful failure: the queue works, but it does not make the workload faster yet.

Result

metricP89 word I-cacheP91 fill bufferP92 fetch queue
post-load cycles222,317,206221,327,811222,624,131
shell window cycles66,957,62065,985,29767,206,635
retired instructions86,601,83986,295,20586,687,669
CPI2.56712.56482.5681
memory handshakes31,289,31331,944,86070,222,426
memory stall cycles83,361,54584,488,16560,050,776
fetch stall cycles54,266,19255,533,55523,555,005
I-cache hits6,434,3338,855,59942,665,352

P92 changes the shape of instruction delivery but does not beat P91:

comparisonresult
shell window vs P91+1.85%
post-load cycles vs P91+0.59%
shell window vs P89+0.37%
fetch stalls vs P91-57.58%

That is the honest result. The queue dramatically lowers measured fetch-class stalls, but the whole shell workload is slightly slower.

Queue Counters

countervalue
queue valid cycles53,982,463
queue fills53,982,463
queue consumes53,982,463
execute-prefetch cycles53,982,463

The queue is not dead. It fills and consumes tens of millions of instructions. The missing piece is not activity; it is enough frontend independence to make that activity profitable.

Memory Stalls

memory stalls label P92 fetch-queue workload stalls 60,050,776 handshakes 70,222,426
  1. instruction fetch 23,555,005 39.2% 49,683,780 req
  2. data load 14,625,563 24.4% 987,610 req
  3. data store 11,991,847 20% 221,694 req
  4. atomic memory op 158,522 0.3% 184,825 req
  5. page walk for fetch 1,130,996 1.9% 1,124,842 req
  6. page walk for load/store 1,236,316 2.1% 1,235,813 req
  7. other 7,352,527 12.2% 16,783,862 req

Fetch-class stall time falls hard, but the single memory path still has to carry instruction fetch, data traffic, AMOs, and page walks. P92 moves pressure around more than it removes pressure.

Shell Phases

shell phases label P92 shell workload cycles 222,624,131 cpi 2.57
  1. kernel banner to /init 117,615,946 53%
  2. /init to shell banner 1,085,876 0.5%
  3. shell banner to first command 36,087,609 16.3%
  4. echo command 1,598 0%
  5. uname -a 1,991,539 0.9%
  6. ls /bin /usr/share 33,283,552 15%
  7. cat sample file 3,445,788 1.6%
  8. touch/write/cat/rm /tmp file 10,681,546 4.8%
  9. 8x ash loop with file I/O 16,364,812 7.4%
  10. final marker 1,437,800 0.7%

The shell benchmark still reaches the same final file marker. The slower window is the important part: functional progress is not the same as performance progress.

Cycle Shape

state breakdown label P92 fetch-queue workload cycles 222,624,131 cpi 2.57
  1. fetch 3.7% 8,335,372
  2. execute 39% 86,712,815
  3. mem 12.7% 28,170,061
  4. walker 2.1% 4,727,967
  5. writeback 38.9% 86,687,669
  6. mul/div 3.6% 7,988,543

The old S_IC_FILL cliff is still gone. P92’s cost is subtler: more frontend work happens earlier, but the core is still effectively single-lane around memory service and control flow.

Hot Functions

hot functions label P92 BusyBox shell symbols samples 65,631 period every 1,024 cycles
  1. printf_core busybox
    5.5% 3,595
  2. memset kernel
    5% 3,306
  3. memcpy busybox
    3.5% 2,326
  4. vruntime_eligible kernel
    3.5% 2,310
  5. blake2s_compress_generic kernel
    2.7% 1,797
  6. __fwritex busybox
    2.6% 1,706
  7. memcpy kernel
    2.6% 1,686
  8. unmap_page_range kernel
    1.7% 1,102
  9. handle_exception kernel
    1.7% 1,098
  10. avg_vruntime kernel
    1.4% 890
  11. n_tty_write kernel
    1.3% 860
  12. memset busybox
    1.2% 805
  13. ret_from_exception kernel
    1.2% 795
  14. n_tty_read kernel
    1.1% 692
  15. next_uptodate_folio kernel
    1% 665
  16. (remaining) remaining
    55.8% 36,633

The hot-symbol profile remains a mix of BusyBox shell code, kernel memory paths, scheduler/exception overhead, and libc-style routines. Frontend work helps only if it stops showing up as contention elsewhere.

Honest Status

checkstatus
One-entry fetch queuePASS
Execute-stage safe next-PC prefetchPASS
BusyBox shell workload runsPASS
Shell-window speedup vs P91FAIL
Fetch-stall reduction vs P91PASS
LibreLane hardeningNOT RUN

Next

P93 should add the tiny predictor: BTB, 2-bit direction counters, and a small return-address stack. A fetch queue that only trusts fall-through has limited room to run ahead.