P92 adds a one-entry fetch queue between instruction delivery and
execute. It lets a safe subset of S_EXECUTE cycles prepare the
fall-through next_pc, then lets S_WB consume that queued instruction
before using the older writeback prefetch path.
This is the first real frontend/backend decoupling point in the Linux shell core. It is also a useful failure: the queue works, but it does not make the workload faster yet.
Result
| metric | P89 word I-cache | P91 fill buffer | P92 fetch queue |
|---|---|---|---|
| post-load cycles | 222,317,206 | 221,327,811 | 222,624,131 |
| shell window cycles | 66,957,620 | 65,985,297 | 67,206,635 |
| retired instructions | 86,601,839 | 86,295,205 | 86,687,669 |
| CPI | 2.5671 | 2.5648 | 2.5681 |
| memory handshakes | 31,289,313 | 31,944,860 | 70,222,426 |
| memory stall cycles | 83,361,545 | 84,488,165 | 60,050,776 |
| fetch stall cycles | 54,266,192 | 55,533,555 | 23,555,005 |
| I-cache hits | 6,434,333 | 8,855,599 | 42,665,352 |
P92 changes the shape of instruction delivery but does not beat P91:
| comparison | result |
|---|---|
| shell window vs P91 | +1.85% |
| post-load cycles vs P91 | +0.59% |
| shell window vs P89 | +0.37% |
| fetch stalls vs P91 | -57.58% |
That is the honest result. The queue dramatically lowers measured fetch-class stalls, but the whole shell workload is slightly slower.
Queue Counters
| counter | value |
|---|---|
| queue valid cycles | 53,982,463 |
| queue fills | 53,982,463 |
| queue consumes | 53,982,463 |
| execute-prefetch cycles | 53,982,463 |
The queue is not dead. It fills and consumes tens of millions of instructions. The missing piece is not activity; it is enough frontend independence to make that activity profitable.
Memory Stalls
- instruction fetch 23,555,005 39.2% 49,683,780 req
- data load 14,625,563 24.4% 987,610 req
- data store 11,991,847 20% 221,694 req
- atomic memory op 158,522 0.3% 184,825 req
- page walk for fetch 1,130,996 1.9% 1,124,842 req
- page walk for load/store 1,236,316 2.1% 1,235,813 req
- other 7,352,527 12.2% 16,783,862 req
Fetch-class stall time falls hard, but the single memory path still has to carry instruction fetch, data traffic, AMOs, and page walks. P92 moves pressure around more than it removes pressure.
Shell Phases
- kernel banner to /init 117,615,946 53%
- /init to shell banner 1,085,876 0.5%
- shell banner to first command 36,087,609 16.3%
- echo command 1,598 0%
- uname -a 1,991,539 0.9%
- ls /bin /usr/share 33,283,552 15%
- cat sample file 3,445,788 1.6%
- touch/write/cat/rm /tmp file 10,681,546 4.8%
- 8x ash loop with file I/O 16,364,812 7.4%
- final marker 1,437,800 0.7%
The shell benchmark still reaches the same final file marker. The slower window is the important part: functional progress is not the same as performance progress.
Cycle Shape
- fetch 3.7% 8,335,372
- execute 39% 86,712,815
- mem 12.7% 28,170,061
- walker 2.1% 4,727,967
- writeback 38.9% 86,687,669
- mul/div 3.6% 7,988,543
The old S_IC_FILL cliff is still gone. P92’s cost is subtler: more
frontend work happens earlier, but the core is still effectively
single-lane around memory service and control flow.
Hot Functions
- 5.5% of samples (3,595 samples)5.5% 3,595
- 5% of samples (3,306 samples)5% 3,306
- 3.5% of samples (2,326 samples)3.5% 2,326
- 3.5% of samples (2,310 samples)3.5% 2,310
- 2.7% of samples (1,797 samples)2.7% 1,797
- 2.6% of samples (1,706 samples)2.6% 1,706
- 2.6% of samples (1,686 samples)2.6% 1,686
- 1.7% of samples (1,102 samples)1.7% 1,102
- 1.7% of samples (1,098 samples)1.7% 1,098
- 1.4% of samples (890 samples)1.4% 890
- 1.3% of samples (860 samples)1.3% 860
- 1.2% of samples (805 samples)1.2% 805
- 1.2% of samples (795 samples)1.2% 795
- 1.1% of samples (692 samples)1.1% 692
- 1% of samples (665 samples)1% 665
- 55.8% of samples (36,633 samples)55.8% 36,633
The hot-symbol profile remains a mix of BusyBox shell code, kernel memory paths, scheduler/exception overhead, and libc-style routines. Frontend work helps only if it stops showing up as contention elsewhere.
Honest Status
| check | status |
|---|---|
| One-entry fetch queue | PASS |
| Execute-stage safe next-PC prefetch | PASS |
| BusyBox shell workload runs | PASS |
| Shell-window speedup vs P91 | FAIL |
| Fetch-stall reduction vs P91 | PASS |
| LibreLane hardening | NOT RUN |
Next
P93 should add the tiny predictor: BTB, 2-bit direction counters, and a small return-address stack. A fetch queue that only trusts fall-through has limited room to run ahead.