P91 fixes the P90 line-fill policy. The cache is still 64 direct-mapped
lines with 4 words per line, but an I-cache miss no longer parks the
core in S_IC_FILL. The returned critical word is executed immediately;
the rest of the line is filled later from a one-entry background
descriptor when S_WB has a spare memory slot.
This is still not a real XiangShan-style frontend. It is the first measured step away from a blocking frontend.
Result
| metric | P89 word I-cache | P90 blocking line fill | P91 fill buffer |
|---|---|---|---|
| post-load cycles | 222,317,206 | 245,417,593 | 221,327,811 |
| shell window cycles | 66,957,620 | 84,084,195 | 65,985,297 |
| retired instructions | 86,601,839 | 89,490,840 | 86,295,205 |
| CPI | 2.5671 | 2.7424 | 2.5648 |
| memory handshakes | 31,289,313 | 37,457,672 | 31,944,860 |
| memory stall cycles | 83,361,545 | 86,350,457 | 84,488,165 |
| fetch handshakes | 27,560,784 | 33,504,298 | 28,263,597 |
| fetch stall cycles | 54,266,192 | 55,950,261 | 55,533,555 |
| I-cache hits | 6,434,333 | 12,352,686 | 8,855,599 |
P91 is a recovery from P90 and a small win over P89:
| comparison | result |
|---|---|
| shell window vs P90 | -21.52% |
| post-load cycles vs P90 | -9.82% |
| shell window vs P89 | -1.45% |
| post-load cycles vs P89 | -0.45% |
| fetch stalls vs P89 | +2.34% |
The last row matters. P91 wins shell time, but it does not yet beat P89 on fetch-stall cycles.
Fill Buffer
| counter | value |
|---|---|
| background active cycles | 195,614,189 |
| background issue cycles | 4,698,933 |
| background fills | 846,917 |
The descriptor is active for most of the run but only occasionally gets an idle bus slot. That is enough to erase P90’s blocking penalty, but it is not enough to make instruction delivery feel like a modern frontend.
Memory Stalls
- instruction fetch 55,533,555 65.7% 28,263,597 req
- data load 14,549,605 17.2% 966,517 req
- data store 11,923,432 14.1% 215,684 req
- atomic memory op 157,302 0.2% 183,178 req
- page walk for fetch 1,113,908 1.3% 1,107,754 req
- page walk for load/store 1,208,647 1.4% 1,208,130 req
- other 1,716 0% 0 req
Fetch is still the largest memory-stall bucket. The fill buffer changes where the pain lands; it does not remove the instruction-delivery problem.
Shell Phases
- kernel banner to /init 117,613,988 53.3%
- /init to shell banner 1,085,500 0.5%
- shell banner to first command 36,014,961 16.3%
- echo command 1,598 0%
- uname -a 2,599,477 1.2%
- ls /bin /usr/share 32,253,703 14.6%
- cat sample file 3,183,166 1.4%
- touch/write/cat/rm /tmp file 10,684,061 4.8%
- 8x ash loop with file I/O 16,308,284 7.4%
- final marker 955,008 0.4%
The shell window is the practical win. This run gets through the same BusyBox commands faster than both P89 and P90.
Cycle Shape
- fetch 3.8% 8,307,587
- execute 39% 86,320,041
- mem 12.6% 27,995,718
- walker 2.1% 4,638,439
- writeback 39% 86,295,205
- mul/div 3.5% 7,769,105
The important absence is S_IC_FILL. P91 keeps the 4-word line cache
without spending a visible state bucket waiting for future words.
Hot Functions
- 5.4% of samples (3,504 samples)5.4% 3,504
- 5.2% of samples (3,318 samples)5.2% 3,318
- 3.6% of samples (2,334 samples)3.6% 2,334
- 3.4% of samples (2,206 samples)3.4% 2,206
- 2.8% of samples (1,798 samples)2.8% 1,798
- 2.6% of samples (1,704 samples)2.6% 1,704
- 2.6% of samples (1,670 samples)2.6% 1,670
- 1.8% of samples (1,156 samples)1.8% 1,156
- 1.7% of samples (1,103 samples)1.7% 1,103
- 1.4% of samples (884 samples)1.4% 884
- 1.3% of samples (829 samples)1.3% 829
- 1.3% of samples (822 samples)1.3% 822
- 1.2% of samples (777 samples)1.2% 777
- 1% of samples (655 samples)1% 655
- 1% of samples (640 samples)1% 640
- 55.5% of samples (35,730 samples)55.5% 35,730
The hot-symbol view remains a mix of BusyBox, kernel memory routines, scheduler/exception paths, and shell machinery. Frontend work helps, but Linux shell performance is not a single-knob problem.
Honest Status
| check | status |
|---|---|
| Critical-word-first miss handling | PASS |
| Background I-cache fill descriptor | PASS |
| BusyBox shell workload runs | PASS |
| Shell-window speedup vs P90 | PASS |
| Shell-window speedup vs P89 | PASS |
| Fetch-stall speedup vs P89 | FAIL |
| LibreLane hardening | NOT RUN |
Next
P92 should add a fetch queue. P91 makes line fill less foolish, but execute still sees frontend memory timing directly. A queue is the next decoupling point.