No. 91 / project of 147 on the ladder

I-cache fill buffer

introduces — critical-word-first I-cache miss handling; one-entry background fill descriptor; P90 recovery

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P91 fixes the P90 line-fill policy. The cache is still 64 direct-mapped lines with 4 words per line, but an I-cache miss no longer parks the core in S_IC_FILL. The returned critical word is executed immediately; the rest of the line is filled later from a one-entry background descriptor when S_WB has a spare memory slot.

This is still not a real XiangShan-style frontend. It is the first measured step away from a blocking frontend.

Result

metricP89 word I-cacheP90 blocking line fillP91 fill buffer
post-load cycles222,317,206245,417,593221,327,811
shell window cycles66,957,62084,084,19565,985,297
retired instructions86,601,83989,490,84086,295,205
CPI2.56712.74242.5648
memory handshakes31,289,31337,457,67231,944,860
memory stall cycles83,361,54586,350,45784,488,165
fetch handshakes27,560,78433,504,29828,263,597
fetch stall cycles54,266,19255,950,26155,533,555
I-cache hits6,434,33312,352,6868,855,599

P91 is a recovery from P90 and a small win over P89:

comparisonresult
shell window vs P90-21.52%
post-load cycles vs P90-9.82%
shell window vs P89-1.45%
post-load cycles vs P89-0.45%
fetch stalls vs P89+2.34%

The last row matters. P91 wins shell time, but it does not yet beat P89 on fetch-stall cycles.

Fill Buffer

countervalue
background active cycles195,614,189
background issue cycles4,698,933
background fills846,917

The descriptor is active for most of the run but only occasionally gets an idle bus slot. That is enough to erase P90’s blocking penalty, but it is not enough to make instruction delivery feel like a modern frontend.

Memory Stalls

memory stalls label P91 fill-buffer workload stalls 84,488,165 handshakes 31,944,860
  1. instruction fetch 55,533,555 65.7% 28,263,597 req
  2. data load 14,549,605 17.2% 966,517 req
  3. data store 11,923,432 14.1% 215,684 req
  4. atomic memory op 157,302 0.2% 183,178 req
  5. page walk for fetch 1,113,908 1.3% 1,107,754 req
  6. page walk for load/store 1,208,647 1.4% 1,208,130 req
  7. other 1,716 0% 0 req

Fetch is still the largest memory-stall bucket. The fill buffer changes where the pain lands; it does not remove the instruction-delivery problem.

Shell Phases

shell phases label P91 shell workload cycles 221,327,811 cpi 2.56
  1. kernel banner to /init 117,613,988 53.3%
  2. /init to shell banner 1,085,500 0.5%
  3. shell banner to first command 36,014,961 16.3%
  4. echo command 1,598 0%
  5. uname -a 2,599,477 1.2%
  6. ls /bin /usr/share 32,253,703 14.6%
  7. cat sample file 3,183,166 1.4%
  8. touch/write/cat/rm /tmp file 10,684,061 4.8%
  9. 8x ash loop with file I/O 16,308,284 7.4%
  10. final marker 955,008 0.4%

The shell window is the practical win. This run gets through the same BusyBox commands faster than both P89 and P90.

Cycle Shape

state breakdown label P91 fill-buffer workload cycles 221,327,811 cpi 2.56
  1. fetch 3.8% 8,307,587
  2. execute 39% 86,320,041
  3. mem 12.6% 27,995,718
  4. walker 2.1% 4,638,439
  5. writeback 39% 86,295,205
  6. mul/div 3.5% 7,769,105

The important absence is S_IC_FILL. P91 keeps the 4-word line cache without spending a visible state bucket waiting for future words.

Hot Functions

hot functions label P91 BusyBox shell symbols samples 64,439 period every 1,024 cycles
  1. printf_core busybox
    5.4% 3,504
  2. memset kernel
    5.2% 3,318
  3. memcpy busybox
    3.6% 2,334
  4. vruntime_eligible kernel
    3.4% 2,206
  5. blake2s_compress_generic kernel
    2.8% 1,798
  6. memcpy kernel
    2.6% 1,704
  7. __fwritex busybox
    2.6% 1,670
  8. handle_exception kernel
    1.8% 1,156
  9. unmap_page_range kernel
    1.7% 1,103
  10. n_tty_write kernel
    1.4% 884
  11. avg_vruntime kernel
    1.3% 829
  12. memset busybox
    1.3% 822
  13. ret_from_exception kernel
    1.2% 777
  14. next_uptodate_folio kernel
    1% 655
  15. n_tty_read kernel
    1% 640
  16. (remaining) remaining
    55.5% 35,730

The hot-symbol view remains a mix of BusyBox, kernel memory routines, scheduler/exception paths, and shell machinery. Frontend work helps, but Linux shell performance is not a single-knob problem.

Honest Status

checkstatus
Critical-word-first miss handlingPASS
Background I-cache fill descriptorPASS
BusyBox shell workload runsPASS
Shell-window speedup vs P90PASS
Shell-window speedup vs P89PASS
Fetch-stall speedup vs P89FAIL
LibreLane hardeningNOT RUN

Next

P92 should add a fetch queue. P91 makes line fill less foolish, but execute still sees frontend memory timing directly. A queue is the next decoupling point.