No. 90 / project of 147 on the ladder

I-cache line fill

introduces — 4-word I-cache lines; blocking refill state; negative frontend result

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P90 does the obvious follow-up to P89: turn the word I-cache into a 4-word line cache. The implementation is deliberately simple and blocking. On a miss, the core fetches the requested word, enters S_IC_FILL, fills the rest of the line, then returns to S_FETCH.

That simplicity is useful because the result is clear: this version is wrong for the shell workload.

Result

metricP89 word I-cacheP90 blocking line filldelta
post-load cycles222,317,206245,417,593+10.39%
shell window cycles66,957,62084,084,195+25.58%
retired instructions86,601,83989,490,840+3.34%
CPI2.56712.7424+6.83%
memory handshakes31,289,31337,457,672+19.71%
memory stall cycles83,361,54586,350,457+3.59%
fetch handshakes27,560,78433,504,298+21.57%
fetch stall cycles54,266,19255,950,261+3.10%
I-cache hits6,434,33312,352,686+91.98%

The cache gets far more hits, but it manufactures them by stalling the core. S_IC_FILL accounts for 11,224,132 cycles.

Memory Stalls

memory stalls label P90 blocking line-fill workload stalls 86,350,457 handshakes 37,457,672
  1. instruction fetch 55,950,261 64.8% 33,504,298 req
  2. data load 15,291,828 17.7% 1,034,172 req
  3. data store 12,441,554 14.4% 234,015 req
  4. atomic memory op 166,469 0.2% 193,202 req
  5. page walk for fetch 1,200,622 1.4% 1,194,468 req
  6. page walk for load/store 1,298,019 1.5% 1,297,517 req
  7. other 1,704 0% 0 req

Compared with P88, fetch stalls are still lower. Compared with P89, they are worse. That tells us P89’s tiny cache was directionally right, but P90’s blocking fill policy is too blunt.

Shell Phases

shell phases label P90 shell workload cycles 245,417,593 cpi 2.74
  1. kernel banner to /init 119,143,502 48.7%
  2. /init to shell banner 1,156,340 0.5%
  3. shell banner to first command 40,405,326 16.5%
  4. echo command 2,538 0%
  5. uname -a 2,207,729 0.9%
  6. ls /bin /usr/share 39,732,205 16.2%
  7. cat sample file 5,276,294 2.2%
  8. touch/write/cat/rm /tmp file 10,985,502 4.5%
  9. 8x ash loop with file I/O 24,353,026 10%
  10. final marker 1,526,901 0.6%

The shell window gets hit hard: ls, cat, and the ash loop all pay for the blocking frontend. Linux shell work jumps around enough that waiting for every line to fill before executing is a bad trade.

Cycle Shape

state breakdown label P90 line-fill workload cycles 245,417,593 cpi 2.74
  1. fetch 4.6% 11,323,240
  2. execute 36.5% 89,518,022
  3. mem 12% 29,361,240
  4. walker 2% 4,990,626
  5. writeback 36.5% 89,490,840
  6. mul/div 3.9% 9,507,789

The new state exists for a reason. If S_IC_FILL is visually large, that is the bug, not a side detail.

Hot Functions

hot functions label P90 BusyBox shell symbols samples 82,113 period every 1,024 cycles
  1. printf_core busybox
    5.6% 4,566
  2. memset kernel
    4.4% 3,581
  3. vruntime_eligible kernel
    4.2% 3,427
  4. memcpy busybox
    3.7% 3,070
  5. __fwritex busybox
    2.4% 1,969
  6. blake2s_compress_generic kernel
    2.2% 1,804
  7. memcpy kernel
    2.1% 1,739
  8. handle_exception kernel
    1.8% 1,433
  9. avg_vruntime kernel
    1.6% 1,341
  10. memset busybox
    1.6% 1,339
  11. unmap_page_range kernel
    1.4% 1,120
  12. update_curr kernel
    1.1% 929
  13. n_tty_write kernel
    1.1% 859
  14. ret_from_exception kernel
    1% 851
  15. n_tty_read kernel
    1% 813
  16. (remaining) remaining
    56.3% 46,251

The hot-symbol picture does not magically change. Frontend stalls are important, but BusyBox formatting, kernel memory routines, scheduler work, and exception paths still matter.

Toward XiangShan

This is a small but important architectural lesson on the way toward a more XiangShan-like frontend: line fill has to be decoupled. Modern application cores do not stop the machine just to make future fetches pretty. They return the critical word, keep fill buffers in flight, feed a queue, and let prediction keep the frontend pointed at useful code.

stepresult
P89 tiny word cachefetch stalls down, whole workload flat
P90 blocking line fillmore hits, whole workload much worse
P91 targetnonblocking or critical-word-first line fill

Honest Status

checkstatus
4-word I-cache line storagePASS
Blocking line-fill FSMPASS
BusyBox shell workload runsPASS
Whole workload speedup vs P89FAIL
LibreLane hardeningNOT RUN

Next

P91 should keep the line cache but remove the blocking behavior. The critical word should execute immediately; the rest of the line should be filled only when the fetch side can spare the bus.