No. 89 / project of 147 on the ladder

Tiny I-cache fetch

introduces — direct-mapped instruction cache; fetch-stall reduction; first frontend step toward application-core shape

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P88 said the next architectural feature should hit fetch. P89 adds the first fetch-side storage point: a 256-word direct-mapped instruction cache with physical tags.

This is not yet a modern frontend. It is the smallest measured step from “single-word external fetch every time” toward a real application-core frontend.

Result

metricP88 no I-cacheP89 tiny I-cachedelta
post-load cycles221,748,021222,317,206+0.26%
shell window cycles66,279,80766,957,620+1.02%
retired instructions86,435,21186,601,839+0.19%
CPI2.56552.5671+0.06%
memory handshakes32,917,71731,289,313-4.95%
memory stall cycles87,892,03183,361,545-5.15%
fetch handshakes29,224,09327,560,784-5.69%
fetch stall cycles58,870,16654,266,192-7.82%

The cache works in the narrow sense: fetch refills and fetch stalls go down. It does not improve the full shell workload yet. That is the useful part of this result, not an embarrassment to hide.

I-Cache Counters

countervalue
total hits6,434,333
hits from S_FETCH3,528,678
hits from S_WB prefetch2,905,655
miss refills27,560,784

The hit count is real but the miss count is still huge. A one-word line means sequential code still creates refill pressure, and whole-cache invalidation on every store is intentionally brutal.

Memory Stalls

memory stalls label P89 tiny I-cache workload stalls 83,361,545 handshakes 31,289,313
  1. instruction fetch 54,266,192 65.1% 27,560,784 req
  2. data load 14,615,626 17.5% 979,239 req
  3. data store 11,967,986 14.4% 220,079 req
  4. atomic memory op 158,190 0.2% 184,053 req
  5. page walk for fetch 1,125,783 1.4% 1,119,629 req
  6. page walk for load/store 1,226,052 1.5% 1,225,529 req
  7. other 1,716 0% 0 req

Fetch remains the largest memory-stall bucket, but it moved from 58,870,166 cycles in P88 to 54,266,192 cycles here. The next frontend step should be line fill, not another tiny single-word cache tweak.

Shell Phases

shell phases label P89 shell workload cycles 222,317,206 cpi 2.57
  1. kernel banner to /init 117,614,394 53.1%
  2. /init to shell banner 1,087,274 0.5%
  3. shell banner to first command 36,029,853 16.3%
  4. echo command 1,600 0%
  5. uname -a 1,990,527 0.9%
  6. ls /bin /usr/share 33,547,374 15.1%
  7. cat sample file 2,757,580 1.2%
  8. touch/write/cat/rm /tmp file 11,707,774 5.3%
  9. 8x ash loop with file I/O 16,952,102 7.7%
  10. final marker 663 0%

The shell-window number regressed by about 1%. That can happen while fetch stalls improve because the workload has several other large costs: filesystem work, scheduler paths, exceptions/syscalls, BusyBox formatting, and cache invalidations caused by stores.

Cycle Shape

state breakdown label P89 tiny I-cache workload cycles 222,317,206 cpi 2.57
  1. fetch 3.7% 8,335,661
  2. execute 39% 86,626,877
  3. mem 12.7% 28,125,173
  4. walker 2.1% 4,696,993
  5. writeback 39% 86,601,839
  6. mul/div 3.6% 7,928,947

The interesting state-level change is that memory wait is reduced, not that the core suddenly has a different pipeline. P89 is still the same single-issue in-order FSM core.

Hot Functions

hot functions label P89 BusyBox shell symbols samples 65,388 period every 1,024 cycles
  1. printf_core busybox
    5.5% 3,598
  2. memset kernel
    5.1% 3,341
  3. memcpy busybox
    3.5% 2,304
  4. vruntime_eligible kernel
    3.4% 2,246
  5. blake2s_compress_generic kernel
    2.8% 1,811
  6. __fwritex busybox
    2.6% 1,697
  7. memcpy kernel
    2.6% 1,690
  8. handle_exception kernel
    1.9% 1,208
  9. unmap_page_range kernel
    1.7% 1,095
  10. avg_vruntime kernel
    1.4% 913
  11. n_tty_write kernel
    1.3% 860
  12. memset busybox
    1.3% 830
  13. ret_from_exception kernel
    1.2% 776
  14. next_uptodate_folio kernel
    1% 675
  15. do_trap_ecall_u kernel
    1% 639
  16. (remaining) remaining
    55.5% 36,297

The hot-symbol picture is mostly unchanged: BusyBox formatting and kernel memory/scheduler/exception work remain visible. That says frontend work is necessary but not sufficient.

Toward XiangShan

XiangShan is a modern high-performance RV64 application processor project. P89 is not trying to jump there in one move. The staged path is more like:

stepfeaturewhy it moves us closer
P89tiny physical I-cachefirst frontend storage point
nextcache line fillreduce sequential-code refill traffic
soonbetter I-cache invalidationstop throwing away the whole frontend on ordinary stores
soonfetch queuedecouple frontend memory timing from execute timing
laterBTB / branch predictorstop paying full redirect cost on hot control flow
laterD-cache / LSU cleanupseparate instruction and data locality problems
much laterdeeper pipeline, scoreboard, rename, ROBthe long road toward BOOM/XiangShan-class machinery

The important constraint is pacing: every one of those steps needs a benchmark like P88/P89 attached to it, or we are just adding hardware because it sounds adult.

Honest Status

checkstatus
256-word direct-mapped I-cache RTLPASS
BusyBox shell workload runsPASS
Fetch memory stalls improvePASS
Whole shell workload improvesFAIL
LibreLane hardeningNOT RUN

Next

P90 should turn the word cache into a small line-fill I-cache. If that does not move total shell time, the evidence will be pointing away from raw fetch storage and toward branch/control flow, store invalidation, or D-side/cache/LSU work.