P88 said the next architectural feature should hit fetch. P89 adds the first fetch-side storage point: a 256-word direct-mapped instruction cache with physical tags.
This is not yet a modern frontend. It is the smallest measured step from “single-word external fetch every time” toward a real application-core frontend.
Result
| metric | P88 no I-cache | P89 tiny I-cache | delta |
|---|---|---|---|
| post-load cycles | 221,748,021 | 222,317,206 | +0.26% |
| shell window cycles | 66,279,807 | 66,957,620 | +1.02% |
| retired instructions | 86,435,211 | 86,601,839 | +0.19% |
| CPI | 2.5655 | 2.5671 | +0.06% |
| memory handshakes | 32,917,717 | 31,289,313 | -4.95% |
| memory stall cycles | 87,892,031 | 83,361,545 | -5.15% |
| fetch handshakes | 29,224,093 | 27,560,784 | -5.69% |
| fetch stall cycles | 58,870,166 | 54,266,192 | -7.82% |
The cache works in the narrow sense: fetch refills and fetch stalls go down. It does not improve the full shell workload yet. That is the useful part of this result, not an embarrassment to hide.
I-Cache Counters
| counter | value |
|---|---|
| total hits | 6,434,333 |
hits from S_FETCH | 3,528,678 |
hits from S_WB prefetch | 2,905,655 |
| miss refills | 27,560,784 |
The hit count is real but the miss count is still huge. A one-word line means sequential code still creates refill pressure, and whole-cache invalidation on every store is intentionally brutal.
Memory Stalls
- instruction fetch 54,266,192 65.1% 27,560,784 req
- data load 14,615,626 17.5% 979,239 req
- data store 11,967,986 14.4% 220,079 req
- atomic memory op 158,190 0.2% 184,053 req
- page walk for fetch 1,125,783 1.4% 1,119,629 req
- page walk for load/store 1,226,052 1.5% 1,225,529 req
- other 1,716 0% 0 req
Fetch remains the largest memory-stall bucket, but it moved from 58,870,166 cycles in P88 to 54,266,192 cycles here. The next frontend step should be line fill, not another tiny single-word cache tweak.
Shell Phases
- kernel banner to /init 117,614,394 53.1%
- /init to shell banner 1,087,274 0.5%
- shell banner to first command 36,029,853 16.3%
- echo command 1,600 0%
- uname -a 1,990,527 0.9%
- ls /bin /usr/share 33,547,374 15.1%
- cat sample file 2,757,580 1.2%
- touch/write/cat/rm /tmp file 11,707,774 5.3%
- 8x ash loop with file I/O 16,952,102 7.7%
- final marker 663 0%
The shell-window number regressed by about 1%. That can happen while fetch stalls improve because the workload has several other large costs: filesystem work, scheduler paths, exceptions/syscalls, BusyBox formatting, and cache invalidations caused by stores.
Cycle Shape
- fetch 3.7% 8,335,661
- execute 39% 86,626,877
- mem 12.7% 28,125,173
- walker 2.1% 4,696,993
- writeback 39% 86,601,839
- mul/div 3.6% 7,928,947
The interesting state-level change is that memory wait is reduced, not that the core suddenly has a different pipeline. P89 is still the same single-issue in-order FSM core.
Hot Functions
- 5.5% of samples (3,598 samples)5.5% 3,598
- 5.1% of samples (3,341 samples)5.1% 3,341
- 3.5% of samples (2,304 samples)3.5% 2,304
- 3.4% of samples (2,246 samples)3.4% 2,246
- 2.8% of samples (1,811 samples)2.8% 1,811
- 2.6% of samples (1,697 samples)2.6% 1,697
- 2.6% of samples (1,690 samples)2.6% 1,690
- 1.9% of samples (1,208 samples)1.9% 1,208
- 1.7% of samples (1,095 samples)1.7% 1,095
- 1.4% of samples (913 samples)1.4% 913
- 1.3% of samples (860 samples)1.3% 860
- 1.3% of samples (830 samples)1.3% 830
- 1.2% of samples (776 samples)1.2% 776
- 1% of samples (675 samples)1% 675
- 1% of samples (639 samples)1% 639
- 55.5% of samples (36,297 samples)55.5% 36,297
The hot-symbol picture is mostly unchanged: BusyBox formatting and kernel memory/scheduler/exception work remain visible. That says frontend work is necessary but not sufficient.
Toward XiangShan
XiangShan is a modern high-performance RV64 application processor project. P89 is not trying to jump there in one move. The staged path is more like:
| step | feature | why it moves us closer |
|---|---|---|
| P89 | tiny physical I-cache | first frontend storage point |
| next | cache line fill | reduce sequential-code refill traffic |
| soon | better I-cache invalidation | stop throwing away the whole frontend on ordinary stores |
| soon | fetch queue | decouple frontend memory timing from execute timing |
| later | BTB / branch predictor | stop paying full redirect cost on hot control flow |
| later | D-cache / LSU cleanup | separate instruction and data locality problems |
| much later | deeper pipeline, scoreboard, rename, ROB | the long road toward BOOM/XiangShan-class machinery |
The important constraint is pacing: every one of those steps needs a benchmark like P88/P89 attached to it, or we are just adding hardware because it sounds adult.
Honest Status
| check | status |
|---|---|
| 256-word direct-mapped I-cache RTL | PASS |
| BusyBox shell workload runs | PASS |
| Fetch memory stalls improve | PASS |
| Whole shell workload improves | FAIL |
| LibreLane hardening | NOT RUN |
Next
P90 should turn the word cache into a small line-fill I-cache. If that does not move total shell time, the evidence will be pointing away from raw fetch storage and toward branch/control flow, store invalidation, or D-side/cache/LSU work.