P90 does the obvious follow-up to P89: turn the word I-cache into a
4-word line cache. The implementation is deliberately simple and
blocking. On a miss, the core fetches the requested word, enters
S_IC_FILL, fills the rest of the line, then returns to S_FETCH.
That simplicity is useful because the result is clear: this version is wrong for the shell workload.
Result
| metric | P89 word I-cache | P90 blocking line fill | delta |
|---|---|---|---|
| post-load cycles | 222,317,206 | 245,417,593 | +10.39% |
| shell window cycles | 66,957,620 | 84,084,195 | +25.58% |
| retired instructions | 86,601,839 | 89,490,840 | +3.34% |
| CPI | 2.5671 | 2.7424 | +6.83% |
| memory handshakes | 31,289,313 | 37,457,672 | +19.71% |
| memory stall cycles | 83,361,545 | 86,350,457 | +3.59% |
| fetch handshakes | 27,560,784 | 33,504,298 | +21.57% |
| fetch stall cycles | 54,266,192 | 55,950,261 | +3.10% |
| I-cache hits | 6,434,333 | 12,352,686 | +91.98% |
The cache gets far more hits, but it manufactures them by stalling the
core. S_IC_FILL accounts for 11,224,132 cycles.
Memory Stalls
- instruction fetch 55,950,261 64.8% 33,504,298 req
- data load 15,291,828 17.7% 1,034,172 req
- data store 12,441,554 14.4% 234,015 req
- atomic memory op 166,469 0.2% 193,202 req
- page walk for fetch 1,200,622 1.4% 1,194,468 req
- page walk for load/store 1,298,019 1.5% 1,297,517 req
- other 1,704 0% 0 req
Compared with P88, fetch stalls are still lower. Compared with P89, they are worse. That tells us P89’s tiny cache was directionally right, but P90’s blocking fill policy is too blunt.
Shell Phases
- kernel banner to /init 119,143,502 48.7%
- /init to shell banner 1,156,340 0.5%
- shell banner to first command 40,405,326 16.5%
- echo command 2,538 0%
- uname -a 2,207,729 0.9%
- ls /bin /usr/share 39,732,205 16.2%
- cat sample file 5,276,294 2.2%
- touch/write/cat/rm /tmp file 10,985,502 4.5%
- 8x ash loop with file I/O 24,353,026 10%
- final marker 1,526,901 0.6%
The shell window gets hit hard: ls, cat, and the ash loop all pay for
the blocking frontend. Linux shell work jumps around enough that waiting
for every line to fill before executing is a bad trade.
Cycle Shape
- fetch 4.6% 11,323,240
- execute 36.5% 89,518,022
- mem 12% 29,361,240
- walker 2% 4,990,626
- writeback 36.5% 89,490,840
- mul/div 3.9% 9,507,789
The new state exists for a reason. If S_IC_FILL is visually large, that
is the bug, not a side detail.
Hot Functions
- 5.6% of samples (4,566 samples)5.6% 4,566
- 4.4% of samples (3,581 samples)4.4% 3,581
- 4.2% of samples (3,427 samples)4.2% 3,427
- 3.7% of samples (3,070 samples)3.7% 3,070
- 2.4% of samples (1,969 samples)2.4% 1,969
- 2.2% of samples (1,804 samples)2.2% 1,804
- 2.1% of samples (1,739 samples)2.1% 1,739
- 1.8% of samples (1,433 samples)1.8% 1,433
- 1.6% of samples (1,341 samples)1.6% 1,341
- 1.6% of samples (1,339 samples)1.6% 1,339
- 1.4% of samples (1,120 samples)1.4% 1,120
- 1.1% of samples (929 samples)1.1% 929
- 1.1% of samples (859 samples)1.1% 859
- 1% of samples (851 samples)1% 851
- 1% of samples (813 samples)1% 813
- 56.3% of samples (46,251 samples)56.3% 46,251
The hot-symbol picture does not magically change. Frontend stalls are important, but BusyBox formatting, kernel memory routines, scheduler work, and exception paths still matter.
Toward XiangShan
This is a small but important architectural lesson on the way toward a more XiangShan-like frontend: line fill has to be decoupled. Modern application cores do not stop the machine just to make future fetches pretty. They return the critical word, keep fill buffers in flight, feed a queue, and let prediction keep the frontend pointed at useful code.
| step | result |
|---|---|
| P89 tiny word cache | fetch stalls down, whole workload flat |
| P90 blocking line fill | more hits, whole workload much worse |
| P91 target | nonblocking or critical-word-first line fill |
Honest Status
| check | status |
|---|---|
| 4-word I-cache line storage | PASS |
| Blocking line-fill FSM | PASS |
| BusyBox shell workload runs | PASS |
| Whole workload speedup vs P89 | FAIL |
| LibreLane hardening | NOT RUN |
Next
P91 should keep the line cache but remove the blocking behavior. The critical word should execute immediately; the rest of the line should be filled only when the fetch side can spare the bus.