P97 extends P96’s one-word D-cache into four-word lines. Demand loads are still critical-word-first: the requested word returns immediately, then the cache tries to fill the other words through a background request class.
It works functionally. It is slower than P96.
Result
| metric | P94 arbiter | P96 D-cache v0 | P97 line-fill |
|---|---|---|---|
| post-load cycles | 222,459,202 | 221,522,958 | 222,850,787 |
| shell window cycles | 67,050,374 | 66,084,155 | 67,369,576 |
| retired instructions | 86,664,089 | 86,344,929 | 86,777,980 |
| CPI | 2.5669 | 2.5656 | 2.5681 |
| memory stall cycles | 60,032,329 | 59,418,375 | 60,295,642 |
| load stall cycles | 14,632,992 | 10,976,902 | 10,387,310 |
| fetch stall cycles | 23,549,359 | 26,676,104 | 29,593,757 |
| comparison | result |
|---|---|
| shell window vs P96 | +1.95% |
| post-load cycles vs P96 | +0.60% |
| memory stalls vs P96 | +1.48% |
| load stalls vs P96 | -5.37% |
| fetch stalls vs P96 | +10.94% |
D-cache Counters
| counter | P96 | P97 |
|---|---|---|
| load hits | 3,656,064 | 4,370,122 |
| load misses | 6,354,876 | 5,746,602 |
| demand fills | 6,354,876 | 5,746,602 |
| background fills | 0 | 3,419,006 |
| background active cycles | 0 | 85,257,787 |
| store updates | 10,473,803 | 10,547,848 |
| invalidations | 1,873,327 | 1,874,674 |
The cache geometry helps the data side. The fill policy hurts the machine. That is the useful result.
Memory Stalls
- instruction fetch 29,593,757 49.1% 47,122,281 req
- data load 10,387,310 17.2% 875,234 req
- data store 12,008,121 19.9% 219,405 req
- atomic memory op 158,700 0.3% 184,997 req
- page walk for fetch 1,130,883 1.9% 1,124,729 req
- page walk for load/store 1,229,718 2% 1,229,207 req
- other 5,787,153 9.6% 18,370,689 req
Load stalls drop again, but fetch stalls rise enough to lose the P96 shell-window win.
Shell Phases
- kernel banner to /init 117,614,831 52.9%
- /init to shell banner 1,081,377 0.5%
- shell banner to first command 36,156,938 16.3%
- echo command 1,598 0%
- uname -a 2,616,228 1.2%
- ls /bin /usr/share 31,715,496 14.3%
- cat sample file 4,087,721 1.8%
- touch/write/cat/rm /tmp file 11,430,280 5.1%
- 8x ash loop with file I/O 16,108,796 7.3%
- final marker 1,409,457 0.6%
The full BusyBox shell script reaches P97-FILE-OK. The shell window is
67.37M cycles, slower than both P96 and P94.
Cycle Shape
- fetch 3.7% 8,335,018
- execute 39% 86,803,158
- mem 12.7% 28,203,889
- walker 2.1% 4,714,537
- writeback 38.9% 86,777,980
- mul/div 3.6% 8,014,501
P97 does not add a blocking line-fill state. The cost appears as more shared-memory service pressure while the normal state machine runs.
Hot Functions
- 5.5% of samples (3,630 samples)5.5% 3,630
- 5.1% of samples (3,376 samples)5.1% 3,376
- 3.5% of samples (2,318 samples)3.5% 2,318
- 3.5% of samples (2,305 samples)3.5% 2,305
- 2.8% of samples (1,808 samples)2.8% 1,808
- 2.5% of samples (1,668 samples)2.5% 1,668
- 2.5% of samples (1,630 samples)2.5% 1,630
- 1.8% of samples (1,182 samples)1.8% 1,182
- 1.6% of samples (1,071 samples)1.6% 1,071
- 1.5% of samples (956 samples)1.5% 956
- 1.3% of samples (850 samples)1.3% 850
- 1.3% of samples (832 samples)1.3% 832
- 1.2% of samples (791 samples)1.2% 791
- 1.1% of samples (688 samples)1.1% 688
- 1% of samples (652 samples)1% 652
- 55.7% of samples (36,611 samples)55.7% 36,611
The hot-symbol mix remains the same BusyBox shell workload. The change is memory-system policy, not different software.
Honest Status
| check | status |
|---|---|
| Four-word D-cache line storage | PASS |
| Critical-word-first demand load | PASS |
| Background D-cache fill descriptor | PASS |
dcache_background arbiter class | PASS |
| BusyBox shell workload runs | PASS |
| D-cache line-fill counters captured | PASS |
| Shell-window speedup vs P96 | FAIL |
| Smarter fill throttling | NOT RUN |
| True split I/D RAM ports | NOT RUN |
| LibreLane hardening | NOT RUN |
Next
P98 should keep the four-word line shape but throttle background D-cache fill much more aggressively. The obvious rule is to let it run only when the frontend is genuinely not waiting.