P98 keeps P97’s four-word D-cache line structure, but stops treating background data-line fill as free. The fill descriptor can only issue when the frontend already has useful work queued and the I-cache fill path is not active.
It works functionally. It recovers the P96 shell-window timing. It also shows why the next arc needs a real Harvard-style instruction/data split instead of more shared-port negotiation.
Result
| metric | P94 arbiter | P96 D-cache v0 | P97 line-fill | P98 throttle |
|---|---|---|---|---|
| post-load cycles | 222,459,202 | 221,522,958 | 222,850,787 | 221,452,591 |
| shell window cycles | 67,050,374 | 66,084,155 | 67,369,576 | 66,055,345 |
| retired instructions | 86,664,089 | 86,344,929 | 86,777,980 | 86,329,983 |
| CPI | 2.5669 | 2.5656 | 2.5681 | 2.5652 |
| memory stall cycles | 60,032,329 | 59,418,375 | 60,295,642 | 59,683,338 |
| load stall cycles | 14,632,992 | 10,976,902 | 10,387,310 | 10,697,962 |
| fetch stall cycles | 23,549,359 | 26,676,104 | 29,593,757 | 27,286,526 |
| comparison | result |
|---|---|
| shell window vs P96 | -0.04% |
| post-load cycles vs P96 | -0.03% |
| memory stalls vs P96 | +0.45% |
| load stalls vs P96 | -2.54% |
| fetch stalls vs P96 | +2.29% |
| shell window vs P97 | -1.95% |
| fetch stalls vs P97 | -7.80% |
D-cache Counters
| counter | P96 | P97 | P98 |
|---|---|---|---|
| load hits | 3,656,064 | 4,370,122 | 3,945,531 |
| load misses | 6,354,876 | 5,746,602 | 6,060,778 |
| demand fills | 6,354,876 | 5,746,602 | 6,060,778 |
| background fills | 0 | 3,419,006 | 377,930 |
| background active cycles | 0 | 85,257,787 | 102,335,320 |
| store updates | 10,473,803 | 10,547,848 | 10,477,277 |
| invalidations | 1,873,327 | 1,874,674 | 1,873,376 |
The throttle cut background fill grants sharply. That gives back some P97 data locality, but it avoids most of P97’s frontend damage.
Memory Stalls
- instruction fetch 27,286,526 45.7% 45,996,088 req
- data load 10,697,962 17.9% 875,585 req
- data store 11,941,385 20% 216,019 req
- atomic memory op 157,331 0.3% 183,415 req
- page walk for fetch 1,118,488 1.9% 1,112,334 req
- page walk for load/store 1,213,747 2% 1,213,221 req
- other 7,267,899 12.2% 16,761,794 req
P98 is still worse than P96 on fetch stalls, but much better than P97. That is the narrow win: less data-side opportunism on the one RAM port.
Shell Phases
- kernel banner to /init 117,616,704 53.3%
- /init to shell banner 1,084,530 0.5%
- shell banner to first command 36,067,947 16.3%
- echo command 1,598 0%
- uname -a 2,432,864 1.1%
- ls /bin /usr/share 31,670,845 14.3%
- cat sample file 4,549,496 2.1%
- touch/write/cat/rm /tmp file 11,060,226 5%
- 8x ash loop with file I/O 16,339,653 7.4%
- final marker 663 0%
The full BusyBox shell script reaches P98-FILE-OK. The shell window is
66.06M cycles, slightly faster than P96 and 1.95% faster than P97.
Cycle Shape
- fetch 3.8% 8,315,386
- execute 39% 86,354,801
- mem 12.7% 28,017,228
- walker 2.1% 4,657,790
- writeback 39% 86,329,983
- mul/div 3.5% 7,775,687
There is no new blocking cache-fill state. The change is request gating before the shared memory arbiter.
Hot Functions
- 5.6% of samples (3,624 samples)5.6% 3,624
- 5.1% of samples (3,293 samples)5.1% 3,293
- 3.7% of samples (2,357 samples)3.7% 2,357
- 3.4% of samples (2,196 samples)3.4% 2,196
- 2.8% of samples (1,808 samples)2.8% 1,808
- 2.7% of samples (1,708 samples)2.7% 1,708
- 2.6% of samples (1,674 samples)2.6% 1,674
- 1.7% of samples (1,125 samples)1.7% 1,125
- 1.7% of samples (1,119 samples)1.7% 1,119
- 1.4% of samples (873 samples)1.4% 873
- 1.3% of samples (831 samples)1.3% 831
- 1.3% of samples (810 samples)1.3% 810
- 1.1% of samples (708 samples)1.1% 708
- 1% of samples (664 samples)1% 664
- 1% of samples (632 samples)1% 632
- 55.4% of samples (35,739 samples)55.4% 35,739
The software workload stayed the same. The measured change is memory policy.
Honest Status
| check | status |
|---|---|
| Four-word D-cache line storage | PASS |
| Critical-word-first demand load | PASS |
| Frontend-aware background-fill throttle | PASS |
| BusyBox shell workload runs | PASS |
| D-cache throttle counters captured | PASS |
| Shell-window speedup vs P96 | PASS |
| True split I/D RAM ports | NOT RUN |
| Split ITLB/DTLB | NOT RUN |
| Nonblocking miss machinery | NOT RUN |
| LibreLane hardening | NOT RUN |
Next
P99 should stop trying to make one port polite and instead map the Harvard split clearly: what an instruction path owns, what a data path owns, where translation lives, and where the lower shared memory system is allowed to reappear.