P96 adds the first data cache to the Linux-capable core. It is small on
purpose: 64 direct-mapped entries, one 32-bit word per entry, physical
address tags, aligned integer LW hit bypass, and aligned integer SW
write-through update/allocate.
This is not a real nonblocking cache yet. It is the first proof that data-side caching is worth continuing.
Result
| metric | P94 arbiter | P95 store buffer | P96 D-cache |
|---|---|---|---|
| post-load cycles | 222,459,202 | 241,494,238 | 221,522,958 |
| shell window cycles | 67,050,374 | 77,821,976 | 66,084,155 |
| retired instructions | 86,664,089 | 88,851,638 | 86,344,929 |
| CPI | 2.5669 | 2.7179 | 2.5656 |
| memory stall cycles | 60,032,329 | 60,797,627 | 59,418,375 |
| load stall cycles | 14,632,992 | 15,144,900 | 10,976,902 |
| fetch stall cycles | 23,549,359 | 35,563,846 | 26,676,104 |
| comparison | result |
|---|---|
| shell window vs P94 | -1.44% |
| post-load cycles vs P94 | -0.42% |
| memory stalls vs P94 | -1.02% |
| load stalls vs P94 | -24.99% |
| fetch stalls vs P94 | +13.28% |
D-cache Counters
| counter | value |
|---|---|
| load hits | 3,656,064 |
| load misses | 6,354,876 |
| fills | 6,354,876 |
| store updates | 10,473,803 |
| invalidations | 1,873,327 |
The one-word cache cuts data-side stalls, but the workload still has more misses than hits. A line-based D-cache is the obvious next test, provided it avoids the blocking-fill mistake P90 made on the instruction side.
Memory Stalls
- instruction fetch 26,676,104 44.9% 46,243,128 req
- data load 10,976,902 18.5% 890,777 req
- data store 11,935,094 20.1% 218,712 req
- atomic memory op 157,413 0.3% 183,483 req
- page walk for fetch 1,122,386 1.9% 1,116,232 req
- page walk for load/store 1,221,734 2.1% 1,221,224 req
- other 7,328,742 12.3% 16,703,007 req
Load stall cycles drop from P94’s 14.63M to 10.98M. Fetch stalls rise, which means the memory system is still a shared-port negotiation rather than independent instruction/data service.
Shell Phases
- kernel banner to /init 117,615,769 53.3%
- /init to shell banner 1,069,519 0.5%
- shell banner to first command 36,125,450 16.4%
- echo command 1,598 0%
- uname -a 1,991,318 0.9%
- ls /bin /usr/share 32,798,794 14.9%
- cat sample file 4,516,475 2%
- touch/write/cat/rm /tmp file 10,556,885 4.8%
- 8x ash loop with file I/O 16,218,422 7.3%
- final marker 663 0%
The full BusyBox shell script reaches P96-FILE-OK. The shell window is
66.08M cycles, a 1.44% improvement over P94 and a clear recovery from
P95’s store-buffer regression.
Cycle Shape
- fetch 3.8% 8,323,856
- execute 39% 86,369,781
- mem 12.6% 28,018,445
- walker 2.1% 4,681,576
- writeback 39% 86,344,929
- mul/div 3.5% 7,782,655
P96 does not add a new architectural state. The speedup shows up as
fewer external data-memory waits from the existing S_MEM path.
Hot Functions
- 5.5% of samples (3,545 samples)5.5% 3,545
- 5.1% of samples (3,311 samples)5.1% 3,311
- 3.6% of samples (2,320 samples)3.6% 2,320
- 3.3% of samples (2,137 samples)3.3% 2,137
- 2.8% of samples (1,815 samples)2.8% 1,815
- 2.6% of samples (1,690 samples)2.6% 1,690
- 2.6% of samples (1,673 samples)2.6% 1,673
- 1.7% of samples (1,121 samples)1.7% 1,121
- 1.7% of samples (1,073 samples)1.7% 1,073
- 1.3% of samples (864 samples)1.3% 864
- 1.3% of samples (857 samples)1.3% 857
- 1.3% of samples (841 samples)1.3% 841
- 1.2% of samples (784 samples)1.2% 784
- 1% of samples (651 samples)1% 651
- 1% of samples (646 samples)1% 646
- 55.5% of samples (35,846 samples)55.5% 35,846
The symbol mix remains the same shell/kernel workload. The improvement comes from memory behavior, not from running a different software path.
Honest Status
| check | status |
|---|---|
| Direct-mapped word D-cache in RTL | PASS |
Aligned LW hit bypass | PASS |
Aligned SW write-through update/allocate | PASS |
| BusyBox shell workload runs | PASS |
| D-cache counters captured | PASS |
| Subword store merge | NOT RUN |
| Multi-word line fill | NOT RUN |
| Nonblocking D-cache miss handling | NOT RUN |
| Shell-window speedup vs P94 | PASS |
| LibreLane hardening | NOT RUN |
Next
P97 should try the data-cache policy that P96 points toward: four-word lines, critical-word-first response, and background fill through the existing P94 arbiter counters. The constraint is clear: do not block the core just to fill the rest of a line.