P108 adds an instruction-side consumer for the auxiliary lower-bank response. P107 used the second response for D-cache background fill. P108 also uses it when writeback prefetch is blocked by data-side work: the auxiliary response fills the target I-cache word.
Result
| check | result |
|---|---|
make check-tools | PASS |
| Verilator build | PASS |
Linux reaches /init | PASS |
| BusyBox prompt | PASS |
BusyBox shell workload reaches P108-FILE-OK | PASS |
| Auxiliary I-cache prefetch response consumed | PASS |
| Auxiliary read errors | PASS |
| Hardened layout | NOT RUN |
Timing
| metric | P107 aux D-cache fill | P108 aux I-cache fill |
|---|---|---|
| post-load cycles | 219,407,400 | 218,960,570 |
| shell window cycles | 65,269,213 | 64,823,658 |
| retired instructions | 86,411,402 | 86,260,233 |
| CPI | 2.5391 | 2.5384 |
| BusyBox ready milestone | 118,427,145 | 118,420,395 |
shell FILE-OK milestone | 219,407,543 | 218,960,713 |
| kernel panic milestone | 0 | 0 |
P108 improves the shell window by 445,555 cycles versus P107, about 0.68%. Against P106, the two auxiliary-response consumers have cut 734,419 shell-window cycles.
Auxiliary Consumers
P108 consumes auxiliary data in two places:
| consumer | consumed fills | shell-window fills |
|---|---|---|
| I-cache writeback prefetch | 488,027 | 327,106 |
| D-cache background fill | 10,034,869 | 4,069,556 |
| I-cache background fill | 0 | 0 |
I-cache background fill staying at 0 is not a bug in the counter. With the current priority policy, I-cache background fill usually wins before data background work. Blocked writeback prefetch is the real instruction-side auxiliary opportunity.
| counter | value |
|---|---|
| auxiliary instruction reads serviced | 488,027 |
| auxiliary data reads serviced | 10,034,869 |
| auxiliary reads serviced total | 10,522,896 |
| shell-window auxiliary reads | 4,396,662 |
| auxiliary read errors | 0 |
| auxiliary read checksum | 1,087,009,691 |
Cache Shape
| counter | value |
|---|---|
| I-cache hits | 43,271,616 |
| I-cache fetch-state hits | 4,938,617 |
| I-cache writeback-prefetch hits | 38,332,999 |
| I-cache miss refills | 45,205,080 |
| I-cache aux prefetch fills | 488,027 |
| D-cache load hits | 4,579,787 |
| D-cache load misses | 5,404,513 |
| D-cache aux background fills | 10,034,869 |
The speedup is real but still modest because P108 fills cache state. It does not let an auxiliary response directly complete a stalled demand fetch or demand load.
Memory Stalls
- instruction fetch 28,239,593 48.3% 45,205,080 req
- data load 10,362,110 17.7% 561,037 req
- data store 10,914,944 18.7% 77,240 req
- atomic memory op 173,730 0.3% 166,815 req
- page walk for fetch 678,767 1.2% 672,613 req
- page walk for load/store 677,027 1.2% 670,851 req
- other 7,401,286 12.7% 16,774,601 req
The lower-memory conflict is smaller, but demand-visible stalls remain.
Shell Phases
- kernel banner to /init 116,720,090 53.5%
- /init to shell banner 1,072,095 0.5%
- shell banner to first command 35,716,660 16.4%
- echo command 1,649 0%
- uname -a 2,407,721 1.1%
- ls /bin /usr/share 32,106,012 14.7%
- cat sample file 3,070,534 1.4%
- touch/write/cat/rm /tmp file 11,008,246 5%
- 8x ash loop with file I/O 16,228,816 7.4%
- final marker 680 0%
The shell script reaches P108-FILE-OK.
Cycle Shape
- fetch 3.7% 8,109,410
- execute 39.4% 86,284,961
- mem 12.8% 27,992,010
- walker 1.2% 2,699,258
- writeback 39.4% 86,260,233
- mul/div 3.5% 7,612,982
P108 retires 86.26M instructions at CPI 2.5384.
Hot Functions
- 5.7% of samples (3,576 samples)5.7% 3,576
- 5.2% of samples (3,300 samples)5.2% 3,300
- 3.7% of samples (2,353 samples)3.7% 2,353
- 3.4% of samples (2,121 samples)3.4% 2,121
- 2.8% of samples (1,791 samples)2.8% 1,791
- 2.7% of samples (1,721 samples)2.7% 1,721
- 2.7% of samples (1,719 samples)2.7% 1,719
- 1.7% of samples (1,067 samples)1.7% 1,067
- 1.6% of samples (1,036 samples)1.6% 1,036
- 1.4% of samples (862 samples)1.4% 862
- 1.4% of samples (856 samples)1.4% 856
- 1.3% of samples (821 samples)1.3% 821
- 1.2% of samples (761 samples)1.2% 761
- 1% of samples (648 samples)1% 648
- 1% of samples (647 samples)1% 647
- 55.1% of samples (34,850 samples)55.1% 34,850
The software workload is unchanged; the experiment is the instruction prefetch response path.
Next
P109 should add a tiny ownership tracker for a demand-visible auxiliary response. Demand fetch is probably the first target: tag the auxiliary word by physical address, cancel it on PC or translation changes, and only then let it advance architectural fetch.