P107 is the first lower-bank rung where the core consumes the auxiliary response. It keeps the consumer deliberately narrow: only blocked D-cache background fills use the new response. Demand loads, demand fetch, page-table walks, AMOs, and stores still use the original memory path.
Result
| check | result |
|---|---|
make check-tools | PASS |
| Verilator build | PASS |
Linux reaches /init | PASS |
| BusyBox prompt | PASS |
BusyBox shell workload reaches P107-FILE-OK | PASS |
| Auxiliary response consumed by D-cache background fill | PASS |
| Auxiliary read errors | PASS |
| Hardened layout | NOT RUN |
Timing
| metric | P106 contract | P107 aux D-cache fill |
|---|---|---|
| post-load cycles | 219,613,584 | 219,407,400 |
| shell window cycles | 65,558,077 | 65,269,213 |
| retired instructions | 86,478,207 | 86,411,402 |
| CPI | 2.5395 | 2.5391 |
| BusyBox ready milestone | 118,413,096 | 118,427,145 |
shell FILE-OK milestone | 219,613,727 | 219,407,543 |
| kernel panic milestone | 0 | 0 |
This is a small real speedup against P106: the shell window drops by 288,864 cycles, or about 0.44%. It is not enough to declare the memory arc solved.
Auxiliary Consumer
The P106 lane now has response inputs:
banked_aux_ready, banked_aux_rdata, banked_aux_error
P107 consumes the response only when the blocked auxiliary request is
the D-cache background-fill descriptor. On a valid response, the core
marks the current D-cache fill word valid, stores banked_aux_rdata,
advances the fill pointer, and increments an auxiliary fill counter.
| counter | value |
|---|---|
| auxiliary instruction reads serviced | 488,267 |
| auxiliary data reads serviced | 10,065,241 |
| auxiliary reads serviced total | 10,553,508 |
| shell-window auxiliary reads | 4,426,763 |
| auxiliary read errors | 0 |
| D-cache aux background fills consumed | 10,065,241 |
| shell-window D-cache aux fills consumed | 4,099,429 |
| core aux-fill counter | 10,065,241 |
| auxiliary read checksum | 2,929,617,952 |
The consumed-fill count and the core counter match exactly.
D-cache Shape
| counter | P106 | P107 |
|---|---|---|
| load hits | 4,605,965 | 4,603,565 |
| load misses | 5,417,820 | 5,417,845 |
| demand fills | 5,417,820 | 5,417,845 |
| background fills | 438,883 | 10,241,841 |
| aux background fills | 0 | 10,065,241 |
| background active cycles | 55,759,373 | 55,537,629 |
P107 turns the auxiliary lane into a lot of cache maintenance work, but demand load misses barely move. That explains the modest speedup.
Memory Stalls
- instruction fetch 28,293,238 48.4% 45,777,433 req
- data load 10,378,588 17.7% 559,131 req
- data store 10,940,802 18.7% 77,582 req
- atomic memory op 173,764 0.3% 167,687 req
- page walk for fetch 681,848 1.2% 675,694 req
- page walk for load/store 673,456 1.2% 667,265 req
- other 7,351,414 12.6% 16,866,351 req
The main shared response path is still the path that demand work uses. P107 only consumes auxiliary data for background fill.
Shell Phases
- kernel banner to /init 116,720,762 53.4%
- /init to shell banner 1,078,173 0.5%
- shell banner to first command 35,711,185 16.3%
- echo command 1,649 0%
- uname -a 2,560,223 1.2%
- ls /bin /usr/share 32,199,160 14.7%
- cat sample file 3,161,684 1.5%
- touch/write/cat/rm /tmp file 10,996,952 5%
- 8x ash loop with file I/O 16,348,865 7.5%
- final marker 680 0%
The shell script reaches P107-FILE-OK.
Cycle Shape
- fetch 3.7% 8,118,654
- execute 39.4% 86,436,270
- mem 12.8% 28,058,018
- walker 1.2% 2,698,263
- writeback 39.4% 86,411,402
- mul/div 3.5% 7,683,077
P107 retires 86.41M instructions at CPI 2.5391.
Hot Functions
- 5.5% of samples (3,506 samples)5.5% 3,506
- 5.2% of samples (3,296 samples)5.2% 3,296
- 3.8% of samples (2,418 samples)3.8% 2,418
- 3.3% of samples (2,079 samples)3.3% 2,079
- 2.8% of samples (1,799 samples)2.8% 1,799
- 2.7% of samples (1,696 samples)2.7% 1,696
- 2.6% of samples (1,677 samples)2.6% 1,677
- 1.7% of samples (1,063 samples)1.7% 1,063
- 1.6% of samples (1,020 samples)1.6% 1,020
- 1.3% of samples (857 samples)1.3% 857
- 1.3% of samples (812 samples)1.3% 812
- 1.3% of samples (794 samples)1.3% 794
- 1.2% of samples (739 samples)1.2% 739
- 1% of samples (656 samples)1% 656
- 1% of samples (619 samples)1% 619
- 55.6% of samples (35,462 samples)55.6% 35,462
The software workload is the same shell script. The experiment is the hardware response path.
Next
P108 should consume auxiliary data in a path that can directly shorten a stall. Instruction-side prefetch/background fill is the safer next step; a tagged demand fetch/load response path is more valuable but needs proper ownership and cancellation rules.