P104 measures the lower-memory part of the Harvard problem. The core now has instruction-side and data-side service counters, split ITLB/DTLB storage, and a repaired one-entry store buffer. The remaining question is where those separated requests collapse back into one shared lower port.
This rung does not add a faster RAM. It overlays a hypothetical four-bank, word-interleaved lower-memory map on the existing single-port arbiter and records whether simultaneous instruction/data requests would hit the same bank or different banks.
Result
| check | result |
|---|---|
make check-tools | PASS |
| Verilator build | PASS |
Linux reaches /init | PASS |
| BusyBox prompt | PASS |
BusyBox shell workload reaches P104-FILE-OK | PASS |
| Lower-bank conflict counters emitted | PASS |
| Hardened layout | NOT RUN |
Timing
| metric | P103 repaired store buffer | P104 bank counters |
|---|---|---|
| post-load cycles | 218,842,451 | 219,172,843 |
| shell window cycles | 64,809,989 | 65,062,462 |
| retired instructions | 86,218,075 | 86,339,942 |
| CPI | 2.5382 | 2.5385 |
| BusyBox ready milestone | 118,415,663 | 118,418,832 |
shell FILE-OK milestone | 218,842,594 | 219,172,986 |
| kernel panic milestone | 0 | 0 |
P104 is a functional and measurement PASS, not a speed PASS. The shell window is 0.39% slower than P103, which is acceptable for simulator-side instrumentation but not a performance claim.
Bank Model
The bank mapping is intentionally plain:
bank = physical_address[3:2]
That gives four banks interleaved at aligned-word granularity. Demand fetch, prefetch, I-cache fill, and instruction-side page-table walks are instruction-side traffic. Store-buffer drains, loads, stores, FP memory, AMOs, D-cache fill, and data-side page-table walks are data-side traffic.
Bank Conflicts
| counter | value |
|---|---|
| both I/D want lower memory | 28,801,693 |
| same-bank wants | 8,236,861 |
| split-bank wants | 20,564,832 |
| same-bank share | 28.60% |
| split-bank share | 71.40% |
| split-bank parallel opportunity | 20,564,832 |
| instruction blocked by lower memory | 631,885 |
| data blocked by lower memory | 32,443,689 |
| data blocked during split-bank wants | 20,076,006 |
| data blocked during same-bank wants | 8,093,802 |
This is the useful result. In 71.40% of simultaneous instruction/data lower-memory wants, the two sides target different word-interleaved banks. The current implementation still serializes those cycles through one lower port.
Per-Bank Demand
| bank | I want | D want | I grant | D grant |
|---|---|---|---|---|
| 0 | 27,689,968 | 18,558,571 | 27,511,963 | 6,390,833 |
| 1 | 23,411,126 | 13,605,291 | 23,288,613 | 6,371,689 |
| 2 | 24,122,967 | 13,046,541 | 23,988,410 | 6,556,003 |
| 3 | 23,642,775 | 13,715,086 | 23,445,965 | 7,163,275 |
Bank 0 is busier, but the workload is not so skewed that banking looks pointless. The next experiment has a real target: convert different-bank I/D overlap into independent lower-memory progress.
What Other Cores Do
XiangShan/Kunminghu does not treat this as one shared blocking pipe. Its L1 instruction cache has prefetching and configurable MSHRs for fetch and prefetch misses, and its data cache is a larger write-back/write-allocate structure with miss, AMO, writeback, and coherence machinery. The XiangShan load-unit documentation also shows load pipeline stages checking D-cache, TLB, LoadQueue, and StoreQueue/sbuffer state instead of waiting on a single memory state.
The XiangShan store queue is the grown-up version of our P102/P103 store-buffer lesson: it tracks address/data validity, committed state, store-to-load forwarding, and store-buffer handoff. CVA6 is more modest but still has the same separation pressure: the OpenHW docs describe a six-stage application core with fast instruction cache hits, longer data cache hits, and a configurable write-through data-cache write-buffer depth.
P104 is not copying those cores. It is measuring the smallest local version of the same architectural move: split near-core service, then make lower-level conflicts explicit.
Memory Stalls
- instruction fetch 27,412,460 46.6% 46,865,172 req
- data load 11,644,533 19.8% 559,598 req
- data store 10,926,460 18.6% 77,367 req
- atomic memory op 174,147 0.3% 167,306 req
- page walk for fetch 682,484 1.2% 676,330 req
- page walk for load/store 670,095 1.1% 663,907 req
- other 7,315,546 12.4% 16,881,346 req
The lower-port stall bucket does not improve yet because the hardware is still one port. P104 tells us how much of that bucket is plausibly different-bank work.
Shell Phases
- kernel banner to /init 116,716,932 53.4%
- /init to shell banner 1,073,690 0.5%
- shell banner to first command 35,691,692 16.3%
- echo command 1,649 0%
- uname -a 2,381,786 1.1%
- ls /bin /usr/share 31,886,575 14.6%
- cat sample file 3,180,895 1.5%
- touch/write/cat/rm /tmp file 9,627,799 4.4%
- 8x ash loop with file I/O 17,983,078 8.2%
- final marker 680 0%
The BusyBox script reaches P104-FILE-OK, including uname, ls,
cat, touch, and the temp-file loop.
Cycle Shape
- fetch 3.7% 8,126,711
- execute 39.4% 86,365,066
- mem 12.8% 28,021,522
- walker 1.2% 2,692,816
- writeback 39.4% 86,339,942
- mul/div 3.5% 7,625,070
P104 retires 86.34M instructions at CPI 2.5385. Treat this as the baseline for the next lower-memory service experiment.
Hot Functions
- 5.6% of samples (3,563 samples)5.6% 3,563
- 5.2% of samples (3,332 samples)5.2% 3,332
- 3.6% of samples (2,309 samples)3.6% 2,309
- 3.3% of samples (2,107 samples)3.3% 2,107
- 2.8% of samples (1,798 samples)2.8% 1,798
- 2.7% of samples (1,725 samples)2.7% 1,725
- 2.7% of samples (1,695 samples)2.7% 1,695
- 1.7% of samples (1,102 samples)1.7% 1,102
- 1.6% of samples (995 samples)1.6% 995
- 1.4% of samples (885 samples)1.4% 885
- 1.3% of samples (847 samples)1.3% 847
- 1.3% of samples (844 samples)1.3% 844
- 1.2% of samples (764 samples)1.2% 764
- 1.1% of samples (668 samples)1.1% 668
- 1% of samples (644 samples)1% 644
- 55.3% of samples (35,120 samples)55.3% 35,120
The hot-symbol shape remains the shell workload. The architectural result
is in the new lower_banks block, not in a new user program.
Next
P105 should stop counting only. Add a bank-aware lower-memory service model that can grant instruction and data requests in the same cycle when their bank numbers differ, while preserving single-bank arbitration and store/order rules. Then rerun the exact BusyBox shell profile and see whether the 20.56M split-bank opportunity cycles turn into lower CPI.