No. 104 / project of 147 on the ladder

Banked lower memory conflict counters

introduces — word-interleaved lower-memory bank model; instruction/data same-bank conflict counters; split-bank opportunity accounting

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P104 measures the lower-memory part of the Harvard problem. The core now has instruction-side and data-side service counters, split ITLB/DTLB storage, and a repaired one-entry store buffer. The remaining question is where those separated requests collapse back into one shared lower port.

This rung does not add a faster RAM. It overlays a hypothetical four-bank, word-interleaved lower-memory map on the existing single-port arbiter and records whether simultaneous instruction/data requests would hit the same bank or different banks.

Result

checkresult
make check-toolsPASS
Verilator buildPASS
Linux reaches /initPASS
BusyBox promptPASS
BusyBox shell workload reaches P104-FILE-OKPASS
Lower-bank conflict counters emittedPASS
Hardened layoutNOT RUN

Timing

metricP103 repaired store bufferP104 bank counters
post-load cycles218,842,451219,172,843
shell window cycles64,809,98965,062,462
retired instructions86,218,07586,339,942
CPI2.53822.5385
BusyBox ready milestone118,415,663118,418,832
shell FILE-OK milestone218,842,594219,172,986
kernel panic milestone00

P104 is a functional and measurement PASS, not a speed PASS. The shell window is 0.39% slower than P103, which is acceptable for simulator-side instrumentation but not a performance claim.

Bank Model

The bank mapping is intentionally plain:

bank = physical_address[3:2]

That gives four banks interleaved at aligned-word granularity. Demand fetch, prefetch, I-cache fill, and instruction-side page-table walks are instruction-side traffic. Store-buffer drains, loads, stores, FP memory, AMOs, D-cache fill, and data-side page-table walks are data-side traffic.

Bank Conflicts

countervalue
both I/D want lower memory28,801,693
same-bank wants8,236,861
split-bank wants20,564,832
same-bank share28.60%
split-bank share71.40%
split-bank parallel opportunity20,564,832
instruction blocked by lower memory631,885
data blocked by lower memory32,443,689
data blocked during split-bank wants20,076,006
data blocked during same-bank wants8,093,802

This is the useful result. In 71.40% of simultaneous instruction/data lower-memory wants, the two sides target different word-interleaved banks. The current implementation still serializes those cycles through one lower port.

Per-Bank Demand

bankI wantD wantI grantD grant
027,689,96818,558,57127,511,9636,390,833
123,411,12613,605,29123,288,6136,371,689
224,122,96713,046,54123,988,4106,556,003
323,642,77513,715,08623,445,9657,163,275

Bank 0 is busier, but the workload is not so skewed that banking looks pointless. The next experiment has a real target: convert different-bank I/D overlap into independent lower-memory progress.

What Other Cores Do

XiangShan/Kunminghu does not treat this as one shared blocking pipe. Its L1 instruction cache has prefetching and configurable MSHRs for fetch and prefetch misses, and its data cache is a larger write-back/write-allocate structure with miss, AMO, writeback, and coherence machinery. The XiangShan load-unit documentation also shows load pipeline stages checking D-cache, TLB, LoadQueue, and StoreQueue/sbuffer state instead of waiting on a single memory state.

The XiangShan store queue is the grown-up version of our P102/P103 store-buffer lesson: it tracks address/data validity, committed state, store-to-load forwarding, and store-buffer handoff. CVA6 is more modest but still has the same separation pressure: the OpenHW docs describe a six-stage application core with fast instruction cache hits, longer data cache hits, and a configurable write-through data-cache write-buffer depth.

P104 is not copying those cores. It is measuring the smallest local version of the same architectural move: split near-core service, then make lower-level conflicts explicit.

Memory Stalls

memory stalls label P104 banked lower-memory workload stalls 58,825,725 handshakes 65,891,026
  1. instruction fetch 27,412,460 46.6% 46,865,172 req
  2. data load 11,644,533 19.8% 559,598 req
  3. data store 10,926,460 18.6% 77,367 req
  4. atomic memory op 174,147 0.3% 167,306 req
  5. page walk for fetch 682,484 1.2% 676,330 req
  6. page walk for load/store 670,095 1.1% 663,907 req
  7. other 7,315,546 12.4% 16,881,346 req

The lower-port stall bucket does not improve yet because the hardware is still one port. P104 tells us how much of that bucket is plausibly different-bank work.

Shell Phases

shell phases label P104 shell workload cycles 219,172,843 cpi 2.54
  1. kernel banner to /init 116,716,932 53.4%
  2. /init to shell banner 1,073,690 0.5%
  3. shell banner to first command 35,691,692 16.3%
  4. echo command 1,649 0%
  5. uname -a 2,381,786 1.1%
  6. ls /bin /usr/share 31,886,575 14.6%
  7. cat sample file 3,180,895 1.5%
  8. touch/write/cat/rm /tmp file 9,627,799 4.4%
  9. 8x ash loop with file I/O 17,983,078 8.2%
  10. final marker 680 0%

The BusyBox script reaches P104-FILE-OK, including uname, ls, cat, touch, and the temp-file loop.

Cycle Shape

state breakdown label P104 banked lower-memory workload cycles 219,172,843 cpi 2.54
  1. fetch 3.7% 8,126,711
  2. execute 39.4% 86,365,066
  3. mem 12.8% 28,021,522
  4. walker 1.2% 2,692,816
  5. writeback 39.4% 86,339,942
  6. mul/div 3.5% 7,625,070

P104 retires 86.34M instructions at CPI 2.5385. Treat this as the baseline for the next lower-memory service experiment.

Hot Functions

hot functions label P104 BusyBox shell symbols samples 63,537 period every 1,024 cycles
  1. printf_core busybox
    5.6% 3,563
  2. memset kernel
    5.2% 3,332
  3. memcpy busybox
    3.6% 2,309
  4. vruntime_eligible kernel
    3.3% 2,107
  5. blake2s_compress_generic kernel
    2.8% 1,798
  6. __fwritex busybox
    2.7% 1,725
  7. memcpy kernel
    2.7% 1,695
  8. handle_exception kernel
    1.7% 1,102
  9. unmap_page_range kernel
    1.6% 995
  10. memset busybox
    1.4% 885
  11. n_tty_write kernel
    1.3% 847
  12. avg_vruntime kernel
    1.3% 844
  13. ret_from_exception kernel
    1.2% 764
  14. n_tty_read kernel
    1.1% 668
  15. next_uptodate_folio kernel
    1% 644
  16. (remaining) remaining
    55.3% 35,120

The hot-symbol shape remains the shell workload. The architectural result is in the new lower_banks block, not in a new user program.

Next

P105 should stop counting only. Add a bank-aware lower-memory service model that can grant instruction and data requests in the same cycle when their bank numbers differ, while preserving single-bank arbitration and store/order rules. Then rerun the exact BusyBox shell profile and see whether the 20.56M split-bank opportunity cycles turn into lower CPI.