Banked lower memory conflict counters

P104 measures the lower-memory part of the Harvard problem. The core now has instruction-side and data-side service counters, split ITLB/DTLB storage, and a repaired one-entry store buffer. The remaining question is where those separated requests collapse back into one shared lower port.

This rung does not add a faster RAM. It overlays a hypothetical four-bank, word-interleaved lower-memory map on the existing single-port arbiter and records whether simultaneous instruction/data requests would hit the same bank or different banks.

Result

check	result
`make check-tools`	PASS
Verilator build	PASS
Linux reaches `/init`	PASS
BusyBox prompt	PASS
BusyBox shell workload reaches `P104-FILE-OK`	PASS
Lower-bank conflict counters emitted	PASS
Hardened layout	NOT RUN

Timing

metric	P103 repaired store buffer	P104 bank counters
post-load cycles	218,842,451	219,172,843
shell window cycles	64,809,989	65,062,462
retired instructions	86,218,075	86,339,942
CPI	2.5382	2.5385
BusyBox ready milestone	118,415,663	118,418,832
shell `FILE-OK` milestone	218,842,594	219,172,986
kernel panic milestone	0	0

P104 is a functional and measurement PASS, not a speed PASS. The shell window is 0.39% slower than P103, which is acceptable for simulator-side instrumentation but not a performance claim.

Bank Model

The bank mapping is intentionally plain:

bank = physical_address[3:2]

That gives four banks interleaved at aligned-word granularity. Demand fetch, prefetch, I-cache fill, and instruction-side page-table walks are instruction-side traffic. Store-buffer drains, loads, stores, FP memory, AMOs, D-cache fill, and data-side page-table walks are data-side traffic.

Bank Conflicts

counter	value
both I/D want lower memory	28,801,693
same-bank wants	8,236,861
split-bank wants	20,564,832
same-bank share	28.60%
split-bank share	71.40%
split-bank parallel opportunity	20,564,832
instruction blocked by lower memory	631,885
data blocked by lower memory	32,443,689
data blocked during split-bank wants	20,076,006
data blocked during same-bank wants	8,093,802

This is the useful result. In 71.40% of simultaneous instruction/data lower-memory wants, the two sides target different word-interleaved banks. The current implementation still serializes those cycles through one lower port.

Per-Bank Demand

bank	I want	D want	I grant	D grant
0	27,689,968	18,558,571	27,511,963	6,390,833
1	23,411,126	13,605,291	23,288,613	6,371,689
2	24,122,967	13,046,541	23,988,410	6,556,003
3	23,642,775	13,715,086	23,445,965	7,163,275

Bank 0 is busier, but the workload is not so skewed that banking looks pointless. The next experiment has a real target: convert different-bank I/D overlap into independent lower-memory progress.

What Other Cores Do

XiangShan/Kunminghu does not treat this as one shared blocking pipe. Its L1 instruction cache has prefetching and configurable MSHRs for fetch and prefetch misses, and its data cache is a larger write-back/write-allocate structure with miss, AMO, writeback, and coherence machinery. The XiangShan load-unit documentation also shows load pipeline stages checking D-cache, TLB, LoadQueue, and StoreQueue/sbuffer state instead of waiting on a single memory state.

The XiangShan store queue is the grown-up version of our P102/P103 store-buffer lesson: it tracks address/data validity, committed state, store-to-load forwarding, and store-buffer handoff. CVA6 is more modest but still has the same separation pressure: the OpenHW docs describe a six-stage application core with fast instruction cache hits, longer data cache hits, and a configurable write-through data-cache write-buffer depth.

P104 is not copying those cores. It is measuring the smallest local version of the same architectural move: split near-core service, then make lower-level conflicts explicit.

Memory Stalls

memory stalls label P104 banked lower-memory workload stalls 58,825,725 handshakes 65,891,026

instruction fetch 27,412,460 46.6% 46,865,172 req
data load 11,644,533 19.8% 559,598 req
data store 10,926,460 18.6% 77,367 req
atomic memory op 174,147 0.3% 167,306 req
page walk for fetch 682,484 1.2% 676,330 req
page walk for load/store 670,095 1.1% 663,907 req
other 7,315,546 12.4% 16,881,346 req

The lower-port stall bucket does not improve yet because the hardware is still one port. P104 tells us how much of that bucket is plausibly different-bank work.

Shell Phases

shell phases label P104 shell workload cycles 219,172,843 cpi 2.54

kernel banner to /init 116,716,932 53.4%
/init to shell banner 1,073,690 0.5%
shell banner to first command 35,691,692 16.3%
echo command 1,649 0%
uname -a 2,381,786 1.1%
ls /bin /usr/share 31,886,575 14.6%
cat sample file 3,180,895 1.5%
touch/write/cat/rm /tmp file 9,627,799 4.4%
8x ash loop with file I/O 17,983,078 8.2%
final marker 680 0%

The BusyBox script reaches P104-FILE-OK, including uname, ls, cat, touch, and the temp-file loop.

Cycle Shape

state breakdown label P104 banked lower-memory workload cycles 219,172,843 cpi 2.54

fetch 3.7% 8,126,711
execute 39.4% 86,365,066
mem 12.8% 28,021,522
walker 1.2% 2,692,816
writeback 39.4% 86,339,942
mul/div 3.5% 7,625,070

P104 retires 86.34M instructions at CPI 2.5385. Treat this as the baseline for the next lower-memory service experiment.

Hot Functions

hot functions label P104 BusyBox shell symbols samples 63,537 period every 1,024 cycles

printf_core busybox

5.6% of samples (3,563 samples)

5.6% 3,563
memset kernel

5.2% of samples (3,332 samples)

5.2% 3,332
memcpy busybox

3.6% of samples (2,309 samples)

3.6% 2,309
vruntime_eligible kernel

3.3% of samples (2,107 samples)

3.3% 2,107
blake2s_compress_generic kernel

2.8% of samples (1,798 samples)

2.8% 1,798
__fwritex busybox

2.7% of samples (1,725 samples)

2.7% 1,725
memcpy kernel

2.7% of samples (1,695 samples)

2.7% 1,695
handle_exception kernel

1.7% of samples (1,102 samples)

1.7% 1,102
unmap_page_range kernel

1.6% of samples (995 samples)

1.6% 995
memset busybox

1.4% of samples (885 samples)

1.4% 885
n_tty_write kernel

1.3% of samples (847 samples)

1.3% 847
avg_vruntime kernel

1.3% of samples (844 samples)

1.3% 844
ret_from_exception kernel

1.2% of samples (764 samples)

1.2% 764
n_tty_read kernel

1.1% of samples (668 samples)

1.1% 668
next_uptodate_folio kernel

1% of samples (644 samples)

1% 644
(remaining) remaining

55.3% of samples (35,120 samples)

55.3% 35,120

The hot-symbol shape remains the shell workload. The architectural result is in the new lower_banks block, not in a new user program.

P105 should stop counting only. Add a bank-aware lower-memory service model that can grant instruction and data requests in the same cycle when their bank numbers differ, while preserving single-bank arbitration and store/order rules. Then rerun the exact BusyBox shell profile and see whether the 20.56M split-bank opportunity cycles turn into lower CPI.

Result

Timing

Bank Model

Bank Conflicts

Per-Bank Demand

What Other Cores Do

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Next