No. 101 / project of 147 on the ladder

Split ITLB/DTLB

introduces — separate 8-entry ITLB and DTLB banks; side-specific TLB replacement; split translation profiling

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P101 splits translation storage. P100 separated instruction-side and data-side memory-service intent, but both sides still shared one tiny 8-entry TLB. P101 replaces that with an 8-entry ITLB and an 8-entry DTLB while keeping the page-table walker shared.

This one is not just cleaner architecture. It is faster.

Result

metricP94 arbiterP100 split serviceP101 split TLB
post-load cycles222,459,202221,990,140217,630,965
shell window cycles67,050,37466,518,62663,777,267
retired instructions86,664,08986,512,02786,031,234
CPI2.56692.56602.5297
memory stall cycles60,032,32959,819,12958,999,994
fetch stall cycles23,549,35927,346,15027,158,844
load stall cycles14,632,99210,729,42710,999,370
fetch page walks1,123,9871,122,943674,678
data page walks1,220,4761,218,084666,522
comparisonresult
shell window vs P100-4.12%
post-load cycles vs P100-1.96%
memory stalls vs P100-1.37%
fetch stalls vs P100-0.68%
load stalls vs P100+2.52%
fetch walks vs P100-39.92%
data walks vs P100-45.28%
shell window vs P94-4.88%

P101 is a speed PASS. The load-stall bucket rises slightly, but the shell workload wins because translation walks fall sharply.

Split TLB Counters

bankentriesactivitymissesfillsreplacement index
ITLB fetch-hit cycles86,107,447674,678674,6785
ITLB prefetch-hit cycles8139,274,063674,678674,6785
DTLB LSU hits827,081,725666,522666,5226

Shared walker:

walker metriccount
fetch walks674,678
data walks666,522
A/D writebacks96
TLB flushes13,447

The split removes replacement interference, not walker serialization. That distinction matters for the next rung.

Translation Shape

accessP101 bank
current PC fetchITLB
next_pc prefetchITLB
load/store effective addressDTLB
AMO effective addressDTLB

Both banks are flushed on reset, satp writes, and sfence.vma.

Memory Stalls

memory stalls label P101 split ITLB/DTLB workload stalls 58,999,994 handshakes 64,560,204
  1. instruction fetch 27,158,844 46% 45,623,132 req
  2. data load 10,999,370 18.6% 550,464 req
  3. data store 12,019,517 20.4% 83,798 req
  4. atomic memory op 172,099 0.3% 166,763 req
  5. page walk for fetch 674,678 1.1% 668,524 req
  6. page walk for load/store 666,522 1.1% 660,352 req
  7. other 7,308,964 12.4% 16,807,171 req

Memory stalls fall 1.37% versus P100. The more interesting movement is inside the page-walk traffic: fetch-side and data-side PTE work both drop because each side keeps its own translations longer.

Shell Phases

shell phases label P101 shell workload cycles 217,630,965 cpi 2.53
  1. kernel banner to /init 116,724,467 53.8%
  2. /init to shell banner 1,073,233 0.5%
  3. shell banner to first command 35,427,931 16.3%
  4. echo command 1,649 0%
  5. uname -a 2,387,886 1.1%
  6. ls /bin /usr/share 31,716,614 14.6%
  7. cat sample file 3,042,641 1.4%
  8. touch/write/cat/rm /tmp file 10,791,609 5%
  9. 8x ash loop with file I/O 15,836,188 7.3%
  10. final marker 680 0%

The full BusyBox shell script reaches P101-FILE-OK.

Cycle Shape

state breakdown label P101 split ITLB/DTLB workload cycles 217,630,965 cpi 2.53
  1. fetch 3.4% 7,465,407
  2. execute 39.5% 86,055,752
  3. mem 12.8% 27,891,331
  4. walker 1.2% 2,670,076
  5. writeback 39.5% 86,031,234
  6. mul/div 3.5% 7,515,449

The CPI improvement is visible here: 2.5660 in P100, 2.5297 in P101.

Hot Functions

hot functions label P101 BusyBox shell symbols samples 62,283 period every 1,024 cycles
  1. printf_core busybox
    5.6% 3,500
  2. memset kernel
    5.1% 3,198
  3. memcpy busybox
    3.7% 2,319
  4. vruntime_eligible kernel
    3.2% 1,984
  5. blake2s_compress_generic kernel
    2.9% 1,802
  6. memcpy kernel
    2.8% 1,726
  7. __fwritex busybox
    2.7% 1,685
  8. unmap_page_range kernel
    1.6% 1,019
  9. handle_exception kernel
    1.6% 1,004
  10. n_tty_write kernel
    1.3% 830
  11. memset busybox
    1.3% 785
  12. avg_vruntime kernel
    1.2% 749
  13. ret_from_exception kernel
    1.1% 701
  14. next_uptodate_folio kernel
    1.1% 680
  15. n_tty_read kernel
    1.1% 661
  16. (remaining) remaining
    55.3% 34,438

The software workload did not change. The speedup comes from fewer translation walks underneath the same shell script.

Honest Status

checkstatus
make check-toolsPASS
Verilator buildPASS
BusyBox userspace/initramfs buildPASS
Linux image rebuilt with P101 initramfsPASS
BusyBox shell workload reaches P101-FILE-OKPASS
P101 chart data capturedPASS
Separate ITLB and DTLB storagePASS
Shared walker fills side-specific bankPASS
Shell-window speedup vs P100PASS
Parallel page walkersNOT RUN
Data-side write buffer with forwardingNOT RUN
Nonblocking miss machineryNOT RUN
LibreLane hardeningNOT RUN

Next

P102 should move back to the data side: a write buffer with same-address forwarding and clear fence/AMO ordering, or an MSHR-lite miss tracker if we want to attack blocking loads first.