P101 splits translation storage. P100 separated instruction-side and data-side memory-service intent, but both sides still shared one tiny 8-entry TLB. P101 replaces that with an 8-entry ITLB and an 8-entry DTLB while keeping the page-table walker shared.
This one is not just cleaner architecture. It is faster.
Result
| metric | P94 arbiter | P100 split service | P101 split TLB |
|---|---|---|---|
| post-load cycles | 222,459,202 | 221,990,140 | 217,630,965 |
| shell window cycles | 67,050,374 | 66,518,626 | 63,777,267 |
| retired instructions | 86,664,089 | 86,512,027 | 86,031,234 |
| CPI | 2.5669 | 2.5660 | 2.5297 |
| memory stall cycles | 60,032,329 | 59,819,129 | 58,999,994 |
| fetch stall cycles | 23,549,359 | 27,346,150 | 27,158,844 |
| load stall cycles | 14,632,992 | 10,729,427 | 10,999,370 |
| fetch page walks | 1,123,987 | 1,122,943 | 674,678 |
| data page walks | 1,220,476 | 1,218,084 | 666,522 |
| comparison | result |
|---|---|
| shell window vs P100 | -4.12% |
| post-load cycles vs P100 | -1.96% |
| memory stalls vs P100 | -1.37% |
| fetch stalls vs P100 | -0.68% |
| load stalls vs P100 | +2.52% |
| fetch walks vs P100 | -39.92% |
| data walks vs P100 | -45.28% |
| shell window vs P94 | -4.88% |
P101 is a speed PASS. The load-stall bucket rises slightly, but the shell workload wins because translation walks fall sharply.
Split TLB Counters
| bank | entries | activity | misses | fills | replacement index |
|---|---|---|---|---|---|
| ITLB fetch-hit cycles | 8 | 6,107,447 | 674,678 | 674,678 | 5 |
| ITLB prefetch-hit cycles | 8 | 139,274,063 | 674,678 | 674,678 | 5 |
| DTLB LSU hits | 8 | 27,081,725 | 666,522 | 666,522 | 6 |
Shared walker:
| walker metric | count |
|---|---|
| fetch walks | 674,678 |
| data walks | 666,522 |
| A/D writebacks | 96 |
| TLB flushes | 13,447 |
The split removes replacement interference, not walker serialization. That distinction matters for the next rung.
Translation Shape
| access | P101 bank |
|---|---|
| current PC fetch | ITLB |
next_pc prefetch | ITLB |
| load/store effective address | DTLB |
| AMO effective address | DTLB |
Both banks are flushed on reset, satp writes, and sfence.vma.
Memory Stalls
- instruction fetch 27,158,844 46% 45,623,132 req
- data load 10,999,370 18.6% 550,464 req
- data store 12,019,517 20.4% 83,798 req
- atomic memory op 172,099 0.3% 166,763 req
- page walk for fetch 674,678 1.1% 668,524 req
- page walk for load/store 666,522 1.1% 660,352 req
- other 7,308,964 12.4% 16,807,171 req
Memory stalls fall 1.37% versus P100. The more interesting movement is inside the page-walk traffic: fetch-side and data-side PTE work both drop because each side keeps its own translations longer.
Shell Phases
- kernel banner to /init 116,724,467 53.8%
- /init to shell banner 1,073,233 0.5%
- shell banner to first command 35,427,931 16.3%
- echo command 1,649 0%
- uname -a 2,387,886 1.1%
- ls /bin /usr/share 31,716,614 14.6%
- cat sample file 3,042,641 1.4%
- touch/write/cat/rm /tmp file 10,791,609 5%
- 8x ash loop with file I/O 15,836,188 7.3%
- final marker 680 0%
The full BusyBox shell script reaches P101-FILE-OK.
Cycle Shape
- fetch 3.4% 7,465,407
- execute 39.5% 86,055,752
- mem 12.8% 27,891,331
- walker 1.2% 2,670,076
- writeback 39.5% 86,031,234
- mul/div 3.5% 7,515,449
The CPI improvement is visible here: 2.5660 in P100, 2.5297 in P101.
Hot Functions
- 5.6% of samples (3,500 samples)5.6% 3,500
- 5.1% of samples (3,198 samples)5.1% 3,198
- 3.7% of samples (2,319 samples)3.7% 2,319
- 3.2% of samples (1,984 samples)3.2% 1,984
- 2.9% of samples (1,802 samples)2.9% 1,802
- 2.8% of samples (1,726 samples)2.8% 1,726
- 2.7% of samples (1,685 samples)2.7% 1,685
- 1.6% of samples (1,019 samples)1.6% 1,019
- 1.6% of samples (1,004 samples)1.6% 1,004
- 1.3% of samples (830 samples)1.3% 830
- 1.3% of samples (785 samples)1.3% 785
- 1.2% of samples (749 samples)1.2% 749
- 1.1% of samples (701 samples)1.1% 701
- 1.1% of samples (680 samples)1.1% 680
- 1.1% of samples (661 samples)1.1% 661
- 55.3% of samples (34,438 samples)55.3% 34,438
The software workload did not change. The speedup comes from fewer translation walks underneath the same shell script.
Honest Status
| check | status |
|---|---|
make check-tools | PASS |
| Verilator build | PASS |
| BusyBox userspace/initramfs build | PASS |
| Linux image rebuilt with P101 initramfs | PASS |
BusyBox shell workload reaches P101-FILE-OK | PASS |
| P101 chart data captured | PASS |
| Separate ITLB and DTLB storage | PASS |
| Shared walker fills side-specific bank | PASS |
| Shell-window speedup vs P100 | PASS |
| Parallel page walkers | NOT RUN |
| Data-side write buffer with forwarding | NOT RUN |
| Nonblocking miss machinery | NOT RUN |
| LibreLane hardening | NOT RUN |
Next
P102 should move back to the data side: a write buffer with
same-address forwarding and clear fence/AMO ordering, or an MSHR-lite
miss tracker if we want to attack blocking loads first.