P99 is a map rung. It keeps the P98 core behavior, reruns the BusyBox shell profile under P99 labels, and writes down exactly what the Harvard instruction/data split needs to mean for this core.
The short version: we have I-cache and D-cache structures. We do not yet have independent instruction and data service.
Current Shape
| area | current P99 reality | Harvard target |
|---|---|---|
| instruction cache | 64-line, 4-word direct-mapped I-cache | independent instruction-side L1 service |
| data cache | 64-line, 4-word write-through D-cache for aligned RAM LW/SW | independent data-side L1 service |
| fetch queue | one-entry next-instruction queue | enough frontend buffering to cover data hiccups |
| translation | one 8-entry unified TLB with separate fetch/LSU lookup wires | split ITLB/DTLB lookup and refill accounting |
| page-table walker | one walker tagged by ptw_for_fetch_q | split request queues or explicit lower arbitration |
| stores | write-through on the shared path | data-side write buffer with forwarding |
| lower memory | one selected mem_arb_class drives mem_valid | lower shared memory or banks with conflict counters |
Result
| metric | P94 arbiter | P98 throttle | P99 map |
|---|---|---|---|
| post-load cycles | 222,459,202 | 221,452,591 | 222,509,604 |
| shell window cycles | 67,050,374 | 66,055,345 | 66,998,698 |
| retired instructions | 86,664,089 | 86,329,983 | 86,648,693 |
| CPI | 2.5669 | 2.5652 | 2.5680 |
| memory stall cycles | 60,032,329 | 59,683,338 | 59,928,278 |
| fetch stall cycles | 23,549,359 | 27,286,526 | 27,399,253 |
| load stall cycles | 14,632,992 | 10,697,962 | 10,718,661 |
| comparison | result |
|---|---|
| shell window vs P98 | +1.43% |
| post-load cycles vs P98 | +0.48% |
| memory stalls vs P98 | +0.41% |
| fetch stalls vs P98 | +0.41% |
| load stalls vs P98 | +0.19% |
| shell window vs P94 | -0.08% |
| load stalls vs P94 | -26.75% |
| fetch stalls vs P94 | +16.35% |
P99 is a functional PASS. It is not a speed PASS versus P98. That is expected because it does not implement the split yet.
Current Request Classes
| class | future side |
|---|---|
fetch | instruction |
execute_prefetch | instruction |
writeback_prefetch | instruction |
icache_background | instruction |
load | data |
store | data |
fp_load | data |
fp_store | data |
amo | data |
ptw_fetch | instruction translation |
ptw_lsu | data translation |
dcache_background | data |
P94 gave these clients names. P100 should stop forcing them through one final near-core port before the memory model can see them.
Memory Stalls
- instruction fetch 27,399,253 45.7% 46,182,344 req
- data load 10,718,661 17.9% 892,084 req
- data store 11,987,867 20% 222,648 req
- atomic memory op 158,393 0.3% 184,717 req
- page walk for fetch 1,135,398 1.9% 1,129,244 req
- page walk for load/store 1,243,664 2.1% 1,243,154 req
- other 7,285,042 12.2% 16,829,586 req
The profile still shows the same basic shape: load stalls are much better than P94 thanks to the D-cache work, but fetch stalls are still higher because the frontend remains tied to the shared service point.
Shell Phases
- kernel banner to /init 117,615,427 53%
- /init to shell banner 1,076,089 0.5%
- shell banner to first command 36,191,325 16.3%
- echo command 1,598 0%
- uname -a 2,604,026 1.2%
- ls /bin /usr/share 32,177,153 14.5%
- cat sample file 2,923,000 1.3%
- touch/write/cat/rm /tmp file 11,447,584 5.2%
- 8x ash loop with file I/O 16,330,121 7.4%
- final marker 1,515,216 0.7%
The full BusyBox shell script reaches P99-FILE-OK.
Cycle Shape
- fetch 3.8% 8,352,672
- execute 39% 86,673,821
- mem 12.7% 28,152,093
- walker 2.1% 4,751,460
- writeback 38.9% 86,648,693
- mul/div 3.6% 7,929,149
No new execution state was added. P99 validates that the mapped core still runs Linux userspace.
Hot Functions
- 5.4% of samples (3,548 samples)5.4% 3,548
- 5.1% of samples (3,317 samples)5.1% 3,317
- 3.6% of samples (2,348 samples)3.6% 2,348
- 3.4% of samples (2,252 samples)3.4% 2,252
- 2.7% of samples (1,795 samples)2.7% 1,795
- 2.7% of samples (1,783 samples)2.7% 1,783
- 2.6% of samples (1,678 samples)2.6% 1,678
- 1.8% of samples (1,188 samples)1.8% 1,188
- 1.7% of samples (1,086 samples)1.7% 1,086
- 1.4% of samples (923 samples)1.4% 923
- 1.3% of samples (858 samples)1.3% 858
- 1.2% of samples (808 samples)1.2% 808
- 1.2% of samples (805 samples)1.2% 805
- 1% of samples (671 samples)1% 671
- 1% of samples (664 samples)1% 664
- 55.7% of samples (36,422 samples)55.7% 36,422
The software mix remains the same shell workload. The P99 page is about hardware service boundaries, not a new application.
Honest Status
| check | status |
|---|---|
| P98 RTL cloned and relabeled for P99 | PASS |
| Verilator build | PASS |
| BusyBox shell workload runs | PASS |
| P99 chart data captured | PASS |
| Harvard I/D map written | PASS |
| Shell-window speedup vs P98 | FAIL |
| True split I/D RAM ports | NOT RUN |
| Split ITLB/DTLB | NOT RUN |
| Data-side write buffer with forwarding | NOT RUN |
| Nonblocking miss machinery | NOT RUN |
| LibreLane hardening | NOT RUN |
Next
P100 should implement the first split-port model: instruction-side and data-side service intents near the core, with the old shared memory model kept underneath as a measured lower conflict point.