No. 99 / project of 147 on the ladder

Harvard I/D map

introduces — explicit Harvard instruction/data service map; current shared-port gap list; P100 split-port acceptance criteria

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P99 is a map rung. It keeps the P98 core behavior, reruns the BusyBox shell profile under P99 labels, and writes down exactly what the Harvard instruction/data split needs to mean for this core.

The short version: we have I-cache and D-cache structures. We do not yet have independent instruction and data service.

Current Shape

areacurrent P99 realityHarvard target
instruction cache64-line, 4-word direct-mapped I-cacheindependent instruction-side L1 service
data cache64-line, 4-word write-through D-cache for aligned RAM LW/SWindependent data-side L1 service
fetch queueone-entry next-instruction queueenough frontend buffering to cover data hiccups
translationone 8-entry unified TLB with separate fetch/LSU lookup wiressplit ITLB/DTLB lookup and refill accounting
page-table walkerone walker tagged by ptw_for_fetch_qsplit request queues or explicit lower arbitration
storeswrite-through on the shared pathdata-side write buffer with forwarding
lower memoryone selected mem_arb_class drives mem_validlower shared memory or banks with conflict counters

Result

metricP94 arbiterP98 throttleP99 map
post-load cycles222,459,202221,452,591222,509,604
shell window cycles67,050,37466,055,34566,998,698
retired instructions86,664,08986,329,98386,648,693
CPI2.56692.56522.5680
memory stall cycles60,032,32959,683,33859,928,278
fetch stall cycles23,549,35927,286,52627,399,253
load stall cycles14,632,99210,697,96210,718,661
comparisonresult
shell window vs P98+1.43%
post-load cycles vs P98+0.48%
memory stalls vs P98+0.41%
fetch stalls vs P98+0.41%
load stalls vs P98+0.19%
shell window vs P94-0.08%
load stalls vs P94-26.75%
fetch stalls vs P94+16.35%

P99 is a functional PASS. It is not a speed PASS versus P98. That is expected because it does not implement the split yet.

Current Request Classes

classfuture side
fetchinstruction
execute_prefetchinstruction
writeback_prefetchinstruction
icache_backgroundinstruction
loaddata
storedata
fp_loaddata
fp_storedata
amodata
ptw_fetchinstruction translation
ptw_lsudata translation
dcache_backgrounddata

P94 gave these clients names. P100 should stop forcing them through one final near-core port before the memory model can see them.

Memory Stalls

memory stalls label P99 Harvard I/D map workload stalls 59,928,278 handshakes 66,683,777
  1. instruction fetch 27,399,253 45.7% 46,182,344 req
  2. data load 10,718,661 17.9% 892,084 req
  3. data store 11,987,867 20% 222,648 req
  4. atomic memory op 158,393 0.3% 184,717 req
  5. page walk for fetch 1,135,398 1.9% 1,129,244 req
  6. page walk for load/store 1,243,664 2.1% 1,243,154 req
  7. other 7,285,042 12.2% 16,829,586 req

The profile still shows the same basic shape: load stalls are much better than P94 thanks to the D-cache work, but fetch stalls are still higher because the frontend remains tied to the shared service point.

Shell Phases

shell phases label P99 shell workload cycles 222,509,604 cpi 2.57
  1. kernel banner to /init 117,615,427 53%
  2. /init to shell banner 1,076,089 0.5%
  3. shell banner to first command 36,191,325 16.3%
  4. echo command 1,598 0%
  5. uname -a 2,604,026 1.2%
  6. ls /bin /usr/share 32,177,153 14.5%
  7. cat sample file 2,923,000 1.3%
  8. touch/write/cat/rm /tmp file 11,447,584 5.2%
  9. 8x ash loop with file I/O 16,330,121 7.4%
  10. final marker 1,515,216 0.7%

The full BusyBox shell script reaches P99-FILE-OK.

Cycle Shape

state breakdown label P99 Harvard I/D map workload cycles 222,509,604 cpi 2.57
  1. fetch 3.8% 8,352,672
  2. execute 39% 86,673,821
  3. mem 12.7% 28,152,093
  4. walker 2.1% 4,751,460
  5. writeback 38.9% 86,648,693
  6. mul/div 3.6% 7,929,149

No new execution state was added. P99 validates that the mapped core still runs Linux userspace.

Hot Functions

hot functions label P99 BusyBox shell symbols samples 65,428 period every 1,024 cycles
  1. printf_core busybox
    5.4% 3,548
  2. memset kernel
    5.1% 3,317
  3. memcpy busybox
    3.6% 2,348
  4. vruntime_eligible kernel
    3.4% 2,252
  5. blake2s_compress_generic kernel
    2.7% 1,795
  6. __fwritex busybox
    2.7% 1,783
  7. memcpy kernel
    2.6% 1,678
  8. handle_exception kernel
    1.8% 1,188
  9. unmap_page_range kernel
    1.7% 1,086
  10. avg_vruntime kernel
    1.4% 923
  11. n_tty_write kernel
    1.3% 858
  12. memset busybox
    1.2% 808
  13. ret_from_exception kernel
    1.2% 805
  14. n_tty_read kernel
    1% 671
  15. next_uptodate_folio kernel
    1% 664
  16. (remaining) remaining
    55.7% 36,422

The software mix remains the same shell workload. The P99 page is about hardware service boundaries, not a new application.

Honest Status

checkstatus
P98 RTL cloned and relabeled for P99PASS
Verilator buildPASS
BusyBox shell workload runsPASS
P99 chart data capturedPASS
Harvard I/D map writtenPASS
Shell-window speedup vs P98FAIL
True split I/D RAM portsNOT RUN
Split ITLB/DTLBNOT RUN
Data-side write buffer with forwardingNOT RUN
Nonblocking miss machineryNOT RUN
LibreLane hardeningNOT RUN

Next

P100 should implement the first split-port model: instruction-side and data-side service intents near the core, with the old shared memory model kept underneath as a measured lower conflict point.