section 06/roadmap

The climb

From hardened teaching chips through Linux bring-up, then toward a XiangShan-inspired frontend and memory system without pretending this notebook core is a Kunminghu-class machine.

currentP147 · strict execute-prefetch guardlast hardenedP45 · A extension, plus earlier hardened rungsnext architecture arcP148 · pivot away from execute-prefetch repair toward the next frontend/memory bottlenecknorth starXiangShan-style shape, notebook-scale steps

§ Milestone reached: FreeRTOS on hardware

P43 closes the FreeRTOS arc: an unmodified FreeRTOS V11.1.0 kernel, three application tasks plus idle, queue-based message passing, timer-driven preemption, and a clean halt convention - all running on a hardened sky130A GDS at 0 errors across DRC, LVS, antenna, routing, setup, hold, and fanout.

P43 - FreeRTOS multi-task demo, hardened
  UART output : S a b c d e f g h D
  Cycles      : 5,100,751
  Halt        : MMIO_HALT(0x10001ff8) <- 1, halted=1
  GDS         : projects/43_freertos_hardened/librelane/runs/RUN_2026-05-03_03-20-20/final/gds/top.gds
  Std cells   : 28,079    Setup slack : 6.568 ns
  DRC/LVS/ANT : PASS      Hold slack  : 0.107 ns

That makes this the first hardened ladder rung where a real third-party RTOS runs on the chip we built. Twelve rungs from P32 through P43 brought us here: RV32M, MMIO platform, trap frame, Zicntr counters, FreeRTOS port, scheduler bring-up, and finally an MMIO halt port that fixed a real RTL bug found because real software ran on the chip.

That’s the FreeRTOS milestone. From here the climb gets steeper.

§ Where this page changes

Earlier versions of this page treated FreeRTOS and then Linux as future milestones. Those arcs have now been walked far enough to boot Linux 6.12.85, run a userspace PID 1, boot BusyBox, profile shell workloads, and measure frontend experiments. The next arc is architectural: make the core less blocking, starting with the frontend and memory system.

The older Linux plan is preserved below as history because it explains why the core has S-mode, Sv32, SBI, BusyBox, AtomVM, and profiling infrastructure. The live path starts here.

§ What the last backend rungs bought us

The recent backend rungs have mostly been proof work, not speed work. That is why they can feel abstract.

P123 found 40.40M cycles where the frontend had queued work while the backend was still busy. That was the reason to look at dispatch and issue at all. P124 through P129 then asked whether independent ready work already existed behind the old FSM. The answer was basically no: integer, memory, and control records stayed one-deep, and dual-ready cycles stayed at zero.

So P130 through P133 changed the goal from “make it faster right now” to “make the boundary real enough to modify safely.” We now have an explicit valid/ready/fire contract, module-owned queue state, module-owned payload records, and 22.66M payload-class audit checks with zero mismatches under Linux and BusyBox.

The breakthrough has not happened yet. The core is still a serialized backend with max queue occupancy 1 in every class. The useful decision was to pivot back to frontend and memory work, where previous rungs produced measured shell-window wins. P134 does that: it reopens guarded aux-load issue, completes 139,881 load misses through the auxiliary lane, and trims the shell window by 360,275 cycles versus P133. P135 keeps that policy and audits the remaining blocks. After the frontend prefetch guard passes, only 196K candidates are blocked by D-cache background fill alone, while 1.62M are blocked by I-cache background fill alone and 1.86M are blocked by both background paths. P136 tests that target by preempting I-cache background fill. It issues 1.77M aux loads with 0 drops, errors, or cancels, but the shell window worsens by 1.98M cycles versus P135. The current lesson is concrete: unbounded preemption is too aggressive; the next rung needs bounded arbitration or a rollback to the quiet-I-cache guard. P137 adds the first bound: one I-cache-background preemption may issue, then the next otherwise-safe I-cache-only candidate defers. It issues 1.67M aux loads with 0 drops, errors, or cancels and recovers the shell window to 65.36M cycles, a tiny 55K-cycle win over P135. P138 replaces that burst cap with explicit repair debt. It keeps 1.67M aux-load issues, balances 1.50M preemptions with 1.50M debt paydowns, and improves the shell window to 64.64M cycles. That beats P137 and P135, but still misses P134 by 416K cycles. P139 tags background-repaired I-cache words and counts later fetch use. It records 52.85M repair word fills but only 1.93M first later fetch hits and 0.83M repeat fetch hits, so the repair stream is fair but mostly not promptly useful. P140 makes that policy value-aware with a one-word adjacent repair budget. It cuts repair fills to 32.05M and improves 673K cycles versus P139, but still trails P138 and P134 because it loses too much line locality. P141 grants a second repair word to demand-fetch lines while keeping prefetch-only lines at one word. It keeps repair fills near P140, beats P140 by 746K cycles, beats P138 by 348K cycles, and lands only 67.6K cycles behind P134. P142 gives that second repair word to prefetch fills that are immediately consumed by the frontend. It spends more repair traffic, but converts enough of it into I-cache hits to beat P141 by 362.6K cycles and P134 by 295.0K cycles. P143 classifies those repaired words by source. It finds the broad P142 bucket is mostly execute-prefetch traffic: 39.10M repair fills with only a 2.83% first+repeat hit ratio. Demand, writeback-prefetch, and aux-prefetch repair have much stronger payback. P144 removes the second repair word from execute-prefetch fills by default. It cuts 10.01M repair fills and beats P143 by 564K cycles, but remains 266K cycles slower than P142, so the class-wide throttle is too blunt. P145 restores that second word only when the prefetched instruction’s sequential next PC is the adjacent word in the same I-cache line. It passes RTL and the BusyBox shell workload, but it is a speed FAIL: 908K cycles slower than P144. P146 rolls the active policy back to P144 and shadow-counts several candidate predicates. The P145 seq_adjacent predicate produces 20,552 fills and zero later fetch hits; the best single broader predicate, predicted_not_taken, is still only 2.91% first+repeat hits per fill. That makes one more strict composite-guard experiment the limit before we pivot away from this execute-prefetch repair bucket. P147 runs that final strict guard: predicted_not_taken && word_not_last && quiet_backend. It passes Linux and BusyBox and beats P146 by 850K cycles, but it still loses to P144 by 620K cycles and P142 by 886K cycles. That closes the execute-prefetch second-word repair thread.

§ Phase 6 - XiangShan-inspired frontend and memory arc (P91-P147)

This does not mean cloning XiangShan. XiangShan/Kunminghu is a team-scale RV64 out-of-order application core with wide fetch/decode, branch prediction, nonblocking caches and TLBs, rename, schedulers, a ROB, physical register files, vector/FPU, and major verification infrastructure. This repo is still a paced RV32 teaching core.

The useful lesson from XiangShan is the shape: decouple fetch from execute, make instruction and data memory service nonblocking where possible, measure the bottlenecks, and only then consider out-of-order machinery. XiangShan’s public Kunminghu V2R2 guide frames the big machine as IFU, IDU, rename, out-of-order dispatch/issue, integer/FP/vector execution, LSU, ROB, MMU, PMP/PMA, and L2 cache subsystems; our teaching path is deliberately extracting the smallest checkable ideas from that shape.

P90 proved the immediate mistake. A 4-word I-cache line gets more hits, but a blocking S_IC_FILL state made the BusyBox shell workload slower. P91 fixed that policy. P92 then added a one-entry fetch queue; it cut fetch-class stalls but did not yet improve shell-window time. P93 added a shadow predictor and measured whether the workload has useful control-flow regularity before letting prediction steer fetch. P94 split the shared memory request path into named clients so the next data-side experiments can stop guessing. P95 tried the conservative one-entry store buffer and proved that merely moving store wait into a blocking drain policy makes the shell workload worse. P96 added the first word-only D-cache, cut load stalls by 24.99%, and produced a modest 1.44% shell-window speedup over P94 while also exposing the need for better line-fill policy. P97 tried four-word D-cache lines with critical-word-first demand loads. It improved local D-cache hit behavior but lost the shell window because background data fills stole shared RAM service from fetch. P98 throttled that background fill so it only runs in frontend-safe slots. It recovered the P96 shell timing, but the result is still one-port policy work. The P99 map defines the Harvard instruction/data split directly. P100 turns that map into measured instruction/data service counters while leaving the lower shared memory path in place. P101 splits the unified TLB into ITLB and DTLB banks, cutting translation walks hard enough to make the shell workload faster. P102 then tried a core-local store buffer and produced a useful correctness failure: Linux reaches /init, but BusyBox faults before the shell prompt after only 79 buffered user stores. P103 traced that failure and repaired the transaction boundary: a store-buffer entry is now cleared only when the store-buffer request actually wins the memory grant and memory accepts it. P104 then measured the lower shared-memory problem directly: 71.40% of simultaneous instruction/data lower-memory wants map to different word-interleaved banks, but the current one-port fabric still serializes them. P105 adds a conservative banked service model and finds 8.29M shell-window extra grants that could be serviced on different lower banks if the memory contract were widened. P106 widens that contract: the RTL emits an auxiliary read lane and the Verilator memory model services 20.63M auxiliary reads with 0 errors while Linux runs. P107 feeds that response back into one narrow core client: D-cache background fill. It consumes 10.07M auxiliary fills and trims the shell window by 288,864 cycles versus P106, which is a mechanism PASS but not the end of the memory problem. P108 adds the first instruction-side consumer: blocked writeback prefetch fills the I-cache through the auxiliary response. It consumes 488K instruction-side aux prefetch fills and improves the shell window by another 445,555 cycles. P109 makes that same auxiliary writeback-prefetch response demand-visible when S_WB drains the store buffer on the main port. It bypasses 488K prefetches into S_EXECUTE and cuts S_FETCH by 481,840 cycles, but the shell-window result is mixed. P110 routes those responses through one owner/address/data/error/cancel record. It counts 488K writeback prefetch responses and 9.98M D-cache background responses with 0 errors and 0 cancels. P111 makes the load owner real: 3.545M aligned integer load misses complete through the auxiliary response while the main port fetches a safe next-PC word. That proves the tagged slot can carry architectural load data, but the first policy is too eager and regresses the shell window by 1,005,481 cycles versus P110. P112 puts that load response behind a one-entry queue. It records 3.677M enqueues and matching dequeues with 0 drops, 0 errors, and 0 cancels, but the registered completion cycle pushes the shell window to 71.95M cycles. P113 then tightens the issue policy, blocks all eager aux-load candidates, and recovers most of P112’s regression. P114 measures PTW aux ownership and finds no safe read-like walker candidates in the shell workload. P115 adds the frontend target queue metadata needed before active predictor steering: 54.24M fills, 54.24M consumes, and 0 flushes. P116 tries to use that metadata for active steering, catches a real early-boot failure in the live predicted-target prefetch path, and lands as a guarded counter rung: 4.46M steering candidates, 0 issued fills, Linux and BusyBox still PASS. P117 adds the one-entry speculative target-buffer record and proves the next hazard more precisely: with live issue enabled only in userspace, Linux reaches /init, then BusyBox faults at badaddr=0x00000000. The checked-in guarded version records 215K candidates, 0 issued fills, and keeps the shell profile passing. P118 switches back to the data side and names the current monolithic execute plus S_MEM path as an LSU-shaped measurement: 27.90M address-generation events, 27.22M DTLB hits, 671K DTLB misses, 4.60M D-cache-hit cycles, and 1.16M store-buffer accepts. P119 adds the first shadow request record and scoreboard-style busy accounting: 27.89M request allocs, 27.75M classified completes, 443 flushes, and 30.69M busy cycles while the BusyBox shell still passes. P120 starts the backend-renaming arc without changing architectural state: a 64-entry shadow integer PRF map records 147.10M source reads, 59.31M integer destination allocations, matching frees/commits, and a passing BusyBox shell profile. Because the current in-order writeback allocates and frees in the same cycle, live physical-register pressure remains 32; P121 is where ROB lifetime should make that pressure real. P121 adds that first lifetime model: one shadow ROB record allocates a physical tag at S_EXECUTE, commits or flushes it at S_WB/trap time, and keeps architectural regs[] unchanged. The shell workload passes with 59.18M ROB allocs, 59.18M commits, 179 flushes, 0 missing commits, and max live PRF pressure of 33. P122 grows that record into a four-entry ring and gets the useful negative result: Linux and BusyBox still pass, alloc/commit/flush accounting balances, but max occupancy is still 1. The current FSM has no separate dispatch/issue path, so a bigger ROB is just a bigger counter container until P123 splits backend progress from writeback. P123 measures that split point directly: 40.40M cycles where the frontend has queued work while the backend is still busy. The shadow dispatch queue allocs and drains 40.40M records with no full blocks and max occupancy 1, which says the next backend rung should be a real shadow issue slot with explicit block reasons. P124 adds that one-entry shadow integer issue slot. It accepts 9.80M simple integer queued ops, blocks 11.30M on modeled dependencies, 6.66M on memory-class work, 9.33M on control flow, and 76.6K on system/fence instructions. The next backend rung is source-ready bookkeeping and class-specific holding records. P125 adds the source-ready half: an architectural busy-bit scoreboard over queued simple integer ops. It finds 13.10M all-sources-ready candidates and 11.31M source-busy candidates, while max busy architectural register count stays 1. That keeps the next step pointed at memory/control class records, not a bigger integer slot. P126 adds those records: a one-entry memory holding model and a one-entry control holding model. It counts 6.66M memory candidates, 9.33M control candidates, 3.41M memory source-busy blocks, 3.02M control source-busy blocks, and 29.9K control full-record blocks. The next useful backend step is wakeup/issue eligibility across the held classes. P127 adds that ready-mask model. It samples 43.66M cycles with at least one held record and finds 37.18M cycles where some record is ready, but every ready cycle is single-class: zero integer+memory, integer+control, memory+control, or triple-ready cycles. The next useful backend step is queue/lifetime depth so held work can coexist before any real multi-issue selector. P128 adds that queue model: capacity-4 integer, memory, and control shadow queues with a one-lane fixed-priority drain. It passes the shell workload, but each queue still maxes at occupancy 1 and the model records zero dual-ready cycles. The next useful backend step is to decouple scheduler arrival from backend service more honestly before trying a two-issue picker. P129 moves that arrival accounting earlier and still gets max occupancy 1, with 13.10M integer arrivals, 3.25M memory arrivals, 6.31M control arrivals, and zero dual-ready cycles. That answers the abstraction question: the current split is useful instrumentation, but not yet a proper frontend/backend ready/valid contract. P130 extracts that contract into a plain-RTL helper with explicit valid, ready, and fire signals. It passes the shell workload and matches the older scheduler counters, but still reports zero backpressure, zero dual-ready cycles, and no queue depth beyond 1. The next useful step is a small state-owning dispatch queue module, not a picker. P131 takes that step: the queue state is owned by p131_dispatch_queue_module3, and the module’s fires, backpressure, ready count, max occupancy, and flush clears match the older shadow accounting under the BusyBox shell profile. It still maxes every class at occupancy 1, so the next refactor is payload ownership, not issue width. P132 adds that ownership: the module captures PC, opcode, rd, rs1, rs2, and source-use bits on arrival fire. Payload accepts match arrival fires, payload services match service fires, payload flush clears account for the remaining delta, and invariant errors stay at zero. P133 closes the conservative module-boundary proof: it compares every accepted module-owned payload against the older integer, memory, and control classifiers. The shell workload records 22.66M class audits, including 13.10M integer, 3.25M memory, and 6.31M control audits, with 0 mismatches. This still does not create speed or queue depth; it creates the cleanest point so far to decide whether an active dispatch queue is worth trying. P134 chooses the pivot instead: the aux-load path is re-enabled only when the main port can perform a useful next-PC instruction prefetch and both cache background fill paths are quiet. It issues 139,881 aux loads, records 0 queue drops, and improves shell-window time by 0.56% versus P133. The next question is no longer “more backend scaffolding?” It is “which background cache policy is blocking useful overlap?” P135 answers that with mutually exclusive buckets. The D-cache-only bucket is small at 196K. The I-cache-only bucket is 1.62M, and both-backgrounds is 1.86M. P135 is not a speed rung; its shell window is worse than P134. Its value was the narrow target it gave P136: allow useful next-PC prefetch plus aux-load issue to preempt I-cache background fill while keeping D-cache background fill quiet. P136 runs that exact test. The hardware path stays correct and issues 1.77M aux loads with 0 queue drops, 0 errors, and 0 cancels, but shell-window time regresses by 1.98M cycles versus P135. That makes P136 a useful negative result: I-cache background fill is not just decorative; interrupting it without an age/debt limit steals too much instruction-line repair.

#ProjectWhat it addsWhy it comes here
91Critical-word-first / nonblocking I-cache fill bufferOn an I-cache miss, deliver the requested word as soon as it arrives, then fill the rest of the line opportunistically.Done. P91 beats P90 and slightly beats P89 on shell-window cycles, but does not yet beat P89 on fetch-stall cycles.
92One-entry fetch queue between frontend and executeSafe S_EXECUTE next-PC prefetch with S_WB queue consume before normal writeback prefetch.Done. Queue fills/consumes 53.98M instructions and cuts fetch stalls, but shell-window speedup is FAIL versus P91.
93Branch predictor v0Shadow 32-entry BTB, 2-bit direction counters, and 8-entry return-address stack.Done. RAS target accuracy is 96.57%; BTB target accuracy is weaker, so steering waits for a better frontend path.
94Memory arbiter v0Separate request classification and arbitration for fetch, prefetch, background I-cache fill, load, store, FP, AMO, and page-walk traffic behind the shared external RAM model.Done. Only background I-cache fill is denied service; foreground data traffic is paying shared-memory latency.
95Store buffer v0One-entry external-RAM store buffer at the SoC boundary, with accept/drain/block counters.Done. Store stalls collapsed, but fetch stalls rose 51.02%; shell-window speedup is FAIL.
96D-cache v0Direct-mapped, word-only, write-through data cache for aligned external-RAM LW/SW, with hit/miss/fill/update/invalidation counters.Done. Shell-window speedup is PASS versus P94, load stalls fall 24.99%, but fetch stalls rise 13.28%.
97Four-word D-cache line fillCritical-word-first data-cache line fill, then background fill through the P94 arbiter.Done. Load stalls fall again, but fetch stalls rise 10.94% versus P96; shell-window speedup is FAIL.
98Throttled D-cache background fill policyKeep P97’s line geometry, but only fill remaining words when the frontend already has useful work queued and I-cache fill is quiet.Done. Shell-window speedup is PASS versus P96, but this is still a shared-port scheduling patch.
99Harvard I/D service mapDraw the actual instruction/data architecture boundary, list what this core lacks, and decide which measurements define success.Done. P99 is functional PASS, but not a speed rung; it sets P100’s split-port acceptance criteria.
100Split instruction/data memory service modelGroup fetch/I-cache/instruction-PTW and LSU/D-cache/data-PTW traffic into separate service intentions, then count lower shared conflicts.Done. Instruction demand is always granted by the current policy; data wants 59.35M cycles and is not granted for 32.40M cycles.
101Split ITLB/DTLB lookup pathReplace the unified 8-entry TLB with separate 8-entry ITLB and DTLB banks while keeping the walker shared.Done. Shell window improves 4.12% versus P100; fetch walks fall 39.92% and data walks fall 45.28%.
102Data-side write buffer with forwardingAdd a core-local translated one-entry store buffer and instrument accept/drain/forward behavior.Partial. Verilator builds and Linux reaches /init, but BusyBox faults before the shell prompt after 79 buffered stores. Next rung should trace/fix this before adding more nonblocking machinery.
103Store-buffer trace and repairAdd grant-qualified store-buffer tracing and fix the request/grant/clear contract.Done. BusyBox shell reaches P103-FILE-OK; 1.16M stores accept and drain correctly. Shell-window speedup is FAIL versus P101 because the policy still drains before fetch.
104Banked lower memory conflict countersMeasure how instruction-side and data-side clients would map onto banked lower memory, before pretending the near-core Harvard split has solved the shared-port problem.Done. 20.56M simultaneous I/D lower-memory wants land on different banks; that is 71.40% of the conflict window and the target for P105.
105Banked lower service modelModel same-cycle instruction/data lower-memory grants when selected banks differ and the blocked request is read-like.Done. The conservative model finds 8.29M shell-window extra grants and projects a 56.15M-cycle shell window if each grant saves one cycle.
106Banked lower-memory contractAdd an auxiliary read lane at the simulator/RTL boundary and service safe different-bank reads from the Verilator memory model.Done. The aux lane services 20.63M reads total and 8.46M during the shell window, matching the model exactly with 0 errors.
107Banked auxiliary D-cache fillFeed one narrow auxiliary response class back into the core, starting with D-cache background fill.Done. The core consumes 10.07M auxiliary D-cache fills with 0 aux errors and improves the shell window by 0.44% versus P106.
108Banked auxiliary I-cache fillConsume blocked instruction writeback-prefetch responses as I-cache fills while retaining P107’s D-cache background-fill consumer.Done. The core consumes 488K aux I-cache prefetch fills and improves the shell window by 0.68% versus P107.
109Banked auxiliary demand prefetchLet one demand-visible path consume an auxiliary response, starting with the S_WB store-buffer drain plus writeback-prefetch overlap.Done. The core bypasses 488K auxiliary prefetches into execute and cuts S_FETCH cycles, but shell-window speedup is FAIL versus P108.
110Tagged auxiliary response slotReplace one-off aux consumers with owner/address/data/error/cancel metadata for fetch, prefetch, background fill, and later load-miss service.Done. The slot records 10.47M tagged aux responses with 0 errors and 0 cancels while the shell workload reaches P110-FILE-OK.
111Nonblocking aligned load-miss aux consumerLet one data-side load miss own a tagged aux response without violating store, trap, or D-cache invalidation ordering.Done. 3.545M load aux responses, 0 errors, 0 cancels; speedup FAIL versus P110.
112Aux response queue / one-entry MSHRRegister one outstanding aux response so useful overlap can survive a cycle of consumer backpressure and policy can distinguish load demand from background fill.Done. 3.677M queue enqueues/dequeues, 0 drops, 0 errors, 0 cancels; speedup FAIL versus P111.
113Load-miss issue policy v2Gate aux-load issue using measured frontend pressure, D-cache line state, and background-fill debt.Done. Blocks all 5.39M aux-load candidates, recovers most of P112, but remains slower than P110.
114PTW aux owner measurementCount safe PTW aux opportunities and A/D-write blocks before trying a walker consumer.Done. No safe PTW aux candidates in shell workload; 73 A/D write blocks.
115Frontend target queueReplace the one-entry fall-through fetch queue with an FTQ-like target queue that can hold predicted PCs and fetch metadata.Done. 54.24M FTQ fills and matching consumes with 0 flushes; this is scaffolding, not a speedup rung.
116Active predictor steering guardrailLet the P93 predictor steer fetch, then repair mispredicts with explicit flush/accounting.Done as a guardrail. The live target-prefetch attempt wedged before the kernel banner, so P116 gates issue off and records 4.46M candidates for the next speculative-buffer rung.
117Speculative target buffer guardrailHold predicted-target fetch data outside the architectural fetch queue, then promote or discard it when the queued branch resolves.Done as a guardrail. Live issue reaches /init but faults BusyBox at badaddr=0x00000000; guarded issue records 215K candidates and keeps BusyBox passing.
118LSU shape countersSplit the existing execute plus S_MEM behavior into address-generation, DTLB, D-cache/store-buffer, and lower-memory counters.Done. 27.90M address-generation events, 27.22M DTLB hits, 671K DTLB misses, 4.60M D-cache-hit cycles, and 1.16M store-buffer accepts.
119LSU request record / in-order scoreboardAdd explicit in-flight LSU request metadata and scoreboard-style busy bits while preserving in-order commit.Done as a shadow record. 27.89M allocs, 27.75M classified completes, 443 flushes, and 30.69M busy cycles.
120Physical register-file sketchAdd a documentation/simulation rung for rename maps, free list, and PRF sizing before changing architectural commit.Done as a shadow map. 147.10M source reads, 59.31M integer PRF allocations, matching frees/commits, and live pressure stays 32 because commit is still in-order writeback.
121ROB commit model sketchModel in-order commit, exception replay, and flush policy in the harness before trying to execute out of order.Done as a one-entry shadow ROB. 59.18M allocs, 59.18M commits, 179 flushes, 0 missing commits, and max live PRF pressure of 33.
122Multi-entry ROB/free-list sketchTry a tiny multi-entry ROB/free-list sketch before any scheduler or real out-of-order execution.Done as a four-entry shadow ring. Max occupancy stays 1, proving the next missing boundary is dispatch/issue separation.
123Dispatch/issue split sketchLet a shadow dispatch record get ahead of writeback before trying a scheduler.Done as an opportunity model. It finds 40.40M frontend-ready/backend-busy cycles, but still no issue depth beyond 1.
124Shadow integer issue slotAdd one modeled issue slot and classify blocks before trying a scheduler.Done. 9.80M queued simple integer ops accepted; dependency, memory, and control classes dominate the remaining blocks.
125Source-ready scoreboard modelTrack which queued source operands are ready instead of using one crude older-destination dependency rule.Done. 13.10M queued simple-integer candidates have all sources ready; max busy architectural register count is still 1.
126Memory/control holding recordsSplit memory-class and control-flow queued work out of the integer slot model.Done. 6.66M memory candidates, 9.33M control candidates, and explicit source-busy/full-record block reasons.
127Scheduler wakeup/issue eligibilityModel which held integer, memory, and control records could issue together once sources wake up.Done. 37.18M single-ready cycles, but 0 dual-ready cycles; records still do not coexist.
128Scheduler queue/lifetime depthKeep multiple class records alive in the shadow model before trying a multi-issue picker.Done. Capacity-4 queues accept 9.82M integer, 3.25M memory, and 6.29M control records, but max occupancy is still 1 and dual-ready cycles remain 0.
129Scheduler arrival/service decouplingLet modeled scheduler arrivals and backend service diverge enough to test whether class coexistence is possible.Done. Arrival/service counts still track one another and all class queues max at occupancy 1.
130Ready/valid contract extractionTurn the measured frontend/backend boundary into explicit plain-RTL valid/ready/fire wires before adding more scheduler policy.Done. The helper reports 22.65M arrival-fire cycles, matching service fires closely, with zero backpressure and zero dual-ready cycles.
131Dispatch queue module extractionMove from a combinational contract helper to a small module that owns queue state and exposes valid/ready/fire wires.Done. Module-owned counters exactly match the P130-style contract helper; max occupancy remains 1.
132Dispatch payload recordAdd payload fields to the dispatch queue module and compare accept, service, and flush behavior before making it active.Done. Payload accepts/services match queue fires, append-without-storage is 0, and invariant errors are 0.
133Dispatch payload class auditCompare the module-owned payload record against the older issue-slot and memory/control classifiers.Done. 22.66M class audits, 0 mismatches, max occupancy still 1.
134Aux load prefetch policyPivot back to frontend/memory: issue guarded aux loads only when the main port can prefetch next-PC safely.Done. 139,881 aux loads, 0 queue drops, shell window improves by 360,275 cycles versus P133.
135Cache background policy auditExplain remaining aux-load blocks from I-cache and D-cache background fill before relaxing either policy.Done. I-cache-only blocks dominate D-cache-only blocks, 1.62M versus 196K.
136I-cache background preemptLet useful next-PC prefetch plus aux-load issue preempt I-cache background fill while keeping D-cache background quiet.Done. Mechanism PASS, speedup FAIL: 1.77M aux loads issue, but the shell window worsens by 1.98M cycles versus P135.
137Bounded memory arbitrationKeep P136’s measured opportunity, but limit I-cache-background preemption with a one-preempt/one-defer burst counter.Done. 1.67M aux loads issue, 72.9K candidates defer, and the shell window barely beats P135.
138Debt memory arbitrationReplace P137’s crude burst cap with a debt counter that preemption increments and I-cache background service pays down.Done. 1.67M aux loads issue, 1.50M debt paydowns balance 1.50M preemptions, and the shell window beats P137/P135 but not P134.
139I-cache repair usefulness auditCount whether background-repaired I-cache words are consumed by fetch soon afterward.Done. 52.85M repair word fills produce only 1.93M first later fetch hits and 0.83M repeat hits, so P140 should make repair policy value-aware.
140Repair-aware I-cache arbitrationGive each foreground I-cache fill a one-word adjacent background repair budget.Done. Repair fills fall 39.4% and shell window improves by 673K cycles versus P139, but the policy is too stingy and loses line locality.
141Adaptive second-word I-cache repairGive demand-fetch lines a second repair word while prefetch-only lines stay at one.Done. It beats P140 by 746K cycles, beats P138 by 348K, and trails P134 by only 67.6K cycles.
142Selective prefetch second-word repairGive a second repair word to prefetch lines when they are immediately consumed by the frontend.Done. It beats P141 by 362.6K cycles and P134 by 295.0K cycles, but the prefetch grant bucket is too broad.
143Prefetch consumer repair classifierSplit P142’s broad frontend-consuming prefetch bucket into profitable and wasteful consumers.Done as an audit rung. Execute-prefetch repair is the bad bucket: 39.10M fills at only 2.83% first+repeat hit ratio.
144Execute-prefetch repair throttleStop giving execute-prefetch fills a second repair word by default while keeping higher-payback repair classes.Done. It cuts 10.01M repair fills and beats P143 by 564K cycles, but trails P142 by 266K cycles.
145Conditional execute-prefetch repairRe-enable execute-prefetch second-word repair only when a local sequential-adjacent condition says it is worth the traffic.Done. RTL PASS, speedup FAIL: it adds 165,881 prefetch second-word grants but worsens the shell window by 908K cycles versus P144.
146Execute-prefetch predicate auditRoll back to P144 behavior and shadow-count multiple execute-prefetch usefulness predicates before changing the active repair policy again.Done. The P145 seq_adjacent predicate produces zero later fetch hits; predicted_not_taken is the best single candidate, but only reaches 2.91% first+repeat hits per fill.
147Strict execute-prefetch composite guardTest one guarded combination of the P146 predicates as an active second-word repair policy.Done. It beats P146 by 850K cycles, but loses to P144 by 620K and P142 by 886K, so this repair bucket is closed.
148Frontend/memory pivot after execute-prefetch repairPick the next bottleneck from the shell profile now that execute-prefetch second-word repair is no longer the target.Next. The likely direction is a fresh bottleneck audit rather than another local execute-prefetch predicate.

XiangShan Gap Check

XiangShan/Kunminghu is the north star for architectural shape, not a literal near-term implementation target. The Kunminghu V2R2 microarchitecture guide describes a decoupled frontend with ICache, FDIP, and a branch prediction unit; the memory-subsystem guide calls out MSHR-managed fetch/prefetch misses, uFTB, FTB, TAGE-SC, ITTAGE, and RAS. The current repo has tiny versions of some concepts: I-cache, D-cache, ITLB/DTLB, fetch queue, BTB/counter/RAS measurement, store buffer, banked lower-memory model, and an aux response slot. It does not yet have an FTQ, real predictor steering, MSHRs, a load queue/store queue, rename, schedulers, ROB, vector unit, large L2, or XiangShan-scale verification.

That puts the next 10 projects in a sane order: finish the nonblocking memory contract first, then improve frontend steering, then only start backend speculation scaffolding.

Harvard arc

For this repo, “Harvard” means the core has separate instruction and data service close to execute: fetch, I-cache, and instruction translation can keep feeding the frontend while loads, stores, AMOs, and data translation use a different path. It does not require two totally separate DRAM systems forever. Real designs usually rejoin at a lower cache or memory fabric; the important part is that an L1 data event does not automatically steal the one cycle the frontend needed.

A useful Harvard-shaped memory system for this core would have:

What we lack today is exactly the interesting part. The core has I-cache and D-cache experiments, a fetch queue, and a named memory arbiter, but they still negotiate behind one shared external RAM service. Translation storage is now split by P101, but the page-table walker is still shared. D-cache misses are blocking demand events plus optional background fill, stores now have a conservative one-entry buffer but no useful forwarding yet, and there is no MSHR-like miss tracking.

The next projects should make that gap visible:

#ProjectQuestion
99Harvard I/D service mapWhere exactly do instruction fetch and data access split in this RTL, and what counters prove the split matters?
100Split I/D memory service modelHow much instruction/data contention is still hidden below the near-core split?
101Split ITLB/DTLB lookup pathHow much translation interference remains after the memory-service split?
102Data-side write buffer with forwardingPartial: the first core-local buffer corrupts BusyBox before the shell prompt.
103Store-buffer trace and repairFixed: request, grant, and store-buffer clear now describe the same memory transaction.
104Banked lower memory conflict countersMeasured: 71.40% of simultaneous I/D lower-memory wants are split-bank opportunities.
105Banked lower service modelModeled: 8.29M shell-window extra read-like grants could be serviced on different lower banks.
106Banked lower-memory contractProven: the widened boundary services the modeled auxiliary reads with 0 errors.
107Banked auxiliary D-cache fillProven: one non-architectural core client can consume the second response while Linux and BusyBox keep running.
108Banked auxiliary I-cache fillProven: instruction-side prefetch can consume the second response and fill I-cache state safely.
109Banked auxiliary demand prefetchProven: the second response can advance frontend state for S_WB prefetch while a store-buffer drain uses the main port.
110Tagged auxiliary response slotProven: fetch, prefetch, and background-fill classes can share one explicit response ownership record.
111Nonblocking aligned load-miss aux consumerProven functionally: 3.545M aligned load misses complete from the aux response while instruction service uses the main port, but the first policy is slower.
112Aux response queue / one-entry MSHRProven functionally: the core preserves and drains the load response, but queueing every eager aux-load opportunity is too expensive.
113Load-miss issue policy v2Proven: the conservative gate blocks all aux-load issues and recovers most of P112.
114PTW aux owner measurementProven: no safe PTW aux read candidates appear in the shell workload; A/D writes remain ordered.
115Frontend target queueProven: the frontend can hold predicted-target metadata exactly alongside the fetch queue, with 54.24M fills and consumes and 0 flushes.
116Active predictor steering guardrailAnswer: not with the existing architectural fetch queue. The live attempt wedged before the kernel banner; the guarded version counts 4.46M candidates and keeps BusyBox passing.
117Speculative target buffer guardrailAnswer: the record alone is not enough. The live issue path still perturbs userspace, so the passing rung gates issue off and records 215K candidates for a stricter promotion/repair contract.
118LSU shape countersProven: the current data path can be reported as address generation, DTLB service, cache/store-buffer service, and S_MEM completion without changing behavior.
119LSU request scoreboardProven: an in-order shadow request record can track alloc/complete/flush/busy counts without breaking Linux or BusyBox.
120PRF rename sketchProven: a shadow integer PRF map can measure source reads and destination allocation pressure without changing architectural commit.
121ROB commit modelProven: a one-entry shadow ROB can balance alloc/commit/flush/free lifetime while Linux and BusyBox still pass.
122Multi-entry ROB sketchProven negative: a four-entry ROB ring remains occupancy 1 under the current serialized backend.
123Dispatch/issue split sketchProven measurement: 40.40M frontend-ready/backend-busy cycles exist, so the next useful backend boundary is an issue slot rather than a larger ROB.
124Shadow integer issue slotProven measurement: a one-entry shadow slot accepts 9.80M simple integer ops, while dependency, memory, and control classes define the next blockers.
125Source-ready scoreboard modelProven measurement: 13.10M queued integer candidates have ready modeled sources, but the shadow backend still only has one busy architectural destination at a time.
126Memory/control holding recordsProven measurement: memory has 3.25M ready accepts and no full-record pressure; control has 6.28M accepts plus 29.9K full-record blocks.
127Scheduler wakeup/issue eligibilityProven negative: the scheduler sees 37.18M ready cycles, but no two held classes are ready in the same cycle under the current lifetime model.
128Scheduler queue/lifetime depthProven negative: capacity-4 class queues still max at occupancy 1, so arrival/service coupling remains the blocker.
129Scheduler arrival/service decouplingProven negative: earlier arrival accounting still maxes every class at occupancy 1, so the next missing abstraction is a real ready/valid boundary.
130Ready/valid contract extractionProven: the contract can be named as plain RTL and measured under Linux. Proven negative: without a state-owning queue, the backend still has no class coexistence.
131Dispatch queue module extractionProven: one state cluster can move into a module and match the old counters under Linux. Proven negative: the module still sees no class coexistence without payload ownership and real dispatch isolation.
132Dispatch payload recordProven: the module can own decoded payload metadata with zero invariant errors. Proven negative: the queue remains one-deep per class, so class audit comes before active issue.
133Dispatch payload class auditProven: module-owned payload class agrees with the old classifiers across 22.66M audits and 0 mismatches.

The rule for this phase: every rung must run the BusyBox shell profile and compare against the previous rung before claiming a speedup.

§ Linux bring-up: gap closed enough to use

The old question was “what does Linux on RV32 need?” The current answer is better: we have already built enough of it to boot a real kernel and run userspace. The checklist below is now historical context plus a map to the rungs that closed each part.

requirementwhere it landedstatus
A extension for spinlocks and lr.w/sc.wP45 A extensionPASS
Supervisor-mode CSR/trap/delegation machineryP47 through P51PASS
Sv32 page-table walking and translationP52 page-table walker and P53 walker completionPASS
Platform shape, memory size, SBI, device treeP54 platform shape, P55 S-mode kernel, P56 RTL for LinuxPASS
Instruction-fetch translation during Linux bootP59 Linux bootPASS
Real userspace processP60 userspace helloPASS
TLB and shell profiling infrastructureP61 TLB, P84 shell profile, P85 symbol profilePASS
BusyBox initramfs and interactive shell pathP80 BusyBox initramfs, P81 PTY consolePASS
Frontend stall attribution and first I-cache/predictor/data-buffer/backend-sketch experimentsP88 memory attribution, P89 I-cache, P90 line fill, P91 fill buffer, P92 fetch queue, P93 predictor, P94 arbiter, P95 store buffer, P96 D-cache, P97 D-cache line fill, P98 D-cache throttle, P99 Harvard map, P100 split I/D service, P101 split TLB, P102 write buffer, P103 store-buffer repair, P104 banked lower memory, P105 banked lower service, P106 banked lower contract, P107 banked aux D-cache fill, P108 banked aux I-cache fill, P109 banked aux demand prefetch, P110 tagged aux response, P111 nonblocking load aux, P112 aux response queue, P113 load-miss policy, P114 PTW aux owner, P115 frontend target queue, P116 active steering guardrail, P117 speculative target buffer, P118 LSU shape, P119 LSU request scoreboard, P120 PRF rename sketch, P121 ROB commit model, P122 multi-entry ROB sketch, P123 dispatch/issue split, P124 shadow issue slot, P125 source-ready scoreboard, P126 memory/control holding records, P127 scheduler wakeup/issue, P128 scheduler queue/lifetime, P129 scheduler arrival/service, P130 ready/valid contract, P131 dispatch queue module, P132 dispatch payload record, P133 dispatch payload class audit, P134 aux load prefetch policy, P135 cache background policy audit, P136 I-cache background preempt, P137 bounded memory arbitration, P138 debt memory arbitration, P139 I-cache repair usefulness, P140 repair-aware I-cache arbitration, P141 adaptive second-word I-cache repair, P142 selective prefetch second-word repair, P143 prefetch consumer repair classifierPASS, with P90, P92, P94, P95, P97, P99, P103, P109, P111, P112, P115, P116, P117, P118, P119, P120, P122, P123, P124, P125, P126, P127, P128, P129, P130, P131, P132, P133, P135, P136, and P139 speedup FAIL or shadow/audit-only; P102 is partial; P116/P117 are guarded frontend correctness rungs, P118/P119 are data-side measurement rungs, P120-P133 begin backend rename/ROB/dispatch/issue/scoreboard/holding-record/scheduler/contract/module/payload/class-audit scaffolding, P134 pivots back to memory with a small shell-window speedup, P135 identifies I-cache background fill as the next policy target, P136 proves unbounded I-cache-background preemption is too aggressive, P137 shows a crude bound can recover most of that regression, P138 improves the bound with explicit repair debt, P139 proves most background repair words are not promptly fetched, P140 cuts repair bandwidth but proves a fixed one-word budget is too stingy, P141 restores most line locality with demand-side second-word repair, P142 turns frontend-consuming prefetch repair into the first post-P134 shell-window win, and P143 identifies execute-prefetch second-word repair as the low-payback bucket to throttle next
Execute-prefetch repair throttleP144 execute-prefetch repair throttlePASS. The class-wide throttle cuts 10.01M repair fills and recovers 564K cycles versus P143, but it remains 266K cycles slower than P142, which set up P145’s conditional repair test.
Conditional execute-prefetch repairP145 conditional execute-prefetch repairPASS mechanically, speedup FAIL. The local sequential-adjacent predicate adds 165,881 prefetch second-word grants but makes the shell 908K cycles slower than P144, which set up P146’s predicate audit.
Execute-prefetch predicate auditP146 execute-prefetch predicate auditPASS mechanically, speedup FAIL as an audit rung. It keeps the active policy at P144 shape and shows the P145 predicate has zero measured later fetch payoff in this shell run.
Strict execute-prefetch guardP147 strict execute-prefetch guardPASS mechanically, speedup FAIL. The strict composite guard beats P146 but still loses to P144/P142, so execute-prefetch second-word repair is no longer the next target.

So this page should no longer talk as if Linux is hypothetical. Linux is running; the current problem is making the machine less painfully blocking while it runs Linux.

§ Side rung: framebuffer demo (P44)

P44 took a small detour to give the FreeRTOS milestone a face. The chip’s render task computes 96×96 RGB565 plasma frames into a memory-mapped framebuffer; the testbench dumps each frame on MMIO_FRAME_READY; a pygame window plays them back. No SPI peripheral, no display hardware - just memory + simulator + Python. That same software pattern later came back in the AtomVM framebuffer work.

§ Historical Phase 1 - ISA breadth (P45-P46)

This closed the non-privileged gap to RV32IMA + bitmanip. Each rung was a small RTL change that added a real ISA extension we still use.

#ProjectWhat it addsLinux relevance
45A extension (atomics)lr.w/sc.w + amo*.w. Single-hart reservation register.Linux required. Also lets FreeRTOS use proper atomic critical sections instead of MIE-disable.
46Zba + Zbb-essentials13 single-cycle bitmanip ops (sh*add, andn/orn/xnor, min/max, sext/zext). RTL good; gcc-zbb auto-emit has a known wart.Modest Linux build perf; cheap rung.

C extension (compressed) was originally P46 but it’s a multi-hour fetch-front-end rewrite; it got bumped to a future “supervised” rung and eventually became part of the AtomVM port work.

After Phase 1 we had **RV32IMA + Zicsr + Zifencei + Zicntr + Zba

§ Historical Phase 2 - Privileged depth (P47-P51)

Linux runs in S-mode; user code runs in U-mode; M-mode hosts the SBI firmware. Phase 2 added the missing privilege machinery.

The address-map work is split out from the behavior work so each rung is small and composable.

#ProjectWhat it adds
47S-mode CSR scaffoldingsstatus, sie, stvec, sscratch, sepc, scause, stval, sip, satp decoded as M-readable storage. No priv transition yet.
48Trap delegation CSR scaffoldingmedeleg / mideleg decoded as M-readable storage. No actual delegation behavior yet.
49M↔S priv trackingmstatus.MPP real, mret returns to the right priv level; cause-bit-driven trap routing using the now-storage medeleg/mideleg.
50S-mode trap entryWhen a delegated trap fires, write stvec/sepc/scause/stval and switch to S; sret returns to U.
51CSR priv check + sie/sip subsetReal privilege checks and S-mode interrupt-pending/enable views.

After Phase 2 the chip could run a hypervisor-free, page-table-free kernel that just uses S/M splits. Useful in its own right; required for what follows.

§ Historical Phase 3 - The MMU (P52-P54)

This was the biggest single rung in the Linux climb. Sv32 page-table walks turn every load and store into a potential TLB lookup + multi-cycle walk + permission check. It added significant RTL and significant area.

#ProjectWhat it adds
52Sv32 page-table walkerPage-table walker using the satp storage from P47, small TLB, and permission/access checks. The big rung.
53Walker completionMegapages, sfence.vma, fault encoding, and AMO translation.
54Platform shape16 MiB memory model plus the first SBI/platform proof of concept.

The MMU is where the chip stopped being a microcontroller-shaped core and started being a small application platform.

§ Historical Phase 4 - Platform glue (P55-P58)

Linux assumes specific platform shapes. These rungs made the platform look enough like a real RV32 machine to run a kernel.

#ProjectWhat it adds
55Hello, S-mode kernelReal C S-mode kernel using SBI console calls.
56RTL completion for Linux bootA/D bit updates, CLINT-shaped MMIO, and the missing Linux-facing RTL pieces.
57SBI runtime + kernel trap handlerMinimal SBI v0.1 runtime and S-mode kernel with its own stvec handler.
58Linux kernel boot attemptStage-0 handoff worked; Linux parked during MMU enable. This failure set up P59.

After Phase 4 the platform looked enough like a generic RV32-Linux target that an off-the-shelf kernel build could start booting.

§ Historical Phase 5 - Linux bring-up and first speed work (P59-P90)

This is where Linux became real in the repo, then became slow enough that profiling and frontend work became the obvious next target.

#ProjectWhat it does
59Linux 6.12.85 bootsSv32 instruction-fetch translation made the kernel boot and print through SBI.
60First userspace processRV32 ELF runs as PID 1 and prints through SYS_write.
614-entry TLB + profiling harnessFirst structured benchmark harness around the Linux-capable core.
63-66Fetch/writeback speed rungsSingle-cycle fetch on TLB hit, prefetch-under-writeback, fused decode/execute, and D+X collapse.
80-81BusyBox initramfs and PTY consoleReal shell path with host-side interaction.
84-88Shell profiling through memory attributionFlamegraph-style data, symbolized BusyBox samples, TLB comparison, UART comparison, and memory request attribution.
89-90Tiny I-cache and line-fill experimentFirst frontend cache experiments. P89 reduced fetch stalls; P90 proved blocking line fill is the wrong policy.

§ Speed rungs already taken

The speed page named the divider OR-tree as an early critical path, and the later speed rungs attacked CPI instead of just clock frequency. See P63, P64, P65, and P66 for the first round. P91-P147 are the second round: less blocking frontend and memory behavior. P91 was a small win; P92 was a useful negative result, P93 measured enough return predictability to justify active prediction once the frontend can steer, P94 made the shared memory clients visible enough to choose a data-side experiment, P95 proved the first conservative store buffer policy is not good enough, P96 showed that even a one-word D-cache can reduce load stalls enough to win a little, and P97 showed that line fill needs throttling on the one-port memory system. P98 recovered the shell timing with a stricter throttle, P99 mapped the Harvard split, P100 made instruction/data service demand visible below that map, P101 split translation storage into ITLB/DTLB banks, P102 found the first core-local store-buffer correctness hole, P103 repaired it, P104 measured that most simultaneous I/D lower-memory conflicts are different-bank opportunities, P105 showed that 8.29M shell-window cycles are conservative read-like extra-grant candidates, P106 turned that model into a serviced auxiliary read lane, P107 proved the core can consume that response for D-cache background fill while Linux keeps running, P108 added instruction-side prefetch fill, and P109 let the writeback-prefetch response bypass directly into execute while a store-buffer drain uses the main port. P110 replaced those one-off consumers with a tagged auxiliary response slot. P111 let a data-side load miss use that slot and proved the mechanism, but the shell window regressed. P112 added the one-entry response queue and proved that queueing works, but also made the cost of bad issue policy obvious. P113 tightened that policy, P114 measured that the page-table walker is not the next useful aux consumer, and P115 added the frontend target record needed before active predictor steering. P116 then proved active steering needs a speculative target buffer instead of reusing the architectural fetch queue directly. P117 added that record, but proved the issue/fill path still needs stricter isolation before speculative target data can influence userspace execution. P118 then made the current data side legible as an LSU-shaped path before any request-record or scoreboard refactor. P119 added that shadow request record and proved the accounting can run under Linux. P120-P133 then started the backend arc with shadow PRF rename pressure, one-entry ROB lifetime accounting, and the explicit negative result that a four-entry ROB still stays at occupancy 1 without dispatch/issue decoupling. P123 then measured the dispatch/issue opportunity window directly: 40.40M cycles where queued frontend work exists while the backend is busy. P124 classified that window with a one-entry shadow issue slot and found 9.80M simple integer ops that can be held; P125 added source-ready accounting and found 13.10M queued integer candidates with all modeled sources ready; P126 split memory and control work into holding records and found the next missing contract is wakeup/issue eligibility across held classes. P127 then added that ready-mask model and found zero same-cycle multi-class ready opportunities. P128 added queue depth and still found max occupancy 1. P129 moved arrival accounting earlier and still found max occupancy 1. P130 extracted explicit valid/ready/fire wires and confirmed the same result. P131 moved that queue state into a module and matched the old counters. P132 added one-deep payload ownership with zero invariant errors. P133 audited that payload class against the older classifiers with 22.66M checks and 0 mismatches. P134 then made the pivot: it re-enabled guarded aux-load issue, completed 139,881 load misses through the auxiliary lane, and cut the shell window by 360,275 cycles versus P133. P135 kept the policy and added mutually exclusive background-block buckets, showing I-cache-only blocks dominate D-cache-only blocks after frontend prefetch is safe. P136 tested that target and proved the unbounded form is too aggressive: 1.77M aux loads issue cleanly, but shell-window time gets worse. P137 adds a one-preempt/one-defer bound, keeps 1.67M aux-load issues, and recovers the shell window to a tiny win over P135. P138 replaces that sketch with I-cache repair debt and improves again, but still cannot beat P134. P139 then audits the repaired words directly: only 3.64% become first later fetch hits, so the next policy should be value-aware rather than just fair. P140 gives each foreground fill one adjacent repair word. It improves versus P139 by cutting 20.80M repair fills, but the fixed budget is too small to recover P138/P134. P141 gives demand-fetch lines a second repair word, keeps repair fills near P140, and recovers to within 67.6K cycles of P134. P142 extends the second-word budget to frontend-consuming prefetch fills. It spends more repair traffic, but wins the shell window by 362.6K cycles versus P141 and 295.0K cycles versus P134. P143 classifies that broad prefetch repair bucket and finds execute-prefetch second-word repair is the low-payback source. P144 throttles that class, cuts 10.01M repair fills, and recovers 564K cycles versus P143, but it remains 266K cycles slower than P142. P145 tries a local sequential-adjacent condition for restoring that second word. It passes the shell workload but worsens by 908K cycles versus P144. P146 then performs the audit instead of guessing again: seq_adjacent gets zero measured later fetch hits, and the best single broader predicate is still weak at 2.91% first+repeat hits per fill. P147 tests the strict composite guard and still loses to P144/P142, so the next speed rung should pivot away from execute-prefetch repair.

§ Demo rungs along the way (no specific position)

These don’t gate anything but they’re the rungs that make the chip feel real.

§ What we’re explicitly not doing on this path

§ Why this order

The completed Linux phases front-loaded correctness: ISA, privilege, MMU, platform, userspace. That was the right order because Linux is not useful until it boots and runs real programs.

The P91-P147 order front-loads decoupling: I-cache fill policy, fetch queue, branch prediction, memory-path attribution, D-cache policy, then the Harvard instruction/data split and the first repaired data-side buffer, followed by lower-memory bank conflict accounting, a banked service model, a serviced auxiliary read contract, narrow data-side and instruction-side auxiliary response consumers, and the first demand-visible auxiliary prefetch bypass, followed by a shared tagged response slot, and the first live aux-load owner. That is deliberate too. P88-P116 showed instruction delivery and memory stalls dominating the shell workload, and P111/P112 showed that a nonblocking mechanism still needs a policy layer. P115 adds the frontend target record needed before prediction can steer fetch; P116 shows that steering needs a non-architectural promote/discard record. P120-P133 deliberately keep the backend shadow-only while measuring rename, ROB lifetime, dispatch/issue opportunity, issue-slot block reasons, source-ready pressure, class-specific memory/control holding records, scheduler ready masks, queue lifetime, arrival/service accounting, the first explicit ready/valid contract helper, the first state-owning dispatch queue module, one-deep payload ownership, and payload-class audit; P133 shows real dispatch isolation would be the next backend step, but P134 chooses the measured frontend/memory path instead. P135 then narrows the next memory policy experiment to I-cache-background preemption, and P136 shows the unbounded version regresses, so the next useful step is bounded arbitration instead of wider speculation. P137 confirms the direction with a crude one-preempt/one-defer bound; the next useful step is a real debt/age arbiter. P138 implements that debt model and improves the shell window, but still trails P134, so the next useful step is auditing the actual value of protected/interrupted I-cache repair. P139 does that and finds most repaired words are not promptly consumed, so P140 should spend repair bandwidth only where fetch is likely to use it. P140 confirms that direction but shows a fixed one-word budget is too stingy. P141 gives demand-fetch lines a second repair word and nearly closes the P134 gap. P142 gives frontend-consuming prefetch fills a second word too and beats P134, but the prefetch grant bucket is broad. P143 classifies it and shows execute-prefetch repair dominates traffic with weak payback. P144 throttles that class and proves the bandwidth diagnosis was right, but the performance result says the throttle is too blunt. P145 tries one stronger local-use signal and fails on shell speed. P146 shadow-counts candidate predicates and shows the simple local signal has zero measured payoff, while the best single broader signal is still weak. P147 runs that strict composite-guard check and cannot beat P142/P144, so this repair bucket should stop consuming the architecture arc. A big out-of-order backend would be the expensive answer to the wrong first question if the frontend and LSU are still blocking on every miss.