§ Milestone reached: FreeRTOS on hardware
P43 closes the FreeRTOS arc: an
unmodified FreeRTOS V11.1.0 kernel, three application tasks plus idle,
queue-based message passing, timer-driven preemption, and a clean halt
convention - all running on a hardened sky130A GDS at 0 errors
across DRC, LVS, antenna, routing, setup, hold, and fanout.
P43 - FreeRTOS multi-task demo, hardened
UART output : S a b c d e f g h D
Cycles : 5,100,751
Halt : MMIO_HALT(0x10001ff8) <- 1, halted=1
GDS : projects/43_freertos_hardened/librelane/runs/RUN_2026-05-03_03-20-20/final/gds/top.gds
Std cells : 28,079 Setup slack : 6.568 ns
DRC/LVS/ANT : PASS Hold slack : 0.107 nsThat makes this the first hardened ladder rung where a real third-party RTOS runs on the chip we built. Twelve rungs from P32 through P43 brought us here: RV32M, MMIO platform, trap frame, Zicntr counters, FreeRTOS port, scheduler bring-up, and finally an MMIO halt port that fixed a real RTL bug found because real software ran on the chip.
That’s the FreeRTOS milestone. From here the climb gets steeper.
§ Where this page changes
Earlier versions of this page treated FreeRTOS and then Linux as future milestones. Those arcs have now been walked far enough to boot Linux 6.12.85, run a userspace PID 1, boot BusyBox, profile shell workloads, and measure frontend experiments. The next arc is architectural: make the core less blocking, starting with the frontend and memory system.
The older Linux plan is preserved below as history because it explains why the core has S-mode, Sv32, SBI, BusyBox, AtomVM, and profiling infrastructure. The live path starts here.
§ What the last backend rungs bought us
The recent backend rungs have mostly been proof work, not speed work. That is why they can feel abstract.
P123 found 40.40M cycles where the frontend had queued work while the backend was still busy. That was the reason to look at dispatch and issue at all. P124 through P129 then asked whether independent ready work already existed behind the old FSM. The answer was basically no: integer, memory, and control records stayed one-deep, and dual-ready cycles stayed at zero.
So P130 through P133 changed the goal from “make it faster right now” to “make the boundary real enough to modify safely.” We now have an explicit valid/ready/fire contract, module-owned queue state, module-owned payload records, and 22.66M payload-class audit checks with zero mismatches under Linux and BusyBox.
The breakthrough has not happened yet. The core is still a serialized
backend with max queue occupancy 1 in every class. The useful decision
was to pivot back to frontend and memory work, where previous rungs
produced measured shell-window wins. P134
does that: it reopens guarded aux-load issue, completes 139,881 load
misses through the auxiliary lane, and trims the shell window by 360,275
cycles versus P133.
P135 keeps that policy
and audits the remaining blocks. After the frontend prefetch guard
passes, only 196K candidates are blocked by D-cache background fill
alone, while 1.62M are blocked by I-cache background fill alone and
1.86M are blocked by both background paths.
P136 tests that target by
preempting I-cache background fill. It issues 1.77M aux loads with 0
drops, errors, or cancels, but the shell window worsens by 1.98M cycles
versus P135. The current lesson is concrete: unbounded preemption is too
aggressive; the next rung needs bounded arbitration or a rollback to the
quiet-I-cache guard.
P137 adds the first bound:
one I-cache-background preemption may issue, then the next otherwise-safe
I-cache-only candidate defers. It issues 1.67M aux loads with 0 drops,
errors, or cancels and recovers the shell window to 65.36M cycles, a
tiny 55K-cycle win over P135.
P138 replaces that burst cap
with explicit repair debt. It keeps 1.67M aux-load issues, balances
1.50M preemptions with 1.50M debt paydowns, and improves the shell
window to 64.64M cycles. That beats P137 and P135, but still misses
P134 by 416K cycles.
P139 tags background-repaired
I-cache words and counts later fetch use. It records 52.85M repair word
fills but only 1.93M first later fetch hits and 0.83M repeat fetch hits,
so the repair stream is fair but mostly not promptly useful.
P140 makes that policy
value-aware with a one-word adjacent repair budget. It cuts repair fills
to 32.05M and improves 673K cycles versus P139, but still trails P138
and P134 because it loses too much line locality.
P141 grants a
second repair word to demand-fetch lines while keeping prefetch-only
lines at one word. It keeps repair fills near P140, beats P140 by 746K
cycles, beats P138 by 348K cycles, and lands only 67.6K cycles behind
P134.
P142 gives that
second repair word to prefetch fills that are immediately consumed by
the frontend. It spends more repair traffic, but converts enough of it
into I-cache hits to beat P141 by 362.6K cycles and P134 by 295.0K
cycles.
P143 classifies
those repaired words by source. It finds the broad P142 bucket is mostly
execute-prefetch traffic: 39.10M repair fills with only a 2.83%
first+repeat hit ratio. Demand, writeback-prefetch, and aux-prefetch
repair have much stronger payback.
P144 removes the
second repair word from execute-prefetch fills by default. It cuts 10.01M
repair fills and beats P143 by 564K cycles, but remains 266K cycles
slower than P142, so the class-wide throttle is too blunt.
P145 restores that
second word only when the prefetched instruction’s sequential next PC is
the adjacent word in the same I-cache line. It passes RTL and the BusyBox
shell workload, but it is a speed FAIL: 908K cycles slower than P144.
P146 rolls the active
policy back to P144 and shadow-counts several candidate predicates. The
P145 seq_adjacent predicate produces 20,552 fills and zero later fetch
hits; the best single broader predicate, predicted_not_taken, is still
only 2.91% first+repeat hits per fill. That makes one more strict
composite-guard experiment the limit before we pivot away from this
execute-prefetch repair bucket.
P147 runs that final
strict guard: predicted_not_taken && word_not_last && quiet_backend.
It passes Linux and BusyBox and beats P146 by 850K cycles, but it still
loses to P144 by 620K cycles and P142 by 886K cycles. That closes the
execute-prefetch second-word repair thread.
§ Phase 6 - XiangShan-inspired frontend and memory arc (P91-P147)
This does not mean cloning XiangShan. XiangShan/Kunminghu is a team-scale RV64 out-of-order application core with wide fetch/decode, branch prediction, nonblocking caches and TLBs, rename, schedulers, a ROB, physical register files, vector/FPU, and major verification infrastructure. This repo is still a paced RV32 teaching core.
The useful lesson from XiangShan is the shape: decouple fetch from execute, make instruction and data memory service nonblocking where possible, measure the bottlenecks, and only then consider out-of-order machinery. XiangShan’s public Kunminghu V2R2 guide frames the big machine as IFU, IDU, rename, out-of-order dispatch/issue, integer/FP/vector execution, LSU, ROB, MMU, PMP/PMA, and L2 cache subsystems; our teaching path is deliberately extracting the smallest checkable ideas from that shape.
P90 proved the immediate mistake. A
4-word I-cache line gets more hits, but a blocking S_IC_FILL state
made the BusyBox shell workload slower. P91 fixed that policy. P92 then
added a one-entry fetch queue; it cut fetch-class stalls but did not yet
improve shell-window time. P93 added a shadow predictor and measured
whether the workload has useful control-flow regularity before letting
prediction steer fetch. P94 split the
shared memory request path into named clients so the next data-side
experiments can stop guessing. P95 tried
the conservative one-entry store buffer and proved that merely moving
store wait into a blocking drain policy makes the shell workload worse.
P96 added the first word-only D-cache, cut load
stalls by 24.99%, and produced a modest 1.44% shell-window speedup over
P94 while also exposing the need for better line-fill policy.
P97 tried four-word D-cache lines with
critical-word-first demand loads. It improved local D-cache hit behavior
but lost the shell window because background data fills stole shared RAM
service from fetch. P98 throttled that
background fill so it only runs in frontend-safe slots. It recovered the
P96 shell timing, but the result is still one-port policy work. The
P99 map defines the Harvard
instruction/data split directly. P100
turns that map into measured instruction/data service counters while
leaving the lower shared memory path in place. P101
splits the unified TLB into ITLB and DTLB banks, cutting translation
walks hard enough to make the shell workload faster. P102
then tried a core-local store buffer and produced a useful correctness
failure: Linux reaches /init, but BusyBox faults before the shell
prompt after only 79 buffered user stores. P103
traced that failure and repaired the transaction boundary: a store-buffer
entry is now cleared only when the store-buffer request actually wins the
memory grant and memory accepts it. P104
then measured the lower shared-memory problem directly: 71.40% of
simultaneous instruction/data lower-memory wants map to different
word-interleaved banks, but the current one-port fabric still serializes
them. P105 adds a conservative
banked service model and finds 8.29M shell-window extra grants that
could be serviced on different lower banks if the memory contract were
widened. P106 widens that
contract: the RTL emits an auxiliary read lane and the Verilator memory
model services 20.63M auxiliary reads with 0 errors while Linux runs.
P107 feeds that response back
into one narrow core client: D-cache background fill. It consumes 10.07M
auxiliary fills and trims the shell window by 288,864 cycles versus
P106, which is a mechanism PASS but not the end of the memory problem.
P108 adds the first
instruction-side consumer: blocked writeback prefetch fills the I-cache
through the auxiliary response. It consumes 488K instruction-side aux
prefetch fills and improves the shell window by another 445,555 cycles.
P109 makes that same auxiliary
writeback-prefetch response demand-visible when S_WB drains the store
buffer on the main port. It bypasses 488K prefetches into S_EXECUTE and
cuts S_FETCH by 481,840 cycles, but the shell-window result is mixed.
P110 routes those responses through
one owner/address/data/error/cancel record. It counts 488K writeback
prefetch responses and 9.98M D-cache background responses with 0 errors
and 0 cancels.
P111 makes the load owner real:
3.545M aligned integer load misses complete through the auxiliary response
while the main port fetches a safe next-PC word. That proves the tagged
slot can carry architectural load data, but the first policy is too
eager and regresses the shell window by 1,005,481 cycles versus P110.
P112 puts that load response behind
a one-entry queue. It records 3.677M enqueues and matching dequeues with
0 drops, 0 errors, and 0 cancels, but the registered completion cycle
pushes the shell window to 71.95M cycles.
P113 then tightens the issue policy,
blocks all eager aux-load candidates, and recovers most of P112’s
regression. P114 measures PTW aux
ownership and finds no safe read-like walker candidates in the shell
workload. P115 adds the
frontend target queue metadata needed before active predictor steering:
54.24M fills, 54.24M consumes, and 0 flushes.
P116 tries to use that
metadata for active steering, catches a real early-boot failure in the
live predicted-target prefetch path, and lands as a guarded counter rung:
4.46M steering candidates, 0 issued fills, Linux and BusyBox still PASS.
P117 adds the one-entry
speculative target-buffer record and proves the next hazard more
precisely: with live issue enabled only in userspace, Linux reaches
/init, then BusyBox faults at badaddr=0x00000000. The checked-in
guarded version records 215K candidates, 0 issued fills, and keeps the
shell profile passing.
P118 switches back to the data side and
names the current monolithic execute plus S_MEM path as an LSU-shaped
measurement: 27.90M address-generation events, 27.22M DTLB hits, 671K
DTLB misses, 4.60M D-cache-hit cycles, and 1.16M store-buffer accepts.
P119 adds the first shadow
request record and scoreboard-style busy accounting: 27.89M request
allocs, 27.75M classified completes, 443 flushes, and 30.69M busy
cycles while the BusyBox shell still passes.
P120 starts the backend-renaming
arc without changing architectural state: a 64-entry shadow integer PRF
map records 147.10M source reads, 59.31M integer destination
allocations, matching frees/commits, and a passing BusyBox shell
profile. Because the current in-order writeback allocates and frees in
the same cycle, live physical-register pressure remains 32; P121 is
where ROB lifetime should make that pressure real.
P121 adds that first lifetime model:
one shadow ROB record allocates a physical tag at S_EXECUTE, commits
or flushes it at S_WB/trap time, and keeps architectural regs[]
unchanged. The shell workload passes with 59.18M ROB allocs, 59.18M
commits, 179 flushes, 0 missing commits, and max live PRF pressure of 33.
P122 grows that record into a
four-entry ring and gets the useful negative result: Linux and BusyBox
still pass, alloc/commit/flush accounting balances, but max occupancy is
still 1. The current FSM has no separate dispatch/issue path, so a
bigger ROB is just a bigger counter container until P123 splits backend
progress from writeback.
P123 measures that split point
directly: 40.40M cycles where the frontend has queued work while the
backend is still busy. The shadow dispatch queue allocs and drains
40.40M records with no full blocks and max occupancy 1, which says the
next backend rung should be a real shadow issue slot with explicit
block reasons.
P124 adds that one-entry shadow
integer issue slot. It accepts 9.80M simple integer queued ops, blocks
11.30M on modeled dependencies, 6.66M on memory-class work, 9.33M on
control flow, and 76.6K on system/fence instructions. The next backend
rung is source-ready bookkeeping and class-specific holding records.
P125 adds the source-ready
half: an architectural busy-bit scoreboard over queued simple integer
ops. It finds 13.10M all-sources-ready candidates and 11.31M
source-busy candidates, while max busy architectural register count
stays 1. That keeps the next step pointed at memory/control class
records, not a bigger integer slot.
P126 adds those
records: a one-entry memory holding model and a one-entry control
holding model. It counts 6.66M memory candidates, 9.33M control
candidates, 3.41M memory source-busy blocks, 3.02M control source-busy
blocks, and 29.9K control full-record blocks. The next useful backend
step is wakeup/issue eligibility across the held classes.
P127 adds that ready-mask
model. It samples 43.66M cycles with at least one held record and finds
37.18M cycles where some record is ready, but every ready cycle is
single-class: zero integer+memory, integer+control, memory+control, or
triple-ready cycles. The next useful backend step is queue/lifetime
depth so held work can coexist before any real multi-issue selector.
P128 adds that queue model:
capacity-4 integer, memory, and control shadow queues with a one-lane
fixed-priority drain. It passes the shell workload, but each queue still
maxes at occupancy 1 and the model records zero dual-ready cycles. The
next useful backend step is to decouple scheduler arrival from backend
service more honestly before trying a two-issue picker.
P129 moves that arrival
accounting earlier and still gets max occupancy 1, with 13.10M integer
arrivals, 3.25M memory arrivals, 6.31M control arrivals, and zero
dual-ready cycles. That answers the abstraction question: the current
split is useful instrumentation, but not yet a proper frontend/backend
ready/valid contract. P130
extracts that contract into a plain-RTL helper with explicit
valid, ready, and fire signals. It passes the shell workload and
matches the older scheduler counters, but still reports zero
backpressure, zero dual-ready cycles, and no queue depth beyond 1. The
next useful step is a small state-owning dispatch queue module, not a
picker. P131 takes that step:
the queue state is owned by p131_dispatch_queue_module3, and the
module’s fires, backpressure, ready count, max occupancy, and flush
clears match the older shadow accounting under the BusyBox shell
profile. It still maxes every class at occupancy 1, so the next
refactor is payload ownership, not issue width.
P132 adds that ownership:
the module captures PC, opcode, rd, rs1, rs2, and source-use bits on
arrival fire. Payload accepts match arrival fires, payload services
match service fires, payload flush clears account for the remaining
delta, and invariant errors stay at zero.
P133 closes the
conservative module-boundary proof: it compares every accepted
module-owned payload against the older integer, memory, and control
classifiers. The shell workload records 22.66M class audits, including
13.10M integer, 3.25M memory, and 6.31M control audits, with 0
mismatches. This still does not create speed or queue depth; it creates
the cleanest point so far to decide whether an active dispatch queue is
worth trying.
P134 chooses the pivot
instead: the aux-load path is re-enabled only when the main port can
perform a useful next-PC instruction prefetch and both cache background
fill paths are quiet. It issues 139,881 aux loads, records 0 queue
drops, and improves shell-window time by 0.56% versus P133. The next
question is no longer “more backend scaffolding?” It is “which
background cache policy is blocking useful overlap?”
P135 answers that with
mutually exclusive buckets. The D-cache-only bucket is small at 196K.
The I-cache-only bucket is 1.62M, and both-backgrounds is 1.86M. P135 is
not a speed rung; its shell window is worse than P134. Its value was the
narrow target it gave P136: allow useful next-PC prefetch plus aux-load
issue to preempt I-cache background fill while keeping D-cache background
fill quiet.
P136 runs that exact test.
The hardware path stays correct and issues 1.77M aux loads with 0
queue drops, 0 errors, and 0 cancels, but shell-window time regresses by
1.98M cycles versus P135. That makes P136 a useful negative result:
I-cache background fill is not just decorative; interrupting it without
an age/debt limit steals too much instruction-line repair.
| # | Project | What it adds | Why it comes here |
|---|---|---|---|
| 91 | Critical-word-first / nonblocking I-cache fill buffer | On an I-cache miss, deliver the requested word as soon as it arrives, then fill the rest of the line opportunistically. | Done. P91 beats P90 and slightly beats P89 on shell-window cycles, but does not yet beat P89 on fetch-stall cycles. |
| 92 | One-entry fetch queue between frontend and execute | Safe S_EXECUTE next-PC prefetch with S_WB queue consume before normal writeback prefetch. | Done. Queue fills/consumes 53.98M instructions and cuts fetch stalls, but shell-window speedup is FAIL versus P91. |
| 93 | Branch predictor v0 | Shadow 32-entry BTB, 2-bit direction counters, and 8-entry return-address stack. | Done. RAS target accuracy is 96.57%; BTB target accuracy is weaker, so steering waits for a better frontend path. |
| 94 | Memory arbiter v0 | Separate request classification and arbitration for fetch, prefetch, background I-cache fill, load, store, FP, AMO, and page-walk traffic behind the shared external RAM model. | Done. Only background I-cache fill is denied service; foreground data traffic is paying shared-memory latency. |
| 95 | Store buffer v0 | One-entry external-RAM store buffer at the SoC boundary, with accept/drain/block counters. | Done. Store stalls collapsed, but fetch stalls rose 51.02%; shell-window speedup is FAIL. |
| 96 | D-cache v0 | Direct-mapped, word-only, write-through data cache for aligned external-RAM LW/SW, with hit/miss/fill/update/invalidation counters. | Done. Shell-window speedup is PASS versus P94, load stalls fall 24.99%, but fetch stalls rise 13.28%. |
| 97 | Four-word D-cache line fill | Critical-word-first data-cache line fill, then background fill through the P94 arbiter. | Done. Load stalls fall again, but fetch stalls rise 10.94% versus P96; shell-window speedup is FAIL. |
| 98 | Throttled D-cache background fill policy | Keep P97’s line geometry, but only fill remaining words when the frontend already has useful work queued and I-cache fill is quiet. | Done. Shell-window speedup is PASS versus P96, but this is still a shared-port scheduling patch. |
| 99 | Harvard I/D service map | Draw the actual instruction/data architecture boundary, list what this core lacks, and decide which measurements define success. | Done. P99 is functional PASS, but not a speed rung; it sets P100’s split-port acceptance criteria. |
| 100 | Split instruction/data memory service model | Group fetch/I-cache/instruction-PTW and LSU/D-cache/data-PTW traffic into separate service intentions, then count lower shared conflicts. | Done. Instruction demand is always granted by the current policy; data wants 59.35M cycles and is not granted for 32.40M cycles. |
| 101 | Split ITLB/DTLB lookup path | Replace the unified 8-entry TLB with separate 8-entry ITLB and DTLB banks while keeping the walker shared. | Done. Shell window improves 4.12% versus P100; fetch walks fall 39.92% and data walks fall 45.28%. |
| 102 | Data-side write buffer with forwarding | Add a core-local translated one-entry store buffer and instrument accept/drain/forward behavior. | Partial. Verilator builds and Linux reaches /init, but BusyBox faults before the shell prompt after 79 buffered stores. Next rung should trace/fix this before adding more nonblocking machinery. |
| 103 | Store-buffer trace and repair | Add grant-qualified store-buffer tracing and fix the request/grant/clear contract. | Done. BusyBox shell reaches P103-FILE-OK; 1.16M stores accept and drain correctly. Shell-window speedup is FAIL versus P101 because the policy still drains before fetch. |
| 104 | Banked lower memory conflict counters | Measure how instruction-side and data-side clients would map onto banked lower memory, before pretending the near-core Harvard split has solved the shared-port problem. | Done. 20.56M simultaneous I/D lower-memory wants land on different banks; that is 71.40% of the conflict window and the target for P105. |
| 105 | Banked lower service model | Model same-cycle instruction/data lower-memory grants when selected banks differ and the blocked request is read-like. | Done. The conservative model finds 8.29M shell-window extra grants and projects a 56.15M-cycle shell window if each grant saves one cycle. |
| 106 | Banked lower-memory contract | Add an auxiliary read lane at the simulator/RTL boundary and service safe different-bank reads from the Verilator memory model. | Done. The aux lane services 20.63M reads total and 8.46M during the shell window, matching the model exactly with 0 errors. |
| 107 | Banked auxiliary D-cache fill | Feed one narrow auxiliary response class back into the core, starting with D-cache background fill. | Done. The core consumes 10.07M auxiliary D-cache fills with 0 aux errors and improves the shell window by 0.44% versus P106. |
| 108 | Banked auxiliary I-cache fill | Consume blocked instruction writeback-prefetch responses as I-cache fills while retaining P107’s D-cache background-fill consumer. | Done. The core consumes 488K aux I-cache prefetch fills and improves the shell window by 0.68% versus P107. |
| 109 | Banked auxiliary demand prefetch | Let one demand-visible path consume an auxiliary response, starting with the S_WB store-buffer drain plus writeback-prefetch overlap. | Done. The core bypasses 488K auxiliary prefetches into execute and cuts S_FETCH cycles, but shell-window speedup is FAIL versus P108. |
| 110 | Tagged auxiliary response slot | Replace one-off aux consumers with owner/address/data/error/cancel metadata for fetch, prefetch, background fill, and later load-miss service. | Done. The slot records 10.47M tagged aux responses with 0 errors and 0 cancels while the shell workload reaches P110-FILE-OK. |
| 111 | Nonblocking aligned load-miss aux consumer | Let one data-side load miss own a tagged aux response without violating store, trap, or D-cache invalidation ordering. | Done. 3.545M load aux responses, 0 errors, 0 cancels; speedup FAIL versus P110. |
| 112 | Aux response queue / one-entry MSHR | Register one outstanding aux response so useful overlap can survive a cycle of consumer backpressure and policy can distinguish load demand from background fill. | Done. 3.677M queue enqueues/dequeues, 0 drops, 0 errors, 0 cancels; speedup FAIL versus P111. |
| 113 | Load-miss issue policy v2 | Gate aux-load issue using measured frontend pressure, D-cache line state, and background-fill debt. | Done. Blocks all 5.39M aux-load candidates, recovers most of P112, but remains slower than P110. |
| 114 | PTW aux owner measurement | Count safe PTW aux opportunities and A/D-write blocks before trying a walker consumer. | Done. No safe PTW aux candidates in shell workload; 73 A/D write blocks. |
| 115 | Frontend target queue | Replace the one-entry fall-through fetch queue with an FTQ-like target queue that can hold predicted PCs and fetch metadata. | Done. 54.24M FTQ fills and matching consumes with 0 flushes; this is scaffolding, not a speedup rung. |
| 116 | Active predictor steering guardrail | Let the P93 predictor steer fetch, then repair mispredicts with explicit flush/accounting. | Done as a guardrail. The live target-prefetch attempt wedged before the kernel banner, so P116 gates issue off and records 4.46M candidates for the next speculative-buffer rung. |
| 117 | Speculative target buffer guardrail | Hold predicted-target fetch data outside the architectural fetch queue, then promote or discard it when the queued branch resolves. | Done as a guardrail. Live issue reaches /init but faults BusyBox at badaddr=0x00000000; guarded issue records 215K candidates and keeps BusyBox passing. |
| 118 | LSU shape counters | Split the existing execute plus S_MEM behavior into address-generation, DTLB, D-cache/store-buffer, and lower-memory counters. | Done. 27.90M address-generation events, 27.22M DTLB hits, 671K DTLB misses, 4.60M D-cache-hit cycles, and 1.16M store-buffer accepts. |
| 119 | LSU request record / in-order scoreboard | Add explicit in-flight LSU request metadata and scoreboard-style busy bits while preserving in-order commit. | Done as a shadow record. 27.89M allocs, 27.75M classified completes, 443 flushes, and 30.69M busy cycles. |
| 120 | Physical register-file sketch | Add a documentation/simulation rung for rename maps, free list, and PRF sizing before changing architectural commit. | Done as a shadow map. 147.10M source reads, 59.31M integer PRF allocations, matching frees/commits, and live pressure stays 32 because commit is still in-order writeback. |
| 121 | ROB commit model sketch | Model in-order commit, exception replay, and flush policy in the harness before trying to execute out of order. | Done as a one-entry shadow ROB. 59.18M allocs, 59.18M commits, 179 flushes, 0 missing commits, and max live PRF pressure of 33. |
| 122 | Multi-entry ROB/free-list sketch | Try a tiny multi-entry ROB/free-list sketch before any scheduler or real out-of-order execution. | Done as a four-entry shadow ring. Max occupancy stays 1, proving the next missing boundary is dispatch/issue separation. |
| 123 | Dispatch/issue split sketch | Let a shadow dispatch record get ahead of writeback before trying a scheduler. | Done as an opportunity model. It finds 40.40M frontend-ready/backend-busy cycles, but still no issue depth beyond 1. |
| 124 | Shadow integer issue slot | Add one modeled issue slot and classify blocks before trying a scheduler. | Done. 9.80M queued simple integer ops accepted; dependency, memory, and control classes dominate the remaining blocks. |
| 125 | Source-ready scoreboard model | Track which queued source operands are ready instead of using one crude older-destination dependency rule. | Done. 13.10M queued simple-integer candidates have all sources ready; max busy architectural register count is still 1. |
| 126 | Memory/control holding records | Split memory-class and control-flow queued work out of the integer slot model. | Done. 6.66M memory candidates, 9.33M control candidates, and explicit source-busy/full-record block reasons. |
| 127 | Scheduler wakeup/issue eligibility | Model which held integer, memory, and control records could issue together once sources wake up. | Done. 37.18M single-ready cycles, but 0 dual-ready cycles; records still do not coexist. |
| 128 | Scheduler queue/lifetime depth | Keep multiple class records alive in the shadow model before trying a multi-issue picker. | Done. Capacity-4 queues accept 9.82M integer, 3.25M memory, and 6.29M control records, but max occupancy is still 1 and dual-ready cycles remain 0. |
| 129 | Scheduler arrival/service decoupling | Let modeled scheduler arrivals and backend service diverge enough to test whether class coexistence is possible. | Done. Arrival/service counts still track one another and all class queues max at occupancy 1. |
| 130 | Ready/valid contract extraction | Turn the measured frontend/backend boundary into explicit plain-RTL valid/ready/fire wires before adding more scheduler policy. | Done. The helper reports 22.65M arrival-fire cycles, matching service fires closely, with zero backpressure and zero dual-ready cycles. |
| 131 | Dispatch queue module extraction | Move from a combinational contract helper to a small module that owns queue state and exposes valid/ready/fire wires. | Done. Module-owned counters exactly match the P130-style contract helper; max occupancy remains 1. |
| 132 | Dispatch payload record | Add payload fields to the dispatch queue module and compare accept, service, and flush behavior before making it active. | Done. Payload accepts/services match queue fires, append-without-storage is 0, and invariant errors are 0. |
| 133 | Dispatch payload class audit | Compare the module-owned payload record against the older issue-slot and memory/control classifiers. | Done. 22.66M class audits, 0 mismatches, max occupancy still 1. |
| 134 | Aux load prefetch policy | Pivot back to frontend/memory: issue guarded aux loads only when the main port can prefetch next-PC safely. | Done. 139,881 aux loads, 0 queue drops, shell window improves by 360,275 cycles versus P133. |
| 135 | Cache background policy audit | Explain remaining aux-load blocks from I-cache and D-cache background fill before relaxing either policy. | Done. I-cache-only blocks dominate D-cache-only blocks, 1.62M versus 196K. |
| 136 | I-cache background preempt | Let useful next-PC prefetch plus aux-load issue preempt I-cache background fill while keeping D-cache background quiet. | Done. Mechanism PASS, speedup FAIL: 1.77M aux loads issue, but the shell window worsens by 1.98M cycles versus P135. |
| 137 | Bounded memory arbitration | Keep P136’s measured opportunity, but limit I-cache-background preemption with a one-preempt/one-defer burst counter. | Done. 1.67M aux loads issue, 72.9K candidates defer, and the shell window barely beats P135. |
| 138 | Debt memory arbitration | Replace P137’s crude burst cap with a debt counter that preemption increments and I-cache background service pays down. | Done. 1.67M aux loads issue, 1.50M debt paydowns balance 1.50M preemptions, and the shell window beats P137/P135 but not P134. |
| 139 | I-cache repair usefulness audit | Count whether background-repaired I-cache words are consumed by fetch soon afterward. | Done. 52.85M repair word fills produce only 1.93M first later fetch hits and 0.83M repeat hits, so P140 should make repair policy value-aware. |
| 140 | Repair-aware I-cache arbitration | Give each foreground I-cache fill a one-word adjacent background repair budget. | Done. Repair fills fall 39.4% and shell window improves by 673K cycles versus P139, but the policy is too stingy and loses line locality. |
| 141 | Adaptive second-word I-cache repair | Give demand-fetch lines a second repair word while prefetch-only lines stay at one. | Done. It beats P140 by 746K cycles, beats P138 by 348K, and trails P134 by only 67.6K cycles. |
| 142 | Selective prefetch second-word repair | Give a second repair word to prefetch lines when they are immediately consumed by the frontend. | Done. It beats P141 by 362.6K cycles and P134 by 295.0K cycles, but the prefetch grant bucket is too broad. |
| 143 | Prefetch consumer repair classifier | Split P142’s broad frontend-consuming prefetch bucket into profitable and wasteful consumers. | Done as an audit rung. Execute-prefetch repair is the bad bucket: 39.10M fills at only 2.83% first+repeat hit ratio. |
| 144 | Execute-prefetch repair throttle | Stop giving execute-prefetch fills a second repair word by default while keeping higher-payback repair classes. | Done. It cuts 10.01M repair fills and beats P143 by 564K cycles, but trails P142 by 266K cycles. |
| 145 | Conditional execute-prefetch repair | Re-enable execute-prefetch second-word repair only when a local sequential-adjacent condition says it is worth the traffic. | Done. RTL PASS, speedup FAIL: it adds 165,881 prefetch second-word grants but worsens the shell window by 908K cycles versus P144. |
| 146 | Execute-prefetch predicate audit | Roll back to P144 behavior and shadow-count multiple execute-prefetch usefulness predicates before changing the active repair policy again. | Done. The P145 seq_adjacent predicate produces zero later fetch hits; predicted_not_taken is the best single candidate, but only reaches 2.91% first+repeat hits per fill. |
| 147 | Strict execute-prefetch composite guard | Test one guarded combination of the P146 predicates as an active second-word repair policy. | Done. It beats P146 by 850K cycles, but loses to P144 by 620K and P142 by 886K, so this repair bucket is closed. |
| 148 | Frontend/memory pivot after execute-prefetch repair | Pick the next bottleneck from the shell profile now that execute-prefetch second-word repair is no longer the target. | Next. The likely direction is a fresh bottleneck audit rather than another local execute-prefetch predicate. |
XiangShan Gap Check
XiangShan/Kunminghu is the north star for architectural shape, not a literal near-term implementation target. The Kunminghu V2R2 microarchitecture guide describes a decoupled frontend with ICache, FDIP, and a branch prediction unit; the memory-subsystem guide calls out MSHR-managed fetch/prefetch misses, uFTB, FTB, TAGE-SC, ITTAGE, and RAS. The current repo has tiny versions of some concepts: I-cache, D-cache, ITLB/DTLB, fetch queue, BTB/counter/RAS measurement, store buffer, banked lower-memory model, and an aux response slot. It does not yet have an FTQ, real predictor steering, MSHRs, a load queue/store queue, rename, schedulers, ROB, vector unit, large L2, or XiangShan-scale verification.
That puts the next 10 projects in a sane order: finish the nonblocking memory contract first, then improve frontend steering, then only start backend speculation scaffolding.
Harvard arc
For this repo, “Harvard” means the core has separate instruction and data service close to execute: fetch, I-cache, and instruction translation can keep feeding the frontend while loads, stores, AMOs, and data translation use a different path. It does not require two totally separate DRAM systems forever. Real designs usually rejoin at a lower cache or memory fabric; the important part is that an L1 data event does not automatically steal the one cycle the frontend needed.
A useful Harvard-shaped memory system for this core would have:
- an instruction path: next-PC generation, I-cache, instruction fill buffer, and ITLB lookup
- a data path: LSU, D-cache, store buffer, AMO handling, and DTLB lookup
- independent near-core service ports or banks so hits do not arbitrate against each other
- miss tracking so one outstanding refill does not stall unrelated work
- a lower shared memory level with explicit conflict counters, not a hidden single-port bottleneck
- ordering and invalidation rules for
fence,sfence.vma,satp, stores, AMOs, and page-table A/D writes
What we lack today is exactly the interesting part. The core has I-cache and D-cache experiments, a fetch queue, and a named memory arbiter, but they still negotiate behind one shared external RAM service. Translation storage is now split by P101, but the page-table walker is still shared. D-cache misses are blocking demand events plus optional background fill, stores now have a conservative one-entry buffer but no useful forwarding yet, and there is no MSHR-like miss tracking.
The next projects should make that gap visible:
| # | Project | Question |
|---|---|---|
| 99 | Harvard I/D service map | Where exactly do instruction fetch and data access split in this RTL, and what counters prove the split matters? |
| 100 | Split I/D memory service model | How much instruction/data contention is still hidden below the near-core split? |
| 101 | Split ITLB/DTLB lookup path | How much translation interference remains after the memory-service split? |
| 102 | Data-side write buffer with forwarding | Partial: the first core-local buffer corrupts BusyBox before the shell prompt. |
| 103 | Store-buffer trace and repair | Fixed: request, grant, and store-buffer clear now describe the same memory transaction. |
| 104 | Banked lower memory conflict counters | Measured: 71.40% of simultaneous I/D lower-memory wants are split-bank opportunities. |
| 105 | Banked lower service model | Modeled: 8.29M shell-window extra read-like grants could be serviced on different lower banks. |
| 106 | Banked lower-memory contract | Proven: the widened boundary services the modeled auxiliary reads with 0 errors. |
| 107 | Banked auxiliary D-cache fill | Proven: one non-architectural core client can consume the second response while Linux and BusyBox keep running. |
| 108 | Banked auxiliary I-cache fill | Proven: instruction-side prefetch can consume the second response and fill I-cache state safely. |
| 109 | Banked auxiliary demand prefetch | Proven: the second response can advance frontend state for S_WB prefetch while a store-buffer drain uses the main port. |
| 110 | Tagged auxiliary response slot | Proven: fetch, prefetch, and background-fill classes can share one explicit response ownership record. |
| 111 | Nonblocking aligned load-miss aux consumer | Proven functionally: 3.545M aligned load misses complete from the aux response while instruction service uses the main port, but the first policy is slower. |
| 112 | Aux response queue / one-entry MSHR | Proven functionally: the core preserves and drains the load response, but queueing every eager aux-load opportunity is too expensive. |
| 113 | Load-miss issue policy v2 | Proven: the conservative gate blocks all aux-load issues and recovers most of P112. |
| 114 | PTW aux owner measurement | Proven: no safe PTW aux read candidates appear in the shell workload; A/D writes remain ordered. |
| 115 | Frontend target queue | Proven: the frontend can hold predicted-target metadata exactly alongside the fetch queue, with 54.24M fills and consumes and 0 flushes. |
| 116 | Active predictor steering guardrail | Answer: not with the existing architectural fetch queue. The live attempt wedged before the kernel banner; the guarded version counts 4.46M candidates and keeps BusyBox passing. |
| 117 | Speculative target buffer guardrail | Answer: the record alone is not enough. The live issue path still perturbs userspace, so the passing rung gates issue off and records 215K candidates for a stricter promotion/repair contract. |
| 118 | LSU shape counters | Proven: the current data path can be reported as address generation, DTLB service, cache/store-buffer service, and S_MEM completion without changing behavior. |
| 119 | LSU request scoreboard | Proven: an in-order shadow request record can track alloc/complete/flush/busy counts without breaking Linux or BusyBox. |
| 120 | PRF rename sketch | Proven: a shadow integer PRF map can measure source reads and destination allocation pressure without changing architectural commit. |
| 121 | ROB commit model | Proven: a one-entry shadow ROB can balance alloc/commit/flush/free lifetime while Linux and BusyBox still pass. |
| 122 | Multi-entry ROB sketch | Proven negative: a four-entry ROB ring remains occupancy 1 under the current serialized backend. |
| 123 | Dispatch/issue split sketch | Proven measurement: 40.40M frontend-ready/backend-busy cycles exist, so the next useful backend boundary is an issue slot rather than a larger ROB. |
| 124 | Shadow integer issue slot | Proven measurement: a one-entry shadow slot accepts 9.80M simple integer ops, while dependency, memory, and control classes define the next blockers. |
| 125 | Source-ready scoreboard model | Proven measurement: 13.10M queued integer candidates have ready modeled sources, but the shadow backend still only has one busy architectural destination at a time. |
| 126 | Memory/control holding records | Proven measurement: memory has 3.25M ready accepts and no full-record pressure; control has 6.28M accepts plus 29.9K full-record blocks. |
| 127 | Scheduler wakeup/issue eligibility | Proven negative: the scheduler sees 37.18M ready cycles, but no two held classes are ready in the same cycle under the current lifetime model. |
| 128 | Scheduler queue/lifetime depth | Proven negative: capacity-4 class queues still max at occupancy 1, so arrival/service coupling remains the blocker. |
| 129 | Scheduler arrival/service decoupling | Proven negative: earlier arrival accounting still maxes every class at occupancy 1, so the next missing abstraction is a real ready/valid boundary. |
| 130 | Ready/valid contract extraction | Proven: the contract can be named as plain RTL and measured under Linux. Proven negative: without a state-owning queue, the backend still has no class coexistence. |
| 131 | Dispatch queue module extraction | Proven: one state cluster can move into a module and match the old counters under Linux. Proven negative: the module still sees no class coexistence without payload ownership and real dispatch isolation. |
| 132 | Dispatch payload record | Proven: the module can own decoded payload metadata with zero invariant errors. Proven negative: the queue remains one-deep per class, so class audit comes before active issue. |
| 133 | Dispatch payload class audit | Proven: module-owned payload class agrees with the old classifiers across 22.66M audits and 0 mismatches. |
The rule for this phase: every rung must run the BusyBox shell profile and compare against the previous rung before claiming a speedup.
§ Linux bring-up: gap closed enough to use
The old question was “what does Linux on RV32 need?” The current answer is better: we have already built enough of it to boot a real kernel and run userspace. The checklist below is now historical context plus a map to the rungs that closed each part.
| requirement | where it landed | status |
|---|---|---|
A extension for spinlocks and lr.w/sc.w | P45 A extension | PASS |
| Supervisor-mode CSR/trap/delegation machinery | P47 through P51 | PASS |
| Sv32 page-table walking and translation | P52 page-table walker and P53 walker completion | PASS |
| Platform shape, memory size, SBI, device tree | P54 platform shape, P55 S-mode kernel, P56 RTL for Linux | PASS |
| Instruction-fetch translation during Linux boot | P59 Linux boot | PASS |
| Real userspace process | P60 userspace hello | PASS |
| TLB and shell profiling infrastructure | P61 TLB, P84 shell profile, P85 symbol profile | PASS |
| BusyBox initramfs and interactive shell path | P80 BusyBox initramfs, P81 PTY console | PASS |
| Frontend stall attribution and first I-cache/predictor/data-buffer/backend-sketch experiments | P88 memory attribution, P89 I-cache, P90 line fill, P91 fill buffer, P92 fetch queue, P93 predictor, P94 arbiter, P95 store buffer, P96 D-cache, P97 D-cache line fill, P98 D-cache throttle, P99 Harvard map, P100 split I/D service, P101 split TLB, P102 write buffer, P103 store-buffer repair, P104 banked lower memory, P105 banked lower service, P106 banked lower contract, P107 banked aux D-cache fill, P108 banked aux I-cache fill, P109 banked aux demand prefetch, P110 tagged aux response, P111 nonblocking load aux, P112 aux response queue, P113 load-miss policy, P114 PTW aux owner, P115 frontend target queue, P116 active steering guardrail, P117 speculative target buffer, P118 LSU shape, P119 LSU request scoreboard, P120 PRF rename sketch, P121 ROB commit model, P122 multi-entry ROB sketch, P123 dispatch/issue split, P124 shadow issue slot, P125 source-ready scoreboard, P126 memory/control holding records, P127 scheduler wakeup/issue, P128 scheduler queue/lifetime, P129 scheduler arrival/service, P130 ready/valid contract, P131 dispatch queue module, P132 dispatch payload record, P133 dispatch payload class audit, P134 aux load prefetch policy, P135 cache background policy audit, P136 I-cache background preempt, P137 bounded memory arbitration, P138 debt memory arbitration, P139 I-cache repair usefulness, P140 repair-aware I-cache arbitration, P141 adaptive second-word I-cache repair, P142 selective prefetch second-word repair, P143 prefetch consumer repair classifier | PASS, with P90, P92, P94, P95, P97, P99, P103, P109, P111, P112, P115, P116, P117, P118, P119, P120, P122, P123, P124, P125, P126, P127, P128, P129, P130, P131, P132, P133, P135, P136, and P139 speedup FAIL or shadow/audit-only; P102 is partial; P116/P117 are guarded frontend correctness rungs, P118/P119 are data-side measurement rungs, P120-P133 begin backend rename/ROB/dispatch/issue/scoreboard/holding-record/scheduler/contract/module/payload/class-audit scaffolding, P134 pivots back to memory with a small shell-window speedup, P135 identifies I-cache background fill as the next policy target, P136 proves unbounded I-cache-background preemption is too aggressive, P137 shows a crude bound can recover most of that regression, P138 improves the bound with explicit repair debt, P139 proves most background repair words are not promptly fetched, P140 cuts repair bandwidth but proves a fixed one-word budget is too stingy, P141 restores most line locality with demand-side second-word repair, P142 turns frontend-consuming prefetch repair into the first post-P134 shell-window win, and P143 identifies execute-prefetch second-word repair as the low-payback bucket to throttle next |
| Execute-prefetch repair throttle | P144 execute-prefetch repair throttle | PASS. The class-wide throttle cuts 10.01M repair fills and recovers 564K cycles versus P143, but it remains 266K cycles slower than P142, which set up P145’s conditional repair test. |
| Conditional execute-prefetch repair | P145 conditional execute-prefetch repair | PASS mechanically, speedup FAIL. The local sequential-adjacent predicate adds 165,881 prefetch second-word grants but makes the shell 908K cycles slower than P144, which set up P146’s predicate audit. |
| Execute-prefetch predicate audit | P146 execute-prefetch predicate audit | PASS mechanically, speedup FAIL as an audit rung. It keeps the active policy at P144 shape and shows the P145 predicate has zero measured later fetch payoff in this shell run. |
| Strict execute-prefetch guard | P147 strict execute-prefetch guard | PASS mechanically, speedup FAIL. The strict composite guard beats P146 but still loses to P144/P142, so execute-prefetch second-word repair is no longer the next target. |
So this page should no longer talk as if Linux is hypothetical. Linux is running; the current problem is making the machine less painfully blocking while it runs Linux.
§ Side rung: framebuffer demo (P44)
P44 took a small detour to give the
FreeRTOS milestone a face. The chip’s render task computes 96×96 RGB565
plasma frames into a memory-mapped framebuffer; the testbench dumps each
frame on MMIO_FRAME_READY; a pygame window plays them back. No SPI
peripheral, no display hardware - just memory + simulator + Python.
That same software pattern later came back in the
AtomVM framebuffer work.
§ Historical Phase 1 - ISA breadth (P45-P46)
This closed the non-privileged gap to RV32IMA + bitmanip. Each rung was a small RTL change that added a real ISA extension we still use.
| # | Project | What it adds | Linux relevance |
|---|---|---|---|
| 45 | A extension (atomics) | lr.w/sc.w + amo*.w. Single-hart reservation register. | Linux required. Also lets FreeRTOS use proper atomic critical sections instead of MIE-disable. |
| 46 | Zba + Zbb-essentials | 13 single-cycle bitmanip ops (sh*add, andn/orn/xnor, min/max, sext/zext). RTL good; gcc-zbb auto-emit has a known wart. | Modest Linux build perf; cheap rung. |
C extension (compressed) was originally P46 but it’s a multi-hour fetch-front-end rewrite; it got bumped to a future “supervised” rung and eventually became part of the AtomVM port work.
After Phase 1 we had **RV32IMA + Zicsr + Zifencei + Zicntr + Zba
- Zbb-essentials** - close to a “small embedded Linux” target ISA (modulo C, which is a build-size issue more than a correctness issue).
§ Historical Phase 2 - Privileged depth (P47-P51)
Linux runs in S-mode; user code runs in U-mode; M-mode hosts the SBI firmware. Phase 2 added the missing privilege machinery.
The address-map work is split out from the behavior work so each rung is small and composable.
| # | Project | What it adds |
|---|---|---|
| 47 | S-mode CSR scaffolding | sstatus, sie, stvec, sscratch, sepc, scause, stval, sip, satp decoded as M-readable storage. No priv transition yet. |
| 48 | Trap delegation CSR scaffolding | medeleg / mideleg decoded as M-readable storage. No actual delegation behavior yet. |
| 49 | M↔S priv tracking | mstatus.MPP real, mret returns to the right priv level; cause-bit-driven trap routing using the now-storage medeleg/mideleg. |
| 50 | S-mode trap entry | When a delegated trap fires, write stvec/sepc/scause/stval and switch to S; sret returns to U. |
| 51 | CSR priv check + sie/sip subset | Real privilege checks and S-mode interrupt-pending/enable views. |
After Phase 2 the chip could run a hypervisor-free, page-table-free kernel that just uses S/M splits. Useful in its own right; required for what follows.
§ Historical Phase 3 - The MMU (P52-P54)
This was the biggest single rung in the Linux climb. Sv32 page-table walks turn every load and store into a potential TLB lookup + multi-cycle walk + permission check. It added significant RTL and significant area.
| # | Project | What it adds |
|---|---|---|
| 52 | Sv32 page-table walker | Page-table walker using the satp storage from P47, small TLB, and permission/access checks. The big rung. |
| 53 | Walker completion | Megapages, sfence.vma, fault encoding, and AMO translation. |
| 54 | Platform shape | 16 MiB memory model plus the first SBI/platform proof of concept. |
The MMU is where the chip stopped being a microcontroller-shaped core and started being a small application platform.
§ Historical Phase 4 - Platform glue (P55-P58)
Linux assumes specific platform shapes. These rungs made the platform look enough like a real RV32 machine to run a kernel.
| # | Project | What it adds |
|---|---|---|
| 55 | Hello, S-mode kernel | Real C S-mode kernel using SBI console calls. |
| 56 | RTL completion for Linux boot | A/D bit updates, CLINT-shaped MMIO, and the missing Linux-facing RTL pieces. |
| 57 | SBI runtime + kernel trap handler | Minimal SBI v0.1 runtime and S-mode kernel with its own stvec handler. |
| 58 | Linux kernel boot attempt | Stage-0 handoff worked; Linux parked during MMU enable. This failure set up P59. |
After Phase 4 the platform looked enough like a generic RV32-Linux target that an off-the-shelf kernel build could start booting.
§ Historical Phase 5 - Linux bring-up and first speed work (P59-P90)
This is where Linux became real in the repo, then became slow enough that profiling and frontend work became the obvious next target.
| # | Project | What it does |
|---|---|---|
| 59 | Linux 6.12.85 boots | Sv32 instruction-fetch translation made the kernel boot and print through SBI. |
| 60 | First userspace process | RV32 ELF runs as PID 1 and prints through SYS_write. |
| 61 | 4-entry TLB + profiling harness | First structured benchmark harness around the Linux-capable core. |
| 63-66 | Fetch/writeback speed rungs | Single-cycle fetch on TLB hit, prefetch-under-writeback, fused decode/execute, and D+X collapse. |
| 80-81 | BusyBox initramfs and PTY console | Real shell path with host-side interaction. |
| 84-88 | Shell profiling through memory attribution | Flamegraph-style data, symbolized BusyBox samples, TLB comparison, UART comparison, and memory request attribution. |
| 89-90 | Tiny I-cache and line-fill experiment | First frontend cache experiments. P89 reduced fetch stalls; P90 proved blocking line fill is the wrong policy. |
§ Speed rungs already taken
The speed page named the divider OR-tree as an early
critical path, and the later speed rungs attacked CPI instead of just
clock frequency. See P63,
P64,
P65, and
P66 for the first round. P91-P147 are
the second round: less blocking frontend and memory behavior. P91 was a
small win; P92 was a useful negative result, P93 measured enough return
predictability to justify active prediction once the frontend can steer,
P94 made the shared memory clients visible enough to choose a data-side
experiment, P95 proved the first conservative store buffer policy is not
good enough, P96 showed that even a one-word D-cache can reduce load
stalls enough to win a little, and P97 showed that line fill needs
throttling on the one-port memory system. P98 recovered the shell timing
with a stricter throttle, P99 mapped the Harvard split, P100 made
instruction/data service demand visible below that map, P101 split
translation storage into ITLB/DTLB banks, P102 found the first
core-local store-buffer correctness hole, P103 repaired it, P104
measured that most simultaneous I/D lower-memory conflicts are
different-bank opportunities, P105 showed that 8.29M shell-window cycles
are conservative read-like extra-grant candidates, P106 turned that
model into a serviced auxiliary read lane, P107 proved the core can
consume that response for D-cache background fill while Linux keeps
running, P108 added instruction-side prefetch fill, and P109 let the
writeback-prefetch response bypass directly into execute while a
store-buffer drain uses the main port. P110 replaced those one-off
consumers with a tagged auxiliary response slot. P111 let a data-side
load miss use that slot and proved the mechanism, but the shell window
regressed. P112 added the one-entry response queue and proved that
queueing works, but also made the cost of bad issue policy obvious. P113
tightened that policy, P114 measured that the page-table walker is not
the next useful aux consumer, and P115 added the frontend target record
needed before active predictor steering. P116 then proved active
steering needs a speculative target buffer instead of reusing the
architectural fetch queue directly. P117 added that record, but proved
the issue/fill path still needs stricter isolation before speculative
target data can influence userspace execution. P118 then made the
current data side legible as an LSU-shaped path before any request-record
or scoreboard refactor. P119 added that shadow request record and proved
the accounting can run under Linux. P120-P133 then started the backend
arc with shadow PRF rename pressure, one-entry ROB lifetime accounting,
and the explicit negative result that a four-entry ROB still stays at
occupancy 1 without dispatch/issue decoupling. P123 then measured the
dispatch/issue opportunity window directly: 40.40M cycles where queued
frontend work exists while the backend is busy. P124 classified that
window with a one-entry shadow issue slot and found 9.80M simple integer
ops that can be held; P125 added source-ready accounting and found
13.10M queued integer candidates with all modeled sources ready; P126
split memory and control work into holding records and found the next
missing contract is wakeup/issue eligibility across held classes. P127
then added that ready-mask model and found zero same-cycle multi-class
ready opportunities. P128 added queue depth and still found max
occupancy 1. P129 moved arrival accounting earlier and still found
max occupancy 1. P130 extracted explicit valid/ready/fire wires and
confirmed the same result. P131 moved that queue state into a module and
matched the old counters. P132 added one-deep payload ownership with
zero invariant errors. P133 audited that payload class against the older
classifiers with 22.66M checks and 0 mismatches. P134 then made the
pivot: it re-enabled guarded aux-load issue, completed 139,881 load
misses through the auxiliary lane, and cut the shell window by 360,275
cycles versus P133. P135 kept the policy and added mutually exclusive
background-block buckets, showing I-cache-only blocks dominate
D-cache-only blocks after frontend prefetch is safe. P136 tested that
target and proved the unbounded form is too aggressive: 1.77M aux loads
issue cleanly, but shell-window time gets worse. P137 adds a
one-preempt/one-defer bound, keeps 1.67M aux-load issues, and recovers
the shell window to a tiny win over P135. P138 replaces that sketch with
I-cache repair debt and improves again, but still cannot beat P134.
P139 then audits the repaired words directly: only 3.64% become first
later fetch hits, so the next policy should be value-aware rather than
just fair. P140 gives each foreground fill one adjacent repair word. It
improves versus P139 by cutting 20.80M repair fills, but the fixed
budget is too small to recover P138/P134. P141 gives demand-fetch lines
a second repair word, keeps repair fills near P140, and recovers to
within 67.6K cycles of P134. P142 extends the second-word budget to
frontend-consuming prefetch fills. It spends more repair traffic, but
wins the shell window by 362.6K cycles versus P141 and 295.0K cycles
versus P134. P143 classifies that broad prefetch repair bucket and finds
execute-prefetch second-word repair is the low-payback source. P144
throttles that class, cuts 10.01M repair fills, and recovers 564K cycles
versus P143, but it remains 266K cycles slower than P142. P145 tries a
local sequential-adjacent condition for restoring that second word. It
passes the shell workload but worsens by 908K cycles versus P144. P146
then performs the audit instead of guessing again: seq_adjacent gets
zero measured later fetch hits, and the best single broader predicate is
still weak at 2.91% first+repeat hits per fill. P147 tests the strict
composite guard and still loses to P144/P142, so the next speed rung
should pivot away from execute-prefetch repair.
§ Demo rungs along the way (no specific position)
These don’t gate anything but they’re the rungs that make the chip feel real.
- SPI peripheral + SPI-LCD demo. P14’s
spi_bootshifter promoted into a real SPI master peripheral, then a FreeRTOS task that draws to a $10 ST7789 240×240 IPS display - chip drives the screen directly, no PC, real animation. The most visually impressive thing this chip can do. - SPI-flash boot for P43+. P14’s autonomous flash boot brought forward to the FreeRTOS chip, so the chip wakes up, pulls a FreeRTOS image from flash, runs it without a host.
- Real upstream
riscv-arch-testpackaging. Drop ourp17_act4_batch.pyrunner from P17; use the framework’smakedirectly. Honesty upgrade for results we already have. - mruby on the chip. Later runtime side quest: cross-compile
mruby against the P70-ish bare-metal libc/FPU platform,
start with
puts "hello"and a tiny arithmetic script, then decide whether fibers or garbage-collector behavior deserve their own rung. Not on the Linux critical path; very much on the “can this chip host another real language VM?” dream list after AtomVM. - Browser CPU simulator / annotated trace lab. Pie-in-the-sky but
plausible: start with precomputed Verilator traces rendered in the
site, then a small P09/P17-class RTL simulator compiled to WASM with
step(n), UART, registers, memory, and per-cycle annotations. Full Linux/BusyBox in-browser is the hard version because recent workloads boot a multi-megabyte image and run hundreds of millions of cycles; the better first public demo is Godbolt-style source/assembly/trace annotation with short live runs and replayed long traces.
§ What we’re explicitly not doing on this path
- RV64. The Linux kernel runs fine on RV32 (some distros haven’t built userspace for it lately, but that’s a port issue). Going to RV64 is a separate, much larger rebuild - and only worth doing if we want the full RV64GC software ecosystem, which we don’t yet need.
- Full F / D compliance. Linux can build without floating-point hardware. P70 added a pragmatic D-FPU subset because newlib’s hard-float printf/dtoa paths made it useful for runtime bring-up; full architectural F/D compliance is still parked unless a later demo actually needs it.
- Vector (V). Useful in some HPC niches; irrelevant to a generic Linux target.
- Full out-of-order execution. The core stays in-order through the Harvard split. Later ROB/rename ideas may return, but that is not the same as committing to a XiangShan-class backend.
- Tape-out. The chip past P50 will be too big for any free TT slot. ChipFoundry’s chipIgnite (~$15K USD for 100 packaged parts) would fab it, but that’s a separate decision.
§ Why this order
The completed Linux phases front-loaded correctness: ISA, privilege, MMU, platform, userspace. That was the right order because Linux is not useful until it boots and runs real programs.
The P91-P147 order front-loads decoupling: I-cache fill policy, fetch queue, branch prediction, memory-path attribution, D-cache policy, then the Harvard instruction/data split and the first repaired data-side buffer, followed by lower-memory bank conflict accounting, a banked service model, a serviced auxiliary read contract, narrow data-side and instruction-side auxiliary response consumers, and the first demand-visible auxiliary prefetch bypass, followed by a shared tagged response slot, and the first live aux-load owner. That is deliberate too. P88-P116 showed instruction delivery and memory stalls dominating the shell workload, and P111/P112 showed that a nonblocking mechanism still needs a policy layer. P115 adds the frontend target record needed before prediction can steer fetch; P116 shows that steering needs a non-architectural promote/discard record. P120-P133 deliberately keep the backend shadow-only while measuring rename, ROB lifetime, dispatch/issue opportunity, issue-slot block reasons, source-ready pressure, class-specific memory/control holding records, scheduler ready masks, queue lifetime, arrival/service accounting, the first explicit ready/valid contract helper, the first state-owning dispatch queue module, one-deep payload ownership, and payload-class audit; P133 shows real dispatch isolation would be the next backend step, but P134 chooses the measured frontend/memory path instead. P135 then narrows the next memory policy experiment to I-cache-background preemption, and P136 shows the unbounded version regresses, so the next useful step is bounded arbitration instead of wider speculation. P137 confirms the direction with a crude one-preempt/one-defer bound; the next useful step is a real debt/age arbiter. P138 implements that debt model and improves the shell window, but still trails P134, so the next useful step is auditing the actual value of protected/interrupted I-cache repair. P139 does that and finds most repaired words are not promptly consumed, so P140 should spend repair bandwidth only where fetch is likely to use it. P140 confirms that direction but shows a fixed one-word budget is too stingy. P141 gives demand-fetch lines a second repair word and nearly closes the P134 gap. P142 gives frontend-consuming prefetch fills a second word too and beats P134, but the prefetch grant bucket is broad. P143 classifies it and shows execute-prefetch repair dominates traffic with weak payback. P144 throttles that class and proves the bandwidth diagnosis was right, but the performance result says the throttle is too blunt. P145 tries one stronger local-use signal and fails on shell speed. P146 shadow-counts candidate predicates and shows the simple local signal has zero measured payoff, while the best single broader signal is still weak. P147 runs that strict composite-guard check and cannot beat P142/P144, so this repair bucket should stop consuming the architecture arc. A big out-of-order backend would be the expensive answer to the wrong first question if the frontend and LSU are still blocking on every miss.