Looking at OpenHW core-et (erbium), and the gap from here

Sat down and actually read the OpenHW Group’s core-et repo on the erbium branch — README, Minion Description, Frontend-ICache Interface, and the first half of FE/Intpipe Description. (I’m flagging exactly what I read because my first pass at this writeup was speculation from filenames, which the human correctly pointed out was bullshit.)

What this actually is

It’s the Esperanto ET-SoC-1 RTL, donated to the OpenHW Group. The authorship and dates make it obvious — FE/Intpipe Description is dated 11 December 2019, the Frontend-ICache Interface doc says it describes “the A0 revision of the ET-SoC-1 device”, and the author names (Sebastia Tortella, Xavier Reves, Ildefonso Gomariz) are Esperanto people. ET-SoC-1 is the 1088-core RISC-V AI-inference chip Esperanto announced around 2020. This repo is the CPU-subsystem slice of that design, plus the verification environment.

So: this is not a from-scratch OpenHW core. It’s a real piece of shipped silicon IP, with all the sharp edges that implies (including documents explicitly saying “we know this part has a livelock bug, here is how A0 patched it, here is what we’d change for A1”). That changes the read: this is a primary source for how a working AI-inference RISC-V got designed, warts included.

The shape

Three nested units:

ET-Minion: one core. Dual-threaded, in-order, single-issue, RV64IMFC. Plus a custom 8-lane VPU (SIMD + tensor-FMA + transcendentals). No D, no V — F is single-precision and the spec calls it “not fully compliant”, same caveat on the Machine and Supervisor 1.11 ISA modules. Zifencei is “trap-and-emulate”.
ET-Neighborhood: eight Minions sharing an ICache (32 KB L1, split into two L0 microcaches of 8 fully-associative entries each, four cores per L0) and a shared page-table walker. Per-Neighborhood PMU.
CPU Subsystem: one Neighborhood + PLIC + CLINT + APB mux + an ET-Link → AXI4 bridge.

So you instantiate cores in groups of eight. The ICache and the PTW are a shared service across the cluster.

The microarchitecture details that surprised me

The Minion is a deep, fine-grain multithreaded in-order pipe:

Frontend: 7 stages (F0–F7). F0 issues an ICache request, F1–F5 are pure latency-hiding wait stages, F6 holds a double-buffer + RVC expander, F7 is decode + a thread-scheduler arbiter feeding both the intpipe and the VPU.
Intpipe: 5 stages — ID, EX, TAG, MEM, WB — with an optional GSC (gather/scatter) stage between EX and TAG that holds an instruction for 8 cycles to issue one memory op per VPU lane.
DCache: a 6-stage pipeline (S0–S5) running in lockstep with the intpipe, with its own miss handler, replay queue, atomic ALU, store merge, and a tensor-load fast path.
VPU: 9 stages (F0–F8), in lockstep with the intpipe; F8 can write the integer RF (FP-to-INT moves).

End-to-end that’s ~12 stages from PC issue to retire on the integer side. The way they make a 12-stage in-order pipe go fast is two hardware threads per core, round-robin scheduled at the F7/ID seam. Branches and loads stall that thread, but the other thread keeps issuing. With ~16 threads per Neighborhood (8 cores × 2 threads), the design is essentially a barrel processor wearing scalar-RISC-V clothes.

Knock-on consequences I didn’t expect:

There is no branch predictor. The frontend speculates PC+4 and the intpipe kills earlier stages on a taken branch (resolved at TAG). On a misprediction the FE flushes; on a long miss the thread sleeps and the other thread runs. They don’t need a BTB or RAS because multithreading hides the bubble. (My earlier guess that there was a shared bpam2minions predictor was wrong; that file is something else.)
Loads and stores retire out of order. The intpipe is single-issue in-order issue, but the completion of memory ops is tracked in a scoreboard — a load can leave the pipe before its data arrives, with the scoreboard stalling any later instruction that consumes it. There are three scoreboards: integer, FP, and VPU-mask. This is the CVA6-style middle ground I’d been guessing at, and they actually built it.
TLB miss = pipeline flush + replay. When the DCache discovers a TLB miss in MEM, it requests a flush; the instruction is replayed from the FE buffer. This is much heavier than our PTW state machine, but it cleanly avoids stalling the thread.
“M-code” instructions. ID stage exception list includes “M-code instruction: instruction that is implemented in SW”. Esperanto carved out a subset of the ISA that traps to a software handler instead of being implemented in hardware. That’s a clean answer to the eternal question of where to draw the HW/SW boundary on rare ops.
The FE↔ICache interface is inherited from Rocket — fixed-latency response, miss notification + sleep + fill_done wake. The doc spends five pages documenting the livelock this caused on A0 (one thread could starve forever, and the fix was per-thread saturating consecutive-miss counters in the Neighborhood that mask other requesters when one thread stalls). The “Future Versions” section basically says: next time, switch to a variable-latency ET-Link interface, add a tiny per-frontend local cache, support multiple outstanding misses.

Things I’d missed about the CSR/system surface:

MATP — Machine Address Translation and Protection. M-mode gets its own translation/protection register, separate from satp. Not standard RISC-V.
FLB — Fast Local Barriers. User-mode atomic barrier counters exposed through a CSR.
FCC / fccnb / CREDINC0–3 — Fast Credit Counters for producer/consumer coordination, also user-mode.
ESRs — ET System Registers. A whole separate config space accessed via APB (esr_bypass_dcache, esr_shire_coop_mode, esr_minion_mem_override, etc.). This is where you toggle DCache scratchpad mode, or enable Cooperative TensorLoad across the Neighborhood.
vmspagesize is a top-level input — virtual page size is configurable, not baked into Sv32/Sv39.
chicken_bits — explicit “disable some automatic functions” signal at the boundary. Standard testchip pattern, named honestly.
UltraSoC trace encoder (te_thread_sel, traceEncoder, te_enable) and a full APB debug slave. Real industrial debug surface.
A0 didn’t even use virtual memory — the FE-ICache doc says the vm_status field on the request bus is “unused in A0”.

The cosim reference is extern/et-platform/sw-sysemu (the ETSOC-1 sysemu functional emulator), not Spike and not the made-up “BEMU” I had in the previous draft.

How this compares to where we are

We’re at P114. The current core (projects/114_ptw_aux_owner/src/top.sv) is single-issue, in-order, single-thread, RV32IMA + C + F/D subset + Zba/Zbb + Zicsr/Zifencei + Zicntr, with a 5-state FSM (S_FETCH / S_DECODE-fused / S_EXECUTE / S_MEM / S_WB), 16-line direct-mapped I-cache, 16-line direct-mapped write-through D-cache, 8-entry split ITLB/DTLB, Sv32 PTW, one-entry fetch queue, one-entry store buffer with forwarding, banked lower memory + tagged aux response queue, and a P93 BTB+counters+RAS that does not yet steer fetch (P115 is the FTQ that lets it).

Measured CPI on the BusyBox shell at P114 is 2.53.

The top-level differences are unsurprising:

64-bit registers and addressing, deep pipeline, two threads per core, eight cores per cluster, full 8-lane SIMD/tensor unit, AXI4 out, full debug + trace, Cooperative TensorLoad, UltraSoC harness.
A real dv/ tree with cosim, arch monitors, DPI, and a custom test runner (et-dvrun).
This is all “shipped silicon” scope. Chasing any of it as a goal would teach nothing.

What’s actually worth stealing as ideas

These are the parts where erbium does the same job we’re doing, and does it more interestingly.

1. Multithreading is the alternative to prediction. I had assumed serious cores were getting their fetch-side performance from BTBs and RAS. Erbium says: a barrel processor with two threads and a 7-stage frontend doesn’t need prediction at all, because there’s always another thread to issue. That’s a real fork in the road. We’re going the prediction route (P93→P115→P116) because we’re committed to single-thread. But “add a second hart instead of a predictor” is a legitimate alternative roadmap, especially for our PTW-miss-heavy shell workload. Worth at least naming.

2. Out-of-order completion via scoreboard. Single-issue in-order issue, but loads, stores, mul/div, and FP can leave the pipe and write back later. The scoreboard catches the dependency. This is the piece I keep saying we’ll need when a real FPU lands, and erbium has the exact 3-scoreboard layout (int / FP / mask) that fits the way our F/D subset would be wired in.

3. Pipeline flush + replay as the TLB-miss strategy. Right now we have S_PTW1 / S_PTW0 states inside our FSM that block everything during a walk. Erbium’s DCache requests a flush at MEM and the FE replays the instruction from its buffer once the walk finishes. We already have a fetch queue (P92) and an FTQ shadow (P115). Replay-on- TLB-miss is a small extension and would let us stop blocking the rest of the (admittedly single-threaded) machine during a walk.

4. “M-code” as an HW/SW boundary. Our F/D extension is a partial HW implementation that returns NaN on the rare ops. A cleaner story: trap them to a software handler explicitly tagged as “this is M-code”, and grow the HW subset over time. The exception is just another ID-stage check.

5. ESRs as a separate config plane. Right now our knobs (cache sizes, MMIO map, banked aux behaviour) are scattered across top.sv parameters. Erbium puts every “is this DCache a scratchpad” / “is cooperative mode on” / “force DCache bypass” knob behind an APB register space called ESR. We don’t need APB, but a single named config-register block — even if it’s just an MMIO range — would let us expose runtime knobs for the BusyBox profiling harness instead of recompiling.

6. Configurable virtual page size at the boundary. vmspagesize as a top-level input is a small thing, but we currently bake “4 KiB and 4 MiB” into the walker. There’s no reason to.

7. PMA boxes and a real PMU at the cluster level. The Minion has PMA blocks per cache; the Neighborhood has an event-driven PMU that each Minion feeds via pmu_count_up + pmu_neigh_event_sel. We have counters scattered through top.sv (BTB, storebuf, icache, dcache, PTW, aux). Pulling them into a PMA-and-PMU pair, with a documented MMIO interface, would clean up our profile pipeline without adding a feature.

8. Honest “we know this is broken” engineering documents. The Frontend-ICache Interface doc has a five-page section labelled Issues that explains the livelock, then a “Future Versions” section that basically apologises for the design. That’s the tone we want in our project READMEs when something is partial or rtl-pass. The Honesty rules in CLAUDE.md say roughly the same thing — this is what it looks like when it’s done well.

What this doesn’t change

The roadmap stands. P115 (FTQ shadow), P116 (active fetch steering), and the rungs after that are the right next steps for a single-thread core trying to get its fetch and memory paths to be less blocking. Erbium isn’t actually doing that arc — they skipped past it to fine- grain multithreading. The lessons we can take are about the pieces: scoreboarded out-of-order completion, replay on TLB miss, an ESR-style config plane, and a real PMU.

If I had a sticky note while writing P116 onward, those’d be on it. Plus the broader correction: when a doc is sitting in a repo, read the doc, don’t infer from filenames. Apologies for the previous draft.