P65 still serialised decode and execute across two cycles —
S_DECODE latched op_a/op_b from the regfile, then S_EXECUTE
read those registered values to drive the ALU. P66 collapses
that into a single cycle: op_a/op_b are now combinational
reads of the regfile keyed off the registered ir, so the ALU
fires the same cycle decode-time would have started. The
S_DECODE state itself goes away.
Headline: CPI 3.63 → 2.62 (-27.8%) on Linux boot. Userspace
helloat 55.4M cycles, down from 76.4M. Stage-0 banner to clean panic in ~13 seconds of Verilator wall time.
What changed
Two edits, both inside p37_rv32i_arch_core:
-
op_aandop_bwere registered (logic [31:0]) — written in S_DECODE, read everywhere else. Now they’re wires:wire [31:0] op_a = reg_read(rs1); wire [31:0] op_b = reg_read(rs2);rs1/rs2are slices ofir, which is registered, so the regfile read is stable across the cycle. -
S_DECODE state is gone. The legality check it did (
legal_decode) moves to the top of S_EXECUTE — same semantics, no extra cycle.
S_FETCH still latches ir at end of cycle. The next cycle is
S_EXECUTE, which now has op_a/op_b ready combinationally.
For multi-cycle ops (MUL, DIV, AMO, walker), op_a/op_b
remain stable because ir doesn’t change and the regfile isn’t
written during the multi-cycle wait.
Boot milestones (P62 → P66)
| milestone | P62 Zbb rotates + U-mode | P63 fetch fast-path | P64 prefetch-under-WB | P65 fused D+X | P66 D+X = one cycle | Δ last col |
|---|---|---|---|---|---|---|
| “Linux version” | −30.0% | |||||
| clocksource switched | −28.4% | |||||
| “Run /init” | −27.5% | |||||
| userspace hello | −27.5% | |||||
| init exit (panic) | −27.5% |
| milestone | P65 cycles | P66 cycles | Δ |
|---|---|---|---|
Linux version | 895,310 | 626,590 | -30.0% |
Switched to clocksource | 10,579,811 | 7,580,431 | -28.4% |
Run /init as init | 75,734,906 | 54,889,203 | -27.5% |
userspace hello | 76,410,353 | 55,379,980 | -27.5% |
Attempted to kill init (clean panic) | 76,629,292 | 55,537,930 | -27.5% |
A roughly uniform ~27.5% reduction across every milestone — a per-instruction speedup, exactly as expected for “every instruction skips a cycle.”
CPI stack
cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.
| chip rev | CPI | cumulative speedup vs P62 |
|---|---|---|
| P62 | 6.24 | baseline |
| P63 | 5.27 | 1.18× |
| P64 | 4.29 | 1.45× |
| P65 | 3.63 | 1.72× |
| P66 | 2.62 | 2.38× |
Where the cycles went
The S_DECODE column in the comparison is empty for P66 — the
state literally doesn’t exist anymore in this revision’s
benchmark.json. The work it used to do happens combinationally
inside S_EXECUTE.
- fetch 1.9% 1,083,458
- execute 38.2% 21,777,678
- mem 12.4% 7,056,578
- walker 4.2% 2,373,440
- writeback 38.2% 21,770,462
- mul/div 5.2% 2,938,384
Where time goes inside the kernel
- 13.7% of samples (7,600 samples)13.7% 7,600
- 8.1% of samples (4,483 samples)8.1% 4,483
- 6.8% of samples (3,793 samples)6.8% 3,793
- 5% of samples (2,804 samples)5% 2,804
- 3.5% of samples (1,938 samples)3.5% 1,938
- 3.3% of samples (1,850 samples)3.3% 1,850
- 2.5% of samples (1,397 samples)2.5% 1,397
- 2.4% of samples (1,312 samples)2.4% 1,312
- 2.2% of samples (1,230 samples)2.2% 1,230
- 1.9% of samples (1,031 samples)1.9% 1,031
- 1.8% of samples (1,020 samples)1.8% 1,020
- 1.4% of samples (778 samples)1.4% 778
- 1.2% of samples (644 samples)1.2% 644
- 1% of samples (577 samples)1% 577
- 1% of samples (564 samples)1% 564
- 34.9% of samples (19,406 samples)34.9% 19,406
Hot-function shape unchanged from P64/P65: BLAKE2s CRNG,
post-panic machine_restart, memset/memcpy. Whole-program
speedup, no structural shift.
What this is not (and what’s queued)
P66 is still not a real pipeline — F and X don’t overlap; the core handles one instruction at a time, just with fewer cycles per instruction. The structural pipelining work (independent F stage, hazard logic, forwarding, load-use stall counter) is still queued.
The next rung in the optimization arc is what most people would call the “real” pipelined core. It’s a substantial chunk of RTL — and at this point the right move is probably to take a detour through synthesis first, so we know what the per-cycle clock period actually is on sky130 before pushing the combinational path deeper.
So the planned next two rungs are:
- Synthesis baseline (next): run P66 through the LibreLane flow on sky130, get fmax/area/cells. The first time the chart shows a real time improvement instead of a cycle count.
- Real F/X pipeline (after): F runs concurrently with X, hazard + forwarding + load-use stall counter charteable on the same per-state breakdown.
Then P68 — AtomVM: already scaffolded in parallel,
project page here. Path is smooth:
update its src/top.sv to whichever core is current at the
time, then nix develop && make all && make verilator-run.