No. 66 / project of 147 on the ladder

D+X fused into a single execute cycle (CPI 3.63 → 2.62)

introduces — combinational op_a/op_b read; S_DECODE eliminated

harden statelast run2026-05-04
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P65 still serialised decode and execute across two cycles — S_DECODE latched op_a/op_b from the regfile, then S_EXECUTE read those registered values to drive the ALU. P66 collapses that into a single cycle: op_a/op_b are now combinational reads of the regfile keyed off the registered ir, so the ALU fires the same cycle decode-time would have started. The S_DECODE state itself goes away.

Headline: CPI 3.63 → 2.62 (-27.8%) on Linux boot. Userspace hello at 55.4M cycles, down from 76.4M. Stage-0 banner to clean panic in ~13 seconds of Verilator wall time.

What changed

Two edits, both inside p37_rv32i_arch_core:

  1. op_a and op_b were registered (logic [31:0]) — written in S_DECODE, read everywhere else. Now they’re wires:

    wire [31:0] op_a = reg_read(rs1);
    wire [31:0] op_b = reg_read(rs2);

    rs1/rs2 are slices of ir, which is registered, so the regfile read is stable across the cycle.

  2. S_DECODE state is gone. The legality check it did (legal_decode) moves to the top of S_EXECUTE — same semantics, no extra cycle.

S_FETCH still latches ir at end of cycle. The next cycle is S_EXECUTE, which now has op_a/op_b ready combinationally. For multi-cycle ops (MUL, DIV, AMO, walker), op_a/op_b remain stable because ir doesn’t change and the regfile isn’t written during the multi-cycle wait.

Boot milestones (P62 → P66)

milestone progression runs 5 scale 0 → 131.6M cycles
milestone P62 Zbb rotates + U-mode P63 fetch fast-path P64 prefetch-under-WB P65 fused D+X P66 D+X = one cycle Δ last col
“Linux version”
P62: 1,611,173 cycles
1.6M
P63: 1,346,691 cycles
1.3M
P64: 1,078,147 cycles
1.1M
P65: 895,310 cycles
895K
P66: 626,590 cycles
627K
−30.0%
clocksource switched
P62: 18,426,917 cycles
18.4M
P63: 15,530,502 cycles
15.5M
P64: 12,603,459 cycles
12.6M
P65: 10,579,811 cycles
10.6M
P66: 7,580,431 cycles
7.6M
−28.4%
“Run /init”
P62: 130,054,364 cycles
130.1M
P63: 109,755,383 cycles
109.8M
P64: 89,432,784 cycles
89.4M
P65: 75,734,906 cycles
75.7M
P66: 54,889,203 cycles
54.9M
−27.5%
userspace hello
P62: 131,241,874 cycles
131.2M
P63: 110,746,765 cycles
110.7M
P64: 90,246,176 cycles
90.2M
P65: 76,410,353 cycles
76.4M
P66: 55,379,980 cycles
55.4M
−27.5%
init exit (panic)
P62: 131,610,501 cycles
131.6M
P63: 111,063,721 cycles
111.1M
P64: 90,505,484 cycles
90.5M
P65: 76,629,292 cycles
76.6M
P66: 55,537,930 cycles
55.5M
−27.5%
milestoneP65 cyclesP66 cyclesΔ
Linux version895,310626,590-30.0%
Switched to clocksource10,579,8117,580,431-28.4%
Run /init as init75,734,90654,889,203-27.5%
userspace hello76,410,35355,379,980-27.5%
Attempted to kill init (clean panic)76,629,29255,537,930-27.5%

A roughly uniform ~27.5% reduction across every milestone — a per-instruction speedup, exactly as expected for “every instruction skips a cycle.”

CPI stack

cpi compare runs 5 best 2.62 CPI worst 6.24 CPI
P62 Zbb rotates + U-mode
6.24
baseline
P63 fetch fast-path
5.27
−0.97 vs P62
P64 prefetch-under-WB
4.29
−0.98 vs P63
P65 fused D+X
3.63
−0.67 vs P64
P66 D+X = one cycle
2.62
−1.01 vs P65

cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.

chip revCPIcumulative speedup vs P62
P626.24baseline
P635.271.18×
P644.291.45×
P653.631.72×
P662.622.38×

Where the cycles went

state distribution runs 5 scale % of post-load cycles
  1. P62 Zbb rotates + U-mode 132,610,502 cycles
  2. P63 fetch fast-path 112,063,722 cycles
  3. P64 prefetch-under-WB 92,000,000 cycles
  4. P65 fused D+X 78,000,000 cycles
  5. P66 D+X = one cycle 57,000,000 cycles

The S_DECODE column in the comparison is empty for P66 — the state literally doesn’t exist anymore in this revision’s benchmark.json. The work it used to do happens combinationally inside S_EXECUTE.

state breakdown label cycles 57,000,000 cpi 2.62
  1. fetch 1.9% 1,083,458
  2. execute 38.2% 21,777,678
  3. mem 12.4% 7,056,578
  4. walker 4.2% 2,373,440
  5. writeback 38.2% 21,770,462
  6. mul/div 5.2% 2,938,384

Where time goes inside the kernel

hot functions label test samples 55,664 period every 1,024 cycles
  1. blake2s_compress_generic kernel
    13.7% 7,600
  2. memset kernel
    8.1% 4,483
  3. memcpy kernel
    6.8% 3,793
  4. format_decode kernel
    5% 2,804
  5. vsnprintf kernel
    3.5% 1,938
  6. vruntime_eligible kernel
    3.3% 1,850
  7. machine_restart kernel
    2.5% 1,397
  8. number kernel
    2.4% 1,312
  9. memcmp kernel
    2.2% 1,230
  10. avg_vruntime kernel
    1.9% 1,031
  11. __slab_alloc_node.isra.0 kernel
    1.8% 1,020
  12. strlen kernel
    1.4% 778
  13. string_nocheck kernel
    1.2% 644
  14. chacha_permute kernel
    1% 577
  15. add_uevent_var kernel
    1% 564
  16. (remaining) remaining
    34.9% 19,406

Hot-function shape unchanged from P64/P65: BLAKE2s CRNG, post-panic machine_restart, memset/memcpy. Whole-program speedup, no structural shift.

What this is not (and what’s queued)

P66 is still not a real pipeline — F and X don’t overlap; the core handles one instruction at a time, just with fewer cycles per instruction. The structural pipelining work (independent F stage, hazard logic, forwarding, load-use stall counter) is still queued.

The next rung in the optimization arc is what most people would call the “real” pipelined core. It’s a substantial chunk of RTL — and at this point the right move is probably to take a detour through synthesis first, so we know what the per-cycle clock period actually is on sky130 before pushing the combinational path deeper.

So the planned next two rungs are:

  • Synthesis baseline (next): run P66 through the LibreLane flow on sky130, get fmax/area/cells. The first time the chart shows a real time improvement instead of a cycle count.
  • Real F/X pipeline (after): F runs concurrently with X, hazard + forwarding + load-use stall counter charteable on the same per-state breakdown.

Then P68 — AtomVM: already scaffolded in parallel, project page here. Path is smooth: update its src/top.sv to whichever core is current at the time, then nix develop && make all && make verilator-run.