D+X fused into a single execute cycle (CPI 3.63 → 2.62)

P65 still serialised decode and execute across two cycles — S_DECODE latched op_a/op_b from the regfile, then S_EXECUTE read those registered values to drive the ALU. P66 collapses that into a single cycle: op_a/op_b are now combinational reads of the regfile keyed off the registered ir, so the ALU fires the same cycle decode-time would have started. The S_DECODE state itself goes away.

Headline: CPI 3.63 → 2.62 (-27.8%) on Linux boot. Userspace hello at 55.4M cycles, down from 76.4M. Stage-0 banner to clean panic in ~13 seconds of Verilator wall time.

What changed

Two edits, both inside p37_rv32i_arch_core:

op_a and op_b were registered (logic [31:0]) — written in S_DECODE, read everywhere else. Now they’re wires:
```
wire [31:0] op_a = reg_read(rs1);
wire [31:0] op_b = reg_read(rs2);
```
rs1/rs2 are slices of ir, which is registered, so the regfile read is stable across the cycle.
S_DECODE state is gone. The legality check it did (legal_decode) moves to the top of S_EXECUTE — same semantics, no extra cycle.

S_FETCH still latches ir at end of cycle. The next cycle is S_EXECUTE, which now has op_a/op_b ready combinationally. For multi-cycle ops (MUL, DIV, AMO, walker), op_a/op_b remain stable because ir doesn’t change and the regfile isn’t written during the multi-cycle wait.

Boot milestones (P62 → P66)

milestone progression runs 5 scale 0 → 131.6M cycles

milestone	P62 Zbb rotates + U-mode	P63 fetch fast-path	P64 prefetch-under-WB	P65 fused D+X	P66 D+X = one cycle	Δ last col
“Linux version”	P62: 1,611,173 cycles 1.6M	P63: 1,346,691 cycles 1.3M	P64: 1,078,147 cycles 1.1M	P65: 895,310 cycles 895K	P66: 626,590 cycles 627K	−30.0%
clocksource switched	P62: 18,426,917 cycles 18.4M	P63: 15,530,502 cycles 15.5M	P64: 12,603,459 cycles 12.6M	P65: 10,579,811 cycles 10.6M	P66: 7,580,431 cycles 7.6M	−28.4%
“Run /init”	P62: 130,054,364 cycles 130.1M	P63: 109,755,383 cycles 109.8M	P64: 89,432,784 cycles 89.4M	P65: 75,734,906 cycles 75.7M	P66: 54,889,203 cycles 54.9M	−27.5%
userspace hello	P62: 131,241,874 cycles 131.2M	P63: 110,746,765 cycles 110.7M	P64: 90,246,176 cycles 90.2M	P65: 76,410,353 cycles 76.4M	P66: 55,379,980 cycles 55.4M	−27.5%
init exit (panic)	P62: 131,610,501 cycles 131.6M	P63: 111,063,721 cycles 111.1M	P64: 90,505,484 cycles 90.5M	P65: 76,629,292 cycles 76.6M	P66: 55,537,930 cycles 55.5M	−27.5%

milestone	P65 cycles	P66 cycles	Δ
`Linux version`	895,310	626,590	-30.0%
`Switched to clocksource`	10,579,811	7,580,431	-28.4%
`Run /init as init`	75,734,906	54,889,203	-27.5%
userspace `hello`	76,410,353	55,379,980	-27.5%
`Attempted to kill init` (clean panic)	76,629,292	55,537,930	-27.5%

A roughly uniform ~27.5% reduction across every milestone — a per-instruction speedup, exactly as expected for “every instruction skips a cycle.”

CPI stack

cpi compare runs 5 best 2.62 CPI worst 6.24 CPI

P62 Zbb rotates + U-mode

6.24

baseline

P63 fetch fast-path

5.27

−0.97 vs P62

P64 prefetch-under-WB

4.29

−0.98 vs P63

P65 fused D+X

3.63

−0.67 vs P64

P66 D+X = one cycle

2.62

−1.01 vs P65

cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.

chip rev	CPI	cumulative speedup vs P62
P62	6.24	baseline
P63	5.27	1.18×
P64	4.29	1.45×
P65	3.63	1.72×
P66	2.62	2.38×

Where the cycles went

state distribution runs 5 scale % of post-load cycles

P62 Zbb rotates + U-mode 132,610,502 cycles

fetch 32% fetch: 32% (42,437,741 cycles)
decode 16% decode: 16% (21,249,591 cycles)
execute 16% execute: 16% (21,249,591 cycles)
mem 15.9% mem: 15.9% (21,106,222 cycles)
walker: 1.8% (2,377,514 cycles)
writeback 16% writeback: 16% (21,242,373 cycles)
mul/div: 2.2% (2,947,470 cycles)
P63 fetch fast-path 112,063,722 cycles

fetch 19.5% fetch: 19.5% (21,809,761 cycles)
decode 19% decode: 19% (21,271,018 cycles)
execute 19% execute: 19% (21,271,018 cycles)
mem 18.9% mem: 18.9% (21,127,815 cycles)
walker: 2.1% (2,376,234 cycles)
writeback 19% writeback: 19% (21,263,802 cycles)
mul/div: 2.6% (2,944,074 cycles)
P64 prefetch-under-WB 92,000,000 cycles

fetch: 1.2% (1,084,166 cycles)
decode 23.3% decode: 23.3% (21,437,063 cycles)
execute 23.3% execute: 23.3% (21,437,063 cycles)
mem 23.1% mem: 23.1% (21,293,966 cycles)
walker: 2.6% (2,375,026 cycles)
writeback 23.3% writeback: 23.3% (21,429,846 cycles)
3.2% mul/div: 3.2% (2,942,870 cycles)
P65 fused D+X 78,000,000 cycles

fetch: 1.4% (1,083,868 cycles)
decode 27.6% decode: 27.6% (21,516,418 cycles)
execute 27.6% execute: 27.6% (21,516,417 cycles)
mem 9% mem: 9% (7,058,586 cycles)
3% walker: 3% (2,374,410 cycles)
writeback 27.6% writeback: 27.6% (21,509,201 cycles)
3.8% mul/div: 3.8% (2,941,100 cycles)
P66 D+X = one cycle 57,000,000 cycles

fetch: 1.9% (1,083,458 cycles)
execute 38.2% execute: 38.2% (21,777,678 cycles)
mem 12.4% mem: 12.4% (7,056,578 cycles)
4.2% walker: 4.2% (2,373,440 cycles)
writeback 38.2% writeback: 38.2% (21,770,462 cycles)
mul/div 5.2% mul/div: 5.2% (2,938,384 cycles)

The S_DECODE column in the comparison is empty for P66 — the state literally doesn’t exist anymore in this revision’s benchmark.json. The work it used to do happens combinationally inside S_EXECUTE.

state breakdown label cycles 57,000,000 cpi 2.62

fetch 1.9% 1,083,458
execute 38.2% 21,777,678
mem 12.4% 7,056,578
walker 4.2% 2,373,440
writeback 38.2% 21,770,462
mul/div 5.2% 2,938,384

Where time goes inside the kernel

hot functions label test samples 55,664 period every 1,024 cycles

blake2s_compress_generic kernel

13.7% of samples (7,600 samples)

13.7% 7,600
memset kernel

8.1% of samples (4,483 samples)

8.1% 4,483
memcpy kernel

6.8% of samples (3,793 samples)

6.8% 3,793
format_decode kernel

5% of samples (2,804 samples)

5% 2,804
vsnprintf kernel

3.5% of samples (1,938 samples)

3.5% 1,938
vruntime_eligible kernel

3.3% of samples (1,850 samples)

3.3% 1,850
machine_restart kernel

2.5% of samples (1,397 samples)

2.5% 1,397
number kernel

2.4% of samples (1,312 samples)

2.4% 1,312
memcmp kernel

2.2% of samples (1,230 samples)

2.2% 1,230
avg_vruntime kernel

1.9% of samples (1,031 samples)

1.9% 1,031
__slab_alloc_node.isra.0 kernel

1.8% of samples (1,020 samples)

1.8% 1,020
strlen kernel

1.4% of samples (778 samples)

1.4% 778
string_nocheck kernel

1.2% of samples (644 samples)

1.2% 644
chacha_permute kernel

1% of samples (577 samples)

1% 577
add_uevent_var kernel

1% of samples (564 samples)

1% 564
(remaining) remaining

34.9% of samples (19,406 samples)

34.9% 19,406

Hot-function shape unchanged from P64/P65: BLAKE2s CRNG, post-panic machine_restart, memset/memcpy. Whole-program speedup, no structural shift.

What this is not (and what’s queued)

P66 is still not a real pipeline — F and X don’t overlap; the core handles one instruction at a time, just with fewer cycles per instruction. The structural pipelining work (independent F stage, hazard logic, forwarding, load-use stall counter) is still queued.

The next rung in the optimization arc is what most people would call the “real” pipelined core. It’s a substantial chunk of RTL — and at this point the right move is probably to take a detour through synthesis first, so we know what the per-cycle clock period actually is on sky130 before pushing the combinational path deeper.

So the planned next two rungs are:

Synthesis baseline (next): run P66 through the LibreLane flow on sky130, get fmax/area/cells. The first time the chart shows a real time improvement instead of a cycle count.
Real F/X pipeline (after): F runs concurrently with X, hazard + forwarding + load-use stall counter charteable on the same per-state breakdown.

Then P68 — AtomVM: already scaffolded in parallel, project page here. Path is smooth: update its src/top.sv to whichever core is current at the time, then nix develop && make all && make verilator-run.