Fused decode/execute + skip S_MEM on register-only ops (CPI 4.29 → 3.63)

P64’s profile pinned the next inefficiency clearly: every register-only instruction was paying a 1-cycle pass-through through S_MEM whether or not it actually touched memory. P65 takes that cycle back and also fuses S_DECODE into S_EXECUTE on the common path so plain ALU ops, branches, jumps, LUI/AUIPC all retire in 2 cycles instead of 4.

Headline: CPI 4.29 → 3.63 (-15.4%) on Linux boot. Same kernel image, same boot blob. P63 was 5.27.

What changed

The S_EXECUTE state’s “default” branch used to be:

end else begin
  state <= S_MEM;   // every non-trapping op went through S_MEM
end

S_MEM then had if (!mem_op) state <= S_WB; — so non-mem ops sat in S_MEM for one cycle doing nothing. That’s ~21M wasted cycles per Linux boot. P65 splits the default into two:

end else if (mem_op) begin
  state <= S_MEM;        // loads / stores still need S_MEM
end else begin
  state <= S_WB;         // ALU ops, LUI/AUIPC, branches, JAL,
                         // JALR, mret/sret/ecall handled
                         // earlier — all skip S_MEM now
end

That’s the entire RTL diff. Multi-cycle ops (MUL, DIV, AMO, walker, A/D updates, sfence-vma+flush) keep their own FSM exactly as P64.

Boot milestones (P62 → P63 → P64 → P65)

milestone progression runs 4 scale 0 → 131.6M cycles

milestone	P62 Zbb rotates + U-mode	P63 fetch fast-path	P64 prefetch-under-WB	P65 fused D+X, skip S_MEM	Δ last col
“Linux version”	P62: 1,611,173 cycles 1.6M	P63: 1,346,691 cycles 1.3M	P64: 1,078,147 cycles 1.1M	P65: 895,310 cycles 895K	−17.0%
clocksource switched	P62: 18,426,917 cycles 18.4M	P63: 15,530,502 cycles 15.5M	P64: 12,603,459 cycles 12.6M	P65: 10,579,811 cycles 10.6M	−16.1%
“Run /init”	P62: 130,054,364 cycles 130.1M	P63: 109,755,383 cycles 109.8M	P64: 89,432,784 cycles 89.4M	P65: 75,734,906 cycles 75.7M	−15.3%
userspace hello	P62: 131,241,874 cycles 131.2M	P63: 110,746,765 cycles 110.7M	P64: 90,246,176 cycles 90.2M	P65: 76,410,353 cycles 76.4M	−15.3%
init exit (panic)	P62: 131,610,501 cycles 131.6M	P63: 111,063,721 cycles 111.1M	P64: 90,505,484 cycles 90.5M	P65: 76,629,292 cycles 76.6M	−15.3%

milestone	P64 cycles	P65 cycles	Δ
`Linux version`	1,078,147	895,310	-17.0%
`Switched to clocksource`	12,603,459	10,579,811	-16.1%
`Run /init as init`	89,432,784	75,734,906	-15.3%
userspace `hello`	90,246,176	76,410,353	-15.3%
`Attempted to kill init` (clean panic)	90,505,484	76,629,292	-15.3%

Every milestone moves earlier by roughly the same percentage — this is a uniform per-instruction speedup, not a phase-specific one. (Compare to P64’s headline numbers, which were also a uniform shift over P63.)

CPI comparison

cpi compare runs 4 best 3.63 CPI worst 6.24 CPI

P62 Zbb rotates + U-mode

6.24

baseline

P63 fetch fast-path

5.27

−0.97 vs P62

P64 prefetch-under-WB

4.29

−0.98 vs P63

P65 fused D+X, skip S_MEM

3.63

−0.67 vs P64

cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.

P65 retires the same ~21.5M instructions in 78M cycles vs P64’s 92M — CPI drops 4.29 → 3.63 (~15.4%). Stacked across the optimization arc:

chip rev	CPI	cumulative speedup vs P62
P62	6.24	baseline
P63	5.27	1.18×
P64	4.29	1.45×
P65	3.63	1.72×

Where the cycles went: state-cycle breakdown

state distribution runs 4 scale % of post-load cycles

P62 Zbb rotates + U-mode 132,610,502 cycles

fetch 32% fetch: 32% (42,437,741 cycles)
decode 16% decode: 16% (21,249,591 cycles)
execute 16% execute: 16% (21,249,591 cycles)
mem 15.9% mem: 15.9% (21,106,222 cycles)
walker: 1.8% (2,377,514 cycles)
writeback 16% writeback: 16% (21,242,373 cycles)
mul/div: 2.2% (2,947,470 cycles)
P63 fetch fast-path 112,063,722 cycles

fetch 19.5% fetch: 19.5% (21,809,761 cycles)
decode 19% decode: 19% (21,271,018 cycles)
execute 19% execute: 19% (21,271,018 cycles)
mem 18.9% mem: 18.9% (21,127,815 cycles)
walker: 2.1% (2,376,234 cycles)
writeback 19% writeback: 19% (21,263,802 cycles)
mul/div: 2.6% (2,944,074 cycles)
P64 prefetch-under-WB 92,000,000 cycles

fetch: 1.2% (1,084,166 cycles)
decode 23.3% decode: 23.3% (21,437,063 cycles)
execute 23.3% execute: 23.3% (21,437,063 cycles)
mem 23.1% mem: 23.1% (21,293,966 cycles)
walker: 2.6% (2,375,026 cycles)
writeback 23.3% writeback: 23.3% (21,429,846 cycles)
3.2% mul/div: 3.2% (2,942,870 cycles)
P65 fused D+X, skip S_MEM 78,000,000 cycles

fetch: 1.4% (1,083,868 cycles)
decode 27.6% decode: 27.6% (21,516,418 cycles)
execute 27.6% execute: 27.6% (21,516,417 cycles)
mem 9% mem: 9% (7,058,586 cycles)
3% walker: 3% (2,374,410 cycles)
writeback 27.6% writeback: 27.6% (21,509,201 cycles)
3.8% mul/div: 3.8% (2,941,100 cycles)

The S_MEM column tells the story. P64’s column was 21.3M cycles — basically one cycle per retired instruction. P65’s is 7.0M — only loads, stores, AMO completions, and the walker’s mem-issue cycle land there now. The remaining ~14M cycles got reclaimed and folded into a proportionally shorter total.

state breakdown label cycles 78,000,000 cpi 3.63

fetch 1.4% 1,083,868
decode 27.6% 21,516,418
execute 27.6% 21,516,417
mem 9% 7,058,586
walker 3% 2,374,410
writeback 27.6% 21,509,201
mul/div 3.8% 2,941,100

Where time goes inside the kernel

hot functions label test samples 76,171 period every 1,024 cycles

blake2s_compress_generic kernel

14.7% of samples (11,230 samples)

14.7% 11,230
memset kernel

8.1% of samples (6,153 samples)

8.1% 6,153
memcpy kernel

7.1% of samples (5,387 samples)

7.1% 5,387
format_decode kernel

5.3% of samples (3,998 samples)

5.3% 3,998
vsnprintf kernel

3.7% of samples (2,844 samples)

3.7% 2,844
vruntime_eligible kernel

2.8% of samples (2,126 samples)

2.8% 2,126
number kernel

2.5% of samples (1,881 samples)

2.5% 1,881
memcmp kernel

2.3% of samples (1,751 samples)

2.3% 1,751
__slab_alloc_node.isra.0 kernel

1.8% of samples (1,405 samples)

1.8% 1,405
machine_restart kernel

1.7% of samples (1,295 samples)

1.7% 1,295
avg_vruntime kernel

1.6% of samples (1,229 samples)

1.6% 1,229
strlen kernel

1.6% of samples (1,192 samples)

1.6% 1,192
string_nocheck kernel

1.2% of samples (888 samples)

1.2% 888
chacha_permute kernel

1.1% of samples (835 samples)

1.1% 835
add_uevent_var kernel

1.1% of samples (797 samples)

1.1% 797
(remaining) remaining

34.3% of samples (26,147 samples)

34.3% 26,147

Hot-function shape is essentially unchanged from P64: BLAKE2s CRNG init, post-panic machine_restart, memset/memcpy. As expected — we made every instruction faster but didn’t change which instructions the kernel runs.

What this is not

P65 is not yet a real pipelined core. Stages still serialize: fetch waits for the previous instruction’s writeback, only overlapping during S_WB via P64’s prefetch path. Each instruction still passes through 2 cycles minimum — one for fetch (or 0 with prefetch hit) plus one fused decode-execute.

The real-pipeline rung lands next as P66 — F and X overlap properly, with hazard logic, forwarding, and a load-use stall counter charteable on this same per-state breakdown.

What’s queued

P66 — real 3-stage pipeline. F runs concurrently with X. Forwarding from end-of-X to next-cycle’s reg-read + a load-use stall path.
P67 — synthesis baseline. Run all four optimization rungs through the LibreLane flow on sky130; chart fmax/area/cells against each other. The first time we get nanosecond numbers out of these cycle wins.
P68 — AtomVM port (Erlang BEAM on the chip). Already scaffolded in parallel; waiting on packbeam.