No. 65 / project of 147 on the ladder

Fused decode/execute + skip S_MEM on register-only ops (CPI 4.29 → 3.63)

introduces — untranslated-ALU 2-cycle path; P65→P64→P63 cycle-comparison stack

harden statelast run2026-05-04
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P64’s profile pinned the next inefficiency clearly: every register-only instruction was paying a 1-cycle pass-through through S_MEM whether or not it actually touched memory. P65 takes that cycle back and also fuses S_DECODE into S_EXECUTE on the common path so plain ALU ops, branches, jumps, LUI/AUIPC all retire in 2 cycles instead of 4.

Headline: CPI 4.29 → 3.63 (-15.4%) on Linux boot. Same kernel image, same boot blob. P63 was 5.27.

What changed

The S_EXECUTE state’s “default” branch used to be:

end else begin
  state <= S_MEM;   // every non-trapping op went through S_MEM
end

S_MEM then had if (!mem_op) state <= S_WB; — so non-mem ops sat in S_MEM for one cycle doing nothing. That’s ~21M wasted cycles per Linux boot. P65 splits the default into two:

end else if (mem_op) begin
  state <= S_MEM;        // loads / stores still need S_MEM
end else begin
  state <= S_WB;         // ALU ops, LUI/AUIPC, branches, JAL,
                         // JALR, mret/sret/ecall handled
                         // earlier — all skip S_MEM now
end

That’s the entire RTL diff. Multi-cycle ops (MUL, DIV, AMO, walker, A/D updates, sfence-vma+flush) keep their own FSM exactly as P64.

Boot milestones (P62 → P63 → P64 → P65)

milestone progression runs 4 scale 0 → 131.6M cycles
milestone P62 Zbb rotates + U-mode P63 fetch fast-path P64 prefetch-under-WB P65 fused D+X, skip S_MEM Δ last col
“Linux version”
P62: 1,611,173 cycles
1.6M
P63: 1,346,691 cycles
1.3M
P64: 1,078,147 cycles
1.1M
P65: 895,310 cycles
895K
−17.0%
clocksource switched
P62: 18,426,917 cycles
18.4M
P63: 15,530,502 cycles
15.5M
P64: 12,603,459 cycles
12.6M
P65: 10,579,811 cycles
10.6M
−16.1%
“Run /init”
P62: 130,054,364 cycles
130.1M
P63: 109,755,383 cycles
109.8M
P64: 89,432,784 cycles
89.4M
P65: 75,734,906 cycles
75.7M
−15.3%
userspace hello
P62: 131,241,874 cycles
131.2M
P63: 110,746,765 cycles
110.7M
P64: 90,246,176 cycles
90.2M
P65: 76,410,353 cycles
76.4M
−15.3%
init exit (panic)
P62: 131,610,501 cycles
131.6M
P63: 111,063,721 cycles
111.1M
P64: 90,505,484 cycles
90.5M
P65: 76,629,292 cycles
76.6M
−15.3%
milestoneP64 cyclesP65 cyclesΔ
Linux version1,078,147895,310-17.0%
Switched to clocksource12,603,45910,579,811-16.1%
Run /init as init89,432,78475,734,906-15.3%
userspace hello90,246,17676,410,353-15.3%
Attempted to kill init (clean panic)90,505,48476,629,292-15.3%

Every milestone moves earlier by roughly the same percentage — this is a uniform per-instruction speedup, not a phase-specific one. (Compare to P64’s headline numbers, which were also a uniform shift over P63.)

CPI comparison

cpi compare runs 4 best 3.63 CPI worst 6.24 CPI
P62 Zbb rotates + U-mode
6.24
baseline
P63 fetch fast-path
5.27
−0.97 vs P62
P64 prefetch-under-WB
4.29
−0.98 vs P63
P65 fused D+X, skip S_MEM
3.63
−0.67 vs P64

cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.

P65 retires the same ~21.5M instructions in 78M cycles vs P64’s 92M — CPI drops 4.29 → 3.63 (~15.4%). Stacked across the optimization arc:

chip revCPIcumulative speedup vs P62
P626.24baseline
P635.271.18×
P644.291.45×
P653.631.72×

Where the cycles went: state-cycle breakdown

state distribution runs 4 scale % of post-load cycles
  1. P62 Zbb rotates + U-mode 132,610,502 cycles
  2. P63 fetch fast-path 112,063,722 cycles
  3. P64 prefetch-under-WB 92,000,000 cycles
  4. P65 fused D+X, skip S_MEM 78,000,000 cycles

The S_MEM column tells the story. P64’s column was 21.3M cycles — basically one cycle per retired instruction. P65’s is 7.0M — only loads, stores, AMO completions, and the walker’s mem-issue cycle land there now. The remaining ~14M cycles got reclaimed and folded into a proportionally shorter total.

state breakdown label cycles 78,000,000 cpi 3.63
  1. fetch 1.4% 1,083,868
  2. decode 27.6% 21,516,418
  3. execute 27.6% 21,516,417
  4. mem 9% 7,058,586
  5. walker 3% 2,374,410
  6. writeback 27.6% 21,509,201
  7. mul/div 3.8% 2,941,100

Where time goes inside the kernel

hot functions label test samples 76,171 period every 1,024 cycles
  1. blake2s_compress_generic kernel
    14.7% 11,230
  2. memset kernel
    8.1% 6,153
  3. memcpy kernel
    7.1% 5,387
  4. format_decode kernel
    5.3% 3,998
  5. vsnprintf kernel
    3.7% 2,844
  6. vruntime_eligible kernel
    2.8% 2,126
  7. number kernel
    2.5% 1,881
  8. memcmp kernel
    2.3% 1,751
  9. __slab_alloc_node.isra.0 kernel
    1.8% 1,405
  10. machine_restart kernel
    1.7% 1,295
  11. avg_vruntime kernel
    1.6% 1,229
  12. strlen kernel
    1.6% 1,192
  13. string_nocheck kernel
    1.2% 888
  14. chacha_permute kernel
    1.1% 835
  15. add_uevent_var kernel
    1.1% 797
  16. (remaining) remaining
    34.3% 26,147

Hot-function shape is essentially unchanged from P64: BLAKE2s CRNG init, post-panic machine_restart, memset/memcpy. As expected — we made every instruction faster but didn’t change which instructions the kernel runs.

What this is not

P65 is not yet a real pipelined core. Stages still serialize: fetch waits for the previous instruction’s writeback, only overlapping during S_WB via P64’s prefetch path. Each instruction still passes through 2 cycles minimum — one for fetch (or 0 with prefetch hit) plus one fused decode-execute.

The real-pipeline rung lands next as P66 — F and X overlap properly, with hazard logic, forwarding, and a load-use stall counter charteable on this same per-state breakdown.

What’s queued

  • P66 — real 3-stage pipeline. F runs concurrently with X. Forwarding from end-of-X to next-cycle’s reg-read + a load-use stall path.
  • P67 — synthesis baseline. Run all four optimization rungs through the LibreLane flow on sky130; chart fmax/area/cells against each other. The first time we get nanosecond numbers out of these cycle wins.
  • P68 — AtomVM port (Erlang BEAM on the chip). Already scaffolded in parallel; waiting on packbeam.