P64’s profile pinned the next inefficiency clearly: every
register-only instruction was paying a 1-cycle pass-through
through S_MEM whether or not it actually touched memory. P65
takes that cycle back and also fuses S_DECODE into S_EXECUTE
on the common path so plain ALU ops, branches, jumps, LUI/AUIPC
all retire in 2 cycles instead of 4.
Headline: CPI 4.29 → 3.63 (-15.4%) on Linux boot. Same kernel image, same boot blob. P63 was 5.27.
What changed
The S_EXECUTE state’s “default” branch used to be:
end else begin
state <= S_MEM; // every non-trapping op went through S_MEM
end
S_MEM then had if (!mem_op) state <= S_WB; — so non-mem ops
sat in S_MEM for one cycle doing nothing. That’s ~21M wasted
cycles per Linux boot. P65 splits the default into two:
end else if (mem_op) begin
state <= S_MEM; // loads / stores still need S_MEM
end else begin
state <= S_WB; // ALU ops, LUI/AUIPC, branches, JAL,
// JALR, mret/sret/ecall handled
// earlier — all skip S_MEM now
end
That’s the entire RTL diff. Multi-cycle ops (MUL, DIV, AMO, walker, A/D updates, sfence-vma+flush) keep their own FSM exactly as P64.
Boot milestones (P62 → P63 → P64 → P65)
| milestone | P62 Zbb rotates + U-mode | P63 fetch fast-path | P64 prefetch-under-WB | P65 fused D+X, skip S_MEM | Δ last col |
|---|---|---|---|---|---|
| “Linux version” | −17.0% | ||||
| clocksource switched | −16.1% | ||||
| “Run /init” | −15.3% | ||||
| userspace hello | −15.3% | ||||
| init exit (panic) | −15.3% |
| milestone | P64 cycles | P65 cycles | Δ |
|---|---|---|---|
Linux version | 1,078,147 | 895,310 | -17.0% |
Switched to clocksource | 12,603,459 | 10,579,811 | -16.1% |
Run /init as init | 89,432,784 | 75,734,906 | -15.3% |
userspace hello | 90,246,176 | 76,410,353 | -15.3% |
Attempted to kill init (clean panic) | 90,505,484 | 76,629,292 | -15.3% |
Every milestone moves earlier by roughly the same percentage — this is a uniform per-instruction speedup, not a phase-specific one. (Compare to P64’s headline numbers, which were also a uniform shift over P63.)
CPI comparison
cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.
P65 retires the same ~21.5M instructions in 78M cycles vs P64’s 92M — CPI drops 4.29 → 3.63 (~15.4%). Stacked across the optimization arc:
| chip rev | CPI | cumulative speedup vs P62 |
|---|---|---|
| P62 | 6.24 | baseline |
| P63 | 5.27 | 1.18× |
| P64 | 4.29 | 1.45× |
| P65 | 3.63 | 1.72× |
Where the cycles went: state-cycle breakdown
The S_MEM column tells the story. P64’s column was 21.3M cycles — basically one cycle per retired instruction. P65’s is 7.0M — only loads, stores, AMO completions, and the walker’s mem-issue cycle land there now. The remaining ~14M cycles got reclaimed and folded into a proportionally shorter total.
- fetch 1.4% 1,083,868
- decode 27.6% 21,516,418
- execute 27.6% 21,516,417
- mem 9% 7,058,586
- walker 3% 2,374,410
- writeback 27.6% 21,509,201
- mul/div 3.8% 2,941,100
Where time goes inside the kernel
- 14.7% of samples (11,230 samples)14.7% 11,230
- 8.1% of samples (6,153 samples)8.1% 6,153
- 7.1% of samples (5,387 samples)7.1% 5,387
- 5.3% of samples (3,998 samples)5.3% 3,998
- 3.7% of samples (2,844 samples)3.7% 2,844
- 2.8% of samples (2,126 samples)2.8% 2,126
- 2.5% of samples (1,881 samples)2.5% 1,881
- 2.3% of samples (1,751 samples)2.3% 1,751
- 1.8% of samples (1,405 samples)1.8% 1,405
- 1.7% of samples (1,295 samples)1.7% 1,295
- 1.6% of samples (1,229 samples)1.6% 1,229
- 1.6% of samples (1,192 samples)1.6% 1,192
- 1.2% of samples (888 samples)1.2% 888
- 1.1% of samples (835 samples)1.1% 835
- 1.1% of samples (797 samples)1.1% 797
- 34.3% of samples (26,147 samples)34.3% 26,147
Hot-function shape is essentially unchanged from P64: BLAKE2s
CRNG init, post-panic machine_restart, memset/memcpy. As
expected — we made every instruction faster but didn’t change
which instructions the kernel runs.
What this is not
P65 is not yet a real pipelined core. Stages still serialize: fetch waits for the previous instruction’s writeback, only overlapping during S_WB via P64’s prefetch path. Each instruction still passes through 2 cycles minimum — one for fetch (or 0 with prefetch hit) plus one fused decode-execute.
The real-pipeline rung lands next as P66 — F and X overlap properly, with hazard logic, forwarding, and a load-use stall counter charteable on this same per-state breakdown.
What’s queued
- P66 — real 3-stage pipeline. F runs concurrently with X. Forwarding from end-of-X to next-cycle’s reg-read + a load-use stall path.
- P67 — synthesis baseline. Run all four optimization rungs through the LibreLane flow on sky130; chart fmax/area/cells against each other. The first time we get nanosecond numbers out of these cycle wins.
- P68 — AtomVM port (Erlang BEAM on the chip). Already scaffolded in parallel; waiting on packbeam.