Speed ceiling · librelane-playground

This page is a ceiling estimation, not a roadmap. The ladder is going somewhere else — toward enough RISC-V to host real software. But the question “if we just wanted to push speed, how far could we go on sky130A?” comes up often enough to write down.

§ What the technology decides

The PDK fixes two things that anchor every speed number on this site:

Standard-cell library. We use sky130_fd_sc_hd — the open high-density library. It has a moderate cell variety, no hand-optimized fast flops, and a published FO4 delay around 50 ps for an inverter at the nominal corner. That’s the fundamental gate-delay yardstick.
Routing parasitics. sky130 is a 130 nm node. Wire delay is a real fraction of the cycle once nets get long. Tools have to spend buffers to keep slew under control on big fanout trees, and that buffering eats into the cycle budget.

A back-of-envelope ceiling for a single fast path on sky130 hd is roughly 15–25 FO4 inverter delays per cycle once you account for setup time, clock skew, and a realistic mix of multi-input gates. That’s 750 ps – 1.25 ns, or 800 MHz – 1.3 GHz as a raw cell-delay number.

That number is misleading. Real designs don’t get there because the critical path isn’t a chain of inverters — it’s a register-to-register path through ALU logic, mux trees, and routed wires. Every realistic RISC-V critical path on sky130 hd lands much further down.

§ What the architecture decides

The biggest single multiplier between “FO4 ceiling” and “what your chip actually clocks at” is whether you pipelined.

core style	comfortable	with effort	hard ceiling
multi-cycle FSM (today)	`50–80 MHz`	`~100 MHz`	front-end FSM transitions
3-stage pipelined (Ibex-ish)	`100–150 MHz`	`~180 MHz`	reg-file + ALU forwarding
5-stage pipelined (Rocket-mini)	`100–180 MHz`	`~220 MHz`	branch + memory paths
aggressively tuned	—	`~250 MHz`	clock tree, std-cell skew

A multi-cycle CPU like the one we ship today has long combinational paths between flop-stages because each instruction does several operations in series across one giant FSM. Adding a real pipeline breaks those paths into smaller pieces, and the cycle time drops roughly proportionally — at the cost of pipeline registers, forwarding logic, hazard detection, and a much larger test surface.

5 stages is the classic RISC textbook split (fetch / decode / execute / memory / writeback). It’s also what most open-silicon RV32 cores actually ship: VexRiscv, Ibex (when configured), and SCR1-class designs all live in the 100–180 MHz zone on sky130 hd.

§ What careful PnR decides

Past the architecture, you can squeeze another factor by being careful about the flow itself:

Floorplan. Hand-placing critical macros (reg-file, instruction buffer) so they sit close to the path that uses them shortens routes and removes buffering.
Clock tree. The default CTS targets a generous skew budget. Tightening it costs more buffers but recovers cycle time.
Flops. The default flops in sky130_fd_sc_hd are general- purpose. Some designs swap in *_2 or *_4 drive-strength variants on critical endpoints to reduce setup time.
Synthesis. Pushing Yosys harder (abc -fast off, retiming on, different mapping passes) trades runtime for QoR.

These are diminishing returns. Each one buys 5–15% cycle time. None of them turn a 100 MHz core into a 200 MHz core — that’s an architecture change.

§ Open-silicon reference points

design	core type	sky130 PDK	reported clock
Caravel mgmt SoC	VexRiscv-derived RV32IMC	sky130A	`~10–40 MHz` (system-bound)
Ibex (open-silicon hardenings)	RV32IMC, 2-stage pipe	sky130 hd	`~80–120 MHz`
VexRiscv, mid-tune	RV32IMC, configurable pipe	sky130 hd	`~100–150 MHz`
VexiiRiscv, aggressive	RV32IMC, deeper pipe	sky130 hd	`~180–220 MHz`
TinyQV	RV32EC, multi-cycle	sky130A · TT	`~64 MHz` (TT clock)

These are the public points worth pinning the chart on. Anything claiming >250 MHz on sky130A for an in-order RISC-V is either running at the nominal corner only, ignoring SRAM access timing, or using cells the open community can’t reach.

§ Where today’s chip sits

P37 is FSM-bound, not technology-bound. We chose CLOCK_PERIOD: 40.0 because that’s the constraint the rest of the ladder used; the recorded slack at signoff is 8.74 ns against a 40 ns budget, which means the critical path is around 31 ns, not 40 ns.

A speed-push experiment confirmed this directly. P37 was re-hardened at CLOCK_PERIOD: 30.0 (33.3 MHz); the flow ran end-to-end and produced a clean GDS, but signoff reported 152 setup violations at the slow corner with worst slack -1.258 ns. Implied critical path: 30 + 1.258 = 31.258 ns — the same 31.258 ns we computed from P37’s positive-slack number. The resizer didn’t gain anything from the tighter budget: same 27157 cells, same 191927 um² of stdcell.

That puts today’s RTL at an empirical ~32 MHz Fmax at the slow signoff corner, with the critical path landing inside a wide OR- reduction tree starting at the ALU operand-B register and walking through the divider/multiplier block. The journal entry has the endpoint detail.

A meaningful speed jump from here means a different core, not a tighter budget: pipelined fetch/decode/execute, register-file forwarding, branch-target resolution moved out of the same cycle as ALU. That’s a P-something-large rung, and only worth doing if “fast” becomes a real goal. For now, “boring and correct enough to host FreeRTOS” is the cheaper milestone.

§ Honest framing

The fastest RISC-V we could realistically produce on sky130A is ~200–250 MHz, with a well-pipelined RV32 core and careful but not exotic PnR. Above that requires custom flops, custom clocking, and research-level effort that doesn’t fit the educational shape of this project.

The ladder is not currently aimed there. The roadmap explains what it is aimed at: enough of the RISC-V architecture to plausibly host real software, starting with FreeRTOS.