Speed-push experiment on P37

The speed page put today’s chip somewhere around ~32 MHz implied Fmax at the slow signoff corner, against a 25 MHz (40 ns) target. P37’s recorded worst setup slack at signoff was 8.74 ns, which puts the critical path around 31 ns. The natural diagnostic is: drop CLOCK_PERIOD to 30 ns (33.3 MHz) and re-harden. If it converges, we learn something about how much margin we actually had; if it breaks, we learn which path inside the FSM CPU pinches first.

The new config is checked in at:

projects/37_rv32im_zicsr_zifencei/librelane/config_speed30.yaml

Same RTL, same SDC, same memory map, same DEFAULT_CORNER. The only diff vs the recorded P37 harden is CLOCK_PERIOD: 40.0 -> 30.0.

Running it

scripts/run_librelane.sh hard-codes librelane/config.yaml, so this one goes through librelane directly:

cd projects/37_rv32im_zicsr_zifencei
librelane librelane/config_speed30.yaml

The fresh RUN_<timestamp>/ lands under projects/37_rv32im_zicsr_zifencei/librelane/runs/, alongside the existing RUN_2026-05-02_22-49-46/ from the original P37 harden. The original hardened result is not disturbed.

What we got

Outcome 2: converges through GDS with negative setup slack. The flow ran end-to-end and produced a clean GDS (Magic DRC, KLayout DRC, LVS, antenna, routing DRC, XOR all 0 errors), but signoff timing reported 152 setup violations at max_ss_100C_1v60 with worst slack -1.258 ns.

metric	P37 (40 ns)	P37-speed (30 ns)
Worst setup slack	`8.742 ns`	`-1.258 ns`
Worst hold slack	`0.105 ns`	`0.105 ns`
Setup violation count	`0`	`152`
Hold violation count	`0`	`0`
`max_ss` slew vio	`83`	`83`
`max_ss` cap vio	`8`	`8`
Standard-cell area	`191927 um²`	`191927 um²`
Standard-cell count	`27157`	`27157`
Magic DRC / KLayout DRC / LVS / antenna / routing DRC	`0`	`0`

The first interesting observation: the implied critical-path length is identical. P37 had 40 - 8.742 = 31.258 ns of used path; P37-30 has 30 + 1.258 = 31.258 ns. Same number, four decimal places. The resizer didn’t gain anything from the tighter budget - same standard-cell count, same area, same DRV tail. This is the cell- strength ceiling for that path; making the constraint tighter just moved the slack from positive to negative without changing the underlying logic.

The critical endpoint

The worst violator (and the top 16 violators) all start at the same register:

Startpoint: _19399_/Q   (u_core.op_b[2], the ALU operand-B register)
Endpoint:   _17817_/D   (a downstream flop)

The path runs through a ~16-deep chain of or4_2/or3_2/or2_2/ a22o_2 gates with buffer repeaters between segments. That shape is a classic wide OR-reduction tree, almost certainly the divider’s bit-by-bit quotient/remainder reduction or the multiplier’s add- reduction in the same block. The fact that the same source register (op_b[2]) drives the top 16 endpoints tells us those 16 destinations are all part of the same wide reduction.

This is consistent with what the speed page said: today’s chip is FSM-bound, not technology-bound. The path is not about wire delay or routing parasitics - it is the depth of a combinational arithmetic reduction inside the FSM’s S_DIV/S_MUL cycle. Adding a single pipeline register inside that reduction would split the path in half and the design would converge at ~16 ns. Doing that means designing a real pipelined core, which is a P-something-large rung, not a config flip.

What this confirms

The headroom-from-slack estimate on the speed page is correct. The signoff Fmax of this RTL is about 1 / 31.258 ns = 32 MHz, almost exactly the ~32 MHz we estimated.
The resizer is already doing what it can. Same cells, same area at both budgets - it can’t trade more cells for more speed on this path.
The path lives in the divider/multiplier reduction. That is a deliberate target for a future architectural rung if speed becomes a goal.

Status

Configured: PASS

Hardened (fast): PARTIAL - GDS/DRC/LVS clean, 152 setup violations in signoff timing.

The fresh RUN_2026-05-03_01-29-50/ is checked into the run directory next to the original P37 harden; the original 40 ns harden is unchanged.

The roadmap-side conclusion is unchanged: today’s chip is FSM-bound, not technology-bound. A real speed jump means a different core, not a tighter budget. See /speed/ for the framing. This run graduates from “experiment we should try” to “data point we have,” and the divider/multiplier OR-reduction is the named target if we ever care about Fmax.