The first project where two clocks meet on the same chip and disagree
about what time it is. An SPI host out there in the world drives cs_n,
sck, and mosi on its own clock; we sit on our 100 MHz internal
clock and have to read those signals reliably without going insane on
metastability. The standard solution — two-flop synchronizers on
every async input — is one of the unglamorous core skills of digital
design. We use it five times in this design.
This is also the first project that looks like a memory-mapped peripheral: a small register file, an SPI command frame that selects a register and either reads or writes it, and three GPIO bits exposed through registers. It’s the same shape as a real SoC peripheral, just shrunk down to three registers and 8 pins.
Wire length (estimated): 7,767 μm, 1.7× P03. Power: 724 μW. Four max-slew warnings in the slow process corner — same shape of flaky check P02 hit; not a hard failure.
What’s new vs. project 03
- Asynchronous external interface. UART (P03) was driven by our
clk— host and peripheral shared a reference somewhere upstream. SPI is the opposite: the master brings its own clock, and we sample it. That sampling is where the real interesting bit lives. - Two-flop synchronizers. Five of them in this design — the
standard fix for CDC. Every async input (
cs_n,sck,mosi, plus all 8 bits ofgpio_in) goes through two back-to-back flops in the chip-clock domain. - Edge detection. With
scksynchronized, we sample its rising and falling edges as 1-cycle pulses (sck_rise,sck_fall) and use those as clock enables — same trick as the baud counter in P03. We never usesckas a real clock anywhere. - Memory-mapped register file. A 7-bit address selects one of three 8-bit registers; the R/W bit selects direction. Same shape as a real SoC peripheral, just very tiny.
- Multi-edge pin placement. Three edges in use: control on the west, SPI cluster on the north, GPIO bus on the south. Easier to spot in the layout viewer than in P03 where we only used east+west.
How clock-domain crossing actually works
Take a flop with its data input wired to a signal from a different clock domain. If the data changes too close to the rising clock edge, the flop violates setup or hold time — and the output goes metastable: a voltage that’s neither valid 0 nor valid 1, hovering in the forbidden middle while the flop’s internal feedback loop tries to make up its mind.
Metastability resolves probabilistically. After one clock period, the chance the flop is still ambiguous is some small number — call it p (typically 10⁻⁶ to 10⁻³). After two periods, it’s p². The trick of the two-flop synchronizer is that the second flop only ever sees the first flop’s output, which has had a full clock period to settle — so the probability the second flop sees an invalid value is p², and 10⁻⁶ squared is “your chip will be fine for the heat death of the universe.”
We do this for every wire crossing in: cs_n, sck, mosi, and
each bit of gpio_in. Five chains × 2 flops = 10 synchronizer flops,
plus history flops for edge detection.
The RTL
About 200 lines. The interesting structures:
// ---- two-FF synchronizers ----
// sck, cs_n, mosi all enter the clk domain through two flops in
// series. The first flop catches the asynchronous transition (and
// may go metastable); the second flop sees a stable value because
// metastability resolves with a probability that doubles per period.
logic [1:0] sck_sync;
logic [1:0] cs_sync;
logic [1:0] mosi_sync;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
sck_sync <= 2'b00;
cs_sync <= 2'b11; // de-asserted on reset (cs_n = 1)
mosi_sync <= 2'b00;
end else begin
sck_sync <= {sck_sync[0], sck};
cs_sync <= {cs_sync[0], cs_n};
mosi_sync <= {mosi_sync[0], mosi};
end
end
logic sck_q = 1'b0; // one more cycle of history for edge-detect
logic cs_n_q = 1'b1;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
sck_q <= 1'b0;
cs_n_q <= 1'b1;
end else begin
sck_q <= sck_sync[1];
cs_n_q <= cs_sync[1];
end
end
wire sck_rise = ~sck_q & sck_sync[1];
wire sck_fall = sck_q & ~sck_sync[1];
wire cs_falling = cs_n_q & ~cs_sync[1]; // cs_n: 1 → 0
wire cs_active = ~cs_sync[1];
wire mosi_in = mosi_sync[1]; The bit walker is just an arithmetic countdown:
// while cs is active.
logic [3:0] bit_idx; // 15..0 ; reset to 4'd15 at frame start
logic [15:0] shift_in; // accumulated bits
logic [7:0] cmd_byte; // the upper byte once it's been clocked in
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
bit_idx <= 4'd15;
shift_in <= 16'h0000;
end else if (cs_falling) begin
bit_idx <= 4'd15;
shift_in <= 16'h0000;
end else if (cs_active && sck_rise) begin
shift_in <= {shift_in[14:0], mosi_in};
bit_idx <= bit_idx - 4'd1;
end
end
// Latch the command byte the moment bit 8 finishes clocking in.
// After that, ADDR = cmd_byte[6:0], R/W = cmd_byte[7].
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) cmd_byte <= 8'h00; And the read path drives MISO from a small shift register loaded the moment the command byte is fully clocked in:
end
// Load shift_out on the first sck-fall after cmd_byte is captured.
// Shifting cmd_byte capture and shift_out load apart by one half-sck
// means cmd_byte (and thus is_read / addr) are stable when we load.
// After the load, every subsequent sck-fall shifts shift_out left so
// the MSB pumps out onto miso.
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) shift_out <= 8'h00;
else if (cs_active && sck_fall && bit_idx == 4'd7 && is_read) begin
unique case (addr)
7'h00: shift_out <= gpio_oe_q;
7'h01: shift_out <= gpio_out_q;
7'h02: shift_out <= gpio_in_sync;
default: shift_out <= 8'h00;
endcase
end else if (cs_active && sck_fall && bit_idx <= 4'd6 && is_read) begin
shift_out <= {shift_out[6:0], 1'b0};
end
end
// MISO drives the MSB of shift_out only during the read-data half.
// When idle, drive 0 (a real chip would tristate; sky130 stdcell
// libs do not have routable tristate buffers from the synth flow,
// so we drive low).
assign miso = (cs_active && is_read && bit_idx <= 4'd7) ? shift_out[7] : 1'b0; The trickiest piece in this whole module is the timing relationship
between cmd_byte (captured on the 8th sck-rise) and shift_out
(loaded on the next sck-fall). The first version of the code tried
to load shift_out on the same edge that captured cmd_byte and read all
zeroes, because is_read (= cmd_byte[7]) was still pointing at the
previous frame’s value. Pushing the load to the next half-clock fixes
this — cmd_byte is fully settled by the time we read it.
The testbench
Six SPI transactions: write OE, write OUT (twice), then read all three registers including a re-read after the GPIO_IN value changes.
int errors = 0;
// ---- spi_xfer: clock 16 bits, mosi=tx[15:0], rx=miso latched on sck rise ----
task automatic spi_xfer(input [15:0] tx, output [15:0] rx);
integer i;
begin
sck = 0;
mosi = tx[15];
cs_n = 0;
#(CS_SETUP);
rx = 16'h0000;
for (i = 15; i >= 0; i = i - 1) begin
// present mosi on falling-clock half (sck currently low)
mosi = tx[i];
#(SCK_HALF);
sck = 1;
// sample miso just after sck rise
rx[i] = miso;
#(SCK_HALF);
sck = 0;
end
#(CS_SETUP);
cs_n = 1;
mosi = 0;
// give the DUT a few cycles to commit the write
#(CS_SETUP * 2);
end
endtask
// ---- spi_write / spi_read helpers ----
task automatic spi_write(input [6:0] addr, input [7:0] data);
logic [15:0] rx;
begin
spi_xfer({1'b0, addr, data}, rx);
end
endtask
task automatic spi_read(input [6:0] addr, output [7:0] data);
logic [15:0] rx; $ make test PROJECT=04_spi_gpio_peripheral
[310000] write GPIO_OE = 0xFF gpio_oe after w: 0xff OK
[4310000] write GPIO_OUT = 0x55 gpio_out after w: 0x55 OK
[8310000] write GPIO_OUT = 0xA5 gpio_out after w: 0xa5 OK
[12310000] read GPIO_OE read GPIO_OE: 0xff OK
[16310000] read GPIO_OUT read GPIO_OUT: 0xa5 OK
[20370000] read GPIO_IN (drive=0x33) read GPIO_IN: 0x33 OK
[24430000] read GPIO_IN (drive=0xCC) read GPIO_IN: 0xcc OK
PASS: all SPI transactions verified.
Watching it do something
The verifying testbench checks the writes/reads with OK lines.
A second testbench, tb_demo.sv, treats the chip the way a Linux
user would: pretend to be a microcontroller wired to its SPI master,
run a short script of register accesses, and print what the GPIO
pins look like after each one. make demo PROJECT=04_spi_gpio_peripheral:
[chip ] -- librelane-playground / project 04 / SPI GPIO peripheral --
[chip ] reset released. cs_n=1 (idle).
[host ] WR GPIO_OE (0x00) <- 0xff
[chip ] gpio_oe=0xff gpio_out=0x00 gpio_in=0x00
[host ] WR GPIO_OUT (0x01) <- 0x55
[chip ] gpio_oe=0xff gpio_out=0x55 gpio_in=0x00
[host ] WR GPIO_OUT (0x01) <- 0xa5
[chip ] gpio_oe=0xff gpio_out=0xa5 gpio_in=0x00
[host ] RD GPIO_OE (0x00) -> 0xff
[host ] RD GPIO_OUT (0x01) -> 0xa5
[host ] RD GPIO_IN (0x02) -> 0x33
[host ] RD GPIO_IN (0x02) -> 0xcc
[host ] WR GPIO_OE (0x00) <- 0x00
[chip ] gpio_oe=0x00 gpio_out=0xa5 gpio_in=0xcc
Each [host ] line is a 16-bit SPI frame. Each [chip ] line is
the GPIO pad ring as a logic analyser would draw it on the next
clock edge. Watch how the writes mutate the bottom row of the
diagram, the reads don’t, and the very last write turns all output
drivers off — at that point the pins float (in a real chip with
real tristate; here they hold last-driven value because sky130’s
high-density library doesn’t expose tristate buffers from the synth
flow). Same shape every microcontroller-driven SPI peripheral has
been showing for thirty years.
What LibreLane did differently
Compared to P03:
- Setup slack got more comfortable, not less. P03 was +1.08 ns; P04 is +3.78 ns. Surprising at first — bigger design should be harder, right? But P04’s longest combinational path is just a 7-bit comparator on the address, while P03’s was the next-state decoder for the FSM. Width × narrow logic depth beats narrow × deep depth almost every time.
- Max-slew is back. 4 violations, all in the slow PVT corner.
These are likely on the long wires from the synchronizer flops to
wherever the synchronized signals get used (esp.
cs_active, which fans out to the whole frame state). LibreLane’s resizer didn’t add a buffer here. We could fix it withMAX_TRANSITIONset lower or with explicitdont_touchon the synchronizer outputs to force a cleaner topology. - Pin layout is a story. Open the chip viewer above and rotate it.
The west edge is two pins (
clk,rst_n). The north edge is the SPI cluster (cs_n,sck,mosi,miso) — clearly grouped, the way you’d run them off a microcontroller’s SPI block. The south edge is the wide GPIO bus. The east edge is empty. This is what a pin_order.cfg gets you over a free-floating placer.
What just happened?
Two clocks and one chip. Five synchronizer chains. ~10 KGE worth of silicon. Three internal registers and a shift register that moves data into and out of them under the host’s clock. This is the register-mapped-peripheral pattern that every SoC ever built reuses hundreds of times — STM32 has dozens of these things, the RP2040 has ~25, ARM SoCs ship with hundreds. Most of them are wider, faster, have a fancier interrupt model, but the bones are the same: an external interface, a synchronizer wall, a register decoder, and a backing data store.
By project 06 there will be a small CPU on this ladder. This is the thing it’ll talk to.
See also
- Project 03 — UART transmitter, our first protocol on a single clock.
- Project 05 → builds the datapath side: ALU, register file, sequencing — the pieces a CPU is made of.
- Project README — full lesson plan.