No. 04 / project of 147 on the ladder

SPI GPIO peripheral

introduces — clock-domain crossing, two-FF synchronizers, memory-mapped registers, multi-edge pin placement

harden statelast run2026-04-28
cells3,217non-filler
slack3.78ns setup
area16900 (die) / 13988 (core)μm²
signoff
  • DRCPASS
  • LVSPASS
  • antennaPASS

The first project where two clocks meet on the same chip and disagree about what time it is. An SPI host out there in the world drives cs_n, sck, and mosi on its own clock; we sit on our 100 MHz internal clock and have to read those signals reliably without going insane on metastability. The standard solution — two-flop synchronizers on every async input — is one of the unglamorous core skills of digital design. We use it five times in this design.

This is also the first project that looks like a memory-mapped peripheral: a small register file, an SPI command frame that selects a register and either reads or writes it, and three GPIO bits exposed through registers. It’s the same shape as a real SoC peripheral, just shrunk down to three registers and 8 pins.

layout · sky130A x= μm y= μm
drag · scroll to zoom · double-click to fit · 1 1:1 · f fit 130 × 130 μm die · sky130A · pins on three edges
3d · sky130A · z×10
drag · scroll · right-drag pan · double-click recenter · R reset full sky130 stack · z exaggerated 10× · 125k shapes · meshopt-compressed

Wire length (estimated): 7,767 μm, 1.7× P03. Power: 724 μW. Four max-slew warnings in the slow process corner — same shape of flaky check P02 hit; not a hard failure.

What’s new vs. project 03

  • Asynchronous external interface. UART (P03) was driven by our clk — host and peripheral shared a reference somewhere upstream. SPI is the opposite: the master brings its own clock, and we sample it. That sampling is where the real interesting bit lives.
  • Two-flop synchronizers. Five of them in this design — the standard fix for CDC. Every async input (cs_n, sck, mosi, plus all 8 bits of gpio_in) goes through two back-to-back flops in the chip-clock domain.
  • Edge detection. With sck synchronized, we sample its rising and falling edges as 1-cycle pulses (sck_rise, sck_fall) and use those as clock enables — same trick as the baud counter in P03. We never use sck as a real clock anywhere.
  • Memory-mapped register file. A 7-bit address selects one of three 8-bit registers; the R/W bit selects direction. Same shape as a real SoC peripheral, just very tiny.
  • Multi-edge pin placement. Three edges in use: control on the west, SPI cluster on the north, GPIO bus on the south. Easier to spot in the layout viewer than in P03 where we only used east+west.

How clock-domain crossing actually works

Take a flop with its data input wired to a signal from a different clock domain. If the data changes too close to the rising clock edge, the flop violates setup or hold time — and the output goes metastable: a voltage that’s neither valid 0 nor valid 1, hovering in the forbidden middle while the flop’s internal feedback loop tries to make up its mind.

Metastability resolves probabilistically. After one clock period, the chance the flop is still ambiguous is some small number — call it p (typically 10⁻⁶ to 10⁻³). After two periods, it’s . The trick of the two-flop synchronizer is that the second flop only ever sees the first flop’s output, which has had a full clock period to settle — so the probability the second flop sees an invalid value is , and 10⁻⁶ squared is “your chip will be fine for the heat death of the universe.”

async input FFclk FFclk stablein clk domain
Two-flop synchronizer. The first flop catches the asynchronous edge (and may go metastable); the second sees only stable values.

We do this for every wire crossing in: cs_n, sck, mosi, and each bit of gpio_in. Five chains × 2 flops = 10 synchronizer flops, plus history flops for edge detection.

The RTL

About 200 lines. The interesting structures:

projects/04_spi_gpio_peripheral/src/top.sv system-verilog · L64-101

  // ---- two-FF synchronizers ----
  // sck, cs_n, mosi all enter the clk domain through two flops in
  // series. The first flop catches the asynchronous transition (and
  // may go metastable); the second flop sees a stable value because
  // metastability resolves with a probability that doubles per period.
  logic [1:0] sck_sync;
  logic [1:0] cs_sync;
  logic [1:0] mosi_sync;
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      sck_sync  <= 2'b00;
      cs_sync   <= 2'b11;     // de-asserted on reset (cs_n = 1)
      mosi_sync <= 2'b00;
    end else begin
      sck_sync  <= {sck_sync[0],  sck};
      cs_sync   <= {cs_sync[0],   cs_n};
      mosi_sync <= {mosi_sync[0], mosi};
    end
  end

  logic sck_q   = 1'b0;       // one more cycle of history for edge-detect
  logic cs_n_q  = 1'b1;
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      sck_q  <= 1'b0;
      cs_n_q <= 1'b1;
    end else begin
      sck_q  <= sck_sync[1];
      cs_n_q <= cs_sync[1];
    end
  end

  wire sck_rise   = ~sck_q & sck_sync[1];
  wire sck_fall   =  sck_q & ~sck_sync[1];
  wire cs_falling =  cs_n_q & ~cs_sync[1];   // cs_n: 1 → 0
  wire cs_active  = ~cs_sync[1];
  wire mosi_in    =  mosi_sync[1];
src/top.sv — synchronizer chains and edge detection.

The bit walker is just an arithmetic countdown:

projects/04_spi_gpio_peripheral/src/top.sv system-verilog · L106-127
  // while cs is active.
  logic [3:0] bit_idx;        // 15..0 ; reset to 4'd15 at frame start
  logic [15:0] shift_in;      // accumulated bits
  logic [7:0]  cmd_byte;      // the upper byte once it's been clocked in

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      bit_idx  <= 4'd15;
      shift_in <= 16'h0000;
    end else if (cs_falling) begin
      bit_idx  <= 4'd15;
      shift_in <= 16'h0000;
    end else if (cs_active && sck_rise) begin
      shift_in <= {shift_in[14:0], mosi_in};
      bit_idx  <= bit_idx - 4'd1;
    end
  end

  // Latch the command byte the moment bit 8 finishes clocking in.
  // After that, ADDR = cmd_byte[6:0], R/W = cmd_byte[7].
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)                                              cmd_byte <= 8'h00;
src/top.sv — bit_idx walks 15 down to 0 across each frame, shifting MOSI in on every detected sck rise.

And the read path drives MISO from a small shift register loaded the moment the command byte is fully clocked in:

projects/04_spi_gpio_peripheral/src/top.sv system-verilog · L180-206
  end

  // Load shift_out on the first sck-fall after cmd_byte is captured.
  // Shifting cmd_byte capture and shift_out load apart by one half-sck
  // means cmd_byte (and thus is_read / addr) are stable when we load.
  // After the load, every subsequent sck-fall shifts shift_out left so
  // the MSB pumps out onto miso.
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)                                                shift_out <= 8'h00;
    else if (cs_active && sck_fall && bit_idx == 4'd7 && is_read) begin
      unique case (addr)
        7'h00:   shift_out <= gpio_oe_q;
        7'h01:   shift_out <= gpio_out_q;
        7'h02:   shift_out <= gpio_in_sync;
        default: shift_out <= 8'h00;
      endcase
    end else if (cs_active && sck_fall && bit_idx <= 4'd6 && is_read) begin
      shift_out <= {shift_out[6:0], 1'b0};
    end
  end

  // MISO drives the MSB of shift_out only during the read-data half.
  // When idle, drive 0 (a real chip would tristate; sky130 stdcell
  // libs do not have routable tristate buffers from the synth flow,
  // so we drive low).
  assign miso = (cs_active && is_read && bit_idx <= 4'd7) ? shift_out[7] : 1'b0;
src/top.sv — MISO shift-out, loaded on the first sck-fall after the cmd byte.

The trickiest piece in this whole module is the timing relationship between cmd_byte (captured on the 8th sck-rise) and shift_out (loaded on the next sck-fall). The first version of the code tried to load shift_out on the same edge that captured cmd_byte and read all zeroes, because is_read (= cmd_byte[7]) was still pointing at the previous frame’s value. Pushing the load to the next half-clock fixes this — cmd_byte is fully settled by the time we read it.

The testbench

Six SPI transactions: write OE, write OUT (twice), then read all three registers including a re-read after the GPIO_IN value changes.

projects/04_spi_gpio_peripheral/test/tb.sv system-verilog · L56-95

  int errors = 0;

  // ---- spi_xfer: clock 16 bits, mosi=tx[15:0], rx=miso latched on sck rise ----
  task automatic spi_xfer(input [15:0] tx, output [15:0] rx);
    integer i;
    begin
      sck  = 0;
      mosi = tx[15];
      cs_n = 0;
      #(CS_SETUP);
      rx = 16'h0000;
      for (i = 15; i >= 0; i = i - 1) begin
        // present mosi on falling-clock half (sck currently low)
        mosi = tx[i];
        #(SCK_HALF);
        sck = 1;
        // sample miso just after sck rise
        rx[i] = miso;
        #(SCK_HALF);
        sck = 0;
      end
      #(CS_SETUP);
      cs_n = 1;
      mosi = 0;
      // give the DUT a few cycles to commit the write
      #(CS_SETUP * 2);
    end
  endtask

  // ---- spi_write / spi_read helpers ----
  task automatic spi_write(input [6:0] addr, input [7:0] data);
    logic [15:0] rx;
    begin
      spi_xfer({1'b0, addr, data}, rx);
    end
  endtask

  task automatic spi_read(input [6:0] addr, output [7:0] data);
    logic [15:0] rx;
tb.sv — the SPI master tasks.
$ make test PROJECT=04_spi_gpio_peripheral
[310000]  write GPIO_OE = 0xFF      gpio_oe  after w: 0xff OK
[4310000] write GPIO_OUT = 0x55     gpio_out after w: 0x55 OK
[8310000] write GPIO_OUT = 0xA5     gpio_out after w: 0xa5 OK
[12310000] read GPIO_OE             read GPIO_OE: 0xff OK
[16310000] read GPIO_OUT            read GPIO_OUT: 0xa5 OK
[20370000] read GPIO_IN  (drive=0x33)   read GPIO_IN: 0x33 OK
[24430000] read GPIO_IN  (drive=0xCC)   read GPIO_IN: 0xcc OK
PASS: all SPI transactions verified.

Watching it do something

The verifying testbench checks the writes/reads with OK lines. A second testbench, tb_demo.sv, treats the chip the way a Linux user would: pretend to be a microcontroller wired to its SPI master, run a short script of register accesses, and print what the GPIO pins look like after each one. make demo PROJECT=04_spi_gpio_peripheral:

[chip ] -- librelane-playground / project 04 / SPI GPIO peripheral --
[chip ] reset released. cs_n=1 (idle).

[host ] WR  GPIO_OE  (0x00)  <-  0xff
[chip ] gpio_oe=0xff  gpio_out=0x00  gpio_in=0x00
[host ] WR  GPIO_OUT (0x01)  <-  0x55
[chip ] gpio_oe=0xff  gpio_out=0x55  gpio_in=0x00

[host ] WR  GPIO_OUT (0x01)  <-  0xa5
[chip ] gpio_oe=0xff  gpio_out=0xa5  gpio_in=0x00

[host ] RD  GPIO_OE  (0x00)  ->  0xff
[host ] RD  GPIO_OUT (0x01)  ->  0xa5

[host ] RD  GPIO_IN  (0x02)  ->  0x33
[host ] RD  GPIO_IN  (0x02)  ->  0xcc

[host ] WR  GPIO_OE  (0x00)  <-  0x00
[chip ] gpio_oe=0x00  gpio_out=0xa5  gpio_in=0xcc

Each [host ] line is a 16-bit SPI frame. Each [chip ] line is the GPIO pad ring as a logic analyser would draw it on the next clock edge. Watch how the writes mutate the bottom row of the diagram, the reads don’t, and the very last write turns all output drivers off — at that point the pins float (in a real chip with real tristate; here they hold last-driven value because sky130’s high-density library doesn’t expose tristate buffers from the synth flow). Same shape every microcontroller-driven SPI peripheral has been showing for thirty years.

What LibreLane did differently

Compared to P03:

  • Setup slack got more comfortable, not less. P03 was +1.08 ns; P04 is +3.78 ns. Surprising at first — bigger design should be harder, right? But P04’s longest combinational path is just a 7-bit comparator on the address, while P03’s was the next-state decoder for the FSM. Width × narrow logic depth beats narrow × deep depth almost every time.
  • Max-slew is back. 4 violations, all in the slow PVT corner. These are likely on the long wires from the synchronizer flops to wherever the synchronized signals get used (esp. cs_active, which fans out to the whole frame state). LibreLane’s resizer didn’t add a buffer here. We could fix it with MAX_TRANSITION set lower or with explicit dont_touch on the synchronizer outputs to force a cleaner topology.
  • Pin layout is a story. Open the chip viewer above and rotate it. The west edge is two pins (clk, rst_n). The north edge is the SPI cluster (cs_n, sck, mosi, miso) — clearly grouped, the way you’d run them off a microcontroller’s SPI block. The south edge is the wide GPIO bus. The east edge is empty. This is what a pin_order.cfg gets you over a free-floating placer.

What just happened?

Two clocks and one chip. Five synchronizer chains. ~10 KGE worth of silicon. Three internal registers and a shift register that moves data into and out of them under the host’s clock. This is the register-mapped-peripheral pattern that every SoC ever built reuses hundreds of times — STM32 has dozens of these things, the RP2040 has ~25, ARM SoCs ship with hundreds. Most of them are wider, faster, have a fancier interrupt model, but the bones are the same: an external interface, a synchronizer wall, a register decoder, and a backing data store.

By project 06 there will be a small CPU on this ladder. This is the thing it’ll talk to.

See also

  • Project 03 — UART transmitter, our first protocol on a single clock.
  • Project 05 → builds the datapath side: ALU, register file, sequencing — the pieces a CPU is made of.
  • Project README — full lesson plan.