No. 03 / project of 147 on the ladder

UART transmitter

introduces — explicit FSM, baud generation, the first protocol, pin ordering

harden statelast run2026-04-28
cells2,374non-filler
slack1.08ns setup
area12100 (die) / 9774 (core)μm²
signoff
  • DRCPASS
  • LVSPASS
  • antennaPASS

The first project on the ladder that talks to the outside world. UART (Universal Asynchronous Receiver/Transmitter) is the simplest serial protocol in common use: one wire, no clock alongside it, both ends agree on a bit rate and the receiver samples in the middle of each bit. We only build the TX half here.

This is also the first project with an explicit finite state machine — eleven states walking the line through one frame: IDLE → START → D0 → D1 → … → D7 → STOP → IDLE. The earlier projects were either pure combinational (P01) or independent counters and shift registers (P02). Here the output depends on which step of the protocol you are in, and that’s the defining feature of an FSM.

layout · sky130A x= μm y= μm
drag · scroll to zoom · double-click to fit · 1 1:1 · f fit 110 × 110 μm die · sky130A · 11-state FSM, 16-bit baud counter
3d · sky130A · z×10
drag · scroll · right-drag pan · double-click recenter · R reset full sky130 stack · z exaggerated 10× · 88k shapes · meshopt-compressed

Wire length (estimated): 4,501 μm, 1.7× P02. Power: 851 μW — the UART’s continuous toggling on the baud counter and shifter dominates. All max-slew / max-fanout / max-cap checks pass in every process corner.

What’s new vs. project 02

  • Typed FSM. SystemVerilog typedef enum logic [3:0] { ... } with unique case for next-state and output decoding. Synthesis encodes it however it wants (binary, one-hot, gray) — the source declares intent, not encoding.
  • Clock-enable counter. The baud generator is a 16-bit countdown that reloads from baud_div. Its zero-detect output (bit_tick) acts as a clock enable for the FSM. We do not generate a slower clock; ASIC flows really do not want you generating clocks. One real clock, gated by enables, full stop.
  • Pin ordering. This is the first project where pin placement matters in the layout. UART has obvious sides — inputs feed in one side, the serial line leaves the other. We add pin_order.cfg to the LibreLane config so this is reproducible. Inputs (clk, rst_n, start, data, baud_div) cluster on the west edge; outputs (tx, busy, state_o) cluster on the east edge. You can see this directly in the layout viewer above — west pins are the strip on the left, east pins on the right.

How UART works (60-second primer)

A UART line is idle high. To send a byte:

  1. Drop the line low for one bit time. This is the start bit — the receiver’s clock recovery uses the falling edge here.
  2. Send 8 data bits, LSB first, one bit time each. (LSB first is a convention; some protocols are MSB-first, UART is not.)
  3. Drive the line high for one bit time. This is the stop bit. It’s also “line idle again” so back-to-back frames work.

Everyone agrees on a bit rate (the baud), nothing else. No clock wire, no framing, no error correction. Common rates are 9600, 115200, 921600 — all chosen to be cleanly divisible from common reference clocks.

For a 50 MHz reference clock targeting 9600 baud, that’s 50_000_000 / 9600 = 5208.3 clocks per bit, so baud_div = 5207. The testbench uses a much faster baud (5 cycles/bit) so the simulation finishes in microseconds, not milliseconds.

The RTL

The whole module is ~110 lines: enum, baud counter, data latch, next-state logic, output mux, busy/state outputs.

projects/03_uart_tx/src/top.sv system-verilog
// Project 03: UART transmitter.
//
// First protocol on the ladder, and the first explicit FSM. UART is the
// minimum-viable serial protocol: one wire, no clock, no flow control.
// Both ends agree on a baud rate and the receiver samples the line at
// roughly the middle of each bit. We only build the TX side here.
//
// Frame format (8N1 — 8 data bits, no parity, 1 stop bit, LSB first):
//
//          ___       _   _   _   _   _   _   _   _   ___________
//   idle      \_____/ \_/ \_/ \_/ \_/ \_/ \_/ \_/ \_/
//          ^   ^   ^   ^   ^   ^   ^   ^   ^   ^   ^
//          |   |   d0  d1  d2  d3  d4  d5  d6  d7  stop
//          |   start (always 0)
//          line idle = 1
//
// Baud generation
// ---------------
// `baud_div` is the number of clock cycles per UART bit, minus one. So
// for a 50 MHz clock and 9600 baud, baud_div = 50_000_000/9600 - 1 = 5207.
// We expose it as an input so the testbench can run at a fast baud
// without burning simulator time.
//
// FSM
// ---
//   IDLE  → wait for `start` to pulse, latch `data`, drop tx low
//   START → drive tx=0 for one bit time
//   D0..D7→ shift data out LSB first, one bit time each
//   STOP  → drive tx=1 for one bit time, then back to IDLE
//
// `busy` is high any time we are not in IDLE — host should not raise
// `start` again until `busy` falls.

`default_nettype none

module top (
    input  logic         clk,
    input  logic         rst_n,
    input  logic         start,        // pulse high for 1 cycle to begin a frame
    input  logic [7:0]   data,         // captured at the cycle `start` is sampled
    input  logic [15:0]  baud_div,     // clocks per bit, minus 1

    output logic         tx,           // serial output, idle = 1
    output logic         busy,         // 1 while a frame is in flight
    output logic [3:0]   state_o       // debug: which FSM state we are in
);

  // ---- FSM states ----
  // 11 states: IDLE, START, D0..D7, STOP. Encoded with 4-bit enum so the
  // synth tool can pick the encoding (binary, one-hot, etc.).
  typedef enum logic [3:0] {
    S_IDLE  = 4'd0,
    S_START = 4'd1,
    S_D0    = 4'd2,
    S_D1    = 4'd3,
    S_D2    = 4'd4,
    S_D3    = 4'd5,
    S_D4    = 4'd6,
    S_D5    = 4'd7,
    S_D6    = 4'd8,
    S_D7    = 4'd9,
    S_STOP  = 4'd10
  } state_t;

  state_t state, state_next;

  // ---- baud counter ----
  // Counts down from `baud_div` to 0. When it hits 0 and we're not idle,
  // it's a "bit tick" and we advance to the next FSM state. Re-loads
  // every tick.
  logic [15:0] baud_cnt;
  logic        bit_tick;
  assign bit_tick = (baud_cnt == 16'd0);

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)            baud_cnt <= 16'd0;
    else if (state == S_IDLE) baud_cnt <= baud_div;  // pre-load while idle
    else if (bit_tick)     baud_cnt <= baud_div;
    else                   baud_cnt <= baud_cnt - 16'd1;
  end

  // ---- data latch ----
  // Capture `data` the cycle we leave IDLE so the host doesn't have to
  // hold it for the whole frame.
  logic [7:0] data_q;
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)                                 data_q <= 8'h00;
    else if (state == S_IDLE && start)          data_q <= data;
  end

  // ---- next-state logic ----
  // Linear walk through the states. The only branch is IDLE → START on
  // `start`; everything else is "advance on bit_tick".
  always_comb begin
    state_next = state;
    unique case (state)
      S_IDLE:  if (start)    state_next = S_START;
      S_START: if (bit_tick) state_next = S_D0;
      S_D0:    if (bit_tick) state_next = S_D1;
      S_D1:    if (bit_tick) state_next = S_D2;
      S_D2:    if (bit_tick) state_next = S_D3;
      S_D3:    if (bit_tick) state_next = S_D4;
      S_D4:    if (bit_tick) state_next = S_D5;
      S_D5:    if (bit_tick) state_next = S_D6;
      S_D6:    if (bit_tick) state_next = S_D7;
      S_D7:    if (bit_tick) state_next = S_STOP;
      S_STOP:  if (bit_tick) state_next = S_IDLE;
      default:               state_next = S_IDLE;
    endcase
  end

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) state <= S_IDLE;
    else        state <= state_next;
  end

  // ---- tx output ----
  // tx is registered (combinational drive into a reg) so the line
  // doesn't glitch on FSM transitions. Idle = 1, start bit = 0,
  // data bits = data_q[i], stop bit = 1.
  always_comb begin
    unique case (state)
      S_IDLE:  tx = 1'b1;
      S_START: tx = 1'b0;
      S_D0:    tx = data_q[0];
      S_D1:    tx = data_q[1];
      S_D2:    tx = data_q[2];
      S_D3:    tx = data_q[3];
      S_D4:    tx = data_q[4];
      S_D5:    tx = data_q[5];
      S_D6:    tx = data_q[6];
      S_D7:    tx = data_q[7];
      S_STOP:  tx = 1'b1;
      default: tx = 1'b1;
    endcase
  end

  assign busy    = (state != S_IDLE);
  assign state_o = state;

endmodule

`default_nettype wire
src/top.sv — eleven-state UART TX FSM, baud-tick clocked.

The bit_tick signal is the gear that meshes the baud counter with the FSM. Every state except IDLE advances exactly when bit_tick fires. When you read sequential RTL, find the clock enables first — they tell you the rhythm of the design before any of the state-by-state logic does.

The output mux is a flat case over 11 states. We could shift data_q right one position per bit-tick instead, but the explicit per-bit indexing makes waveforms easier to read in gtkwave: bit D3 always drives data_q[3], never some shifted-around state of an internal register.

The testbench

Four canonical patterns: 0x55 (alternating), 0xA5 (uneven), 0x00 (all zeros — checks we never accidentally treat data zeros as start bits), 0xFF (all ones — checks we never accidentally drop the line when sending all-high data).

The decoder is a behavioural process that watches for the tx falling edge (start bit), then samples the line every bit time, mid-bit. It’s the same algorithm a real UART receiver runs in hardware.

projects/03_uart_tx/test/tb.sv system-verilog · L50-90
  localparam int CYC_PER_BIT = 5;
  localparam int CLK_NS      = 20;
  localparam int BIT_NS      = CYC_PER_BIT * CLK_NS;   // 100 ns

  // ---- decoder coroutine ----
  // Watches the TX line and prints / checks each frame as it arrives.
  // Triggered by the falling edge of `tx` (start bit). After the frame
  // is captured, fires `decoded_byte` / `decoded_valid` so the main
  // sequencer can compare.
  logic [7:0] decoded_byte;
  logic       decoded_valid;
  int         decode_errors = 0;

  initial begin
    decoded_valid = 0;
    forever begin
      // wait for start bit (TX falling edge while line was idle)
      @(negedge tx);
      // sample mid-bit by waiting 1.5 bit times (start bit center + half
      // a bit time = 1.5 bit times from the falling edge to bit-0
      // center). Use BIT_NS / 2 to align mid-bit, then advance by
      // BIT_NS to sample each subsequent bit.
      #(BIT_NS + BIT_NS/2);   // now in middle of D0
      for (int i = 0; i < 8; i++) begin
        decoded_byte[i] = tx;
        if (i < 7) #(BIT_NS);
      end
      // Advance to middle of stop bit, check it is high.
      #(BIT_NS);
      if (tx !== 1'b1) begin
        $display("FAIL: stop bit was %b, expected 1", tx);
        decode_errors++;
      end
      decoded_valid = 1;
      @(posedge clk); decoded_valid = 0;
    end
  end

  // ---- helper task: send + check one byte ----
  int sent_count = 0;
  task send_and_check(input [7:0] b);
tb.sv — the decoder coroutine. 1.5 bit times after the start-bit edge lands you in the middle of D0.
$ make test PROJECT=03_uart_tx
== 03_uart_tx ==
iverilog -g2012 -Wall -o tb.vvp -s tb ../src/top.sv tb.sv
[150000] sending 0x55 (01010101)
       decoded 0x55 OK
[1250000] sending 0xa5 (10100101)
       decoded 0xa5 OK
[2350000] sending 0x00 (00000000)
       decoded 0x00 OK
[3450000] sending 0xff (11111111)
       decoded 0xff OK
PASS: 4 frames sent, all decoded correctly.
03_uart_tx                       PASS

How the harden actually went

Better than P02. P02 produced 11 max-slew violations in the slow PVT corner; P03 produced zero violations of any kind. Same flow, same clock target, same standard-cell library. What changed?

  • Lower fanout. The bit_tick signal is the highest-fanout net in the design (it gates ~30 flops), but it’s only one wire and the resizer balanced it cleanly with three buffer levels.
  • Slower wins than P02. P02’s PWM comparator was a chain of 8 LUTs in the fast corner; the longest path here is the next-state decoder for S_D7 → S_STOP, which compiles down to maybe 5 levels of logic. Easier to time.
  • CTS less stressed. Only ~30 flops to hit, vs. P02’s 16 + sloppier topology. Clock skew came in well under 100 ps.

Cell-count prediction was within 3% (predicted 2300, got 2374). Setup slack prediction was “comfortable, well under 100 MHz tight” — checked: +1.08 ns of slack on a 10 ns period.

The pin ordering worked. Look at the layout viewer above: the west edge has a clean vertical strip of pins (clk, rst_n, start, data[0..7], baud_div[0..15]), the east edge has the output cluster (tx, busy, state_o[0..3]). North and south sides are empty — by design. This is the same design but laid out by the placer’s whim it would be chaotic; the cfg file fixes it.

What just happened?

This is where the chip starts looking like a thing instead of a math experiment. The earlier projects spit out numbers; this one talks. Plug a UART-USB adapter into the TX pin and (if the chip existed) you would see actual bytes arrive in screen /dev/ttyUSB0 9600 — the same protocol my modem used in 1996, the same protocol every embedded debug header still uses today.

We can do the simulator-side equivalent of that screen session right now. A second testbench, tb_console.sv, drives the DUT with ASCII bytes and runs a behavioural UART receiver against the TX line that prints each decoded character to stdout as it arrives. The simulation log itself reads like a serial console:

$ make console PROJECT=03_uart_tx
iverilog -g2012 -Wall -o tb_console.vvp -s tb_console ../src/top.sv tb_console.sv
VCD info: dumpfile tb_console.vcd opened for output.
[uart-rx]
[uart-rx] librelane-playground / project 03
[uart-rx] UART TX online @ 9600 baud (8N1).
[uart-rx] hello from sky130A.
[uart-rx]
[uart-rx] > count = 0xA5
[uart-rx] > tick
[uart-rx] > tick
[uart-rx] > tick
[sim] 324 bytes pushed, line idle.

Every [uart-rx] line is the testbench’s behavioural receiver recovering one frame at a time from raw transitions on the tx wire — the same algorithm a real UART chip runs in hardware:

projects/03_uart_tx/test/tb_console.sv system-verilog · L56-86
  initial begin : rx
    forever begin
      @(negedge tx);
      #(BIT_NS + BIT_NS/2);
      for (int i = 0; i < 8; i++) begin
        rx_byte[i] = tx;
        if (i < 7) #(BIT_NS);
      end
      // Mid stop-bit
      #(BIT_NS);
      // print a [uart-rx] prefix at the start of each output line
      if (!rx_line_open) begin
        $write("[uart-rx] ");
        rx_line_open = 1;
      end
      if (rx_byte == 8'h0A) begin
        $write("\n");
        rx_line_open = 0;
      end else if (rx_byte == 8'h0D) begin
        // CR alone: ignore (we expect CR/LF pairs, LF triggers newline)
      end else if (rx_byte >= 8'h20 && rx_byte < 8'h7F) begin
        $write("%c", rx_byte);
      end else begin
        $write("<%02h>", rx_byte);
      end
      $fflush;
    end
  end

  // ---- transmit a string byte-by-byte ----
  task uart_send(input byte b);
tb_console.sv — the inline UART receiver. Watches for the falling edge on tx, samples 8 bits at mid-bit, and emits printable ASCII to stdout.

If we replaced the simulator’s tx wire with a copper trace through a USB-UART bridge into a Linux box, screen /dev/ttyUSB0 9600 would show the exact same lines. The protocol is the oldest part of computing that has not been replaced, and it is also the simplest piece of hardware to build that produces something externally observable.

The console testbench was also a small lesson in event-scheduling hazards. The first version raced on wait (busy == 0) immediately after asserting start: at that posedge, the DUT’s state <= S_START is still in the NBA region and busy = (state != S_IDLE) still reads 0 from the prior state = S_IDLE. The wait returned instantly, the next byte was loaded with the FSM still mid-frame, and every other character on the wire was the previous byte. The fix is wait (busy == 1) ; wait (busy == 0) — wait for the FSM to actually leave IDLE before checking that it returned.

Sub-1-mW power. 12100 μm² of silicon. 2374 standard cells. A 60-year-old protocol on a brand-new die. Build a better receiver and you’ve got yourself a serial console.

See also

  • Project 02 — counter + LFSR + PWM, the sequential ancestor.
  • Project 04 → first asynchronous external interface, first clock-domain crossing.
  • Project README — full lesson plan.