The first project on the ladder that talks to the outside world. UART (Universal Asynchronous Receiver/Transmitter) is the simplest serial protocol in common use: one wire, no clock alongside it, both ends agree on a bit rate and the receiver samples in the middle of each bit. We only build the TX half here.
This is also the first project with an explicit finite state machine
— eleven states walking the line through one frame: IDLE → START → D0 → D1 → … → D7 → STOP → IDLE. The earlier projects were either
pure combinational (P01) or independent counters and shift registers
(P02). Here the output depends on which step of the protocol you are
in, and that’s the defining feature of an FSM.
Wire length (estimated): 4,501 μm, 1.7× P02. Power: 851 μW — the UART’s continuous toggling on the baud counter and shifter dominates. All max-slew / max-fanout / max-cap checks pass in every process corner.
What’s new vs. project 02
- Typed FSM. SystemVerilog
typedef enum logic [3:0] { ... }withunique casefor next-state and output decoding. Synthesis encodes it however it wants (binary, one-hot, gray) — the source declares intent, not encoding. - Clock-enable counter. The baud generator is a 16-bit countdown
that reloads from
baud_div. Its zero-detect output (bit_tick) acts as a clock enable for the FSM. We do not generate a slower clock; ASIC flows really do not want you generating clocks. One real clock, gated by enables, full stop. - Pin ordering. This is the first project where pin placement
matters in the layout. UART has obvious sides — inputs feed in one
side, the serial line leaves the other. We add
pin_order.cfgto the LibreLane config so this is reproducible. Inputs (clk,rst_n,start,data,baud_div) cluster on the west edge; outputs (tx,busy,state_o) cluster on the east edge. You can see this directly in the layout viewer above — west pins are the strip on the left, east pins on the right.
How UART works (60-second primer)
A UART line is idle high. To send a byte:
- Drop the line low for one bit time. This is the start bit — the receiver’s clock recovery uses the falling edge here.
- Send 8 data bits, LSB first, one bit time each. (LSB first is a convention; some protocols are MSB-first, UART is not.)
- Drive the line high for one bit time. This is the stop bit. It’s also “line idle again” so back-to-back frames work.
Everyone agrees on a bit rate (the baud), nothing else. No clock wire, no framing, no error correction. Common rates are 9600, 115200, 921600 — all chosen to be cleanly divisible from common reference clocks.
For a 50 MHz reference clock targeting 9600 baud, that’s
50_000_000 / 9600 = 5208.3 clocks per bit, so baud_div = 5207.
The testbench uses a much faster baud (5 cycles/bit) so the simulation
finishes in microseconds, not milliseconds.
The RTL
The whole module is ~110 lines: enum, baud counter, data latch, next-state logic, output mux, busy/state outputs.
// Project 03: UART transmitter.
//
// First protocol on the ladder, and the first explicit FSM. UART is the
// minimum-viable serial protocol: one wire, no clock, no flow control.
// Both ends agree on a baud rate and the receiver samples the line at
// roughly the middle of each bit. We only build the TX side here.
//
// Frame format (8N1 — 8 data bits, no parity, 1 stop bit, LSB first):
//
// ___ _ _ _ _ _ _ _ _ ___________
// idle \_____/ \_/ \_/ \_/ \_/ \_/ \_/ \_/ \_/
// ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
// | | d0 d1 d2 d3 d4 d5 d6 d7 stop
// | start (always 0)
// line idle = 1
//
// Baud generation
// ---------------
// `baud_div` is the number of clock cycles per UART bit, minus one. So
// for a 50 MHz clock and 9600 baud, baud_div = 50_000_000/9600 - 1 = 5207.
// We expose it as an input so the testbench can run at a fast baud
// without burning simulator time.
//
// FSM
// ---
// IDLE → wait for `start` to pulse, latch `data`, drop tx low
// START → drive tx=0 for one bit time
// D0..D7→ shift data out LSB first, one bit time each
// STOP → drive tx=1 for one bit time, then back to IDLE
//
// `busy` is high any time we are not in IDLE — host should not raise
// `start` again until `busy` falls.
`default_nettype none
module top (
input logic clk,
input logic rst_n,
input logic start, // pulse high for 1 cycle to begin a frame
input logic [7:0] data, // captured at the cycle `start` is sampled
input logic [15:0] baud_div, // clocks per bit, minus 1
output logic tx, // serial output, idle = 1
output logic busy, // 1 while a frame is in flight
output logic [3:0] state_o // debug: which FSM state we are in
);
// ---- FSM states ----
// 11 states: IDLE, START, D0..D7, STOP. Encoded with 4-bit enum so the
// synth tool can pick the encoding (binary, one-hot, etc.).
typedef enum logic [3:0] {
S_IDLE = 4'd0,
S_START = 4'd1,
S_D0 = 4'd2,
S_D1 = 4'd3,
S_D2 = 4'd4,
S_D3 = 4'd5,
S_D4 = 4'd6,
S_D5 = 4'd7,
S_D6 = 4'd8,
S_D7 = 4'd9,
S_STOP = 4'd10
} state_t;
state_t state, state_next;
// ---- baud counter ----
// Counts down from `baud_div` to 0. When it hits 0 and we're not idle,
// it's a "bit tick" and we advance to the next FSM state. Re-loads
// every tick.
logic [15:0] baud_cnt;
logic bit_tick;
assign bit_tick = (baud_cnt == 16'd0);
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) baud_cnt <= 16'd0;
else if (state == S_IDLE) baud_cnt <= baud_div; // pre-load while idle
else if (bit_tick) baud_cnt <= baud_div;
else baud_cnt <= baud_cnt - 16'd1;
end
// ---- data latch ----
// Capture `data` the cycle we leave IDLE so the host doesn't have to
// hold it for the whole frame.
logic [7:0] data_q;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) data_q <= 8'h00;
else if (state == S_IDLE && start) data_q <= data;
end
// ---- next-state logic ----
// Linear walk through the states. The only branch is IDLE → START on
// `start`; everything else is "advance on bit_tick".
always_comb begin
state_next = state;
unique case (state)
S_IDLE: if (start) state_next = S_START;
S_START: if (bit_tick) state_next = S_D0;
S_D0: if (bit_tick) state_next = S_D1;
S_D1: if (bit_tick) state_next = S_D2;
S_D2: if (bit_tick) state_next = S_D3;
S_D3: if (bit_tick) state_next = S_D4;
S_D4: if (bit_tick) state_next = S_D5;
S_D5: if (bit_tick) state_next = S_D6;
S_D6: if (bit_tick) state_next = S_D7;
S_D7: if (bit_tick) state_next = S_STOP;
S_STOP: if (bit_tick) state_next = S_IDLE;
default: state_next = S_IDLE;
endcase
end
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) state <= S_IDLE;
else state <= state_next;
end
// ---- tx output ----
// tx is registered (combinational drive into a reg) so the line
// doesn't glitch on FSM transitions. Idle = 1, start bit = 0,
// data bits = data_q[i], stop bit = 1.
always_comb begin
unique case (state)
S_IDLE: tx = 1'b1;
S_START: tx = 1'b0;
S_D0: tx = data_q[0];
S_D1: tx = data_q[1];
S_D2: tx = data_q[2];
S_D3: tx = data_q[3];
S_D4: tx = data_q[4];
S_D5: tx = data_q[5];
S_D6: tx = data_q[6];
S_D7: tx = data_q[7];
S_STOP: tx = 1'b1;
default: tx = 1'b1;
endcase
end
assign busy = (state != S_IDLE);
assign state_o = state;
endmodule
`default_nettype wire The bit_tick signal is the gear that meshes the baud counter with
the FSM. Every state except IDLE advances exactly when bit_tick
fires. When you read sequential RTL, find the clock enables first
— they tell you the rhythm of the design before any of the
state-by-state logic does.
The output mux is a flat case over 11 states. We could shift data_q
right one position per bit-tick instead, but the explicit per-bit
indexing makes waveforms easier to read in gtkwave: bit D3 always
drives data_q[3], never some shifted-around state of an internal
register.
The testbench
Four canonical patterns: 0x55 (alternating), 0xA5 (uneven), 0x00
(all zeros — checks we never accidentally treat data zeros as start
bits), 0xFF (all ones — checks we never accidentally drop the line
when sending all-high data).
The decoder is a behavioural process that watches for the tx falling
edge (start bit), then samples the line every bit time, mid-bit. It’s
the same algorithm a real UART receiver runs in hardware.
localparam int CYC_PER_BIT = 5;
localparam int CLK_NS = 20;
localparam int BIT_NS = CYC_PER_BIT * CLK_NS; // 100 ns
// ---- decoder coroutine ----
// Watches the TX line and prints / checks each frame as it arrives.
// Triggered by the falling edge of `tx` (start bit). After the frame
// is captured, fires `decoded_byte` / `decoded_valid` so the main
// sequencer can compare.
logic [7:0] decoded_byte;
logic decoded_valid;
int decode_errors = 0;
initial begin
decoded_valid = 0;
forever begin
// wait for start bit (TX falling edge while line was idle)
@(negedge tx);
// sample mid-bit by waiting 1.5 bit times (start bit center + half
// a bit time = 1.5 bit times from the falling edge to bit-0
// center). Use BIT_NS / 2 to align mid-bit, then advance by
// BIT_NS to sample each subsequent bit.
#(BIT_NS + BIT_NS/2); // now in middle of D0
for (int i = 0; i < 8; i++) begin
decoded_byte[i] = tx;
if (i < 7) #(BIT_NS);
end
// Advance to middle of stop bit, check it is high.
#(BIT_NS);
if (tx !== 1'b1) begin
$display("FAIL: stop bit was %b, expected 1", tx);
decode_errors++;
end
decoded_valid = 1;
@(posedge clk); decoded_valid = 0;
end
end
// ---- helper task: send + check one byte ----
int sent_count = 0;
task send_and_check(input [7:0] b); $ make test PROJECT=03_uart_tx
== 03_uart_tx ==
iverilog -g2012 -Wall -o tb.vvp -s tb ../src/top.sv tb.sv
[150000] sending 0x55 (01010101)
decoded 0x55 OK
[1250000] sending 0xa5 (10100101)
decoded 0xa5 OK
[2350000] sending 0x00 (00000000)
decoded 0x00 OK
[3450000] sending 0xff (11111111)
decoded 0xff OK
PASS: 4 frames sent, all decoded correctly.
03_uart_tx PASS
How the harden actually went
Better than P02. P02 produced 11 max-slew violations in the slow PVT corner; P03 produced zero violations of any kind. Same flow, same clock target, same standard-cell library. What changed?
- Lower fanout. The
bit_ticksignal is the highest-fanout net in the design (it gates ~30 flops), but it’s only one wire and the resizer balanced it cleanly with three buffer levels. - Slower wins than P02. P02’s PWM comparator was a chain of 8 LUTs
in the fast corner; the longest path here is the next-state decoder
for
S_D7 → S_STOP, which compiles down to maybe 5 levels of logic. Easier to time. - CTS less stressed. Only ~30 flops to hit, vs. P02’s 16 + sloppier topology. Clock skew came in well under 100 ps.
Cell-count prediction was within 3% (predicted 2300, got 2374). Setup slack prediction was “comfortable, well under 100 MHz tight” — checked: +1.08 ns of slack on a 10 ns period.
The pin ordering worked. Look at the layout viewer above: the west edge has a clean vertical strip of pins (clk, rst_n, start, data[0..7], baud_div[0..15]), the east edge has the output cluster (tx, busy, state_o[0..3]). North and south sides are empty — by design. This is the same design but laid out by the placer’s whim it would be chaotic; the cfg file fixes it.
What just happened?
This is where the chip starts looking like a thing instead of a
math experiment. The earlier projects spit out numbers; this one
talks. Plug a UART-USB adapter into the TX pin and (if the chip
existed) you would see actual bytes arrive in screen /dev/ttyUSB0 9600 — the same protocol my modem used in 1996, the same protocol
every embedded debug header still uses today.
We can do the simulator-side equivalent of that screen session right
now. A second testbench, tb_console.sv, drives the DUT with ASCII
bytes and runs a behavioural UART receiver against the TX line that
prints each decoded character to stdout as it arrives. The
simulation log itself reads like a serial console:
$ make console PROJECT=03_uart_tx
iverilog -g2012 -Wall -o tb_console.vvp -s tb_console ../src/top.sv tb_console.sv
VCD info: dumpfile tb_console.vcd opened for output.
[uart-rx]
[uart-rx] librelane-playground / project 03
[uart-rx] UART TX online @ 9600 baud (8N1).
[uart-rx] hello from sky130A.
[uart-rx]
[uart-rx] > count = 0xA5
[uart-rx] > tick
[uart-rx] > tick
[uart-rx] > tick
[sim] 324 bytes pushed, line idle.
Every [uart-rx] line is the testbench’s behavioural receiver
recovering one frame at a time from raw transitions on the tx wire
— the same algorithm a real UART chip runs in hardware:
initial begin : rx
forever begin
@(negedge tx);
#(BIT_NS + BIT_NS/2);
for (int i = 0; i < 8; i++) begin
rx_byte[i] = tx;
if (i < 7) #(BIT_NS);
end
// Mid stop-bit
#(BIT_NS);
// print a [uart-rx] prefix at the start of each output line
if (!rx_line_open) begin
$write("[uart-rx] ");
rx_line_open = 1;
end
if (rx_byte == 8'h0A) begin
$write("\n");
rx_line_open = 0;
end else if (rx_byte == 8'h0D) begin
// CR alone: ignore (we expect CR/LF pairs, LF triggers newline)
end else if (rx_byte >= 8'h20 && rx_byte < 8'h7F) begin
$write("%c", rx_byte);
end else begin
$write("<%02h>", rx_byte);
end
$fflush;
end
end
// ---- transmit a string byte-by-byte ----
task uart_send(input byte b); If we replaced the simulator’s tx wire with a copper trace through
a USB-UART bridge into a Linux box, screen /dev/ttyUSB0 9600 would
show the exact same lines. The protocol is the oldest part of
computing that has not been replaced, and it is also the simplest
piece of hardware to build that produces something externally
observable.
The console testbench was also a small lesson in event-scheduling
hazards. The first version raced on wait (busy == 0) immediately
after asserting start: at that posedge, the DUT’s state <= S_START is still in the NBA region and busy = (state != S_IDLE)
still reads 0 from the prior state = S_IDLE. The wait returned
instantly, the next byte was loaded with the FSM still mid-frame,
and every other character on the wire was the previous byte. The
fix is wait (busy == 1) ; wait (busy == 0) — wait for the FSM to
actually leave IDLE before checking that it returned.
Sub-1-mW power. 12100 μm² of silicon. 2374 standard cells. A 60-year-old protocol on a brand-new die. Build a better receiver and you’ve got yourself a serial console.
See also
- Project 02 — counter + LFSR + PWM, the sequential ancestor.
- Project 04 → first asynchronous external interface, first clock-domain crossing.
- Project README — full lesson plan.