No. 06 / project of 147 on the ladder

Tiny FSM CPU

introduces — program counter, instruction memory, instruction decode, multi-cycle FSM

harden statelast run2026-04-28
cells2,333non-filler
slack0.32ns setup
area16900 (die) / 13988 (core)μm²
signoff
  • DRCPASS
  • LVSPASS
  • antennaPASS

The smallest thing on this ladder that’s defensibly a CPU. Project 05’s datapath gets a control unit bolted on top: a 5-bit program counter walks a 32-entry ROM, an instruction register holds the current opcode, and a 4-state FSM cycles through one instruction at a time:

FETCH → DECODE → EXECUTE → WB
layout · sky130A x= μm y= μm
drag · scroll to zoom · double-click to fit · 1 1:1 · f fit 130 × 130 μm die · sky130A · 100 MHz target · regfile + ALU + FSM
3d · sky130A · z×10
drag · scroll · right-drag pan · double-click recenter · R reset full sky130 stack · z exaggerated 10× · 127k shapes · meshopt-compressed

Hold slack +0.91 ns, target 100 MHz (10 ns period). Composition: P05’s datapath + a 5-state FSM control + 32-instruction PROG ROM + P03’s UART transmitter, all in 2,333 cells.

The 4-stage FSM split paid off: P05 needed 25 ns at slow corner to fit its all-in-one-cycle datapath. P06 ships at 10 ns with +0.32 ns of slack at the slow corner — 2.5× the clock for 4× the cycles per instruction. Adding the UART tightened the slack significantly (the first attempt without UART had +2.65 ns), but it still passes at 100 MHz across all PVT corners. The register file is 48 dfrtp_2 cells; the UART adds another 28 flops (a 4-bit FSM + 8-bit data latch + 16-bit baud counter). DRC, LVS, antenna all clean.

What’s new vs. P05

  • Program counter + ROM. 5-bit PC, 32-entry × 16-bit instruction ROM declared as a parameterized constant. Synthesis materializes it as combinational lookup logic — no SRAM macro needed.
  • Instruction decode. A 16-op encoding (4-bit opcode, three 3-bit register fields, 8-bit immediate for LDI, 5-bit absolute address for branches). Decode is combinational from the IR.
  • Multi-cycle FSM. Each instruction takes 4 cycles (FETCH → DECODE → EXECUTE → WB). Splitting the work across stages shortens the per-cycle critical path — that’s the headroom for cranking the clock back to 100 MHz.
  • Branches and HALT. JMP / BZ / BNZ consume flags from the most recently completed flag-writing instruction. HLT parks the FSM in S_HALT permanently.
  • A real UART. The OUT instruction (opcode 0x7, replacing P05’s rarely-useful SAR) pushes regs[ra] out a hardware UART on the chip’s uart_tx pin. The CPU’s FSM stalls during transmission, so consecutive OUTs naturally serialize. The UART module is the same 8N1 transmitter from project 03, inlined inside this design — first time the ladder reuses a previous project’s RTL.

The FSM

not HLT, UART idle OUT and uart_busy HLT loop FETCH DECODE EXECUTE WB HALT
The control unit. Each instruction takes four cycles, then loops back to FETCH unless it was a HLT. The OUT instruction adds a self-loop on WB that holds while the UART is busy transmitting.

How one instruction executes

The 4-stage walk for a single instruction — ADD R3, R1, R2 — gives the cleanest read on what the FSM actually buys us:

cyclestatewhat happens
1FETCHir <= PROG[pc]. PC increments to point at the next instruction. The decoder’s combinational fan-out is now valid for the rest of the instruction.
2DECODEop_a <= regs[ra] (= regs[1]), op_b <= regs[rb] (= regs[2]). Each operand becomes a registered byte that the ALU reads next cycle.
3EXECUTEresult_q <= alu(op_a, op_b). flags_d <= {Z, N, C, V} is also captured. Pure ALU work — no regfile lookup, no decode, just one combinational ALU pass.
4WBregs[rd] <= result_q and flags_q <= flags_d if the instruction calls for either. Branches override the PC here from dec_addr.

Each cycle’s combinational chain is now ~1/4 of P05’s all-in-one path. That’s why P05 had to clock at 40 MHz to fit the slow corner and P06 fits the same datapath plus the control unit on top at 100 MHz with +2.65 ns of slack to spare.

Instruction set

opmnemonicsemantics
0x0ADDrd = ra + rb — updates flags
0x1SUBrd = ra - rb — updates flags
0x2ANDrd = ra & rb — updates Z, N
0x3ORrd = ra | rb — updates Z, N
0x4XORrd = ra ^ rb — updates Z, N
0x5SHLrd = ra << 1 — C ← old MSB
0x6SHRrd = ra >> 1 — C ← old LSB
0x7OUTpush regs[ra] out the UART tx pin; FSM stalls until the byte finishes
0x8MOVrd = ra — no flag update
0x9NOTrd = ~ra — updates Z, N
0xALDIrd = imm8 (no flag update)
0xBCMPflags = ra - rb, no register write
0xCJMPpc = addr5
0xDBZpc = addr5 if Z = 1
0xEBNZpc = addr5 if Z = 0
0xFHLThalt forever

The 16-bit instruction word is laid out as four nibbles. Register-register instructions (ADD, SUB, AND, …) put rd / ra / rb into the three middle 3-bit fields, leaving 3 unused bits at the bottom. LDI uses bits [8:1] as an 8-bit immediate (with bit 0 reserved). Branches and JMP use bits [4:0] as a 5-bit absolute ROM address.

RTL

The whole CPU is one ~390-line top.sv plus an inlined uart_tx submodule. The walkthrough below breaks it into the pieces a reader actually wants to look at, in roughly the order they fire when an instruction executes.

The header and ports

Standard P05-shaped wrapper, plus a baud_div input and a uart_tx output. The TB drives baud_div low for fast sim; real silicon gets 868 for 115200 baud at 100 MHz.

projects/06_fsm_cpu/src/top.sv system-verilog · L79-95
`default_nettype none

module top (
    input  logic        clk,
    input  logic        rst_n,
    input  logic        start,        // currently unused (1 = run)
    // UART pacing: clocks-per-bit minus 1. At 100 MHz, 868 → 115200 baud.
    // Driven externally so the testbench can speed it up for sim.
    input  logic [15:0] baud_div,
    output logic [7:0]  out,
    output logic [4:0]  pc_out,
    output logic        halted,
    // UART tx pin (idle-high). Pulled out of the chip; off-chip a
    // standard 3.3V serial monitor at the matching baud rate decodes
    // the bytes the OUT instruction sent.
    output logic        uart_tx
);

The instruction ROM

The ROM is built up from four little encoder helpers and dropped into a 512-bit packed parameter. Yosys’s read_verilog won’t accept unpacked array parameters, so the ROM is stored as one giant bit-vector and sliced with +: at fetch time. The encoder functions are also written in Verilog-2001 style (assign through the function name, no return) because Yosys rejects return {...} inside automatic functions.

projects/06_fsm_cpu/src/top.sv system-verilog · L110-160
  localparam int ROM_DEPTH = 32;

  // Encoder helpers — used only inside the localparam ROM init below,
  // so they must be valid in Yosys's SV frontend. We use the Verilog
  // function-name-assignment form (no `return` statement) and avoid
  // `automatic`, since Yosys's read_verilog rejects `return {...}`
  // inside automatic functions.
  function [15:0] op_alu(input [3:0] opcode,
                          input [2:0] rd,
                          input [2:0] ra,
                          input [2:0] rb);
    op_alu = {opcode, rd, ra, rb, 3'b000};
  endfunction
  function [15:0] op_unary(input [3:0] opcode,
                            input [2:0] rd,
                            input [2:0] ra);
    op_unary = {opcode, rd, ra, 3'b000, 3'b000};
  endfunction
  function [15:0] op_ldi(input [2:0] rd,
                          input [7:0] imm);
    op_ldi = {4'hA, rd, imm, 1'b0};
  endfunction
  function [15:0] op_jmp(input [3:0] opcode,
                          input [4:0] addr);
    op_jmp = {opcode, 7'b0000000, addr};       // 4 + 7 + 5 = 16 bits
  endfunction

  // ROM is held as a packed bit-vector — 32 instructions × 16 bits =
  // 512 bits. Iverilog doesn't accept unpacked array parameters, but
  // is fine with packed bit-vectors. We slice the current word with a
  // `+:` index. Concatenation order is MSB-first, so PC=31 sits at
  // the top and PC=0 at the bottom.
  localparam int PROG_BITS = ROM_DEPTH * 16;

  // Default boot program: Fibonacci(6) into R1..R6, then HLT. R7 ends
  // up holding the last value (matches the `out` port). Testbench
  // overrides this via parameter when it wants to exercise different
  // instructions. Listed PC=31 first → PC=0 last.
  localparam logic [PROG_BITS-1:0] DEFAULT_PROG = {
    {24{16'h0000}},                                // 31..08: zero-fill
    op_jmp (4'hF, 5'd0),                           // 07: HLT (jump field unused)
    op_unary(4'h8, 3'd7, 3'd6),                    // 06: R7 = R6 (= out)
    op_alu (4'h0, 3'd6, 3'd4, 3'd5),               // 05: R6 = R4 + R5 = 8
    op_alu (4'h0, 3'd5, 3'd3, 3'd4),               // 04: R5 = R3 + R4 = 5
    op_alu (4'h0, 3'd4, 3'd2, 3'd3),               // 03: R4 = R2 + R3 = 3
    op_alu (4'h0, 3'd3, 3'd1, 3'd2),               // 02: R3 = R1 + R2 = 2
    op_ldi (3'd2, 8'h01),                          // 01: R2 = 1
    op_ldi (3'd1, 8'h01)                           // 00: R1 = 1
  };

  parameter logic [PROG_BITS-1:0] PROG = DEFAULT_PROG;

PC, IR, decode

The IR is dissected into named slices the same cycle FETCH latches it. Everything from here down is combinational off ir. The alu_op mux is mostly identity (CPU op = ALU op), with a few overrides: CMP routes through SUB, LDI/MOV/OUT all use the ALU’s a-passthrough, and branches park the ALU on whatever (the result is unused).

projects/06_fsm_cpu/src/top.sv system-verilog · L165-195
  logic [4:0]  pc;
  logic [15:0] ir;

  wire [3:0] dec_op  = ir[15:12];
  wire [2:0] dec_rd  = ir[11: 9];
  wire [2:0] dec_ra  = ir[ 8: 6];
  wire [2:0] dec_rb  = ir[ 5: 3];
  wire [7:0] dec_imm = ir[ 8: 1];      // LDI imm payload (bit 0 reserved)
  wire [4:0] dec_addr= ir[ 4: 0];      // branch/jump target

  // Map CPU opcode to ALU op (a 4-bit code matching the P05 ALU).
  // Most ALU opcodes are 1:1 with their CPU encoding. CMP routes
  // through SUB (subtractor compute, no register write).
  logic [3:0] alu_op;
  always_comb begin
    unique case (dec_op)
      4'h0: alu_op = 4'b0000;            // ADD
      4'h1: alu_op = 4'b0001;            // SUB
      4'h2: alu_op = 4'b0010;            // AND
      4'h3: alu_op = 4'b0011;            // OR
      4'h4: alu_op = 4'b0100;            // XOR
      4'h5: alu_op = 4'b0101;            // SHL
      4'h6: alu_op = 4'b0110;            // SHR
      4'h7: alu_op = 4'b1000;            // OUT — passthrough A; UART side-effect handled below
      4'h8: alu_op = 4'b1000;            // MOV
      4'h9: alu_op = 4'b1001;            // NOT
      4'hA: alu_op = 4'b1000;            // LDI uses MOV-with-imm (handled by use_imm)
      4'hB: alu_op = 4'b0001;            // CMP routes through SUB
      default: alu_op = 4'b1000;         // JMP/BZ/BNZ/HLT — datapath idle
    endcase
  end

Per-instruction control signals

Decode is purely combinational off dec_op. Each instruction class gets a one-bit is_* predicate, and from those we build flag_update and reg_write — the two write-enables that decide what WB actually commits.

R0 is conventionally hardwired to zero. Rather than special-casing reads (which reg_read() does anyway), we drop writes to it at the control-signal level so synthesis can prove the regs[0] flop is unreachable and optimize it away.

projects/06_fsm_cpu/src/top.sv system-verilog · L197-213
  // Per-instruction control signals (combinational from dec_op).
  wire is_out     = (dec_op == 4'h7);              // UART send (replaces SAR)
  wire is_alu_rr  = (dec_op <= 4'h9) && !is_out;   // ALU writeback ops
  wire is_ldi     = (dec_op == 4'hA);
  wire is_cmp     = (dec_op == 4'hB);
  wire is_jmp     = (dec_op == 4'hC);
  wire is_bz      = (dec_op == 4'hD);
  wire is_bnz     = (dec_op == 4'hE);
  wire is_hlt     = (dec_op == 4'hF);
  wire is_branch  = is_jmp | is_bz | is_bnz;
  // Some ops don't update flags (MOV, OUT, LDI, branches/halt, CMP only
  // updates flags — but its result isn't written back).
  wire flag_update = ~(dec_op == 4'h8 || is_out || is_ldi || is_branch || is_hlt);
  // Register-write enable: any ALU op (excluding CMP and OUT) writes rd.
  // LDI also writes. R0 is hardwired to zero in the regfile so writes
  // there are dropped.
  wire reg_write   = (is_alu_rr || is_ldi) && (dec_rd != 3'd0);

The register file and ALU

This block is verbatim P05’s datapath, inlined so each project on the ladder stands alone. regs[0:7] is an unpacked array of 8-bit flops, the ALU is a single combinational case driving alu_y, and flags_q holds the post-EXECUTE {Z, N, C, V}.

projects/06_fsm_cpu/src/top.sv system-verilog · L219-267
  logic [7:0] regs [0:7];
  logic [3:0] flags_q;        // {Z, N, C, V}

  // Decode-stage operand registers (so the ALU compute is a separate
  // pipeline-ish stage from regfile read).
  logic [7:0] op_a;
  logic [7:0] op_b;
  // Execute-stage result register, fed to writeback.
  logic [7:0] result_q;
  logic [3:0] flags_d;        // computed in EXECUTE, captured at WB

  // Async regfile read selectors. (R0 reads as 0, writes ignored.)
  // Verilog-2001 style — Yosys's read_verilog rejects `return` inside
  // an automatic function.
  function [7:0] reg_read(input [2:0] sel);
    if (sel == 3'd0) reg_read = 8'h00;
    else             reg_read = regs[sel];
  endfunction

  // ---- ALU: same shape as P05 ----
  wire [7:0] a_data = op_a;
  wire [7:0] b_data = op_b;
  wire [8:0] add_w  = {1'b0, a_data} + {1'b0, b_data};
  wire [8:0] sub_w  = {1'b0, a_data} - {1'b0, b_data};

  logic [7:0] alu_y;
  logic       c_out, v_out;
  always_comb begin
    alu_y = 8'h00;
    c_out = 1'b0;
    v_out = 1'b0;
    unique case (alu_op)
      4'b0000: begin alu_y = add_w[7:0]; c_out = add_w[8];
                     v_out = (a_data[7] == b_data[7]) && (alu_y[7] != a_data[7]); end
      4'b0001: begin alu_y = sub_w[7:0]; c_out = sub_w[8];
                     v_out = (a_data[7] != b_data[7]) && (alu_y[7] != a_data[7]); end
      4'b0010: alu_y = a_data & b_data;
      4'b0011: alu_y = a_data | b_data;
      4'b0100: alu_y = a_data ^ b_data;
      4'b0101: begin alu_y = {a_data[6:0], 1'b0}; c_out = a_data[7]; end
      4'b0110: begin alu_y = {1'b0, a_data[7:1]}; c_out = a_data[0]; end
      4'b0111: begin alu_y = {a_data[7], a_data[7:1]}; c_out = a_data[0]; end
      4'b1000: alu_y = a_data;
      4'b1001: alu_y = ~a_data;
      default: alu_y = a_data;
    endcase
  end
  wire z_out = (alu_y == 8'h00);
  wire n_out =  alu_y[7];

The FSM (next-state logic)

Five states, three flops. FETCH→DECODE→EXECUTE→WB is rigid; WB is the only state with a real branch in the next-state logic. HALT is a self-loop with no escape.

projects/06_fsm_cpu/src/top.sv system-verilog · L272-307
  typedef enum logic [2:0] {
    S_FETCH   = 3'd0,
    S_DECODE  = 3'd1,
    S_EXECUTE = 3'd2,
    S_WB      = 3'd3,
    S_HALT    = 3'd4
  } state_t;

  state_t state, next_state;

  // UART transmitter signals (the submodule is instantiated below).
  // We pulse uart_start_pulse for one cycle as the FSM moves from
  // EXECUTE to WB on an OUT instruction. The UART then asserts
  // uart_busy until the byte (start + 8 data + stop) finishes
  // transmitting; the FSM stalls in S_WB while busy is high.
  logic       uart_start_pulse;
  logic [7:0] uart_data;
  logic       uart_busy;

  always_comb begin
    next_state = state;
    unique case (state)
      S_FETCH:   next_state = S_DECODE;
      S_DECODE:  next_state = S_EXECUTE;
      S_EXECUTE: next_state = S_WB;
      S_WB: begin
        // OUT stalls in WB until the UART finishes transmitting. Once
        // uart_busy drops we proceed to the next instruction.
        if (is_out && uart_busy)      next_state = S_WB;
        else if (is_hlt)              next_state = S_HALT;
        else                          next_state = S_FETCH;
      end
      S_HALT:    next_state = S_HALT;
      default:   next_state = S_FETCH;
    endcase
  end

UART and OUT instruction signals

Three wires bridge the CPU’s main FSM and the UART submodule: uart_start_pulse is high for exactly one clock as we move through EXECUTE on an OUT, uart_data is the byte to send (already sitting in op_a from DECODE), and uart_busy is the back-pressure that holds the main FSM in WB until transmission completes.

projects/06_fsm_cpu/src/top.sv system-verilog · L282-325
  // UART transmitter signals (the submodule is instantiated below).
  // We pulse uart_start_pulse for one cycle as the FSM moves from
  // EXECUTE to WB on an OUT instruction. The UART then asserts
  // uart_busy until the byte (start + 8 data + stop) finishes
  // transmitting; the FSM stalls in S_WB while busy is high.
  logic       uart_start_pulse;
  logic [7:0] uart_data;
  logic       uart_busy;

  always_comb begin
    next_state = state;
    unique case (state)
      S_FETCH:   next_state = S_DECODE;
      S_DECODE:  next_state = S_EXECUTE;
      S_EXECUTE: next_state = S_WB;
      S_WB: begin
        // OUT stalls in WB until the UART finishes transmitting. Once
        // uart_busy drops we proceed to the next instruction.
        if (is_out && uart_busy)      next_state = S_WB;
        else if (is_hlt)              next_state = S_HALT;
        else                          next_state = S_FETCH;
      end
      S_HALT:    next_state = S_HALT;
      default:   next_state = S_FETCH;
    endcase
  end

  // Pulse uart_start for exactly one cycle: when we're moving out of
  // EXECUTE on an OUT and have just latched op_a as the byte to send.
  assign uart_start_pulse = (state == S_EXECUTE) && is_out;
  // The byte to send is sitting in op_a (we routed regs[ra] there in
  // DECODE; the ALU's MOV passthrough also computed result_q = op_a in
  // EXECUTE, but op_a is one cycle earlier so latency is identical).
  assign uart_data        = op_a;

  // Branch resolution — happens in WB stage.
  // (Computed combinationally from flags_q, since flags from the *just-
  // executed* instruction haven't been latched yet — they're in flags_d
  // mid-EXECUTE/WB. For simplicity, branches consume flags from the
  // most recently captured flag register, so a CMP must complete one
  // full instruction before the BZ/BNZ that depends on it.)
  wire take_branch = is_jmp
                   || (is_bz  &&  flags_q[3])
                   || (is_bnz && ~flags_q[3]);

The sequential always_ff

One big always_ff with a per-state case. Reset clears state, PC, IR, operands, results, flags, and zeroes the regfile. Otherwise each state’s behavior is one or two non-blocking assignments.

projects/06_fsm_cpu/src/top.sv system-verilog · L330-369
  integer i;
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      state    <= S_FETCH;
      pc       <= 5'd0;
      ir       <= 16'h0000;
      op_a     <= 8'h00;
      op_b     <= 8'h00;
      result_q <= 8'h00;
      flags_d  <= 4'h0;
      flags_q  <= 4'h0;
      for (i = 0; i < 8; i = i + 1) regs[i] <= 8'h00;
    end else begin
      state <= next_state;
      unique case (state)
        S_FETCH: begin
          ir <= PROG[16*pc +: 16];
          // PC advances on FETCH; branches override in WB.
          pc <= pc + 5'd1;
        end
        S_DECODE: begin
          // LDI: route imm through op_a so the ALU's MOV (a-passthrough)
          // writes it back to rd. ra/rb are unused for LDI.
          op_a <= is_ldi ? dec_imm : reg_read(dec_ra);
          op_b <= reg_read(dec_rb);
        end
        S_EXECUTE: begin
          result_q <= alu_y;
          flags_d  <= {z_out, n_out, c_out, v_out};
        end
        S_WB: begin
          if (reg_write) regs[dec_rd] <= result_q;
          if (flag_update || is_cmp) flags_q <= flags_d;
          if (is_branch && take_branch) pc <= dec_addr;
        end
        S_HALT: ;
        default: ;
      endcase
    end
  end

Outputs

Three port assigns and one explicit lint tieoff for the unused start input.

projects/06_fsm_cpu/src/top.sv system-verilog · L384-393
  // ---- outputs ----
  assign out    = regs[7];
  assign pc_out = pc;
  assign halted = (state == S_HALT);

  // start input reserved for future use (single-step / run gate).
  // Tie off the lint warning explicitly.
  wire _unused = &{1'b0, start};

endmodule

The UART submodule

The same 8N1 transmitter project 03 hardened standalone, copied verbatim into this file. Each project on the ladder stays self-contained — no include files, no shared modules — so reading top.sv is enough to understand the whole chip.

projects/06_fsm_cpu/src/top.sv system-verilog · L395-484
// ---------------------------------------------------------------------
// uart_tx — 8N1 UART transmitter, lifted from project 03.
//
// Pulse `start` for one cycle with `data` valid; tx will then drive the
// start bit, 8 data bits LSB-first, and a stop bit, with each bit held
// for `baud_div + 1` clock cycles. busy stays high for the duration so
// the host can poll-and-wait. Idle line level is high.
// ---------------------------------------------------------------------
module uart_tx (
    input  logic         clk,
    input  logic         rst_n,
    input  logic         start,
    input  logic [7:0]   data,
    input  logic [15:0]  baud_div,
    output logic         tx,
    output logic         busy
);

  typedef enum logic [3:0] {
    U_IDLE  = 4'd0,
    U_START = 4'd1,
    U_D0    = 4'd2,  U_D1 = 4'd3,  U_D2 = 4'd4,  U_D3 = 4'd5,
    U_D4    = 4'd6,  U_D5 = 4'd7,  U_D6 = 4'd8,  U_D7 = 4'd9,
    U_STOP  = 4'd10
  } ustate_t;

  ustate_t ustate, ustate_next;

  logic [15:0] baud_cnt;
  logic        bit_tick;
  assign bit_tick = (baud_cnt == 16'd0);

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)                     baud_cnt <= 16'd0;
    else if (ustate == U_IDLE)      baud_cnt <= baud_div;
    else if (bit_tick)              baud_cnt <= baud_div;
    else                            baud_cnt <= baud_cnt - 16'd1;
  end

  // Latch data on entry to U_START.
  logic [7:0] data_q;
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)                              data_q <= 8'h00;
    else if (ustate == U_IDLE && start)      data_q <= data;
  end

  always_comb begin
    ustate_next = ustate;
    unique case (ustate)
      U_IDLE:  if (start)    ustate_next = U_START;
      U_START: if (bit_tick) ustate_next = U_D0;
      U_D0:    if (bit_tick) ustate_next = U_D1;
      U_D1:    if (bit_tick) ustate_next = U_D2;
      U_D2:    if (bit_tick) ustate_next = U_D3;
      U_D3:    if (bit_tick) ustate_next = U_D4;
      U_D4:    if (bit_tick) ustate_next = U_D5;
      U_D5:    if (bit_tick) ustate_next = U_D6;
      U_D6:    if (bit_tick) ustate_next = U_D7;
      U_D7:    if (bit_tick) ustate_next = U_STOP;
      U_STOP:  if (bit_tick) ustate_next = U_IDLE;
      default:               ustate_next = U_IDLE;
    endcase
  end

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) ustate <= U_IDLE;
    else        ustate <= ustate_next;
  end

  always_comb begin
    unique case (ustate)
      U_IDLE:  tx = 1'b1;
      U_START: tx = 1'b0;
      U_D0:    tx = data_q[0];
      U_D1:    tx = data_q[1];
      U_D2:    tx = data_q[2];
      U_D3:    tx = data_q[3];
      U_D4:    tx = data_q[4];
      U_D5:    tx = data_q[5];
      U_D6:    tx = data_q[6];
      U_D7:    tx = data_q[7];
      U_STOP:  tx = 1'b1;
      default: tx = 1'b1;
    endcase
  end

  assign busy = (ustate != U_IDLE);

endmodule

Verifying testbench

Three programs covering the major instruction classes.

projects/06_fsm_cpu/test/tb.sv system-verilog
// Project 06 testbench — verifying TB for the tiny FSM CPU.
//
// Strategy: run the DUT's default boot program (Fibonacci(6) into
// R1..R6, then HLT into R7). Wait for `halted` to assert, then
// peek at R1..R7 by reading the testbench's view of the regfile via
// hierarchical access.
//
// We also exercise:
//   - LDI to a few different registers and values.
//   - ADD/SUB/AND/OR/XOR with reg-reg operands.
//   - CMP + BZ/BNZ branch flow.
//   - JMP forward / loop.
//   - HLT terminates the FSM in the S_HALT state.
//
// Each test reuses the DUT but with a different program loaded via the
// PROG parameter. `top` accepts PROG as a localparam-overridable
// parameter; we instantiate one DUT per test program (each with its
// own `top` instance scoped to a separate `generate` block).

`timescale 1ns/1ps
`default_nettype none

module tb;

  // 100 MHz chip clock.
  logic clk = 0;
  always #5 clk = ~clk;

  logic rst_n;
  logic start = 1'b1;

  int errors = 0;
  int test_num = 0;

  // ---- helpers ---------------------------------------------------------

  // Apply reset and release on the next negedge.
  task automatic do_reset;
    begin
      rst_n = 1'b0;
      repeat (4) @(posedge clk);
      @(negedge clk); rst_n = 1'b1;
    end
  endtask

  // ---- instruction encoders, kept in sync with top.sv ------------------
  // Replicated here because functions inside the DUT aren't visible
  // through hierarchical reference for parameter computation.
  function automatic logic [15:0] enc_alu(input logic [3:0] op,
                                            input logic [2:0] rd,
                                            input logic [2:0] ra,
                                            input logic [2:0] rb);
    return {op, rd, ra, rb, 3'b000};
  endfunction
  function automatic logic [15:0] enc_unary(input logic [3:0] op,
                                              input logic [2:0] rd,
                                              input logic [2:0] ra);
    return {op, rd, ra, 3'b000, 3'b000};
  endfunction
  function automatic logic [15:0] enc_ldi(input logic [2:0] rd,
                                            input logic [7:0] imm);
    return {4'hA, rd, imm, 1'b0};
  endfunction
  function automatic logic [15:0] enc_jmp(input logic [3:0] op,
                                            input logic [4:0] addr);
    return {op, 7'b0000000, addr};
  endfunction
  function automatic logic [15:0] enc_hlt;
    return {4'hF, 12'h000};
  endfunction

  // ---- Test 1: default Fibonacci program (uses DUT's built-in PROG) ---

  // Baud divider — shared across all DUTs. Set to a small value so
  // OUT-driven UART transmissions complete quickly in sim. (Real
  // hardware uses something like 868 for 115200 @ 100 MHz; sim doesn't
  // care about the wire-level timing as long as it's not zero.)
  logic [15:0] baud_div = 16'd3;

  logic [7:0] dut1_out;
  logic [4:0] dut1_pc;
  logic       dut1_halted;
  logic       dut1_tx;
  top dut1 (.clk(clk), .rst_n(rst_n), .start(start), .baud_div(baud_div),
            .out(dut1_out), .pc_out(dut1_pc), .halted(dut1_halted),
            .uart_tx(dut1_tx));

  // ---- Test 2: branching / loop program ------------------------------
  // Counts down from 5 in R1 to 0 using SUB + BNZ.
  // Final R1 = 0, R7 (loop counter via copy) = number of iterations = 5.
  //
  //   00 LDI R1, 5
  //   01 LDI R2, 0           ; iteration counter
  //   02 SUB R1, R1, R3       ; R3 = 0 (R0 = 0; using R3=0 since not loaded)
  //                            actually use a constant: SUB R1,R1,R0 won't
  //                            work because R0 = 0 → R1 = R1 - 0; we want
  //                            decrement by 1. Use ADD R1, R1, R7 where R7
  //                            is loaded with -1 (0xFF) first.
  //   ... rewritten:
  //
  //   00 LDI R7, 0xFF        ; R7 = -1
  //   01 LDI R1, 5           ; R1 = 5 (loop count)
  //   02 LDI R2, 0           ; R2 = 0 (iteration tally)
  //   03 LDI R3, 1           ; R3 = 1 (increment for tally)
  // loop:
  //   04 ADD R1, R1, R7      ; R1 = R1 - 1 (since R7 = 0xFF wraps as -1)
  //   05 ADD R2, R2, R3      ; R2 += 1
  //   06 CMP R1, R0          ; compare R1 with 0; sets Z if R1 == 0
  //   07 BNZ 04              ; loop if R1 != 0
  //   08 MOV R7, R2          ; final tally → R7 (= out)
  //   09 HLT
  // Packed bit-vector programs — concatenation order is MSB-first, so
  // PC=31 sits at the top of the {} list and PC=0 at the bottom.
  localparam int PROG_BITS = 32 * 16;
  localparam logic [PROG_BITS-1:0] PROG_LOOP = {
    {22{16'h0000}},                                // 31..10
    enc_hlt(),                                      // 09 HLT
    enc_unary(4'h8, 3'd7, 3'd2),                    // 08 MOV R7,R2
    enc_jmp (4'hE, 5'd4),                           // 07 BNZ 04
    enc_alu (4'hB, 3'd0, 3'd1, 3'd0),               // 06 CMP R1,R0
    enc_alu (4'h0, 3'd2, 3'd2, 3'd3),               // 05 ADD R2,R2,R3
    enc_alu (4'h0, 3'd1, 3'd1, 3'd7),               // 04 ADD R1,R1,R7
    enc_ldi (3'd3, 8'd1),                           // 03 R3 = 1
    enc_ldi (3'd2, 8'd0),                           // 02 R2 = 0
    enc_ldi (3'd1, 8'd5),                           // 01 R1 = 5
    enc_ldi (3'd7, 8'hFF)                           // 00 R7 = -1
  };

  logic [7:0] dut2_out;
  logic [4:0] dut2_pc;
  logic       dut2_halted;
  logic       dut2_tx;
  top #(.PROG(PROG_LOOP)) dut2 (.clk(clk), .rst_n(rst_n), .start(start), .baud_div(baud_div),
                                  .out(dut2_out), .pc_out(dut2_pc), .halted(dut2_halted),
                                  .uart_tx(dut2_tx));

  // ---- Test 3: bitwise + JMP-forward (skip-over) -----------------------
  //   00 LDI R1, 0xF0
  //   01 LDI R2, 0x0F
  //   02 OR  R3, R1, R2      ; R3 = 0xFF
  //   03 JMP 06              ; skip the AND
  //   04 AND R3, R1, R2      ; would set R3 = 0 if reached
  //   05 HLT
  //   06 MOV R7, R3          ; R7 = 0xFF (proves we jumped past the AND)
  //   07 HLT
  localparam logic [PROG_BITS-1:0] PROG_JMP = {
    {24{16'h0000}},                                // 31..08
    enc_hlt(),                                      // 07 HLT
    enc_unary(4'h8, 3'd7, 3'd3),                    // 06 MOV R7,R3
    enc_hlt(),                                      // 05 HLT
    enc_alu (4'h2, 3'd3, 3'd1, 3'd2),               // 04 AND R3,R1,R2 (skipped)
    enc_jmp (4'hC, 5'd6),                           // 03 JMP 06
    enc_alu (4'h3, 3'd3, 3'd1, 3'd2),               // 02 OR R3,R1,R2
    enc_ldi (3'd2, 8'h0F),                          // 01 R2 = 0x0F
    enc_ldi (3'd1, 8'hF0)                           // 00 R1 = 0xF0
  };

  logic [7:0] dut3_out;
  logic [4:0] dut3_pc;
  logic       dut3_halted;
  logic       dut3_tx;
  top #(.PROG(PROG_JMP)) dut3 (.clk(clk), .rst_n(rst_n), .start(start), .baud_div(baud_div),
                                  .out(dut3_out), .pc_out(dut3_pc), .halted(dut3_halted),
                                  .uart_tx(dut3_tx));

  task automatic check8(input logic [7:0] got, input logic [7:0] exp,
                          input string label);
    begin
      if (got !== exp) begin
        $display("FAIL [%s] got 0x%02h, expected 0x%02h", label, got, exp);
        errors = errors + 1;
      end
    end
  endtask

  // ---- main ------------------------------------------------------------

  initial begin
    $dumpfile("tb.vcd");
    $dumpvars(0, tb);

    // ---- Test 1: Fibonacci ----
    do_reset();
    begin
      int cycles; cycles = 0;
      while (!dut1_halted && cycles < 1000) begin
        @(posedge clk); cycles = cycles + 1;
      end
      if (!dut1_halted) begin
        $display("FAIL [fib] did not halt within 1000 cycles");
        errors = errors + 1;
      end
    end
    // Expected regfile after Fibonacci program completes:
    //   R1=1, R2=1, R3=2, R4=3, R5=5, R6=8, R7=8 (= R6 → out)
    check8(dut1.regs[1], 8'd1, "fib R1");
    check8(dut1.regs[2], 8'd1, "fib R2");
    check8(dut1.regs[3], 8'd2, "fib R3");
    check8(dut1.regs[4], 8'd3, "fib R4");
    check8(dut1.regs[5], 8'd5, "fib R5");
    check8(dut1.regs[6], 8'd8, "fib R6");
    check8(dut1_out,     8'd8, "fib out (R7)");

    // ---- Test 2: countdown loop with CMP + BNZ ----
    do_reset();
    begin
      int cycles; cycles = 0;
      while (!dut2_halted && cycles < 1000) begin
        @(posedge clk); cycles = cycles + 1;
      end
      if (!dut2_halted) begin
        $display("FAIL [loop] did not halt within 1000 cycles");
        errors = errors + 1;
      end
    end
    // R1 should have decremented to 0; R2 should have tallied 5 iterations.
    check8(dut2.regs[1], 8'd0, "loop R1 final");
    check8(dut2.regs[2], 8'd5, "loop R2 tally");
    check8(dut2_out,     8'd5, "loop out (R7=R2)");

    // ---- Test 3: JMP forward skips the AND ----
    do_reset();
    begin
      int cycles; cycles = 0;
      while (!dut3_halted && cycles < 1000) begin
        @(posedge clk); cycles = cycles + 1;
      end
      if (!dut3_halted) begin
        $display("FAIL [jmp] did not halt within 1000 cycles");
        errors = errors + 1;
      end
    end
    check8(dut3.regs[3], 8'hFF, "jmp R3 (= OR result, AND skipped)");
    check8(dut3_out,     8'hFF, "jmp out (R7=R3)");

    if (errors == 0) $display("PASS: tiny FSM CPU, all programs executed correctly.");
    else             $display("FAIL: %0d errors", errors);

    $finish;
  end

  initial begin
    #5_000_000;
    $display("FAIL: testbench timed out");
    $finish;
  end

endmodule

`default_nettype wire

Demo

The demo program computes Fibonacci(7) into R1..R7 and pushes each new value out the UART using the OUT instruction. The demo TB includes a behavioral 8N1 receiver that watches the uart_tx pin, samples each bit at the middle of its bit time, reconstructs the byte, and prints it. So the log interleaves the FETCH-cycle CPU trace with [uart] lines for every byte received on the wire:

[cpu]  pc=00  ir=a202  LDI  R1,#0x01   | R1=00 R2=00 R3=00 R4=00 R5=00 R6=00 R7=00 | flags=0000
[cpu]  pc=01  ir=a402  LDI  R2,#0x01   | R1=01 R2=00 R3=00 R4=00 R5=00 R6=00 R7=00 | flags=0000
[cpu]  pc=02  ir=7040  OUT  R1         | R1=01 R2=01 R3=00 R4=00 R5=00 R6=00 R7=00 | flags=0000
[uart]  rx byte 1: 0x01 (1)
[cpu]  pc=03  ir=7080  OUT  R2         | R1=01 R2=01 R3=00 R4=00 R5=00 R6=00 R7=00 | flags=0000
[uart]  rx byte 2: 0x01 (1)
[cpu]  pc=04  ir=0650  ADD  R3,R1,R2   | R1=01 R2=01 R3=00 R4=00 R5=00 R6=00 R7=00 | flags=0000
[cpu]  pc=05  ir=70c0  OUT  R3         | R1=01 R2=01 R3=02 R4=00 R5=00 R6=00 R7=00 | flags=0000
[uart]  rx byte 3: 0x02 (2)
[cpu]  pc=06  ir=0898  ADD  R4,R2,R3   | R1=01 R2=01 R3=02 R4=00 R5=00 R6=00 R7=00 | flags=0000
[cpu]  pc=07  ir=7100  OUT  R4         | R1=01 R2=01 R3=02 R4=03 R5=00 R6=00 R7=00 | flags=0000
[uart]  rx byte 4: 0x03 (3)
...
[cpu]  pc=13  ir=71c0  OUT  R7         | R1=01 R2=01 R3=02 R4=03 R5=05 R6=08 R7=0d | flags=0000
[uart]  rx byte 7: 0x0d (13)
[cpu]  pc=14  ir=f000  HLT             | R1=01 R2=01 R3=02 R4=03 R5=05 R6=08 R7=0d | flags=0000
[cpu]  halted at pc=15

Each OUT instruction pulses the UART’s start for one cycle, and the CPU’s FSM stalls in WB until the UART’s busy line drops — about ~40 clock cycles per byte at the demo’s 4-clocks-per-bit baud divider. Real hardware running at 115200 baud would stall ~8700 cycles per byte, which is fine because the program is doing nothing useful while the byte is on the wire anyway.

Reading the layout

Open the viewer at the top and click each annotation in turn:

  • A highlights one cell row near the bottom edge — the same building block P01 uses, just one of dozens here.
  • R outlines the bulk of the regfile (48 dfrtp_2 cells). Notice how spread out it is: the placer doesn’t keep regs[1][0..7] together, because what it cares about is shortening the wires to the ALU and the writeback mux that read/write each bit. Source-code locality has nothing to do with placement.
  • S outlines the 3-flop FSM state register in a narrow column mid-chip. Three bits is everything that distinguishes a CPU from a static datapath. Click into it and you can see how small the “control” really is: a tap-row column with a few flops tucked in.
  • D outlines the result/flag writeback registers in the upper half — the result_q and flags_q flops latched at the end of EXECUTE. These exist because of the FSM split; in P05 the same logic ran combinationally without an intermediate register, which is exactly why P05’s slow-corner setup missed.
  • U outlines the UART transmitter along the top edge of the chip. The placer kept the whole peripheral together because its 28 flops all talk to each other and to nothing else. The same module that’s spread across 110 µm in P03’s standalone harden is squeezed into a thin band here — different placement context, different shape.

What just happened?

We made a CPU. ~290 lines of SV (130 of them the inlined P05 datapath, ~100 the inlined P03 UART), one parameterized ROM, one 4-state FSM, three 3-bit register-address fields, a real serial output. It runs Fibonacci, can branch, can compare, can jump, can print bytes out a wire, can halt. It’s tiny — 0.014 mm² of core area, three flops of state — but the shape is exactly that of a real microcontroller: fetch, decode, execute, writeback, repeat, with a peripheral hanging off the side. P07 puts a real bus on the back so this CPU can talk to several peripherals through the same address space instead of pinning each one to a dedicated instruction.

See also