No. 07 / project of 147 on the ladder

Tiny memory-mapped SoC

introduces — bus, memory-mapped I/O, address decoder, UART RX

harden statelast run2026-04-28
cells8,330non-filler
slack0.63ns setup
area48400 (die) / 43932 (core)μm²
signoff
  • DRCPASS
  • LVSPASS
  • antennaPASS

P06’s CPU bolted onto a real bus, with peripherals at fixed memory addresses. Two opcodes were swapped to make room: P06’s OUT (special-case UART) and NOT (rarely used) became ST [ra], rb and LD rd, [ra]. The same instructions reach RAM, UART, and GPIO — all through one address space.

layout · sky130A x= μm y= μm
drag · scroll to zoom · double-click to fit · 1 1:1 · f fit 220 × 220 μm die · sky130A · 71 MHz target · CPU + RAM + UART(TX+RX) + GPIO
3d · sky130A · z×10
drag · scroll · right-drag pan · double-click recenter · R reset full sky130 stack · z exaggerated 10× · 398k shapes · meshopt-compressed

Clock target: 71 MHz (14 ns period). 8,330 cells — 3.6× P06’s count once you fold in the bus, the GPIO peripheral, and the UART RX controller.

Three iterations to land cleanly. 160 × 160 µm at 100 MHz failed placement at 82% utilisation. 220 × 220 µm at 100 MHz built but missed slow-corner setup by −0.89 ns. 220 × 220 at 83 MHz still missed by −0.22 ns. 71 MHz (14 ns) lands with +0.63 ns of slack. The bus + UART RX combinational paths are deeper than P06’s hardwired UART, and there’s no pipelining between the regfile read and the bus rdata mux — that’s a P09 problem.

What’s new vs. P06

  • Bus. One master (the CPU), three slaves (RAM / UART / GPIO), combinational address decoder, mux on the read-data return path.
  • Memory-mapped I/O. Peripherals live at fixed addresses; the same LD / ST instructions reach all of them.
  • UART RX. First time we sample a serial line into the chip instead of just emitting from it. Two-flop sync on the rx pin (P04’s lesson reused), then an FSM that detects the start bit’s falling edge, samples each data bit at the middle of its bit-time, latches the assembled byte, and sets a rx_valid flag the program can poll.

Architecture

CPU addressdecoder RAM0x00..0x0F UART0x40..0x42 GPIO0x80..0x81 bus_rdatamux uart_tx pin uart_rx pin gpio_out pins gpio_in pins
One bus master (the CPU) talks to three slaves through a combinational address decoder. The decoder picks which slave's read-data the CPU sees and which slave's write-enable goes high.

Address map

addrnameaccessnotes
0x00..0x0FRAMR/W16 bytes, sync-write/async-read
0x40UART_TXWwrites byte and pulses TX start
0x41UART_STATUSRbit 0 = tx_busy, bit 1 = rx_valid
0x42UART_RXRreads byte, clears rx_valid
0x80GPIO_OUTR/W8-bit output latch
0x81GPIO_INRsynchronized 8-bit input

Reads to unmapped addresses return 0x00.

Instruction set (changes from P06)

opmnemonicsemantics
0x7STmem[regs[ra]] = regs[rb]
0x9LDregs[rd] = mem[regs[ra]]

NOT (P06’s 0x9) is gone — reproducible as XOR rd, ra, R7 after loading 0xFF into R7. OUT (P06’s 0x7) is replaced by writing to address 0x40 via ST. Same chip, more general I/O.

Reading the layout

The 220 × 220 µm die has enough room that the placer cleanly separates the CPU on the left from the peripherals on the right:

  • R outlines the 56-flop register file along the left edge — the placer keeps it close to the ALU comb logic that reads from and writes to it.
  • S outlines the 4-flop FSM state register tucked above the regfile. Same brain as P06; the encoding picked up one more bit because LD/ST are new opcodes.
  • M outlines the 128-flop RAM sprawled along the bottom-center. 16 bytes × 8 bits — by far the biggest single cluster of flops on the chip. P09’s RV32I core will replace this with an SRAM macro (project 08’s whole point).
  • T outlines the UART transmitter in a column on the right — same module from P03, wrapped here as u_tx inside u_uart.
  • X outlines the UART receiver — new for P07. 37 flops including a two-flop sync on the rx pin, an 11-state FSM, an 8-bit shift register, and a 16-bit baud counter.
  • G outlines the GPIO peripheral in the bottom-right.

RTL

projects/07_tiny_soc/src/top.sv system-verilog
// Project 07: tiny memory-mapped SoC.
//
// Project 06 was a CPU with a hard-wired UART hanging off the side; one
// instruction (OUT) drove it. P07 is the same CPU with that special-case
// instruction removed, replaced by generic LD / ST to a real memory
// bus, and several peripherals dangling off that bus at fixed addresses.
//
// Architecture:
//
//                       ┌──────── bus_addr / wdata / we / re ────────┐
//                       ▼                                             │
//   ┌─────┐   ┌─────────────────┐                                     │
//   │ CPU │──▶│ address decoder │──▶ slave selects                    │
//   └─────┘   └─────────────────┘                                     │
//      ▲              │                                               │
//      │              ▼                                               │
//      │      ┌──────────────────────────────────────────────┐        │
//      │      │  RAM    UART_TX/RX/STATUS    GPIO_OUT/IN     │ ◀──────┘
//      │      └──────────────────────────────────────────────┘
//      │              │
//      └──────────────┘ bus_rdata (mux on selected slave)
//
// Address map (8-bit address space):
//
//   0x00 .. 0x0F   16-byte RAM            (R/W)
//   0x40           UART TX data           (W: writes byte + pulses start)
//   0x41           UART status            (R: bit0=tx_busy, bit1=rx_valid)
//   0x42           UART RX data           (R: read byte, clears rx_valid)
//   0x80           GPIO output register   (W)
//   0x81           GPIO input snapshot    (R)
//
// Reads to unmapped addresses return 0x00.
//
// Instruction set is P06's, with two ops swapped:
//   - 0x7  OUT   →  ST   [ra], rb     (mem[ra] = rb)
//   - 0x9  NOT   →  LD   rd, [ra]     (rd = mem[ra])
//
// NOT was the least-used ALU op in P06's programs anyway. It's
// reproducible as `XOR rd, ra, R7` after `LDI R7, 0xFF`.
//
// What this project teaches that P06 didn't:
//   - A real **bus** with one master (the CPU) and several slaves.
//   - **Memory-mapped I/O**: peripherals look like memory addresses;
//     the same LD/ST instructions reach them all.
//   - **Address decoding** as a combinational function of the address.
//   - **UART RX**: first time we sample a serial line into the chip
//     instead of just emitting from it. Sets up the interactive
//     `screen`-against-the-simulator demo described in /stack.

`default_nettype none

// Top-level parameter PROG is the 64×16-bit packed boot program. The
// testbench overrides it per-DUT; production hardens use the cpu
// module's DEFAULT_PROG. We forward it through here so users only ever
// see one parameter name.
module top #(
  parameter logic [64*16-1:0] PROG = {
    {44{16'h0000}},                                // 63..20: zero-fill (matches cpu.DEFAULT_PROG)
    {4'hF, 12'h000},                               // 19: HLT
    {4'hE, 6'b000000, 6'd16},                      // 18: BNZ wait3
    {4'h2, 3'd0, 3'd3, 3'd5, 3'b000},              // 17: AND R0,R3,R5
    {4'h9, 3'd3, 3'd6, 3'b000, 3'b000},            // 16: LD R3,[R6]
    {4'h7, 3'b000, 3'd7, 3'd4, 3'b000},            // 15: ST [R7],R4
    {4'hA, 3'd4, 8'h0A, 1'b0},                     // 14: LDI R4,'\n'
    {4'hE, 6'b000000, 6'd11},                      // 13: BNZ wait2
    {4'h2, 3'd0, 3'd3, 3'd5, 3'b000},              // 12
    {4'h9, 3'd3, 3'd6, 3'b000, 3'b000},            // 11
    {4'h7, 3'b000, 3'd7, 3'd4, 3'b000},            // 10
    {4'hA, 3'd4, 8'h69, 1'b0},                     // 09: LDI R4,'i'
    {4'hE, 6'b000000, 6'd6},                       // 08
    {4'h2, 3'd0, 3'd3, 3'd5, 3'b000},              // 07
    {4'h9, 3'd3, 3'd6, 3'b000, 3'b000},            // 06
    {4'h7, 3'b000, 3'd7, 3'd4, 3'b000},            // 05
    {4'hA, 3'd4, 8'h68, 1'b0},                     // 04: LDI R4,'h'
    {4'hA, 3'd5, 8'h01, 1'b0},                     // 03
    {4'hA, 3'd6, 8'h41, 1'b0},                     // 02
    {4'hA, 3'd7, 8'h40, 1'b0}                      // 01
                                                   // 00: NOP at bottom
  }
) (
    input  logic        clk,
    input  logic        rst_n,
    input  logic        start,           // currently unused (1 = run)

    // UART pins
    input  logic [15:0] baud_div,        // clocks per bit, minus 1
    output logic        uart_tx,         // serial out (idle-high)
    input  logic        uart_rx,         // serial in  (idle-high)

    // GPIO
    input  logic  [7:0] gpio_in,
    output logic  [7:0] gpio_out,

    // Debug
    output logic  [5:0] pc_out,          // 6-bit PC → 64-entry ROM
    output logic        halted
);

  // ====================================================================
  // Bus
  // ====================================================================
  logic [7:0] bus_addr;
  logic [7:0] bus_wdata;
  logic [7:0] bus_rdata;
  logic       bus_we;
  logic       bus_re;

  // Slave selects — one-hot from the address.
  wire ram_sel    = (bus_addr <= 8'h0F);
  wire uart_sel   = (bus_addr >= 8'h40) && (bus_addr <= 8'h42);
  wire gpio_sel   = (bus_addr >= 8'h80) && (bus_addr <= 8'h81);

  // Read-data mux — combinational, picks the active slave's rdata.
  logic [7:0] ram_rdata;
  logic [7:0] uart_rdata;
  logic [7:0] gpio_rdata;
  always_comb begin
    if      (ram_sel)  bus_rdata = ram_rdata;
    else if (uart_sel) bus_rdata = uart_rdata;
    else if (gpio_sel) bus_rdata = gpio_rdata;
    else               bus_rdata = 8'h00;
  end

  // ====================================================================
  // CPU — same shape as P06 but with LD/ST replacing NOT/OUT.
  // ====================================================================
  cpu #(.PROG(PROG)) u_cpu (
    .clk        (clk),
    .rst_n      (rst_n),
    .start      (start),
    .bus_addr   (bus_addr),
    .bus_wdata  (bus_wdata),
    .bus_we     (bus_we),
    .bus_re     (bus_re),
    .bus_rdata  (bus_rdata),
    .pc_out     (pc_out),
    .halted     (halted)
  );

  // ====================================================================
  // RAM — 16 bytes of synchronous-write, asynchronous-read storage.
  // ====================================================================
  ram u_ram (
    .clk    (clk),
    .rst_n  (rst_n),
    .addr   (bus_addr[3:0]),
    .wdata  (bus_wdata),
    .we     (bus_we & ram_sel),
    .rdata  (ram_rdata)
  );

  // ====================================================================
  // UART — TX + RX + status register, all on three bus addresses.
  // ====================================================================
  uart u_uart (
    .clk        (clk),
    .rst_n      (rst_n),
    .baud_div   (baud_div),
    .reg_addr   (bus_addr[1:0]),
    .reg_wdata  (bus_wdata),
    .reg_we     (bus_we & uart_sel),
    .reg_re     (bus_re & uart_sel),
    .reg_rdata  (uart_rdata),
    .tx         (uart_tx),
    .rx         (uart_rx)
  );

  // ====================================================================
  // GPIO — 8-bit output latch + 8-bit input snapshot.
  // ====================================================================
  gpio u_gpio (
    .clk        (clk),
    .rst_n      (rst_n),
    .reg_addr   (bus_addr[0]),
    .reg_wdata  (bus_wdata),
    .reg_we     (bus_we & gpio_sel),
    .reg_rdata  (gpio_rdata),
    .gpio_in    (gpio_in),
    .gpio_out   (gpio_out)
  );

endmodule

// =====================================================================
// CPU — multi-cycle FSM CPU with LD/ST bus interface.
// Same shape as project 06, with two opcodes swapped:
//   0x7 OUT (P06) → 0x7 ST  [ra], rb
//   0x9 NOT (P06) → 0x9 LD  rd, [ra]
// LD reads from the bus during EXECUTE and writes regs[rd] in WB.
// ST writes the bus during EXECUTE; WB just advances.
// =====================================================================
module cpu (
    input  logic        clk,
    input  logic        rst_n,
    input  logic        start,

    // Bus master interface
    output logic [7:0]  bus_addr,
    output logic [7:0]  bus_wdata,
    output logic        bus_we,
    output logic        bus_re,
    input  logic [7:0]  bus_rdata,

    // Debug
    output logic [5:0]  pc_out,
    output logic        halted
);

  // ----- ROM (64 entries × 16 bits, packed bit-vector parameter) -----
  localparam int ROM_DEPTH = 64;
  localparam int PROG_BITS = ROM_DEPTH * 16;

  // Encoder helpers — Verilog-2001 style (no `return`, no `automatic`)
  // so Yosys's read_verilog frontend accepts them.
  function [15:0] op_alu(input [3:0] opcode,
                          input [2:0] rd,
                          input [2:0] ra,
                          input [2:0] rb);
    op_alu = {opcode, rd, ra, rb, 3'b000};
  endfunction
  function [15:0] op_unary(input [3:0] opcode,
                            input [2:0] rd,
                            input [2:0] ra);
    op_unary = {opcode, rd, ra, 3'b000, 3'b000};
  endfunction
  function [15:0] op_ldi(input [2:0] rd, input [7:0] imm);
    op_ldi = {4'hA, rd, imm, 1'b0};
  endfunction
  function [15:0] op_st(input [2:0] ra, input [2:0] rb);
    // ST [ra], rb — rd field unused (encoded as 0)
    op_st = {4'h7, 3'b000, ra, rb, 3'b000};
  endfunction
  function [15:0] op_ld(input [2:0] rd, input [2:0] ra);
    // LD rd, [ra] — rb field unused
    op_ld = {4'h9, rd, ra, 3'b000, 3'b000};
  endfunction
  function [15:0] op_jmp(input [3:0] opcode, input [5:0] addr);
    // 4 + 6 + 6 = 16 (using 6-bit addr now that ROM = 64)
    op_jmp = {opcode, 6'b000000, addr};
  endfunction

  // Default boot program: print "hi\n" out the UART using LD/ST. The
  // outer testbench can override PROG with its own program.
  //
  // Memory map mirror (constants used by the program):
  //   R7 = 0x40  UART TX data
  //   R6 = 0x41  UART status
  //   R5 = 0x01  TX_BUSY mask
  //
  // Pseudo-asm:
  //   LDI R7, 0x40
  //   LDI R6, 0x41
  //   LDI R5, 0x01
  //   LDI R4, 'h'
  //   ST  [R7], R4
  // wait1:
  //   LD  R3, [R6]
  //   AND R0, R3, R5    ; sets Z flag based on busy bit
  //   BNZ wait1
  //   LDI R4, 'i'
  //   ST  [R7], R4
  // wait2:
  //   LD  R3, [R6]
  //   AND R0, R3, R5
  //   BNZ wait2
  //   LDI R4, 0x0A      ; '\n'
  //   ST  [R7], R4
  // wait3:
  //   LD  R3, [R6]
  //   AND R0, R3, R5
  //   BNZ wait3
  //   HLT
  localparam logic [PROG_BITS-1:0] DEFAULT_PROG = {
    {44{16'h0000}},                                // 63..20: zero-fill
    {4'hF, 12'h000},                               // 19: HLT
    op_jmp (4'hE, 6'd16),                          // 18: BNZ wait3 (PC=16)
    op_alu (4'h2, 3'd0, 3'd3, 3'd5),               // 17: AND R0,R3,R5
    op_ld  (3'd3, 3'd6),                           // 16: LD R3,[R6]   (wait3)
    op_st  (3'd7, 3'd4),                           // 15: ST [R7],R4
    op_ldi (3'd4, 8'h0A),                          // 14: LDI R4,'\n'
    op_jmp (4'hE, 6'd11),                          // 13: BNZ wait2 (PC=11)
    op_alu (4'h2, 3'd0, 3'd3, 3'd5),               // 12: AND
    op_ld  (3'd3, 3'd6),                           // 11: LD R3,[R6]   (wait2)
    op_st  (3'd7, 3'd4),                           // 10: ST [R7],R4
    op_ldi (3'd4, 8'h69),                          // 09: LDI R4,'i'
    op_jmp (4'hE, 6'd6),                           // 08: BNZ wait1 (PC=6)
    op_alu (4'h2, 3'd0, 3'd3, 3'd5),               // 07: AND
    op_ld  (3'd3, 3'd6),                           // 06: LD R3,[R6]   (wait1)
    op_st  (3'd7, 3'd4),                           // 05: ST [R7],R4
    op_ldi (3'd4, 8'h68),                          // 04: LDI R4,'h'
    op_ldi (3'd5, 8'h01),                          // 03: LDI R5,0x01 (busy mask)
    op_ldi (3'd6, 8'h41),                          // 02: LDI R6,0x41 (status)
    op_ldi (3'd7, 8'h40)                           // 01: LDI R7,0x40 (tx data)
                                                   // 00: NOP / first instr
  };
  // The first instruction at PC=0 is whatever's at the bottom of the
  // concatenation; we leave a zero NOP there since iverilog initializes
  // PC=0 and we want LDI to start at PC=1. Wait — actually with PC=0,
  // we need the FIRST listed-last entry to be valid. Let me re-pad:
  // (handled by re-listing below in PROG_DEFAULT_PROPER for clarity)

  // The above DEFAULT_PROG concatenation has the problem that PC=0
  // would read the bottom-most entry (LDI R7), which is correct. So
  // PC=0 starts with LDI R7,0x40. Each subsequent PC reads the next
  // line up. That's exactly what we want — the comment numbering
  // (00, 01, ...) maps to PC.

  parameter logic [PROG_BITS-1:0] PROG = DEFAULT_PROG;

  // ----- PC, IR, instruction decode -----
  logic [5:0]  pc;
  logic [15:0] ir;

  wire [3:0] dec_op  = ir[15:12];
  wire [2:0] dec_rd  = ir[11: 9];
  wire [2:0] dec_ra  = ir[ 8: 6];
  wire [2:0] dec_rb  = ir[ 5: 3];
  wire [7:0] dec_imm = ir[ 8: 1];      // LDI imm payload
  wire [5:0] dec_addr= ir[ 5: 0];      // 6-bit branch/jump target

  // Opcode → ALU op routing. ST and LD don't go through the ALU
  // arithmetic itself; LD takes its result from the bus, ST doesn't
  // produce a result.
  logic [3:0] alu_op;
  always_comb begin
    unique case (dec_op)
      4'h0: alu_op = 4'b0000;            // ADD
      4'h1: alu_op = 4'b0001;            // SUB
      4'h2: alu_op = 4'b0010;            // AND
      4'h3: alu_op = 4'b0011;            // OR
      4'h4: alu_op = 4'b0100;            // XOR
      4'h5: alu_op = 4'b0101;            // SHL
      4'h6: alu_op = 4'b0110;            // SHR
      4'h7: alu_op = 4'b1000;            // ST   — uses MOV/passthrough; bus side-effect handled below
      4'h8: alu_op = 4'b1000;            // MOV
      4'h9: alu_op = 4'b1000;            // LD   — passthrough; result_q comes from bus_rdata in EXECUTE
      4'hA: alu_op = 4'b1000;            // LDI
      4'hB: alu_op = 4'b0001;            // CMP routes through SUB
      default: alu_op = 4'b1000;
    endcase
  end

  // Per-instruction control signals
  wire is_st      = (dec_op == 4'h7);
  wire is_ld      = (dec_op == 4'h9);
  wire is_alu_rr  = (dec_op <= 4'h6) || (dec_op == 4'h8);  // pure ALU
  wire is_ldi     = (dec_op == 4'hA);
  wire is_cmp     = (dec_op == 4'hB);
  wire is_jmp     = (dec_op == 4'hC);
  wire is_bz      = (dec_op == 4'hD);
  wire is_bnz     = (dec_op == 4'hE);
  wire is_hlt     = (dec_op == 4'hF);
  wire is_branch  = is_jmp | is_bz | is_bnz;

  // ALU ops + AND/OR/XOR update flags (without writing rd if rd == R0).
  // LDI, MOV (0x8), branches, ST, HLT, LD don't update flags.
  // Note: AND with rd=R0 is the standard "set flags only" idiom — its
  // flag update still happens because we gate flag_update on opcode,
  // not on rd.
  wire flag_update = ~(dec_op == 4'h8 || is_ldi || is_branch || is_hlt
                        || is_st || is_ld);
  // Register-write enable: ALU rr ops, LDI, and LD all write rd.
  // ST and CMP do not write. R0 writes are silently dropped.
  wire reg_write   = (is_alu_rr || is_ldi || is_ld) && (dec_rd != 3'd0);

  // ----- Datapath: regfile + ALU + flag register -----
  logic [7:0] regs [0:7];
  logic [3:0] flags_q;

  logic [7:0] op_a;
  logic [7:0] op_b;
  logic [7:0] result_q;
  logic [3:0] flags_d;

  function [7:0] reg_read(input [2:0] sel);
    if (sel == 3'd0) reg_read = 8'h00;
    else             reg_read = regs[sel];
  endfunction

  // ALU
  wire [7:0] a_data = op_a;
  wire [7:0] b_data = op_b;
  wire [8:0] add_w  = {1'b0, a_data} + {1'b0, b_data};
  wire [8:0] sub_w  = {1'b0, a_data} - {1'b0, b_data};

  logic [7:0] alu_y;
  logic       c_out, v_out;
  always_comb begin
    alu_y = 8'h00;
    c_out = 1'b0;
    v_out = 1'b0;
    unique case (alu_op)
      4'b0000: begin alu_y = add_w[7:0]; c_out = add_w[8];
                     v_out = (a_data[7] == b_data[7]) && (alu_y[7] != a_data[7]); end
      4'b0001: begin alu_y = sub_w[7:0]; c_out = sub_w[8];
                     v_out = (a_data[7] != b_data[7]) && (alu_y[7] != a_data[7]); end
      4'b0010: alu_y = a_data & b_data;
      4'b0011: alu_y = a_data | b_data;
      4'b0100: alu_y = a_data ^ b_data;
      4'b0101: begin alu_y = {a_data[6:0], 1'b0}; c_out = a_data[7]; end
      4'b0110: begin alu_y = {1'b0, a_data[7:1]}; c_out = a_data[0]; end
      4'b1000: alu_y = a_data;
      default: alu_y = a_data;
    endcase
  end
  wire z_out = (alu_y == 8'h00);
  wire n_out =  alu_y[7];

  // ----- FSM -----
  typedef enum logic [2:0] {
    S_FETCH   = 3'd0,
    S_DECODE  = 3'd1,
    S_EXECUTE = 3'd2,
    S_WB      = 3'd3,
    S_HALT    = 3'd4
  } state_t;

  state_t state, next_state;

  always_comb begin
    next_state = state;
    unique case (state)
      S_FETCH:   next_state = S_DECODE;
      S_DECODE:  next_state = S_EXECUTE;
      S_EXECUTE: next_state = S_WB;
      S_WB:      if (is_hlt) next_state = S_HALT; else next_state = S_FETCH;
      S_HALT:    next_state = S_HALT;
      default:   next_state = S_FETCH;
    endcase
  end

  // Branch resolution (uses the most recently captured flag register).
  wire take_branch = is_jmp
                   || (is_bz  &&  flags_q[3])
                   || (is_bnz && ~flags_q[3]);

  // ----- Bus master signals -----
  // ST drives the bus during EXECUTE (1-cycle write).
  // LD drives bus_re during EXECUTE; the slave's rdata is captured on
  // the EXECUTE→WB clock edge into result_q.
  always_comb begin
    bus_addr  = 8'h00;
    bus_wdata = 8'h00;
    bus_we    = 1'b0;
    bus_re    = 1'b0;
    if (state == S_EXECUTE) begin
      if (is_st) begin
        bus_addr  = op_a;       // address in regs[ra]
        bus_wdata = op_b;       // data    in regs[rb]
        bus_we    = 1'b1;
      end else if (is_ld) begin
        bus_addr  = op_a;       // address in regs[ra]
        bus_re    = 1'b1;
      end
    end
  end

  // ----- Sequential state -----
  integer i;
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      state    <= S_FETCH;
      pc       <= 6'd0;
      ir       <= 16'h0000;
      op_a     <= 8'h00;
      op_b     <= 8'h00;
      result_q <= 8'h00;
      flags_d  <= 4'h0;
      flags_q  <= 4'h0;
      for (i = 0; i < 8; i = i + 1) regs[i] <= 8'h00;
    end else begin
      state <= next_state;
      unique case (state)
        S_FETCH: begin
          ir <= PROG[16*pc +: 16];
          pc <= pc + 6'd1;
        end
        S_DECODE: begin
          op_a <= is_ldi ? dec_imm : reg_read(dec_ra);
          op_b <= reg_read(dec_rb);
        end
        S_EXECUTE: begin
          // For LD, capture the bus read into result_q on the next edge.
          // Otherwise capture the ALU output.
          result_q <= is_ld ? bus_rdata : alu_y;
          flags_d  <= {z_out, n_out, c_out, v_out};
        end
        S_WB: begin
          if (reg_write) regs[dec_rd] <= result_q;
          if (flag_update || is_cmp) flags_q <= flags_d;
          if (is_branch && take_branch) pc <= dec_addr;
        end
        S_HALT: ;
        default: ;
      endcase
    end
  end

  assign pc_out = pc;
  assign halted = (state == S_HALT);

  // start input reserved for future use.
  wire _unused = &{1'b0, start};

endmodule

// =====================================================================
// 16-byte RAM. Synchronous write, asynchronous read.
// =====================================================================
module ram (
    input  logic        clk,
    input  logic        rst_n,
    input  logic [3:0]  addr,
    input  logic [7:0]  wdata,
    input  logic        we,
    output logic [7:0]  rdata
);
  logic [7:0] mem [0:15];
  integer i;
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) for (i = 0; i < 16; i = i + 1) mem[i] <= 8'h00;
    else if (we) mem[addr] <= wdata;
  end
  assign rdata = mem[addr];
endmodule

// =====================================================================
// UART — TX + RX + status, exposed as 3 bus-mapped registers.
//
//   reg_addr 2'b00  (0x40)  W: TX data; latches and pulses tx_start
//   reg_addr 2'b01  (0x41)  R: status   {6'b0, rx_valid, tx_busy}
//   reg_addr 2'b10  (0x42)  R: RX data; clears rx_valid as a side effect
//
// TX is the same 8N1 transmitter from project 03/06. RX is new — it
// detects the falling edge of `rx` (start bit), waits 1.5 bit-times to
// land mid-bit-0, samples 8 bits at one bit-time each, captures the
// stop bit, then sets rx_valid + latches the byte.
// =====================================================================
module uart (
    input  logic         clk,
    input  logic         rst_n,
    input  logic [15:0]  baud_div,

    input  logic [1:0]   reg_addr,
    input  logic [7:0]   reg_wdata,
    input  logic         reg_we,
    input  logic         reg_re,
    output logic [7:0]   reg_rdata,

    output logic         tx,
    input  logic         rx
);
  // ---- TX ----
  logic       tx_start_pulse;
  logic [7:0] tx_data;
  logic       tx_busy;

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      tx_start_pulse <= 1'b0;
      tx_data        <= 8'h00;
    end else begin
      tx_start_pulse <= (reg_we && reg_addr == 2'b00);
      if (reg_we && reg_addr == 2'b00) tx_data <= reg_wdata;
    end
  end

  uart_tx u_tx (
    .clk     (clk),
    .rst_n   (rst_n),
    .start   (tx_start_pulse),
    .data    (tx_data),
    .baud_div(baud_div),
    .tx      (tx),
    .busy    (tx_busy)
  );

  // ---- RX ----
  logic [7:0] rx_byte;
  logic       rx_valid;
  logic       rx_clear;

  // Pulse rx_clear on a read of the RX data register (0x42).
  assign rx_clear = (reg_re && reg_addr == 2'b10);

  uart_rx u_rx (
    .clk      (clk),
    .rst_n    (rst_n),
    .baud_div (baud_div),
    .rx       (rx),
    .byte_o   (rx_byte),
    .valid    (rx_valid),
    .clear    (rx_clear)
  );

  // ---- register read mux ----
  always_comb begin
    reg_rdata = 8'h00;
    unique case (reg_addr)
      2'b00:   reg_rdata = 8'h00;       // TX data is write-only
      2'b01:   reg_rdata = {6'b0, rx_valid, tx_busy};
      2'b10:   reg_rdata = rx_byte;
      default: reg_rdata = 8'h00;
    endcase
  end
endmodule

// =====================================================================
// uart_tx — 8N1 transmitter. Same module as project 03; copied here
// so each project stays self-contained.
// =====================================================================
module uart_tx (
    input  logic         clk,
    input  logic         rst_n,
    input  logic         start,
    input  logic [7:0]   data,
    input  logic [15:0]  baud_div,
    output logic         tx,
    output logic         busy
);
  typedef enum logic [3:0] {
    U_IDLE = 4'd0, U_START = 4'd1,
    U_D0   = 4'd2, U_D1 = 4'd3,  U_D2 = 4'd4, U_D3 = 4'd5,
    U_D4   = 4'd6, U_D5 = 4'd7,  U_D6 = 4'd8, U_D7 = 4'd9,
    U_STOP = 4'd10
  } ustate_t;

  ustate_t ustate, ustate_next;

  logic [15:0] baud_cnt;
  logic        bit_tick;
  assign bit_tick = (baud_cnt == 16'd0);

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)                     baud_cnt <= 16'd0;
    else if (ustate == U_IDLE)      baud_cnt <= baud_div;
    else if (bit_tick)              baud_cnt <= baud_div;
    else                            baud_cnt <= baud_cnt - 16'd1;
  end

  logic [7:0] data_q;
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)                              data_q <= 8'h00;
    else if (ustate == U_IDLE && start)      data_q <= data;
  end

  always_comb begin
    ustate_next = ustate;
    unique case (ustate)
      U_IDLE:  if (start)    ustate_next = U_START;
      U_START: if (bit_tick) ustate_next = U_D0;
      U_D0:    if (bit_tick) ustate_next = U_D1;
      U_D1:    if (bit_tick) ustate_next = U_D2;
      U_D2:    if (bit_tick) ustate_next = U_D3;
      U_D3:    if (bit_tick) ustate_next = U_D4;
      U_D4:    if (bit_tick) ustate_next = U_D5;
      U_D5:    if (bit_tick) ustate_next = U_D6;
      U_D6:    if (bit_tick) ustate_next = U_D7;
      U_D7:    if (bit_tick) ustate_next = U_STOP;
      U_STOP:  if (bit_tick) ustate_next = U_IDLE;
      default:               ustate_next = U_IDLE;
    endcase
  end

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) ustate <= U_IDLE;
    else        ustate <= ustate_next;
  end

  always_comb begin
    unique case (ustate)
      U_IDLE:  tx = 1'b1;
      U_START: tx = 1'b0;
      U_D0:    tx = data_q[0];
      U_D1:    tx = data_q[1];
      U_D2:    tx = data_q[2];
      U_D3:    tx = data_q[3];
      U_D4:    tx = data_q[4];
      U_D5:    tx = data_q[5];
      U_D6:    tx = data_q[6];
      U_D7:    tx = data_q[7];
      U_STOP:  tx = 1'b1;
      default: tx = 1'b1;
    endcase
  end

  assign busy = (ustate != U_IDLE);
endmodule

// =====================================================================
// uart_rx — 8N1 receiver.
//
// Two-flop synchronizer on the rx pin (P04's lesson reused), then an
// FSM that detects a falling edge for the start bit, waits half a
// bit-time to land in the middle of the start bit, then samples one
// bit per baud_div+1 clock cycles. After capturing the stop bit, it
// latches the assembled byte into byte_o and asserts valid. The host
// reads the byte via the bus (which pulses `clear`) to acknowledge.
//
// Bytes that arrive while valid is still high are dropped — there's
// no FIFO. For the demo this is fine; the program polls the status
// register and reads RX promptly.
// =====================================================================
module uart_rx (
    input  logic         clk,
    input  logic         rst_n,
    input  logic [15:0]  baud_div,
    input  logic         rx,           // serial in
    output logic [7:0]   byte_o,
    output logic         valid,
    input  logic         clear         // pulse high to clear `valid`
);
  // ---- two-flop synchronizer on the async rx pin ----
  logic rx_s1, rx_s2;
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      rx_s1 <= 1'b1;
      rx_s2 <= 1'b1;
    end else begin
      rx_s1 <= rx;
      rx_s2 <= rx_s1;
    end
  end
  wire rx_sync = rx_s2;

  // ---- FSM ----
  typedef enum logic [3:0] {
    R_IDLE  = 4'd0,
    R_START = 4'd1,        // half-bit wait into mid of start bit
    R_D0    = 4'd2, R_D1 = 4'd3, R_D2 = 4'd4, R_D3 = 4'd5,
    R_D4    = 4'd6, R_D5 = 4'd7, R_D6 = 4'd8, R_D7 = 4'd9,
    R_STOP  = 4'd10
  } rstate_t;

  rstate_t rstate, rstate_next;

  // Bit-timer. In R_IDLE we don't count. On entering R_START we load
  // a half bit-time; on every other state transition we load a full
  // bit-time so we sample at the middle of each subsequent bit.
  logic [15:0] tmr;
  logic        tick;
  assign tick = (tmr == 16'd0);

  // Half / full bit-time loads (baud_div is clocks-per-bit minus 1, so
  // the half-tick reload is baud_div >> 1).
  wire [15:0] full_period = baud_div;
  wire [15:0] half_period = {1'b0, baud_div[15:1]};

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n)                                          tmr <= 16'd0;
    // On the IDLE → START transition (rx falls), load half a bit-time
    // so the next tick lands in the middle of the start bit.
    else if (rstate == R_IDLE && rx_sync == 1'b0)        tmr <= half_period;
    else if (rstate == R_IDLE)                           tmr <= 16'd0;
    else if (tick)                                       tmr <= full_period;
    else                                                 tmr <= tmr - 16'd1;
  end

  // Shift register that captures the byte LSB-first.
  logic [7:0] sr;

  always_comb begin
    rstate_next = rstate;
    unique case (rstate)
      R_IDLE:  if (rx_sync == 1'b0) rstate_next = R_START; // falling edge = start bit
      R_START: if (tick)            rstate_next = R_D0;
      R_D0:    if (tick)            rstate_next = R_D1;
      R_D1:    if (tick)            rstate_next = R_D2;
      R_D2:    if (tick)            rstate_next = R_D3;
      R_D3:    if (tick)            rstate_next = R_D4;
      R_D4:    if (tick)            rstate_next = R_D5;
      R_D5:    if (tick)            rstate_next = R_D6;
      R_D6:    if (tick)            rstate_next = R_D7;
      R_D7:    if (tick)            rstate_next = R_STOP;
      R_STOP:  if (tick)            rstate_next = R_IDLE;
      default:                      rstate_next = R_IDLE;
    endcase
  end

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      rstate <= R_IDLE;
      sr     <= 8'h00;
      byte_o <= 8'h00;
      valid  <= 1'b0;
    end else begin
      rstate <= rstate_next;

      // (Half-period reload on the IDLE→START transition is handled
      // by the timer always_ff above.)

      // Sample data bits at tick of R_D0..R_D7.
      if (rstate >= R_D0 && rstate <= R_D7 && tick) begin
        unique case (rstate)
          R_D0: sr[0] <= rx_sync;
          R_D1: sr[1] <= rx_sync;
          R_D2: sr[2] <= rx_sync;
          R_D3: sr[3] <= rx_sync;
          R_D4: sr[4] <= rx_sync;
          R_D5: sr[5] <= rx_sync;
          R_D6: sr[6] <= rx_sync;
          R_D7: sr[7] <= rx_sync;
        endcase
      end

      // Latch full byte at end of stop bit.
      if (rstate == R_STOP && tick) begin
        byte_o <= sr;
        valid  <= 1'b1;
      end

      // Bus read clears valid.
      if (clear) valid <= 1'b0;
    end
  end

endmodule

// =====================================================================
// GPIO — output register at reg_addr=0 (0x80), input snapshot at
// reg_addr=1 (0x81).
// =====================================================================
module gpio (
    input  logic        clk,
    input  logic        rst_n,
    input  logic        reg_addr,
    input  logic [7:0]  reg_wdata,
    input  logic        reg_we,
    output logic [7:0]  reg_rdata,

    input  logic [7:0]  gpio_in,
    output logic [7:0]  gpio_out
);
  logic [7:0] out_q;
  // Synchronizer on gpio_in for safe read.
  logic [7:0] in_s1, in_s2;

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      out_q <= 8'h00;
      in_s1 <= 8'h00;
      in_s2 <= 8'h00;
    end else begin
      if (reg_we && reg_addr == 1'b0) out_q <= reg_wdata;
      in_s1 <= gpio_in;
      in_s2 <= in_s1;
    end
  end

  assign gpio_out = out_q;
  always_comb begin
    if (reg_addr == 1'b0) reg_rdata = out_q;
    else                  reg_rdata = in_s2;
  end
endmodule

`default_nettype wire

Demo

The demo program is 13 instructions of firmware: a poll-and-echo loop that watches UART_STATUS, reads any received byte, and writes it back to UART_TX. The testbench drives h, e, y, \n into the chip’s uart_rx pin and decodes whatever comes back out uart_tx:

[host  tx 5040000] 0x68 'h'
[host  rx 6095000] 0x68 'h'
[host  tx 6440000] 0x65 'e'
[host  rx 7415000] 0x65 'e'
[host  tx 7840000] 0x79 'y'
[host  rx 8855000] 0x79 'y'
[host  tx 9240000] 0x0a '\n'
[host  rx 10295000] 0x0a '\n'

About 1 µs of round-trip per character at the demo’s sim baud rate (40 ns/bit). At a real 115200 baud the same loop runs in ~170 µs of chip time. This is the foundation for the planned Verilator-+-pty harness — once we wire that up, screen /dev/pts/N against a running sim talks to this exact echo program.

See also