P06’s CPU bolted onto a real bus, with peripherals at fixed memory
addresses. Two opcodes were swapped to make room: P06’s OUT
(special-case UART) and NOT (rarely used)
became ST [ra], rb and LD rd, [ra]. The same
instructions reach RAM, UART, and GPIO — all through one address space.
Clock target: 71 MHz (14 ns period). 8,330 cells — 3.6× P06’s count once you fold in the bus, the GPIO peripheral, and the UART RX controller.
Three iterations to land cleanly. 160 × 160 µm at 100 MHz failed placement at 82% utilisation. 220 × 220 µm at 100 MHz built but missed slow-corner setup by −0.89 ns. 220 × 220 at 83 MHz still missed by −0.22 ns. 71 MHz (14 ns) lands with +0.63 ns of slack. The bus + UART RX combinational paths are deeper than P06’s hardwired UART, and there’s no pipelining between the regfile read and the bus rdata mux — that’s a P09 problem.
What’s new vs. P06
- Bus. One master (the CPU), three slaves (RAM / UART / GPIO), combinational address decoder, mux on the read-data return path.
- Memory-mapped I/O. Peripherals live at fixed addresses; the same
LD/STinstructions reach all of them. - UART RX. First time we sample a serial line into the chip
instead of just emitting from it. Two-flop sync on the rx pin
(P04’s lesson reused), then an FSM that detects the start bit’s
falling edge, samples each data bit at the middle of its bit-time,
latches the assembled byte, and sets a
rx_validflag the program can poll.
Architecture
Address map
| addr | name | access | notes |
|---|---|---|---|
0x00..0x0F | RAM | R/W | 16 bytes, sync-write/async-read |
0x40 | UART_TX | W | writes byte and pulses TX start |
0x41 | UART_STATUS | R | bit 0 = tx_busy, bit 1 = rx_valid |
0x42 | UART_RX | R | reads byte, clears rx_valid |
0x80 | GPIO_OUT | R/W | 8-bit output latch |
0x81 | GPIO_IN | R | synchronized 8-bit input |
Reads to unmapped addresses return 0x00.
Instruction set (changes from P06)
| op | mnemonic | semantics |
|---|---|---|
0x7 | ST | mem[regs[ra]] = regs[rb] |
0x9 | LD | regs[rd] = mem[regs[ra]] |
NOT (P06’s 0x9) is gone — reproducible as XOR rd, ra, R7 after
loading 0xFF into R7. OUT (P06’s 0x7) is replaced by writing to
address 0x40 via ST. Same chip, more general I/O.
Reading the layout
The 220 × 220 µm die has enough room that the placer cleanly separates the CPU on the left from the peripherals on the right:
- R outlines the 56-flop register file along the left edge — the placer keeps it close to the ALU comb logic that reads from and writes to it.
- S outlines the 4-flop FSM state register tucked above the regfile. Same brain as P06; the encoding picked up one more bit because LD/ST are new opcodes.
- M outlines the 128-flop RAM sprawled along the bottom-center. 16 bytes × 8 bits — by far the biggest single cluster of flops on the chip. P09’s RV32I core will replace this with an SRAM macro (project 08’s whole point).
- T outlines the UART transmitter in a column on the right —
same module from P03, wrapped here as
u_txinsideu_uart. - X outlines the UART receiver — new for P07. 37 flops including a two-flop sync on the rx pin, an 11-state FSM, an 8-bit shift register, and a 16-bit baud counter.
- G outlines the GPIO peripheral in the bottom-right.
RTL
// Project 07: tiny memory-mapped SoC.
//
// Project 06 was a CPU with a hard-wired UART hanging off the side; one
// instruction (OUT) drove it. P07 is the same CPU with that special-case
// instruction removed, replaced by generic LD / ST to a real memory
// bus, and several peripherals dangling off that bus at fixed addresses.
//
// Architecture:
//
// ┌──────── bus_addr / wdata / we / re ────────┐
// ▼ │
// ┌─────┐ ┌─────────────────┐ │
// │ CPU │──▶│ address decoder │──▶ slave selects │
// └─────┘ └─────────────────┘ │
// ▲ │ │
// │ ▼ │
// │ ┌──────────────────────────────────────────────┐ │
// │ │ RAM UART_TX/RX/STATUS GPIO_OUT/IN │ ◀──────┘
// │ └──────────────────────────────────────────────┘
// │ │
// └──────────────┘ bus_rdata (mux on selected slave)
//
// Address map (8-bit address space):
//
// 0x00 .. 0x0F 16-byte RAM (R/W)
// 0x40 UART TX data (W: writes byte + pulses start)
// 0x41 UART status (R: bit0=tx_busy, bit1=rx_valid)
// 0x42 UART RX data (R: read byte, clears rx_valid)
// 0x80 GPIO output register (W)
// 0x81 GPIO input snapshot (R)
//
// Reads to unmapped addresses return 0x00.
//
// Instruction set is P06's, with two ops swapped:
// - 0x7 OUT → ST [ra], rb (mem[ra] = rb)
// - 0x9 NOT → LD rd, [ra] (rd = mem[ra])
//
// NOT was the least-used ALU op in P06's programs anyway. It's
// reproducible as `XOR rd, ra, R7` after `LDI R7, 0xFF`.
//
// What this project teaches that P06 didn't:
// - A real **bus** with one master (the CPU) and several slaves.
// - **Memory-mapped I/O**: peripherals look like memory addresses;
// the same LD/ST instructions reach them all.
// - **Address decoding** as a combinational function of the address.
// - **UART RX**: first time we sample a serial line into the chip
// instead of just emitting from it. Sets up the interactive
// `screen`-against-the-simulator demo described in /stack.
`default_nettype none
// Top-level parameter PROG is the 64×16-bit packed boot program. The
// testbench overrides it per-DUT; production hardens use the cpu
// module's DEFAULT_PROG. We forward it through here so users only ever
// see one parameter name.
module top #(
parameter logic [64*16-1:0] PROG = {
{44{16'h0000}}, // 63..20: zero-fill (matches cpu.DEFAULT_PROG)
{4'hF, 12'h000}, // 19: HLT
{4'hE, 6'b000000, 6'd16}, // 18: BNZ wait3
{4'h2, 3'd0, 3'd3, 3'd5, 3'b000}, // 17: AND R0,R3,R5
{4'h9, 3'd3, 3'd6, 3'b000, 3'b000}, // 16: LD R3,[R6]
{4'h7, 3'b000, 3'd7, 3'd4, 3'b000}, // 15: ST [R7],R4
{4'hA, 3'd4, 8'h0A, 1'b0}, // 14: LDI R4,'\n'
{4'hE, 6'b000000, 6'd11}, // 13: BNZ wait2
{4'h2, 3'd0, 3'd3, 3'd5, 3'b000}, // 12
{4'h9, 3'd3, 3'd6, 3'b000, 3'b000}, // 11
{4'h7, 3'b000, 3'd7, 3'd4, 3'b000}, // 10
{4'hA, 3'd4, 8'h69, 1'b0}, // 09: LDI R4,'i'
{4'hE, 6'b000000, 6'd6}, // 08
{4'h2, 3'd0, 3'd3, 3'd5, 3'b000}, // 07
{4'h9, 3'd3, 3'd6, 3'b000, 3'b000}, // 06
{4'h7, 3'b000, 3'd7, 3'd4, 3'b000}, // 05
{4'hA, 3'd4, 8'h68, 1'b0}, // 04: LDI R4,'h'
{4'hA, 3'd5, 8'h01, 1'b0}, // 03
{4'hA, 3'd6, 8'h41, 1'b0}, // 02
{4'hA, 3'd7, 8'h40, 1'b0} // 01
// 00: NOP at bottom
}
) (
input logic clk,
input logic rst_n,
input logic start, // currently unused (1 = run)
// UART pins
input logic [15:0] baud_div, // clocks per bit, minus 1
output logic uart_tx, // serial out (idle-high)
input logic uart_rx, // serial in (idle-high)
// GPIO
input logic [7:0] gpio_in,
output logic [7:0] gpio_out,
// Debug
output logic [5:0] pc_out, // 6-bit PC → 64-entry ROM
output logic halted
);
// ====================================================================
// Bus
// ====================================================================
logic [7:0] bus_addr;
logic [7:0] bus_wdata;
logic [7:0] bus_rdata;
logic bus_we;
logic bus_re;
// Slave selects — one-hot from the address.
wire ram_sel = (bus_addr <= 8'h0F);
wire uart_sel = (bus_addr >= 8'h40) && (bus_addr <= 8'h42);
wire gpio_sel = (bus_addr >= 8'h80) && (bus_addr <= 8'h81);
// Read-data mux — combinational, picks the active slave's rdata.
logic [7:0] ram_rdata;
logic [7:0] uart_rdata;
logic [7:0] gpio_rdata;
always_comb begin
if (ram_sel) bus_rdata = ram_rdata;
else if (uart_sel) bus_rdata = uart_rdata;
else if (gpio_sel) bus_rdata = gpio_rdata;
else bus_rdata = 8'h00;
end
// ====================================================================
// CPU — same shape as P06 but with LD/ST replacing NOT/OUT.
// ====================================================================
cpu #(.PROG(PROG)) u_cpu (
.clk (clk),
.rst_n (rst_n),
.start (start),
.bus_addr (bus_addr),
.bus_wdata (bus_wdata),
.bus_we (bus_we),
.bus_re (bus_re),
.bus_rdata (bus_rdata),
.pc_out (pc_out),
.halted (halted)
);
// ====================================================================
// RAM — 16 bytes of synchronous-write, asynchronous-read storage.
// ====================================================================
ram u_ram (
.clk (clk),
.rst_n (rst_n),
.addr (bus_addr[3:0]),
.wdata (bus_wdata),
.we (bus_we & ram_sel),
.rdata (ram_rdata)
);
// ====================================================================
// UART — TX + RX + status register, all on three bus addresses.
// ====================================================================
uart u_uart (
.clk (clk),
.rst_n (rst_n),
.baud_div (baud_div),
.reg_addr (bus_addr[1:0]),
.reg_wdata (bus_wdata),
.reg_we (bus_we & uart_sel),
.reg_re (bus_re & uart_sel),
.reg_rdata (uart_rdata),
.tx (uart_tx),
.rx (uart_rx)
);
// ====================================================================
// GPIO — 8-bit output latch + 8-bit input snapshot.
// ====================================================================
gpio u_gpio (
.clk (clk),
.rst_n (rst_n),
.reg_addr (bus_addr[0]),
.reg_wdata (bus_wdata),
.reg_we (bus_we & gpio_sel),
.reg_rdata (gpio_rdata),
.gpio_in (gpio_in),
.gpio_out (gpio_out)
);
endmodule
// =====================================================================
// CPU — multi-cycle FSM CPU with LD/ST bus interface.
// Same shape as project 06, with two opcodes swapped:
// 0x7 OUT (P06) → 0x7 ST [ra], rb
// 0x9 NOT (P06) → 0x9 LD rd, [ra]
// LD reads from the bus during EXECUTE and writes regs[rd] in WB.
// ST writes the bus during EXECUTE; WB just advances.
// =====================================================================
module cpu (
input logic clk,
input logic rst_n,
input logic start,
// Bus master interface
output logic [7:0] bus_addr,
output logic [7:0] bus_wdata,
output logic bus_we,
output logic bus_re,
input logic [7:0] bus_rdata,
// Debug
output logic [5:0] pc_out,
output logic halted
);
// ----- ROM (64 entries × 16 bits, packed bit-vector parameter) -----
localparam int ROM_DEPTH = 64;
localparam int PROG_BITS = ROM_DEPTH * 16;
// Encoder helpers — Verilog-2001 style (no `return`, no `automatic`)
// so Yosys's read_verilog frontend accepts them.
function [15:0] op_alu(input [3:0] opcode,
input [2:0] rd,
input [2:0] ra,
input [2:0] rb);
op_alu = {opcode, rd, ra, rb, 3'b000};
endfunction
function [15:0] op_unary(input [3:0] opcode,
input [2:0] rd,
input [2:0] ra);
op_unary = {opcode, rd, ra, 3'b000, 3'b000};
endfunction
function [15:0] op_ldi(input [2:0] rd, input [7:0] imm);
op_ldi = {4'hA, rd, imm, 1'b0};
endfunction
function [15:0] op_st(input [2:0] ra, input [2:0] rb);
// ST [ra], rb — rd field unused (encoded as 0)
op_st = {4'h7, 3'b000, ra, rb, 3'b000};
endfunction
function [15:0] op_ld(input [2:0] rd, input [2:0] ra);
// LD rd, [ra] — rb field unused
op_ld = {4'h9, rd, ra, 3'b000, 3'b000};
endfunction
function [15:0] op_jmp(input [3:0] opcode, input [5:0] addr);
// 4 + 6 + 6 = 16 (using 6-bit addr now that ROM = 64)
op_jmp = {opcode, 6'b000000, addr};
endfunction
// Default boot program: print "hi\n" out the UART using LD/ST. The
// outer testbench can override PROG with its own program.
//
// Memory map mirror (constants used by the program):
// R7 = 0x40 UART TX data
// R6 = 0x41 UART status
// R5 = 0x01 TX_BUSY mask
//
// Pseudo-asm:
// LDI R7, 0x40
// LDI R6, 0x41
// LDI R5, 0x01
// LDI R4, 'h'
// ST [R7], R4
// wait1:
// LD R3, [R6]
// AND R0, R3, R5 ; sets Z flag based on busy bit
// BNZ wait1
// LDI R4, 'i'
// ST [R7], R4
// wait2:
// LD R3, [R6]
// AND R0, R3, R5
// BNZ wait2
// LDI R4, 0x0A ; '\n'
// ST [R7], R4
// wait3:
// LD R3, [R6]
// AND R0, R3, R5
// BNZ wait3
// HLT
localparam logic [PROG_BITS-1:0] DEFAULT_PROG = {
{44{16'h0000}}, // 63..20: zero-fill
{4'hF, 12'h000}, // 19: HLT
op_jmp (4'hE, 6'd16), // 18: BNZ wait3 (PC=16)
op_alu (4'h2, 3'd0, 3'd3, 3'd5), // 17: AND R0,R3,R5
op_ld (3'd3, 3'd6), // 16: LD R3,[R6] (wait3)
op_st (3'd7, 3'd4), // 15: ST [R7],R4
op_ldi (3'd4, 8'h0A), // 14: LDI R4,'\n'
op_jmp (4'hE, 6'd11), // 13: BNZ wait2 (PC=11)
op_alu (4'h2, 3'd0, 3'd3, 3'd5), // 12: AND
op_ld (3'd3, 3'd6), // 11: LD R3,[R6] (wait2)
op_st (3'd7, 3'd4), // 10: ST [R7],R4
op_ldi (3'd4, 8'h69), // 09: LDI R4,'i'
op_jmp (4'hE, 6'd6), // 08: BNZ wait1 (PC=6)
op_alu (4'h2, 3'd0, 3'd3, 3'd5), // 07: AND
op_ld (3'd3, 3'd6), // 06: LD R3,[R6] (wait1)
op_st (3'd7, 3'd4), // 05: ST [R7],R4
op_ldi (3'd4, 8'h68), // 04: LDI R4,'h'
op_ldi (3'd5, 8'h01), // 03: LDI R5,0x01 (busy mask)
op_ldi (3'd6, 8'h41), // 02: LDI R6,0x41 (status)
op_ldi (3'd7, 8'h40) // 01: LDI R7,0x40 (tx data)
// 00: NOP / first instr
};
// The first instruction at PC=0 is whatever's at the bottom of the
// concatenation; we leave a zero NOP there since iverilog initializes
// PC=0 and we want LDI to start at PC=1. Wait — actually with PC=0,
// we need the FIRST listed-last entry to be valid. Let me re-pad:
// (handled by re-listing below in PROG_DEFAULT_PROPER for clarity)
// The above DEFAULT_PROG concatenation has the problem that PC=0
// would read the bottom-most entry (LDI R7), which is correct. So
// PC=0 starts with LDI R7,0x40. Each subsequent PC reads the next
// line up. That's exactly what we want — the comment numbering
// (00, 01, ...) maps to PC.
parameter logic [PROG_BITS-1:0] PROG = DEFAULT_PROG;
// ----- PC, IR, instruction decode -----
logic [5:0] pc;
logic [15:0] ir;
wire [3:0] dec_op = ir[15:12];
wire [2:0] dec_rd = ir[11: 9];
wire [2:0] dec_ra = ir[ 8: 6];
wire [2:0] dec_rb = ir[ 5: 3];
wire [7:0] dec_imm = ir[ 8: 1]; // LDI imm payload
wire [5:0] dec_addr= ir[ 5: 0]; // 6-bit branch/jump target
// Opcode → ALU op routing. ST and LD don't go through the ALU
// arithmetic itself; LD takes its result from the bus, ST doesn't
// produce a result.
logic [3:0] alu_op;
always_comb begin
unique case (dec_op)
4'h0: alu_op = 4'b0000; // ADD
4'h1: alu_op = 4'b0001; // SUB
4'h2: alu_op = 4'b0010; // AND
4'h3: alu_op = 4'b0011; // OR
4'h4: alu_op = 4'b0100; // XOR
4'h5: alu_op = 4'b0101; // SHL
4'h6: alu_op = 4'b0110; // SHR
4'h7: alu_op = 4'b1000; // ST — uses MOV/passthrough; bus side-effect handled below
4'h8: alu_op = 4'b1000; // MOV
4'h9: alu_op = 4'b1000; // LD — passthrough; result_q comes from bus_rdata in EXECUTE
4'hA: alu_op = 4'b1000; // LDI
4'hB: alu_op = 4'b0001; // CMP routes through SUB
default: alu_op = 4'b1000;
endcase
end
// Per-instruction control signals
wire is_st = (dec_op == 4'h7);
wire is_ld = (dec_op == 4'h9);
wire is_alu_rr = (dec_op <= 4'h6) || (dec_op == 4'h8); // pure ALU
wire is_ldi = (dec_op == 4'hA);
wire is_cmp = (dec_op == 4'hB);
wire is_jmp = (dec_op == 4'hC);
wire is_bz = (dec_op == 4'hD);
wire is_bnz = (dec_op == 4'hE);
wire is_hlt = (dec_op == 4'hF);
wire is_branch = is_jmp | is_bz | is_bnz;
// ALU ops + AND/OR/XOR update flags (without writing rd if rd == R0).
// LDI, MOV (0x8), branches, ST, HLT, LD don't update flags.
// Note: AND with rd=R0 is the standard "set flags only" idiom — its
// flag update still happens because we gate flag_update on opcode,
// not on rd.
wire flag_update = ~(dec_op == 4'h8 || is_ldi || is_branch || is_hlt
|| is_st || is_ld);
// Register-write enable: ALU rr ops, LDI, and LD all write rd.
// ST and CMP do not write. R0 writes are silently dropped.
wire reg_write = (is_alu_rr || is_ldi || is_ld) && (dec_rd != 3'd0);
// ----- Datapath: regfile + ALU + flag register -----
logic [7:0] regs [0:7];
logic [3:0] flags_q;
logic [7:0] op_a;
logic [7:0] op_b;
logic [7:0] result_q;
logic [3:0] flags_d;
function [7:0] reg_read(input [2:0] sel);
if (sel == 3'd0) reg_read = 8'h00;
else reg_read = regs[sel];
endfunction
// ALU
wire [7:0] a_data = op_a;
wire [7:0] b_data = op_b;
wire [8:0] add_w = {1'b0, a_data} + {1'b0, b_data};
wire [8:0] sub_w = {1'b0, a_data} - {1'b0, b_data};
logic [7:0] alu_y;
logic c_out, v_out;
always_comb begin
alu_y = 8'h00;
c_out = 1'b0;
v_out = 1'b0;
unique case (alu_op)
4'b0000: begin alu_y = add_w[7:0]; c_out = add_w[8];
v_out = (a_data[7] == b_data[7]) && (alu_y[7] != a_data[7]); end
4'b0001: begin alu_y = sub_w[7:0]; c_out = sub_w[8];
v_out = (a_data[7] != b_data[7]) && (alu_y[7] != a_data[7]); end
4'b0010: alu_y = a_data & b_data;
4'b0011: alu_y = a_data | b_data;
4'b0100: alu_y = a_data ^ b_data;
4'b0101: begin alu_y = {a_data[6:0], 1'b0}; c_out = a_data[7]; end
4'b0110: begin alu_y = {1'b0, a_data[7:1]}; c_out = a_data[0]; end
4'b1000: alu_y = a_data;
default: alu_y = a_data;
endcase
end
wire z_out = (alu_y == 8'h00);
wire n_out = alu_y[7];
// ----- FSM -----
typedef enum logic [2:0] {
S_FETCH = 3'd0,
S_DECODE = 3'd1,
S_EXECUTE = 3'd2,
S_WB = 3'd3,
S_HALT = 3'd4
} state_t;
state_t state, next_state;
always_comb begin
next_state = state;
unique case (state)
S_FETCH: next_state = S_DECODE;
S_DECODE: next_state = S_EXECUTE;
S_EXECUTE: next_state = S_WB;
S_WB: if (is_hlt) next_state = S_HALT; else next_state = S_FETCH;
S_HALT: next_state = S_HALT;
default: next_state = S_FETCH;
endcase
end
// Branch resolution (uses the most recently captured flag register).
wire take_branch = is_jmp
|| (is_bz && flags_q[3])
|| (is_bnz && ~flags_q[3]);
// ----- Bus master signals -----
// ST drives the bus during EXECUTE (1-cycle write).
// LD drives bus_re during EXECUTE; the slave's rdata is captured on
// the EXECUTE→WB clock edge into result_q.
always_comb begin
bus_addr = 8'h00;
bus_wdata = 8'h00;
bus_we = 1'b0;
bus_re = 1'b0;
if (state == S_EXECUTE) begin
if (is_st) begin
bus_addr = op_a; // address in regs[ra]
bus_wdata = op_b; // data in regs[rb]
bus_we = 1'b1;
end else if (is_ld) begin
bus_addr = op_a; // address in regs[ra]
bus_re = 1'b1;
end
end
end
// ----- Sequential state -----
integer i;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
state <= S_FETCH;
pc <= 6'd0;
ir <= 16'h0000;
op_a <= 8'h00;
op_b <= 8'h00;
result_q <= 8'h00;
flags_d <= 4'h0;
flags_q <= 4'h0;
for (i = 0; i < 8; i = i + 1) regs[i] <= 8'h00;
end else begin
state <= next_state;
unique case (state)
S_FETCH: begin
ir <= PROG[16*pc +: 16];
pc <= pc + 6'd1;
end
S_DECODE: begin
op_a <= is_ldi ? dec_imm : reg_read(dec_ra);
op_b <= reg_read(dec_rb);
end
S_EXECUTE: begin
// For LD, capture the bus read into result_q on the next edge.
// Otherwise capture the ALU output.
result_q <= is_ld ? bus_rdata : alu_y;
flags_d <= {z_out, n_out, c_out, v_out};
end
S_WB: begin
if (reg_write) regs[dec_rd] <= result_q;
if (flag_update || is_cmp) flags_q <= flags_d;
if (is_branch && take_branch) pc <= dec_addr;
end
S_HALT: ;
default: ;
endcase
end
end
assign pc_out = pc;
assign halted = (state == S_HALT);
// start input reserved for future use.
wire _unused = &{1'b0, start};
endmodule
// =====================================================================
// 16-byte RAM. Synchronous write, asynchronous read.
// =====================================================================
module ram (
input logic clk,
input logic rst_n,
input logic [3:0] addr,
input logic [7:0] wdata,
input logic we,
output logic [7:0] rdata
);
logic [7:0] mem [0:15];
integer i;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) for (i = 0; i < 16; i = i + 1) mem[i] <= 8'h00;
else if (we) mem[addr] <= wdata;
end
assign rdata = mem[addr];
endmodule
// =====================================================================
// UART — TX + RX + status, exposed as 3 bus-mapped registers.
//
// reg_addr 2'b00 (0x40) W: TX data; latches and pulses tx_start
// reg_addr 2'b01 (0x41) R: status {6'b0, rx_valid, tx_busy}
// reg_addr 2'b10 (0x42) R: RX data; clears rx_valid as a side effect
//
// TX is the same 8N1 transmitter from project 03/06. RX is new — it
// detects the falling edge of `rx` (start bit), waits 1.5 bit-times to
// land mid-bit-0, samples 8 bits at one bit-time each, captures the
// stop bit, then sets rx_valid + latches the byte.
// =====================================================================
module uart (
input logic clk,
input logic rst_n,
input logic [15:0] baud_div,
input logic [1:0] reg_addr,
input logic [7:0] reg_wdata,
input logic reg_we,
input logic reg_re,
output logic [7:0] reg_rdata,
output logic tx,
input logic rx
);
// ---- TX ----
logic tx_start_pulse;
logic [7:0] tx_data;
logic tx_busy;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
tx_start_pulse <= 1'b0;
tx_data <= 8'h00;
end else begin
tx_start_pulse <= (reg_we && reg_addr == 2'b00);
if (reg_we && reg_addr == 2'b00) tx_data <= reg_wdata;
end
end
uart_tx u_tx (
.clk (clk),
.rst_n (rst_n),
.start (tx_start_pulse),
.data (tx_data),
.baud_div(baud_div),
.tx (tx),
.busy (tx_busy)
);
// ---- RX ----
logic [7:0] rx_byte;
logic rx_valid;
logic rx_clear;
// Pulse rx_clear on a read of the RX data register (0x42).
assign rx_clear = (reg_re && reg_addr == 2'b10);
uart_rx u_rx (
.clk (clk),
.rst_n (rst_n),
.baud_div (baud_div),
.rx (rx),
.byte_o (rx_byte),
.valid (rx_valid),
.clear (rx_clear)
);
// ---- register read mux ----
always_comb begin
reg_rdata = 8'h00;
unique case (reg_addr)
2'b00: reg_rdata = 8'h00; // TX data is write-only
2'b01: reg_rdata = {6'b0, rx_valid, tx_busy};
2'b10: reg_rdata = rx_byte;
default: reg_rdata = 8'h00;
endcase
end
endmodule
// =====================================================================
// uart_tx — 8N1 transmitter. Same module as project 03; copied here
// so each project stays self-contained.
// =====================================================================
module uart_tx (
input logic clk,
input logic rst_n,
input logic start,
input logic [7:0] data,
input logic [15:0] baud_div,
output logic tx,
output logic busy
);
typedef enum logic [3:0] {
U_IDLE = 4'd0, U_START = 4'd1,
U_D0 = 4'd2, U_D1 = 4'd3, U_D2 = 4'd4, U_D3 = 4'd5,
U_D4 = 4'd6, U_D5 = 4'd7, U_D6 = 4'd8, U_D7 = 4'd9,
U_STOP = 4'd10
} ustate_t;
ustate_t ustate, ustate_next;
logic [15:0] baud_cnt;
logic bit_tick;
assign bit_tick = (baud_cnt == 16'd0);
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) baud_cnt <= 16'd0;
else if (ustate == U_IDLE) baud_cnt <= baud_div;
else if (bit_tick) baud_cnt <= baud_div;
else baud_cnt <= baud_cnt - 16'd1;
end
logic [7:0] data_q;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) data_q <= 8'h00;
else if (ustate == U_IDLE && start) data_q <= data;
end
always_comb begin
ustate_next = ustate;
unique case (ustate)
U_IDLE: if (start) ustate_next = U_START;
U_START: if (bit_tick) ustate_next = U_D0;
U_D0: if (bit_tick) ustate_next = U_D1;
U_D1: if (bit_tick) ustate_next = U_D2;
U_D2: if (bit_tick) ustate_next = U_D3;
U_D3: if (bit_tick) ustate_next = U_D4;
U_D4: if (bit_tick) ustate_next = U_D5;
U_D5: if (bit_tick) ustate_next = U_D6;
U_D6: if (bit_tick) ustate_next = U_D7;
U_D7: if (bit_tick) ustate_next = U_STOP;
U_STOP: if (bit_tick) ustate_next = U_IDLE;
default: ustate_next = U_IDLE;
endcase
end
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) ustate <= U_IDLE;
else ustate <= ustate_next;
end
always_comb begin
unique case (ustate)
U_IDLE: tx = 1'b1;
U_START: tx = 1'b0;
U_D0: tx = data_q[0];
U_D1: tx = data_q[1];
U_D2: tx = data_q[2];
U_D3: tx = data_q[3];
U_D4: tx = data_q[4];
U_D5: tx = data_q[5];
U_D6: tx = data_q[6];
U_D7: tx = data_q[7];
U_STOP: tx = 1'b1;
default: tx = 1'b1;
endcase
end
assign busy = (ustate != U_IDLE);
endmodule
// =====================================================================
// uart_rx — 8N1 receiver.
//
// Two-flop synchronizer on the rx pin (P04's lesson reused), then an
// FSM that detects a falling edge for the start bit, waits half a
// bit-time to land in the middle of the start bit, then samples one
// bit per baud_div+1 clock cycles. After capturing the stop bit, it
// latches the assembled byte into byte_o and asserts valid. The host
// reads the byte via the bus (which pulses `clear`) to acknowledge.
//
// Bytes that arrive while valid is still high are dropped — there's
// no FIFO. For the demo this is fine; the program polls the status
// register and reads RX promptly.
// =====================================================================
module uart_rx (
input logic clk,
input logic rst_n,
input logic [15:0] baud_div,
input logic rx, // serial in
output logic [7:0] byte_o,
output logic valid,
input logic clear // pulse high to clear `valid`
);
// ---- two-flop synchronizer on the async rx pin ----
logic rx_s1, rx_s2;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
rx_s1 <= 1'b1;
rx_s2 <= 1'b1;
end else begin
rx_s1 <= rx;
rx_s2 <= rx_s1;
end
end
wire rx_sync = rx_s2;
// ---- FSM ----
typedef enum logic [3:0] {
R_IDLE = 4'd0,
R_START = 4'd1, // half-bit wait into mid of start bit
R_D0 = 4'd2, R_D1 = 4'd3, R_D2 = 4'd4, R_D3 = 4'd5,
R_D4 = 4'd6, R_D5 = 4'd7, R_D6 = 4'd8, R_D7 = 4'd9,
R_STOP = 4'd10
} rstate_t;
rstate_t rstate, rstate_next;
// Bit-timer. In R_IDLE we don't count. On entering R_START we load
// a half bit-time; on every other state transition we load a full
// bit-time so we sample at the middle of each subsequent bit.
logic [15:0] tmr;
logic tick;
assign tick = (tmr == 16'd0);
// Half / full bit-time loads (baud_div is clocks-per-bit minus 1, so
// the half-tick reload is baud_div >> 1).
wire [15:0] full_period = baud_div;
wire [15:0] half_period = {1'b0, baud_div[15:1]};
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) tmr <= 16'd0;
// On the IDLE → START transition (rx falls), load half a bit-time
// so the next tick lands in the middle of the start bit.
else if (rstate == R_IDLE && rx_sync == 1'b0) tmr <= half_period;
else if (rstate == R_IDLE) tmr <= 16'd0;
else if (tick) tmr <= full_period;
else tmr <= tmr - 16'd1;
end
// Shift register that captures the byte LSB-first.
logic [7:0] sr;
always_comb begin
rstate_next = rstate;
unique case (rstate)
R_IDLE: if (rx_sync == 1'b0) rstate_next = R_START; // falling edge = start bit
R_START: if (tick) rstate_next = R_D0;
R_D0: if (tick) rstate_next = R_D1;
R_D1: if (tick) rstate_next = R_D2;
R_D2: if (tick) rstate_next = R_D3;
R_D3: if (tick) rstate_next = R_D4;
R_D4: if (tick) rstate_next = R_D5;
R_D5: if (tick) rstate_next = R_D6;
R_D6: if (tick) rstate_next = R_D7;
R_D7: if (tick) rstate_next = R_STOP;
R_STOP: if (tick) rstate_next = R_IDLE;
default: rstate_next = R_IDLE;
endcase
end
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
rstate <= R_IDLE;
sr <= 8'h00;
byte_o <= 8'h00;
valid <= 1'b0;
end else begin
rstate <= rstate_next;
// (Half-period reload on the IDLE→START transition is handled
// by the timer always_ff above.)
// Sample data bits at tick of R_D0..R_D7.
if (rstate >= R_D0 && rstate <= R_D7 && tick) begin
unique case (rstate)
R_D0: sr[0] <= rx_sync;
R_D1: sr[1] <= rx_sync;
R_D2: sr[2] <= rx_sync;
R_D3: sr[3] <= rx_sync;
R_D4: sr[4] <= rx_sync;
R_D5: sr[5] <= rx_sync;
R_D6: sr[6] <= rx_sync;
R_D7: sr[7] <= rx_sync;
endcase
end
// Latch full byte at end of stop bit.
if (rstate == R_STOP && tick) begin
byte_o <= sr;
valid <= 1'b1;
end
// Bus read clears valid.
if (clear) valid <= 1'b0;
end
end
endmodule
// =====================================================================
// GPIO — output register at reg_addr=0 (0x80), input snapshot at
// reg_addr=1 (0x81).
// =====================================================================
module gpio (
input logic clk,
input logic rst_n,
input logic reg_addr,
input logic [7:0] reg_wdata,
input logic reg_we,
output logic [7:0] reg_rdata,
input logic [7:0] gpio_in,
output logic [7:0] gpio_out
);
logic [7:0] out_q;
// Synchronizer on gpio_in for safe read.
logic [7:0] in_s1, in_s2;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
out_q <= 8'h00;
in_s1 <= 8'h00;
in_s2 <= 8'h00;
end else begin
if (reg_we && reg_addr == 1'b0) out_q <= reg_wdata;
in_s1 <= gpio_in;
in_s2 <= in_s1;
end
end
assign gpio_out = out_q;
always_comb begin
if (reg_addr == 1'b0) reg_rdata = out_q;
else reg_rdata = in_s2;
end
endmodule
`default_nettype wire Demo
The demo program is 13 instructions of firmware: a poll-and-echo loop
that watches UART_STATUS, reads any received byte, and writes it back
to UART_TX. The testbench drives h, e, y, \n into the chip’s
uart_rx pin and decodes whatever comes back out uart_tx:
[host tx 5040000] 0x68 'h'
[host rx 6095000] 0x68 'h'
[host tx 6440000] 0x65 'e'
[host rx 7415000] 0x65 'e'
[host tx 7840000] 0x79 'y'
[host rx 8855000] 0x79 'y'
[host tx 9240000] 0x0a '\n'
[host rx 10295000] 0x0a '\n'
About 1 µs of round-trip per character at the demo’s sim baud rate
(40 ns/bit). At a real 115200 baud the same loop runs in ~170 µs of
chip time. This is the foundation for the planned Verilator-+-pty
harness — once we wire that up, screen /dev/pts/N against a running
sim talks to this exact echo program.
See also
- Project 06 → the CPU this SoC builds on.
- Project 04 → the synchronizer
pattern reused on
uart_rx. - Project README
- /stack — the toolchain that hardens this design.