The smallest thing on this ladder that’s defensibly a CPU. Project 05’s datapath gets a control unit bolted on top: a 5-bit program counter walks a 32-entry ROM, an instruction register holds the current opcode, and a 4-state FSM cycles through one instruction at a time:
FETCH → DECODE → EXECUTE → WB
Hold slack +0.91 ns, target 100 MHz (10 ns period). Composition: P05’s datapath + a 5-state FSM control + 32-instruction PROG ROM + P03’s UART transmitter, all in 2,333 cells.
The 4-stage FSM split paid off: P05 needed 25 ns at slow corner to fit its all-in-one-cycle datapath. P06 ships at 10 ns with +0.32 ns of slack at the slow corner — 2.5× the clock for 4× the cycles per instruction. Adding the UART tightened the slack significantly (the first attempt without UART had +2.65 ns), but it still passes at 100 MHz across all PVT corners. The register file is 48 dfrtp_2 cells; the UART adds another 28 flops (a 4-bit FSM + 8-bit data latch + 16-bit baud counter). DRC, LVS, antenna all clean.
What’s new vs. P05
- Program counter + ROM. 5-bit PC, 32-entry × 16-bit instruction ROM declared as a parameterized constant. Synthesis materializes it as combinational lookup logic — no SRAM macro needed.
- Instruction decode. A 16-op encoding (4-bit opcode, three 3-bit
register fields, 8-bit immediate for
LDI, 5-bit absolute address for branches). Decode is combinational from the IR. - Multi-cycle FSM. Each instruction takes 4 cycles
(
FETCH → DECODE → EXECUTE → WB). Splitting the work across stages shortens the per-cycle critical path — that’s the headroom for cranking the clock back to 100 MHz. - Branches and HALT.
JMP/BZ/BNZconsume flags from the most recently completed flag-writing instruction.HLTparks the FSM inS_HALTpermanently. - A real UART. The
OUTinstruction (opcode0x7, replacing P05’s rarely-usefulSAR) pushesregs[ra]out a hardware UART on the chip’suart_txpin. The CPU’s FSM stalls during transmission, so consecutiveOUTs naturally serialize. The UART module is the same 8N1 transmitter from project 03, inlined inside this design — first time the ladder reuses a previous project’s RTL.
The FSM
How one instruction executes
The 4-stage walk for a single instruction — ADD R3, R1, R2 — gives
the cleanest read on what the FSM actually buys us:
| cycle | state | what happens |
|---|---|---|
| 1 | FETCH | ir <= PROG[pc]. PC increments to point at the next instruction. The decoder’s combinational fan-out is now valid for the rest of the instruction. |
| 2 | DECODE | op_a <= regs[ra] (= regs[1]), op_b <= regs[rb] (= regs[2]). Each operand becomes a registered byte that the ALU reads next cycle. |
| 3 | EXECUTE | result_q <= alu(op_a, op_b). flags_d <= {Z, N, C, V} is also captured. Pure ALU work — no regfile lookup, no decode, just one combinational ALU pass. |
| 4 | WB | regs[rd] <= result_q and flags_q <= flags_d if the instruction calls for either. Branches override the PC here from dec_addr. |
Each cycle’s combinational chain is now ~1/4 of P05’s all-in-one path. That’s why P05 had to clock at 40 MHz to fit the slow corner and P06 fits the same datapath plus the control unit on top at 100 MHz with +2.65 ns of slack to spare.
Instruction set
op | mnemonic | semantics |
|---|---|---|
| 0x0 | ADD | rd = ra + rb — updates flags |
| 0x1 | SUB | rd = ra - rb — updates flags |
| 0x2 | AND | rd = ra & rb — updates Z, N |
| 0x3 | OR | rd = ra | rb — updates Z, N |
| 0x4 | XOR | rd = ra ^ rb — updates Z, N |
| 0x5 | SHL | rd = ra << 1 — C ← old MSB |
| 0x6 | SHR | rd = ra >> 1 — C ← old LSB |
| 0x7 | OUT | push regs[ra] out the UART tx pin; FSM stalls until the byte finishes |
| 0x8 | MOV | rd = ra — no flag update |
| 0x9 | NOT | rd = ~ra — updates Z, N |
| 0xA | LDI | rd = imm8 (no flag update) |
| 0xB | CMP | flags = ra - rb, no register write |
| 0xC | JMP | pc = addr5 |
| 0xD | BZ | pc = addr5 if Z = 1 |
| 0xE | BNZ | pc = addr5 if Z = 0 |
| 0xF | HLT | halt forever |
The 16-bit instruction word is laid out as four nibbles. Register-register
instructions (ADD, SUB, AND, …) put rd / ra / rb into the three middle
3-bit fields, leaving 3 unused bits at the bottom. LDI uses bits
[8:1] as an 8-bit immediate (with bit 0 reserved). Branches and
JMP use bits [4:0] as a 5-bit absolute ROM address.
RTL
The whole CPU is one ~390-line top.sv plus an inlined uart_tx submodule.
The walkthrough below breaks it into the pieces a reader actually wants to
look at, in roughly the order they fire when an instruction executes.
The header and ports
Standard P05-shaped wrapper, plus a baud_div input and a uart_tx
output. The TB drives baud_div low for fast sim; real silicon gets
868 for 115200 baud at 100 MHz.
`default_nettype none
module top (
input logic clk,
input logic rst_n,
input logic start, // currently unused (1 = run)
// UART pacing: clocks-per-bit minus 1. At 100 MHz, 868 → 115200 baud.
// Driven externally so the testbench can speed it up for sim.
input logic [15:0] baud_div,
output logic [7:0] out,
output logic [4:0] pc_out,
output logic halted,
// UART tx pin (idle-high). Pulled out of the chip; off-chip a
// standard 3.3V serial monitor at the matching baud rate decodes
// the bytes the OUT instruction sent.
output logic uart_tx
); The instruction ROM
The ROM is built up from four little encoder helpers and dropped into a
512-bit packed parameter. Yosys’s read_verilog won’t accept unpacked
array parameters, so the ROM is stored as one giant bit-vector and
sliced with +: at fetch time. The encoder functions are also written
in Verilog-2001 style (assign through the function name, no return)
because Yosys rejects return {...} inside automatic functions.
localparam int ROM_DEPTH = 32;
// Encoder helpers — used only inside the localparam ROM init below,
// so they must be valid in Yosys's SV frontend. We use the Verilog
// function-name-assignment form (no `return` statement) and avoid
// `automatic`, since Yosys's read_verilog rejects `return {...}`
// inside automatic functions.
function [15:0] op_alu(input [3:0] opcode,
input [2:0] rd,
input [2:0] ra,
input [2:0] rb);
op_alu = {opcode, rd, ra, rb, 3'b000};
endfunction
function [15:0] op_unary(input [3:0] opcode,
input [2:0] rd,
input [2:0] ra);
op_unary = {opcode, rd, ra, 3'b000, 3'b000};
endfunction
function [15:0] op_ldi(input [2:0] rd,
input [7:0] imm);
op_ldi = {4'hA, rd, imm, 1'b0};
endfunction
function [15:0] op_jmp(input [3:0] opcode,
input [4:0] addr);
op_jmp = {opcode, 7'b0000000, addr}; // 4 + 7 + 5 = 16 bits
endfunction
// ROM is held as a packed bit-vector — 32 instructions × 16 bits =
// 512 bits. Iverilog doesn't accept unpacked array parameters, but
// is fine with packed bit-vectors. We slice the current word with a
// `+:` index. Concatenation order is MSB-first, so PC=31 sits at
// the top and PC=0 at the bottom.
localparam int PROG_BITS = ROM_DEPTH * 16;
// Default boot program: Fibonacci(6) into R1..R6, then HLT. R7 ends
// up holding the last value (matches the `out` port). Testbench
// overrides this via parameter when it wants to exercise different
// instructions. Listed PC=31 first → PC=0 last.
localparam logic [PROG_BITS-1:0] DEFAULT_PROG = {
{24{16'h0000}}, // 31..08: zero-fill
op_jmp (4'hF, 5'd0), // 07: HLT (jump field unused)
op_unary(4'h8, 3'd7, 3'd6), // 06: R7 = R6 (= out)
op_alu (4'h0, 3'd6, 3'd4, 3'd5), // 05: R6 = R4 + R5 = 8
op_alu (4'h0, 3'd5, 3'd3, 3'd4), // 04: R5 = R3 + R4 = 5
op_alu (4'h0, 3'd4, 3'd2, 3'd3), // 03: R4 = R2 + R3 = 3
op_alu (4'h0, 3'd3, 3'd1, 3'd2), // 02: R3 = R1 + R2 = 2
op_ldi (3'd2, 8'h01), // 01: R2 = 1
op_ldi (3'd1, 8'h01) // 00: R1 = 1
};
parameter logic [PROG_BITS-1:0] PROG = DEFAULT_PROG; PC, IR, decode
The IR is dissected into named slices the same cycle FETCH latches it.
Everything from here down is combinational off ir. The alu_op mux
is mostly identity (CPU op = ALU op), with a few overrides: CMP routes
through SUB, LDI/MOV/OUT all use the ALU’s a-passthrough, and branches
park the ALU on whatever (the result is unused).
logic [4:0] pc;
logic [15:0] ir;
wire [3:0] dec_op = ir[15:12];
wire [2:0] dec_rd = ir[11: 9];
wire [2:0] dec_ra = ir[ 8: 6];
wire [2:0] dec_rb = ir[ 5: 3];
wire [7:0] dec_imm = ir[ 8: 1]; // LDI imm payload (bit 0 reserved)
wire [4:0] dec_addr= ir[ 4: 0]; // branch/jump target
// Map CPU opcode to ALU op (a 4-bit code matching the P05 ALU).
// Most ALU opcodes are 1:1 with their CPU encoding. CMP routes
// through SUB (subtractor compute, no register write).
logic [3:0] alu_op;
always_comb begin
unique case (dec_op)
4'h0: alu_op = 4'b0000; // ADD
4'h1: alu_op = 4'b0001; // SUB
4'h2: alu_op = 4'b0010; // AND
4'h3: alu_op = 4'b0011; // OR
4'h4: alu_op = 4'b0100; // XOR
4'h5: alu_op = 4'b0101; // SHL
4'h6: alu_op = 4'b0110; // SHR
4'h7: alu_op = 4'b1000; // OUT — passthrough A; UART side-effect handled below
4'h8: alu_op = 4'b1000; // MOV
4'h9: alu_op = 4'b1001; // NOT
4'hA: alu_op = 4'b1000; // LDI uses MOV-with-imm (handled by use_imm)
4'hB: alu_op = 4'b0001; // CMP routes through SUB
default: alu_op = 4'b1000; // JMP/BZ/BNZ/HLT — datapath idle
endcase
end Per-instruction control signals
Decode is purely combinational off dec_op. Each instruction class
gets a one-bit is_* predicate, and from those we build flag_update
and reg_write — the two write-enables that decide what WB actually
commits.
R0 is conventionally hardwired to zero. Rather than special-casing
reads (which reg_read() does anyway), we drop writes to it at the
control-signal level so synthesis can prove the regs[0] flop is
unreachable and optimize it away.
// Per-instruction control signals (combinational from dec_op).
wire is_out = (dec_op == 4'h7); // UART send (replaces SAR)
wire is_alu_rr = (dec_op <= 4'h9) && !is_out; // ALU writeback ops
wire is_ldi = (dec_op == 4'hA);
wire is_cmp = (dec_op == 4'hB);
wire is_jmp = (dec_op == 4'hC);
wire is_bz = (dec_op == 4'hD);
wire is_bnz = (dec_op == 4'hE);
wire is_hlt = (dec_op == 4'hF);
wire is_branch = is_jmp | is_bz | is_bnz;
// Some ops don't update flags (MOV, OUT, LDI, branches/halt, CMP only
// updates flags — but its result isn't written back).
wire flag_update = ~(dec_op == 4'h8 || is_out || is_ldi || is_branch || is_hlt);
// Register-write enable: any ALU op (excluding CMP and OUT) writes rd.
// LDI also writes. R0 is hardwired to zero in the regfile so writes
// there are dropped.
wire reg_write = (is_alu_rr || is_ldi) && (dec_rd != 3'd0); The register file and ALU
This block is verbatim P05’s datapath, inlined so each project on the
ladder stands alone. regs[0:7] is an unpacked array of 8-bit flops,
the ALU is a single combinational case driving alu_y, and flags_q
holds the post-EXECUTE {Z, N, C, V}.
logic [7:0] regs [0:7];
logic [3:0] flags_q; // {Z, N, C, V}
// Decode-stage operand registers (so the ALU compute is a separate
// pipeline-ish stage from regfile read).
logic [7:0] op_a;
logic [7:0] op_b;
// Execute-stage result register, fed to writeback.
logic [7:0] result_q;
logic [3:0] flags_d; // computed in EXECUTE, captured at WB
// Async regfile read selectors. (R0 reads as 0, writes ignored.)
// Verilog-2001 style — Yosys's read_verilog rejects `return` inside
// an automatic function.
function [7:0] reg_read(input [2:0] sel);
if (sel == 3'd0) reg_read = 8'h00;
else reg_read = regs[sel];
endfunction
// ---- ALU: same shape as P05 ----
wire [7:0] a_data = op_a;
wire [7:0] b_data = op_b;
wire [8:0] add_w = {1'b0, a_data} + {1'b0, b_data};
wire [8:0] sub_w = {1'b0, a_data} - {1'b0, b_data};
logic [7:0] alu_y;
logic c_out, v_out;
always_comb begin
alu_y = 8'h00;
c_out = 1'b0;
v_out = 1'b0;
unique case (alu_op)
4'b0000: begin alu_y = add_w[7:0]; c_out = add_w[8];
v_out = (a_data[7] == b_data[7]) && (alu_y[7] != a_data[7]); end
4'b0001: begin alu_y = sub_w[7:0]; c_out = sub_w[8];
v_out = (a_data[7] != b_data[7]) && (alu_y[7] != a_data[7]); end
4'b0010: alu_y = a_data & b_data;
4'b0011: alu_y = a_data | b_data;
4'b0100: alu_y = a_data ^ b_data;
4'b0101: begin alu_y = {a_data[6:0], 1'b0}; c_out = a_data[7]; end
4'b0110: begin alu_y = {1'b0, a_data[7:1]}; c_out = a_data[0]; end
4'b0111: begin alu_y = {a_data[7], a_data[7:1]}; c_out = a_data[0]; end
4'b1000: alu_y = a_data;
4'b1001: alu_y = ~a_data;
default: alu_y = a_data;
endcase
end
wire z_out = (alu_y == 8'h00);
wire n_out = alu_y[7]; The FSM (next-state logic)
Five states, three flops. FETCH→DECODE→EXECUTE→WB is rigid; WB is the only state with a real branch in the next-state logic. HALT is a self-loop with no escape.
typedef enum logic [2:0] {
S_FETCH = 3'd0,
S_DECODE = 3'd1,
S_EXECUTE = 3'd2,
S_WB = 3'd3,
S_HALT = 3'd4
} state_t;
state_t state, next_state;
// UART transmitter signals (the submodule is instantiated below).
// We pulse uart_start_pulse for one cycle as the FSM moves from
// EXECUTE to WB on an OUT instruction. The UART then asserts
// uart_busy until the byte (start + 8 data + stop) finishes
// transmitting; the FSM stalls in S_WB while busy is high.
logic uart_start_pulse;
logic [7:0] uart_data;
logic uart_busy;
always_comb begin
next_state = state;
unique case (state)
S_FETCH: next_state = S_DECODE;
S_DECODE: next_state = S_EXECUTE;
S_EXECUTE: next_state = S_WB;
S_WB: begin
// OUT stalls in WB until the UART finishes transmitting. Once
// uart_busy drops we proceed to the next instruction.
if (is_out && uart_busy) next_state = S_WB;
else if (is_hlt) next_state = S_HALT;
else next_state = S_FETCH;
end
S_HALT: next_state = S_HALT;
default: next_state = S_FETCH;
endcase
end UART and OUT instruction signals
Three wires bridge the CPU’s main FSM and the UART submodule:
uart_start_pulse is high for exactly one clock as we move through
EXECUTE on an OUT, uart_data is the byte to send (already sitting in
op_a from DECODE), and uart_busy is the back-pressure that holds the
main FSM in WB until transmission completes.
// UART transmitter signals (the submodule is instantiated below).
// We pulse uart_start_pulse for one cycle as the FSM moves from
// EXECUTE to WB on an OUT instruction. The UART then asserts
// uart_busy until the byte (start + 8 data + stop) finishes
// transmitting; the FSM stalls in S_WB while busy is high.
logic uart_start_pulse;
logic [7:0] uart_data;
logic uart_busy;
always_comb begin
next_state = state;
unique case (state)
S_FETCH: next_state = S_DECODE;
S_DECODE: next_state = S_EXECUTE;
S_EXECUTE: next_state = S_WB;
S_WB: begin
// OUT stalls in WB until the UART finishes transmitting. Once
// uart_busy drops we proceed to the next instruction.
if (is_out && uart_busy) next_state = S_WB;
else if (is_hlt) next_state = S_HALT;
else next_state = S_FETCH;
end
S_HALT: next_state = S_HALT;
default: next_state = S_FETCH;
endcase
end
// Pulse uart_start for exactly one cycle: when we're moving out of
// EXECUTE on an OUT and have just latched op_a as the byte to send.
assign uart_start_pulse = (state == S_EXECUTE) && is_out;
// The byte to send is sitting in op_a (we routed regs[ra] there in
// DECODE; the ALU's MOV passthrough also computed result_q = op_a in
// EXECUTE, but op_a is one cycle earlier so latency is identical).
assign uart_data = op_a;
// Branch resolution — happens in WB stage.
// (Computed combinationally from flags_q, since flags from the *just-
// executed* instruction haven't been latched yet — they're in flags_d
// mid-EXECUTE/WB. For simplicity, branches consume flags from the
// most recently captured flag register, so a CMP must complete one
// full instruction before the BZ/BNZ that depends on it.)
wire take_branch = is_jmp
|| (is_bz && flags_q[3])
|| (is_bnz && ~flags_q[3]); The sequential always_ff
One big always_ff with a per-state case. Reset clears state, PC, IR, operands, results, flags, and zeroes the regfile. Otherwise each state’s behavior is one or two non-blocking assignments.
integer i;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
state <= S_FETCH;
pc <= 5'd0;
ir <= 16'h0000;
op_a <= 8'h00;
op_b <= 8'h00;
result_q <= 8'h00;
flags_d <= 4'h0;
flags_q <= 4'h0;
for (i = 0; i < 8; i = i + 1) regs[i] <= 8'h00;
end else begin
state <= next_state;
unique case (state)
S_FETCH: begin
ir <= PROG[16*pc +: 16];
// PC advances on FETCH; branches override in WB.
pc <= pc + 5'd1;
end
S_DECODE: begin
// LDI: route imm through op_a so the ALU's MOV (a-passthrough)
// writes it back to rd. ra/rb are unused for LDI.
op_a <= is_ldi ? dec_imm : reg_read(dec_ra);
op_b <= reg_read(dec_rb);
end
S_EXECUTE: begin
result_q <= alu_y;
flags_d <= {z_out, n_out, c_out, v_out};
end
S_WB: begin
if (reg_write) regs[dec_rd] <= result_q;
if (flag_update || is_cmp) flags_q <= flags_d;
if (is_branch && take_branch) pc <= dec_addr;
end
S_HALT: ;
default: ;
endcase
end
end Outputs
Three port assigns and one explicit lint tieoff for the unused start
input.
// ---- outputs ----
assign out = regs[7];
assign pc_out = pc;
assign halted = (state == S_HALT);
// start input reserved for future use (single-step / run gate).
// Tie off the lint warning explicitly.
wire _unused = &{1'b0, start};
endmodule The UART submodule
The same 8N1 transmitter project 03 hardened standalone, copied
verbatim into this file. Each project on the ladder stays
self-contained — no include files, no shared modules — so reading
top.sv is enough to understand the whole chip.
// ---------------------------------------------------------------------
// uart_tx — 8N1 UART transmitter, lifted from project 03.
//
// Pulse `start` for one cycle with `data` valid; tx will then drive the
// start bit, 8 data bits LSB-first, and a stop bit, with each bit held
// for `baud_div + 1` clock cycles. busy stays high for the duration so
// the host can poll-and-wait. Idle line level is high.
// ---------------------------------------------------------------------
module uart_tx (
input logic clk,
input logic rst_n,
input logic start,
input logic [7:0] data,
input logic [15:0] baud_div,
output logic tx,
output logic busy
);
typedef enum logic [3:0] {
U_IDLE = 4'd0,
U_START = 4'd1,
U_D0 = 4'd2, U_D1 = 4'd3, U_D2 = 4'd4, U_D3 = 4'd5,
U_D4 = 4'd6, U_D5 = 4'd7, U_D6 = 4'd8, U_D7 = 4'd9,
U_STOP = 4'd10
} ustate_t;
ustate_t ustate, ustate_next;
logic [15:0] baud_cnt;
logic bit_tick;
assign bit_tick = (baud_cnt == 16'd0);
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) baud_cnt <= 16'd0;
else if (ustate == U_IDLE) baud_cnt <= baud_div;
else if (bit_tick) baud_cnt <= baud_div;
else baud_cnt <= baud_cnt - 16'd1;
end
// Latch data on entry to U_START.
logic [7:0] data_q;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) data_q <= 8'h00;
else if (ustate == U_IDLE && start) data_q <= data;
end
always_comb begin
ustate_next = ustate;
unique case (ustate)
U_IDLE: if (start) ustate_next = U_START;
U_START: if (bit_tick) ustate_next = U_D0;
U_D0: if (bit_tick) ustate_next = U_D1;
U_D1: if (bit_tick) ustate_next = U_D2;
U_D2: if (bit_tick) ustate_next = U_D3;
U_D3: if (bit_tick) ustate_next = U_D4;
U_D4: if (bit_tick) ustate_next = U_D5;
U_D5: if (bit_tick) ustate_next = U_D6;
U_D6: if (bit_tick) ustate_next = U_D7;
U_D7: if (bit_tick) ustate_next = U_STOP;
U_STOP: if (bit_tick) ustate_next = U_IDLE;
default: ustate_next = U_IDLE;
endcase
end
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) ustate <= U_IDLE;
else ustate <= ustate_next;
end
always_comb begin
unique case (ustate)
U_IDLE: tx = 1'b1;
U_START: tx = 1'b0;
U_D0: tx = data_q[0];
U_D1: tx = data_q[1];
U_D2: tx = data_q[2];
U_D3: tx = data_q[3];
U_D4: tx = data_q[4];
U_D5: tx = data_q[5];
U_D6: tx = data_q[6];
U_D7: tx = data_q[7];
U_STOP: tx = 1'b1;
default: tx = 1'b1;
endcase
end
assign busy = (ustate != U_IDLE);
endmodule Verifying testbench
Three programs covering the major instruction classes.
// Project 06 testbench — verifying TB for the tiny FSM CPU.
//
// Strategy: run the DUT's default boot program (Fibonacci(6) into
// R1..R6, then HLT into R7). Wait for `halted` to assert, then
// peek at R1..R7 by reading the testbench's view of the regfile via
// hierarchical access.
//
// We also exercise:
// - LDI to a few different registers and values.
// - ADD/SUB/AND/OR/XOR with reg-reg operands.
// - CMP + BZ/BNZ branch flow.
// - JMP forward / loop.
// - HLT terminates the FSM in the S_HALT state.
//
// Each test reuses the DUT but with a different program loaded via the
// PROG parameter. `top` accepts PROG as a localparam-overridable
// parameter; we instantiate one DUT per test program (each with its
// own `top` instance scoped to a separate `generate` block).
`timescale 1ns/1ps
`default_nettype none
module tb;
// 100 MHz chip clock.
logic clk = 0;
always #5 clk = ~clk;
logic rst_n;
logic start = 1'b1;
int errors = 0;
int test_num = 0;
// ---- helpers ---------------------------------------------------------
// Apply reset and release on the next negedge.
task automatic do_reset;
begin
rst_n = 1'b0;
repeat (4) @(posedge clk);
@(negedge clk); rst_n = 1'b1;
end
endtask
// ---- instruction encoders, kept in sync with top.sv ------------------
// Replicated here because functions inside the DUT aren't visible
// through hierarchical reference for parameter computation.
function automatic logic [15:0] enc_alu(input logic [3:0] op,
input logic [2:0] rd,
input logic [2:0] ra,
input logic [2:0] rb);
return {op, rd, ra, rb, 3'b000};
endfunction
function automatic logic [15:0] enc_unary(input logic [3:0] op,
input logic [2:0] rd,
input logic [2:0] ra);
return {op, rd, ra, 3'b000, 3'b000};
endfunction
function automatic logic [15:0] enc_ldi(input logic [2:0] rd,
input logic [7:0] imm);
return {4'hA, rd, imm, 1'b0};
endfunction
function automatic logic [15:0] enc_jmp(input logic [3:0] op,
input logic [4:0] addr);
return {op, 7'b0000000, addr};
endfunction
function automatic logic [15:0] enc_hlt;
return {4'hF, 12'h000};
endfunction
// ---- Test 1: default Fibonacci program (uses DUT's built-in PROG) ---
// Baud divider — shared across all DUTs. Set to a small value so
// OUT-driven UART transmissions complete quickly in sim. (Real
// hardware uses something like 868 for 115200 @ 100 MHz; sim doesn't
// care about the wire-level timing as long as it's not zero.)
logic [15:0] baud_div = 16'd3;
logic [7:0] dut1_out;
logic [4:0] dut1_pc;
logic dut1_halted;
logic dut1_tx;
top dut1 (.clk(clk), .rst_n(rst_n), .start(start), .baud_div(baud_div),
.out(dut1_out), .pc_out(dut1_pc), .halted(dut1_halted),
.uart_tx(dut1_tx));
// ---- Test 2: branching / loop program ------------------------------
// Counts down from 5 in R1 to 0 using SUB + BNZ.
// Final R1 = 0, R7 (loop counter via copy) = number of iterations = 5.
//
// 00 LDI R1, 5
// 01 LDI R2, 0 ; iteration counter
// 02 SUB R1, R1, R3 ; R3 = 0 (R0 = 0; using R3=0 since not loaded)
// actually use a constant: SUB R1,R1,R0 won't
// work because R0 = 0 → R1 = R1 - 0; we want
// decrement by 1. Use ADD R1, R1, R7 where R7
// is loaded with -1 (0xFF) first.
// ... rewritten:
//
// 00 LDI R7, 0xFF ; R7 = -1
// 01 LDI R1, 5 ; R1 = 5 (loop count)
// 02 LDI R2, 0 ; R2 = 0 (iteration tally)
// 03 LDI R3, 1 ; R3 = 1 (increment for tally)
// loop:
// 04 ADD R1, R1, R7 ; R1 = R1 - 1 (since R7 = 0xFF wraps as -1)
// 05 ADD R2, R2, R3 ; R2 += 1
// 06 CMP R1, R0 ; compare R1 with 0; sets Z if R1 == 0
// 07 BNZ 04 ; loop if R1 != 0
// 08 MOV R7, R2 ; final tally → R7 (= out)
// 09 HLT
// Packed bit-vector programs — concatenation order is MSB-first, so
// PC=31 sits at the top of the {} list and PC=0 at the bottom.
localparam int PROG_BITS = 32 * 16;
localparam logic [PROG_BITS-1:0] PROG_LOOP = {
{22{16'h0000}}, // 31..10
enc_hlt(), // 09 HLT
enc_unary(4'h8, 3'd7, 3'd2), // 08 MOV R7,R2
enc_jmp (4'hE, 5'd4), // 07 BNZ 04
enc_alu (4'hB, 3'd0, 3'd1, 3'd0), // 06 CMP R1,R0
enc_alu (4'h0, 3'd2, 3'd2, 3'd3), // 05 ADD R2,R2,R3
enc_alu (4'h0, 3'd1, 3'd1, 3'd7), // 04 ADD R1,R1,R7
enc_ldi (3'd3, 8'd1), // 03 R3 = 1
enc_ldi (3'd2, 8'd0), // 02 R2 = 0
enc_ldi (3'd1, 8'd5), // 01 R1 = 5
enc_ldi (3'd7, 8'hFF) // 00 R7 = -1
};
logic [7:0] dut2_out;
logic [4:0] dut2_pc;
logic dut2_halted;
logic dut2_tx;
top #(.PROG(PROG_LOOP)) dut2 (.clk(clk), .rst_n(rst_n), .start(start), .baud_div(baud_div),
.out(dut2_out), .pc_out(dut2_pc), .halted(dut2_halted),
.uart_tx(dut2_tx));
// ---- Test 3: bitwise + JMP-forward (skip-over) -----------------------
// 00 LDI R1, 0xF0
// 01 LDI R2, 0x0F
// 02 OR R3, R1, R2 ; R3 = 0xFF
// 03 JMP 06 ; skip the AND
// 04 AND R3, R1, R2 ; would set R3 = 0 if reached
// 05 HLT
// 06 MOV R7, R3 ; R7 = 0xFF (proves we jumped past the AND)
// 07 HLT
localparam logic [PROG_BITS-1:0] PROG_JMP = {
{24{16'h0000}}, // 31..08
enc_hlt(), // 07 HLT
enc_unary(4'h8, 3'd7, 3'd3), // 06 MOV R7,R3
enc_hlt(), // 05 HLT
enc_alu (4'h2, 3'd3, 3'd1, 3'd2), // 04 AND R3,R1,R2 (skipped)
enc_jmp (4'hC, 5'd6), // 03 JMP 06
enc_alu (4'h3, 3'd3, 3'd1, 3'd2), // 02 OR R3,R1,R2
enc_ldi (3'd2, 8'h0F), // 01 R2 = 0x0F
enc_ldi (3'd1, 8'hF0) // 00 R1 = 0xF0
};
logic [7:0] dut3_out;
logic [4:0] dut3_pc;
logic dut3_halted;
logic dut3_tx;
top #(.PROG(PROG_JMP)) dut3 (.clk(clk), .rst_n(rst_n), .start(start), .baud_div(baud_div),
.out(dut3_out), .pc_out(dut3_pc), .halted(dut3_halted),
.uart_tx(dut3_tx));
task automatic check8(input logic [7:0] got, input logic [7:0] exp,
input string label);
begin
if (got !== exp) begin
$display("FAIL [%s] got 0x%02h, expected 0x%02h", label, got, exp);
errors = errors + 1;
end
end
endtask
// ---- main ------------------------------------------------------------
initial begin
$dumpfile("tb.vcd");
$dumpvars(0, tb);
// ---- Test 1: Fibonacci ----
do_reset();
begin
int cycles; cycles = 0;
while (!dut1_halted && cycles < 1000) begin
@(posedge clk); cycles = cycles + 1;
end
if (!dut1_halted) begin
$display("FAIL [fib] did not halt within 1000 cycles");
errors = errors + 1;
end
end
// Expected regfile after Fibonacci program completes:
// R1=1, R2=1, R3=2, R4=3, R5=5, R6=8, R7=8 (= R6 → out)
check8(dut1.regs[1], 8'd1, "fib R1");
check8(dut1.regs[2], 8'd1, "fib R2");
check8(dut1.regs[3], 8'd2, "fib R3");
check8(dut1.regs[4], 8'd3, "fib R4");
check8(dut1.regs[5], 8'd5, "fib R5");
check8(dut1.regs[6], 8'd8, "fib R6");
check8(dut1_out, 8'd8, "fib out (R7)");
// ---- Test 2: countdown loop with CMP + BNZ ----
do_reset();
begin
int cycles; cycles = 0;
while (!dut2_halted && cycles < 1000) begin
@(posedge clk); cycles = cycles + 1;
end
if (!dut2_halted) begin
$display("FAIL [loop] did not halt within 1000 cycles");
errors = errors + 1;
end
end
// R1 should have decremented to 0; R2 should have tallied 5 iterations.
check8(dut2.regs[1], 8'd0, "loop R1 final");
check8(dut2.regs[2], 8'd5, "loop R2 tally");
check8(dut2_out, 8'd5, "loop out (R7=R2)");
// ---- Test 3: JMP forward skips the AND ----
do_reset();
begin
int cycles; cycles = 0;
while (!dut3_halted && cycles < 1000) begin
@(posedge clk); cycles = cycles + 1;
end
if (!dut3_halted) begin
$display("FAIL [jmp] did not halt within 1000 cycles");
errors = errors + 1;
end
end
check8(dut3.regs[3], 8'hFF, "jmp R3 (= OR result, AND skipped)");
check8(dut3_out, 8'hFF, "jmp out (R7=R3)");
if (errors == 0) $display("PASS: tiny FSM CPU, all programs executed correctly.");
else $display("FAIL: %0d errors", errors);
$finish;
end
initial begin
#5_000_000;
$display("FAIL: testbench timed out");
$finish;
end
endmodule
`default_nettype wire Demo
The demo program computes Fibonacci(7) into R1..R7 and pushes each
new value out the UART using the OUT instruction. The demo TB
includes a behavioral 8N1 receiver that watches the uart_tx pin,
samples each bit at the middle of its bit time, reconstructs the
byte, and prints it. So the log interleaves the FETCH-cycle CPU
trace with [uart] lines for every byte received on the wire:
[cpu] pc=00 ir=a202 LDI R1,#0x01 | R1=00 R2=00 R3=00 R4=00 R5=00 R6=00 R7=00 | flags=0000
[cpu] pc=01 ir=a402 LDI R2,#0x01 | R1=01 R2=00 R3=00 R4=00 R5=00 R6=00 R7=00 | flags=0000
[cpu] pc=02 ir=7040 OUT R1 | R1=01 R2=01 R3=00 R4=00 R5=00 R6=00 R7=00 | flags=0000
[uart] rx byte 1: 0x01 (1)
[cpu] pc=03 ir=7080 OUT R2 | R1=01 R2=01 R3=00 R4=00 R5=00 R6=00 R7=00 | flags=0000
[uart] rx byte 2: 0x01 (1)
[cpu] pc=04 ir=0650 ADD R3,R1,R2 | R1=01 R2=01 R3=00 R4=00 R5=00 R6=00 R7=00 | flags=0000
[cpu] pc=05 ir=70c0 OUT R3 | R1=01 R2=01 R3=02 R4=00 R5=00 R6=00 R7=00 | flags=0000
[uart] rx byte 3: 0x02 (2)
[cpu] pc=06 ir=0898 ADD R4,R2,R3 | R1=01 R2=01 R3=02 R4=00 R5=00 R6=00 R7=00 | flags=0000
[cpu] pc=07 ir=7100 OUT R4 | R1=01 R2=01 R3=02 R4=03 R5=00 R6=00 R7=00 | flags=0000
[uart] rx byte 4: 0x03 (3)
...
[cpu] pc=13 ir=71c0 OUT R7 | R1=01 R2=01 R3=02 R4=03 R5=05 R6=08 R7=0d | flags=0000
[uart] rx byte 7: 0x0d (13)
[cpu] pc=14 ir=f000 HLT | R1=01 R2=01 R3=02 R4=03 R5=05 R6=08 R7=0d | flags=0000
[cpu] halted at pc=15
Each OUT instruction pulses the UART’s start for one cycle, and
the CPU’s FSM stalls in WB until the UART’s busy line drops —
about ~40 clock cycles per byte at the demo’s 4-clocks-per-bit baud
divider. Real hardware running at 115200 baud would stall ~8700
cycles per byte, which is fine because the program is doing nothing
useful while the byte is on the wire anyway.
Reading the layout
Open the viewer at the top and click each annotation in turn:
- A highlights one cell row near the bottom edge — the same building block P01 uses, just one of dozens here.
- R outlines the bulk of the regfile (48 dfrtp_2 cells). Notice
how spread out it is: the placer doesn’t keep
regs[1][0..7]together, because what it cares about is shortening the wires to the ALU and the writeback mux that read/write each bit. Source-code locality has nothing to do with placement. - S outlines the 3-flop FSM state register in a narrow column mid-chip. Three bits is everything that distinguishes a CPU from a static datapath. Click into it and you can see how small the “control” really is: a tap-row column with a few flops tucked in.
- D outlines the result/flag writeback registers in the upper
half — the
result_qandflags_qflops latched at the end of EXECUTE. These exist because of the FSM split; in P05 the same logic ran combinationally without an intermediate register, which is exactly why P05’s slow-corner setup missed. - U outlines the UART transmitter along the top edge of the chip. The placer kept the whole peripheral together because its 28 flops all talk to each other and to nothing else. The same module that’s spread across 110 µm in P03’s standalone harden is squeezed into a thin band here — different placement context, different shape.
What just happened?
We made a CPU. ~290 lines of SV (130 of them the inlined P05 datapath, ~100 the inlined P03 UART), one parameterized ROM, one 4-state FSM, three 3-bit register-address fields, a real serial output. It runs Fibonacci, can branch, can compare, can jump, can print bytes out a wire, can halt. It’s tiny — 0.014 mm² of core area, three flops of state — but the shape is exactly that of a real microcontroller: fetch, decode, execute, writeback, repeat, with a peripheral hanging off the side. P07 puts a real bus on the back so this CPU can talk to several peripherals through the same address space instead of pinning each one to a dedicated instruction.
See also
- Project 05 → the datapath this CPU is built on.
- Project README