RV32I-min educational CPU · librelane-playground

A minimal RV32I implementation. Multi-cycle 5-stage FSM (IF → ID → EX → MEM → WB), 32-bit datapath, 32 × 32-bit register file (x0 hardwired to zero), 256-byte instruction ROM, 256-byte data RAM. The first project on the ladder where the ISA isn’t ours-by-convenience — this is real RISC-V assembly.

Status: Hardened at sky130A on a 600 × 600 µm die at 40 MHz. 17,277 non-filler cells (regfile + dmem + ALU + decoder + 5-state FSM), 3,262 flops, 2.94 ns of setup slack, zero DRC/LVS/antenna violations. Three RV32I programs pass under iverilog — arithmetic ops (LUI/ADDI/ADD/SUB/AND/OR/XOR/SLLI/SRLI/SRAI/SLT), a countdown loop with BNE, and Fibonacci(10) using ADD/BLT/JAL.

layout · sky130A x= —μm y= —μm

loading shapes · …

1.00×

annotations

M
Data RAM (dmem). 64 × 32-bit words = 2,048 flops in a tight cluster on the left half of the chip. P09's dmem is flop-based — no SRAM macro — so every byte is a real DFF. The wide rectangular footprint is the placer fitting 2k flops into one big columnar block; routing channels above carry the load/store data path back to the ALU. met1 met2
R
32 × 32-bit register file. 1,024 flops. RV32I keeps all 32 registers (vs P12/P13's 16); that's twice the regfile area of an RV32E core. The placer parked it on the right half, mirroring dmem on the left, with the ALU and operand staging registers in the column between them. met1 met2
D
CPU datapath: op_a / op_b operand latches (32 + 32 flops), result_q ALU staging (32 flops). This narrow vertical column is where one cycle of FETCH→DECODE→EXECUTE→MEM→WB collapses into actual silicon. ~100 flops handle every RV32I instruction the chip executes. met1 met2 met3
F
The 5-state FSM register — 5 flops encoding {FETCH, DECODE, EXECUTE, MEM, WB, HALT}. The smallest functional block on the chip and the one that turns the regfile/dmem/ALU into a CPU rather than a static datapath. met1 met2

3d · sky130A · z×10 —

loading 3d · 0%

annotations

M
Data RAM (dmem). 64 × 32-bit words = 2,048 flops in a tight cluster on the left half of the chip. P09's dmem is flop-based — no SRAM macro — so every byte is a real DFF. The wide rectangular footprint is the placer fitting 2k flops into one big columnar block; routing channels above carry the load/store data path back to the ALU. met1 met2
R
32 × 32-bit register file. 1,024 flops. RV32I keeps all 32 registers (vs P12/P13's 16); that's twice the regfile area of an RV32E core. The placer parked it on the right half, mirroring dmem on the left, with the ALU and operand staging registers in the column between them. met1 met2
D
CPU datapath: op_a / op_b operand latches (32 + 32 flops), result_q ALU staging (32 flops). This narrow vertical column is where one cycle of FETCH→DECODE→EXECUTE→MEM→WB collapses into actual silicon. ~100 flops handle every RV32I instruction the chip executes. met1 met2 met3
F
The 5-state FSM register — 5 flops encoding {FETCH, DECODE, EXECUTE, MEM, WB, HALT}. The smallest functional block on the chip and the one that turns the regfile/dmem/ALU into a CPU rather than a static datapath. met1 met2

annotations

M
Data RAM (dmem). 64 × 32-bit words = 2,048 flops in a tight cluster on the left half of the chip. P09's dmem is flop-based — no SRAM macro — so every byte is a real DFF. The wide rectangular footprint is the placer fitting 2k flops into one big columnar block; routing channels above carry the load/store data path back to the ALU. met1 met2
R
32 × 32-bit register file. 1,024 flops. RV32I keeps all 32 registers (vs P12/P13's 16); that's twice the regfile area of an RV32E core. The placer parked it on the right half, mirroring dmem on the left, with the ALU and operand staging registers in the column between them. met1 met2
D
CPU datapath: op_a / op_b operand latches (32 + 32 flops), result_q ALU staging (32 flops). This narrow vertical column is where one cycle of FETCH→DECODE→EXECUTE→MEM→WB collapses into actual silicon. ~100 flops handle every RV32I instruction the chip executes. met1 met2 met3
F
The 5-state FSM register — 5 flops encoding {FETCH, DECODE, EXECUTE, MEM, WB, HALT}. The smallest functional block on the chip and the one that turns the regfile/dmem/ALU into a CPU rather than a static datapath. met1 met2

Honesty rules

Per the project conventions in CLAUDE.md, here’s exactly what this core supports and what it doesn’t:

Supported:

class	instructions
R-type	ADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT, SLTU
I-type	ADDI, ANDI, ORI, XORI, SLLI, SRLI, SRAI, SLTI, SLTIU
Upper imm	LUI, AUIPC
Branch	BEQ, BNE, BLT, BGE, BLTU, BGEU
Jump	JAL, JALR
Load	LW only (no LB/LH/LBU/LHU)
Store	SW only (no SB/SH)
FENCE	decoded as NOP — legal for in-order single-issue cores

Not supported:

Sub-word memory access
ECALL / EBREAK / FENCE.I
CSR ops (no CSR file)
MRET / SRET / WFI (no privileged modes)
Any M / A / F / D / C / Zicsr / Zifencei extensions
Misaligned memory access

Compliance tests have not been run. This core implements the RV32I instruction shape — programs that stay within the supported subset above run correctly in simulation. Anything that hits an unsupported encoding lands in the FSM’s S_ILLEGAL state and halts.

What’s new vs. P08

32-bit datapath. All previous projects used 8-bit registers and operands; now everything is 32 bits. Sign-extension of immediates and halfword offsets becomes load-bearing.
A real ISA. The instruction encoding is RISC-V’s. Decoding has to fish opcode / funct3 / funct7 / imm fields out of fixed bit positions per type.
Five FSM stages. MEM is its own stage so LW data has time to settle. Pure-ALU and branch instructions still walk through MEM but it’s a no-op for them.
Branch decode. Six branch ops with two operand-comparison classes (signed for BLT/BGE, unsigned for BLTU/BGEU).

The FSM

Five stages, every instruction walks the same shape. ILLEGAL is the catch-all for unsupported encodings; HALT is reached by the conventional jump-to-self idiom (jal x0, 0).

RTL

projects/09_rv32i_min/src/top.sv system-verilog

// Project 09: RV32I-min educational core.
//
// A minimal RV32I implementation. Multi-cycle FSM (same shape as
// projects 06/07/08), 5 stages: IF, ID, EX, MEM, WB. 32-bit datapath,
// 32 × 32-bit register file (x0 hardwired to zero), 256-byte
// instruction ROM (parameterized at instantiation), 256-byte data
// RAM (flop-based; P10 may revisit with a macro).
//
// HONESTY RULES (per CLAUDE.md). What this core *does* support:
//
//   R-type (10):  ADD  SUB  AND  OR   XOR  SLL  SRL  SRA  SLT  SLTU
//   I-type (9):   ADDI ANDI ORI  XORI SLLI SRLI SRAI SLTI SLTIU
//   Upper imm:    LUI  AUIPC
//   Branch (6):   BEQ  BNE  BLT  BGE  BLTU BGEU
//   Jump:         JAL  JALR
//   Load:         LW                  (only — no LB/LH/LBU/LHU)
//   Store:        SW                  (only — no SB/SH)
//   FENCE:        decoded as NOP      (legal: FENCE is allowed to be a NOP
//                                       on a single-issue in-order core)
//
// What this core does NOT support:
//
//   LB / LH / LBU / LHU / SB / SH    (sub-word memory access)
//   ECALL / EBREAK                    (no traps; treated as illegal)
//   CSRRW / CSRRS / CSRRC / immediate forms  (no CSR file)
//   MRET / SRET / WFI                 (no privileged modes)
//   Any of the M/A/F/D/C extensions   (we are int-base-only)
//
// Compliance tests have NOT been run. This core targets the
// RV32I instruction shape, not the full RISC-V specification.
// Programs that stay within the supported subset above run correctly
// in simulation; anything that hits an unsupported encoding lands in
// the FSM's `S_ILLEGAL` state which halts the chip.
//
// What's new vs. P06/07/08:
//
//   - **32-bit datapath.** All previous projects used 8-bit registers
//     and operands; now everything is 32 bits. Sign-extension of
//     immediates and halfword address offsets becomes load-bearing.
//   - **A real ISA.** The instruction encoding is RISC-V's, not
//     ours-by-convenience. Decoding has to fish opcode / funct3 /
//     funct7 / imm fields out of fixed bit positions per type.
//   - **Five FSM stages instead of four.** MEM is its own stage so
//     LW data has time to settle from the bus the same cycle that
//     ST gets its write-enable pulse.
//   - **Branch decode.** Six branch ops with two operand-comparison
//     classes (signed for BLT/BGE, unsigned for BLTU/BGEU). The
//     EXECUTE stage computes both, and the WB stage picks the right
//     one.
//
// Programs are parameterized into the design via PROG (256 × 32-bit
// instructions). The default boot program computes Fibonacci(10) and
// stores the result at data-memory address 0; the testbench reads it
// out of the regfile after the chip halts via a self-loop on the
// `halt_addr`.

`default_nettype none

module top #(
    // Boot program: 256 instructions × 32 bits = 1024 bytes of ROM.
    //
    // The default is a real RV32I Fibonacci(10) program — earlier we
    // had `jal x0, 0` here, which is a one-instruction infinite loop
    // that never writes a register. Yosys correctly proves that the
    // entire regfile and dmem stay at zero forever and synthesises
    // them away as dead code (the silicon shrinks from 5500 logic
    // cells to 210). Baking a non-trivial default program forces the
    // synth-time view of the design to actually exercise the ALU,
    // regfile, branch comparator, and store path, so the hardened
    // chip retains the structures we wanted to study. Testbenches
    // still override PROG by named-parameter override.
    parameter logic [256*32-1:0] PROG = {
        {243{32'h0000_0000}},
        32'h0000006f,  // 12: JAL  x0, 0          ; halt (jump-to-self)
        32'h00a02023,  // 11: SW   x10, 0(x0)     ; dmem[0] = x10
        32'h00008513,  // 10: ADDI x10, x1, 0     ; x10 = a (the Fib result)
        32'hfe3248e3,  //  9: BLT  x4, x3, -16    ; if i < n: loop back
        32'h00520233,  //  8: ADD  x4, x4, x5     ; i++
        32'h00030113,  //  7: ADDI x2, x6, 0      ; b = t
        32'h00010093,  //  6: ADDI x1, x2, 0      ; a = b
        32'h00208333,  //  5: ADD  x6, x1, x2     ; t = a + b
        32'h00100293,  //  4: ADDI x5, x0, 1      ; x5 = 1 (loop step)
        32'h00000213,  //  3: ADDI x4, x0, 0      ; i = 0
        32'h00a00193,  //  2: ADDI x3, x0, 10     ; n = 10
        32'h00100113,  //  1: ADDI x2, x0, 1      ; b = 1
        32'h00000093   //  0: ADDI x1, x0, 0      ; a = 0
    }
) (
    input  logic        clk,
    input  logic        rst_n,
    input  logic        start,           // currently unused (1 = run)
    input  logic [4:0]  dbg_reg_sel,     // chip-pin: which regfile entry to expose

    // Debug
    output logic [31:0] pc_out,
    output logic [31:0] dbg_reg_out,     // selected regfile entry (held combinationally)
    output logic [31:0] dmem_out,        // dmem[0] — observable view of memory
    output logic        halted
);

  // ====================================================================
  // Stage 0: Program counter + instruction ROM read.
  // ====================================================================
  // PC is byte-addressed (RISC-V convention) but instructions are
  // word-aligned, so we shift by 2 to index PROG (which is packed by
  // word: ROM[0] is bits 31:0, ROM[1] is 63:32, etc. — selected by the
  // `+:` indexed-part-select).
  logic [31:0] pc;
  logic [31:0] ir;          // current instruction (loaded from PROG in IF)

  // ====================================================================
  // Decode (combinational from IR).
  // ====================================================================
  wire [6:0]  opcode = ir[ 6: 0];
  wire [4:0]  rd     = ir[11: 7];
  wire [2:0]  funct3 = ir[14:12];
  wire [4:0]  rs1    = ir[19:15];
  wire [4:0]  rs2    = ir[24:20];
  wire [6:0]  funct7 = ir[31:25];

  // RISC-V opcode literal categories.
  localparam logic [6:0] OP_LUI    = 7'b0110111;
  localparam logic [6:0] OP_AUIPC  = 7'b0010111;
  localparam logic [6:0] OP_JAL    = 7'b1101111;
  localparam logic [6:0] OP_JALR   = 7'b1100111;
  localparam logic [6:0] OP_BRANCH = 7'b1100011;
  localparam logic [6:0] OP_LOAD   = 7'b0000011;
  localparam logic [6:0] OP_STORE  = 7'b0100011;
  localparam logic [6:0] OP_OPIMM  = 7'b0010011;
  localparam logic [6:0] OP_OP     = 7'b0110011;
  localparam logic [6:0] OP_FENCE  = 7'b0001111;

  // Per-instruction predicates.
  wire is_lui    = (opcode == OP_LUI);
  wire is_auipc  = (opcode == OP_AUIPC);
  wire is_jal    = (opcode == OP_JAL);
  wire is_jalr   = (opcode == OP_JALR);
  wire is_branch = (opcode == OP_BRANCH);
  wire is_load   = (opcode == OP_LOAD);
  wire is_store  = (opcode == OP_STORE);
  wire is_opimm  = (opcode == OP_OPIMM);
  wire is_op     = (opcode == OP_OP);
  wire is_fence  = (opcode == OP_FENCE);
  wire is_legal  = is_lui | is_auipc | is_jal | is_jalr | is_branch
                  | is_load | is_store | is_opimm | is_op | is_fence;

  // Sub-word load/store (LB/LH/LBU/LHU/SB/SH) we don't support — flag
  // those as illegal so we halt rather than silently misbehave. funct3:
  //   LW: 010    LB: 000  LH: 001  LBU: 100  LHU: 101
  //   SW: 010    SB: 000  SH: 001
  wire is_lw_only = is_load  && (funct3 == 3'b010);
  wire is_sw_only = is_store && (funct3 == 3'b010);
  wire mem_unsupported = (is_load && !is_lw_only) || (is_store && !is_sw_only);

  // ----- Immediate decode -----
  // Each instruction format puts the immediate in different bits and
  // sign-extends it differently. RISC-V is meticulous about sign-bit
  // placement to share decoder hardware.
  wire [31:0] imm_i = {{20{ir[31]}}, ir[31:20]};
  wire [31:0] imm_s = {{20{ir[31]}}, ir[31:25], ir[11:7]};
  wire [31:0] imm_b = {{19{ir[31]}}, ir[31], ir[7], ir[30:25], ir[11:8], 1'b0};
  wire [31:0] imm_u = {ir[31:12], 12'h000};
  wire [31:0] imm_j = {{11{ir[31]}}, ir[31], ir[19:12], ir[20], ir[30:21], 1'b0};

  // Pick the right immediate per opcode.
  logic [31:0] imm;
  always_comb begin
    unique case (opcode)
      OP_OPIMM, OP_LOAD, OP_JALR: imm = imm_i;
      OP_STORE:                    imm = imm_s;
      OP_BRANCH:                   imm = imm_b;
      OP_LUI, OP_AUIPC:            imm = imm_u;
      OP_JAL:                      imm = imm_j;
      default:                     imm = 32'h0000_0000;
    endcase
  end

  // ====================================================================
  // Register file. 32 × 32-bit. x0 reads as zero, writes are dropped.
  // Async read, sync write — same shape as project 05's regfile.
  // ====================================================================
  logic [31:0] regs [0:31];

  // Yosys SV frontend doesn't accept `return X;` inside a function — assign
  // to the function name instead. Same semantics, plain Verilog form.
  function automatic logic [31:0] reg_read(input [4:0] sel);
    if (sel == 5'd0) reg_read = 32'h0000_0000;
    else             reg_read = regs[sel];
  endfunction

  // ----- Operand registers (latched in DECODE) -----
  logic [31:0] op_a;        // regs[rs1]
  logic [31:0] op_b;        // regs[rs2] or imm

  // For ALU ops the second operand is reg or imm depending on opcode;
  // for STORE the data-to-write is regs[rs2] and the immediate is the
  // address offset; for BRANCH both come from regs[rs1] and regs[rs2]
  // and the immediate is the branch target.
  // We resolve "what op_a / op_b mean" structurally in the always_ff
  // for clarity.

  // ====================================================================
  // ALU. Combinational.
  // ====================================================================
  // Picks between op_a + imm style and op_a (op) op_b style.
  // The selector logic is in the EXECUTE stage of the always_ff.
  logic [31:0] alu_a, alu_b;
  logic [3:0]  alu_op;

  // ALU op encoding (internal, doesn't match RISC-V funct3/funct7
  // because the compression there is non-orthogonal; we re-encode for
  // simplicity).
  localparam logic [3:0] ALU_ADD  = 4'b0000;
  localparam logic [3:0] ALU_SUB  = 4'b0001;
  localparam logic [3:0] ALU_AND  = 4'b0010;
  localparam logic [3:0] ALU_OR   = 4'b0011;
  localparam logic [3:0] ALU_XOR  = 4'b0100;
  localparam logic [3:0] ALU_SLL  = 4'b0101;
  localparam logic [3:0] ALU_SRL  = 4'b0110;
  localparam logic [3:0] ALU_SRA  = 4'b0111;
  localparam logic [3:0] ALU_SLT  = 4'b1000;        // signed
  localparam logic [3:0] ALU_SLTU = 4'b1001;        // unsigned
  // 1010..1111 reserved.

  logic [31:0] alu_y;
  always_comb begin
    unique case (alu_op)
      ALU_ADD:  alu_y = alu_a + alu_b;
      ALU_SUB:  alu_y = alu_a - alu_b;
      ALU_AND:  alu_y = alu_a & alu_b;
      ALU_OR:   alu_y = alu_a | alu_b;
      ALU_XOR:  alu_y = alu_a ^ alu_b;
      ALU_SLL:  alu_y = alu_a << alu_b[4:0];
      ALU_SRL:  alu_y = alu_a >> alu_b[4:0];
      ALU_SRA:  alu_y = $signed(alu_a) >>> alu_b[4:0];
      ALU_SLT:  alu_y = ($signed(alu_a) < $signed(alu_b)) ? 32'h1 : 32'h0;
      ALU_SLTU: alu_y = (alu_a < alu_b)                   ? 32'h1 : 32'h0;
      default:  alu_y = 32'h0;
    endcase
  end

  // Decode opcode/funct3/funct7 → alu_op for OP and OPIMM. For LOAD,
  // STORE, JAL, JALR, AUIPC the ALU adds an immediate; LUI passes the
  // immediate through. BRANCH compares using SUB and the comparator
  // logic in the WB stage.
  logic [3:0] decoded_alu_op;
  always_comb begin
    decoded_alu_op = ALU_ADD;
    if (is_op || is_opimm) begin
      unique case (funct3)
        3'b000: decoded_alu_op = (is_op && funct7[5]) ? ALU_SUB : ALU_ADD;
        3'b001: decoded_alu_op = ALU_SLL;
        3'b010: decoded_alu_op = ALU_SLT;
        3'b011: decoded_alu_op = ALU_SLTU;
        3'b100: decoded_alu_op = ALU_XOR;
        3'b101: decoded_alu_op = funct7[5] ? ALU_SRA : ALU_SRL;
        3'b110: decoded_alu_op = ALU_OR;
        3'b111: decoded_alu_op = ALU_AND;
        default: decoded_alu_op = ALU_ADD;
      endcase
    end
  end

  // ====================================================================
  // Branch comparator. Combinational.
  // ====================================================================
  // Computes the branch condition based on funct3 from the latched
  // op_a (= rs1), op_b (= rs2). Used in WB to decide whether to take
  // the branch.
  logic branch_taken_comb;
  always_comb begin
    branch_taken_comb = 1'b0;
    if (is_branch) begin
      unique case (funct3)
        3'b000: branch_taken_comb = (op_a == op_b);                       // BEQ
        3'b001: branch_taken_comb = (op_a != op_b);                       // BNE
        3'b100: branch_taken_comb = ($signed(op_a) <  $signed(op_b));     // BLT
        3'b101: branch_taken_comb = ($signed(op_a) >= $signed(op_b));     // BGE
        3'b110: branch_taken_comb = (op_a <  op_b);                       // BLTU
        3'b111: branch_taken_comb = (op_a >= op_b);                       // BGEU
        default: branch_taken_comb = 1'b0;
      endcase
    end
  end

  // ====================================================================
  // FSM state typedef — declared early so the dmem block below can
  // reference S_EXECUTE.
  // ====================================================================
  typedef enum logic [2:0] {
    S_FETCH    = 3'd0,
    S_DECODE   = 3'd1,
    S_EXECUTE  = 3'd2,
    S_MEM      = 3'd3,
    S_WB       = 3'd4,
    S_ILLEGAL  = 3'd5,
    S_HALT     = 3'd6
  } state_t;
  state_t state, next_state;

  // ====================================================================
  // Data memory — 256 bytes of flop RAM, word-addressed (4-byte stride).
  // ====================================================================
  logic [31:0] dmem [0:63];                  // 64 words = 256 bytes
  logic [31:0] dmem_rdata;
  logic [5:0]  dmem_addr;       // word index = addr[7:2]

  always_ff @(posedge clk or negedge rst_n) begin
    integer di;
    if (!rst_n) begin
      for (di = 0; di < 64; di = di + 1) dmem[di] <= 32'h0;
      dmem_rdata <= 32'h0;
    end else begin
      // Capture rdata on the same edge that we'd commit a write so LW
      // gets fresh data the cycle after EXECUTE.
      dmem_rdata <= dmem[dmem_addr];
      if (state == S_EXECUTE && is_store && !mem_unsupported) begin
        dmem[dmem_addr] <= op_b;        // op_b holds rs2
      end
    end
  end

  // ====================================================================
  // FSM next-state logic.
  // ====================================================================
  // IF  — fetch IR from PROG[pc].
  // ID  — decode; latch op_a, op_b, dmem_addr.
  // EX  — ALU compute; for LOAD/STORE the ALU computes the effective
  //       address (op_a + imm). STORE issues its write here.
  // MEM — wait one cycle so dmem_rdata can settle for LOAD.
  // WB  — write rd, advance PC.

  // Halt detection: we declare "halted" on a JAL x0, 0 (encoded as
  // 0x0000006F) — i.e., an unconditional jump-to-self at PC. This is
  // the conventional RISC-V "stuck loop = halted" marker. We also halt
  // on illegal instructions.
  wire is_halt_loop = (ir == 32'h0000_006F) && (pc == /* about to JAL to self */ pc);
  // Detect jump-to-self by comparing the JAL target to the current PC.
  // pc + imm_j == pc ⇒ imm_j == 0 ⇒ encoding 0x0000006F.

  // PC update logic — combinational from current PC, latched in WB.
  logic [31:0] next_pc;
  always_comb begin
    next_pc = pc + 32'd4;             // default: sequential
    if (is_jal)                       next_pc = pc + imm;
    else if (is_jalr)                 next_pc = (op_a + imm) & ~32'h1;
    else if (is_branch && branch_taken_comb)
                                      next_pc = pc + imm;
  end

  always_comb begin
    next_state = state;
    unique case (state)
      S_FETCH:   next_state = S_DECODE;
      S_DECODE:  if (!is_legal || mem_unsupported) next_state = S_ILLEGAL;
                 else                              next_state = S_EXECUTE;
      S_EXECUTE: next_state = S_MEM;
      S_MEM:     next_state = S_WB;
      S_WB:      if (is_jal && (imm == 32'd0)) next_state = S_HALT;
                 else                          next_state = S_FETCH;
      S_ILLEGAL: next_state = S_HALT;
      S_HALT:    next_state = S_HALT;
      default:   next_state = S_FETCH;
    endcase
  end

  // ====================================================================
  // ALU operand muxing (combinational from latched op_a / op_b / imm).
  // ====================================================================
  always_comb begin
    alu_a = op_a;
    alu_b = op_b;
    alu_op = decoded_alu_op;

    if (is_lui) begin
      alu_a = 32'h0;
      alu_b = imm;
      alu_op = ALU_ADD;             // result = imm
    end else if (is_auipc) begin
      alu_a = pc;
      alu_b = imm;
      alu_op = ALU_ADD;
    end else if (is_jal || is_jalr) begin
      alu_a = pc;
      alu_b = 32'd4;
      alu_op = ALU_ADD;             // rd = pc + 4
    end else if (is_load || is_store) begin
      alu_a = op_a;
      alu_b = imm;
      alu_op = ALU_ADD;             // address = rs1 + imm
    end else if (is_opimm) begin
      alu_b = imm;
    end
    // OP / BRANCH leave alu_a/alu_b as op_a/op_b.
  end

  // ====================================================================
  // Sequential state.
  // ====================================================================
  logic [31:0] alu_result_q;        // latched ALU result for WB

  always_ff @(posedge clk or negedge rst_n) begin
    integer ri;
    if (!rst_n) begin
      state        <= S_FETCH;
      pc           <= 32'h0;
      ir           <= 32'h0;
      op_a         <= 32'h0;
      op_b         <= 32'h0;
      alu_result_q <= 32'h0;
      dmem_addr    <= 6'd0;
      for (ri = 0; ri < 32; ri = ri + 1) regs[ri] <= 32'h0;
    end else begin
      state <= next_state;

      unique case (state)
        S_FETCH: begin
          ir <= PROG[32 * pc[9:2] +: 32];
        end

        S_DECODE: begin
          op_a <= reg_read(rs1);
          op_b <= reg_read(rs2);
          // For LW/SW the address is rs1 + imm; capture the word index
          // here using the just-read rs1 so dmem_rdata settles by MEM.
          dmem_addr <= (reg_read(rs1) + imm) >> 2;
        end

        S_EXECUTE: begin
          // STORE write happens in the dmem always_ff above (see
          // is_store branch). For everything else, latch the ALU
          // output into alu_result_q.
          alu_result_q <= alu_y;
        end

        S_MEM: begin
          // LW data is now in dmem_rdata (latched at this edge from
          // the previous EXECUTE cycle's dmem_addr).
          if (is_load) alu_result_q <= dmem_rdata;
        end

        S_WB: begin
          // Write rd if the instruction has one (everything except
          // STORE / BRANCH / FENCE).
          if (rd != 5'd0
              && !is_store && !is_branch && !is_fence) begin
            regs[rd] <= alu_result_q;
          end
          pc <= next_pc;
        end

        S_ILLEGAL: ;
        S_HALT:    ;
        default:   ;
      endcase
    end
  end

  // ====================================================================
  // Outputs.
  // ====================================================================
  assign pc_out = pc;
  assign halted = (state == S_HALT);

  // The regfile and dmem need observable taps to keep yosys from
  // optimising them away during synthesis. dbg_reg_out exposes any one
  // regfile entry chosen by the chip-pin selector; dmem_out exposes the
  // first word of data memory. Without these the synth result has no
  // path from the CPU's working state to a primary output and yosys
  // strips out the regfile (1024 flops) and the dmem (2048 flops) as
  // unreachable. Tying them through to chip pins forces the placer to
  // realise them and the dataflow to stay in the netlist.
  assign dbg_reg_out = regs[dbg_reg_sel];
  assign dmem_out    = dmem[0];

  // start input reserved for future use (single-step / run gate).
  wire _unused = &{1'b0, start};

endmodule

`default_nettype wire

Test programs

The verifying TB ships three programs as functions that build the PROG bit-vector. Each uses RV32I-format encoders so the assembly intent reads through clearly:

projects/09_rv32i_min/test/tb.sv system-verilog · L180-260

    int i;
    for (i = 0; i < 256; i++) prog[i] = 32'h0;
    prog[ 0] = ADDI(5'd1, 5'd0, 12'd7);          // x1 = 7
    prog[ 1] = ADDI(5'd2, 5'd0, 12'd5);          // x2 = 5
    prog[ 2] = ADD (5'd3, 5'd1, 5'd2);
    prog[ 3] = SUB (5'd4, 5'd1, 5'd2);
    prog[ 4] = AND_(5'd5, 5'd1, 5'd2);
    prog[ 5] = OR_ (5'd6, 5'd1, 5'd2);
    prog[ 6] = XOR_(5'd7, 5'd1, 5'd2);
    prog[ 7] = SLLI(5'd8, 5'd1, 5'd1);           // x8 = x1 << 1 = 14
    prog[ 8] = SRLI(5'd9, 5'd1, 5'd1);           // x9 = x1 >> 1 = 3
    prog[ 9] = SLT_(5'd10, 5'd1, 5'd2);          // x10 = (x1 < x2) = 0
    prog[10] = ADDI(5'd11, 5'd0, -12'sd3);       // x11 = -3 (sign-extended)
    prog[11] = SRAI(5'd12, 5'd11, 5'd1);         // x12 = -3 >>> 1 = -2 = 0xFFFF_FFFE
    prog[12] = HLT();
    // Pack into the bit-vector. PROG[32*i +: 32] = prog[i].
    build_prog_arith = '0;
    for (i = 0; i < 256; i++) begin
      build_prog_arith[32*i +: 32] = prog[i];
    end
  endfunction

  logic [31:0] dut1_pc; logic dut1_halted;
  logic [31:0] dut1_reg, dut1_dmem;
  top #(.PROG(PROG_ARITH)) dut1 (
    .clk(clk), .rst_n(rst1_n), .start(start),
    .dbg_reg_sel(5'd0),
    .pc_out(dut1_pc), .halted(dut1_halted),
    .dbg_reg_out(dut1_reg), .dmem_out(dut1_dmem)
  );

  // ========== Program 2: branch ==========
  // Decrement x1 from 5 down to 0, counting iterations in x2.
  //   x1 = 5; x2 = 0; x3 = 1; x4 = 0;
  // loop:
  //   x1 = x1 - x3;           # decrement
  //   x2 = x2 + x3;           # tally
  //   bne x1, x4, loop        # if x1 != 0, branch back
  //   halt
  // After: x1=0, x2=5.
  localparam logic [256*32-1:0] PROG_BRANCH = build_prog_branch();

  function automatic logic [256*32-1:0] build_prog_branch;
    logic [31:0] prog [0:255];
    int i;
    for (i = 0; i < 256; i++) prog[i] = 32'h0;
    prog[ 0] = ADDI(5'd1, 5'd0, 12'd5);          // x1 = 5
    prog[ 1] = ADDI(5'd2, 5'd0, 12'd0);          // x2 = 0
    prog[ 2] = ADDI(5'd3, 5'd0, 12'd1);          // x3 = 1
    prog[ 3] = ADDI(5'd4, 5'd0, 12'd0);          // x4 = 0
    // loop label = address 16 (4 instructions × 4 bytes)
    prog[ 4] = SUB (5'd1, 5'd1, 5'd3);           // x1 -= 1
    prog[ 5] = ADD (5'd2, 5'd2, 5'd3);           // x2 += 1
    prog[ 6] = BNE (5'd1, 5'd4, -13'sd8);        // pc -= 8 = back to addr 16
    prog[ 7] = HLT();
    build_prog_branch = '0;
    for (i = 0; i < 256; i++) build_prog_branch[32*i +: 32] = prog[i];
  endfunction

  logic [31:0] dut2_pc; logic dut2_halted;
  logic [31:0] dut2_reg, dut2_dmem;
  top #(.PROG(PROG_BRANCH)) dut2 (
    .clk(clk), .rst_n(rst2_n), .start(start),
    .dbg_reg_sel(5'd0),
    .pc_out(dut2_pc), .halted(dut2_halted),
    .dbg_reg_out(dut2_reg), .dmem_out(dut2_dmem)
  );

  // ========== Program 3: Fibonacci(10) ==========
  // Compute fib(10) = 55 in x10.
  //   x1 = 0;       # a
  //   x2 = 1;       # b
  //   x3 = 10;      # n
  //   x4 = 0;       # i
  //   x5 = 1;       # const 1
  // loop:
  //   t = x1 + x2;
  //   x1 = x2;
  //   x2 = t;
  //   x4 = x4 + 1;
  //   blt x4, x3, loop

Demo

The demo runs Fibonacci(10) and prints each FETCH cycle with the program counter, the hex instruction, a short disassembly, and the key registers:

[cpu]  -- librelane-playground / project 09 / RV32I-min --
[cpu]  multi-cycle 5-stage FSM (IF/ID/EX/MEM/WB), 32 x 32-bit regfile
[cpu]  program: Fibonacci(10) -> x10 = 55

[cpu]  pc=0x00000000  ir=00000093  addi  x1, x0, 0     | x1=0  x2=0  x4=0  x10=0
[cpu]  pc=0x00000004  ir=00100113  addi  x2, x0, 1     | x1=0  x2=0  x4=0  x10=0
[cpu]  pc=0x00000008  ir=00a00193  addi  x3, x0, 10    | x1=0  x2=1  x4=0  x10=0
[cpu]  pc=0x00000014  ir=00208333  add   x6, x1, x2    | x1=0  x2=1  x4=0  x10=0
[cpu]  pc=0x00000018  ir=00010093  addi  x1, x2, 0     | x1=0  x2=1  x4=0  x10=0
[cpu]  pc=0x0000001c  ir=00030113  addi  x2, x6, 0     | x1=1  x2=1  x4=0  x10=0
[cpu]  pc=0x00000024  ir=fe3248e3  blt   x4, x3, -16   | x1=1  x2=1  x4=1  x10=0
... (loops back to pc=0x14, ten iterations)
[cpu]  pc=0x00000028  ir=00008513  addi  x10, x1, 0    | x1=55 x2=89 x4=10 x10=0
[cpu]  pc=0x0000002c  ir=0000006f  halt  (jal x0,0)    | x1=55 x2=89 x4=10 x10=55
[cpu]  halted at pc=0x0000002c, x10 = 55

Compiling C — gcc-riscv-elf into PROG[]

Hand-written assembly is fine for an educational chip but it stops being interesting after the first addi. The whole point of implementing an existing ISA (rather than ours-by-convenience) is that there’s a real toolchain that targets it. P09 plus tools/riscv-asm/ is the smallest end-to-end RISC-V toolchain flow that fits on this site.

The pipeline:

example.c                                    -- the program
  + start.S (boot stub: sp=0x100, jal main, halt)
  + p09.ld  (linker script: .text at 0x0, 1 KB hard cap)
       │
       ▼   riscv64-elf-gcc -march=rv32i -mabi=ilp32 -nostdlib …
  build/example.elf
       │
       ▼   riscv64-elf-objcopy -O binary
  build/example.bin
       │
       ▼   uv run bin_to_prog.py
  build/example.svh   ── localparam logic [256*32-1:0] PROG_FROM_C = { … };
       │
       ▼   testbench `\`include`-s the .svh and overrides PROG
  iverilog tb_c.sv ../src/top.sv
       │
       ▼
  the chip runs the program.

For examples/fib.c (a real C unsigned int fib(unsigned int n)), gcc -Os produces 11 instructions of inner loop plus a 4-instruction boot stub. The chip executes them in 301 cycles and stores 55 (= fib(10)) at dmem[0]. The strict make c-test testbench reads dmem[0] after halt and asserts it’s 55.

A few sharp edges that surfaced on the way:

gcc treats writes to address 0 as undefined behaviour and replaces them with __builtin_trap() (= ebreak), which P09 doesn’t implement. -fno-delete-null-pointer-checks disables the optimization. P09’s address 0 is dmem[0] — a real, valid memory location — and the C code legitimately stores there.
256 instructions of code is a hard ceiling. The linker script has a ASSERT(_text_size <= 0x400, …) that aborts the link with a clear message if a program overflows the ROM. There’s no “spill to RAM”; the chip’s fetch unit reads from PROG[] exclusively.
gcc’s stack-frame conventions (sp=0x100, alignment to 16 bytes) share an address space with the C int * writes. Anything you store via *(int*)X for X < 0x100 is fighting the stack. For the fib example the stored result lives at address 0 (dmem[0]), well below sp; the function uses no stack memory of its own.

tools/riscv-asm/README.md has the full reference. Adding a new example is one C file, one make EXAMPLE=name, and one tweak to tb_c.sv’s expected value.

What’s next

Tape-out. Project 10 (Tiny Tapeout) fits one of the designs on this ladder into TT’s pin-frame harness for an actual fab submission. P09 itself is too big for a TT tile, but a stripped- down variant (or P11’s wrapped-P06 version) is the kind of thing that would go.
Bigger programs. 1 KB of ROM caps how interesting the programs can get. A future project could add an external SPI ROM and a tiny instruction-fetch state machine that loads the program on boot — same shape as how a real microcontroller boots from flash.

Honesty rules

What’s new vs. P08

The FSM

RTL

Test programs

Demo

Compiling C — gcc-riscv-elf into PROG[]

What’s next

See also