No. 09 / project of 147 on the ladder

RV32I-min educational CPU

introduces — real ISA, 32-bit datapath, RISC-V instruction decode

harden statelast run2026-04-28
cells17,277non-filler
slack2.94ns setup
area360000 (die) / 346472 (core)μm²
signoff
  • DRCPASS
  • LVSPASS
  • antennaPASS

A minimal RV32I implementation. Multi-cycle 5-stage FSM (IF → ID → EX → MEM → WB), 32-bit datapath, 32 × 32-bit register file (x0 hardwired to zero), 256-byte instruction ROM, 256-byte data RAM. The first project on the ladder where the ISA isn’t ours-by-convenience — this is real RISC-V assembly.

Status: Hardened at sky130A on a 600 × 600 µm die at 40 MHz. 17,277 non-filler cells (regfile + dmem + ALU + decoder + 5-state FSM), 3,262 flops, 2.94 ns of setup slack, zero DRC/LVS/antenna violations. Three RV32I programs pass under iverilog — arithmetic ops (LUI/ADDI/ADD/SUB/AND/OR/XOR/SLLI/SRLI/SRAI/SLT), a countdown loop with BNE, and Fibonacci(10) using ADD/BLT/JAL.

layout · sky130A x= μm y= μm
drag · scroll to zoom · double-click to fit · 1 1:1 · f fit 600 × 600 µm die · sky130A · 40 MHz · 17,277 cells · 3,262 flops · met1+met2+met3 only
3d · sky130A · z×10
drag · scroll · right-drag pan · double-click recenter · R reset metal stack only · z exaggerated 10× · 417k shapes · meshopt-compressed

Honesty rules

Per the project conventions in CLAUDE.md, here’s exactly what this core supports and what it doesn’t:

Supported:

classinstructions
R-typeADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT, SLTU
I-typeADDI, ANDI, ORI, XORI, SLLI, SRLI, SRAI, SLTI, SLTIU
Upper immLUI, AUIPC
BranchBEQ, BNE, BLT, BGE, BLTU, BGEU
JumpJAL, JALR
LoadLW only (no LB/LH/LBU/LHU)
StoreSW only (no SB/SH)
FENCEdecoded as NOP — legal for in-order single-issue cores

Not supported:

  • Sub-word memory access
  • ECALL / EBREAK / FENCE.I
  • CSR ops (no CSR file)
  • MRET / SRET / WFI (no privileged modes)
  • Any M / A / F / D / C / Zicsr / Zifencei extensions
  • Misaligned memory access

Compliance tests have not been run. This core implements the RV32I instruction shape — programs that stay within the supported subset above run correctly in simulation. Anything that hits an unsupported encoding lands in the FSM’s S_ILLEGAL state and halts.

What’s new vs. P08

  • 32-bit datapath. All previous projects used 8-bit registers and operands; now everything is 32 bits. Sign-extension of immediates and halfword offsets becomes load-bearing.
  • A real ISA. The instruction encoding is RISC-V’s. Decoding has to fish opcode / funct3 / funct7 / imm fields out of fixed bit positions per type.
  • Five FSM stages. MEM is its own stage so LW data has time to settle. Pure-ALU and branch instructions still walk through MEM but it’s a no-op for them.
  • Branch decode. Six branch ops with two operand-comparison classes (signed for BLT/BGE, unsigned for BLTU/BGEU).

The FSM

is_legal && supported otherwise jal x0, 0 otherwise loop FETCH DECODE EXECUTE ILLEGAL MEM WB HALT
Five stages, every instruction walks the same shape. ILLEGAL is the catch-all for unsupported encodings; HALT is reached by the conventional jump-to-self idiom (jal x0, 0).

RTL

projects/09_rv32i_min/src/top.sv system-verilog
// Project 09: RV32I-min educational core.
//
// A minimal RV32I implementation. Multi-cycle FSM (same shape as
// projects 06/07/08), 5 stages: IF, ID, EX, MEM, WB. 32-bit datapath,
// 32 × 32-bit register file (x0 hardwired to zero), 256-byte
// instruction ROM (parameterized at instantiation), 256-byte data
// RAM (flop-based; P10 may revisit with a macro).
//
// HONESTY RULES (per CLAUDE.md). What this core *does* support:
//
//   R-type (10):  ADD  SUB  AND  OR   XOR  SLL  SRL  SRA  SLT  SLTU
//   I-type (9):   ADDI ANDI ORI  XORI SLLI SRLI SRAI SLTI SLTIU
//   Upper imm:    LUI  AUIPC
//   Branch (6):   BEQ  BNE  BLT  BGE  BLTU BGEU
//   Jump:         JAL  JALR
//   Load:         LW                  (only — no LB/LH/LBU/LHU)
//   Store:        SW                  (only — no SB/SH)
//   FENCE:        decoded as NOP      (legal: FENCE is allowed to be a NOP
//                                       on a single-issue in-order core)
//
// What this core does NOT support:
//
//   LB / LH / LBU / LHU / SB / SH    (sub-word memory access)
//   ECALL / EBREAK                    (no traps; treated as illegal)
//   CSRRW / CSRRS / CSRRC / immediate forms  (no CSR file)
//   MRET / SRET / WFI                 (no privileged modes)
//   Any of the M/A/F/D/C extensions   (we are int-base-only)
//
// Compliance tests have NOT been run. This core targets the
// RV32I instruction shape, not the full RISC-V specification.
// Programs that stay within the supported subset above run correctly
// in simulation; anything that hits an unsupported encoding lands in
// the FSM's `S_ILLEGAL` state which halts the chip.
//
// What's new vs. P06/07/08:
//
//   - **32-bit datapath.** All previous projects used 8-bit registers
//     and operands; now everything is 32 bits. Sign-extension of
//     immediates and halfword address offsets becomes load-bearing.
//   - **A real ISA.** The instruction encoding is RISC-V's, not
//     ours-by-convenience. Decoding has to fish opcode / funct3 /
//     funct7 / imm fields out of fixed bit positions per type.
//   - **Five FSM stages instead of four.** MEM is its own stage so
//     LW data has time to settle from the bus the same cycle that
//     ST gets its write-enable pulse.
//   - **Branch decode.** Six branch ops with two operand-comparison
//     classes (signed for BLT/BGE, unsigned for BLTU/BGEU). The
//     EXECUTE stage computes both, and the WB stage picks the right
//     one.
//
// Programs are parameterized into the design via PROG (256 × 32-bit
// instructions). The default boot program computes Fibonacci(10) and
// stores the result at data-memory address 0; the testbench reads it
// out of the regfile after the chip halts via a self-loop on the
// `halt_addr`.

`default_nettype none

module top #(
    // Boot program: 256 instructions × 32 bits = 1024 bytes of ROM.
    //
    // The default is a real RV32I Fibonacci(10) program — earlier we
    // had `jal x0, 0` here, which is a one-instruction infinite loop
    // that never writes a register. Yosys correctly proves that the
    // entire regfile and dmem stay at zero forever and synthesises
    // them away as dead code (the silicon shrinks from 5500 logic
    // cells to 210). Baking a non-trivial default program forces the
    // synth-time view of the design to actually exercise the ALU,
    // regfile, branch comparator, and store path, so the hardened
    // chip retains the structures we wanted to study. Testbenches
    // still override PROG by named-parameter override.
    parameter logic [256*32-1:0] PROG = {
        {243{32'h0000_0000}},
        32'h0000006f,  // 12: JAL  x0, 0          ; halt (jump-to-self)
        32'h00a02023,  // 11: SW   x10, 0(x0)     ; dmem[0] = x10
        32'h00008513,  // 10: ADDI x10, x1, 0     ; x10 = a (the Fib result)
        32'hfe3248e3,  //  9: BLT  x4, x3, -16    ; if i < n: loop back
        32'h00520233,  //  8: ADD  x4, x4, x5     ; i++
        32'h00030113,  //  7: ADDI x2, x6, 0      ; b = t
        32'h00010093,  //  6: ADDI x1, x2, 0      ; a = b
        32'h00208333,  //  5: ADD  x6, x1, x2     ; t = a + b
        32'h00100293,  //  4: ADDI x5, x0, 1      ; x5 = 1 (loop step)
        32'h00000213,  //  3: ADDI x4, x0, 0      ; i = 0
        32'h00a00193,  //  2: ADDI x3, x0, 10     ; n = 10
        32'h00100113,  //  1: ADDI x2, x0, 1      ; b = 1
        32'h00000093   //  0: ADDI x1, x0, 0      ; a = 0
    }
) (
    input  logic        clk,
    input  logic        rst_n,
    input  logic        start,           // currently unused (1 = run)
    input  logic [4:0]  dbg_reg_sel,     // chip-pin: which regfile entry to expose

    // Debug
    output logic [31:0] pc_out,
    output logic [31:0] dbg_reg_out,     // selected regfile entry (held combinationally)
    output logic [31:0] dmem_out,        // dmem[0] — observable view of memory
    output logic        halted
);

  // ====================================================================
  // Stage 0: Program counter + instruction ROM read.
  // ====================================================================
  // PC is byte-addressed (RISC-V convention) but instructions are
  // word-aligned, so we shift by 2 to index PROG (which is packed by
  // word: ROM[0] is bits 31:0, ROM[1] is 63:32, etc. — selected by the
  // `+:` indexed-part-select).
  logic [31:0] pc;
  logic [31:0] ir;          // current instruction (loaded from PROG in IF)

  // ====================================================================
  // Decode (combinational from IR).
  // ====================================================================
  wire [6:0]  opcode = ir[ 6: 0];
  wire [4:0]  rd     = ir[11: 7];
  wire [2:0]  funct3 = ir[14:12];
  wire [4:0]  rs1    = ir[19:15];
  wire [4:0]  rs2    = ir[24:20];
  wire [6:0]  funct7 = ir[31:25];

  // RISC-V opcode literal categories.
  localparam logic [6:0] OP_LUI    = 7'b0110111;
  localparam logic [6:0] OP_AUIPC  = 7'b0010111;
  localparam logic [6:0] OP_JAL    = 7'b1101111;
  localparam logic [6:0] OP_JALR   = 7'b1100111;
  localparam logic [6:0] OP_BRANCH = 7'b1100011;
  localparam logic [6:0] OP_LOAD   = 7'b0000011;
  localparam logic [6:0] OP_STORE  = 7'b0100011;
  localparam logic [6:0] OP_OPIMM  = 7'b0010011;
  localparam logic [6:0] OP_OP     = 7'b0110011;
  localparam logic [6:0] OP_FENCE  = 7'b0001111;

  // Per-instruction predicates.
  wire is_lui    = (opcode == OP_LUI);
  wire is_auipc  = (opcode == OP_AUIPC);
  wire is_jal    = (opcode == OP_JAL);
  wire is_jalr   = (opcode == OP_JALR);
  wire is_branch = (opcode == OP_BRANCH);
  wire is_load   = (opcode == OP_LOAD);
  wire is_store  = (opcode == OP_STORE);
  wire is_opimm  = (opcode == OP_OPIMM);
  wire is_op     = (opcode == OP_OP);
  wire is_fence  = (opcode == OP_FENCE);
  wire is_legal  = is_lui | is_auipc | is_jal | is_jalr | is_branch
                  | is_load | is_store | is_opimm | is_op | is_fence;

  // Sub-word load/store (LB/LH/LBU/LHU/SB/SH) we don't support — flag
  // those as illegal so we halt rather than silently misbehave. funct3:
  //   LW: 010    LB: 000  LH: 001  LBU: 100  LHU: 101
  //   SW: 010    SB: 000  SH: 001
  wire is_lw_only = is_load  && (funct3 == 3'b010);
  wire is_sw_only = is_store && (funct3 == 3'b010);
  wire mem_unsupported = (is_load && !is_lw_only) || (is_store && !is_sw_only);

  // ----- Immediate decode -----
  // Each instruction format puts the immediate in different bits and
  // sign-extends it differently. RISC-V is meticulous about sign-bit
  // placement to share decoder hardware.
  wire [31:0] imm_i = {{20{ir[31]}}, ir[31:20]};
  wire [31:0] imm_s = {{20{ir[31]}}, ir[31:25], ir[11:7]};
  wire [31:0] imm_b = {{19{ir[31]}}, ir[31], ir[7], ir[30:25], ir[11:8], 1'b0};
  wire [31:0] imm_u = {ir[31:12], 12'h000};
  wire [31:0] imm_j = {{11{ir[31]}}, ir[31], ir[19:12], ir[20], ir[30:21], 1'b0};

  // Pick the right immediate per opcode.
  logic [31:0] imm;
  always_comb begin
    unique case (opcode)
      OP_OPIMM, OP_LOAD, OP_JALR: imm = imm_i;
      OP_STORE:                    imm = imm_s;
      OP_BRANCH:                   imm = imm_b;
      OP_LUI, OP_AUIPC:            imm = imm_u;
      OP_JAL:                      imm = imm_j;
      default:                     imm = 32'h0000_0000;
    endcase
  end

  // ====================================================================
  // Register file. 32 × 32-bit. x0 reads as zero, writes are dropped.
  // Async read, sync write — same shape as project 05's regfile.
  // ====================================================================
  logic [31:0] regs [0:31];

  // Yosys SV frontend doesn't accept `return X;` inside a function — assign
  // to the function name instead. Same semantics, plain Verilog form.
  function automatic logic [31:0] reg_read(input [4:0] sel);
    if (sel == 5'd0) reg_read = 32'h0000_0000;
    else             reg_read = regs[sel];
  endfunction

  // ----- Operand registers (latched in DECODE) -----
  logic [31:0] op_a;        // regs[rs1]
  logic [31:0] op_b;        // regs[rs2] or imm

  // For ALU ops the second operand is reg or imm depending on opcode;
  // for STORE the data-to-write is regs[rs2] and the immediate is the
  // address offset; for BRANCH both come from regs[rs1] and regs[rs2]
  // and the immediate is the branch target.
  // We resolve "what op_a / op_b mean" structurally in the always_ff
  // for clarity.

  // ====================================================================
  // ALU. Combinational.
  // ====================================================================
  // Picks between op_a + imm style and op_a (op) op_b style.
  // The selector logic is in the EXECUTE stage of the always_ff.
  logic [31:0] alu_a, alu_b;
  logic [3:0]  alu_op;

  // ALU op encoding (internal, doesn't match RISC-V funct3/funct7
  // because the compression there is non-orthogonal; we re-encode for
  // simplicity).
  localparam logic [3:0] ALU_ADD  = 4'b0000;
  localparam logic [3:0] ALU_SUB  = 4'b0001;
  localparam logic [3:0] ALU_AND  = 4'b0010;
  localparam logic [3:0] ALU_OR   = 4'b0011;
  localparam logic [3:0] ALU_XOR  = 4'b0100;
  localparam logic [3:0] ALU_SLL  = 4'b0101;
  localparam logic [3:0] ALU_SRL  = 4'b0110;
  localparam logic [3:0] ALU_SRA  = 4'b0111;
  localparam logic [3:0] ALU_SLT  = 4'b1000;        // signed
  localparam logic [3:0] ALU_SLTU = 4'b1001;        // unsigned
  // 1010..1111 reserved.

  logic [31:0] alu_y;
  always_comb begin
    unique case (alu_op)
      ALU_ADD:  alu_y = alu_a + alu_b;
      ALU_SUB:  alu_y = alu_a - alu_b;
      ALU_AND:  alu_y = alu_a & alu_b;
      ALU_OR:   alu_y = alu_a | alu_b;
      ALU_XOR:  alu_y = alu_a ^ alu_b;
      ALU_SLL:  alu_y = alu_a << alu_b[4:0];
      ALU_SRL:  alu_y = alu_a >> alu_b[4:0];
      ALU_SRA:  alu_y = $signed(alu_a) >>> alu_b[4:0];
      ALU_SLT:  alu_y = ($signed(alu_a) < $signed(alu_b)) ? 32'h1 : 32'h0;
      ALU_SLTU: alu_y = (alu_a < alu_b)                   ? 32'h1 : 32'h0;
      default:  alu_y = 32'h0;
    endcase
  end

  // Decode opcode/funct3/funct7 → alu_op for OP and OPIMM. For LOAD,
  // STORE, JAL, JALR, AUIPC the ALU adds an immediate; LUI passes the
  // immediate through. BRANCH compares using SUB and the comparator
  // logic in the WB stage.
  logic [3:0] decoded_alu_op;
  always_comb begin
    decoded_alu_op = ALU_ADD;
    if (is_op || is_opimm) begin
      unique case (funct3)
        3'b000: decoded_alu_op = (is_op && funct7[5]) ? ALU_SUB : ALU_ADD;
        3'b001: decoded_alu_op = ALU_SLL;
        3'b010: decoded_alu_op = ALU_SLT;
        3'b011: decoded_alu_op = ALU_SLTU;
        3'b100: decoded_alu_op = ALU_XOR;
        3'b101: decoded_alu_op = funct7[5] ? ALU_SRA : ALU_SRL;
        3'b110: decoded_alu_op = ALU_OR;
        3'b111: decoded_alu_op = ALU_AND;
        default: decoded_alu_op = ALU_ADD;
      endcase
    end
  end

  // ====================================================================
  // Branch comparator. Combinational.
  // ====================================================================
  // Computes the branch condition based on funct3 from the latched
  // op_a (= rs1), op_b (= rs2). Used in WB to decide whether to take
  // the branch.
  logic branch_taken_comb;
  always_comb begin
    branch_taken_comb = 1'b0;
    if (is_branch) begin
      unique case (funct3)
        3'b000: branch_taken_comb = (op_a == op_b);                       // BEQ
        3'b001: branch_taken_comb = (op_a != op_b);                       // BNE
        3'b100: branch_taken_comb = ($signed(op_a) <  $signed(op_b));     // BLT
        3'b101: branch_taken_comb = ($signed(op_a) >= $signed(op_b));     // BGE
        3'b110: branch_taken_comb = (op_a <  op_b);                       // BLTU
        3'b111: branch_taken_comb = (op_a >= op_b);                       // BGEU
        default: branch_taken_comb = 1'b0;
      endcase
    end
  end

  // ====================================================================
  // FSM state typedef — declared early so the dmem block below can
  // reference S_EXECUTE.
  // ====================================================================
  typedef enum logic [2:0] {
    S_FETCH    = 3'd0,
    S_DECODE   = 3'd1,
    S_EXECUTE  = 3'd2,
    S_MEM      = 3'd3,
    S_WB       = 3'd4,
    S_ILLEGAL  = 3'd5,
    S_HALT     = 3'd6
  } state_t;
  state_t state, next_state;

  // ====================================================================
  // Data memory — 256 bytes of flop RAM, word-addressed (4-byte stride).
  // ====================================================================
  logic [31:0] dmem [0:63];                  // 64 words = 256 bytes
  logic [31:0] dmem_rdata;
  logic [5:0]  dmem_addr;       // word index = addr[7:2]

  always_ff @(posedge clk or negedge rst_n) begin
    integer di;
    if (!rst_n) begin
      for (di = 0; di < 64; di = di + 1) dmem[di] <= 32'h0;
      dmem_rdata <= 32'h0;
    end else begin
      // Capture rdata on the same edge that we'd commit a write so LW
      // gets fresh data the cycle after EXECUTE.
      dmem_rdata <= dmem[dmem_addr];
      if (state == S_EXECUTE && is_store && !mem_unsupported) begin
        dmem[dmem_addr] <= op_b;        // op_b holds rs2
      end
    end
  end

  // ====================================================================
  // FSM next-state logic.
  // ====================================================================
  // IF  — fetch IR from PROG[pc].
  // ID  — decode; latch op_a, op_b, dmem_addr.
  // EX  — ALU compute; for LOAD/STORE the ALU computes the effective
  //       address (op_a + imm). STORE issues its write here.
  // MEM — wait one cycle so dmem_rdata can settle for LOAD.
  // WB  — write rd, advance PC.

  // Halt detection: we declare "halted" on a JAL x0, 0 (encoded as
  // 0x0000006F) — i.e., an unconditional jump-to-self at PC. This is
  // the conventional RISC-V "stuck loop = halted" marker. We also halt
  // on illegal instructions.
  wire is_halt_loop = (ir == 32'h0000_006F) && (pc == /* about to JAL to self */ pc);
  // Detect jump-to-self by comparing the JAL target to the current PC.
  // pc + imm_j == pc ⇒ imm_j == 0 ⇒ encoding 0x0000006F.

  // PC update logic — combinational from current PC, latched in WB.
  logic [31:0] next_pc;
  always_comb begin
    next_pc = pc + 32'd4;             // default: sequential
    if (is_jal)                       next_pc = pc + imm;
    else if (is_jalr)                 next_pc = (op_a + imm) & ~32'h1;
    else if (is_branch && branch_taken_comb)
                                      next_pc = pc + imm;
  end

  always_comb begin
    next_state = state;
    unique case (state)
      S_FETCH:   next_state = S_DECODE;
      S_DECODE:  if (!is_legal || mem_unsupported) next_state = S_ILLEGAL;
                 else                              next_state = S_EXECUTE;
      S_EXECUTE: next_state = S_MEM;
      S_MEM:     next_state = S_WB;
      S_WB:      if (is_jal && (imm == 32'd0)) next_state = S_HALT;
                 else                          next_state = S_FETCH;
      S_ILLEGAL: next_state = S_HALT;
      S_HALT:    next_state = S_HALT;
      default:   next_state = S_FETCH;
    endcase
  end

  // ====================================================================
  // ALU operand muxing (combinational from latched op_a / op_b / imm).
  // ====================================================================
  always_comb begin
    alu_a = op_a;
    alu_b = op_b;
    alu_op = decoded_alu_op;

    if (is_lui) begin
      alu_a = 32'h0;
      alu_b = imm;
      alu_op = ALU_ADD;             // result = imm
    end else if (is_auipc) begin
      alu_a = pc;
      alu_b = imm;
      alu_op = ALU_ADD;
    end else if (is_jal || is_jalr) begin
      alu_a = pc;
      alu_b = 32'd4;
      alu_op = ALU_ADD;             // rd = pc + 4
    end else if (is_load || is_store) begin
      alu_a = op_a;
      alu_b = imm;
      alu_op = ALU_ADD;             // address = rs1 + imm
    end else if (is_opimm) begin
      alu_b = imm;
    end
    // OP / BRANCH leave alu_a/alu_b as op_a/op_b.
  end

  // ====================================================================
  // Sequential state.
  // ====================================================================
  logic [31:0] alu_result_q;        // latched ALU result for WB

  always_ff @(posedge clk or negedge rst_n) begin
    integer ri;
    if (!rst_n) begin
      state        <= S_FETCH;
      pc           <= 32'h0;
      ir           <= 32'h0;
      op_a         <= 32'h0;
      op_b         <= 32'h0;
      alu_result_q <= 32'h0;
      dmem_addr    <= 6'd0;
      for (ri = 0; ri < 32; ri = ri + 1) regs[ri] <= 32'h0;
    end else begin
      state <= next_state;

      unique case (state)
        S_FETCH: begin
          ir <= PROG[32 * pc[9:2] +: 32];
        end

        S_DECODE: begin
          op_a <= reg_read(rs1);
          op_b <= reg_read(rs2);
          // For LW/SW the address is rs1 + imm; capture the word index
          // here using the just-read rs1 so dmem_rdata settles by MEM.
          dmem_addr <= (reg_read(rs1) + imm) >> 2;
        end

        S_EXECUTE: begin
          // STORE write happens in the dmem always_ff above (see
          // is_store branch). For everything else, latch the ALU
          // output into alu_result_q.
          alu_result_q <= alu_y;
        end

        S_MEM: begin
          // LW data is now in dmem_rdata (latched at this edge from
          // the previous EXECUTE cycle's dmem_addr).
          if (is_load) alu_result_q <= dmem_rdata;
        end

        S_WB: begin
          // Write rd if the instruction has one (everything except
          // STORE / BRANCH / FENCE).
          if (rd != 5'd0
              && !is_store && !is_branch && !is_fence) begin
            regs[rd] <= alu_result_q;
          end
          pc <= next_pc;
        end

        S_ILLEGAL: ;
        S_HALT:    ;
        default:   ;
      endcase
    end
  end

  // ====================================================================
  // Outputs.
  // ====================================================================
  assign pc_out = pc;
  assign halted = (state == S_HALT);

  // The regfile and dmem need observable taps to keep yosys from
  // optimising them away during synthesis. dbg_reg_out exposes any one
  // regfile entry chosen by the chip-pin selector; dmem_out exposes the
  // first word of data memory. Without these the synth result has no
  // path from the CPU's working state to a primary output and yosys
  // strips out the regfile (1024 flops) and the dmem (2048 flops) as
  // unreachable. Tying them through to chip pins forces the placer to
  // realise them and the dataflow to stay in the netlist.
  assign dbg_reg_out = regs[dbg_reg_sel];
  assign dmem_out    = dmem[0];

  // start input reserved for future use (single-step / run gate).
  wire _unused = &{1'b0, start};

endmodule

`default_nettype wire

Test programs

The verifying TB ships three programs as functions that build the PROG bit-vector. Each uses RV32I-format encoders so the assembly intent reads through clearly:

projects/09_rv32i_min/test/tb.sv system-verilog · L180-260
    int i;
    for (i = 0; i < 256; i++) prog[i] = 32'h0;
    prog[ 0] = ADDI(5'd1, 5'd0, 12'd7);          // x1 = 7
    prog[ 1] = ADDI(5'd2, 5'd0, 12'd5);          // x2 = 5
    prog[ 2] = ADD (5'd3, 5'd1, 5'd2);
    prog[ 3] = SUB (5'd4, 5'd1, 5'd2);
    prog[ 4] = AND_(5'd5, 5'd1, 5'd2);
    prog[ 5] = OR_ (5'd6, 5'd1, 5'd2);
    prog[ 6] = XOR_(5'd7, 5'd1, 5'd2);
    prog[ 7] = SLLI(5'd8, 5'd1, 5'd1);           // x8 = x1 << 1 = 14
    prog[ 8] = SRLI(5'd9, 5'd1, 5'd1);           // x9 = x1 >> 1 = 3
    prog[ 9] = SLT_(5'd10, 5'd1, 5'd2);          // x10 = (x1 < x2) = 0
    prog[10] = ADDI(5'd11, 5'd0, -12'sd3);       // x11 = -3 (sign-extended)
    prog[11] = SRAI(5'd12, 5'd11, 5'd1);         // x12 = -3 >>> 1 = -2 = 0xFFFF_FFFE
    prog[12] = HLT();
    // Pack into the bit-vector. PROG[32*i +: 32] = prog[i].
    build_prog_arith = '0;
    for (i = 0; i < 256; i++) begin
      build_prog_arith[32*i +: 32] = prog[i];
    end
  endfunction

  logic [31:0] dut1_pc; logic dut1_halted;
  logic [31:0] dut1_reg, dut1_dmem;
  top #(.PROG(PROG_ARITH)) dut1 (
    .clk(clk), .rst_n(rst1_n), .start(start),
    .dbg_reg_sel(5'd0),
    .pc_out(dut1_pc), .halted(dut1_halted),
    .dbg_reg_out(dut1_reg), .dmem_out(dut1_dmem)
  );

  // ========== Program 2: branch ==========
  // Decrement x1 from 5 down to 0, counting iterations in x2.
  //   x1 = 5; x2 = 0; x3 = 1; x4 = 0;
  // loop:
  //   x1 = x1 - x3;           # decrement
  //   x2 = x2 + x3;           # tally
  //   bne x1, x4, loop        # if x1 != 0, branch back
  //   halt
  // After: x1=0, x2=5.
  localparam logic [256*32-1:0] PROG_BRANCH = build_prog_branch();

  function automatic logic [256*32-1:0] build_prog_branch;
    logic [31:0] prog [0:255];
    int i;
    for (i = 0; i < 256; i++) prog[i] = 32'h0;
    prog[ 0] = ADDI(5'd1, 5'd0, 12'd5);          // x1 = 5
    prog[ 1] = ADDI(5'd2, 5'd0, 12'd0);          // x2 = 0
    prog[ 2] = ADDI(5'd3, 5'd0, 12'd1);          // x3 = 1
    prog[ 3] = ADDI(5'd4, 5'd0, 12'd0);          // x4 = 0
    // loop label = address 16 (4 instructions × 4 bytes)
    prog[ 4] = SUB (5'd1, 5'd1, 5'd3);           // x1 -= 1
    prog[ 5] = ADD (5'd2, 5'd2, 5'd3);           // x2 += 1
    prog[ 6] = BNE (5'd1, 5'd4, -13'sd8);        // pc -= 8 = back to addr 16
    prog[ 7] = HLT();
    build_prog_branch = '0;
    for (i = 0; i < 256; i++) build_prog_branch[32*i +: 32] = prog[i];
  endfunction

  logic [31:0] dut2_pc; logic dut2_halted;
  logic [31:0] dut2_reg, dut2_dmem;
  top #(.PROG(PROG_BRANCH)) dut2 (
    .clk(clk), .rst_n(rst2_n), .start(start),
    .dbg_reg_sel(5'd0),
    .pc_out(dut2_pc), .halted(dut2_halted),
    .dbg_reg_out(dut2_reg), .dmem_out(dut2_dmem)
  );

  // ========== Program 3: Fibonacci(10) ==========
  // Compute fib(10) = 55 in x10.
  //   x1 = 0;       # a
  //   x2 = 1;       # b
  //   x3 = 10;      # n
  //   x4 = 0;       # i
  //   x5 = 1;       # const 1
  // loop:
  //   t = x1 + x2;
  //   x1 = x2;
  //   x2 = t;
  //   x4 = x4 + 1;
  //   blt x4, x3, loop

Demo

The demo runs Fibonacci(10) and prints each FETCH cycle with the program counter, the hex instruction, a short disassembly, and the key registers:

[cpu]  -- librelane-playground / project 09 / RV32I-min --
[cpu]  multi-cycle 5-stage FSM (IF/ID/EX/MEM/WB), 32 x 32-bit regfile
[cpu]  program: Fibonacci(10) -> x10 = 55

[cpu]  pc=0x00000000  ir=00000093  addi  x1, x0, 0     | x1=0  x2=0  x4=0  x10=0
[cpu]  pc=0x00000004  ir=00100113  addi  x2, x0, 1     | x1=0  x2=0  x4=0  x10=0
[cpu]  pc=0x00000008  ir=00a00193  addi  x3, x0, 10    | x1=0  x2=1  x4=0  x10=0
[cpu]  pc=0x00000014  ir=00208333  add   x6, x1, x2    | x1=0  x2=1  x4=0  x10=0
[cpu]  pc=0x00000018  ir=00010093  addi  x1, x2, 0     | x1=0  x2=1  x4=0  x10=0
[cpu]  pc=0x0000001c  ir=00030113  addi  x2, x6, 0     | x1=1  x2=1  x4=0  x10=0
[cpu]  pc=0x00000024  ir=fe3248e3  blt   x4, x3, -16   | x1=1  x2=1  x4=1  x10=0
... (loops back to pc=0x14, ten iterations)
[cpu]  pc=0x00000028  ir=00008513  addi  x10, x1, 0    | x1=55 x2=89 x4=10 x10=0
[cpu]  pc=0x0000002c  ir=0000006f  halt  (jal x0,0)    | x1=55 x2=89 x4=10 x10=55
[cpu]  halted at pc=0x0000002c, x10 = 55

Compiling C — gcc-riscv-elf into PROG[]

Hand-written assembly is fine for an educational chip but it stops being interesting after the first addi. The whole point of implementing an existing ISA (rather than ours-by-convenience) is that there’s a real toolchain that targets it. P09 plus tools/riscv-asm/ is the smallest end-to-end RISC-V toolchain flow that fits on this site.

The pipeline:

example.c                                    -- the program
  + start.S (boot stub: sp=0x100, jal main, halt)
  + p09.ld  (linker script: .text at 0x0, 1 KB hard cap)

       ▼   riscv64-elf-gcc -march=rv32i -mabi=ilp32 -nostdlib …
  build/example.elf

       ▼   riscv64-elf-objcopy -O binary
  build/example.bin

       ▼   uv run bin_to_prog.py
  build/example.svh   ── localparam logic [256*32-1:0] PROG_FROM_C = { … };

       ▼   testbench `\`include`-s the .svh and overrides PROG
  iverilog tb_c.sv ../src/top.sv


  the chip runs the program.

For examples/fib.c (a real C unsigned int fib(unsigned int n)), gcc -Os produces 11 instructions of inner loop plus a 4-instruction boot stub. The chip executes them in 301 cycles and stores 55 (= fib(10)) at dmem[0]. The strict make c-test testbench reads dmem[0] after halt and asserts it’s 55.

A few sharp edges that surfaced on the way:

  • gcc treats writes to address 0 as undefined behaviour and replaces them with __builtin_trap() (= ebreak), which P09 doesn’t implement. -fno-delete-null-pointer-checks disables the optimization. P09’s address 0 is dmem[0] — a real, valid memory location — and the C code legitimately stores there.
  • 256 instructions of code is a hard ceiling. The linker script has a ASSERT(_text_size <= 0x400, …) that aborts the link with a clear message if a program overflows the ROM. There’s no “spill to RAM”; the chip’s fetch unit reads from PROG[] exclusively.
  • gcc’s stack-frame conventions (sp=0x100, alignment to 16 bytes) share an address space with the C int * writes. Anything you store via *(int*)X for X < 0x100 is fighting the stack. For the fib example the stored result lives at address 0 (dmem[0]), well below sp; the function uses no stack memory of its own.

tools/riscv-asm/README.md has the full reference. Adding a new example is one C file, one make EXAMPLE=name, and one tweak to tb_c.sv’s expected value.

What’s next

  • Tape-out. Project 10 (Tiny Tapeout) fits one of the designs on this ladder into TT’s pin-frame harness for an actual fab submission. P09 itself is too big for a TT tile, but a stripped- down variant (or P11’s wrapped-P06 version) is the kind of thing that would go.
  • Bigger programs. 1 KB of ROM caps how interesting the programs can get. A future project could add an external SPI ROM and a tiny instruction-fetch state machine that loads the program on boot — same shape as how a real microcontroller boots from flash.

See also

  • Project 06 → the original 8-bit CPU this scales up from.
  • Project 08 → macro-aware harden pattern; P09 currently uses flop-based memories but a future iteration with an SRAM macro for instruction memory would reuse the same flow config.
  • Project README