A minimal RV32I implementation. Multi-cycle 5-stage FSM
(IF → ID → EX → MEM → WB), 32-bit datapath, 32 × 32-bit
register file (x0 hardwired to zero), 256-byte instruction ROM,
256-byte data RAM. The first project on the ladder where the
ISA isn’t ours-by-convenience — this is real
RISC-V assembly.
Status: Hardened at sky130A on a 600 × 600 µm die at 40 MHz. 17,277 non-filler cells (regfile + dmem + ALU + decoder + 5-state FSM), 3,262 flops, 2.94 ns of setup slack, zero DRC/LVS/antenna violations. Three RV32I programs pass under iverilog — arithmetic ops (LUI/ADDI/ADD/SUB/AND/OR/XOR/SLLI/SRLI/SRAI/SLT), a countdown loop with
BNE, and Fibonacci(10) usingADD/BLT/JAL.
Honesty rules
Per the project conventions in CLAUDE.md, here’s exactly what this
core supports and what it doesn’t:
Supported:
| class | instructions |
|---|---|
| R-type | ADD, SUB, AND, OR, XOR, SLL, SRL, SRA, SLT, SLTU |
| I-type | ADDI, ANDI, ORI, XORI, SLLI, SRLI, SRAI, SLTI, SLTIU |
| Upper imm | LUI, AUIPC |
| Branch | BEQ, BNE, BLT, BGE, BLTU, BGEU |
| Jump | JAL, JALR |
| Load | LW only (no LB/LH/LBU/LHU) |
| Store | SW only (no SB/SH) |
| FENCE | decoded as NOP — legal for in-order single-issue cores |
Not supported:
- Sub-word memory access
- ECALL / EBREAK / FENCE.I
- CSR ops (no CSR file)
- MRET / SRET / WFI (no privileged modes)
- Any M / A / F / D / C / Zicsr / Zifencei extensions
- Misaligned memory access
Compliance tests have not been run. This core implements the
RV32I instruction shape — programs that stay within the supported
subset above run correctly in simulation. Anything that hits an
unsupported encoding lands in the FSM’s S_ILLEGAL state and halts.
What’s new vs. P08
- 32-bit datapath. All previous projects used 8-bit registers and operands; now everything is 32 bits. Sign-extension of immediates and halfword offsets becomes load-bearing.
- A real ISA. The instruction encoding is RISC-V’s. Decoding has to fish opcode / funct3 / funct7 / imm fields out of fixed bit positions per type.
- Five FSM stages. MEM is its own stage so
LWdata has time to settle. Pure-ALU and branch instructions still walk through MEM but it’s a no-op for them. - Branch decode. Six branch ops with two operand-comparison
classes (signed for
BLT/BGE, unsigned forBLTU/BGEU).
The FSM
RTL
// Project 09: RV32I-min educational core.
//
// A minimal RV32I implementation. Multi-cycle FSM (same shape as
// projects 06/07/08), 5 stages: IF, ID, EX, MEM, WB. 32-bit datapath,
// 32 × 32-bit register file (x0 hardwired to zero), 256-byte
// instruction ROM (parameterized at instantiation), 256-byte data
// RAM (flop-based; P10 may revisit with a macro).
//
// HONESTY RULES (per CLAUDE.md). What this core *does* support:
//
// R-type (10): ADD SUB AND OR XOR SLL SRL SRA SLT SLTU
// I-type (9): ADDI ANDI ORI XORI SLLI SRLI SRAI SLTI SLTIU
// Upper imm: LUI AUIPC
// Branch (6): BEQ BNE BLT BGE BLTU BGEU
// Jump: JAL JALR
// Load: LW (only — no LB/LH/LBU/LHU)
// Store: SW (only — no SB/SH)
// FENCE: decoded as NOP (legal: FENCE is allowed to be a NOP
// on a single-issue in-order core)
//
// What this core does NOT support:
//
// LB / LH / LBU / LHU / SB / SH (sub-word memory access)
// ECALL / EBREAK (no traps; treated as illegal)
// CSRRW / CSRRS / CSRRC / immediate forms (no CSR file)
// MRET / SRET / WFI (no privileged modes)
// Any of the M/A/F/D/C extensions (we are int-base-only)
//
// Compliance tests have NOT been run. This core targets the
// RV32I instruction shape, not the full RISC-V specification.
// Programs that stay within the supported subset above run correctly
// in simulation; anything that hits an unsupported encoding lands in
// the FSM's `S_ILLEGAL` state which halts the chip.
//
// What's new vs. P06/07/08:
//
// - **32-bit datapath.** All previous projects used 8-bit registers
// and operands; now everything is 32 bits. Sign-extension of
// immediates and halfword address offsets becomes load-bearing.
// - **A real ISA.** The instruction encoding is RISC-V's, not
// ours-by-convenience. Decoding has to fish opcode / funct3 /
// funct7 / imm fields out of fixed bit positions per type.
// - **Five FSM stages instead of four.** MEM is its own stage so
// LW data has time to settle from the bus the same cycle that
// ST gets its write-enable pulse.
// - **Branch decode.** Six branch ops with two operand-comparison
// classes (signed for BLT/BGE, unsigned for BLTU/BGEU). The
// EXECUTE stage computes both, and the WB stage picks the right
// one.
//
// Programs are parameterized into the design via PROG (256 × 32-bit
// instructions). The default boot program computes Fibonacci(10) and
// stores the result at data-memory address 0; the testbench reads it
// out of the regfile after the chip halts via a self-loop on the
// `halt_addr`.
`default_nettype none
module top #(
// Boot program: 256 instructions × 32 bits = 1024 bytes of ROM.
//
// The default is a real RV32I Fibonacci(10) program — earlier we
// had `jal x0, 0` here, which is a one-instruction infinite loop
// that never writes a register. Yosys correctly proves that the
// entire regfile and dmem stay at zero forever and synthesises
// them away as dead code (the silicon shrinks from 5500 logic
// cells to 210). Baking a non-trivial default program forces the
// synth-time view of the design to actually exercise the ALU,
// regfile, branch comparator, and store path, so the hardened
// chip retains the structures we wanted to study. Testbenches
// still override PROG by named-parameter override.
parameter logic [256*32-1:0] PROG = {
{243{32'h0000_0000}},
32'h0000006f, // 12: JAL x0, 0 ; halt (jump-to-self)
32'h00a02023, // 11: SW x10, 0(x0) ; dmem[0] = x10
32'h00008513, // 10: ADDI x10, x1, 0 ; x10 = a (the Fib result)
32'hfe3248e3, // 9: BLT x4, x3, -16 ; if i < n: loop back
32'h00520233, // 8: ADD x4, x4, x5 ; i++
32'h00030113, // 7: ADDI x2, x6, 0 ; b = t
32'h00010093, // 6: ADDI x1, x2, 0 ; a = b
32'h00208333, // 5: ADD x6, x1, x2 ; t = a + b
32'h00100293, // 4: ADDI x5, x0, 1 ; x5 = 1 (loop step)
32'h00000213, // 3: ADDI x4, x0, 0 ; i = 0
32'h00a00193, // 2: ADDI x3, x0, 10 ; n = 10
32'h00100113, // 1: ADDI x2, x0, 1 ; b = 1
32'h00000093 // 0: ADDI x1, x0, 0 ; a = 0
}
) (
input logic clk,
input logic rst_n,
input logic start, // currently unused (1 = run)
input logic [4:0] dbg_reg_sel, // chip-pin: which regfile entry to expose
// Debug
output logic [31:0] pc_out,
output logic [31:0] dbg_reg_out, // selected regfile entry (held combinationally)
output logic [31:0] dmem_out, // dmem[0] — observable view of memory
output logic halted
);
// ====================================================================
// Stage 0: Program counter + instruction ROM read.
// ====================================================================
// PC is byte-addressed (RISC-V convention) but instructions are
// word-aligned, so we shift by 2 to index PROG (which is packed by
// word: ROM[0] is bits 31:0, ROM[1] is 63:32, etc. — selected by the
// `+:` indexed-part-select).
logic [31:0] pc;
logic [31:0] ir; // current instruction (loaded from PROG in IF)
// ====================================================================
// Decode (combinational from IR).
// ====================================================================
wire [6:0] opcode = ir[ 6: 0];
wire [4:0] rd = ir[11: 7];
wire [2:0] funct3 = ir[14:12];
wire [4:0] rs1 = ir[19:15];
wire [4:0] rs2 = ir[24:20];
wire [6:0] funct7 = ir[31:25];
// RISC-V opcode literal categories.
localparam logic [6:0] OP_LUI = 7'b0110111;
localparam logic [6:0] OP_AUIPC = 7'b0010111;
localparam logic [6:0] OP_JAL = 7'b1101111;
localparam logic [6:0] OP_JALR = 7'b1100111;
localparam logic [6:0] OP_BRANCH = 7'b1100011;
localparam logic [6:0] OP_LOAD = 7'b0000011;
localparam logic [6:0] OP_STORE = 7'b0100011;
localparam logic [6:0] OP_OPIMM = 7'b0010011;
localparam logic [6:0] OP_OP = 7'b0110011;
localparam logic [6:0] OP_FENCE = 7'b0001111;
// Per-instruction predicates.
wire is_lui = (opcode == OP_LUI);
wire is_auipc = (opcode == OP_AUIPC);
wire is_jal = (opcode == OP_JAL);
wire is_jalr = (opcode == OP_JALR);
wire is_branch = (opcode == OP_BRANCH);
wire is_load = (opcode == OP_LOAD);
wire is_store = (opcode == OP_STORE);
wire is_opimm = (opcode == OP_OPIMM);
wire is_op = (opcode == OP_OP);
wire is_fence = (opcode == OP_FENCE);
wire is_legal = is_lui | is_auipc | is_jal | is_jalr | is_branch
| is_load | is_store | is_opimm | is_op | is_fence;
// Sub-word load/store (LB/LH/LBU/LHU/SB/SH) we don't support — flag
// those as illegal so we halt rather than silently misbehave. funct3:
// LW: 010 LB: 000 LH: 001 LBU: 100 LHU: 101
// SW: 010 SB: 000 SH: 001
wire is_lw_only = is_load && (funct3 == 3'b010);
wire is_sw_only = is_store && (funct3 == 3'b010);
wire mem_unsupported = (is_load && !is_lw_only) || (is_store && !is_sw_only);
// ----- Immediate decode -----
// Each instruction format puts the immediate in different bits and
// sign-extends it differently. RISC-V is meticulous about sign-bit
// placement to share decoder hardware.
wire [31:0] imm_i = {{20{ir[31]}}, ir[31:20]};
wire [31:0] imm_s = {{20{ir[31]}}, ir[31:25], ir[11:7]};
wire [31:0] imm_b = {{19{ir[31]}}, ir[31], ir[7], ir[30:25], ir[11:8], 1'b0};
wire [31:0] imm_u = {ir[31:12], 12'h000};
wire [31:0] imm_j = {{11{ir[31]}}, ir[31], ir[19:12], ir[20], ir[30:21], 1'b0};
// Pick the right immediate per opcode.
logic [31:0] imm;
always_comb begin
unique case (opcode)
OP_OPIMM, OP_LOAD, OP_JALR: imm = imm_i;
OP_STORE: imm = imm_s;
OP_BRANCH: imm = imm_b;
OP_LUI, OP_AUIPC: imm = imm_u;
OP_JAL: imm = imm_j;
default: imm = 32'h0000_0000;
endcase
end
// ====================================================================
// Register file. 32 × 32-bit. x0 reads as zero, writes are dropped.
// Async read, sync write — same shape as project 05's regfile.
// ====================================================================
logic [31:0] regs [0:31];
// Yosys SV frontend doesn't accept `return X;` inside a function — assign
// to the function name instead. Same semantics, plain Verilog form.
function automatic logic [31:0] reg_read(input [4:0] sel);
if (sel == 5'd0) reg_read = 32'h0000_0000;
else reg_read = regs[sel];
endfunction
// ----- Operand registers (latched in DECODE) -----
logic [31:0] op_a; // regs[rs1]
logic [31:0] op_b; // regs[rs2] or imm
// For ALU ops the second operand is reg or imm depending on opcode;
// for STORE the data-to-write is regs[rs2] and the immediate is the
// address offset; for BRANCH both come from regs[rs1] and regs[rs2]
// and the immediate is the branch target.
// We resolve "what op_a / op_b mean" structurally in the always_ff
// for clarity.
// ====================================================================
// ALU. Combinational.
// ====================================================================
// Picks between op_a + imm style and op_a (op) op_b style.
// The selector logic is in the EXECUTE stage of the always_ff.
logic [31:0] alu_a, alu_b;
logic [3:0] alu_op;
// ALU op encoding (internal, doesn't match RISC-V funct3/funct7
// because the compression there is non-orthogonal; we re-encode for
// simplicity).
localparam logic [3:0] ALU_ADD = 4'b0000;
localparam logic [3:0] ALU_SUB = 4'b0001;
localparam logic [3:0] ALU_AND = 4'b0010;
localparam logic [3:0] ALU_OR = 4'b0011;
localparam logic [3:0] ALU_XOR = 4'b0100;
localparam logic [3:0] ALU_SLL = 4'b0101;
localparam logic [3:0] ALU_SRL = 4'b0110;
localparam logic [3:0] ALU_SRA = 4'b0111;
localparam logic [3:0] ALU_SLT = 4'b1000; // signed
localparam logic [3:0] ALU_SLTU = 4'b1001; // unsigned
// 1010..1111 reserved.
logic [31:0] alu_y;
always_comb begin
unique case (alu_op)
ALU_ADD: alu_y = alu_a + alu_b;
ALU_SUB: alu_y = alu_a - alu_b;
ALU_AND: alu_y = alu_a & alu_b;
ALU_OR: alu_y = alu_a | alu_b;
ALU_XOR: alu_y = alu_a ^ alu_b;
ALU_SLL: alu_y = alu_a << alu_b[4:0];
ALU_SRL: alu_y = alu_a >> alu_b[4:0];
ALU_SRA: alu_y = $signed(alu_a) >>> alu_b[4:0];
ALU_SLT: alu_y = ($signed(alu_a) < $signed(alu_b)) ? 32'h1 : 32'h0;
ALU_SLTU: alu_y = (alu_a < alu_b) ? 32'h1 : 32'h0;
default: alu_y = 32'h0;
endcase
end
// Decode opcode/funct3/funct7 → alu_op for OP and OPIMM. For LOAD,
// STORE, JAL, JALR, AUIPC the ALU adds an immediate; LUI passes the
// immediate through. BRANCH compares using SUB and the comparator
// logic in the WB stage.
logic [3:0] decoded_alu_op;
always_comb begin
decoded_alu_op = ALU_ADD;
if (is_op || is_opimm) begin
unique case (funct3)
3'b000: decoded_alu_op = (is_op && funct7[5]) ? ALU_SUB : ALU_ADD;
3'b001: decoded_alu_op = ALU_SLL;
3'b010: decoded_alu_op = ALU_SLT;
3'b011: decoded_alu_op = ALU_SLTU;
3'b100: decoded_alu_op = ALU_XOR;
3'b101: decoded_alu_op = funct7[5] ? ALU_SRA : ALU_SRL;
3'b110: decoded_alu_op = ALU_OR;
3'b111: decoded_alu_op = ALU_AND;
default: decoded_alu_op = ALU_ADD;
endcase
end
end
// ====================================================================
// Branch comparator. Combinational.
// ====================================================================
// Computes the branch condition based on funct3 from the latched
// op_a (= rs1), op_b (= rs2). Used in WB to decide whether to take
// the branch.
logic branch_taken_comb;
always_comb begin
branch_taken_comb = 1'b0;
if (is_branch) begin
unique case (funct3)
3'b000: branch_taken_comb = (op_a == op_b); // BEQ
3'b001: branch_taken_comb = (op_a != op_b); // BNE
3'b100: branch_taken_comb = ($signed(op_a) < $signed(op_b)); // BLT
3'b101: branch_taken_comb = ($signed(op_a) >= $signed(op_b)); // BGE
3'b110: branch_taken_comb = (op_a < op_b); // BLTU
3'b111: branch_taken_comb = (op_a >= op_b); // BGEU
default: branch_taken_comb = 1'b0;
endcase
end
end
// ====================================================================
// FSM state typedef — declared early so the dmem block below can
// reference S_EXECUTE.
// ====================================================================
typedef enum logic [2:0] {
S_FETCH = 3'd0,
S_DECODE = 3'd1,
S_EXECUTE = 3'd2,
S_MEM = 3'd3,
S_WB = 3'd4,
S_ILLEGAL = 3'd5,
S_HALT = 3'd6
} state_t;
state_t state, next_state;
// ====================================================================
// Data memory — 256 bytes of flop RAM, word-addressed (4-byte stride).
// ====================================================================
logic [31:0] dmem [0:63]; // 64 words = 256 bytes
logic [31:0] dmem_rdata;
logic [5:0] dmem_addr; // word index = addr[7:2]
always_ff @(posedge clk or negedge rst_n) begin
integer di;
if (!rst_n) begin
for (di = 0; di < 64; di = di + 1) dmem[di] <= 32'h0;
dmem_rdata <= 32'h0;
end else begin
// Capture rdata on the same edge that we'd commit a write so LW
// gets fresh data the cycle after EXECUTE.
dmem_rdata <= dmem[dmem_addr];
if (state == S_EXECUTE && is_store && !mem_unsupported) begin
dmem[dmem_addr] <= op_b; // op_b holds rs2
end
end
end
// ====================================================================
// FSM next-state logic.
// ====================================================================
// IF — fetch IR from PROG[pc].
// ID — decode; latch op_a, op_b, dmem_addr.
// EX — ALU compute; for LOAD/STORE the ALU computes the effective
// address (op_a + imm). STORE issues its write here.
// MEM — wait one cycle so dmem_rdata can settle for LOAD.
// WB — write rd, advance PC.
// Halt detection: we declare "halted" on a JAL x0, 0 (encoded as
// 0x0000006F) — i.e., an unconditional jump-to-self at PC. This is
// the conventional RISC-V "stuck loop = halted" marker. We also halt
// on illegal instructions.
wire is_halt_loop = (ir == 32'h0000_006F) && (pc == /* about to JAL to self */ pc);
// Detect jump-to-self by comparing the JAL target to the current PC.
// pc + imm_j == pc ⇒ imm_j == 0 ⇒ encoding 0x0000006F.
// PC update logic — combinational from current PC, latched in WB.
logic [31:0] next_pc;
always_comb begin
next_pc = pc + 32'd4; // default: sequential
if (is_jal) next_pc = pc + imm;
else if (is_jalr) next_pc = (op_a + imm) & ~32'h1;
else if (is_branch && branch_taken_comb)
next_pc = pc + imm;
end
always_comb begin
next_state = state;
unique case (state)
S_FETCH: next_state = S_DECODE;
S_DECODE: if (!is_legal || mem_unsupported) next_state = S_ILLEGAL;
else next_state = S_EXECUTE;
S_EXECUTE: next_state = S_MEM;
S_MEM: next_state = S_WB;
S_WB: if (is_jal && (imm == 32'd0)) next_state = S_HALT;
else next_state = S_FETCH;
S_ILLEGAL: next_state = S_HALT;
S_HALT: next_state = S_HALT;
default: next_state = S_FETCH;
endcase
end
// ====================================================================
// ALU operand muxing (combinational from latched op_a / op_b / imm).
// ====================================================================
always_comb begin
alu_a = op_a;
alu_b = op_b;
alu_op = decoded_alu_op;
if (is_lui) begin
alu_a = 32'h0;
alu_b = imm;
alu_op = ALU_ADD; // result = imm
end else if (is_auipc) begin
alu_a = pc;
alu_b = imm;
alu_op = ALU_ADD;
end else if (is_jal || is_jalr) begin
alu_a = pc;
alu_b = 32'd4;
alu_op = ALU_ADD; // rd = pc + 4
end else if (is_load || is_store) begin
alu_a = op_a;
alu_b = imm;
alu_op = ALU_ADD; // address = rs1 + imm
end else if (is_opimm) begin
alu_b = imm;
end
// OP / BRANCH leave alu_a/alu_b as op_a/op_b.
end
// ====================================================================
// Sequential state.
// ====================================================================
logic [31:0] alu_result_q; // latched ALU result for WB
always_ff @(posedge clk or negedge rst_n) begin
integer ri;
if (!rst_n) begin
state <= S_FETCH;
pc <= 32'h0;
ir <= 32'h0;
op_a <= 32'h0;
op_b <= 32'h0;
alu_result_q <= 32'h0;
dmem_addr <= 6'd0;
for (ri = 0; ri < 32; ri = ri + 1) regs[ri] <= 32'h0;
end else begin
state <= next_state;
unique case (state)
S_FETCH: begin
ir <= PROG[32 * pc[9:2] +: 32];
end
S_DECODE: begin
op_a <= reg_read(rs1);
op_b <= reg_read(rs2);
// For LW/SW the address is rs1 + imm; capture the word index
// here using the just-read rs1 so dmem_rdata settles by MEM.
dmem_addr <= (reg_read(rs1) + imm) >> 2;
end
S_EXECUTE: begin
// STORE write happens in the dmem always_ff above (see
// is_store branch). For everything else, latch the ALU
// output into alu_result_q.
alu_result_q <= alu_y;
end
S_MEM: begin
// LW data is now in dmem_rdata (latched at this edge from
// the previous EXECUTE cycle's dmem_addr).
if (is_load) alu_result_q <= dmem_rdata;
end
S_WB: begin
// Write rd if the instruction has one (everything except
// STORE / BRANCH / FENCE).
if (rd != 5'd0
&& !is_store && !is_branch && !is_fence) begin
regs[rd] <= alu_result_q;
end
pc <= next_pc;
end
S_ILLEGAL: ;
S_HALT: ;
default: ;
endcase
end
end
// ====================================================================
// Outputs.
// ====================================================================
assign pc_out = pc;
assign halted = (state == S_HALT);
// The regfile and dmem need observable taps to keep yosys from
// optimising them away during synthesis. dbg_reg_out exposes any one
// regfile entry chosen by the chip-pin selector; dmem_out exposes the
// first word of data memory. Without these the synth result has no
// path from the CPU's working state to a primary output and yosys
// strips out the regfile (1024 flops) and the dmem (2048 flops) as
// unreachable. Tying them through to chip pins forces the placer to
// realise them and the dataflow to stay in the netlist.
assign dbg_reg_out = regs[dbg_reg_sel];
assign dmem_out = dmem[0];
// start input reserved for future use (single-step / run gate).
wire _unused = &{1'b0, start};
endmodule
`default_nettype wire Test programs
The verifying TB ships three programs as functions that build the PROG bit-vector. Each uses RV32I-format encoders so the assembly intent reads through clearly:
int i;
for (i = 0; i < 256; i++) prog[i] = 32'h0;
prog[ 0] = ADDI(5'd1, 5'd0, 12'd7); // x1 = 7
prog[ 1] = ADDI(5'd2, 5'd0, 12'd5); // x2 = 5
prog[ 2] = ADD (5'd3, 5'd1, 5'd2);
prog[ 3] = SUB (5'd4, 5'd1, 5'd2);
prog[ 4] = AND_(5'd5, 5'd1, 5'd2);
prog[ 5] = OR_ (5'd6, 5'd1, 5'd2);
prog[ 6] = XOR_(5'd7, 5'd1, 5'd2);
prog[ 7] = SLLI(5'd8, 5'd1, 5'd1); // x8 = x1 << 1 = 14
prog[ 8] = SRLI(5'd9, 5'd1, 5'd1); // x9 = x1 >> 1 = 3
prog[ 9] = SLT_(5'd10, 5'd1, 5'd2); // x10 = (x1 < x2) = 0
prog[10] = ADDI(5'd11, 5'd0, -12'sd3); // x11 = -3 (sign-extended)
prog[11] = SRAI(5'd12, 5'd11, 5'd1); // x12 = -3 >>> 1 = -2 = 0xFFFF_FFFE
prog[12] = HLT();
// Pack into the bit-vector. PROG[32*i +: 32] = prog[i].
build_prog_arith = '0;
for (i = 0; i < 256; i++) begin
build_prog_arith[32*i +: 32] = prog[i];
end
endfunction
logic [31:0] dut1_pc; logic dut1_halted;
logic [31:0] dut1_reg, dut1_dmem;
top #(.PROG(PROG_ARITH)) dut1 (
.clk(clk), .rst_n(rst1_n), .start(start),
.dbg_reg_sel(5'd0),
.pc_out(dut1_pc), .halted(dut1_halted),
.dbg_reg_out(dut1_reg), .dmem_out(dut1_dmem)
);
// ========== Program 2: branch ==========
// Decrement x1 from 5 down to 0, counting iterations in x2.
// x1 = 5; x2 = 0; x3 = 1; x4 = 0;
// loop:
// x1 = x1 - x3; # decrement
// x2 = x2 + x3; # tally
// bne x1, x4, loop # if x1 != 0, branch back
// halt
// After: x1=0, x2=5.
localparam logic [256*32-1:0] PROG_BRANCH = build_prog_branch();
function automatic logic [256*32-1:0] build_prog_branch;
logic [31:0] prog [0:255];
int i;
for (i = 0; i < 256; i++) prog[i] = 32'h0;
prog[ 0] = ADDI(5'd1, 5'd0, 12'd5); // x1 = 5
prog[ 1] = ADDI(5'd2, 5'd0, 12'd0); // x2 = 0
prog[ 2] = ADDI(5'd3, 5'd0, 12'd1); // x3 = 1
prog[ 3] = ADDI(5'd4, 5'd0, 12'd0); // x4 = 0
// loop label = address 16 (4 instructions × 4 bytes)
prog[ 4] = SUB (5'd1, 5'd1, 5'd3); // x1 -= 1
prog[ 5] = ADD (5'd2, 5'd2, 5'd3); // x2 += 1
prog[ 6] = BNE (5'd1, 5'd4, -13'sd8); // pc -= 8 = back to addr 16
prog[ 7] = HLT();
build_prog_branch = '0;
for (i = 0; i < 256; i++) build_prog_branch[32*i +: 32] = prog[i];
endfunction
logic [31:0] dut2_pc; logic dut2_halted;
logic [31:0] dut2_reg, dut2_dmem;
top #(.PROG(PROG_BRANCH)) dut2 (
.clk(clk), .rst_n(rst2_n), .start(start),
.dbg_reg_sel(5'd0),
.pc_out(dut2_pc), .halted(dut2_halted),
.dbg_reg_out(dut2_reg), .dmem_out(dut2_dmem)
);
// ========== Program 3: Fibonacci(10) ==========
// Compute fib(10) = 55 in x10.
// x1 = 0; # a
// x2 = 1; # b
// x3 = 10; # n
// x4 = 0; # i
// x5 = 1; # const 1
// loop:
// t = x1 + x2;
// x1 = x2;
// x2 = t;
// x4 = x4 + 1;
// blt x4, x3, loop Demo
The demo runs Fibonacci(10) and prints each FETCH cycle with the program counter, the hex instruction, a short disassembly, and the key registers:
[cpu] -- librelane-playground / project 09 / RV32I-min --
[cpu] multi-cycle 5-stage FSM (IF/ID/EX/MEM/WB), 32 x 32-bit regfile
[cpu] program: Fibonacci(10) -> x10 = 55
[cpu] pc=0x00000000 ir=00000093 addi x1, x0, 0 | x1=0 x2=0 x4=0 x10=0
[cpu] pc=0x00000004 ir=00100113 addi x2, x0, 1 | x1=0 x2=0 x4=0 x10=0
[cpu] pc=0x00000008 ir=00a00193 addi x3, x0, 10 | x1=0 x2=1 x4=0 x10=0
[cpu] pc=0x00000014 ir=00208333 add x6, x1, x2 | x1=0 x2=1 x4=0 x10=0
[cpu] pc=0x00000018 ir=00010093 addi x1, x2, 0 | x1=0 x2=1 x4=0 x10=0
[cpu] pc=0x0000001c ir=00030113 addi x2, x6, 0 | x1=1 x2=1 x4=0 x10=0
[cpu] pc=0x00000024 ir=fe3248e3 blt x4, x3, -16 | x1=1 x2=1 x4=1 x10=0
... (loops back to pc=0x14, ten iterations)
[cpu] pc=0x00000028 ir=00008513 addi x10, x1, 0 | x1=55 x2=89 x4=10 x10=0
[cpu] pc=0x0000002c ir=0000006f halt (jal x0,0) | x1=55 x2=89 x4=10 x10=55
[cpu] halted at pc=0x0000002c, x10 = 55
Compiling C — gcc-riscv-elf into PROG[]
Hand-written assembly is fine for an educational chip but it stops
being interesting after the first addi. The whole point of
implementing an existing ISA (rather than ours-by-convenience) is
that there’s a real toolchain that targets it. P09 plus
tools/riscv-asm/ is the smallest end-to-end RISC-V toolchain
flow that fits on this site.
The pipeline:
example.c -- the program
+ start.S (boot stub: sp=0x100, jal main, halt)
+ p09.ld (linker script: .text at 0x0, 1 KB hard cap)
│
▼ riscv64-elf-gcc -march=rv32i -mabi=ilp32 -nostdlib …
build/example.elf
│
▼ riscv64-elf-objcopy -O binary
build/example.bin
│
▼ uv run bin_to_prog.py
build/example.svh ── localparam logic [256*32-1:0] PROG_FROM_C = { … };
│
▼ testbench `\`include`-s the .svh and overrides PROG
iverilog tb_c.sv ../src/top.sv
│
▼
the chip runs the program.
For examples/fib.c (a real C unsigned int fib(unsigned int n)),
gcc -Os produces 11 instructions of inner loop plus a 4-instruction
boot stub. The chip executes them in 301 cycles and stores 55 (=
fib(10)) at dmem[0]. The strict make c-test testbench reads
dmem[0] after halt and asserts it’s 55.
A few sharp edges that surfaced on the way:
- gcc treats writes to address 0 as undefined behaviour and
replaces them with
__builtin_trap()(=ebreak), which P09 doesn’t implement.-fno-delete-null-pointer-checksdisables the optimization. P09’s address 0 is dmem[0] — a real, valid memory location — and the C code legitimately stores there. - 256 instructions of code is a hard ceiling. The linker script
has a
ASSERT(_text_size <= 0x400, …)that aborts the link with a clear message if a program overflows the ROM. There’s no “spill to RAM”; the chip’s fetch unit reads fromPROG[]exclusively. - gcc’s stack-frame conventions (sp=0x100, alignment to 16 bytes)
share an address space with the C
int *writes. Anything you store via*(int*)Xfor X < 0x100 is fighting the stack. For the fib example the stored result lives at address 0 (dmem[0]), well below sp; the function uses no stack memory of its own.
tools/riscv-asm/README.md has the full reference. Adding a new
example is one C file, one make EXAMPLE=name, and one tweak to
tb_c.sv’s expected value.
What’s next
- Tape-out. Project 10 (Tiny Tapeout) fits one of the designs on this ladder into TT’s pin-frame harness for an actual fab submission. P09 itself is too big for a TT tile, but a stripped- down variant (or P11’s wrapped-P06 version) is the kind of thing that would go.
- Bigger programs. 1 KB of ROM caps how interesting the programs can get. A future project could add an external SPI ROM and a tiny instruction-fetch state machine that loads the program on boot — same shape as how a real microcontroller boots from flash.
See also
- Project 06 → the original 8-bit CPU this scales up from.
- Project 08 → macro-aware harden pattern; P09 currently uses flop-based memories but a future iteration with an SRAM macro for instruction memory would reuse the same flow config.
- Project README