RV32E with SPI-flash boot · librelane-playground

P13 made the chip’s instruction memory writable — the host could stream programs over UART while load_mode was high. But the loaded program lived in flops: drop power and it’s gone. P14 keeps the writable imem and adds autonomous boot from external SPI NOR flash. Power-on, magic happens, the CPU is running whatever’s been programmed into the flash chip on the dev board.

Status: hardened. Strict tb passes — spi_boot issues a JEDEC READ, the behavioural flash model streams 128 bytes back, the chip packs them into imem as 32 little-endian rv32 words, releases the CPU. The loaded program prints P14\n over the UART and halts. LibreLane run RUN_2026-04-29_12-48-45 completed with final GDS, +0.98 ns worst setup slack, clean DRC, clean LVS, and clean antenna.

What changed from P13

P13’s load_mode pin (was ui_in[1]) and uart_rx pin (was ui_in[0]) are gone — replaced by an SPI master that talks to an external flash chip. The pin frame stays exactly the same shape; only what each bit means changes:

pin	P13	P14
`ui_in[0]`	uart_rx	unused (tied off)
`ui_in[1]`	load_mode	unused
`ui_in[7:2]`	baud_div[5:0]	unused
`uio_in[3]`	(input, unused)	SPI MISO
`uio_out[0]`	(tied 0)	SPI SCK
`uio_out[1]`	(tied 0)	SPI CS_n
`uio_out[2]`	(tied 0)	SPI MOSI
`uio_oe`	`0x00` (input-only)	`0x07` (drive SCK/CS/MOSI)
`uo_out[2]`	`imem_loaded`	`boot_done`

Baud rate moves from configurable to hardcoded — 115200 at 50 MHz — which frees every input pin. The boot_done signal is the analog of P13’s imem_loaded: it goes high once the chip’s internal boot controller has filled imem and is releasing the CPU.

The boot sequence

              chip reset
                  │
                  ▼
        ┌─────────────────────┐
        │ spi_boot.B_IDLE     │  cs_n = 0, load shift_out = 0x03000000
        │  → B_CMD            │
        └─────────────────────┘
                  │ 32 SCK cycles
                  ▼
        ┌─────────────────────┐
        │ shift out READ      │  send 0x03 + 24-bit address 0
        │ command + addr      │
        └─────────────────────┘
                  │ flash starts streaming
                  ▼
        ┌─────────────────────┐
        │ B_DATA              │  for byte=0..127:
        │   sample MISO MSB-first      pack 4 bytes into 32-bit
        │   on each SCK rising         little-endian word
        │                              write imem[byte/4] = word
        └─────────────────────┘
                  │ 1024 SCK cycles later
                  ▼
        ┌─────────────────────┐
        │ B_DONE              │  cs_n = 1, sck = 0
        │  boot_done = 1      │  CPU released, fetches imem[0]
        └─────────────────────┘

The SPI clock divider is hardcoded at chip-clk / 4 — gives a 12.5 MHz SCK at 50 MHz user clock, which is comfortably below the >100 MHz spec of common SPI flash parts (W25Q-series, MX25-series, etc). 1056 SPI cycles × 4 = 4,224 chip clocks total, ~84 µs at 50 MHz.

Harden result

Run directory: projects/14_rv32e_flash_boot/librelane/runs/RUN_2026-04-29_12-48-45

Final GDS: projects/14_rv32e_flash_boot/librelane/runs/RUN_2026-04-29_12-48-45/final/gds/tt_um_librelane_p14_rv32e_flash.gds

Metrics file: projects/14_rv32e_flash_boot/librelane/runs/RUN_2026-04-29_12-48-45/final/metrics.csv

metric	value
Die area	290250 µm²
Core area	271505 µm²
Standard cells	15289
Sequential cells	2062
Worst setup slack	+0.98 ns
Worst hold slack	+0.11 ns
Magic / KLayout DRC	PASS / PASS
LVS	PASS
Antenna	PASS

The run still reports max-slew, max-fanout, and max-cap warnings in the metrics, similar to P13. The harden flow completed and emitted final views; the warnings are part of the result, not a pretend-clean signoff.

Why JEDEC READ (`0x03`) and not a fancier opcode

SPI flash chips support a zoo of read commands — 0x0B (FAST_READ with a dummy byte for higher-clock operation), 0x6B (quad-output read using 4 data lines), 0xEB (quad I/O with quad addressing). We use 0x03 (the basic READ) because:

It’s universal — every SPI flash from every vendor implements it.
We have one MISO line, not four. Quad-mode needs 4 of the uio pins as data, which would mean giving up MOSI’s pin function and redefining the bus protocol on every cycle. Educational chip, not worth the complexity.
0x03 maxes out at ~50 MHz on most parts. Our SCK is 12.5 MHz — far below the limit.

The cost: at 12.5 MHz SCK, reading 128 bytes takes ~80 µs. A FAST_READ at 80 MHz on a quad-I/O part would do it in ~3.2 µs. For a microcontroller booting once at power-on, the 80 µs is invisible.

ISA scope

This is an RV32E-shaped educational core, not a compliance-proven RISC-V implementation. Official RISC-V architectural compliance tests: NOT RUN. A local compliance-shaped subset smoke addendum now exists and passes: PASS.

Supported instructions: LUI, AUIPC, JAL, JALR, BEQ, BNE, BLT, BGE, BLTU, BGEU, LW, SW, ADDI, SLTI, SLTIU, XORI, ORI, ANDI, SLLI, SRLI, SRAI, ADD, SUB, SLL, SLT, SLTU, XOR, SRL, SRA, OR, AND, and FENCE as a no-op. Registers are RV32E style: x0..x15; reads of x16..x31 return zero and writes to them are ignored.

Unsupported: byte/halfword loads and stores, misalignment traps, exceptions, interrupts, CSRs, ECALL, EBREAK, multiply/divide, atomics, compressed instructions, privilege modes, and any compliance claim beyond the project testbench.

ISA smoke addendum

test/tb_isa.sv is the first “how close are we?” pass. It does not use the official RISC-V architectural-test signature protocol. Instead it stays honest to P14’s actual chip interface: every case builds a 32-word SPI flash image, lets the real boot ROM loader fill imem, runs the CPU, halts, and checks the exposed R5[4:0] pass code on uo_out[7:3].

Run:

make -C projects/14_rv32e_flash_boot/test isa

Result: PASS — 9 programs, 0 errors.

test	instructions exercised
`add/sub/logic`	`ADDI`, `ADD`, `SUB`, `AND`, `OR`, `XOR`, `BNE`
`imm-logic`	`ANDI`, `ORI`, `XORI`
`shifts`	`SLLI`, `SRLI`, `SRAI`, `SLL`, `SRA`
`slt/sltu`	`SLT`, `SLTU`, `SLTI`, `SLTIU`
`branches`	`BEQ`, `BNE`, `BLT`, `BGE`, `BLTU`, `BGEU`
`lui/auipc`	`LUI`, `AUIPC`
`jal/jalr`	`JAL`, `JALR`, including low-bit target clearing
`lw/sw`	`LW`, `SW` against P14’s 8-word dmem
`rv32e/fence`	`x0`, ignored `x16`, read-zero `x16`, `FENCE` as NOP

What still blocks a real compliance claim: signature-memory export, sub-word loads/stores, trap/exception behavior, CSRs and system instructions, and enough program/data memory to run official tests without cutting them into tiny fragments.

Official arch-test probe

Then we tried the less flattering thing: build the official riscv-arch-test RV32I/I unprivileged integer files and classify them against P14’s actual limits.

Probe command:

scripts/p14_arch_test_probe.py

Result using upstream riscv-arch-test revision a7c9930:

result	count
Official RV32I/I tests built	39 / 39
Runnable on P14 unmodified	0 / 39
Official tests passed on P14	0
Official tests failed on P14	0
Official tests marked `NOT RUN`	39

That is not a disguised FAIL; it is a real NOT RUN. The smallest official image, I-nop-00.S, still builds to 632 instruction words plus 1384 bytes of data/signature sections. P14 has 32 instruction words, 8 zeroed data words, no data preload path, and no signature export. The official RV32I framework also initializes and uses x16..x31, which P14 intentionally treats as absent RV32E registers.

The probe CSV is tracked at projects/14_rv32e_flash_boot/compliance_probe/rv32i_I_probe.csv. That file is the useful artifact for future rungs: it tells us what has to change before official tests can become executable rather than just buildable.

Source

The whole spi_boot module is ~110 lines:

projects/14_rv32e_flash_boot/src/top.sv system-verilog · L392-528

    input  logic        miso,
    output logic        sck,
    output logic        cs_n,
    output logic        mosi,
    // imem write port.
    output logic        imem_we,
    output logic [4:0]  imem_waddr,
    output logic [31:0] imem_wdata,
    output logic        boot_done
);

  // SPI clock divider. 2-bit counter rolls every 4 chip cycles, so
  // SCK toggles every 2 chip cycles → SPI clock = clk / 4.
  logic [1:0] tick;

  // We drive SCK from `tick`. The high half of the cycle is the rising
  // edge (sample MISO); the falling edge (change MOSI) happens at the
  // tick=0 boundary.
  wire sck_rising  = (tick == 2'd1);     // about to make SCK go high
  wire sck_falling = (tick == 2'd3);     // about to make SCK go low

  typedef enum logic [2:0] {
    B_IDLE   = 3'd0,
    B_CMD    = 3'd1,    // shift out the 0x03 READ command
    B_ADDR   = 3'd2,    // shift out the 24-bit start address
    B_DATA   = 3'd3,    // shift in the 128 program bytes
    B_DONE   = 3'd4
  } bstate_t;
  bstate_t bstate;

  // Bit counter used in CMD/ADDR/DATA states. Sized to count up to 1024
  // (128 bytes × 8 bits + the 32 cmd+addr bits before that).
  logic [10:0] bit_idx;

  // 32-bit shift register holding the cmd+addr to clock out, then we
  // reuse it to assemble incoming data bytes.
  logic [31:0] shift_out;
  logic [7:0]  shift_in;

  // Word-assembly state: 4 bytes per word, little-endian on the wire
  // (bytes[0] arrives first and goes into bits [7:0] of the imem word).
  logic [4:0]  word_idx;        // imem index 0..31
  logic [1:0]  byte_idx;        // 0..3 within the current word
  logic [23:0] word_buf;        // accumulated bytes 0..2

  // SCK output. Idle low; while a transaction is active SCK toggles
  // following `tick` such that the high phase is the second half of
  // the SPI cycle.
  logic sck_q;

  // Flop the SPI-master IOs.
  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      bstate     <= B_IDLE;
      tick       <= 2'd0;
      bit_idx    <= 11'd0;
      shift_out  <= 32'h0;
      shift_in   <= 8'h0;
      word_idx   <= 5'd0;
      byte_idx   <= 2'd0;
      word_buf   <= 24'h0;
      sck_q      <= 1'b0;
      cs_n       <= 1'b1;
      mosi       <= 1'b0;
      imem_we    <= 1'b0;
      imem_waddr <= 5'd0;
      imem_wdata <= 32'h0;
      boot_done  <= 1'b0;
    end else begin
      // Default writes — overridden inside state branches.
      imem_we <= 1'b0;
      tick    <= tick + 2'd1;

      unique case (bstate)
        B_IDLE: begin
          // First cycle out of reset: assert CS_n low, load the
          // command + address shifter, advance to B_CMD.
          cs_n      <= 1'b0;
          shift_out <= {8'h03, 24'h000000};      // READ + addr 0
          bit_idx   <= 11'd31;                    // 32 bits to send
          mosi      <= 1'b0;
          tick      <= 2'd0;                      // restart cycle
          bstate    <= B_CMD;
        end

        B_CMD, B_ADDR: begin
          // CMD and ADDR are the same shift mechanic: 32 bits total,
          // MSB first. Output bit on SCK falling, shift on SCK rising.
          if (sck_falling) begin
            mosi      <= shift_out[31];
            shift_out <= {shift_out[30:0], 1'b0};
          end
          if (sck_rising) begin
            sck_q <= 1'b1;
            if (bit_idx == 11'd0) begin
              // Done with 32-bit cmd+addr; switch to data.
              bit_idx  <= 11'd1023;               // 128 bytes × 8 = 1024 bits
              byte_idx <= 2'd0;
              bstate   <= B_DATA;
            end else begin
              bit_idx <= bit_idx - 11'd1;
            end
          end
          if (sck_falling) begin
            sck_q <= 1'b0;
          end
        end

        B_DATA: begin
          // Shift in MISO MSB-first. After every 8 sampled bits, we
          // have one byte; pack 4 bytes into one imem word.
          if (sck_rising) begin
            sck_q    <= 1'b1;
            shift_in <= {shift_in[6:0], miso};
            if (bit_idx[2:0] == 3'b000) begin
              // We've just shifted the 8th bit of a byte — assemble.
              unique case (byte_idx)
                2'd0: word_buf[7:0]   <= {shift_in[6:0], miso};
                2'd1: word_buf[15:8]  <= {shift_in[6:0], miso};
                2'd2: word_buf[23:16] <= {shift_in[6:0], miso};
                2'd3: begin
                  imem_we    <= 1'b1;
                  imem_waddr <= word_idx;
                  imem_wdata <= {{shift_in[6:0], miso}, word_buf};
                end
              endcase
              if (byte_idx == 2'd3) begin
                byte_idx <= 2'd0;
                word_idx <= word_idx + 5'd1;
              end else begin
                byte_idx <= byte_idx + 2'd1;
              end
            end
            if (bit_idx == 11'd0) begin
              bstate <= B_DONE;
            end else begin
              bit_idx <= bit_idx - 11'd1;

Comparing with P12 and P13

	P12	P13	P14 (this)
imem	combinational ROM	flop array	flop array
Reprogrammable post-fab?	no	yes (UART, every boot)	yes (flash, persistent)
Host required to run?	no	yes — UART loader	no
First-byte-out time after rst_n	~tens of cycles (boot prog)	host-dependent	~84 µs (after spi_boot)
Pin frame	TT 8×2	TT 8×2	TT 8×2

P13 was the smallest microcontroller you could plug into a host computer. P14 is the smallest microcontroller you can plug into only a power supply plus a SPI flash chip on the breadboard. Same chip shape, different relationship with the world around it.

What just happened?

We built the persistence layer. P13 had writable imem but the program went away on power loss. P14 keeps writable imem and adds an internal boot controller that pulls the program out of an external flash chip on every reset. The fabricated chip is now a real microcontroller in the embedded sense: power it up and it runs.

This is the last TT-shippable rung on the ladder. After P14 we leave the Tiny Tapeout shuttle (caps out at ~16k gates / ~2k flops in 8×2) and start building toward something that fits in a custom-die sky130 submission with multiple SRAM macros, real interrupts, and eventually a full RV64GC core capable of booting an OS. See the roadmap for the rest of the climb.

What changed from P13

The boot sequence

Harden result

Why JEDEC READ (0x03) and not a fancier opcode

ISA scope

ISA smoke addendum

Official arch-test probe

Source

Comparing with P12 and P13

What just happened?

See also

Why JEDEC READ (`0x03`) and not a fancier opcode