P13 made the chip’s instruction memory
writable — the host could stream programs over UART while load_mode
was high. But the loaded program lived in flops: drop power and it’s
gone. P14 keeps the writable imem and adds autonomous boot from
external SPI NOR flash. Power-on, magic happens, the CPU is running
whatever’s been programmed into the flash chip on the dev board.
Status: hardened. Strict tb passes —
spi_bootissues a JEDEC READ, the behavioural flash model streams 128 bytes back, the chip packs them into imem as 32 little-endian rv32 words, releases the CPU. The loaded program printsP14\nover the UART and halts. LibreLane runRUN_2026-04-29_12-48-45completed with final GDS, +0.98 ns worst setup slack, clean DRC, clean LVS, and clean antenna.
What changed from P13
P13’s load_mode pin (was ui_in[1]) and uart_rx pin (was
ui_in[0]) are gone — replaced by an SPI master that talks to an
external flash chip. The pin frame stays exactly the same shape;
only what each bit means changes:
| pin | P13 | P14 |
|---|---|---|
ui_in[0] | uart_rx | unused (tied off) |
ui_in[1] | load_mode | unused |
ui_in[7:2] | baud_div[5:0] | unused |
uio_in[3] | (input, unused) | SPI MISO |
uio_out[0] | (tied 0) | SPI SCK |
uio_out[1] | (tied 0) | SPI CS_n |
uio_out[2] | (tied 0) | SPI MOSI |
uio_oe | 0x00 (input-only) | 0x07 (drive SCK/CS/MOSI) |
uo_out[2] | imem_loaded | boot_done |
Baud rate moves from configurable to hardcoded — 115200 at 50 MHz —
which frees every input pin. The boot_done signal is the analog of
P13’s imem_loaded: it goes high once the chip’s internal boot
controller has filled imem and is releasing the CPU.
The boot sequence
chip reset
│
▼
┌─────────────────────┐
│ spi_boot.B_IDLE │ cs_n = 0, load shift_out = 0x03000000
│ → B_CMD │
└─────────────────────┘
│ 32 SCK cycles
▼
┌─────────────────────┐
│ shift out READ │ send 0x03 + 24-bit address 0
│ command + addr │
└─────────────────────┘
│ flash starts streaming
▼
┌─────────────────────┐
│ B_DATA │ for byte=0..127:
│ sample MISO MSB-first pack 4 bytes into 32-bit
│ on each SCK rising little-endian word
│ write imem[byte/4] = word
└─────────────────────┘
│ 1024 SCK cycles later
▼
┌─────────────────────┐
│ B_DONE │ cs_n = 1, sck = 0
│ boot_done = 1 │ CPU released, fetches imem[0]
└─────────────────────┘
The SPI clock divider is hardcoded at chip-clk / 4 — gives a 12.5 MHz SCK at 50 MHz user clock, which is comfortably below the >100 MHz spec of common SPI flash parts (W25Q-series, MX25-series, etc). 1056 SPI cycles × 4 = 4,224 chip clocks total, ~84 µs at 50 MHz.
Harden result
Run directory:
projects/14_rv32e_flash_boot/librelane/runs/RUN_2026-04-29_12-48-45
Final GDS:
projects/14_rv32e_flash_boot/librelane/runs/RUN_2026-04-29_12-48-45/final/gds/tt_um_librelane_p14_rv32e_flash.gds
Metrics file:
projects/14_rv32e_flash_boot/librelane/runs/RUN_2026-04-29_12-48-45/final/metrics.csv
| metric | value |
|---|---|
| Die area | 290250 µm² |
| Core area | 271505 µm² |
| Standard cells | 15289 |
| Sequential cells | 2062 |
| Worst setup slack | +0.98 ns |
| Worst hold slack | +0.11 ns |
| Magic / KLayout DRC | PASS / PASS |
| LVS | PASS |
| Antenna | PASS |
The run still reports max-slew, max-fanout, and max-cap warnings in the metrics, similar to P13. The harden flow completed and emitted final views; the warnings are part of the result, not a pretend-clean signoff.
Why JEDEC READ (0x03) and not a fancier opcode
SPI flash chips support a zoo of read commands — 0x0B (FAST_READ
with a dummy byte for higher-clock operation), 0x6B (quad-output
read using 4 data lines), 0xEB (quad I/O with quad addressing).
We use 0x03 (the basic READ) because:
- It’s universal — every SPI flash from every vendor implements it.
- We have one MISO line, not four. Quad-mode needs 4 of the
uiopins as data, which would mean giving up MOSI’s pin function and redefining the bus protocol on every cycle. Educational chip, not worth the complexity. - 0x03 maxes out at ~50 MHz on most parts. Our SCK is 12.5 MHz — far below the limit.
The cost: at 12.5 MHz SCK, reading 128 bytes takes ~80 µs. A FAST_READ at 80 MHz on a quad-I/O part would do it in ~3.2 µs. For a microcontroller booting once at power-on, the 80 µs is invisible.
ISA scope
This is an RV32E-shaped educational core, not a compliance-proven RISC-V implementation. Official RISC-V architectural compliance tests: NOT RUN. A local compliance-shaped subset smoke addendum now exists and passes: PASS.
Supported instructions: LUI, AUIPC, JAL, JALR, BEQ, BNE,
BLT, BGE, BLTU, BGEU, LW, SW, ADDI, SLTI, SLTIU,
XORI, ORI, ANDI, SLLI, SRLI, SRAI, ADD, SUB, SLL,
SLT, SLTU, XOR, SRL, SRA, OR, AND, and FENCE as a
no-op. Registers are RV32E style: x0..x15; reads of x16..x31
return zero and writes to them are ignored.
Unsupported: byte/halfword loads and stores, misalignment traps,
exceptions, interrupts, CSRs, ECALL, EBREAK, multiply/divide,
atomics, compressed instructions, privilege modes, and any compliance
claim beyond the project testbench.
ISA smoke addendum
test/tb_isa.sv is the first “how close are we?” pass. It does not
use the official RISC-V architectural-test signature protocol. Instead
it stays honest to P14’s actual chip interface: every case builds a
32-word SPI flash image, lets the real boot ROM loader fill imem,
runs the CPU, halts, and checks the exposed R5[4:0] pass code on
uo_out[7:3].
Run:
make -C projects/14_rv32e_flash_boot/test isa
Result: PASS — 9 programs, 0 errors.
| test | instructions exercised |
|---|---|
add/sub/logic | ADDI, ADD, SUB, AND, OR, XOR, BNE |
imm-logic | ANDI, ORI, XORI |
shifts | SLLI, SRLI, SRAI, SLL, SRA |
slt/sltu | SLT, SLTU, SLTI, SLTIU |
branches | BEQ, BNE, BLT, BGE, BLTU, BGEU |
lui/auipc | LUI, AUIPC |
jal/jalr | JAL, JALR, including low-bit target clearing |
lw/sw | LW, SW against P14’s 8-word dmem |
rv32e/fence | x0, ignored x16, read-zero x16, FENCE as NOP |
What still blocks a real compliance claim: signature-memory export, sub-word loads/stores, trap/exception behavior, CSRs and system instructions, and enough program/data memory to run official tests without cutting them into tiny fragments.
Official arch-test probe
Then we tried the less flattering thing: build the official
riscv-arch-test RV32I/I unprivileged integer files and classify them
against P14’s actual limits.
Probe command:
scripts/p14_arch_test_probe.py
Result using upstream riscv-arch-test revision a7c9930:
| result | count |
|---|---|
| Official RV32I/I tests built | 39 / 39 |
| Runnable on P14 unmodified | 0 / 39 |
| Official tests passed on P14 | 0 |
| Official tests failed on P14 | 0 |
Official tests marked NOT RUN | 39 |
That is not a disguised FAIL; it is a real NOT RUN. The smallest
official image, I-nop-00.S, still builds to 632 instruction words
plus 1384 bytes of data/signature sections. P14 has 32 instruction
words, 8 zeroed data words, no data preload path, and no signature
export. The official RV32I framework also initializes and uses
x16..x31, which P14 intentionally treats as absent RV32E registers.
The probe CSV is tracked at
projects/14_rv32e_flash_boot/compliance_probe/rv32i_I_probe.csv.
That file is the useful artifact for future rungs: it tells us what
has to change before official tests can become executable rather than
just buildable.
Source
The whole spi_boot module is ~110 lines:
input logic miso,
output logic sck,
output logic cs_n,
output logic mosi,
// imem write port.
output logic imem_we,
output logic [4:0] imem_waddr,
output logic [31:0] imem_wdata,
output logic boot_done
);
// SPI clock divider. 2-bit counter rolls every 4 chip cycles, so
// SCK toggles every 2 chip cycles → SPI clock = clk / 4.
logic [1:0] tick;
// We drive SCK from `tick`. The high half of the cycle is the rising
// edge (sample MISO); the falling edge (change MOSI) happens at the
// tick=0 boundary.
wire sck_rising = (tick == 2'd1); // about to make SCK go high
wire sck_falling = (tick == 2'd3); // about to make SCK go low
typedef enum logic [2:0] {
B_IDLE = 3'd0,
B_CMD = 3'd1, // shift out the 0x03 READ command
B_ADDR = 3'd2, // shift out the 24-bit start address
B_DATA = 3'd3, // shift in the 128 program bytes
B_DONE = 3'd4
} bstate_t;
bstate_t bstate;
// Bit counter used in CMD/ADDR/DATA states. Sized to count up to 1024
// (128 bytes × 8 bits + the 32 cmd+addr bits before that).
logic [10:0] bit_idx;
// 32-bit shift register holding the cmd+addr to clock out, then we
// reuse it to assemble incoming data bytes.
logic [31:0] shift_out;
logic [7:0] shift_in;
// Word-assembly state: 4 bytes per word, little-endian on the wire
// (bytes[0] arrives first and goes into bits [7:0] of the imem word).
logic [4:0] word_idx; // imem index 0..31
logic [1:0] byte_idx; // 0..3 within the current word
logic [23:0] word_buf; // accumulated bytes 0..2
// SCK output. Idle low; while a transaction is active SCK toggles
// following `tick` such that the high phase is the second half of
// the SPI cycle.
logic sck_q;
// Flop the SPI-master IOs.
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
bstate <= B_IDLE;
tick <= 2'd0;
bit_idx <= 11'd0;
shift_out <= 32'h0;
shift_in <= 8'h0;
word_idx <= 5'd0;
byte_idx <= 2'd0;
word_buf <= 24'h0;
sck_q <= 1'b0;
cs_n <= 1'b1;
mosi <= 1'b0;
imem_we <= 1'b0;
imem_waddr <= 5'd0;
imem_wdata <= 32'h0;
boot_done <= 1'b0;
end else begin
// Default writes — overridden inside state branches.
imem_we <= 1'b0;
tick <= tick + 2'd1;
unique case (bstate)
B_IDLE: begin
// First cycle out of reset: assert CS_n low, load the
// command + address shifter, advance to B_CMD.
cs_n <= 1'b0;
shift_out <= {8'h03, 24'h000000}; // READ + addr 0
bit_idx <= 11'd31; // 32 bits to send
mosi <= 1'b0;
tick <= 2'd0; // restart cycle
bstate <= B_CMD;
end
B_CMD, B_ADDR: begin
// CMD and ADDR are the same shift mechanic: 32 bits total,
// MSB first. Output bit on SCK falling, shift on SCK rising.
if (sck_falling) begin
mosi <= shift_out[31];
shift_out <= {shift_out[30:0], 1'b0};
end
if (sck_rising) begin
sck_q <= 1'b1;
if (bit_idx == 11'd0) begin
// Done with 32-bit cmd+addr; switch to data.
bit_idx <= 11'd1023; // 128 bytes × 8 = 1024 bits
byte_idx <= 2'd0;
bstate <= B_DATA;
end else begin
bit_idx <= bit_idx - 11'd1;
end
end
if (sck_falling) begin
sck_q <= 1'b0;
end
end
B_DATA: begin
// Shift in MISO MSB-first. After every 8 sampled bits, we
// have one byte; pack 4 bytes into one imem word.
if (sck_rising) begin
sck_q <= 1'b1;
shift_in <= {shift_in[6:0], miso};
if (bit_idx[2:0] == 3'b000) begin
// We've just shifted the 8th bit of a byte — assemble.
unique case (byte_idx)
2'd0: word_buf[7:0] <= {shift_in[6:0], miso};
2'd1: word_buf[15:8] <= {shift_in[6:0], miso};
2'd2: word_buf[23:16] <= {shift_in[6:0], miso};
2'd3: begin
imem_we <= 1'b1;
imem_waddr <= word_idx;
imem_wdata <= {{shift_in[6:0], miso}, word_buf};
end
endcase
if (byte_idx == 2'd3) begin
byte_idx <= 2'd0;
word_idx <= word_idx + 5'd1;
end else begin
byte_idx <= byte_idx + 2'd1;
end
end
if (bit_idx == 11'd0) begin
bstate <= B_DONE;
end else begin
bit_idx <= bit_idx - 11'd1; Comparing with P12 and P13
| P12 | P13 | P14 (this) | |
|---|---|---|---|
| imem | combinational ROM | flop array | flop array |
| Reprogrammable post-fab? | no | yes (UART, every boot) | yes (flash, persistent) |
| Host required to run? | no | yes — UART loader | no |
| First-byte-out time after rst_n | ~tens of cycles (boot prog) | host-dependent | ~84 µs (after spi_boot) |
| Pin frame | TT 8×2 | TT 8×2 | TT 8×2 |
P13 was the smallest microcontroller you could plug into a host computer. P14 is the smallest microcontroller you can plug into only a power supply plus a SPI flash chip on the breadboard. Same chip shape, different relationship with the world around it.
What just happened?
We built the persistence layer. P13 had writable imem but the program went away on power loss. P14 keeps writable imem and adds an internal boot controller that pulls the program out of an external flash chip on every reset. The fabricated chip is now a real microcontroller in the embedded sense: power it up and it runs.
This is the last TT-shippable rung on the ladder. After P14 we leave the Tiny Tapeout shuttle (caps out at ~16k gates / ~2k flops in 8×2) and start building toward something that fits in a custom-die sky130 submission with multiple SRAM macros, real interrupts, and eventually a full RV64GC core capable of booting an OS. See the roadmap for the rest of the climb.
See also
- Project 13 — UART loader version for comparison.
- Roadmap — what comes after P14.
- JEDEC SFDP spec — the standard for SPI-flash discovery / commands.