P12 was a real RISC-V chip sized for a Tiny Tapeout 8×2 tile — but the fabbed silicon ran exactly one program forever, baked into the synthesis netlist as combinational ROM. P13 keeps the same chip, the same pin frame, the same FSM core, and adds the smallest possible reprogrammability: a flop-based instruction memory plus a UART-driven loader that writes new programs into it on demand.
The fabricated chip stops being a fixed-function curiosity and starts being a real (if extremely small) microcontroller.
Status: Hardened. Fits in a TT 8×2 tile with everything: 11,082 non-filler cells, 2,038 flops (+1,024 of which is the new writable imem), 0.93 ns of setup slack at 50 MHz, zero DRC/LVS/antenna violations. P13 is 2.24× larger than P12 by cell count — almost all of that growth comes from moving instruction memory from combinational ROM to flops, plus the small loader FSM and uart_rx receiver. Both testbenches pass: default boot prints
P13\n, and the loader test streams a 14- instructionOK\nprogram over UART RX and runs it on the chip.
Compliance tests: NOT RUN for P13. The UART loader makes the tiny RV32E core reprogrammable, but it does not change the ISA limits inherited from P12. This is not a compliance-proven RISC-V implementation.
What changed from P12
The pin frame is identical to P12 except for two repurposed bits:
| pin | P12 | P13 |
|---|---|---|
ui_in[0] | baud_div[0] | uart_rx (host → chip) |
ui_in[1] | baud_div[1] | load_mode (1 = listen, 0 = run) |
uo_out[2] | R5[0] | imem_loaded (loader status) |
uo_out[7:3] | R5[5:1] | R5[4:0] |
The baud divider is now 14 bits instead of 16 — still enough to hit any sensible UART rate from a 50 MHz clock. The R5 mirror loses one bit of width (5 bits visible instead of 6); programs can still expose their key result there.
Inside the chip, the big change is that PROG[] is no longer a
synth-time SystemVerilog parameter lowered into combinational
ROM. It’s now a 32-entry × 32-bit flop array — a register file
full of instruction words. Two consequences:
- +1,024 flops of
imemstorage (P12 had 0; the ROM was pure gates). - The flops are writable at runtime, by the loader FSM.
A BOOT_PROG parameter still exists, but it’s the array’s reset
init, not the runtime ROM. On chip-level rst_n deassert, every
imem flop loads its corresponding word from BOOT_PROG. After
that, the loader can overwrite individual entries.
Loader protocol
The protocol is deliberately tiny — three things to get right:
1. byte: 0xA5 magic byte (anything else is ignored)
2. byte: N (1..32) number of 32-bit instruction words
3. bytes: N × 4 bytes little-endian rv32 words, one byte at a time
Line noise on UART RX during normal operation can’t accidentally
trigger a load — the chip only listens when the host explicitly
asserts load_mode=1. Even then, the magic byte gates everything.
A spurious 0xA5 followed by random data would still load garbage,
but the host has explicitly said “I’m loading”; that’s on them.
Boot vs load — telling them apart
The default boot program prints P13\n over UART. Any host-loaded
program prints whatever it’s been told to print. If the dev board
shows P13 on its USB-UART, the loaded program didn’t run (or
wasn’t loaded). If the dev board shows the expected output (e.g.
OK\n, or the result of some computation), the loaded program ran
exactly as compiled.
This is the kind of distinguishability that makes debugging real hardware tractable — every output mode has a unique signature, so “did the chip do what I think it did” reduces to “what bytes arrived on the wire.”
Two testbenches
tb.sv runs the chip from reset with load_mode=0 and verifies
the silicon-default P13\n greeting:
[uart-rx] byte 0: 0x50 ('P')
[uart-rx] byte 1: 0x31 ('1')
[uart-rx] byte 2: 0x33 ('3')
[uart-rx] byte 3: 0x0a (newline)
[ok] halted after 189 clocks
PASS: P13 default boot prints "P13\n" on UART, halts.
tb_load.sv exercises the loader. It asserts load_mode=1,
sends 0xA5 magic + count + a 14-instruction program over UART
RX, verifies imem_loaded goes high, drops load_mode to 0, then
watches the chip run the loaded program:
[host] load_mode=1; CPU held in reset, loader listening
[host] sent magic 0xA5
[host] sent count 14
[host] sent 14 instruction words
[host] load_mode=0; CPU released
[uart-rx] byte 0: 0x4f ('O')
[uart-rx] byte 1: 0x4b ('K')
[uart-rx] byte 2: 0x0a (newline)
[ok] loaded program halted after 171 clocks
PASS: loaded program ran -> UART "OK\n".
Same chip, two different programs. The hardware contract — the TT pin frame — is identical in both cases; only the host’s behaviour differs.
RTL — the loader FSM
The smallest interesting piece is loader_fsm, a 5-state machine
that watches the UART RX byte stream and writes incoming program
data into imem:
module loader_fsm (
input logic clk,
input logic rst_n,
input logic load_mode,
input logic [7:0] rx_data,
input logic rx_valid,
output logic imem_we,
output logic [4:0] imem_waddr,
output logic [31:0] imem_wdata,
output logic imem_loaded
);
typedef enum logic [2:0] {
L_IDLE = 3'd0,
L_WAIT_MAGIC = 3'd1,
L_READ_COUNT = 3'd2,
L_READ_BYTE = 3'd3,
L_DONE = 3'd4
} lstate_t;
lstate_t lstate;
logic [4:0] count; // total words to load (1..32)
logic [4:0] word_idx; // current word index
logic [1:0] byte_idx; // 0..3 within the current word
logic [23:0] word_buf; // bottom 3 bytes of the in-progress word
// Combinational outputs. imem_we asserts only when the loader is
// actively reading the 4th byte of a word and a new RX byte just
// arrived. imem_wdata is the assembled word: the just-arrived
// byte sits in the high byte; the lower 3 bytes are the buffer
// accumulated over the previous 3 cycles (little-endian on the
// wire = LSB arrives first).
always_comb begin
imem_we = (lstate == L_READ_BYTE) && (byte_idx == 2'd3) && rx_valid;
imem_waddr = word_idx;
imem_wdata = {rx_data, word_buf};
end
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
lstate <= L_IDLE;
count <= 5'd0;
word_idx <= 5'd0;
byte_idx <= 2'd0;
word_buf <= 24'h0;
imem_loaded <= 1'b1; // boot default is "loaded" — chip uses BOOT_PROG
end else begin
unique case (lstate)
L_IDLE: begin
if (load_mode) begin
lstate <= L_WAIT_MAGIC;
imem_loaded <= 1'b0;
word_idx <= 5'd0;
byte_idx <= 2'd0;
end
end
L_WAIT_MAGIC: begin
if (!load_mode) lstate <= L_IDLE;
else if (rx_valid && rx_data == 8'hA5) lstate <= L_READ_COUNT;
// else: ignore non-magic byte, stay listening
end
L_READ_COUNT: begin
if (!load_mode) lstate <= L_IDLE;
else if (rx_valid) begin
// Cap N at 32 — anything bigger overflows imem.
count <= (rx_data == 8'd0) ? 5'd1
: (rx_data > 8'd32) ? 5'd31 // 32 wraps to 0 in 5 bits
: 5'(rx_data - 8'd1); // last valid index
word_idx <= 5'd0;
byte_idx <= 2'd0;
lstate <= L_READ_BYTE;
end
end
L_READ_BYTE: begin
if (!load_mode) lstate <= L_IDLE;
else if (rx_valid) begin
unique case (byte_idx)
2'd0: begin word_buf[7:0] <= rx_data; byte_idx <= 2'd1; end
2'd1: begin word_buf[15:8] <= rx_data; byte_idx <= 2'd2; end
2'd2: begin word_buf[23:16] <= rx_data; byte_idx <= 2'd3; end
2'd3: begin
// 4th byte = MSB. Combinational `imem_wdata` above
// is already {rx_data, word_buf}. Strobe the write.
byte_idx <= 2'd0;
if (word_idx == count) begin
lstate <= L_DONE;
end else begin
word_idx <= word_idx + 5'd1;
end
end
endcase
end
end
L_DONE: begin
imem_loaded <= 1'b1;
if (!load_mode) lstate <= L_IDLE;
// else: stay; host can keep load_mode=1 until ready
end
default: lstate <= L_IDLE;
endcase
end
end
endmodule
// =====================================================================
// uart_tx — same shape as P12.
// =====================================================================
module uart_tx ( L_IDLE waits for load_mode to assert. L_WAIT_MAGIC filters
for the 0xA5 byte. L_READ_COUNT captures the instruction
count. L_READ_BYTE accumulates 4 bytes per word into word_buf,
then strobes imem_we on the 4th byte to write the assembled
word. L_DONE holds imem_loaded=1 until the host releases
load_mode.
Comparing the four CPUs on the ladder
| P06 | P09 | P12 | P13 | |
|---|---|---|---|---|
| Width | 8 | 32 | 32 | 32 |
| ISA | ours | RV32I | RV32E | RV32E |
| imem | combinational ROM | combinational ROM | combinational ROM | flop array (writable) |
| Reprogrammable post-fab? | no | no | no | yes (over UART) |
| Targets | educational | educational | TT 8×2 | TT 8×2 |
| Cells (hardened) | 2,333 | 17,277 | 4,943 | 11,082 |
P13 is the first chip on this ladder where the silicon you receive in the mail can be told to do something different from what it was fabricated with. Every level below ships exactly one program forever; P13 ships a bootloader, and the program comes later.
What just happened?
We took the smallest TT-shippable RISC-V (P12) and added a
1,024-flop instruction memory plus a UART-driven loader. The pin
frame stays identical — same tt_um_* shape, just two pins
repurposed for the loader’s UART-RX line and the load-mode select.
The fabricated chip now behaves like a real microcontroller: power
it up and it greets you with P13\n; tell it to listen and it
accepts a new program over the same UART pins; release the loader
and it runs whatever you sent.
This is the smallest interesting system-level difference between “a chip that runs the program in its mask” and “a chip that runs the program you just sent it.” Every microcontroller from a 4-bit PIC up has some version of this loader. P13 is what the very simplest one looks like in 1,024 flops of imem and a five-state FSM.
See also
- Project 12 → the fixed-function RV32E this scales up from.
- TinyQV → the alternative reprogrammability model: external SPI flash boot rather than UART loader.
- Project README