No. 13 / project of 147 on the ladder

RV32E with a UART program loader

introduces — writable instruction memory, UART RX, in-the-field reprogramming

harden statelast run2026-04-29
cells11,082non-filler
slack0.93ns setup
area290250 (die) / 271505 (core)μm²
signoff
  • DRCPASS
  • LVSPASS
  • antennaPASS

P12 was a real RISC-V chip sized for a Tiny Tapeout 8×2 tile — but the fabbed silicon ran exactly one program forever, baked into the synthesis netlist as combinational ROM. P13 keeps the same chip, the same pin frame, the same FSM core, and adds the smallest possible reprogrammability: a flop-based instruction memory plus a UART-driven loader that writes new programs into it on demand.

The fabricated chip stops being a fixed-function curiosity and starts being a real (if extremely small) microcontroller.

Status: Hardened. Fits in a TT 8×2 tile with everything: 11,082 non-filler cells, 2,038 flops (+1,024 of which is the new writable imem), 0.93 ns of setup slack at 50 MHz, zero DRC/LVS/antenna violations. P13 is 2.24× larger than P12 by cell count — almost all of that growth comes from moving instruction memory from combinational ROM to flops, plus the small loader FSM and uart_rx receiver. Both testbenches pass: default boot prints P13\n, and the loader test streams a 14- instruction OK\n program over UART RX and runs it on the chip.

Compliance tests: NOT RUN for P13. The UART loader makes the tiny RV32E core reprogrammable, but it does not change the ISA limits inherited from P12. This is not a compliance-proven RISC-V implementation.

layout · sky130A x= μm y= μm
drag · scroll to zoom · double-click to fit · 1 1:1 · f fit 1290 × 225 µm die · sky130A · 50 MHz · TT 8×2 tile · met1+met2+met3 only
3d · sky130A · z×10
drag · scroll · right-drag pan · double-click recenter · R reset metal stack only · z exaggerated 10× · 320k shapes · meshopt-compressed

What changed from P12

The pin frame is identical to P12 except for two repurposed bits:

pinP12P13
ui_in[0]baud_div[0]uart_rx (host → chip)
ui_in[1]baud_div[1]load_mode (1 = listen, 0 = run)
uo_out[2]R5[0]imem_loaded (loader status)
uo_out[7:3]R5[5:1]R5[4:0]

The baud divider is now 14 bits instead of 16 — still enough to hit any sensible UART rate from a 50 MHz clock. The R5 mirror loses one bit of width (5 bits visible instead of 6); programs can still expose their key result there.

Inside the chip, the big change is that PROG[] is no longer a synth-time SystemVerilog parameter lowered into combinational ROM. It’s now a 32-entry × 32-bit flop array — a register file full of instruction words. Two consequences:

  1. +1,024 flops of imem storage (P12 had 0; the ROM was pure gates).
  2. The flops are writable at runtime, by the loader FSM.

A BOOT_PROG parameter still exists, but it’s the array’s reset init, not the runtime ROM. On chip-level rst_n deassert, every imem flop loads its corresponding word from BOOT_PROG. After that, the loader can overwrite individual entries.

Loader protocol

The protocol is deliberately tiny — three things to get right:

1. byte:   0xA5             magic byte (anything else is ignored)
2. byte:   N (1..32)        number of 32-bit instruction words
3. bytes:  N × 4 bytes      little-endian rv32 words, one byte at a time

Line noise on UART RX during normal operation can’t accidentally trigger a load — the chip only listens when the host explicitly asserts load_mode=1. Even then, the magic byte gates everything. A spurious 0xA5 followed by random data would still load garbage, but the host has explicitly said “I’m loading”; that’s on them.

Boot vs load — telling them apart

The default boot program prints P13\n over UART. Any host-loaded program prints whatever it’s been told to print. If the dev board shows P13 on its USB-UART, the loaded program didn’t run (or wasn’t loaded). If the dev board shows the expected output (e.g. OK\n, or the result of some computation), the loaded program ran exactly as compiled.

This is the kind of distinguishability that makes debugging real hardware tractable — every output mode has a unique signature, so “did the chip do what I think it did” reduces to “what bytes arrived on the wire.”

Two testbenches

tb.sv runs the chip from reset with load_mode=0 and verifies the silicon-default P13\n greeting:

[uart-rx] byte 0: 0x50 ('P')
[uart-rx] byte 1: 0x31 ('1')
[uart-rx] byte 2: 0x33 ('3')
[uart-rx] byte 3: 0x0a (newline)
[ok] halted after 189 clocks
PASS: P13 default boot prints "P13\n" on UART, halts.

tb_load.sv exercises the loader. It asserts load_mode=1, sends 0xA5 magic + count + a 14-instruction program over UART RX, verifies imem_loaded goes high, drops load_mode to 0, then watches the chip run the loaded program:

[host]    load_mode=1; CPU held in reset, loader listening
[host]    sent magic 0xA5
[host]    sent count 14
[host]    sent 14 instruction words
[host]    load_mode=0; CPU released
[uart-rx] byte 0: 0x4f ('O')
[uart-rx] byte 1: 0x4b ('K')
[uart-rx] byte 2: 0x0a (newline)
[ok]      loaded program halted after 171 clocks
PASS: loaded program ran -> UART "OK\n".

Same chip, two different programs. The hardware contract — the TT pin frame — is identical in both cases; only the host’s behaviour differs.

RTL — the loader FSM

The smallest interesting piece is loader_fsm, a 5-state machine that watches the UART RX byte stream and writes incoming program data into imem:

projects/13_rv32e_loader/src/top.sv system-verilog · L441-550
module loader_fsm (
    input  logic        clk,
    input  logic        rst_n,
    input  logic        load_mode,
    input  logic [7:0]  rx_data,
    input  logic        rx_valid,
    output logic        imem_we,
    output logic [4:0]  imem_waddr,
    output logic [31:0] imem_wdata,
    output logic        imem_loaded
);

  typedef enum logic [2:0] {
    L_IDLE       = 3'd0,
    L_WAIT_MAGIC = 3'd1,
    L_READ_COUNT = 3'd2,
    L_READ_BYTE  = 3'd3,
    L_DONE       = 3'd4
  } lstate_t;
  lstate_t lstate;

  logic [4:0]  count;       // total words to load (1..32)
  logic [4:0]  word_idx;    // current word index
  logic [1:0]  byte_idx;    // 0..3 within the current word
  logic [23:0] word_buf;    // bottom 3 bytes of the in-progress word

  // Combinational outputs. imem_we asserts only when the loader is
  // actively reading the 4th byte of a word and a new RX byte just
  // arrived. imem_wdata is the assembled word: the just-arrived
  // byte sits in the high byte; the lower 3 bytes are the buffer
  // accumulated over the previous 3 cycles (little-endian on the
  // wire = LSB arrives first).
  always_comb begin
    imem_we    = (lstate == L_READ_BYTE) && (byte_idx == 2'd3) && rx_valid;
    imem_waddr = word_idx;
    imem_wdata = {rx_data, word_buf};
  end

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      lstate      <= L_IDLE;
      count       <= 5'd0;
      word_idx    <= 5'd0;
      byte_idx    <= 2'd0;
      word_buf    <= 24'h0;
      imem_loaded <= 1'b1;     // boot default is "loaded" — chip uses BOOT_PROG
    end else begin
      unique case (lstate)
        L_IDLE: begin
          if (load_mode) begin
            lstate      <= L_WAIT_MAGIC;
            imem_loaded <= 1'b0;
            word_idx    <= 5'd0;
            byte_idx    <= 2'd0;
          end
        end
        L_WAIT_MAGIC: begin
          if (!load_mode) lstate <= L_IDLE;
          else if (rx_valid && rx_data == 8'hA5) lstate <= L_READ_COUNT;
          // else: ignore non-magic byte, stay listening
        end
        L_READ_COUNT: begin
          if (!load_mode) lstate <= L_IDLE;
          else if (rx_valid) begin
            // Cap N at 32 — anything bigger overflows imem.
            count    <= (rx_data == 8'd0)             ? 5'd1
                      : (rx_data > 8'd32)             ? 5'd31  // 32 wraps to 0 in 5 bits
                      : 5'(rx_data - 8'd1);                    // last valid index
            word_idx <= 5'd0;
            byte_idx <= 2'd0;
            lstate   <= L_READ_BYTE;
          end
        end
        L_READ_BYTE: begin
          if (!load_mode) lstate <= L_IDLE;
          else if (rx_valid) begin
            unique case (byte_idx)
              2'd0: begin word_buf[7:0]   <= rx_data; byte_idx <= 2'd1; end
              2'd1: begin word_buf[15:8]  <= rx_data; byte_idx <= 2'd2; end
              2'd2: begin word_buf[23:16] <= rx_data; byte_idx <= 2'd3; end
              2'd3: begin
                // 4th byte = MSB. Combinational `imem_wdata` above
                // is already {rx_data, word_buf}. Strobe the write.
                byte_idx <= 2'd0;
                if (word_idx == count) begin
                  lstate   <= L_DONE;
                end else begin
                  word_idx <= word_idx + 5'd1;
                end
              end
            endcase
          end
        end
        L_DONE: begin
          imem_loaded <= 1'b1;
          if (!load_mode) lstate <= L_IDLE;
          // else: stay; host can keep load_mode=1 until ready
        end
        default: lstate <= L_IDLE;
      endcase
    end
  end

endmodule


// =====================================================================
// uart_tx — same shape as P12.
// =====================================================================
module uart_tx (

L_IDLE waits for load_mode to assert. L_WAIT_MAGIC filters for the 0xA5 byte. L_READ_COUNT captures the instruction count. L_READ_BYTE accumulates 4 bytes per word into word_buf, then strobes imem_we on the 4th byte to write the assembled word. L_DONE holds imem_loaded=1 until the host releases load_mode.

Comparing the four CPUs on the ladder

P06P09P12P13
Width8323232
ISAoursRV32IRV32ERV32E
imemcombinational ROMcombinational ROMcombinational ROMflop array (writable)
Reprogrammable post-fab?nononoyes (over UART)
TargetseducationaleducationalTT 8×2TT 8×2
Cells (hardened)2,33317,2774,94311,082

P13 is the first chip on this ladder where the silicon you receive in the mail can be told to do something different from what it was fabricated with. Every level below ships exactly one program forever; P13 ships a bootloader, and the program comes later.

What just happened?

We took the smallest TT-shippable RISC-V (P12) and added a 1,024-flop instruction memory plus a UART-driven loader. The pin frame stays identical — same tt_um_* shape, just two pins repurposed for the loader’s UART-RX line and the load-mode select. The fabricated chip now behaves like a real microcontroller: power it up and it greets you with P13\n; tell it to listen and it accepts a new program over the same UART pins; release the loader and it runs whatever you sent.

This is the smallest interesting system-level difference between “a chip that runs the program in its mask” and “a chip that runs the program you just sent it.” Every microcontroller from a 4-bit PIC up has some version of this loader. P13 is what the very simplest one looks like in 1,024 flops of imem and a five-state FSM.

See also

  • Project 12 → the fixed-function RV32E this scales up from.
  • TinyQV → the alternative reprogrammability model: external SPI flash boot rather than UART loader.
  • Project README