No. 12 / project of 147 on the ladder

RV32E for Tiny Tapeout

introduces — RV32E, MMIO peripherals, TT 8×2 tile, observable-by-UART chip

harden statelast run2026-04-29
cells4,985non-filler
slack4.03ns setup
area290250 (die) / 271505 (core)μm²
signoff
  • DRCPASS
  • LVSPASS
  • antennaPASS

A real RISC-V core sized for a Tiny Tapeout 8×2 tile. Where P09 is the educational “what does RV32I look like inside” (17,277 cells, 600 × 600 µm — won’t fit anywhere on TT) and P11 is the TT-shaped P06 (a small 8-bit FSM CPU, not RISC-V), P12 is a real RISC-V core that targets a real TT shuttle.

Status: Hardened. P12 fits in a TT 8×2 tile (1290 × 225 µm) at 50 MHz — 4,985 non-filler cells, 920 flops, 4.03 ns of setup slack, zero DRC/LVS/antenna violations. Strict tb passes (chip emits P12\n over UART, halts with R5 = 12 on uo_out[7:2], 189 sim clocks total). Compared to P09’s 17,277 cells / 600 × 600 µm die, P12 is a 3.5× cell shrink for a real RV32 core that targets a real fab path.

Compliance tests: NOT RUN for P12. This is a small RV32E-shaped teaching core with a fixed boot program, not a compliance-proven RISC-V implementation. The exact ISA cuts are listed below.

layout · sky130A x= μm y= μm
drag · scroll to zoom · double-click to fit · 1 1:1 · f fit 1290 × 225 µm die · sky130A · 50 MHz · TT 8×2 tile · met1+met2+met3 only
3d · sky130A · z×10
drag · scroll · right-drag pan · double-click recenter · R reset metal stack only · z exaggerated 10× · 316k shapes · meshopt-compressed

What it does at boot

Power on, TT mux selects the project, the boot program runs. It prints P12\n over UART — the chip identifying itself by name — then halts. R5 holds the project number (12) and stays visible on uo_out[7:2] after halt.

[host]     baud_div = 0x0003  (= 4 sysclks/bit)
[host]     ena = 0   ; project deselected
[host]     uo_out = 0x00  (muted by ena=0)
[host]     rst_n released, ena still 0  ->  uo_out = 0x00
[host]     ena = 1   ; CPU running boot program...
[uart-rx]  byte 0x50  = 'P'
[uart-rx]  byte 0x31  = '1'
[uart-rx]  byte 0x32  = '2'
[host]     halt detected after 189 clocks
[uart-rx]  byte 0x0a  (newline)
[host]     uo_out = 0x33
[host]       uo_out[0]   = uart_tx   = 1  (line idle)
[host]       uo_out[1]   = halted    = 1
[host]       uo_out[7:2] = R5[5:0]   = 12  (project #)
[host]       uio_oe      = 0x00     (uio is input-only)
[host]     done.

What got cut from RV32I

P09’s RV32I-min was 17,277 cells / 3,262 flops on a 360k µm² die. A TT 8×2 tile is ~290k µm² with practical capacity around 12-16k gates and 1-2k flops. P09 doesn’t fit, full stop. The way you make an RV32 core fit on TT is the same way TinyQV (the existence proof) and other RV32-on-TT projects do it: cut features that trade real area for marginal teaching value.

P09 (RV32I-min)P12 (this project)savings
Regfile32 × 32 = 1024 flops16 × 32 = 512 flops-512 flops
Data RAM64 × 32 = 2048 flops8 × 32 = 256 flops-1792 flops
PROG ROM256-word mux32-word mux-8× ROM mux
ISARV32IRV32E (= I, x16+ unused)gcc has ilp32e
Tile fitnoTT 8×2 (1290 × 225 µm)

RV32E is RV32I with only the bottom 16 registers. gcc has first-class support for it (-march=rv32e -mabi=ilp32e); the calling convention (ABI) follows RV32I but uses fewer registers for arguments and saved values. P12’s decode silently treats x16..x31 as x0 (reads return 0, writes are ignored), so ilp32e binaries work directly and ilp32 binaries that happen to stay within x0..x15 also work.

Pin map

Standard TT pin frame, plus a 16-bit baud divider split across both input ports:

rv32e_core+ uart_tx16-reg regfile8-word dmem32-instr ROM tt_um_librelane_p12_rv32e "TT pinsclk · rst_n · enaui_in[7:0
P12 wraps a real RISC-V core into the fixed TT pin frame. The wrapper picks what to expose and how to pack it.
TT pinroledirection
ui_in[7:0]baud_div[7:0]in
uio_in[7:0]baud_div[15:8]in
uo_out[0]uart_txout
uo_out[1]haltedout
uo_out[7:2]R5[5:0]out
uio_outalways 0out
uio_oealways 0out

Memory map

P12 is harvard-ish: instruction fetches go to PROG[32], data accesses to dmem[8] or to the MMIO UART register at byte address 0x80.

0x000..0x07f   data RAM (8 words, 32 bytes total)
0x080          UART register
                  SW: write byte to TX
                  LW: read {31'b0, busy}
0x081..0xfff   undefined

The MMIO UART is the entire I/O surface. Programs that want to print the result of a computation poll the busy register, then issue a SW. There is no LB, no SB; only word-aligned SW/LW. There are no interrupts; the program drives the protocol synchronously.

This is what real-world peripheral interaction looks like at the bottom of the stack — every microcontroller from a $0.10 ARM Cortex M0 up has the same shape: registers at fixed addresses, polling loops, no syscalls. The difference between this and a “real” chip is just how many peripherals there are.

The boot program

The default PROG is hand-encoded RISC-V machine code, 18 instructions. It loads the project number into R5, sets up four ASCII bytes (‘P’, ‘1’, ‘2’, ‘\n’), then sends them over UART through the standard poll-busy / write-byte sequence:

PC=0   addi x5, x0, 12        ; x5 = 12 (project #)
PC=1   addi x4, x0, 0x0a      ; x4 = '\n'
PC=2   addi x3, x0, 0x32      ; x3 = '2'
PC=3   addi x2, x0, 0x31      ; x2 = '1'
PC=4   addi x1, x0, 0x50      ; x1 = 'P'
PC=5   lw   x6, 0x80(x0)      ; x6 = uart_busy
PC=6   bne  x6, x0, -4        ; loop until !busy
PC=7   sw   x1, 0x80(x0)      ; UART <- 'P'
PC=8   lw   x6, 0x80(x0)
PC=9   bne  x6, x0, -4
PC=10  sw   x2, 0x80(x0)      ; UART <- '1'
PC=11  lw   x6, 0x80(x0)
PC=12  bne  x6, x0, -4
PC=13  sw   x3, 0x80(x0)      ; UART <- '2'
PC=14  lw   x6, 0x80(x0)
PC=15  bne  x6, x0, -4
PC=16  sw   x4, 0x80(x0)      ; UART <- '\n'
PC=17  jal  x0, 0             ; halt: jump-to-self

This is the silicon default. It’s a self-identification greeting — connect a USB-UART bridge to a real TT shuttle, plug in the dev board, and the chip prints “P12” the moment rst_n deasserts. Then the gcc-compiled programs in tools/riscv-asm/ — which print different output — show up clearly distinct from the boot greeting.

Compiling C — gcc onto a 128-byte ROM

P09 introduced a riscv64-elf-gcc flow that turns C programs into a SystemVerilog PROG[] literal. P12 reuses the same harness with two changes:

  • BOARD=p12-march=rv32e -mabi=ilp32e. gcc emits code that only touches x0..x15, matching what the chip implements.
  • p12.ld linker script caps the program at 128 bytes (32 instructions × 4). The link fails with a clear error if a program overflows.

examples/p12_hello.c is the C version of P12’s boot program — print '1', '3', '\n' over the UART, halt. It compiles to 17 instructions / 68 bytes — tight but well under the cap, leaving room for a slightly bigger main:

static volatile unsigned int *const UART = (unsigned int *)0x80;

static void uart_send(unsigned char c) {
    while (*UART) ;          /* spin while busy */
    *UART = c;
}

int main(void) {
    uart_send('1');
    uart_send('3');
    uart_send('\n');
    return 0;
}

make c-test PROJECT=12_rv32e_tt chains the toolchain build, drops the resulting .svh into a P12 testbench, and asserts the chip emits '1','3','\n'. The output differs from the silicon default on purpose — when you see 13\n on the UART, the gcc program is running; when you see P12\n, the hand-encoded silicon default is running. Same chip, different ROM image.

make test    PROJECT=12_rv32e_tt    # hand-encoded silicon default
[uart-rx]  byte 0x50  = 'P'
[uart-rx]  byte 0x31  = '1'
[uart-rx]  byte 0x32  = '2'
[uart-rx]  byte 0x0a  (newline)
[host]     halt detected after 189 clocks
PASS: TT-wrapped RV32E boots, prints "P12\n" on UART, halts.

make c-test  PROJECT=12_rv32e_tt    # gcc-compiled p12_hello.c
[uart-rx]  byte 0x31 ('1')
[uart-rx]  byte 0x33 ('3')
[uart-rx]  byte 0x0a (newline)
[host]     halt detected after 147 clocks
PASS: gcc-compiled p12_hello.c runs through tt_um_* wrapper -> UART "13\n".

Both go through the actual tt_um_librelane_p12_rv32e wrapper — the same module a TT shuttle would instantiate — driven only from the chip-pin side. The C version is 17 instructions / 68 bytes including its boot stub, vs the silicon default’s 18 instructions hand-encoded for P12\n. Almost identical footprint; very different programs.

The testbench drives the TT wrapper exactly the way the actual TT shuttle integration would — clock, reset, ena, ui_in/uio_in for inputs, observing uo_out for outputs, no peeking inside. The one trick: SystemVerilog parameter PROG = … is declared on the wrapper module itself, so the testbench can do

tt_um_librelane_p12_rv32e #(.PROG(PROG_FROM_C)) dut (.clk(clk), …);

and load the gcc image. The TT chip-level RTL doesn’t pass any parameter override when it instantiates tt_um_*, so the silicon ships with the parameter default (the hand-encoded 13\n boot). The wrapper’s port signature is the contract; the parameter is private to the module and visible only at simulation time.

RTL — the wrapper

The wrapper module + the core; everything packed into one file because TT shuttles host hundreds of modules and flat names are mandatory:

projects/12_rv32e_tt/src/top.sv system-verilog · L72-124
// The TT submission convention requires this exact port signature
// (ui_in/uo_out/uio_in/uio_out/uio_oe/ena/clk/rst_n) — TT's chip-
// level shuttle RTL instantiates `tt_um_*` modules with no parameter
// overrides, so the silicon ships with the parameter defaults.
//
// Adding a `parameter PROG` is still legal: it's invisible to the
// shuttle integration but lets simulation testbenches load a
// different program by named-parameter override, the same way you'd
// instantiate any parameterized SV module:
//
//     tt_um_librelane_p12_rv32e #(.PROG(MY_PROG)) dut (...);
//
// On a real TT submission, the parameter default (the hand-encoded
// "13\n" boot below) is what gets fabbed.
module tt_um_librelane_p12_rv32e #(
  // 32 × 32-bit instruction ROM. Default = the hand-encoded "P12\n"
  // boot below: load R5 = 12 (project number, mirrored to
  // uo_out[7:2]), then print 'P', '1', '2', '\n' over UART, halt.
  // 18 instructions used, 14 zero-fill.
  //
  // The output string is *deliberately different* from what the
  // gcc-compiled `examples/p12_hello.c` program emits ("13\n"). On
  // the actual silicon, with no parameter override, you see "P12"
  // arrive on the UART — the chip identifying itself. In simulation,
  // the testbench can override PROG to load a gcc image and see "13"
  // (or whatever the compiled program prints) instead. Two clearly
  // distinguishable outputs let you tell at a glance which program is
  // running.
  //
  // PC=0   addi x5, x0, 12        ; R5 = 12 (project number)
  // PC=1   addi x4, x0, 0x0a      ; '\n'
  // PC=2   addi x3, x0, 0x32      ; '2'
  // PC=3   addi x2, x0, 0x31      ; '1'
  // PC=4   addi x1, x0, 0x50      ; 'P'
  // PC=5   lw   x6, 0x80(x0)      ; poll uart_busy
  // PC=6   bne  x6, x0, -4        ;   loop while busy
  // PC=7   sw   x1, 0x80(x0)      ; UART <- 'P'
  // PC=8..10 same poll-and-send for x2 ('1')
  // PC=11..13 same for x3 ('2')
  // PC=14..16 same for x4 ('\n')
  // PC=17  jal  x0, 0             ; halt
  parameter logic [32*32-1:0] PROG = {
    {14{32'h00000000}},                          // PC=18..31: zero-fill
    32'h0000006f,                                // PC=17: jal x0, 0
    32'h08402023,                                // PC=16: sw   x4, 0x80(x0)  '\n'
    32'hfe031ee3,                                // PC=15: bne  x6, x0, -4
    32'h08002303,                                // PC=14: lw   x6, 0x80(x0)
    32'h08302023,                                // PC=13: sw   x3, 0x80(x0)  '2'
    32'hfe031ee3,                                // PC=12: bne  x6, x0, -4
    32'h08002303,                                // PC=11: lw   x6, 0x80(x0)
    32'h08202023,                                // PC=10: sw   x2, 0x80(x0)  '1'
    32'hfe031ee3,                                // PC=9 : bne  x6, x0, -4
    32'h08002303,                                // PC=8 : lw   x6, 0x80(x0)

Comparing the three CPUs on the ladder

P06P09P12 (this)
Width8-bit32-bit32-bit
ISAours-by-convenienceRV32IRV32E
Regfile8 × 832 × 3216 × 32
Data RAMnone64 × 32 (flops)8 × 32 (flops)
PROG32 × 16 ROM256 × 32 ROM32 × 32 ROM
Cells (hardened)233317,2774,943
Targetseducational standaloneeducational standaloneTT 8×2 shuttle
Has UARTyesnoyes

Every level is “PN + one capability” or “PN − one constraint” — the ladder shape held all the way up.

What just happened?

We took the lessons from P11 (TT pin frame, ena gating, UART for observability) and the lessons from P09 (observable storage, real ISA, careful default PROG) and combined them into a real RISC-V core that targets a real fab path. P09 was the educational “what does an RV32I core look like inside?”; P12 is “what does the same idea look like when it has to fit on something you can actually order?”.

See also

  • Project 09 → the educational RV32I this scales down from.
  • Project 11 → simpler TT wrapper around P06 for comparison.
  • TinyQV by Michael Bell — the RV32 existence proof on TT, achieved via a 4-bit-serial datapath that’s much more aggressive than P12’s narrow-but- parallel approach.
  • Project README