A real RISC-V core sized for a Tiny Tapeout 8×2 tile. Where P09 is the educational “what does RV32I look like inside” (17,277 cells, 600 × 600 µm — won’t fit anywhere on TT) and P11 is the TT-shaped P06 (a small 8-bit FSM CPU, not RISC-V), P12 is a real RISC-V core that targets a real TT shuttle.
Status: Hardened. P12 fits in a TT 8×2 tile (1290 × 225 µm) at 50 MHz — 4,985 non-filler cells, 920 flops, 4.03 ns of setup slack, zero DRC/LVS/antenna violations. Strict tb passes (chip emits
P12\nover UART, halts with R5 = 12 onuo_out[7:2], 189 sim clocks total). Compared to P09’s 17,277 cells / 600 × 600 µm die, P12 is a 3.5× cell shrink for a real RV32 core that targets a real fab path.
Compliance tests: NOT RUN for P12. This is a small RV32E-shaped teaching core with a fixed boot program, not a compliance-proven RISC-V implementation. The exact ISA cuts are listed below.
What it does at boot
Power on, TT mux selects the project, the boot program runs. It
prints P12\n over UART — the chip identifying itself by name —
then halts. R5 holds the project number (12) and stays visible on
uo_out[7:2] after halt.
[host] baud_div = 0x0003 (= 4 sysclks/bit)
[host] ena = 0 ; project deselected
[host] uo_out = 0x00 (muted by ena=0)
[host] rst_n released, ena still 0 -> uo_out = 0x00
[host] ena = 1 ; CPU running boot program...
[uart-rx] byte 0x50 = 'P'
[uart-rx] byte 0x31 = '1'
[uart-rx] byte 0x32 = '2'
[host] halt detected after 189 clocks
[uart-rx] byte 0x0a (newline)
[host] uo_out = 0x33
[host] uo_out[0] = uart_tx = 1 (line idle)
[host] uo_out[1] = halted = 1
[host] uo_out[7:2] = R5[5:0] = 12 (project #)
[host] uio_oe = 0x00 (uio is input-only)
[host] done.
What got cut from RV32I
P09’s RV32I-min was 17,277 cells / 3,262 flops on a 360k µm² die. A TT 8×2 tile is ~290k µm² with practical capacity around 12-16k gates and 1-2k flops. P09 doesn’t fit, full stop. The way you make an RV32 core fit on TT is the same way TinyQV (the existence proof) and other RV32-on-TT projects do it: cut features that trade real area for marginal teaching value.
| P09 (RV32I-min) | P12 (this project) | savings | |
|---|---|---|---|
| Regfile | 32 × 32 = 1024 flops | 16 × 32 = 512 flops | -512 flops |
| Data RAM | 64 × 32 = 2048 flops | 8 × 32 = 256 flops | -1792 flops |
| PROG ROM | 256-word mux | 32-word mux | -8× ROM mux |
| ISA | RV32I | RV32E (= I, x16+ unused) | gcc has ilp32e |
| Tile fit | no | TT 8×2 (1290 × 225 µm) | — |
RV32E is RV32I with only the bottom 16 registers. gcc has
first-class support for it (-march=rv32e -mabi=ilp32e); the
calling convention (ABI) follows RV32I but uses fewer registers
for arguments and saved values. P12’s decode silently treats
x16..x31 as x0 (reads return 0, writes are ignored), so ilp32e
binaries work directly and ilp32 binaries that happen to stay
within x0..x15 also work.
Pin map
Standard TT pin frame, plus a 16-bit baud divider split across both input ports:
| TT pin | role | direction |
|---|---|---|
ui_in[7:0] | baud_div[7:0] | in |
uio_in[7:0] | baud_div[15:8] | in |
uo_out[0] | uart_tx | out |
uo_out[1] | halted | out |
uo_out[7:2] | R5[5:0] | out |
uio_out | always 0 | out |
uio_oe | always 0 | out |
Memory map
P12 is harvard-ish: instruction fetches go to PROG[32], data accesses to dmem[8] or to the MMIO UART register at byte address 0x80.
0x000..0x07f data RAM (8 words, 32 bytes total)
0x080 UART register
SW: write byte to TX
LW: read {31'b0, busy}
0x081..0xfff undefined
The MMIO UART is the entire I/O surface. Programs that want to print the result of a computation poll the busy register, then issue a SW. There is no LB, no SB; only word-aligned SW/LW. There are no interrupts; the program drives the protocol synchronously.
This is what real-world peripheral interaction looks like at the bottom of the stack — every microcontroller from a $0.10 ARM Cortex M0 up has the same shape: registers at fixed addresses, polling loops, no syscalls. The difference between this and a “real” chip is just how many peripherals there are.
The boot program
The default PROG is hand-encoded RISC-V machine code, 18
instructions. It loads the project number into R5, sets up four
ASCII bytes (‘P’, ‘1’, ‘2’, ‘\n’), then sends them over UART
through the standard poll-busy / write-byte sequence:
PC=0 addi x5, x0, 12 ; x5 = 12 (project #)
PC=1 addi x4, x0, 0x0a ; x4 = '\n'
PC=2 addi x3, x0, 0x32 ; x3 = '2'
PC=3 addi x2, x0, 0x31 ; x2 = '1'
PC=4 addi x1, x0, 0x50 ; x1 = 'P'
PC=5 lw x6, 0x80(x0) ; x6 = uart_busy
PC=6 bne x6, x0, -4 ; loop until !busy
PC=7 sw x1, 0x80(x0) ; UART <- 'P'
PC=8 lw x6, 0x80(x0)
PC=9 bne x6, x0, -4
PC=10 sw x2, 0x80(x0) ; UART <- '1'
PC=11 lw x6, 0x80(x0)
PC=12 bne x6, x0, -4
PC=13 sw x3, 0x80(x0) ; UART <- '2'
PC=14 lw x6, 0x80(x0)
PC=15 bne x6, x0, -4
PC=16 sw x4, 0x80(x0) ; UART <- '\n'
PC=17 jal x0, 0 ; halt: jump-to-self
This is the silicon default. It’s a self-identification
greeting — connect a USB-UART bridge to a real TT shuttle, plug in
the dev board, and the chip prints “P12” the moment rst_n
deasserts. Then the gcc-compiled programs in tools/riscv-asm/ —
which print different output — show up clearly distinct from
the boot greeting.
Compiling C — gcc onto a 128-byte ROM
P09 introduced a riscv64-elf-gcc flow that turns C programs into a
SystemVerilog PROG[] literal. P12 reuses the same harness with
two changes:
BOARD=p12→-march=rv32e -mabi=ilp32e. gcc emits code that only touches x0..x15, matching what the chip implements.p12.ldlinker script caps the program at 128 bytes (32 instructions × 4). The link fails with a clear error if a program overflows.
examples/p12_hello.c is the C version of P12’s boot program —
print '1', '3', '\n' over the UART, halt. It compiles to 17
instructions / 68 bytes — tight but well under the cap, leaving
room for a slightly bigger main:
static volatile unsigned int *const UART = (unsigned int *)0x80;
static void uart_send(unsigned char c) {
while (*UART) ; /* spin while busy */
*UART = c;
}
int main(void) {
uart_send('1');
uart_send('3');
uart_send('\n');
return 0;
}
make c-test PROJECT=12_rv32e_tt chains the toolchain build, drops
the resulting .svh into a P12 testbench, and asserts the chip
emits '1','3','\n'. The output differs from the silicon default
on purpose — when you see 13\n on the UART, the gcc program is
running; when you see P12\n, the hand-encoded silicon default is
running. Same chip, different ROM image.
make test PROJECT=12_rv32e_tt # hand-encoded silicon default
[uart-rx] byte 0x50 = 'P'
[uart-rx] byte 0x31 = '1'
[uart-rx] byte 0x32 = '2'
[uart-rx] byte 0x0a (newline)
[host] halt detected after 189 clocks
PASS: TT-wrapped RV32E boots, prints "P12\n" on UART, halts.
make c-test PROJECT=12_rv32e_tt # gcc-compiled p12_hello.c
[uart-rx] byte 0x31 ('1')
[uart-rx] byte 0x33 ('3')
[uart-rx] byte 0x0a (newline)
[host] halt detected after 147 clocks
PASS: gcc-compiled p12_hello.c runs through tt_um_* wrapper -> UART "13\n".
Both go through the actual tt_um_librelane_p12_rv32e wrapper —
the same module a TT shuttle would instantiate — driven only from
the chip-pin side. The C version is 17 instructions / 68 bytes
including its boot stub, vs the silicon default’s 18 instructions
hand-encoded for P12\n. Almost identical footprint; very
different programs.
The testbench drives the TT wrapper exactly the way the actual TT
shuttle integration would — clock, reset, ena, ui_in/uio_in for
inputs, observing uo_out for outputs, no peeking inside. The one
trick: SystemVerilog parameter PROG = … is declared on the
wrapper module itself, so the testbench can do
tt_um_librelane_p12_rv32e #(.PROG(PROG_FROM_C)) dut (.clk(clk), …);
and load the gcc image. The TT chip-level RTL doesn’t pass any
parameter override when it instantiates tt_um_*, so the silicon
ships with the parameter default (the hand-encoded 13\n boot).
The wrapper’s port signature is the contract; the parameter is
private to the module and visible only at simulation time.
RTL — the wrapper
The wrapper module + the core; everything packed into one file because TT shuttles host hundreds of modules and flat names are mandatory:
// The TT submission convention requires this exact port signature
// (ui_in/uo_out/uio_in/uio_out/uio_oe/ena/clk/rst_n) — TT's chip-
// level shuttle RTL instantiates `tt_um_*` modules with no parameter
// overrides, so the silicon ships with the parameter defaults.
//
// Adding a `parameter PROG` is still legal: it's invisible to the
// shuttle integration but lets simulation testbenches load a
// different program by named-parameter override, the same way you'd
// instantiate any parameterized SV module:
//
// tt_um_librelane_p12_rv32e #(.PROG(MY_PROG)) dut (...);
//
// On a real TT submission, the parameter default (the hand-encoded
// "13\n" boot below) is what gets fabbed.
module tt_um_librelane_p12_rv32e #(
// 32 × 32-bit instruction ROM. Default = the hand-encoded "P12\n"
// boot below: load R5 = 12 (project number, mirrored to
// uo_out[7:2]), then print 'P', '1', '2', '\n' over UART, halt.
// 18 instructions used, 14 zero-fill.
//
// The output string is *deliberately different* from what the
// gcc-compiled `examples/p12_hello.c` program emits ("13\n"). On
// the actual silicon, with no parameter override, you see "P12"
// arrive on the UART — the chip identifying itself. In simulation,
// the testbench can override PROG to load a gcc image and see "13"
// (or whatever the compiled program prints) instead. Two clearly
// distinguishable outputs let you tell at a glance which program is
// running.
//
// PC=0 addi x5, x0, 12 ; R5 = 12 (project number)
// PC=1 addi x4, x0, 0x0a ; '\n'
// PC=2 addi x3, x0, 0x32 ; '2'
// PC=3 addi x2, x0, 0x31 ; '1'
// PC=4 addi x1, x0, 0x50 ; 'P'
// PC=5 lw x6, 0x80(x0) ; poll uart_busy
// PC=6 bne x6, x0, -4 ; loop while busy
// PC=7 sw x1, 0x80(x0) ; UART <- 'P'
// PC=8..10 same poll-and-send for x2 ('1')
// PC=11..13 same for x3 ('2')
// PC=14..16 same for x4 ('\n')
// PC=17 jal x0, 0 ; halt
parameter logic [32*32-1:0] PROG = {
{14{32'h00000000}}, // PC=18..31: zero-fill
32'h0000006f, // PC=17: jal x0, 0
32'h08402023, // PC=16: sw x4, 0x80(x0) '\n'
32'hfe031ee3, // PC=15: bne x6, x0, -4
32'h08002303, // PC=14: lw x6, 0x80(x0)
32'h08302023, // PC=13: sw x3, 0x80(x0) '2'
32'hfe031ee3, // PC=12: bne x6, x0, -4
32'h08002303, // PC=11: lw x6, 0x80(x0)
32'h08202023, // PC=10: sw x2, 0x80(x0) '1'
32'hfe031ee3, // PC=9 : bne x6, x0, -4
32'h08002303, // PC=8 : lw x6, 0x80(x0) Comparing the three CPUs on the ladder
| P06 | P09 | P12 (this) | |
|---|---|---|---|
| Width | 8-bit | 32-bit | 32-bit |
| ISA | ours-by-convenience | RV32I | RV32E |
| Regfile | 8 × 8 | 32 × 32 | 16 × 32 |
| Data RAM | none | 64 × 32 (flops) | 8 × 32 (flops) |
| PROG | 32 × 16 ROM | 256 × 32 ROM | 32 × 32 ROM |
| Cells (hardened) | 2333 | 17,277 | 4,943 |
| Targets | educational standalone | educational standalone | TT 8×2 shuttle |
| Has UART | yes | no | yes |
Every level is “PN + one capability” or “PN − one constraint” — the ladder shape held all the way up.
What just happened?
We took the lessons from P11 (TT pin frame, ena gating, UART for
observability) and the lessons from P09 (observable storage, real
ISA, careful default PROG) and combined them into a real RISC-V
core that targets a real fab path. P09 was the educational “what
does an RV32I core look like inside?”; P12 is “what does the same
idea look like when it has to fit on something you can actually
order?”.
See also
- Project 09 → the educational RV32I this scales down from.
- Project 11 → simpler TT wrapper around P06 for comparison.
- TinyQV by Michael Bell — the RV32 existence proof on TT, achieved via a 4-bit-serial datapath that’s much more aggressive than P12’s narrow-but- parallel approach.
- Project README