P06’s FSM CPU on a Tiny Tapeout shuttle. Where project 10 was a thin pin-frame adapter around the simplest design on the ladder (counter / PWM / LFSR), P11 puts an actual instruction-fetching, register-writing, UART-emitting microcontroller onto a real fab path. The wrapper is maybe 60 lines; everything else on this page is what surrounds it on the actual silicon.
Status: Hardened. Standalone harden at 50 MHz on a 333 × 225 µm die (TT 2×2 tile) — 758 non-filler cells, 7.32 ns of setup slack, zero DRC/LVS/antenna violations. The wrapper preserves P06’s behaviour under iverilog: boots, runs Fibonacci(6) into R7, prints
'8'on the UART, halts.This is not the actual TT submission. TT runs its own CI-driven flow against fixed tile sizes and the shuttle’s chip-level pin-mux RTL. What lives here is the wrapper plus a sanity-check harden — confirm the pin shape synthesizes cleanly before pushing to a TT shuttle.
What it does at boot
The shortest possible end-to-end demo: power up the chip, the TT
mux selects this project, the CPU runs its 11-instruction boot
program, one byte (0x38 = ASCII '8') appears on the UART, then
the CPU halts. A host-side UART receiver in tb_demo.sv watches
uo_out[0] and decodes the byte. The whole thing is under a
microsecond at any plausible TT clock.
[host] baud_div = 0x0003 (= 4 sysclks/bit)
[host] ena = 0 ; project deselected
[host] uo_out = 0x00 (muted by ena=0)
[host] rst_n released, ena still 0 -> uo_out = 0x00
[host] ena = 1 ; CPU running boot program...
[uart-rx] byte 0x38 = '8'
[host] halt detected after 85 clocks
[host] uo_out = 0x23
[host] uo_out[0] = uart_tx = 1 (line idle)
[host] uo_out[1] = halted = 1
[host] uo_out[7:2] = R7[5:0] = 8 (= fib(6))
[host] uio_oe = 0x00 (uio is input-only)
[host] done.
That’s three independent pieces of evidence, from one boot, that this works:
- The CPU executed the boot program correctly — the ALU + register
file + FSM are all alive (
R7 = 8onuo_out[7:2]). - The UART transmitter framed an 8N1 byte that a host-side
decoder recovered intact (
'8'onuo_out[0]). - The TT pin frame is packed exactly the way the page describes —
ena=0mutes the chip outputs,uio_oestays low because we declareduioas input-only.
This is the narrating testbench (make demo PROJECT=11_tt_cpu).
There’s also a strict pass/fail variant — tb.sv, run by
make test PROJECT=11_tt_cpu — that asserts each of the above
checks and exits non-zero if anything is off.
What Tiny Tapeout actually is
Tiny Tapeout is a service that aggregates hundreds of small user projects onto a single shuttle die that gets fabbed at SkyWater. Cost-per-project drops from “tens of thousands of dollars per tape- out” (a real foundry submission) to “$300 per shuttle slot” because the cost of the masks is amortized across everyone on the shuttle.
Each shuttle has a fixed chip-level harness:
- A mux that selects which user project’s logic is connected to
the shared external pins at any given moment. The user picks via
selection inputs; the mux’s
enaoutput goes high for the active project. - A fixed pin frame — every user project must expose 8 dedicated
inputs, 8 dedicated outputs, 8 bidirectional pins, plus
clk/rst_n/ena. No more, no fewer. - A grid of tile slots of varying sizes (1×1, 1×2, 2×2, …), each ~160 µm wide. The bigger your design, the more tiles it occupies (and the more shuttle slots it costs).
- The TT dev board, a PCB that physically holds the shuttle die, routes its pins to easily-accessible headers, drives the user mux from a tiny on-board MCU, and exposes UART / I²C / SPI to a host computer.
For this project, P06’s CPU fits in a 2×2 tile (333 × 225 µm).
The pin map
P06 has more outputs than the TT pin frame can carry. We pick what to expose:
The packing this design picks:
| TT pin | role | direction | notes |
|---|---|---|---|
ui_in[7:0] | baud_div[7:0] | in | low byte of UART baud divider |
uio_in[7:0] | baud_div[15:8] | in | high byte of UART baud divider |
uo_out[0] | uart_tx | out | serial output (one byte 0x38 = ASCII '8' after the boot program runs — see below) |
uo_out[1] | halted | out | high after CPU hits HLT |
uo_out[7:2] | out[5:0] | out | low 6 bits of register R7 |
uio_out | always 0 | out | unused |
uio_oe | always 0 | out | uio is input-only here |
ena | enable | in | TT mux holds this high while we’re selected |
The 16-bit baud_div doesn’t fit in any single 8-bit TT port, so
the wrapper splits it: ui_in carries the low byte, uio_in the
high byte. A host program on the TT dev board can drive both at
runtime to match whatever clock the shuttle ends up using.
The boot program
rst_n deasserts; the CPU starts at PC=0 and runs the program
baked into PROG. The default boot computes Fibonacci(6) into R7
and then emits one ASCII byte over the UART so a host can prove
the UART path is alive:
00 LDI R1, 1
01 LDI R2, 1
02 ADD R3, R1, R2 ; R3 = 2
03 ADD R4, R2, R3 ; R4 = 3
04 ADD R5, R3, R4 ; R5 = 5
05 ADD R6, R4, R5 ; R6 = 8
06 MOV R7, R6 ; R7 = 8 (= the 6th Fibonacci number)
07 LDI R5, 0x30 ; R5 = ASCII '0' offset
08 ADD R4, R7, R5 ; R4 = 0x38 (= ASCII '8')
09 OUT R4 ; UART <- '8'
10 HLT
P06’s FSM is 4 cycles per instruction (FETCH → DECODE → EXECUTE →
WB). Eleven instructions × 4 = ~44 cycles of CPU work, plus the
UART transmit at the chosen baud rate (UART start + 8 data + stop
= 10 bit-times × baud_div system clocks). After all that:
uo_out[1](halted) goes high.uo_out[7:2]reflects R7’s low 6 bits =8.uo_out[0](uart_tx) idles back to high after delivering one start bit, the eight bits of0x38LSB-first, and a stop bit.
A host plugged into the dev board sees '8' arrive on the UART,
sees halted assert, and sees R7’s value on the parallel pins.
Three independent pieces of evidence that the chip booted
correctly.
The boot program is the parameter default of p06_top —
testbenches and any future P12 can override PROG to run anything
that fits in 32 instructions (the ROM size).
The submission flow
Going from this RTL to silicon-in-hand involves five parties and roughly six months of waiting:
Concretely, the steps you take:
- Create a TT user-project repo from the template at
tt-template. This
gives you a
src/with a placeholder Verilog file, aninfo.yamlwith project metadata, and a.github/workflows/directory with the official harden CI. - Drop your RTL in. For this project we’d copy
projects/11_tt_cpu/src/top.svinto the template’ssrc/and make sure the module name matches whatinfo.yamldeclares. - Edit
info.yamlwith the project name, your handle, the tile-size you want (2x2for this CPU), the labels for each pin (howui_in[0]should appear in the TT explorer UI), and a short description. - Push to GitHub. TT’s CI workflow runs OpenLane / LibreLane against the TT shuttle’s standardized config and uploads the resulting GDS as a release artifact. You can iterate on the workflow output as much as you want before submitting.
- Submit via the TT website. Pick the next-open shuttle (“Tiny Tapeout 11”, say), point it at your repo and the workflow run, pay $300, done.
- TT aggregates submissions into a single shuttle die. They
run the integration flow that drops every user project into a
tile slot and wires up the chip-level mux and pad ring. This is
when your wrapper’s pin frame matters — TT’s chip-level RTL
relies on the exact
tt_um_*signature. - Tape-out to SkyWater. ~3 months for the wafers to come back.
- TT packages and ships dev boards.
- You plug it in.
screen /dev/ttyUSB0shows an'8'. Halt light comes on. Your CPU runs Fibonacci(6) on actual silicon.
What we actually have on this site
The wrapper module + a standalone harden config + a verifying testbench. The dev-board side, the chip-level pin mux, the actual TT submission process, the dev board that lights up after you plug it in — all of that lives in TT’s infrastructure, not this repo. What this project demonstrates is the upstream half: the RTL shape, the wrapper pattern, the timing budget, and the hardened GDS for one tile.
RTL — the wrapper
The whole thing is tt_um_librelane_p06_cpu plus an inlined copy
of P06 (renamed to p06_top to avoid collisions on a shuttle
hosting hundreds of top modules):
module tt_um_librelane_p06_cpu (
input wire [7:0] ui_in,
output wire [7:0] uo_out,
input wire [7:0] uio_in,
output wire [7:0] uio_out,
output wire [7:0] uio_oe,
input wire ena,
input wire clk,
input wire rst_n
);
// Compose the 16-bit baud_div from the two TT input ports. Top half
// of uio_in, bottom half of ui_in. Drive both to 0x0364 (decimal
// 868) for 115200 baud at a 100 MHz user clock.
wire [15:0] baud_div = {uio_in, ui_in};
// CPU outputs.
wire [7:0] out;
wire [4:0] pc_out;
wire halted;
wire uart_tx;
// Effective reset — held low when the TT mux deselects us via ena.
wire effective_rst_n = rst_n & ena;
p06_top u_cpu (
.clk (clk),
.rst_n (effective_rst_n),
.start (1'b1), // CPU is always running when not in reset
.baud_div (baud_div),
.out (out),
.pc_out (pc_out),
.halted (halted),
.uart_tx (uart_tx)
);
// Pack the visible outputs.
assign uo_out = ena ? {out[5:0], halted, uart_tx} : 8'h00;
assign uio_out = 8'h00;
assign uio_oe = 8'h00;
// start input on the inner CPU is unused — but we use ena. Keep
// ena referenced explicitly so verilator doesn't flag it.
wire _unused = &{1'b0, ena, pc_out, out[7:6]};
endmodule The testbench
Two testbenches sit on top of the wrapper. Both compile with
iverilog -g2012; both drive the wrapper from the chip-pin side
exactly the way the TT shuttle dev board would.
tb.sv is the strict verifier — make test PROJECT=11_tt_cpu
exits non-zero unless every check passes. tb_demo.sv is the
narrator — it produces the transcript at the top of this page so
the page output and the testbench output stay in lock-step.
The interesting bit is the host-side UART receiver. Both
testbenches share the same shape: wait for a falling edge on
uo_out[0] (the start bit), wait 1.5 bit-times to land in the
middle of bit-0, then sample 8 data bits a bit-time apart. It’s
the same algorithm a real UART chip runs in hardware, just written
in SystemVerilog instead of synthesizable RTL:
// ---------------------------------------------------------------
logic [7:0] rx_byte;
initial begin
@(posedge ena);
@(posedge clk);
forever begin
wait (uo_out[0] == 1'b0);
// Center on bit-0: 1.5 bit-times after the start-bit edge.
repeat (BAUD_CLOCKS + BAUD_CLOCKS/2) @(posedge clk);
for (int b = 0; b < 8; b++) begin
rx_byte[b] = uo_out[0];
repeat (BAUD_CLOCKS) @(posedge clk);
end
if (rx_byte >= 8'h20 && rx_byte < 8'h7f)
$display("[uart-rx] byte 0x%02h = '%c'", rx_byte, rx_byte);
else
$display("[uart-rx] byte 0x%02h", rx_byte);
wait (uo_out[0] == 1'b1); // wait for stop bit
end
end The strict tb’s UART decoder is the same code, plus a
rx_byte_captured flag the result-checking initial polls before
asserting check8(rx_byte, 8'h38, ...).
Comparing P10 and P11
| P10 | P11 (this project) | |
|---|---|---|
| Wrapped design | P02: counter/PWM/LFSR | P06: FSM CPU |
| Tile size | 2 × 1 | 2 × 2 |
| Internal logic | ≈140 cells (P02 hardened) | ≈730 cells (P06 hardened, plus a tiny wrapper) |
| Has UART | no | yes (P03 module reused inside P06) |
| Boot behaviour | static (counter increments) | runs a program, halts |
| Multi-byte signals | no (everything is 1 or 2 bits) | yes (baud_div is 16 bits, split across ui_in/uio_in) |
P10 is the “hello world” of TT submission. P11 is the “hello world of putting a programmable processor on TT.” They’re the same shape of project — both are thin wrappers — just at different points on the design-complexity ladder.
What just happened?
We took the smallest CPU on this ladder (P06, hardened at 100 MHz in its own project) and adapted it for fabrication via Tiny Tapeout. The wrapper is small; the lesson is big. Real-world tape-outs always come with a harness — TT, Caravel, a custom socket — and the user logic is just the interesting part sitting inside that harness. P11 is what it looks like to drop a small but real piece of compute hardware into the smallest practical fab path.
See also
- Project 06 → the underlying CPU.
- Project 10 → simpler TT wrapper for comparison.
- Tiny Tapeout — the actual programme.
- tt-template — the GitHub template you start from for a real submission.
- Project README