No. 11 / project of 147 on the ladder

A real CPU on Tiny Tapeout

introduces — TT shuttle architecture, user-mux, dev board, GHA submission flow

harden statelast run2026-04-29
cells758non-filler
slack7.32ns setup
area74925 (die) / 68511 (core)μm²
signoff
  • DRCPASS
  • LVSPASS
  • antennaPASS

P06’s FSM CPU on a Tiny Tapeout shuttle. Where project 10 was a thin pin-frame adapter around the simplest design on the ladder (counter / PWM / LFSR), P11 puts an actual instruction-fetching, register-writing, UART-emitting microcontroller onto a real fab path. The wrapper is maybe 60 lines; everything else on this page is what surrounds it on the actual silicon.

Status: Hardened. Standalone harden at 50 MHz on a 333 × 225 µm die (TT 2×2 tile) — 758 non-filler cells, 7.32 ns of setup slack, zero DRC/LVS/antenna violations. The wrapper preserves P06’s behaviour under iverilog: boots, runs Fibonacci(6) into R7, prints '8' on the UART, halts.

This is not the actual TT submission. TT runs its own CI-driven flow against fixed tile sizes and the shuttle’s chip-level pin-mux RTL. What lives here is the wrapper plus a sanity-check harden — confirm the pin shape synthesizes cleanly before pushing to a TT shuttle.

layout · sky130A x= μm y= μm
drag · scroll to zoom · double-click to fit · 1 1:1 · f fit 333 × 225 µm die · sky130A · 50 MHz · TT 2×2 tile · met1+met2+met3 only
3d · sky130A · z×10
drag · scroll · right-drag pan · double-click recenter · R reset metal stack only · z exaggerated 10× · meshopt-compressed

What it does at boot

The shortest possible end-to-end demo: power up the chip, the TT mux selects this project, the CPU runs its 11-instruction boot program, one byte (0x38 = ASCII '8') appears on the UART, then the CPU halts. A host-side UART receiver in tb_demo.sv watches uo_out[0] and decodes the byte. The whole thing is under a microsecond at any plausible TT clock.

[host]     baud_div = 0x0003  (= 4 sysclks/bit)
[host]     ena = 0   ; project deselected
[host]     uo_out = 0x00  (muted by ena=0)
[host]     rst_n released, ena still 0  ->  uo_out = 0x00
[host]     ena = 1   ; CPU running boot program...
[uart-rx]  byte 0x38  = '8'
[host]     halt detected after 85 clocks
[host]     uo_out = 0x23
[host]       uo_out[0]   = uart_tx   = 1  (line idle)
[host]       uo_out[1]   = halted    = 1
[host]       uo_out[7:2] = R7[5:0]   = 8  (= fib(6))
[host]       uio_oe      = 0x00     (uio is input-only)
[host]     done.

That’s three independent pieces of evidence, from one boot, that this works:

  • The CPU executed the boot program correctly — the ALU + register file + FSM are all alive (R7 = 8 on uo_out[7:2]).
  • The UART transmitter framed an 8N1 byte that a host-side decoder recovered intact ('8' on uo_out[0]).
  • The TT pin frame is packed exactly the way the page describes — ena=0 mutes the chip outputs, uio_oe stays low because we declared uio as input-only.

This is the narrating testbench (make demo PROJECT=11_tt_cpu). There’s also a strict pass/fail variant — tb.sv, run by make test PROJECT=11_tt_cpu — that asserts each of the above checks and exits non-zero if anything is off.

What Tiny Tapeout actually is

Tiny Tapeout is a service that aggregates hundreds of small user projects onto a single shuttle die that gets fabbed at SkyWater. Cost-per-project drops from “tens of thousands of dollars per tape- out” (a real foundry submission) to “$300 per shuttle slot” because the cost of the masks is amortized across everyone on the shuttle.

Each shuttle has a fixed chip-level harness:

  • A mux that selects which user project’s logic is connected to the shared external pins at any given moment. The user picks via selection inputs; the mux’s ena output goes high for the active project.
  • A fixed pin frame — every user project must expose 8 dedicated inputs, 8 dedicated outputs, 8 bidirectional pins, plus clk/rst_n/ena. No more, no fewer.
  • A grid of tile slots of varying sizes (1×1, 1×2, 2×2, …), each ~160 µm wide. The bigger your design, the more tiles it occupies (and the more shuttle slots it costs).
  • The TT dev board, a PCB that physically holds the shuttle die, routes its pins to easily-accessible headers, drives the user mux from a tiny on-board MCU, and exposes UART / I²C / SPI to a host computer.

For this project, P06’s CPU fits in a 2×2 tile (333 × 225 µm).

The pin map

P06 has more outputs than the TT pin frame can carry. We pick what to expose:

"P06 CPUclk · rst_n · startbaud_div[15:0 wrapper "TT pinsclk · rst_n · enaui_in[7:0
P06's full interface on the left; the TT-shuttle pin frame on the right. The wrapper picks what to expose and how to pack it.

The packing this design picks:

TT pinroledirectionnotes
ui_in[7:0]baud_div[7:0]inlow byte of UART baud divider
uio_in[7:0]baud_div[15:8]inhigh byte of UART baud divider
uo_out[0]uart_txoutserial output (one byte 0x38 = ASCII '8' after the boot program runs — see below)
uo_out[1]haltedouthigh after CPU hits HLT
uo_out[7:2]out[5:0]outlow 6 bits of register R7
uio_outalways 0outunused
uio_oealways 0outuio is input-only here
enaenableinTT mux holds this high while we’re selected

The 16-bit baud_div doesn’t fit in any single 8-bit TT port, so the wrapper splits it: ui_in carries the low byte, uio_in the high byte. A host program on the TT dev board can drive both at runtime to match whatever clock the shuttle ends up using.

The boot program

rst_n deasserts; the CPU starts at PC=0 and runs the program baked into PROG. The default boot computes Fibonacci(6) into R7 and then emits one ASCII byte over the UART so a host can prove the UART path is alive:

00 LDI R1, 1
01 LDI R2, 1
02 ADD R3, R1, R2     ; R3 = 2
03 ADD R4, R2, R3     ; R4 = 3
04 ADD R5, R3, R4     ; R5 = 5
05 ADD R6, R4, R5     ; R6 = 8
06 MOV R7, R6         ; R7 = 8 (= the 6th Fibonacci number)
07 LDI R5, 0x30       ; R5 = ASCII '0' offset
08 ADD R4, R7, R5     ; R4 = 0x38 (= ASCII '8')
09 OUT R4             ; UART <- '8'
10 HLT

P06’s FSM is 4 cycles per instruction (FETCH → DECODE → EXECUTE → WB). Eleven instructions × 4 = ~44 cycles of CPU work, plus the UART transmit at the chosen baud rate (UART start + 8 data + stop = 10 bit-times × baud_div system clocks). After all that:

  • uo_out[1] (halted) goes high.
  • uo_out[7:2] reflects R7’s low 6 bits = 8.
  • uo_out[0] (uart_tx) idles back to high after delivering one start bit, the eight bits of 0x38 LSB-first, and a stop bit.

A host plugged into the dev board sees '8' arrive on the UART, sees halted assert, and sees R7’s value on the parallel pins. Three independent pieces of evidence that the chip booted correctly.

The boot program is the parameter default of p06_top — testbenches and any future P12 can override PROG to run anything that fits in 32 instructions (the ROM size).

The submission flow

Going from this RTL to silicon-in-hand involves five parties and roughly six months of waiting:

You(this RTL) Your TTuser repo GitHub ActionsOpenLane harden TT submissionaggregator Shuttlechip-level integration SkyWaterfab (~3 months) TT receives wafers+ packages Your dev boardarrives
The TT submission timeline. Each arrow is a person- or process-driven handoff. Most of the latency lives between Submission and Tape-out (TT batches submissions to hit the foundry's quarterly window).

Concretely, the steps you take:

  1. Create a TT user-project repo from the template at tt-template. This gives you a src/ with a placeholder Verilog file, an info.yaml with project metadata, and a .github/workflows/ directory with the official harden CI.
  2. Drop your RTL in. For this project we’d copy projects/11_tt_cpu/src/top.sv into the template’s src/ and make sure the module name matches what info.yaml declares.
  3. Edit info.yaml with the project name, your handle, the tile-size you want (2x2 for this CPU), the labels for each pin (how ui_in[0] should appear in the TT explorer UI), and a short description.
  4. Push to GitHub. TT’s CI workflow runs OpenLane / LibreLane against the TT shuttle’s standardized config and uploads the resulting GDS as a release artifact. You can iterate on the workflow output as much as you want before submitting.
  5. Submit via the TT website. Pick the next-open shuttle (“Tiny Tapeout 11”, say), point it at your repo and the workflow run, pay $300, done.
  6. TT aggregates submissions into a single shuttle die. They run the integration flow that drops every user project into a tile slot and wires up the chip-level mux and pad ring. This is when your wrapper’s pin frame matters — TT’s chip-level RTL relies on the exact tt_um_* signature.
  7. Tape-out to SkyWater. ~3 months for the wafers to come back.
  8. TT packages and ships dev boards.
  9. You plug it in. screen /dev/ttyUSB0 shows an '8'. Halt light comes on. Your CPU runs Fibonacci(6) on actual silicon.

What we actually have on this site

The wrapper module + a standalone harden config + a verifying testbench. The dev-board side, the chip-level pin mux, the actual TT submission process, the dev board that lights up after you plug it in — all of that lives in TT’s infrastructure, not this repo. What this project demonstrates is the upstream half: the RTL shape, the wrapper pattern, the timing budget, and the hardened GDS for one tile.

RTL — the wrapper

The whole thing is tt_um_librelane_p06_cpu plus an inlined copy of P06 (renamed to p06_top to avoid collisions on a shuttle hosting hundreds of top modules):

projects/11_tt_cpu/src/top.sv system-verilog · L48-93
module tt_um_librelane_p06_cpu (
    input  wire [7:0] ui_in,
    output wire [7:0] uo_out,
    input  wire [7:0] uio_in,
    output wire [7:0] uio_out,
    output wire [7:0] uio_oe,
    input  wire       ena,
    input  wire       clk,
    input  wire       rst_n
);

  // Compose the 16-bit baud_div from the two TT input ports. Top half
  // of uio_in, bottom half of ui_in. Drive both to 0x0364 (decimal
  // 868) for 115200 baud at a 100 MHz user clock.
  wire [15:0] baud_div = {uio_in, ui_in};

  // CPU outputs.
  wire [7:0] out;
  wire [4:0] pc_out;
  wire       halted;
  wire       uart_tx;

  // Effective reset — held low when the TT mux deselects us via ena.
  wire effective_rst_n = rst_n & ena;

  p06_top u_cpu (
    .clk      (clk),
    .rst_n    (effective_rst_n),
    .start    (1'b1),                  // CPU is always running when not in reset
    .baud_div (baud_div),
    .out      (out),
    .pc_out   (pc_out),
    .halted   (halted),
    .uart_tx  (uart_tx)
  );

  // Pack the visible outputs.
  assign uo_out  = ena ? {out[5:0], halted, uart_tx} : 8'h00;
  assign uio_out = 8'h00;
  assign uio_oe  = 8'h00;

  // start input on the inner CPU is unused — but we use ena. Keep
  // ena referenced explicitly so verilator doesn't flag it.
  wire _unused = &{1'b0, ena, pc_out, out[7:6]};

endmodule

The testbench

Two testbenches sit on top of the wrapper. Both compile with iverilog -g2012; both drive the wrapper from the chip-pin side exactly the way the TT shuttle dev board would.

tb.sv is the strict verifier — make test PROJECT=11_tt_cpu exits non-zero unless every check passes. tb_demo.sv is the narrator — it produces the transcript at the top of this page so the page output and the testbench output stay in lock-step.

The interesting bit is the host-side UART receiver. Both testbenches share the same shape: wait for a falling edge on uo_out[0] (the start bit), wait 1.5 bit-times to land in the middle of bit-0, then sample 8 data bits a bit-time apart. It’s the same algorithm a real UART chip runs in hardware, just written in SystemVerilog instead of synthesizable RTL:

projects/11_tt_cpu/test/tb_demo.sv system-verilog · L51-71
  // ---------------------------------------------------------------
  logic [7:0] rx_byte;
  initial begin
    @(posedge ena);
    @(posedge clk);
    forever begin
      wait (uo_out[0] == 1'b0);
      // Center on bit-0: 1.5 bit-times after the start-bit edge.
      repeat (BAUD_CLOCKS + BAUD_CLOCKS/2) @(posedge clk);
      for (int b = 0; b < 8; b++) begin
        rx_byte[b] = uo_out[0];
        repeat (BAUD_CLOCKS) @(posedge clk);
      end
      if (rx_byte >= 8'h20 && rx_byte < 8'h7f)
        $display("[uart-rx]  byte 0x%02h  = '%c'", rx_byte, rx_byte);
      else
        $display("[uart-rx]  byte 0x%02h", rx_byte);
      wait (uo_out[0] == 1'b1);          // wait for stop bit
    end
  end

The strict tb’s UART decoder is the same code, plus a rx_byte_captured flag the result-checking initial polls before asserting check8(rx_byte, 8'h38, ...).

Comparing P10 and P11

P10P11 (this project)
Wrapped designP02: counter/PWM/LFSRP06: FSM CPU
Tile size2 × 12 × 2
Internal logic≈140 cells (P02 hardened)≈730 cells (P06 hardened, plus a tiny wrapper)
Has UARTnoyes (P03 module reused inside P06)
Boot behaviourstatic (counter increments)runs a program, halts
Multi-byte signalsno (everything is 1 or 2 bits)yes (baud_div is 16 bits, split across ui_in/uio_in)

P10 is the “hello world” of TT submission. P11 is the “hello world of putting a programmable processor on TT.” They’re the same shape of project — both are thin wrappers — just at different points on the design-complexity ladder.

What just happened?

We took the smallest CPU on this ladder (P06, hardened at 100 MHz in its own project) and adapted it for fabrication via Tiny Tapeout. The wrapper is small; the lesson is big. Real-world tape-outs always come with a harness — TT, Caravel, a custom socket — and the user logic is just the interesting part sitting inside that harness. P11 is what it looks like to drop a small but real piece of compute hardware into the smallest practical fab path.

See also