No. 16 / project of 147 on the ladder

RV32I SRAM UART loader

introduces — UART-loaded SRAM boot path for the first official RV32I arch-test smoke

harden statelast run2026-04-30
cells55,282non-filler
slack13.48ns setup
area4420000 (die) / 4329320 (core)μm²
signoff
  • DRCPASS
  • LVSPASS
  • antennaPASS

P16 keeps the P15 RV32I core and 8 KiB SRAM macro target, but stops relying on a testbench-only SRAM preload. The program image now enters through uart_rx, the loader writes SRAM byte by byte, and only then does the CPU come out of reset.

Status: hardened with electrical cleanup left. Official rv32i/I/I-nop-00.S from upstream riscv-arch-test revision a7c9930 builds in RVTEST_SELFCHECK mode, loads into SRAM over the top-level UART pin, and passes under Icarus Verilog. LibreLane run RUN_2026-04-30_13-21-47 produced final GDS, routed with 0 detailed-route DRC errors, passed Magic DRC, KLayout XOR, KLayout DRC, LVS, antenna, setup timing, and hold timing. The max-slew/max-cap checker is still PARTIAL with 4719 slew and 153 cap warnings. A wider official-source rv32i/I batch now builds all 39 tests: I-nop and I-fence pass, 0 fail, and 37 are NOT RUN because their generated self-check images exceed the 8 KiB SRAM.

The result

Run it:

make -C projects/16_rv32i_sram_loader/test

The expected final lines are:

PASS: P16 loaded 8192 bytes into SRAM over UART
PASS: P16 UART-loaded official rv32i/I/I-nop-00.S self-check accepted after 1067 clocks
PASS: P16 UART SRAM-loader acceptance smoke complete.
fieldvalue
Official sourcerv32i/I/I-nop-00.S
Upstream revisiona7c9930
Build modeRVTEST_SELFCHECK
SimulatorIcarus Verilog
Loader pathUART byte stream into 8 KiB SRAM
RTL resultPASS
Wider rv32i/I batchPARTIAL: 2 PASS, 0 FAIL, 37 NOT RUN
Latest harden runPASS with outstanding slew/cap warnings

The backend run used the same four-macro SRAM shape as P15:

checkresult
Run directoryprojects/16_rv32i_sram_loader/librelane/runs/RUN_2026-04-30_13-21-47
Final GDSprojects/16_rv32i_sram_loader/librelane/runs/RUN_2026-04-30_13-21-47/final/gds/top.gds
Metricsprojects/16_rv32i_sram_loader/librelane/runs/RUN_2026-04-30_13-21-47/final/metrics.json
Route DRCPASS (0)
Magic DRCPASS (0)
KLayout XORPASS (0)
KLayout DRCPASS (0)
LVSPASS (0)
AntennaPASS (0)
Setup / holdPASS / PASS
Worst setup / hold slack13.475 ns / 0.0779 ns
Max slew / cap checkerPARTIAL: 4719 slew and 153 cap warnings

The first harden command failed at optional KLayout render because the SRAM streamout produced a multiple-top-cell render issue. Resuming from Magic.WriteLEF with KLayout.Render skipped completed the signoff path and wrote the final artifacts.

KLayout DRC needed one more detail: the first DRC retry reused Magic streamout and failed before rule checking for the same multiple-top-cell reason. With PRIMARY_GDSII_STREAMOUT_TOOL: klayout, the final GDS came from KLayout streamout, and 81-klayout-drc reported 0 violations.

Arch-Test Batch

The repeatable batch command is:

make -C projects/16_rv32i_sram_loader/test batch

It builds every upstream rv32i/I/*.S file in RVTEST_SELFCHECK mode, turns each ELF into a Verilog memory image, and runs any image that fits in the 8 KiB SRAM through the same top-level UART loader path.

resultcount
PASS2
FAIL0
NOT RUN37

The passing tests are I-nop-00.S and I-fence-00.S. The NOT RUN cases all built successfully, but their generated self-check/signature images are larger than the current SRAM window. The closest near-misses are I-auipc and I-lui at 8968 bytes, just over 8 KiB; the arithmetic and branch tests are much larger because the official framework allocates signature space. The full table lives in projects/16_rv32i_sram_loader/compliance/RESULTS.md.

The top-level handoff

The loader and CPU share one SRAM bus. The CPU is reset until the loader has seen a full packet and load_mode has been dropped.

projects/16_rv32i_sram_loader/src/top.sv system-verilog · L19-121
    output logic        loader_done,
    output logic        loader_error
);

  wire        cpu_mem_valid;
  wire        cpu_mem_we;
  wire [1:0]  cpu_mem_size;
  wire [31:0] cpu_mem_addr;
  wire [31:0] cpu_mem_wdata;
  wire [3:0]  cpu_mem_wstrb;

  wire        loader_mem_valid;
  wire        loader_mem_we;
  wire [1:0]  loader_mem_size;
  wire [31:0] loader_mem_addr;
  wire [31:0] loader_mem_wdata;
  wire [3:0]  loader_mem_wstrb;

  wire        mem_valid;
  wire        mem_we;
  wire [1:0]  mem_size;
  wire [31:0] mem_addr;
  wire [31:0] mem_wdata;
  wire [3:0]  mem_wstrb;
  wire [31:0] mem_rdata;
  wire        mem_ready;
  wire        mem_error;

  wire [7:0] rx_data;
  wire       rx_valid;
  wire       loader_owns_bus = load_mode || !loader_done;
  wire       cpu_rst_n = rst_n && loader_done && !load_mode;

  p16_uart_rx u_uart_rx (
    .clk       (clk),
    .rst_n     (rst_n),
    .baud_div  (baud_div),
    .rx        (uart_rx),
    .data_out  (rx_data),
    .valid_out (rx_valid)
  );

  p16_sram_loader u_loader (
    .clk        (clk),
    .rst_n      (rst_n),
    .load_mode  (load_mode),
    .rx_data    (rx_data),
    .rx_valid   (rx_valid),
    .mem_valid  (loader_mem_valid),
    .mem_we     (loader_mem_we),
    .mem_size   (loader_mem_size),
    .mem_addr   (loader_mem_addr),
    .mem_wdata  (loader_mem_wdata),
    .mem_wstrb  (loader_mem_wstrb),
    .mem_ready  (mem_ready),
    .mem_error  (mem_error),
    .done       (loader_done),
    .error      (loader_error)
  );

  p16_rv32i_arch_core u_core (
    .clk        (clk),
    .rst_n      (cpu_rst_n),
    .mem_valid  (cpu_mem_valid),
    .mem_we     (cpu_mem_we),
    .mem_size   (cpu_mem_size),
    .mem_addr   (cpu_mem_addr),
    .mem_wdata  (cpu_mem_wdata),
    .mem_wstrb  (cpu_mem_wstrb),
    .mem_rdata  (mem_rdata),
    .mem_ready  (mem_ready),
    .mem_error  (mem_error),
    .pc_out     (pc_out),
    .x5_out     (x5_out),
    .halted     (halted),
    .illegal    (illegal)
  );

  assign mem_valid = loader_owns_bus ? loader_mem_valid : cpu_mem_valid;
  assign mem_we    = loader_owns_bus ? loader_mem_we    : cpu_mem_we;
  assign mem_size  = loader_owns_bus ? loader_mem_size  : cpu_mem_size;
  assign mem_addr  = loader_owns_bus ? loader_mem_addr  : cpu_mem_addr;
  assign mem_wdata = loader_owns_bus ? loader_mem_wdata : cpu_mem_wdata;
  assign mem_wstrb = loader_owns_bus ? loader_mem_wstrb : cpu_mem_wstrb;

  p16_sram8k_bus_memory u_mem (
    .clk       (clk),
    .rst_n     (rst_n),
    .valid     (mem_valid),
    .we        (mem_we),
    .size      (mem_size),
    .addr      (mem_addr),
    .wdata     (mem_wdata),
    .wstrb     (mem_wstrb),
    .rdata     (mem_rdata),
    .ready     (mem_ready),
    .error     (mem_error)
  );

  wire _unused = &{1'b0, mem_rdata};

endmodule

The loader packet is deliberately small:

byte(s)meaning
0xa5start of load
count lowlow 8 bits of byte count
count highhigh 6 bits of byte count in bits [5:0]
payloadcount bytes, written from SRAM address zero

Valid counts are 1..8192. For the current acceptance smoke, the testbench sends all 8192 bytes of the ELF-derived memory image.

projects/16_rv32i_sram_loader/src/top.sv system-verilog · L123-277
module p16_sram_loader (
    input  logic        clk,
    input  logic        rst_n,
    input  logic        load_mode,
    input  logic [7:0]  rx_data,
    input  logic        rx_valid,
    output logic        mem_valid,
    output logic        mem_we,
    output logic [1:0]  mem_size,
    output logic [31:0] mem_addr,
    output logic [31:0] mem_wdata,
    output logic [3:0]  mem_wstrb,
    input  logic        mem_ready,
    input  logic        mem_error,
    output logic        done,
    output logic        error
);

  localparam logic [13:0] MEM_BYTES = 14'd8192;

  typedef enum logic [3:0] {
    L_WAIT_MODE = 4'd0,
    L_WAIT_MAGIC = 4'd1,
    L_COUNT_LO = 4'd2,
    L_COUNT_HI = 4'd3,
    L_WAIT_BYTE = 4'd4,
    L_WRITE = 4'd5,
    L_DONE = 4'd6,
    L_ERROR = 4'd7
  } loader_state_t;

  loader_state_t state;
  logic [13:0] byte_count;
  logic [13:0] bytes_left;
  logic [12:0] wr_addr;
  logic [7:0]  byte_q;
  logic [7:0]  count_lo_q;

  wire [13:0] count_next = {rx_data[5:0], count_lo_q};
  wire count_ok = (count_next != 14'd0) && (count_next <= MEM_BYTES);

  always_comb begin
    mem_valid = (state == L_WRITE);
    mem_we    = 1'b1;
    mem_size  = 2'd0;
    mem_addr  = {19'h0, wr_addr};
    mem_wdata = {24'h0, byte_q};
    mem_wstrb = 4'b0001;
  end

  always_ff @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
      state      <= L_WAIT_MODE;
      byte_count <= 14'd0;
      bytes_left <= 14'd0;
      wr_addr    <= 13'd0;
      byte_q     <= 8'h00;
      count_lo_q <= 8'h00;
      done       <= 1'b0;
      error      <= 1'b0;
    end else begin
      case (state)
        L_WAIT_MODE: begin
          if (load_mode) begin
            done       <= 1'b0;
            error      <= 1'b0;
            byte_count <= 14'd0;
            bytes_left <= 14'd0;
            wr_addr    <= 13'd0;
            state      <= L_WAIT_MAGIC;
          end
        end

        L_WAIT_MAGIC: begin
          if (!load_mode) begin
            state <= L_WAIT_MODE;
          end else if (rx_valid && rx_data == 8'ha5) begin
            state <= L_COUNT_LO;
          end
        end

        L_COUNT_LO: begin
          if (!load_mode) begin
            state <= L_WAIT_MODE;
          end else if (rx_valid) begin
            count_lo_q <= rx_data;
            state      <= L_COUNT_HI;
          end
        end

        L_COUNT_HI: begin
          if (!load_mode) begin
            state <= L_WAIT_MODE;
          end else if (rx_valid) begin
            if (count_ok) begin
              byte_count <= count_next;
              bytes_left <= count_next;
              wr_addr    <= 13'd0;
              state      <= L_WAIT_BYTE;
            end else begin
              error <= 1'b1;
              state <= L_ERROR;
            end
          end
        end

        L_WAIT_BYTE: begin
          if (!load_mode) begin
            state <= L_WAIT_MODE;
          end else if (rx_valid) begin
            byte_q <= rx_data;
            state  <= L_WRITE;
          end
        end

        L_WRITE: begin
          if (mem_ready) begin
            if (mem_error) begin
              error <= 1'b1;
              state <= L_ERROR;
            end else if (bytes_left == 14'd1) begin
              bytes_left <= 14'd0;
              done       <= 1'b1;
              state      <= L_DONE;
            end else begin
              bytes_left <= bytes_left - 14'd1;
              wr_addr    <= wr_addr + 13'd1;
              state      <= L_WAIT_BYTE;
            end
          end
        end

        L_DONE: begin
          if (load_mode) begin
            state <= L_DONE;
          end else begin
            state <= L_WAIT_MODE;
          end
        end

        L_ERROR: begin
          if (!load_mode) state <= L_WAIT_MODE;
        end

        default: begin
          error <= 1'b1;
          state <= L_ERROR;
        end
      endcase
    end
  end

  wire _unused = &{1'b0, byte_count};

endmodule

The testbench no longer reaches inside the SRAM macro model:

projects/16_rv32i_sram_loader/test/tb_loader.sv system-verilog · L65-138
    end
  endtask

  initial begin
    if ($test$plusargs("wave")) begin
      $dumpfile("tb_loader.vcd");
      $dumpvars(0, tb_loader);
    end

    for (int i = 0; i < MEM_BYTES; i++) image[i] = 8'h00;
    mem_hex = DEFAULT_MEM_HEX;
    test_name = DEFAULT_TEST_NAME;
    max_run_cycles = 200_000;
    void'($value$plusargs("memh=%s", mem_hex));
    void'($value$plusargs("test=%s", test_name));
    void'($value$plusargs("max_cycles=%d", max_run_cycles));
    $readmemh(mem_hex, image);

    rst_n = 1'b0;
    load_mode = 1'b1;
    uart_rx = 1'b1;
    repeat (8) @(posedge clk);
    @(negedge clk); rst_n = 1'b1;
    repeat (8) @(posedge clk);

    uart_byte(8'ha5);
    uart_byte(MEM_BYTES[7:0]);
    uart_byte({2'b00, MEM_BYTES[13:8]});

    for (int i = 0; i < MEM_BYTES; i++) begin
      uart_byte(image[i]);
    end

    begin
      int n;
      n = 0;
      while (!loader_done && !loader_error && n < 100_000) begin
        @(posedge clk);
        n = n + 1;
      end

      if (loader_error) begin
        $display("FAIL: P16 loader reported an error");
        errors = errors + 1;
      end else if (!loader_done) begin
        $display("FAIL: P16 loader timed out before done");
        errors = errors + 1;
      end else begin
        $display("PASS: P16 loaded %0d bytes into SRAM over UART", MEM_BYTES);
      end
    end

    repeat (8) @(posedge clk);
    @(negedge clk); load_mode = 1'b0;

    begin
      int n;
      n = 0;
      while (!halted && n < 200_000) begin
        @(posedge clk);
        n = n + 1;
      end

      if (!halted) begin
        $display("FAIL: P16 UART-loaded %s timed out at pc=0x%08h x5=0x%08h", test_name, pc, x5);
        errors = errors + 1;
      end else if (illegal) begin
        $display("FAIL: P16 UART-loaded %s hit unsupported instruction or memory access at pc=0x%08h x5=0x%08h", test_name, pc, x5);
        errors = errors + 1;
      end else if (x5 !== 32'd1) begin
        $display("FAIL: P16 UART-loaded %s halted with x5=0x%08h, expected PASS code 1", test_name, x5);
        errors = errors + 1;
      end else begin
        $display("PASS: P16 UART-loaded official rv32i/I/%s self-check accepted after %0d clocks", test_name, n);

ISA scope

Supported in this RTL step: LUI, AUIPC, JAL, JALR, BEQ, BNE, BLT, BGE, BLTU, BGEU, byte/halfword/word loads and stores, ADDI, SLTI, SLTIU, XORI, ORI, ANDI, SLLI, SRLI, SRAI, ADD, SUB, SLL, SLT, SLTU, XOR, SRL, SRA, OR, AND, and FENCE as a no-op.

Unsupported: traps, exceptions, interrupts, CSRs, ECALL, EBREAK, misalignment trap handling, FENCE.I, multiply/divide, atomics, compressed instructions, privilege modes, and any official tests beyond the two listed as actually run.

What this proves

This does not prove RV32I compliance. It proves the smallest official integer arch-test smokes can be loaded through a real top-level pin, stored in SRAM, and run to the framework PASS tail. It also proves that the current 8 KiB memory is the next obvious blocker for running the rest of the official integer suite in this self-check configuration.

The backend question is now answered: the UART loader still fits and routes in the P15 SRAM floorplan. What remains is a quality cleanup pass on slew and capacitance, not a functional boot-path question.