P16 keeps the P15 RV32I core and 8 KiB SRAM macro target, but stops relying
on a testbench-only SRAM preload. The program image now enters through
uart_rx, the loader writes SRAM byte by byte, and only then does the CPU
come out of reset.
Status: hardened with electrical cleanup left. Official
rv32i/I/I-nop-00.Sfrom upstreamriscv-arch-testrevisiona7c9930builds inRVTEST_SELFCHECKmode, loads into SRAM over the top-level UART pin, and passes under Icarus Verilog. LibreLane runRUN_2026-04-30_13-21-47produced final GDS, routed with0detailed-route DRC errors, passed Magic DRC, KLayout XOR, KLayout DRC, LVS, antenna, setup timing, and hold timing. The max-slew/max-cap checker is stillPARTIALwith4719slew and153cap warnings. A wider official-sourcerv32i/Ibatch now builds all 39 tests:I-nopandI-fencepass,0fail, and 37 areNOT RUNbecause their generated self-check images exceed the 8 KiB SRAM.
The result
Run it:
make -C projects/16_rv32i_sram_loader/test
The expected final lines are:
PASS: P16 loaded 8192 bytes into SRAM over UART
PASS: P16 UART-loaded official rv32i/I/I-nop-00.S self-check accepted after 1067 clocks
PASS: P16 UART SRAM-loader acceptance smoke complete.
| field | value |
|---|---|
| Official source | rv32i/I/I-nop-00.S |
| Upstream revision | a7c9930 |
| Build mode | RVTEST_SELFCHECK |
| Simulator | Icarus Verilog |
| Loader path | UART byte stream into 8 KiB SRAM |
| RTL result | PASS |
Wider rv32i/I batch | PARTIAL: 2 PASS, 0 FAIL, 37 NOT RUN |
| Latest harden run | PASS with outstanding slew/cap warnings |
The backend run used the same four-macro SRAM shape as P15:
| check | result |
|---|---|
| Run directory | projects/16_rv32i_sram_loader/librelane/runs/RUN_2026-04-30_13-21-47 |
| Final GDS | projects/16_rv32i_sram_loader/librelane/runs/RUN_2026-04-30_13-21-47/final/gds/top.gds |
| Metrics | projects/16_rv32i_sram_loader/librelane/runs/RUN_2026-04-30_13-21-47/final/metrics.json |
| Route DRC | PASS (0) |
| Magic DRC | PASS (0) |
| KLayout XOR | PASS (0) |
| KLayout DRC | PASS (0) |
| LVS | PASS (0) |
| Antenna | PASS (0) |
| Setup / hold | PASS / PASS |
| Worst setup / hold slack | 13.475 ns / 0.0779 ns |
| Max slew / cap checker | PARTIAL: 4719 slew and 153 cap warnings |
The first harden command failed at optional KLayout render because the SRAM
streamout produced a multiple-top-cell render issue. Resuming from
Magic.WriteLEF with KLayout.Render skipped completed the signoff path
and wrote the final artifacts.
KLayout DRC needed one more detail: the first DRC retry reused Magic streamout
and failed before rule checking for the same multiple-top-cell reason. With
PRIMARY_GDSII_STREAMOUT_TOOL: klayout, the final GDS came from KLayout
streamout, and 81-klayout-drc reported 0 violations.
Arch-Test Batch
The repeatable batch command is:
make -C projects/16_rv32i_sram_loader/test batch
It builds every upstream rv32i/I/*.S file in RVTEST_SELFCHECK mode,
turns each ELF into a Verilog memory image, and runs any image that fits in
the 8 KiB SRAM through the same top-level UART loader path.
| result | count |
|---|---|
PASS | 2 |
FAIL | 0 |
NOT RUN | 37 |
The passing tests are I-nop-00.S and I-fence-00.S. The NOT RUN cases
all built successfully, but their generated self-check/signature images are
larger than the current SRAM window. The closest near-misses are I-auipc
and I-lui at 8968 bytes, just over 8 KiB; the arithmetic and branch tests
are much larger because the official framework allocates signature space.
The full table lives in projects/16_rv32i_sram_loader/compliance/RESULTS.md.
The top-level handoff
The loader and CPU share one SRAM bus. The CPU is reset until the loader has
seen a full packet and load_mode has been dropped.
output logic loader_done,
output logic loader_error
);
wire cpu_mem_valid;
wire cpu_mem_we;
wire [1:0] cpu_mem_size;
wire [31:0] cpu_mem_addr;
wire [31:0] cpu_mem_wdata;
wire [3:0] cpu_mem_wstrb;
wire loader_mem_valid;
wire loader_mem_we;
wire [1:0] loader_mem_size;
wire [31:0] loader_mem_addr;
wire [31:0] loader_mem_wdata;
wire [3:0] loader_mem_wstrb;
wire mem_valid;
wire mem_we;
wire [1:0] mem_size;
wire [31:0] mem_addr;
wire [31:0] mem_wdata;
wire [3:0] mem_wstrb;
wire [31:0] mem_rdata;
wire mem_ready;
wire mem_error;
wire [7:0] rx_data;
wire rx_valid;
wire loader_owns_bus = load_mode || !loader_done;
wire cpu_rst_n = rst_n && loader_done && !load_mode;
p16_uart_rx u_uart_rx (
.clk (clk),
.rst_n (rst_n),
.baud_div (baud_div),
.rx (uart_rx),
.data_out (rx_data),
.valid_out (rx_valid)
);
p16_sram_loader u_loader (
.clk (clk),
.rst_n (rst_n),
.load_mode (load_mode),
.rx_data (rx_data),
.rx_valid (rx_valid),
.mem_valid (loader_mem_valid),
.mem_we (loader_mem_we),
.mem_size (loader_mem_size),
.mem_addr (loader_mem_addr),
.mem_wdata (loader_mem_wdata),
.mem_wstrb (loader_mem_wstrb),
.mem_ready (mem_ready),
.mem_error (mem_error),
.done (loader_done),
.error (loader_error)
);
p16_rv32i_arch_core u_core (
.clk (clk),
.rst_n (cpu_rst_n),
.mem_valid (cpu_mem_valid),
.mem_we (cpu_mem_we),
.mem_size (cpu_mem_size),
.mem_addr (cpu_mem_addr),
.mem_wdata (cpu_mem_wdata),
.mem_wstrb (cpu_mem_wstrb),
.mem_rdata (mem_rdata),
.mem_ready (mem_ready),
.mem_error (mem_error),
.pc_out (pc_out),
.x5_out (x5_out),
.halted (halted),
.illegal (illegal)
);
assign mem_valid = loader_owns_bus ? loader_mem_valid : cpu_mem_valid;
assign mem_we = loader_owns_bus ? loader_mem_we : cpu_mem_we;
assign mem_size = loader_owns_bus ? loader_mem_size : cpu_mem_size;
assign mem_addr = loader_owns_bus ? loader_mem_addr : cpu_mem_addr;
assign mem_wdata = loader_owns_bus ? loader_mem_wdata : cpu_mem_wdata;
assign mem_wstrb = loader_owns_bus ? loader_mem_wstrb : cpu_mem_wstrb;
p16_sram8k_bus_memory u_mem (
.clk (clk),
.rst_n (rst_n),
.valid (mem_valid),
.we (mem_we),
.size (mem_size),
.addr (mem_addr),
.wdata (mem_wdata),
.wstrb (mem_wstrb),
.rdata (mem_rdata),
.ready (mem_ready),
.error (mem_error)
);
wire _unused = &{1'b0, mem_rdata};
endmodule The loader packet is deliberately small:
| byte(s) | meaning |
|---|---|
0xa5 | start of load |
| count low | low 8 bits of byte count |
| count high | high 6 bits of byte count in bits [5:0] |
| payload | count bytes, written from SRAM address zero |
Valid counts are 1..8192. For the current acceptance smoke, the testbench
sends all 8192 bytes of the ELF-derived memory image.
module p16_sram_loader (
input logic clk,
input logic rst_n,
input logic load_mode,
input logic [7:0] rx_data,
input logic rx_valid,
output logic mem_valid,
output logic mem_we,
output logic [1:0] mem_size,
output logic [31:0] mem_addr,
output logic [31:0] mem_wdata,
output logic [3:0] mem_wstrb,
input logic mem_ready,
input logic mem_error,
output logic done,
output logic error
);
localparam logic [13:0] MEM_BYTES = 14'd8192;
typedef enum logic [3:0] {
L_WAIT_MODE = 4'd0,
L_WAIT_MAGIC = 4'd1,
L_COUNT_LO = 4'd2,
L_COUNT_HI = 4'd3,
L_WAIT_BYTE = 4'd4,
L_WRITE = 4'd5,
L_DONE = 4'd6,
L_ERROR = 4'd7
} loader_state_t;
loader_state_t state;
logic [13:0] byte_count;
logic [13:0] bytes_left;
logic [12:0] wr_addr;
logic [7:0] byte_q;
logic [7:0] count_lo_q;
wire [13:0] count_next = {rx_data[5:0], count_lo_q};
wire count_ok = (count_next != 14'd0) && (count_next <= MEM_BYTES);
always_comb begin
mem_valid = (state == L_WRITE);
mem_we = 1'b1;
mem_size = 2'd0;
mem_addr = {19'h0, wr_addr};
mem_wdata = {24'h0, byte_q};
mem_wstrb = 4'b0001;
end
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
state <= L_WAIT_MODE;
byte_count <= 14'd0;
bytes_left <= 14'd0;
wr_addr <= 13'd0;
byte_q <= 8'h00;
count_lo_q <= 8'h00;
done <= 1'b0;
error <= 1'b0;
end else begin
case (state)
L_WAIT_MODE: begin
if (load_mode) begin
done <= 1'b0;
error <= 1'b0;
byte_count <= 14'd0;
bytes_left <= 14'd0;
wr_addr <= 13'd0;
state <= L_WAIT_MAGIC;
end
end
L_WAIT_MAGIC: begin
if (!load_mode) begin
state <= L_WAIT_MODE;
end else if (rx_valid && rx_data == 8'ha5) begin
state <= L_COUNT_LO;
end
end
L_COUNT_LO: begin
if (!load_mode) begin
state <= L_WAIT_MODE;
end else if (rx_valid) begin
count_lo_q <= rx_data;
state <= L_COUNT_HI;
end
end
L_COUNT_HI: begin
if (!load_mode) begin
state <= L_WAIT_MODE;
end else if (rx_valid) begin
if (count_ok) begin
byte_count <= count_next;
bytes_left <= count_next;
wr_addr <= 13'd0;
state <= L_WAIT_BYTE;
end else begin
error <= 1'b1;
state <= L_ERROR;
end
end
end
L_WAIT_BYTE: begin
if (!load_mode) begin
state <= L_WAIT_MODE;
end else if (rx_valid) begin
byte_q <= rx_data;
state <= L_WRITE;
end
end
L_WRITE: begin
if (mem_ready) begin
if (mem_error) begin
error <= 1'b1;
state <= L_ERROR;
end else if (bytes_left == 14'd1) begin
bytes_left <= 14'd0;
done <= 1'b1;
state <= L_DONE;
end else begin
bytes_left <= bytes_left - 14'd1;
wr_addr <= wr_addr + 13'd1;
state <= L_WAIT_BYTE;
end
end
end
L_DONE: begin
if (load_mode) begin
state <= L_DONE;
end else begin
state <= L_WAIT_MODE;
end
end
L_ERROR: begin
if (!load_mode) state <= L_WAIT_MODE;
end
default: begin
error <= 1'b1;
state <= L_ERROR;
end
endcase
end
end
wire _unused = &{1'b0, byte_count};
endmodule The testbench no longer reaches inside the SRAM macro model:
end
endtask
initial begin
if ($test$plusargs("wave")) begin
$dumpfile("tb_loader.vcd");
$dumpvars(0, tb_loader);
end
for (int i = 0; i < MEM_BYTES; i++) image[i] = 8'h00;
mem_hex = DEFAULT_MEM_HEX;
test_name = DEFAULT_TEST_NAME;
max_run_cycles = 200_000;
void'($value$plusargs("memh=%s", mem_hex));
void'($value$plusargs("test=%s", test_name));
void'($value$plusargs("max_cycles=%d", max_run_cycles));
$readmemh(mem_hex, image);
rst_n = 1'b0;
load_mode = 1'b1;
uart_rx = 1'b1;
repeat (8) @(posedge clk);
@(negedge clk); rst_n = 1'b1;
repeat (8) @(posedge clk);
uart_byte(8'ha5);
uart_byte(MEM_BYTES[7:0]);
uart_byte({2'b00, MEM_BYTES[13:8]});
for (int i = 0; i < MEM_BYTES; i++) begin
uart_byte(image[i]);
end
begin
int n;
n = 0;
while (!loader_done && !loader_error && n < 100_000) begin
@(posedge clk);
n = n + 1;
end
if (loader_error) begin
$display("FAIL: P16 loader reported an error");
errors = errors + 1;
end else if (!loader_done) begin
$display("FAIL: P16 loader timed out before done");
errors = errors + 1;
end else begin
$display("PASS: P16 loaded %0d bytes into SRAM over UART", MEM_BYTES);
end
end
repeat (8) @(posedge clk);
@(negedge clk); load_mode = 1'b0;
begin
int n;
n = 0;
while (!halted && n < 200_000) begin
@(posedge clk);
n = n + 1;
end
if (!halted) begin
$display("FAIL: P16 UART-loaded %s timed out at pc=0x%08h x5=0x%08h", test_name, pc, x5);
errors = errors + 1;
end else if (illegal) begin
$display("FAIL: P16 UART-loaded %s hit unsupported instruction or memory access at pc=0x%08h x5=0x%08h", test_name, pc, x5);
errors = errors + 1;
end else if (x5 !== 32'd1) begin
$display("FAIL: P16 UART-loaded %s halted with x5=0x%08h, expected PASS code 1", test_name, x5);
errors = errors + 1;
end else begin
$display("PASS: P16 UART-loaded official rv32i/I/%s self-check accepted after %0d clocks", test_name, n); ISA scope
Supported in this RTL step: LUI, AUIPC, JAL, JALR, BEQ,
BNE, BLT, BGE, BLTU, BGEU, byte/halfword/word loads and
stores, ADDI, SLTI, SLTIU, XORI, ORI, ANDI, SLLI,
SRLI, SRAI, ADD, SUB, SLL, SLT, SLTU, XOR, SRL,
SRA, OR, AND, and FENCE as a no-op.
Unsupported: traps, exceptions, interrupts, CSRs, ECALL, EBREAK,
misalignment trap handling, FENCE.I, multiply/divide, atomics,
compressed instructions, privilege modes, and any official tests beyond
the two listed as actually run.
What this proves
This does not prove RV32I compliance. It proves the smallest official integer arch-test smokes can be loaded through a real top-level pin, stored in SRAM, and run to the framework PASS tail. It also proves that the current 8 KiB memory is the next obvious blocker for running the rest of the official integer suite in this self-check configuration.
The backend question is now answered: the UART loader still fits and routes in the P15 SRAM floorplan. What remains is a quality cleanup pass on slew and capacitance, not a functional boot-path question.