Linux boots on the chip

Today the chip booted Linux. Real Linux 6.12.85 LTS, RV32, an unmodified Image from arch/riscv/boot/, with kernel printk visible all the way through architecture init, memory zone setup, SLUB allocator coming up, ASID allocator wiring, and the clocksource switching to riscv_clocksource.

This took two projects: P58 to get all the surrounding pieces working (stage-0 firmware, DTB, kernel image, boot blob, Nix toolchain) and to learn that the chip’s load/store-only translation wasn’t enough; P59 to add Sv32 instruction-fetch translation, after which the kernel just… ran.

P58 — the dead end

Started with the assumption that the existing P58 RTL (which did Sv32 translation for loads and stores in S-mode) was already enough for Linux. Built a tiny stage-0 firmware that: sets up identity-mapped page tables, delegates everything but ECALL_FROM_S/M, sets mstatus.MPP=S, and mrets into the kernel at 0x00200000 with a1 = DTB_BASE.

The kernel reset entry ran. It mret’d into S-mode. PC went to _start_kernel. Then a few hundred instructions in, PC parked at 0x002000b8 and never moved.

Stared at this for a while. The address 0x002000b8 is the physical-mode equivalent of 0xc00000b8, which the kernel links as a wfi; j . loop intended to be stvec’s target during relocate_enable_mmu. Something about that function was failing.

Pulled apart the disassembly. The kernel’s relocate_enable_mmu does this dance:

Compute va_pa_offset = kernel_map.virt_addr - &_start.
Set ra += va_pa_offset so that when we eventually ret, we land at the virtual return address.
Set stvec to the virtual address right after the csrw satp (so a translation fault redirects us to virtual).
csrw satp, trampoline_pg_dir.
(After the trampoline takes effect, do a few more setup things at virtual PC.)
csrw satp, real_kernel_pg_dir.
ret to the virtual return address.

The trick is in step 4: the trampoline maps the kernel’s virtual address to its physical address (one 4 MiB superpage), but does not identity-map the physical address itself. So when the chip is fetching at physical PC 0x002000xx and satp becomes the trampoline, the very next fetch faults — because the trampoline doesn’t have a mapping for VPN1=0. The fault is the intended mechanism: it forces stvec to take over PC, which is set to the virtual address. From there, the kernel runs at virtual PC and the trampoline translates back to the same physical bytes.

But our chip didn’t translate fetches. mem_addr = pc with no satp lookup. So:

csrw satp succeeds; bytes still readable at PC because PC is physical and the chip ignores satp for fetch.
A few cycles later, the kernel hits ret and PC becomes 0xc0000154 (virtual).
The chip fetches at PA = 0xc0000154, which is past our 16 MiB of RAM, so the bus returns nothing meaningful and the CPU illegal-instructions, traps to stvec = 0x002000b8, parks forever.

P59 — the fix

The chip already had a Sv32 page-table walker for load/store. It had two states: S_PTW1 to fetch the L1 PTE and S_PTW0 to fetch the L0 PTE, plus S_PTW{0,1}_AD for hardware A/D bit updates.

To do fetch translation, all I needed was to plumb an extra flag through that walker:

ptw_for_fetch_q: when set, the walk is on behalf of an instruction fetch.
fetch_pa_q: the post-walk physical address for the current PC.
fetch_xlated_q: marks fetch_pa_q as valid; cleared on every PC change (S_WB, mret, sret, trap entry).

Then, in S_FETCH: if translation_active && !fetch_xlated_q, route into S_PTW1 with ptw_va_q = pc and the fetch flag set. The walker checks X=1 instead of R/W for permission, raises MCAUSE_INSTR_PAGE_FAULT instead of load/store on errors, and updates only A (not D) in the PTE. On success, fetch_pa_q gets set and we return to S_FETCH, which now issues the fetch using the translated PA.

Total diff: ~50 lines, all in top.sv.

Before testing, hit a second issue: setup_vm() enforces BUG_ON((kernel_map.phys_addr % PMD_SIZE) != 0). PMD_SIZE on Sv32 is 4 MiB. Our kernel was at 0x00200000 (2 MiB-aligned). Fixed by moving the kernel image to 0x00400000 in the boot blob and rebuilding with CONFIG_PHYS_RAM_BASE=0x00400000.

What we got

The chip prints this:

Linux version 6.12.85 (jadams@solomon) (riscv64-unknown-linux-gnu-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.46) #7 Sun May  3 17:40:25 CDT 2026
OF: fdt: Ignoring memory range 0x0 - 0x400000
Machine model: P58 Linux test platform
SBI specification v0.1 detected
earlycon: sbi0 at I/O port 0x0 (options '')
printk: legacy bootconsole [sbi0] enabled
...
SLUB: HWalign=64, Order=0-1, MinObjects=0, CPUs=1, Nodes=1
NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
riscv-intc: 32 local interrupts mapped
clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x5c40939b5, max_idle_ns: 440795202646 ns
sched_clock: 64 bits at 25MHz, resolution 40ns, wraps every 4398046511100ns
Calibrating delay loop (skipped), value calculated using timer frequency.. 50.00 BogoMIPS (lpj=100000)
ASID allocator using 9 bits (512 entries)
Memory: 9676K/12288K available (1309K kernel code, 470K rwdata, 145K rodata, 118K init, 176K bss, 2392K reserved, 0K cma-reserved)
clocksource: Switched to clocksource riscv_clocksource

The numbers I want to remember

1309K kernel code. The whole RV32 Linux kernel in 1.3 MB of text. Not as small as it could be (no compressed insns), but small enough that it runs in 16 MiB total system RAM.
9676K/12288K available. The kernel reserves 2.4 MB (kernel_data + tables + initial heap) and leaves 9.6 MB free for userspace.
Sim runs at ~30 KHz effective wall-clock. The “fast and loose” iverilog testbench is currently the bottleneck; Verilator would be much faster.
The kernel takes about 2M sim clocks to print Linux version, then another ~22M to reach clocksource: Switched. Real silicon at 25 MHz would hit the same point in under a second.

Things left

Initramfs to actually reach userspace.
A bigger UART buffer in the testbench (currently 8192 bytes).
A timeline document for what the kernel would print past this point if we waited for the sim to finish (and what faults instead — there will absolutely be more chip-side bugs hiding in late init).
TLB. The current “1-entry TLB that lives one instruction” is fine for correctness but it means every single fetch walks the entire page table from scratch. Even a 4-entry fully-associative TLB would probably 5x sim throughput.

Why it matters

Six days ago this repo was a git init. Today the chip ran an unmodified Linux kernel binary. None of it was magic. The chip isn’t fast or efficient, but it implements enough of the RISC-V privileged architecture that mainline Linux can boot on it without modification.

That’s what the ladder is for.