Today the chip booted Linux. Real Linux 6.12.85 LTS, RV32, an
unmodified Image from arch/riscv/boot/, with kernel printk
visible all the way through architecture init, memory zone
setup, SLUB allocator coming up, ASID allocator wiring, and the
clocksource switching to riscv_clocksource.
This took two projects: P58 to get all the surrounding pieces working (stage-0 firmware, DTB, kernel image, boot blob, Nix toolchain) and to learn that the chip’s load/store-only translation wasn’t enough; P59 to add Sv32 instruction-fetch translation, after which the kernel just… ran.
P58 — the dead end
Started with the assumption that the existing P58 RTL (which
did Sv32 translation for loads and stores in S-mode) was
already enough for Linux. Built a tiny stage-0 firmware that:
sets up identity-mapped page tables, delegates everything but
ECALL_FROM_S/M, sets mstatus.MPP=S, and mrets into the
kernel at 0x00200000 with a1 = DTB_BASE.
The kernel reset entry ran. It mret’d into S-mode. PC went to
_start_kernel. Then a few hundred instructions in, PC parked
at 0x002000b8 and never moved.
Stared at this for a while. The address 0x002000b8 is the
physical-mode equivalent of 0xc00000b8, which the kernel
links as a wfi; j . loop intended to be stvec’s target
during relocate_enable_mmu. Something about that function
was failing.
Pulled apart the disassembly. The kernel’s
relocate_enable_mmu does this dance:
- Compute
va_pa_offset = kernel_map.virt_addr - &_start. - Set
ra += va_pa_offsetso that when we eventuallyret, we land at the virtual return address. - Set
stvecto the virtual address right after thecsrw satp(so a translation fault redirects us to virtual). csrw satp, trampoline_pg_dir.- (After the trampoline takes effect, do a few more setup things at virtual PC.)
csrw satp, real_kernel_pg_dir.retto the virtual return address.
The trick is in step 4: the trampoline maps the kernel’s
virtual address to its physical address (one 4 MiB superpage),
but does not identity-map the physical address itself. So
when the chip is fetching at physical PC 0x002000xx and satp
becomes the trampoline, the very next fetch faults — because
the trampoline doesn’t have a mapping for VPN1=0. The fault is
the intended mechanism: it forces stvec to take over PC,
which is set to the virtual address. From there, the kernel
runs at virtual PC and the trampoline translates back to the
same physical bytes.
But our chip didn’t translate fetches. mem_addr = pc with no
satp lookup. So:
csrw satpsucceeds; bytes still readable at PC because PC is physical and the chip ignores satp for fetch.- A few cycles later, the kernel hits
retand PC becomes0xc0000154(virtual). - The chip fetches at PA =
0xc0000154, which is past our 16 MiB of RAM, so the bus returns nothing meaningful and the CPU illegal-instructions, traps tostvec=0x002000b8, parks forever.
P59 — the fix
The chip already had a Sv32 page-table walker for load/store.
It had two states: S_PTW1 to fetch the L1 PTE and S_PTW0
to fetch the L0 PTE, plus S_PTW{0,1}_AD for hardware A/D bit
updates.
To do fetch translation, all I needed was to plumb an extra flag through that walker:
ptw_for_fetch_q: when set, the walk is on behalf of an instruction fetch.fetch_pa_q: the post-walk physical address for the current PC.fetch_xlated_q: marksfetch_pa_qas valid; cleared on every PC change (S_WB, mret, sret, trap entry).
Then, in S_FETCH: if translation_active && !fetch_xlated_q,
route into S_PTW1 with ptw_va_q = pc and the fetch flag set.
The walker checks X=1 instead of R/W for permission,
raises MCAUSE_INSTR_PAGE_FAULT instead of load/store on
errors, and updates only A (not D) in the PTE. On success,
fetch_pa_q gets set and we return to S_FETCH, which now
issues the fetch using the translated PA.
Total diff: ~50 lines, all in top.sv.
Before testing, hit a second issue: setup_vm() enforces
BUG_ON((kernel_map.phys_addr % PMD_SIZE) != 0). PMD_SIZE on
Sv32 is 4 MiB. Our kernel was at 0x00200000 (2 MiB-aligned).
Fixed by moving the kernel image to 0x00400000 in the boot
blob and rebuilding with CONFIG_PHYS_RAM_BASE=0x00400000.
What we got
The chip prints this:
Linux version 6.12.85 (jadams@solomon) (riscv64-unknown-linux-gnu-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.46) #7 Sun May 3 17:40:25 CDT 2026
OF: fdt: Ignoring memory range 0x0 - 0x400000
Machine model: P58 Linux test platform
SBI specification v0.1 detected
earlycon: sbi0 at I/O port 0x0 (options '')
printk: legacy bootconsole [sbi0] enabled
...
SLUB: HWalign=64, Order=0-1, MinObjects=0, CPUs=1, Nodes=1
NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
riscv-intc: 32 local interrupts mapped
clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x5c40939b5, max_idle_ns: 440795202646 ns
sched_clock: 64 bits at 25MHz, resolution 40ns, wraps every 4398046511100ns
Calibrating delay loop (skipped), value calculated using timer frequency.. 50.00 BogoMIPS (lpj=100000)
ASID allocator using 9 bits (512 entries)
Memory: 9676K/12288K available (1309K kernel code, 470K rwdata, 145K rodata, 118K init, 176K bss, 2392K reserved, 0K cma-reserved)
clocksource: Switched to clocksource riscv_clocksource
The numbers I want to remember
1309K kernel code. The whole RV32 Linux kernel in 1.3 MB of text. Not as small as it could be (no compressed insns), but small enough that it runs in 16 MiB total system RAM.9676K/12288K available. The kernel reserves 2.4 MB (kernel_data + tables + initial heap) and leaves 9.6 MB free for userspace.- Sim runs at ~30 KHz effective wall-clock. The “fast and loose” iverilog testbench is currently the bottleneck; Verilator would be much faster.
- The kernel takes about 2M sim clocks to print
Linux version, then another ~22M to reachclocksource: Switched. Real silicon at 25 MHz would hit the same point in under a second.
Things left
- Initramfs to actually reach userspace.
- A bigger UART buffer in the testbench (currently 8192 bytes).
- A timeline document for what the kernel would print past this point if we waited for the sim to finish (and what faults instead — there will absolutely be more chip-side bugs hiding in late init).
- TLB. The current “1-entry TLB that lives one instruction” is fine for correctness but it means every single fetch walks the entire page table from scratch. Even a 4-entry fully-associative TLB would probably 5x sim throughput.
Why it matters
Six days ago this repo was a git init. Today the chip ran an unmodified Linux kernel binary. None of it was magic. The chip isn’t fast or efficient, but it implements enough of the RISC-V privileged architecture that mainline Linux can boot on it without modification.
That’s what the ladder is for.