P62 was scoped as a small win: ship the three Zbb rotate ops to chase BLAKE2s. The Zbb-rebuilt kernel exposed three more chip bugs in sequence — none of which would have triggered on P61’s silent-CRNG-and-die boot. Fixing them got us all the way to the headline result: a userspace process prints text to UART.
Headline: the chip executed M-mode firmware → S-mode kernel → U-mode init →
SYS_writesyscall → SBI putchar → UART. Every privilege transition exercised, every translation bit honored.
What we got on UART
P58 stage-0 firmware
PT_BASE = 0x00010000
DTB_BASE = 0x00100000
KERNEL_BASE = 0x00400000
page tables built
satp = 0x80000010
mret to kernel...
Linux version 6.12.85 (jadams@solomon) (riscv64-unknown-linux-gnu-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.46) #10 ...
... (full kernel boot, see /projects/59_ifetch_translation/) ...
Run /init as init process
with arguments:
/init
earlyprintk
with environment:
HOME=/
TERM=linux
===========================================
hello from userspace on a homemade chip!
===========================================
this binary:
- cross-compiled for rv32ima/ilp32
- statically linked, no libc
- reaches the linux SYS_write syscall
- is running as PID 1 from initramfs
the chip below us:
- 16 MiB DRAM, 25 MHz
- rv32ima_zba, sv32 paging, m+s priv
- SBI v0.1 firmware in stage-0
exiting cleanly...
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000
The “Attempted to kill init” panic is the expected end of run:
PID 1 calling exit(0) is a kernel-level “not allowed”, and
it dutifully panics. exitcode=0x00000000 proves the user
binary returned cleanly.
What was supposed to be small
P62 originally landed three RISC-V Zbb ops — rol, ror,
rori — to make BLAKE2s 3× faster (hardware rotate vs the
3-instruction srli; slli; or sequence gcc emits without
Zbb). About 25 lines of decode + ALU.
What it actually took
The Zbb-rebuilt kernel (one KCFLAGS=-march=rv32ima_zba_zbb_zicsr_zifencei)
brings out a long tail of issues on the first run. We bisected:
-
Full Zbb, not just rotates. gcc emits a wider Zbb subset than just the rotates: 152
zext.h, 46rev8, 25sext.b, 20sext.h, 8clz, 3cpopin the rebuilt kernel. Each of those was an illegal-instruction trap on the chip; the trap handler’s ownprintkpath used Zbb ops too, causing recursive trap-within-trap loops. Rounded out the chip with ~100 lines covering ZEXT.H, SEXT.B/H, CLZ/CTZ/CPOP, REV8, ORC.B. -
U-mode Sv32 translation. Latent bug since P52: the chip’s
translation_activeonly fired in S-mode. The moment Linuxsrets into U-mode the chip stopped translating; PC=0x10000 (a virtual user-mode address) became a fetch from physical 0x10000 (in our MMIO range) → bus error → Linux reportsFailed to execute /init (error -14)and falls through every init candidate before kernel-panic. Fixed by making translation fire forcur_priv != PRIV_Mand generalising the U-bit permission check viapriv_u_violates(u)rather than the old “fault if PTE.U=1, regardless of priv”. -
mstatus.SUM. Even with U-mode translation working,/initstill failed with -EFAULT. Linux’scopy_to_userpath runs in S-mode but writes to U-pages it just allocated for the user binary; that requiresmstatus.SUM=1. Our chip never honored SUM. Added the bit at MSTATUS_SUM_BIT=18, included in SSTATUS_MASK so the kernel can toggle it via sstatus too, and split the permission check intopriv_u_violates_fetch(no SUM — S-mode never executes U-pages even with SUM=1) andpriv_u_violates_lsu(honors SUM). -
/dev/consolein the initramfs. With the chip now actually executing the user binary, the binary still printed nothing visible.Warning: unable to open an initial console.revealed thatinit_eaccess(/dev/console)failed, so PID 1 inherited no fd 0/1/2 at all — everywrite(1, ...)returned -EBADF and the binary’s prints went to a non-existent fd. Fixed by switching the userspace cpio build from plaincpio -H newcto the kernel’s owngen_init_cpio, which can declare a/dev/consolechar-device node (5,1) without needing root.
Each fix was small (10–100 lines). They had to land together.
Boot milestone progression
The benchmark harness P61 introduced now has a real column-to-column comparison:
| milestone | P61 (broken U-mode) | P62 (full chip) | Δ |
|---|---|---|---|
| Linux version | 1,616,273 | 1,611,173 | −5,100 |
| clocksource switched | 18,685,472 | 18,426,917 | −258,555 |
| Run /init | 141,182,181 | 130,054,364 | −11,127,817 (−7.9%) |
| userspace hello | — never reached — | 131,241,874 | first |
| kernel panic (Attempted to kill init) | unreachable | 131,610,501 | first |
Run /init lands 11.1M cycles earlier with full Zbb (the
BLAKE2s/CRNG init phase is the headline win — Zbb rotates
roughly third the BLAKE2s instruction count). The userspace
binary itself runs from Run /init to its exit() panic in
1,556,137 cycles — including ELF load, exec, all the
writes, formatting, and clean teardown.
Files
src/top.sv— chip RTL with full Zbb, U-mode translation, SUM, plus the unchanged P61 TLB.userspace/hello.c— freestanding RV32 program, no libc.userspace/Makefile— builds hello, packs initramfs cpio with/dev/consoleviagen_init_cpio.test/tb_freertos_demo.sv— testbench with milestone tracker, benchmark.json emitter, graceful$finishafter panic so dump_profile runs.test/Makefile—profile,profile-decode,chartstargets.
Where the chip is spending its cycles now
The benchmark from the full boot run (post-load through
userspace exit → kernel panic + 1M settle cycles):
| post-load cycles | 132,610,502 |
| retired instructions | 21,242,373 |
| CPI | 6.24 |
| TLB fetch hit rate | 97.46% |
| TLB LSU hit rate | 94.16% |
| memory bus stall cycles | 0 |
State-cycle breakdown — S_FETCH is 32% of all cycles
(every instruction takes two cycles in fetch: issue, then
data-back), S_DECODE / S_EXECUTE / S_MEM / S_WB are
about 16% each (one cycle per retired instruction), and the
walker collectively eats ~1.8% — the TLB is doing exactly
what it should.
- fetch 32% 42,437,741
- decode 16% 21,249,591
- execute 16% 21,249,591
- mem 15.9% 21,106,222
- walker 1.8% 2,377,514
- writeback 16% 21,242,373
- mul/div 2.2% 2,947,470
The walker breakdown shows where the misses are. Fetch TLB ~97% hit rate; LSU ~94% hit rate. The remaining misses land on the walker and resolve mostly at L0 (4 KiB pages) not L1 (megapages) — kernel pages are 4 KiB by the time init runs.
PC samples classified against the kernel symbol table show where the instruction stream actually spent its time. With the Zbb rotates in place, the BLAKE2s share is meaningfully smaller than P61’s profile (where it dominated).
- 16.5% of samples (21,374 samples)16.5% 21,374
- 7.8% of samples (10,133 samples)7.8% 10,133
- 7.6% of samples (9,835 samples)7.6% 9,835
- 5.6% of samples (7,280 samples)5.6% 7,280
- 3.6% of samples (4,696 samples)3.6% 4,696
- 2.6% of samples (3,379 samples)2.6% 3,379
- 2.5% of samples (3,267 samples)2.5% 3,267
- 2.1% of samples (2,713 samples)2.1% 2,713
- 1.8% of samples (2,331 samples)1.8% 2,331
- 1.6% of samples (2,081 samples)1.6% 2,081
- 1.3% of samples (1,706 samples)1.3% 1,706
- 1.3% of samples (1,649 samples)1.3% 1,649
- 1.2% of samples (1,536 samples)1.2% 1,536
- 1% of samples (1,236 samples)1% 1,236
- 0.9% of samples (1,147 samples)0.9% 1,147
- 42.6% of samples (55,140 samples)42.6% 55,140
Harden
NOT RUN. CLZ/CTZ/CPOP are linear-time SystemVerilog functions
— functionally correct but probably fmax-unfriendly. The post-
pipelining synthesis pass will measure the real cost.
What just happened?
A userspace process running on a chip we wrote in seven days
printed text to UART. The bytes traveled M-mode firmware →
S-mode kernel → U-mode init → ecall → S-mode SBI handler →
M-mode SBI putchar → physical UART. Every privilege
transition the RV32-Linux model defines — and the U-bit, A-bit,
D-bit, SUM, satp swaps, sfence.vma — was exercised at least
once on this chip by a real, unmodified Linux binary.