No. 62 / project of 147 on the ladder

First userspace process prints — Zbb completion + U-mode + SUM

introduces — full Zbb extension, U-mode Sv32 translation, mstatus.SUM, /dev/console in initramfs — first userspace text rendered to UART

harden statelast run2026-05-04
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P62 was scoped as a small win: ship the three Zbb rotate ops to chase BLAKE2s. The Zbb-rebuilt kernel exposed three more chip bugs in sequence — none of which would have triggered on P61’s silent-CRNG-and-die boot. Fixing them got us all the way to the headline result: a userspace process prints text to UART.

Headline: the chip executed M-mode firmware → S-mode kernel → U-mode init → SYS_write syscall → SBI putchar → UART. Every privilege transition exercised, every translation bit honored.

What we got on UART

P58 stage-0 firmware
  PT_BASE     = 0x00010000
  DTB_BASE    = 0x00100000
  KERNEL_BASE = 0x00400000
  page tables built
  satp        = 0x80000010
  mret to kernel...

Linux version 6.12.85 (jadams@solomon) (riscv64-unknown-linux-gnu-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.46) #10 ...

  ... (full kernel boot, see /projects/59_ifetch_translation/) ...

Run /init as init process
  with arguments:
    /init
    earlyprintk
  with environment:
    HOME=/
    TERM=linux

===========================================
  hello from userspace on a homemade chip!
===========================================

  this binary:
    - cross-compiled for rv32ima/ilp32
    - statically linked, no libc
    - reaches the linux SYS_write syscall
    - is running as PID 1 from initramfs

  the chip below us:
    - 16 MiB DRAM, 25 MHz
    - rv32ima_zba, sv32 paging, m+s priv
    - SBI v0.1 firmware in stage-0

  exiting cleanly...
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000

The “Attempted to kill init” panic is the expected end of run: PID 1 calling exit(0) is a kernel-level “not allowed”, and it dutifully panics. exitcode=0x00000000 proves the user binary returned cleanly.

What was supposed to be small

P62 originally landed three RISC-V Zbb ops — rol, ror, rori — to make BLAKE2s 3× faster (hardware rotate vs the 3-instruction srli; slli; or sequence gcc emits without Zbb). About 25 lines of decode + ALU.

What it actually took

The Zbb-rebuilt kernel (one KCFLAGS=-march=rv32ima_zba_zbb_zicsr_zifencei) brings out a long tail of issues on the first run. We bisected:

  1. Full Zbb, not just rotates. gcc emits a wider Zbb subset than just the rotates: 152 zext.h, 46 rev8, 25 sext.b, 20 sext.h, 8 clz, 3 cpop in the rebuilt kernel. Each of those was an illegal-instruction trap on the chip; the trap handler’s own printk path used Zbb ops too, causing recursive trap-within-trap loops. Rounded out the chip with ~100 lines covering ZEXT.H, SEXT.B/H, CLZ/CTZ/CPOP, REV8, ORC.B.

  2. U-mode Sv32 translation. Latent bug since P52: the chip’s translation_active only fired in S-mode. The moment Linux srets into U-mode the chip stopped translating; PC=0x10000 (a virtual user-mode address) became a fetch from physical 0x10000 (in our MMIO range) → bus error → Linux reports Failed to execute /init (error -14) and falls through every init candidate before kernel-panic. Fixed by making translation fire for cur_priv != PRIV_M and generalising the U-bit permission check via priv_u_violates(u) rather than the old “fault if PTE.U=1, regardless of priv”.

  3. mstatus.SUM. Even with U-mode translation working, /init still failed with -EFAULT. Linux’s copy_to_user path runs in S-mode but writes to U-pages it just allocated for the user binary; that requires mstatus.SUM=1. Our chip never honored SUM. Added the bit at MSTATUS_SUM_BIT=18, included in SSTATUS_MASK so the kernel can toggle it via sstatus too, and split the permission check into priv_u_violates_fetch (no SUM — S-mode never executes U-pages even with SUM=1) and priv_u_violates_lsu (honors SUM).

  4. /dev/console in the initramfs. With the chip now actually executing the user binary, the binary still printed nothing visible. Warning: unable to open an initial console. revealed that init_eaccess(/dev/console) failed, so PID 1 inherited no fd 0/1/2 at all — every write(1, ...) returned -EBADF and the binary’s prints went to a non-existent fd. Fixed by switching the userspace cpio build from plain cpio -H newc to the kernel’s own gen_init_cpio, which can declare a /dev/console char-device node (5,1) without needing root.

Each fix was small (10–100 lines). They had to land together.

Boot milestone progression

The benchmark harness P61 introduced now has a real column-to-column comparison:

milestoneP61 (broken U-mode)P62 (full chip)Δ
Linux version1,616,2731,611,173−5,100
clocksource switched18,685,47218,426,917−258,555
Run /init141,182,181130,054,364−11,127,817 (−7.9%)
userspace hello— never reached —131,241,874first
kernel panic (Attempted to kill init)unreachable131,610,501first

Run /init lands 11.1M cycles earlier with full Zbb (the BLAKE2s/CRNG init phase is the headline win — Zbb rotates roughly third the BLAKE2s instruction count). The userspace binary itself runs from Run /init to its exit() panic in 1,556,137 cycles — including ELF load, exec, all the writes, formatting, and clean teardown.

Files

  • src/top.sv — chip RTL with full Zbb, U-mode translation, SUM, plus the unchanged P61 TLB.
  • userspace/hello.c — freestanding RV32 program, no libc.
  • userspace/Makefile — builds hello, packs initramfs cpio with /dev/console via gen_init_cpio.
  • test/tb_freertos_demo.sv — testbench with milestone tracker, benchmark.json emitter, graceful $finish after panic so dump_profile runs.
  • test/Makefileprofile, profile-decode, charts targets.

Where the chip is spending its cycles now

The benchmark from the full boot run (post-load through userspace exit → kernel panic + 1M settle cycles):

post-load cycles132,610,502
retired instructions21,242,373
CPI6.24
TLB fetch hit rate97.46%
TLB LSU hit rate94.16%
memory bus stall cycles0

State-cycle breakdown — S_FETCH is 32% of all cycles (every instruction takes two cycles in fetch: issue, then data-back), S_DECODE / S_EXECUTE / S_MEM / S_WB are about 16% each (one cycle per retired instruction), and the walker collectively eats ~1.8% — the TLB is doing exactly what it should.

state breakdown label P62 + Zbb rotates cycles 132,610,502 cpi 6.24
  1. fetch 32% 42,437,741
  2. decode 16% 21,249,591
  3. execute 16% 21,249,591
  4. mem 15.9% 21,106,222
  5. walker 1.8% 2,377,514
  6. writeback 16% 21,242,373
  7. mul/div 2.2% 2,947,470

The walker breakdown shows where the misses are. Fetch TLB ~97% hit rate; LSU ~94% hit rate. The remaining misses land on the walker and resolve mostly at L0 (4 KiB pages) not L1 (megapages) — kernel pages are 4 KiB by the time init runs.

walker activity label P62 + Zbb rotates flushes 6,156 L1 hits 9,392 L0 hits 1,184,061
fetch translations 21,188,148 total
97.5% TLB hit 539,054 walks
load/store/amo translations 11,199,982 total
94.2% TLB hit 654,399 walks

PC samples classified against the kernel symbol table show where the instruction stream actually spent its time. With the Zbb rotates in place, the BLAKE2s share is meaningfully smaller than P61’s profile (where it dominated).

hot functions label P62 + Zbb rotates samples 129,503 period every 1,024 cycles
  1. blake2s_compress_generic kernel
    16.5% 21,374
  2. memset kernel
    7.8% 10,133
  3. memcpy kernel
    7.6% 9,835
  4. format_decode kernel
    5.6% 7,280
  5. vsnprintf kernel
    3.6% 4,696
  6. number kernel
    2.6% 3,379
  7. memcmp kernel
    2.5% 3,267
  8. vruntime_eligible kernel
    2.1% 2,713
  9. __slab_alloc_node.isra.0 kernel
    1.8% 2,331
  10. strlen kernel
    1.6% 2,081
  11. string_nocheck kernel
    1.3% 1,706
  12. avg_vruntime kernel
    1.3% 1,649
  13. chacha_permute kernel
    1.2% 1,536
  14. add_uevent_var kernel
    1% 1,236
  15. fdt32_ld kernel
    0.9% 1,147
  16. (remaining) remaining
    42.6% 55,140

Harden

NOT RUN. CLZ/CTZ/CPOP are linear-time SystemVerilog functions — functionally correct but probably fmax-unfriendly. The post- pipelining synthesis pass will measure the real cost.

What just happened?

A userspace process running on a chip we wrote in seven days printed text to UART. The bytes traveled M-mode firmware → S-mode kernel → U-mode init → ecall → S-mode SBI handler → M-mode SBI putchar → physical UART. Every privilege transition the RV32-Linux model defines — and the U-bit, A-bit, D-bit, SUM, satp swaps, sfence.vma — was exercised at least once on this chip by a real, unmodified Linux binary.