P62’s profile pinned S_FETCH at 32% of all post-load
cycles — two cycles per retired instruction. The first
cycle evaluated the TLB lookup and set fetch_xlated_q; the
second cycle issued the mem-fetch and latched ir. P63
collapses that into a single cycle on the TLB-hit path, which
is 96.5% of fetches.
Headline: ~17% fewer cycles to userspace exec versus P62. The full F/X parallel pipeline is queued up as P64 — P63 ships the easy half so the chart progression has an honest first column.
The change
The mem-issue combinational now consults the TLB result
directly instead of waiting on fetch_xlated_q:
// Before (P62): two cycles in S_FETCH on TLB hit
// cycle 1 — evaluate TLB, set fetch_pa_q + fetch_xlated_q
// cycle 2 — issue mem (mem_addr = fetch_pa_q), latch ir
mem_valid = (state == S_FETCH && (!translation_active || fetch_xlated_q));
mem_addr = fetch_xlated_q ? fetch_pa_q : pc;
// After (P63): one cycle in S_FETCH on TLB hit
// cycle 1 — TLB hit drives mem_addr, latch ir on mem_ready
mem_valid = (state == S_FETCH &&
(!translation_active || fetch_xlated_q ||
(tlb_fetch_hit && tlb_fetch_x &&
!priv_u_violates_fetch(tlb_fetch_u))));
mem_addr = fetch_xlated_q ? fetch_pa_q
: translation_active ? tlb_fetch_pa
: pc;
The S_FETCH next-state logic was extended in parallel: on
TLB hit + perm OK + mem_ready, latch ir and advance
to S_DECODE in the same cycle. About 30 lines of diff.
The walker / post-walker / M-mode paths are unchanged — this is a pure addition for the common case.
Boot, annotated
The full sim run from chip reset through userspace exit
to Kernel panic — Attempted to kill init. Every captured
UART byte plotted onto the cycle axis where it landed. The
“silent phase” you read about everywhere — that 100M-cycle
gap between clocksource: Switched and Run /init — is
right there, with the BLAKE2s entropy seeding sitting where
it actually sits.
- 01 M
M-mode setup
Stage-0 firmware
0 150,000 Δ 150,000 cycles · 0.1% of run
M-mode firmware builds an Sv32 page table (identity-mapping the chip's 16 MiB of RAM plus the CLINT and UART MMIO regions), sets medeleg = 0xb1ff to delegate the standard exceptions to S, configures mstatus.MPP = S, and mret-s to the kernel entry at PA 0x00400000. The trap handler installed in mtvec stays resident as the SBI v0.1 dispatcher for the rest of the run.
hotstage0_mainbuild_page_tablesfirmware_trap_handlerP58 stage-0 firmware PT_BASE = 0x00010000 DTB_BASE = 0x00100000 KERNEL_BASE = 0x00400000 page tables built satp = 0x80000010 mret to kernel... - 02 S
M → S, MMU enable
Kernel relocation
150,000 1,000,000 Δ 850,000 cycles · 0.6% of run
BSS zero-fill, then the satp dance. relocate_enable_mmu writes satp = trampoline_pg_dir (which only maps the kernel's virt range), takes the deliberate page fault on the next physical-PC fetch, lets stvec redirect PC into the virtual mapping, then writes satp = early_pg_dir for the full kernel pages. P59's instruction-fetch translation is what makes this work.
hot_start_kernelrelocate_enable_mmusetup_vmsilent on UART — kernel is computing, not printing.
- 03 S
first printk
Architecture init
1,000,000 1,700,000 Δ 700,000 cycles · 0.5% of run
First Linux printk — every byte of the banner travels through ecall → our M-mode SBI handler → MMIO_UART_DATA → testbench. The DT parser sees a 1-hart RV32IMA platform with a SiFive-pattern CLINT and reserves the lower 4 MiB (where stage-0 + DTB live).
hotstart_kernelsetup_archparse_dtbsbi_initLinux version 6.12.85 (jadams@solomon) (riscv64-unknown-linux-gnu-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.46) #10 OF: fdt: Ignoring memory range 0x0 - 0x400000 Machine model: P58 Linux test platform SBI specification v0.1 detected earlycon: sbi0 at I/O port 0x0 (options '') printk: legacy bootconsole [sbi0] enabled - 04 S
mem zones · slub · irq · clocksource
Subsystem init
1,700,000 18,426,917 Δ 16,726,917 cycles · 12.6% of run
Page allocator, slab, IRQ controller, RISC-V clocksource. Most of the real chip-aware kernel init happens here. 16 MB of cycles for ~16 KB of printk output; per-instruction ratio is in line with the rest of the run.
hotmem_initkmem_cache_initinit_IRQtime_initZone ranges: Normal [mem 0x0000000000400000-0x0000000000ffffff] SLUB: HWalign=64, Order=0-1, MinObjects=0, CPUs=1, Nodes=1 NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 riscv-intc: 32 local interrupts mapped clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x5c40939b5 sched_clock: 64 bits at 25MHz, resolution 40ns Calibrating delay loop (skipped) .. 50.00 BogoMIPS Memory: 9544K/12288K available (1406K kernel code, 471K rwdata, ...) devtmpfs: initialized clocksource: Switched to clocksource riscv_clocksource - 05 S
BLAKE2s entropy seed · the long quiet
CRNG silent phase
18,426,917 129,500,000 Δ 111,073,083 cycles · 83.8% of run
~111 million cycles, no UART. Linux is hashing whatever entropy it can scrape from boot timer jitter into the CRNG state by repeatedly running BLAKE2s. Without a hardware RNG (we don't have one yet) this is unavoidable. P62 measured this loop at 25.8% of post-load PC samples; with Zbb hardware rotates each compression round dropped from ~3 instructions per rotate to 1.
hotblake2s_compress_genericmemsetmemcpycreate_pgd_mappingsilent on UART — kernel is computing, not printing.
- 06 S
kernel_init
init thread + /init exec
129,500,000 131,241,874 Δ 1,741,874 cycles · 1.3% of run
Switch to PID 1. The ELF loader maps the user binary into U-pages, sets up the AUXV / argv / envp on a fresh user stack, sret-s into U-mode at the binary's entry point. mstatus.SUM (P62) is what makes copy_to_user from kernel mode actually reach those U-pages.
hotkernel_initramdisk_execute_commandload_elf_binarycreate_elf_tablesprintk: legacy console [hvc0] enabled clk: Disabling unused clocks Freeing unused kernel image (initmem) memory: 124K Kernel memory protection not selected by kernel config. Run /init as init process with arguments: /init earlyprintk with environment: HOME=/ TERM=linux - 07 U
hello on a homemade chip
userspace executes
131,241,874 131,610,501 Δ 368,627 cycles · 0.3% of run
PID 1 in U-mode making real ecall syscalls. Each character of every line traverses U → S → M → UART. Six days of chip work, all cashed in on these 369,000 cycles.
hot_start (hello)syscall3sys_writesbi_console_putchar=========================================== hello from userspace on a homemade chip! =========================================== this binary: - cross-compiled for rv32ima/ilp32 - statically linked, no libc - reaches the linux SYS_write syscall - is running as PID 1 from initramfs the chip below us: - 16 MiB DRAM, 25 MHz - rv32ima_zba, sv32 paging, m+s priv - SBI v0.1 firmware in stage-0 exiting cleanly... - 08 S
init exited cleanly
Kernel panic
131,610,501 132,610,502 Δ 1,000,001 cycles · 0.8% of run
PID 1 calling exit(0) is a kernel-level "not allowed" — Linux duly panics. exitcode=0x00000000 confirms the user binary returned cleanly.
hotdo_exitpanicKernel panic - not syncing: Attempted to kill init! exitcode=0x00000000
The 25.8% slice in phase 5 is what motivated P62’s
Zbb rotates; the 32% slice in S_FETCH (spread evenly across
phases 3–6) is what P63 attacks. The full F/X parallel
pipeline coming in P64 will go after the same 32% from the
opposite direction — by overlapping fetches with execute
cycles instead of compressing each fetch.
Measured
The full sim ran end-to-end. The fast path saved exactly one cycle per TLB-hit fetch, applied across every retired instruction:
| metric | P62 | P63 | Δ |
|---|---|---|---|
| post-load cycles | 132,610,502 | 112,063,722 | −20,546,780 (−15.5%) |
S_FETCH cycles | 42,437,741 | 21,809,761 | −20,627,980 (−48.6%) |
S_FETCH % of run | 32.0% | 19.5% | — |
| CPI | 6.24 | 5.27 | −0.97 |
| TLB LSU hit rate | 94.2% | 94.2% | unchanged |
| memory bus stall cycles | 0 | 0 | unchanged |
S_FETCH was nearly cut in half (the half it kept is the
walker / post-walker / non-translation paths, which P63 didn’t
touch). CPI dropped 0.97 — almost exactly one cycle per
instruction, since fetches happen once per retire.
Boot milestones, every one moving by ~15.6%:
| milestone | P62 cycle | P63 cycle | Δ |
|---|---|---|---|
| Linux version | 1,611,173 | 1,346,691 | −16.4% |
| Switched to clocksource | 18,426,917 | 15,530,502 | −15.7% |
| Run /init | 130,054,364 | 109,755,383 | −15.6% |
| userspace hello | 131,241,874 | 110,746,765 | −15.6% |
| kernel panic | 131,610,501 | 111,063,721 | −15.6% |
P62 → P63 progression
The benchmark harness’s whole point: every chip rev gets a new column of charts so the work is visible.
| milestone | P62 Zbb rotates + U-mode | P63 fetch fast-path | Δ last col |
|---|---|---|---|
| “Linux version” | −16.4% | ||
| clocksource switched | −15.7% | ||
| “Run /init” | −15.6% | ||
| userspace hello | −15.6% | ||
| init exit (panic) | −15.6% |
cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.
The state-comparison chart shows it cleanly. P63’s S_FETCH
stripe is half the height of P62’s; everything else stays the
same width because nothing else changed.
Where the chip is spending its cycles now
- fetch 19.5% 21,809,761
- decode 19% 21,271,018
- execute 19% 21,271,018
- mem 18.9% 21,127,815
- walker 2.1% 2,376,234
- writeback 19% 21,263,802
- mul/div 2.6% 2,944,074
- 16.3% of samples (17,878 samples)16.3% 17,878
- 7.7% of samples (8,410 samples)7.7% 8,410
- 7.4% of samples (8,096 samples)7.4% 8,096
- 5.6% of samples (6,095 samples)5.6% 6,095
- 3.7% of samples (4,019 samples)3.7% 4,019
- 2.6% of samples (2,801 samples)2.6% 2,801
- 2.5% of samples (2,689 samples)2.5% 2,689
- 2.3% of samples (2,480 samples)2.3% 2,480
- 1.8% of samples (1,931 samples)1.8% 1,931
- 1.6% of samples (1,722 samples)1.6% 1,722
- 1.4% of samples (1,480 samples)1.4% 1,480
- 1.2% of samples (1,367 samples)1.2% 1,367
- 1.2% of samples (1,282 samples)1.2% 1,282
- 0.9% of samples (1,007 samples)0.9% 1,007
- 0.9% of samples (988 samples)0.9% 988
- 43.1% of samples (47,193 samples)43.1% 47,193
Instrumentation note: the testbench’s per-cycle TLB fetch-hit counter looked for
fetch_xlated_qrising — a P61 hold-over that P63’s fast path bypasses (we go straight fromS_FETCHtoS_DECODEwithout setting that bit). The walker stats and total cycles are correct; the per-event TLB-hit counter reads 0 in this run. Will fix when P64 lands.
What’s next
- P64 — full 2-stage pipeline with a separate F-FSM running concurrently with the X side. Walker arbitration, fault staging, branch flush propagation. The expected win is another ~25% on top of P63 — fetch fully hides under execute for compute-bound code.
- P65 — 3-stage (decode separated).
- P66 — classic 5-stage MIPS with forwarding.
- P67 — synthesis baseline; finally measure fmax / area on sky130 so the speed planning has hardware truth in it, not just CPI math.
Harden
NOT RUN. Driving mem_valid from a TLB lookup adds the
TLB priority encoder into the address-generation path — that’s
the most likely place for an fmax cost. Will measure post-P66.
What just happened?
We noticed that the chip’s per-fetch overhead was an artifact
of how fetch_xlated_q was wired sequentially, not a
fundamental constraint. The TLB lookup is already
combinational (P61). Letting the bus see that result directly
saves a cycle for free, on every fetch, for the rest of the
boot.