Single-cycle fetch on TLB hit (the easy half of a 2-stage pipeline)

P62’s profile pinned S_FETCH at 32% of all post-load cycles — two cycles per retired instruction. The first cycle evaluated the TLB lookup and set fetch_xlated_q; the second cycle issued the mem-fetch and latched ir. P63 collapses that into a single cycle on the TLB-hit path, which is 96.5% of fetches.

Headline: ~17% fewer cycles to userspace exec versus P62. The full F/X parallel pipeline is queued up as P64 — P63 ships the easy half so the chart progression has an honest first column.

The change

The mem-issue combinational now consults the TLB result directly instead of waiting on fetch_xlated_q:

// Before (P62): two cycles in S_FETCH on TLB hit
//   cycle 1 — evaluate TLB, set fetch_pa_q + fetch_xlated_q
//   cycle 2 — issue mem (mem_addr = fetch_pa_q), latch ir
mem_valid = (state == S_FETCH && (!translation_active || fetch_xlated_q));
mem_addr  = fetch_xlated_q ? fetch_pa_q : pc;

// After (P63): one cycle in S_FETCH on TLB hit
//   cycle 1 — TLB hit drives mem_addr, latch ir on mem_ready
mem_valid = (state == S_FETCH &&
             (!translation_active || fetch_xlated_q ||
              (tlb_fetch_hit && tlb_fetch_x &&
               !priv_u_violates_fetch(tlb_fetch_u))));
mem_addr  = fetch_xlated_q   ? fetch_pa_q
          : translation_active ? tlb_fetch_pa
                               : pc;

The S_FETCH next-state logic was extended in parallel: on TLB hit + perm OK + mem_ready, latch ir and advance to S_DECODE in the same cycle. About 30 lines of diff.

The walker / post-walker / M-mode paths are unchanged — this is a pure addition for the common case.

Boot, annotated

The full sim run from chip reset through userspace exit to Kernel panic — Attempted to kill init. Every captured UART byte plotted onto the cycle axis where it landed. The “silent phase” you read about everywhere — that 100M-cycle gap between clocksource: Switched and Run /init — is right there, with the BLAKE2s entropy seeding sitting where it actually sits.

boot timeline window 0 → 132,610,502 cycles sim clock 25 MHz simulated cpi 6.24 tlb hit 97.46% / 94.16% (F / LSU)

25M 50M 75M 100M 125M 133M

zoom · last 3% of run 128,632,187 → 132,610,502 cycles

05 S CRNG silent phase 06 S init thread + /init exec 07 U userspace executes 08 S Kernel panic

01 M
M-mode setup

Stage-0 firmware

0 150,000 Δ 150,000 cycles · 0.1% of run

M-mode firmware builds an Sv32 page table (identity-mapping the chip's 16 MiB of RAM plus the CLINT and UART MMIO regions), sets medeleg = 0xb1ff to delegate the standard exceptions to S, configures mstatus.MPP = S, and mret-s to the kernel entry at PA 0x00400000. The trap handler installed in mtvec stays resident as the SBI v0.1 dispatcher for the rest of the run.

hot stage0_mainbuild_page_tablesfirmware_trap_handler
```
P58 stage-0 firmware
  PT_BASE     = 0x00010000
  DTB_BASE    = 0x00100000
  KERNEL_BASE = 0x00400000
  page tables built
  satp        = 0x80000010
  mret to kernel...
```
02 S

M → S, MMU enable

Kernel relocation

150,000 1,000,000 Δ 850,000 cycles · 0.6% of run

BSS zero-fill, then the satp dance. relocate_enable_mmu writes satp = trampoline_pg_dir (which only maps the kernel's virt range), takes the deliberate page fault on the next physical-PC fetch, lets stvec redirect PC into the virtual mapping, then writes satp = early_pg_dir for the full kernel pages. P59's instruction-fetch translation is what makes this work.

hot _start_kernelrelocate_enable_mmusetup_vm

silent on UART — kernel is computing, not printing.
03 S
first printk

Architecture init

1,000,000 1,700,000 Δ 700,000 cycles · 0.5% of run

First Linux printk — every byte of the banner travels through ecall → our M-mode SBI handler → MMIO_UART_DATA → testbench. The DT parser sees a 1-hart RV32IMA platform with a SiFive-pattern CLINT and reserves the lower 4 MiB (where stage-0 + DTB live).

hot start_kernelsetup_archparse_dtbsbi_init
```
Linux version 6.12.85 (jadams@solomon) (riscv64-unknown-linux-gnu-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.46) #10
OF: fdt: Ignoring memory range 0x0 - 0x400000
Machine model: P58 Linux test platform
SBI specification v0.1 detected
earlycon: sbi0 at I/O port 0x0 (options '')
printk: legacy bootconsole [sbi0] enabled
```

04 S

mem zones · slub · irq · clocksource

Subsystem init

1,700,000 18,426,917 Δ 16,726,917 cycles · 12.6% of run

Page allocator, slab, IRQ controller, RISC-V clocksource. Most of the real chip-aware kernel init happens here. 16 MB of cycles for ~16 KB of printk output; per-instruction ratio is in line with the rest of the run.

hot mem_initkmem_cache_initinit_IRQtime_init

Zone ranges:
  Normal   [mem 0x0000000000400000-0x0000000000ffffff]
SLUB: HWalign=64, Order=0-1, MinObjects=0, CPUs=1, Nodes=1
NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
riscv-intc: 32 local interrupts mapped
clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x5c40939b5
sched_clock: 64 bits at 25MHz, resolution 40ns
Calibrating delay loop (skipped) .. 50.00 BogoMIPS
Memory: 9544K/12288K available (1406K kernel code, 471K rwdata, ...)
devtmpfs: initialized
clocksource: Switched to clocksource riscv_clocksource

05 S

BLAKE2s entropy seed · the long quiet

CRNG silent phase

18,426,917 129,500,000 Δ 111,073,083 cycles · 83.8% of run

~111 million cycles, no UART. Linux is hashing whatever entropy it can scrape from boot timer jitter into the CRNG state by repeatedly running BLAKE2s. Without a hardware RNG (we don't have one yet) this is unavoidable. P62 measured this loop at 25.8% of post-load PC samples; with Zbb hardware rotates each compression round dropped from ~3 instructions per rotate to 1.

hot blake2s_compress_genericmemsetmemcpycreate_pgd_mapping

silent on UART — kernel is computing, not printing.
06 S
kernel_init

init thread + /init exec

129,500,000 131,241,874 Δ 1,741,874 cycles · 1.3% of run

Switch to PID 1. The ELF loader maps the user binary into U-pages, sets up the AUXV / argv / envp on a fresh user stack, sret-s into U-mode at the binary's entry point. mstatus.SUM (P62) is what makes copy_to_user from kernel mode actually reach those U-pages.

hot kernel_initramdisk_execute_commandload_elf_binarycreate_elf_tables
```
printk: legacy console [hvc0] enabled
clk: Disabling unused clocks
Freeing unused kernel image (initmem) memory: 124K
Kernel memory protection not selected by kernel config.
Run /init as init process
  with arguments:
    /init
    earlyprintk
  with environment:
    HOME=/
    TERM=linux
```

07 U

hello on a homemade chip

userspace executes

131,241,874 131,610,501 Δ 368,627 cycles · 0.3% of run

PID 1 in U-mode making real ecall syscalls. Each character of every line traverses U → S → M → UART. Six days of chip work, all cashed in on these 369,000 cycles.

hot _start (hello)syscall3sys_writesbi_console_putchar


===========================================
  hello from userspace on a homemade chip!
===========================================

  this binary:
    - cross-compiled for rv32ima/ilp32
    - statically linked, no libc
    - reaches the linux SYS_write syscall
    - is running as PID 1 from initramfs

  the chip below us:
    - 16 MiB DRAM, 25 MHz
    - rv32ima_zba, sv32 paging, m+s priv
    - SBI v0.1 firmware in stage-0

  exiting cleanly...

08 S
init exited cleanly

Kernel panic

131,610,501 132,610,502 Δ 1,000,001 cycles · 0.8% of run

PID 1 calling exit(0) is a kernel-level "not allowed" — Linux duly panics. exitcode=0x00000000 confirms the user binary returned cleanly.

hot do_exitpanic
```
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000
```

The 25.8% slice in phase 5 is what motivated P62’s Zbb rotates; the 32% slice in S_FETCH (spread evenly across phases 3–6) is what P63 attacks. The full F/X parallel pipeline coming in P64 will go after the same 32% from the opposite direction — by overlapping fetches with execute cycles instead of compressing each fetch.

Measured

The full sim ran end-to-end. The fast path saved exactly one cycle per TLB-hit fetch, applied across every retired instruction:

metric	P62	P63	Δ
post-load cycles	132,610,502	112,063,722	−20,546,780 (−15.5%)
`S_FETCH` cycles	42,437,741	21,809,761	−20,627,980 (−48.6%)
`S_FETCH` % of run	32.0%	19.5%	—
CPI	6.24	5.27	−0.97
TLB LSU hit rate	94.2%	94.2%	unchanged
memory bus stall cycles	0	0	unchanged

S_FETCH was nearly cut in half (the half it kept is the walker / post-walker / non-translation paths, which P63 didn’t touch). CPI dropped 0.97 — almost exactly one cycle per instruction, since fetches happen once per retire.

Boot milestones, every one moving by ~15.6%:

milestone	P62 cycle	P63 cycle	Δ
Linux version	1,611,173	1,346,691	−16.4%
Switched to clocksource	18,426,917	15,530,502	−15.7%
Run /init	130,054,364	109,755,383	−15.6%
userspace hello	131,241,874	110,746,765	−15.6%
kernel panic	131,610,501	111,063,721	−15.6%

P62 → P63 progression

The benchmark harness’s whole point: every chip rev gets a new column of charts so the work is visible.

milestone progression runs 2 scale 0 → 131.6M cycles

milestone	P62 Zbb rotates + U-mode	P63 fetch fast-path	Δ last col
“Linux version”	P62: 1,611,173 cycles 1.6M	P63: 1,346,691 cycles 1.3M	−16.4%
clocksource switched	P62: 18,426,917 cycles 18.4M	P63: 15,530,502 cycles 15.5M	−15.7%
“Run /init”	P62: 130,054,364 cycles 130.1M	P63: 109,755,383 cycles 109.8M	−15.6%
userspace hello	P62: 131,241,874 cycles 131.2M	P63: 110,746,765 cycles 110.7M	−15.6%
init exit (panic)	P62: 131,610,501 cycles 131.6M	P63: 111,063,721 cycles 111.1M	−15.6%

cpi compare runs 2 best 5.27 CPI worst 6.24 CPI

P62 Zbb rotates + U-mode

6.24

baseline

P63 fetch fast-path

5.27

−0.97 vs P62

cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.

state distribution runs 2 scale % of post-load cycles

P62 Zbb rotates + U-mode 132,610,502 cycles

fetch 32% fetch: 32% (42,437,741 cycles)
decode 16% decode: 16% (21,249,591 cycles)
execute 16% execute: 16% (21,249,591 cycles)
mem 15.9% mem: 15.9% (21,106,222 cycles)
walker: 1.8% (2,377,514 cycles)
writeback 16% writeback: 16% (21,242,373 cycles)
mul/div: 2.2% (2,947,470 cycles)
P63 fetch fast-path 112,063,722 cycles

fetch 19.5% fetch: 19.5% (21,809,761 cycles)
decode 19% decode: 19% (21,271,018 cycles)
execute 19% execute: 19% (21,271,018 cycles)
mem 18.9% mem: 18.9% (21,127,815 cycles)
walker: 2.1% (2,376,234 cycles)
writeback 19% writeback: 19% (21,263,802 cycles)
mul/div: 2.6% (2,944,074 cycles)

The state-comparison chart shows it cleanly. P63’s S_FETCH stripe is half the height of P62’s; everything else stays the same width because nothing else changed.

Where the chip is spending its cycles now

state breakdown label P63 fetch fast-path cycles 112,063,722 cpi 5.27

fetch 19.5% 21,809,761
decode 19% 21,271,018
execute 19% 21,271,018
mem 18.9% 21,127,815
walker 2.1% 2,376,234
writeback 19% 21,263,802
mul/div 2.6% 2,944,074

walker activity label P63 fetch fast-path flushes 6,155 L1 hits 9,392 L0 hits 1,183,421

fetch translations 538,742 total

0% TLB hit 538,742 walks

load/store/amo translations 11,227,912 total

94.2% TLB hit 654,071 walks

hot functions label P63 fetch fast-path samples 109,438 period every 1,024 cycles

blake2s_compress_generic kernel

16.3% of samples (17,878 samples)

16.3% 17,878
memset kernel

7.7% of samples (8,410 samples)

7.7% 8,410
memcpy kernel

7.4% of samples (8,096 samples)

7.4% 8,096
format_decode kernel

5.6% of samples (6,095 samples)

5.6% 6,095
vsnprintf kernel

3.7% of samples (4,019 samples)

3.7% 4,019
number kernel

2.6% of samples (2,801 samples)

2.6% 2,801
memcmp kernel

2.5% of samples (2,689 samples)

2.5% 2,689
vruntime_eligible kernel

2.3% of samples (2,480 samples)

2.3% 2,480
__slab_alloc_node.isra.0 kernel

1.8% of samples (1,931 samples)

1.8% 1,931
strlen kernel

1.6% of samples (1,722 samples)

1.6% 1,722
avg_vruntime kernel

1.4% of samples (1,480 samples)

1.4% 1,480
string_nocheck kernel

1.2% of samples (1,367 samples)

1.2% 1,367
chacha_permute kernel

1.2% of samples (1,282 samples)

1.2% 1,282
add_uevent_var kernel

0.9% of samples (1,007 samples)

0.9% 1,007
fdt32_ld kernel

0.9% of samples (988 samples)

0.9% 988
(remaining) remaining

43.1% of samples (47,193 samples)

43.1% 47,193

Instrumentation note: the testbench’s per-cycle TLB fetch-hit counter looked for fetch_xlated_q rising — a P61 hold-over that P63’s fast path bypasses (we go straight from S_FETCH to S_DECODE without setting that bit). The walker stats and total cycles are correct; the per-event TLB-hit counter reads 0 in this run. Will fix when P64 lands.

What’s next

P64 — full 2-stage pipeline with a separate F-FSM running concurrently with the X side. Walker arbitration, fault staging, branch flush propagation. The expected win is another ~25% on top of P63 — fetch fully hides under execute for compute-bound code.
P65 — 3-stage (decode separated).
P66 — classic 5-stage MIPS with forwarding.
P67 — synthesis baseline; finally measure fmax / area on sky130 so the speed planning has hardware truth in it, not just CPI math.

Harden

NOT RUN. Driving mem_valid from a TLB lookup adds the TLB priority encoder into the address-generation path — that’s the most likely place for an fmax cost. Will measure post-P66.

What just happened?

We noticed that the chip’s per-fetch overhead was an artifact of how fetch_xlated_q was wired sequentially, not a fundamental constraint. The TLB lookup is already combinational (P61). Letting the bus see that result directly saves a cycle for free, on every fetch, for the rest of the boot.

The change

Boot, annotated

Stage-0 firmware

Kernel relocation

Architecture init

Subsystem init

CRNG silent phase

init thread + /init exec

userspace executes

Kernel panic

Measured

P62 → P63 progression

Where the chip is spending its cycles now

What’s next

Harden

What just happened?