No. 63 / project of 147 on the ladder

Single-cycle fetch on TLB hit (the easy half of a 2-stage pipeline)

introduces — combinational TLB-driven mem-issue path; an annotated timeline of the full Linux boot

harden statelast run2026-05-04
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P62’s profile pinned S_FETCH at 32% of all post-load cycles — two cycles per retired instruction. The first cycle evaluated the TLB lookup and set fetch_xlated_q; the second cycle issued the mem-fetch and latched ir. P63 collapses that into a single cycle on the TLB-hit path, which is 96.5% of fetches.

Headline: ~17% fewer cycles to userspace exec versus P62. The full F/X parallel pipeline is queued up as P64 — P63 ships the easy half so the chart progression has an honest first column.

The change

The mem-issue combinational now consults the TLB result directly instead of waiting on fetch_xlated_q:

// Before (P62): two cycles in S_FETCH on TLB hit
//   cycle 1 — evaluate TLB, set fetch_pa_q + fetch_xlated_q
//   cycle 2 — issue mem (mem_addr = fetch_pa_q), latch ir
mem_valid = (state == S_FETCH && (!translation_active || fetch_xlated_q));
mem_addr  = fetch_xlated_q ? fetch_pa_q : pc;

// After (P63): one cycle in S_FETCH on TLB hit
//   cycle 1 — TLB hit drives mem_addr, latch ir on mem_ready
mem_valid = (state == S_FETCH &&
             (!translation_active || fetch_xlated_q ||
              (tlb_fetch_hit && tlb_fetch_x &&
               !priv_u_violates_fetch(tlb_fetch_u))));
mem_addr  = fetch_xlated_q   ? fetch_pa_q
          : translation_active ? tlb_fetch_pa
                               : pc;

The S_FETCH next-state logic was extended in parallel: on TLB hit + perm OK + mem_ready, latch ir and advance to S_DECODE in the same cycle. About 30 lines of diff.

The walker / post-walker / M-mode paths are unchanged — this is a pure addition for the common case.

Boot, annotated

The full sim run from chip reset through userspace exit to Kernel panic — Attempted to kill init. Every captured UART byte plotted onto the cycle axis where it landed. The “silent phase” you read about everywhere — that 100M-cycle gap between clocksource: Switched and Run /init — is right there, with the BLAKE2s entropy seeding sitting where it actually sits.

boot timeline window 0 → 132,610,502 cycles sim clock 25 MHz simulated cpi 6.24 tlb hit 97.46% / 94.16% (F / LSU)
25M 50M 75M 100M 125M 133M
  1. 01 M

    M-mode setup

    Stage-0 firmware

    0 150,000 Δ 150,000 cycles · 0.1% of run

    M-mode firmware builds an Sv32 page table (identity-mapping the chip's 16 MiB of RAM plus the CLINT and UART MMIO regions), sets medeleg = 0xb1ff to delegate the standard exceptions to S, configures mstatus.MPP = S, and mret-s to the kernel entry at PA 0x00400000. The trap handler installed in mtvec stays resident as the SBI v0.1 dispatcher for the rest of the run.

    hot stage0_mainbuild_page_tablesfirmware_trap_handler
    P58 stage-0 firmware
      PT_BASE     = 0x00010000
      DTB_BASE    = 0x00100000
      KERNEL_BASE = 0x00400000
      page tables built
      satp        = 0x80000010
      mret to kernel...
  2. 02 S

    M → S, MMU enable

    Kernel relocation

    150,000 1,000,000 Δ 850,000 cycles · 0.6% of run

    BSS zero-fill, then the satp dance. relocate_enable_mmu writes satp = trampoline_pg_dir (which only maps the kernel's virt range), takes the deliberate page fault on the next physical-PC fetch, lets stvec redirect PC into the virtual mapping, then writes satp = early_pg_dir for the full kernel pages. P59's instruction-fetch translation is what makes this work.

    hot _start_kernelrelocate_enable_mmusetup_vm

    silent on UART — kernel is computing, not printing.

  3. 03 S

    first printk

    Architecture init

    1,000,000 1,700,000 Δ 700,000 cycles · 0.5% of run

    First Linux printk — every byte of the banner travels through ecall → our M-mode SBI handler → MMIO_UART_DATA → testbench. The DT parser sees a 1-hart RV32IMA platform with a SiFive-pattern CLINT and reserves the lower 4 MiB (where stage-0 + DTB live).

    hot start_kernelsetup_archparse_dtbsbi_init
    Linux version 6.12.85 (jadams@solomon) (riscv64-unknown-linux-gnu-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.46) #10
    OF: fdt: Ignoring memory range 0x0 - 0x400000
    Machine model: P58 Linux test platform
    SBI specification v0.1 detected
    earlycon: sbi0 at I/O port 0x0 (options '')
    printk: legacy bootconsole [sbi0] enabled
  4. 04 S

    mem zones · slub · irq · clocksource

    Subsystem init

    1,700,000 18,426,917 Δ 16,726,917 cycles · 12.6% of run

    Page allocator, slab, IRQ controller, RISC-V clocksource. Most of the real chip-aware kernel init happens here. 16 MB of cycles for ~16 KB of printk output; per-instruction ratio is in line with the rest of the run.

    hot mem_initkmem_cache_initinit_IRQtime_init
    Zone ranges:
      Normal   [mem 0x0000000000400000-0x0000000000ffffff]
    SLUB: HWalign=64, Order=0-1, MinObjects=0, CPUs=1, Nodes=1
    NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
    riscv-intc: 32 local interrupts mapped
    clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x5c40939b5
    sched_clock: 64 bits at 25MHz, resolution 40ns
    Calibrating delay loop (skipped) .. 50.00 BogoMIPS
    Memory: 9544K/12288K available (1406K kernel code, 471K rwdata, ...)
    devtmpfs: initialized
    clocksource: Switched to clocksource riscv_clocksource
  5. 05 S

    BLAKE2s entropy seed · the long quiet

    CRNG silent phase

    18,426,917 129,500,000 Δ 111,073,083 cycles · 83.8% of run

    ~111 million cycles, no UART. Linux is hashing whatever entropy it can scrape from boot timer jitter into the CRNG state by repeatedly running BLAKE2s. Without a hardware RNG (we don't have one yet) this is unavoidable. P62 measured this loop at 25.8% of post-load PC samples; with Zbb hardware rotates each compression round dropped from ~3 instructions per rotate to 1.

    hot blake2s_compress_genericmemsetmemcpycreate_pgd_mapping

    silent on UART — kernel is computing, not printing.

  6. 06 S

    kernel_init

    init thread + /init exec

    129,500,000 131,241,874 Δ 1,741,874 cycles · 1.3% of run

    Switch to PID 1. The ELF loader maps the user binary into U-pages, sets up the AUXV / argv / envp on a fresh user stack, sret-s into U-mode at the binary's entry point. mstatus.SUM (P62) is what makes copy_to_user from kernel mode actually reach those U-pages.

    hot kernel_initramdisk_execute_commandload_elf_binarycreate_elf_tables
    printk: legacy console [hvc0] enabled
    clk: Disabling unused clocks
    Freeing unused kernel image (initmem) memory: 124K
    Kernel memory protection not selected by kernel config.
    Run /init as init process
      with arguments:
        /init
        earlyprintk
      with environment:
        HOME=/
        TERM=linux
  7. 07 U

    hello on a homemade chip

    userspace executes

    131,241,874 131,610,501 Δ 368,627 cycles · 0.3% of run

    PID 1 in U-mode making real ecall syscalls. Each character of every line traverses U → S → M → UART. Six days of chip work, all cashed in on these 369,000 cycles.

    hot _start (hello)syscall3sys_writesbi_console_putchar
    
    ===========================================
      hello from userspace on a homemade chip!
    ===========================================
    
      this binary:
        - cross-compiled for rv32ima/ilp32
        - statically linked, no libc
        - reaches the linux SYS_write syscall
        - is running as PID 1 from initramfs
    
      the chip below us:
        - 16 MiB DRAM, 25 MHz
        - rv32ima_zba, sv32 paging, m+s priv
        - SBI v0.1 firmware in stage-0
    
      exiting cleanly...
  8. 08 S

    init exited cleanly

    Kernel panic

    131,610,501 132,610,502 Δ 1,000,001 cycles · 0.8% of run

    PID 1 calling exit(0) is a kernel-level "not allowed" — Linux duly panics. exitcode=0x00000000 confirms the user binary returned cleanly.

    hot do_exitpanic
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000

The 25.8% slice in phase 5 is what motivated P62’s Zbb rotates; the 32% slice in S_FETCH (spread evenly across phases 3–6) is what P63 attacks. The full F/X parallel pipeline coming in P64 will go after the same 32% from the opposite direction — by overlapping fetches with execute cycles instead of compressing each fetch.

Measured

The full sim ran end-to-end. The fast path saved exactly one cycle per TLB-hit fetch, applied across every retired instruction:

metricP62P63Δ
post-load cycles132,610,502112,063,722−20,546,780 (−15.5%)
S_FETCH cycles42,437,74121,809,761−20,627,980 (−48.6%)
S_FETCH % of run32.0%19.5%
CPI6.245.27−0.97
TLB LSU hit rate94.2%94.2%unchanged
memory bus stall cycles00unchanged

S_FETCH was nearly cut in half (the half it kept is the walker / post-walker / non-translation paths, which P63 didn’t touch). CPI dropped 0.97 — almost exactly one cycle per instruction, since fetches happen once per retire.

Boot milestones, every one moving by ~15.6%:

milestoneP62 cycleP63 cycleΔ
Linux version1,611,1731,346,691−16.4%
Switched to clocksource18,426,91715,530,502−15.7%
Run /init130,054,364109,755,383−15.6%
userspace hello131,241,874110,746,765−15.6%
kernel panic131,610,501111,063,721−15.6%

P62 → P63 progression

The benchmark harness’s whole point: every chip rev gets a new column of charts so the work is visible.

milestone progression runs 2 scale 0 → 131.6M cycles
milestone P62 Zbb rotates + U-mode P63 fetch fast-path Δ last col
“Linux version”
P62: 1,611,173 cycles
1.6M
P63: 1,346,691 cycles
1.3M
−16.4%
clocksource switched
P62: 18,426,917 cycles
18.4M
P63: 15,530,502 cycles
15.5M
−15.7%
“Run /init”
P62: 130,054,364 cycles
130.1M
P63: 109,755,383 cycles
109.8M
−15.6%
userspace hello
P62: 131,241,874 cycles
131.2M
P63: 110,746,765 cycles
110.7M
−15.6%
init exit (panic)
P62: 131,610,501 cycles
131.6M
P63: 111,063,721 cycles
111.1M
−15.6%
cpi compare runs 2 best 5.27 CPI worst 6.24 CPI
P62 Zbb rotates + U-mode
6.24
baseline
P63 fetch fast-path
5.27
−0.97 vs P62

cpi cycles per retired instruction · lower is better. Each bar's length is its CPI as a fraction of the worst run, so a 50%-shorter bar is a 50%-faster chip on the same workload.

state distribution runs 2 scale % of post-load cycles
  1. P62 Zbb rotates + U-mode 132,610,502 cycles
  2. P63 fetch fast-path 112,063,722 cycles

The state-comparison chart shows it cleanly. P63’s S_FETCH stripe is half the height of P62’s; everything else stays the same width because nothing else changed.

Where the chip is spending its cycles now

state breakdown label P63 fetch fast-path cycles 112,063,722 cpi 5.27
  1. fetch 19.5% 21,809,761
  2. decode 19% 21,271,018
  3. execute 19% 21,271,018
  4. mem 18.9% 21,127,815
  5. walker 2.1% 2,376,234
  6. writeback 19% 21,263,802
  7. mul/div 2.6% 2,944,074
walker activity label P63 fetch fast-path flushes 6,155 L1 hits 9,392 L0 hits 1,183,421
fetch translations 538,742 total
0% TLB hit 538,742 walks
load/store/amo translations 11,227,912 total
94.2% TLB hit 654,071 walks
hot functions label P63 fetch fast-path samples 109,438 period every 1,024 cycles
  1. blake2s_compress_generic kernel
    16.3% 17,878
  2. memset kernel
    7.7% 8,410
  3. memcpy kernel
    7.4% 8,096
  4. format_decode kernel
    5.6% 6,095
  5. vsnprintf kernel
    3.7% 4,019
  6. number kernel
    2.6% 2,801
  7. memcmp kernel
    2.5% 2,689
  8. vruntime_eligible kernel
    2.3% 2,480
  9. __slab_alloc_node.isra.0 kernel
    1.8% 1,931
  10. strlen kernel
    1.6% 1,722
  11. avg_vruntime kernel
    1.4% 1,480
  12. string_nocheck kernel
    1.2% 1,367
  13. chacha_permute kernel
    1.2% 1,282
  14. add_uevent_var kernel
    0.9% 1,007
  15. fdt32_ld kernel
    0.9% 988
  16. (remaining) remaining
    43.1% 47,193

Instrumentation note: the testbench’s per-cycle TLB fetch-hit counter looked for fetch_xlated_q rising — a P61 hold-over that P63’s fast path bypasses (we go straight from S_FETCH to S_DECODE without setting that bit). The walker stats and total cycles are correct; the per-event TLB-hit counter reads 0 in this run. Will fix when P64 lands.

What’s next

  • P64 — full 2-stage pipeline with a separate F-FSM running concurrently with the X side. Walker arbitration, fault staging, branch flush propagation. The expected win is another ~25% on top of P63 — fetch fully hides under execute for compute-bound code.
  • P65 — 3-stage (decode separated).
  • P66 — classic 5-stage MIPS with forwarding.
  • P67 — synthesis baseline; finally measure fmax / area on sky130 so the speed planning has hardware truth in it, not just CPI math.

Harden

NOT RUN. Driving mem_valid from a TLB lookup adds the TLB priority encoder into the address-generation path — that’s the most likely place for an fmax cost. Will measure post-P66.

What just happened?

We noticed that the chip’s per-fetch overhead was an artifact of how fetch_xlated_q was wired sequentially, not a fundamental constraint. The TLB lookup is already combinational (P61). Letting the bus see that result directly saves a cycle for free, on every fetch, for the rest of the boot.