No. 61 / project of 147 on the ladder

4-entry TLB and the chip-profiling harness

introduces — a unified Sv32 TLB sitting in front of the existing P59 walker; chip-side profiling, benchmark.json emission, and the per-rung chart pack

harden statelast run2026-05-03
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P59 made Linux boot by adding instruction-fetch translation. P60 proved userspace works. P61 closes the obvious profiling hole P60 exposed and adds the measurement substrate every later rung will build on.

What’s in the chip

A unified TLB sitting in front of the existing P59 walker:

  • 4 entries, fully-associative, round-robin replacement.
  • Each entry caches one leaf PTE — either a 4 KiB page (L0 leaf) or a 4 MiB megapage (L1 leaf). Per entry: valid, megapage, vpn[19:0], ppn[19:0], and the R/W/X/U/A/D bits.
  • Lookup is purely combinational and runs in parallel for the LSU (querying alu_y) and fetch (querying pc).
  • On a hit: skip the walker entirely. Permission mismatches fault directly without a wasted walk.
  • On a miss: the existing P59 walker fires. The leaf-success paths fill the next round-robin slot.
  • sfence.vma flushes everything.
  • Any write to satp also flushes. Without that, the very last fetch before a csrw satp re-walks under the old satp and refills the TLB with stale identity translations, which then mask the page fault Linux’s relocate_enable_mmu uses to redirect PC into virtual addressing. The privileged spec says invalidate-on-satp-write is implementation-defined; Linux works either way only because of an sfence.vma it emits between the two satp swaps. We pick the invalidate-on-write semantics so we don’t depend on that fence happening.

What’s in the testbench

P61 also extends the testbench with comparable profiling output, so we can chart progression as we go through P62/P63/P64 pipelining and beyond. With +profile:

  • Per-state cycle histogram (S_FETCH, S_DECODE, S_EXECUTE, S_MEM, S_WB, the four walker states + the two A/D writeback states, and S_HALT).
  • Walker-activity counters (walks split by op kind, megapage vs 4 KiB-page leaf hits, A/D writebacks).
  • TLB hit/miss counters, separately for instruction fetch and load/store, plus an sfence.vma flush counter.
  • Memory-bus handshake / stall accounting.
  • Boot milestones — cycle of first occurrence of Linux version, Switched to clocksource, Run /init, the userspace homemade chip banner, and the Attempted to kill init panic.
  • A PC sample every 1024 post-load cycles, written to pc_samples.txt. Post-processed against the kernel’s nm output to a hot-function histogram.

At end of sim the testbench writes a stable benchmark.json. scripts/render_charts.py stages that plus a top-N hot_functions.json (built from pc_samples.txt against the kernel’s nm output) into site/src/data/charts/PROJECT/. The site’s Astro chart components — <StateBreakdown>, <WalkerBreakdown>, <HotFunctions>, plus the cross-project <MilestoneCompare> / <CpiCompare> / <StateCompare> — import that JSON at build time and render inline SVG in the page’s editorial design language. No raw chart SVGs ever ship.

Where the chip is spending its cycles

PC samples from a P61 run, classified against the kernel’s symbol table. Each bar is a function; the percentage is the fraction of post-load PC samples that landed inside it.

hot functions label P61 + 4-entry TLB (partial) samples 43,141 period every 1,024 cycles
  1. blake2s_compress_generic kernel
    24.7% 10,661
  2. memset kernel
    6.7% 2,898
  3. memcpy kernel
    5.3% 2,283
  4. format_decode kernel
    3% 1,310
  5. fdt32_ld kernel
    2.5% 1,097
  6. strlen kernel
    2.5% 1,074
  7. fdt_offset_ptr kernel
    2.1% 902
  8. create_pgd_mapping kernel
    2.1% 889
  9. vsnprintf kernel
    1.9% 826
  10. number kernel
    1.5% 651
  11. __set_fixmap kernel
    1.5% 648
  12. __slab_alloc_node.isra.0 kernel
    1.4% 597
  13. memmap_init_range kernel
    1.3% 553
  14. fdt_next_tag kernel
    1.1% 496
  15. __div64_32 kernel
    1% 425
  16. (remaining) remaining
    41.3% 17,831

BLAKE2s dominating is the kernel’s CRNG initialisation hashing its entropy pool — explains the long silent phase between the last printk and the userspace handoff. memset / memcpy / fdt32_ld / fdt_offset_ptr / strlen / vsnprintf / number / format_decode together are most of the rest, all small leaf functions called millions of times during DT parse, slab init, and printk formatting. Every one of these is fetch-bound or load-bound — both of which a pipelined fetch + a TLB hit make near-free.

What’s next

P62 → P63 → P64 take the chip from a multi-cycle FSM to a classic 5-stage pipeline:

#shapecuts
62F + EX/MEM/WBbranch flush, load-use stall
63F + DEC/EX + MEM/WBadds RAW + load-use, JAL/JALR redirect
64F + D + EX + MEM + WBclassic 5-stage MIPS-style with full forwarding

Every rung goes through the same make profile && make charts path P61 introduces, so the cross-project comparison gets a new column each time and the bar charts grow honestly.

Files

  • src/top.sv — chip RTL with TLB
  • test/tb_freertos_demo.sv — testbench with TLB / milestone / benchmark.json instrumentation
  • test/Makefileprofile, profile-decode, charts targets
  • app/, runtime/, boot/, userspace/ — unchanged from P60

How to reproduce

# from the repo root, ensure kernel + initramfs are built per
# the P60 page.

cd projects/61_tlb/test
make profile KERNEL_IMAGE=/path/to/Image
# emits benchmark.json + pc_samples.txt

make profile-decode  # top-N hot kernel functions to stdout
make charts          # stages site/src/data/charts/61_tlb/*.json

Harden

NOT RUN. Adds ~4 entries × ~40 bits of state and a small priority encoder. Expected fmax impact is small but unmeasured. Will quantify in the synthesis pass after P64.

What just happened?

A simple, real chip optimisation. The walker existed, it just got asked the same question 50,000 times in a row. Cache the answer. Now we can also see and chart that, and every later rung will plug into the same harness so we can show the progression honestly instead of with vibes.