P59 made Linux boot by adding instruction-fetch translation. P60 proved userspace works. P61 closes the obvious profiling hole P60 exposed and adds the measurement substrate every later rung will build on.
What’s in the chip
A unified TLB sitting in front of the existing P59 walker:
- 4 entries, fully-associative, round-robin replacement.
- Each entry caches one leaf PTE — either a 4 KiB page (L0
leaf) or a 4 MiB megapage (L1 leaf). Per entry:
valid,megapage,vpn[19:0],ppn[19:0], and theR/W/X/U/A/Dbits. - Lookup is purely combinational and runs in parallel for the
LSU (querying
alu_y) and fetch (queryingpc). - On a hit: skip the walker entirely. Permission mismatches fault directly without a wasted walk.
- On a miss: the existing P59 walker fires. The leaf-success paths fill the next round-robin slot.
sfence.vmaflushes everything.- Any write to satp also flushes. Without that, the very
last fetch before a
csrw satpre-walks under the old satp and refills the TLB with stale identity translations, which then mask the page fault Linux’srelocate_enable_mmuuses to redirect PC into virtual addressing. The privileged spec says invalidate-on-satp-write is implementation-defined; Linux works either way only because of ansfence.vmait emits between the two satp swaps. We pick the invalidate-on-write semantics so we don’t depend on that fence happening.
What’s in the testbench
P61 also extends the testbench with comparable profiling output,
so we can chart progression as we go through P62/P63/P64
pipelining and beyond. With +profile:
- Per-state cycle histogram (
S_FETCH,S_DECODE,S_EXECUTE,S_MEM,S_WB, the four walker states + the two A/D writeback states, andS_HALT). - Walker-activity counters (walks split by op kind, megapage vs 4 KiB-page leaf hits, A/D writebacks).
- TLB hit/miss counters, separately for instruction fetch and load/store, plus an sfence.vma flush counter.
- Memory-bus handshake / stall accounting.
- Boot milestones — cycle of first occurrence of
Linux version,Switched to clocksource,Run /init, the userspacehomemade chipbanner, and theAttempted to kill initpanic. - A PC sample every 1024 post-load cycles, written to
pc_samples.txt. Post-processed against the kernel’snmoutput to a hot-function histogram.
At end of sim the testbench writes a stable benchmark.json.
scripts/render_charts.py stages that plus a top-N
hot_functions.json (built from pc_samples.txt against the
kernel’s nm output) into site/src/data/charts/PROJECT/. The
site’s Astro chart components — <StateBreakdown>,
<WalkerBreakdown>, <HotFunctions>, plus the cross-project
<MilestoneCompare> / <CpiCompare> / <StateCompare> —
import that JSON at build time and render inline SVG in the
page’s editorial design language. No raw chart SVGs ever ship.
Where the chip is spending its cycles
PC samples from a P61 run, classified against the kernel’s symbol table. Each bar is a function; the percentage is the fraction of post-load PC samples that landed inside it.
- 24.7% of samples (10,661 samples)24.7% 10,661
- 6.7% of samples (2,898 samples)6.7% 2,898
- 5.3% of samples (2,283 samples)5.3% 2,283
- 3% of samples (1,310 samples)3% 1,310
- 2.5% of samples (1,097 samples)2.5% 1,097
- 2.5% of samples (1,074 samples)2.5% 1,074
- 2.1% of samples (902 samples)2.1% 902
- 2.1% of samples (889 samples)2.1% 889
- 1.9% of samples (826 samples)1.9% 826
- 1.5% of samples (651 samples)1.5% 651
- 1.5% of samples (648 samples)1.5% 648
- 1.4% of samples (597 samples)1.4% 597
- 1.3% of samples (553 samples)1.3% 553
- 1.1% of samples (496 samples)1.1% 496
- 1% of samples (425 samples)1% 425
- 41.3% of samples (17,831 samples)41.3% 17,831
BLAKE2s dominating is the kernel’s CRNG initialisation hashing
its entropy pool — explains the long silent phase between the
last printk and the userspace handoff. memset / memcpy /
fdt32_ld / fdt_offset_ptr / strlen / vsnprintf /
number / format_decode together are most of the rest, all
small leaf functions called millions of times during DT parse,
slab init, and printk formatting. Every one of these is
fetch-bound or load-bound — both of which a pipelined fetch +
a TLB hit make near-free.
What’s next
P62 → P63 → P64 take the chip from a multi-cycle FSM to a classic 5-stage pipeline:
| # | shape | cuts |
|---|---|---|
| 62 | F + EX/MEM/WB | branch flush, load-use stall |
| 63 | F + DEC/EX + MEM/WB | adds RAW + load-use, JAL/JALR redirect |
| 64 | F + D + EX + MEM + WB | classic 5-stage MIPS-style with full forwarding |
Every rung goes through the same make profile && make charts
path P61 introduces, so the cross-project comparison gets a
new column each time and the bar charts grow honestly.
Files
src/top.sv— chip RTL with TLBtest/tb_freertos_demo.sv— testbench with TLB / milestone / benchmark.json instrumentationtest/Makefile—profile,profile-decode,chartstargetsapp/,runtime/,boot/,userspace/— unchanged from P60
How to reproduce
# from the repo root, ensure kernel + initramfs are built per
# the P60 page.
cd projects/61_tlb/test
make profile KERNEL_IMAGE=/path/to/Image
# emits benchmark.json + pc_samples.txt
make profile-decode # top-N hot kernel functions to stdout
make charts # stages site/src/data/charts/61_tlb/*.json
Harden
NOT RUN. Adds ~4 entries × ~40 bits of state and a small
priority encoder. Expected fmax impact is small but unmeasured.
Will quantify in the synthesis pass after P64.
What just happened?
A simple, real chip optimisation. The walker existed, it just got asked the same question 50,000 times in a row. Cache the answer. Now we can also see and chart that, and every later rung will plug into the same harness so we can show the progression honestly instead of with vibes.