4-entry TLB and the chip-profiling harness

P59 made Linux boot by adding instruction-fetch translation. P60 proved userspace works. P61 closes the obvious profiling hole P60 exposed and adds the measurement substrate every later rung will build on.

What’s in the chip

A unified TLB sitting in front of the existing P59 walker:

4 entries, fully-associative, round-robin replacement.
Each entry caches one leaf PTE — either a 4 KiB page (L0 leaf) or a 4 MiB megapage (L1 leaf). Per entry: valid, megapage, vpn[19:0], ppn[19:0], and the R/W/X/U/A/D bits.
Lookup is purely combinational and runs in parallel for the LSU (querying alu_y) and fetch (querying pc).
On a hit: skip the walker entirely. Permission mismatches fault directly without a wasted walk.
On a miss: the existing P59 walker fires. The leaf-success paths fill the next round-robin slot.
sfence.vma flushes everything.
Any write to satp also flushes. Without that, the very last fetch before a csrw satp re-walks under the old satp and refills the TLB with stale identity translations, which then mask the page fault Linux’s relocate_enable_mmu uses to redirect PC into virtual addressing. The privileged spec says invalidate-on-satp-write is implementation-defined; Linux works either way only because of an sfence.vma it emits between the two satp swaps. We pick the invalidate-on-write semantics so we don’t depend on that fence happening.

What’s in the testbench

P61 also extends the testbench with comparable profiling output, so we can chart progression as we go through P62/P63/P64 pipelining and beyond. With +profile:

Per-state cycle histogram (S_FETCH, S_DECODE, S_EXECUTE, S_MEM, S_WB, the four walker states + the two A/D writeback states, and S_HALT).
Walker-activity counters (walks split by op kind, megapage vs 4 KiB-page leaf hits, A/D writebacks).
TLB hit/miss counters, separately for instruction fetch and load/store, plus an sfence.vma flush counter.
Memory-bus handshake / stall accounting.
Boot milestones — cycle of first occurrence of Linux version, Switched to clocksource, Run /init, the userspace homemade chip banner, and the Attempted to kill init panic.
A PC sample every 1024 post-load cycles, written to pc_samples.txt. Post-processed against the kernel’s nm output to a hot-function histogram.

At end of sim the testbench writes a stable benchmark.json. scripts/render_charts.py stages that plus a top-N hot_functions.json (built from pc_samples.txt against the kernel’s nm output) into site/src/data/charts/PROJECT/. The site’s Astro chart components — <StateBreakdown>, <WalkerBreakdown>, <HotFunctions>, plus the cross-project <MilestoneCompare> / <CpiCompare> / <StateCompare> — import that JSON at build time and render inline SVG in the page’s editorial design language. No raw chart SVGs ever ship.

Where the chip is spending its cycles

PC samples from a P61 run, classified against the kernel’s symbol table. Each bar is a function; the percentage is the fraction of post-load PC samples that landed inside it.

hot functions label P61 + 4-entry TLB (partial) samples 43,141 period every 1,024 cycles

blake2s_compress_generic kernel

24.7% of samples (10,661 samples)

24.7% 10,661
memset kernel

6.7% of samples (2,898 samples)

6.7% 2,898
memcpy kernel

5.3% of samples (2,283 samples)

5.3% 2,283
format_decode kernel

3% of samples (1,310 samples)

3% 1,310
fdt32_ld kernel

2.5% of samples (1,097 samples)

2.5% 1,097
strlen kernel

2.5% of samples (1,074 samples)

2.5% 1,074
fdt_offset_ptr kernel

2.1% of samples (902 samples)

2.1% 902
create_pgd_mapping kernel

2.1% of samples (889 samples)

2.1% 889
vsnprintf kernel

1.9% of samples (826 samples)

1.9% 826
number kernel

1.5% of samples (651 samples)

1.5% 651
__set_fixmap kernel

1.5% of samples (648 samples)

1.5% 648
__slab_alloc_node.isra.0 kernel

1.4% of samples (597 samples)

1.4% 597
memmap_init_range kernel

1.3% of samples (553 samples)

1.3% 553
fdt_next_tag kernel

1.1% of samples (496 samples)

1.1% 496
__div64_32 kernel

1% of samples (425 samples)

1% 425
(remaining) remaining

41.3% of samples (17,831 samples)

41.3% 17,831

BLAKE2s dominating is the kernel’s CRNG initialisation hashing its entropy pool — explains the long silent phase between the last printk and the userspace handoff. memset / memcpy / fdt32_ld / fdt_offset_ptr / strlen / vsnprintf / number / format_decode together are most of the rest, all small leaf functions called millions of times during DT parse, slab init, and printk formatting. Every one of these is fetch-bound or load-bound — both of which a pipelined fetch + a TLB hit make near-free.

What’s next

P62 → P63 → P64 take the chip from a multi-cycle FSM to a classic 5-stage pipeline:

#	shape	cuts
62	F + EX/MEM/WB	branch flush, load-use stall
63	F + DEC/EX + MEM/WB	adds RAW + load-use, JAL/JALR redirect
64	F + D + EX + MEM + WB	classic 5-stage MIPS-style with full forwarding

Every rung goes through the same make profile && make charts path P61 introduces, so the cross-project comparison gets a new column each time and the bar charts grow honestly.

Files

src/top.sv — chip RTL with TLB
test/tb_freertos_demo.sv — testbench with TLB / milestone / benchmark.json instrumentation
test/Makefile — profile, profile-decode, charts targets
app/, runtime/, boot/, userspace/ — unchanged from P60

How to reproduce

# from the repo root, ensure kernel + initramfs are built per
# the P60 page.

cd projects/61_tlb/test
make profile KERNEL_IMAGE=/path/to/Image
# emits benchmark.json + pc_samples.txt

make profile-decode  # top-N hot kernel functions to stdout
make charts          # stages site/src/data/charts/61_tlb/*.json

Harden

NOT RUN. Adds ~4 entries × ~40 bits of state and a small priority encoder. Expected fmax impact is small but unmeasured. Will quantify in the synthesis pass after P64.

What just happened?

A simple, real chip optimisation. The walker existed, it just got asked the same question 50,000 times in a row. Cache the answer. Now we can also see and chart that, and every later rung will plug into the same harness so we can show the progression honestly instead of with vibes.