Branch predictor v0 · librelane-playground

P93 adds branch predictor hardware without letting it redirect fetch yet. The predictor watches retired control-flow instructions, updates a 32-entry BTB, trains 2-bit conditional counters, and maintains an 8-entry return-address stack.

This is a measurement rung. P92 showed that fall-through prefetch is not enough. P93 asks whether the shell workload has enough predictable control flow to justify a real predicted frontend PC.

Result

metric	P91 fill buffer	P92 fetch queue	P93 predictor
post-load cycles	221,327,811	222,624,131	221,863,586
shell window cycles	65,985,297	67,206,635	66,342,842
retired instructions	86,295,205	86,687,669	86,469,444
CPI	2.5648	2.5681	2.5658
memory stall cycles	84,488,165	60,050,776	59,886,452
fetch stall cycles	55,533,555	23,555,005	23,503,650
I-cache hits	8,855,599	42,665,352	42,594,442
fetch queue fills	0	53,982,463	53,869,769

P93 is a small recovery from P92, but not a speed win over P91:

comparison	result
shell window vs P92	-1.29%
post-load cycles vs P92	-0.34%
shell window vs P91	+0.54%
post-load cycles vs P91	+0.24%
fetch stalls vs P92	-0.22%

Because the predictor is shadow-only, this is not a claimed branch prediction speedup. It is a functional result plus accuracy data.

Predictor Accuracy

counter	value
conditional branches	11,232,139
conditional taken	5,520,468
conditional predicted taken	3,548,651
conditional correct direction	8,079,678
conditional correct target	2,948,527
jumps/calls	3,443,350
jump BTB hits	1,584,431
jump correct target	1,549,542
returns	1,416,840
return RAS hits	1,370,828
return correct target	1,368,269

rate	value
conditional direction accuracy	71.93%
conditional taken rate	49.15%
taken-branch target accuracy	53.41%
jump BTB hit rate	46.01%
jump target accuracy	45.00%
return RAS hit rate	96.75%
return target accuracy	96.57%

The return-address stack is the clear win. Returns are common and highly predictable. The direct-mapped BTB is useful but too small or too aliased for general jumps.

Memory Stalls

memory stalls label P93 branch-predictor workload stalls 59,886,452 handshakes 69,959,357

instruction fetch 23,503,650 39.2% 49,525,077 req
data load 14,586,094 24.4% 973,034 req
data store 11,963,738 20% 217,413 req
atomic memory op 157,964 0.3% 184,202 req
page walk for fetch 1,117,646 1.9% 1,111,492 req
page walk for load/store 1,217,213 2% 1,216,703 req
other 7,340,147 12.3% 16,731,436 req

P93 keeps the P92 fetch-stall shape. The branch predictor is not yet allowed to issue predicted fetches, so memory behavior mostly reflects the existing fetch queue and I-cache path.

Shell Phases

shell phases label P93 shell workload cycles 221,863,586 cpi 2.57

kernel banner to /init 117,615,240 53.2%
/init to shell banner 1,083,479 0.5%
shell banner to first command 36,193,960 16.4%
echo command 1,598 0%
uname -a 2,618,215 1.2%
ls /bin /usr/share 32,299,289 14.6%
cat sample file 3,145,376 1.4%
touch/write/cat/rm /tmp file 11,350,298 5.1%
8x ash loop with file I/O 16,927,403 7.7%
final marker 663 0%

The shell workload reaches the same final file marker. The phase view is still the best way to compare user-visible progress across rungs.

Cycle Shape

state breakdown label P93 branch-predictor workload cycles 221,863,586 cpi 2.57

fetch 3.7% 8,314,840
execute 39% 86,494,366
mem 12.7% 28,082,445
walker 2.1% 4,663,054
writeback 39% 86,469,444
mul/div 3.5% 7,837,721

P93 does not add a new visible FSM state. Predictor update happens at retire, alongside the existing S_WB bookkeeping.

Hot Functions

hot functions label P93 BusyBox shell symbols samples 64,788 period every 1,024 cycles

printf_core busybox

5.6% of samples (3,626 samples)

5.6% 3,626
memset kernel

5.1% of samples (3,316 samples)

5.1% 3,316
memcpy busybox

3.7% of samples (2,374 samples)

3.7% 2,374
vruntime_eligible kernel

3.4% of samples (2,185 samples)

3.4% 2,185
blake2s_compress_generic kernel

2.8% of samples (1,804 samples)

2.8% 1,804
memcpy kernel

2.6% of samples (1,713 samples)

2.6% 1,713
__fwritex busybox

2.6% of samples (1,663 samples)

2.6% 1,663
handle_exception kernel

1.7% of samples (1,079 samples)

1.7% 1,079
unmap_page_range kernel

1.6% of samples (1,053 samples)

1.6% 1,053
avg_vruntime kernel

1.4% of samples (878 samples)

1.4% 878
n_tty_write kernel

1.3% of samples (868 samples)

1.3% 868
memset busybox

1.3% of samples (833 samples)

1.3% 833
ret_from_exception kernel

1.2% of samples (787 samples)

1.2% 787
next_uptodate_folio kernel

1.1% of samples (690 samples)

1.1% 690
n_tty_read kernel

1% of samples (660 samples)

1% 660
(remaining) remaining

55.5% of samples (35,962 samples)

55.5% 35,962

The hot-symbol mix remains BusyBox plus kernel memory, scheduler, and exception paths. Prediction will only matter once it can steer a frontend that has somewhere useful to run ahead.

Honest Status

check	status
32-entry BTB storage and update	PASS
2-bit conditional direction counters	PASS
8-entry return-address stack	PASS
BusyBox shell workload runs	PASS
Predictor accuracy counters captured	PASS
Predictor steers fetch	NOT RUN
Shell-window speedup vs P91	FAIL
LibreLane hardening	NOT RUN

P94 should split memory request service and arbitration. A predictor needs a frontend path that can issue predicted fetches without being jammed into the same control slot as loads, stores, AMOs, page walks, and background I-cache fills.

Result

Predictor Accuracy

Memory Stalls

Shell Phases

Cycle Shape

Hot Functions

Honest Status

Next