No. 93 / project of 147 on the ladder

Branch predictor v0

introduces — 32-entry BTB; 2-bit branch counters; 8-entry return-address stack; predictor accuracy counters

harden statelast run2026-05-05
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

P93 adds branch predictor hardware without letting it redirect fetch yet. The predictor watches retired control-flow instructions, updates a 32-entry BTB, trains 2-bit conditional counters, and maintains an 8-entry return-address stack.

This is a measurement rung. P92 showed that fall-through prefetch is not enough. P93 asks whether the shell workload has enough predictable control flow to justify a real predicted frontend PC.

Result

metricP91 fill bufferP92 fetch queueP93 predictor
post-load cycles221,327,811222,624,131221,863,586
shell window cycles65,985,29767,206,63566,342,842
retired instructions86,295,20586,687,66986,469,444
CPI2.56482.56812.5658
memory stall cycles84,488,16560,050,77659,886,452
fetch stall cycles55,533,55523,555,00523,503,650
I-cache hits8,855,59942,665,35242,594,442
fetch queue fills053,982,46353,869,769

P93 is a small recovery from P92, but not a speed win over P91:

comparisonresult
shell window vs P92-1.29%
post-load cycles vs P92-0.34%
shell window vs P91+0.54%
post-load cycles vs P91+0.24%
fetch stalls vs P92-0.22%

Because the predictor is shadow-only, this is not a claimed branch prediction speedup. It is a functional result plus accuracy data.

Predictor Accuracy

countervalue
conditional branches11,232,139
conditional taken5,520,468
conditional predicted taken3,548,651
conditional correct direction8,079,678
conditional correct target2,948,527
jumps/calls3,443,350
jump BTB hits1,584,431
jump correct target1,549,542
returns1,416,840
return RAS hits1,370,828
return correct target1,368,269
ratevalue
conditional direction accuracy71.93%
conditional taken rate49.15%
taken-branch target accuracy53.41%
jump BTB hit rate46.01%
jump target accuracy45.00%
return RAS hit rate96.75%
return target accuracy96.57%

The return-address stack is the clear win. Returns are common and highly predictable. The direct-mapped BTB is useful but too small or too aliased for general jumps.

Memory Stalls

memory stalls label P93 branch-predictor workload stalls 59,886,452 handshakes 69,959,357
  1. instruction fetch 23,503,650 39.2% 49,525,077 req
  2. data load 14,586,094 24.4% 973,034 req
  3. data store 11,963,738 20% 217,413 req
  4. atomic memory op 157,964 0.3% 184,202 req
  5. page walk for fetch 1,117,646 1.9% 1,111,492 req
  6. page walk for load/store 1,217,213 2% 1,216,703 req
  7. other 7,340,147 12.3% 16,731,436 req

P93 keeps the P92 fetch-stall shape. The branch predictor is not yet allowed to issue predicted fetches, so memory behavior mostly reflects the existing fetch queue and I-cache path.

Shell Phases

shell phases label P93 shell workload cycles 221,863,586 cpi 2.57
  1. kernel banner to /init 117,615,240 53.2%
  2. /init to shell banner 1,083,479 0.5%
  3. shell banner to first command 36,193,960 16.4%
  4. echo command 1,598 0%
  5. uname -a 2,618,215 1.2%
  6. ls /bin /usr/share 32,299,289 14.6%
  7. cat sample file 3,145,376 1.4%
  8. touch/write/cat/rm /tmp file 11,350,298 5.1%
  9. 8x ash loop with file I/O 16,927,403 7.7%
  10. final marker 663 0%

The shell workload reaches the same final file marker. The phase view is still the best way to compare user-visible progress across rungs.

Cycle Shape

state breakdown label P93 branch-predictor workload cycles 221,863,586 cpi 2.57
  1. fetch 3.7% 8,314,840
  2. execute 39% 86,494,366
  3. mem 12.7% 28,082,445
  4. walker 2.1% 4,663,054
  5. writeback 39% 86,469,444
  6. mul/div 3.5% 7,837,721

P93 does not add a new visible FSM state. Predictor update happens at retire, alongside the existing S_WB bookkeeping.

Hot Functions

hot functions label P93 BusyBox shell symbols samples 64,788 period every 1,024 cycles
  1. printf_core busybox
    5.6% 3,626
  2. memset kernel
    5.1% 3,316
  3. memcpy busybox
    3.7% 2,374
  4. vruntime_eligible kernel
    3.4% 2,185
  5. blake2s_compress_generic kernel
    2.8% 1,804
  6. memcpy kernel
    2.6% 1,713
  7. __fwritex busybox
    2.6% 1,663
  8. handle_exception kernel
    1.7% 1,079
  9. unmap_page_range kernel
    1.6% 1,053
  10. avg_vruntime kernel
    1.4% 878
  11. n_tty_write kernel
    1.3% 868
  12. memset busybox
    1.3% 833
  13. ret_from_exception kernel
    1.2% 787
  14. next_uptodate_folio kernel
    1.1% 690
  15. n_tty_read kernel
    1% 660
  16. (remaining) remaining
    55.5% 35,962

The hot-symbol mix remains BusyBox plus kernel memory, scheduler, and exception paths. Prediction will only matter once it can steer a frontend that has somewhere useful to run ahead.

Honest Status

checkstatus
32-entry BTB storage and updatePASS
2-bit conditional direction countersPASS
8-entry return-address stackPASS
BusyBox shell workload runsPASS
Predictor accuracy counters capturedPASS
Predictor steers fetchNOT RUN
Shell-window speedup vs P91FAIL
LibreLane hardeningNOT RUN

Next

P94 should split memory request service and arbitration. A predictor needs a frontend path that can issue predicted fetches without being jammed into the same control slot as loads, stores, AMOs, page walks, and background I-cache fills.