P93 adds branch predictor hardware without letting it redirect fetch yet. The predictor watches retired control-flow instructions, updates a 32-entry BTB, trains 2-bit conditional counters, and maintains an 8-entry return-address stack.
This is a measurement rung. P92 showed that fall-through prefetch is not enough. P93 asks whether the shell workload has enough predictable control flow to justify a real predicted frontend PC.
Result
| metric | P91 fill buffer | P92 fetch queue | P93 predictor |
|---|---|---|---|
| post-load cycles | 221,327,811 | 222,624,131 | 221,863,586 |
| shell window cycles | 65,985,297 | 67,206,635 | 66,342,842 |
| retired instructions | 86,295,205 | 86,687,669 | 86,469,444 |
| CPI | 2.5648 | 2.5681 | 2.5658 |
| memory stall cycles | 84,488,165 | 60,050,776 | 59,886,452 |
| fetch stall cycles | 55,533,555 | 23,555,005 | 23,503,650 |
| I-cache hits | 8,855,599 | 42,665,352 | 42,594,442 |
| fetch queue fills | 0 | 53,982,463 | 53,869,769 |
P93 is a small recovery from P92, but not a speed win over P91:
| comparison | result |
|---|---|
| shell window vs P92 | -1.29% |
| post-load cycles vs P92 | -0.34% |
| shell window vs P91 | +0.54% |
| post-load cycles vs P91 | +0.24% |
| fetch stalls vs P92 | -0.22% |
Because the predictor is shadow-only, this is not a claimed branch prediction speedup. It is a functional result plus accuracy data.
Predictor Accuracy
| counter | value |
|---|---|
| conditional branches | 11,232,139 |
| conditional taken | 5,520,468 |
| conditional predicted taken | 3,548,651 |
| conditional correct direction | 8,079,678 |
| conditional correct target | 2,948,527 |
| jumps/calls | 3,443,350 |
| jump BTB hits | 1,584,431 |
| jump correct target | 1,549,542 |
| returns | 1,416,840 |
| return RAS hits | 1,370,828 |
| return correct target | 1,368,269 |
| rate | value |
|---|---|
| conditional direction accuracy | 71.93% |
| conditional taken rate | 49.15% |
| taken-branch target accuracy | 53.41% |
| jump BTB hit rate | 46.01% |
| jump target accuracy | 45.00% |
| return RAS hit rate | 96.75% |
| return target accuracy | 96.57% |
The return-address stack is the clear win. Returns are common and highly predictable. The direct-mapped BTB is useful but too small or too aliased for general jumps.
Memory Stalls
- instruction fetch 23,503,650 39.2% 49,525,077 req
- data load 14,586,094 24.4% 973,034 req
- data store 11,963,738 20% 217,413 req
- atomic memory op 157,964 0.3% 184,202 req
- page walk for fetch 1,117,646 1.9% 1,111,492 req
- page walk for load/store 1,217,213 2% 1,216,703 req
- other 7,340,147 12.3% 16,731,436 req
P93 keeps the P92 fetch-stall shape. The branch predictor is not yet allowed to issue predicted fetches, so memory behavior mostly reflects the existing fetch queue and I-cache path.
Shell Phases
- kernel banner to /init 117,615,240 53.2%
- /init to shell banner 1,083,479 0.5%
- shell banner to first command 36,193,960 16.4%
- echo command 1,598 0%
- uname -a 2,618,215 1.2%
- ls /bin /usr/share 32,299,289 14.6%
- cat sample file 3,145,376 1.4%
- touch/write/cat/rm /tmp file 11,350,298 5.1%
- 8x ash loop with file I/O 16,927,403 7.7%
- final marker 663 0%
The shell workload reaches the same final file marker. The phase view is still the best way to compare user-visible progress across rungs.
Cycle Shape
- fetch 3.7% 8,314,840
- execute 39% 86,494,366
- mem 12.7% 28,082,445
- walker 2.1% 4,663,054
- writeback 39% 86,469,444
- mul/div 3.5% 7,837,721
P93 does not add a new visible FSM state. Predictor update happens at
retire, alongside the existing S_WB bookkeeping.
Hot Functions
- 5.6% of samples (3,626 samples)5.6% 3,626
- 5.1% of samples (3,316 samples)5.1% 3,316
- 3.7% of samples (2,374 samples)3.7% 2,374
- 3.4% of samples (2,185 samples)3.4% 2,185
- 2.8% of samples (1,804 samples)2.8% 1,804
- 2.6% of samples (1,713 samples)2.6% 1,713
- 2.6% of samples (1,663 samples)2.6% 1,663
- 1.7% of samples (1,079 samples)1.7% 1,079
- 1.6% of samples (1,053 samples)1.6% 1,053
- 1.4% of samples (878 samples)1.4% 878
- 1.3% of samples (868 samples)1.3% 868
- 1.3% of samples (833 samples)1.3% 833
- 1.2% of samples (787 samples)1.2% 787
- 1.1% of samples (690 samples)1.1% 690
- 1% of samples (660 samples)1% 660
- 55.5% of samples (35,962 samples)55.5% 35,962
The hot-symbol mix remains BusyBox plus kernel memory, scheduler, and exception paths. Prediction will only matter once it can steer a frontend that has somewhere useful to run ahead.
Honest Status
| check | status |
|---|---|
| 32-entry BTB storage and update | PASS |
| 2-bit conditional direction counters | PASS |
| 8-entry return-address stack | PASS |
| BusyBox shell workload runs | PASS |
| Predictor accuracy counters captured | PASS |
| Predictor steers fetch | NOT RUN |
| Shell-window speedup vs P91 | FAIL |
| LibreLane hardening | NOT RUN |
Next
P94 should split memory request service and arbitration. A predictor needs a frontend path that can issue predicted fetches without being jammed into the same control slot as loads, stores, AMOs, page walks, and background I-cache fills.