P94: memory arbiter v0

P94 split the shared memory-port mux into named request classes and added want/grant/class counters to the Verilator harness. The physical memory model is still one shared RAM port; the difference is that the traffic is now visible.

The shell smoke passed:

P94 direct UART console + memory attribution smoke PASS

The run is slightly slower than P93, so this is not a speed claim:

metric	P93	P94
post-load cycles	221,863,586	222,459,202
shell window cycles	66,342,842	67,050,374
memory stall cycles	59,886,452	60,032,329
fetch stall cycles	23,503,650	23,549,359

The useful new counter is contention. Only background I-cache fill was denied service by higher-priority work:

class	denied cycles
I-cache background fill	4,542,004
all foreground classes	0

That points P95 away from another pure frontend guess. The foreground load/store path is paying real shared-memory latency, so a store buffer or tiny D-cache is the next sensible experiment.