No. 68 / project of 147 on the ladder

AtomVM port — and the C extension we needed to get there

introduces — RV32C support (decompressor, straddle-aware fetch, compressed-link addressing); AtomVM bare-metal build flow on the chip

harden statelast run2026-05-04
signoff
  • DRCNOT RUN
  • LVSNOT RUN
  • antennaNOT RUN

The plan for P68 was simple: take AtomVM upstream, vendor it, write a platform shim modeled on its RP2 port, and run hello-world Erlang on our chip. Reality had a different plan: link against newlib and the very first call into libc trapped at PC 0x95610 on instruction 0x1141 — a c.addi. A compressed instruction. The nixpkgs riscv32-none-elf toolchain ships libc/libgcc compiled with -march=rv32imafdc and there is no practical way to get a no-C build without forking and rebuilding the toolchain. So P68 became two things: AtomVM bring-up and the C extension our chip needed to link against the real world.

Headline: AtomVM hello-world runs on the chip. The BEAM VM loads hello.beam, executes hello:start/0, and emits "hello from atomvm on a homemade chip" over UART. Total run: ~4.7M cycles, 1.5 seconds Verilator wall time.

AtomVM on a homemade RV32 chip (P68 bare-metal port)
Starting AtomVM revision 0.8.0-dev+git.7a57441
Found startup beam: hello.beam
"hello from atomvm on a homemade chip"
AtomVM exited

Why the C extension at all

Without C, every link against newlib means stepping through compiled object files and rejecting any that contain compressed encodings. The math doesn’t work: all of newlib is built that way in the standard toolchains. We had three options:

  1. Maintain a parallel no-C toolchain build (months of work and a permanent maintenance cost).
  2. Replace every call into libc with a hand-rolled equivalent (also months — newlib has hundreds of internal helpers).
  3. Add C to the chip (~a day of careful RTL).

Option 3 is also the right answer for the broader chip story: real RV32 software in 2026 is ~95% compressed encodings on common code paths, and synthesizing without C means we’d be designing the chip around an artificial constraint that no real toolchain produces.

What “C support” means in this chip

Three RTL changes, all in the existing CPU module (no new states):

  1. A 16-bit → 32-bit decompressor (rvc_decompress, combinational function in top.sv). Covers all of RV32C — ADDI/LI/LUI, LW/SW + LWSP/SWSP + ADDI4SPN, J/JAL/JR/JALR, BEQZ/BNEZ, MV/ADD/SUB/AND/OR/XOR/ANDI, SLLI/SRLI/SRAI, EBREAK. Unsupported / RV64-only / FP-flavoured encodings return 32'h0, which the existing illegal-opcode decode catches as MCAUSE_ILLEGAL_INSTR. ~120 lines of mostly-mechanical table-driven code.

  2. Straddle-aware fetch path. Instructions can now sit at any 2-byte boundary. A 32-bit instruction at pc[1] == 1 straddles two memory words. The fetch state grows two regs: fetch_straddle_q (1 = we have the low half from the previous word) and fetch_straddle_lo_q[15:0] (the stashed low 16 bits). When the chip detects a straddle (pc[1]=1 and the low-half candidate looks like a 32-bit insn) it stashes the low half, doesn’t advance pc, and the next cycle’s S_FETCH issues a fetch at (pc & ~3) + 4 to grab the high half.

    if (fetch_straddle_q) begin
      ir              <= {mem_rdata[15:0], fetch_straddle_lo_q};
      is_compressed_q <= 1'b0;
      fetch_straddle_q <= 1'b0;
      state           <= S_EXECUTE;
    end else if (pc[1] && mem_rdata[17:16] == 2'b11) begin
      fetch_straddle_lo_q <= mem_rdata[31:16];
      fetch_straddle_q    <= 1'b1;
      // stay in S_FETCH; mem_addr drive picks up the +4 path
    end else if (pc[1] ? mem_rdata[17:16] == 2'b11
                       : mem_rdata[1:0]   == 2'b11) begin
      ir              <= mem_rdata;          // 32-bit insn at pc[1]==0
      is_compressed_q <= 1'b0;
      state           <= S_EXECUTE;
    end else begin
      ir              <= rvc_decompress(pc[1] ? mem_rdata[31:16]
                                              : mem_rdata[15:0]);
      is_compressed_q <= 1'b1;
      state           <= S_EXECUTE;
    end
  3. Compressed-aware PC advance and link value. next_pc defaults to pc + (is_compressed_q ? 2 : 4), so a compressed instruction advances PC by two bytes and a 32-bit one by four. The same conditional drives the JAL/JALR link value (alu_b = is_compressed_q ? 32'd2 : 32'd4) — a c.jalr has to push pc + 2 into ra, not pc + 4, or the callee returns to the middle of the calling instruction and traps. We learned that one the hard way on the first run.

fetch_aligned relaxes from pc[1:0] == 2'b00 to pc[0] == 0 to allow halfword-aligned PCs. The misalignment trap fires only on pc[0] != 0 now.

What broke during bring-up

  • First run (no C in the chip): trapped at the very first c.addi (0x1141) inside libc’s fputs, ~3000 cycles in. PC 0x95610, mtval = the compressed insn we couldn’t decode.
  • Second run (C in the chip, but JAL/JALR link value still hardcoded pc + 4): trapped at 0x745b6 — exactly two bytes past the next instruction after a c.jalr. The callee returned to the middle of the next instruction. Diagnostic PC dump told us instantly that the link had been wrong.
  • Third run (link fixed): chip ran 17M+ cycles cleanly through libc, but newlib’s __sfvwrite_r chunking emitted bad chunk lengths in our environment (no __libc_init_array call, FILE buffers never properly initialised). Workaround: replace the fputs calls in our shim with direct UART writes (p68_uart_puts), bypassing newlib’s stdio entirely. Banner emits cleanly.
  • Fourth run (banner via direct UART): AtomVM globalcontext_new returned NULL because newlib’s malloc dragged in __malloc_lock and friends that require runtime init we don’t do. Workaround: provide our own bump-allocator malloc/free/calloc/realloc in port/p68_libc.c that walks _sbrk’s heap window directly. AtomVM’s hello-world malloc-heavy startup needs ~25 calls totalling under 1 MiB — trivial for an 8 MiB heap window.
  • Fifth run (custom malloc): trapped on STORE_ADDR_MISALIGNED at PC 0xb54 — malloc returned a 7-aligned pointer where it should have been 8-aligned. Instrumented the allocator ([m:size=ret] markers); confirmed malloc returned 0x95167 for the first call when it should have returned 0x95160. Decompressor bug: c.andi (and c.srli/c.srai/c.sub/c.xor/c.or/c.and) put the destination register field in the wrong slot — used rd_p = {2'b01, ci[4:2]} (the rs2’ position) instead of rs1_p = {2'b01, ci[9:7]} (the rd’ position). The AND on a2 = 0x95167 was being applied to a different register, so a2 retained its 7-aligned value and the next allocation inherited that misalignment. Fixed by swapping rd_p for rs1_p in those four sub-cases. Boot fully succeeded on the next run.

The AtomVM build flow

End-to-end reproducible from clean. Drops into the nix devshell (pkgs.erlang, pkgs.cmake, pkgs.ninja, pkgs.gperf, pkgs.rebar3, pkgs.pkgsCross.riscv32-embedded.buildPackages.gcc) which already had previous-rung agent work behind it.

nix develop                                                 # shell with all tools
cd projects/68_atomvm_port/vendor/AtomVM/tools/packbeam
rebar3 escriptize                                           # builds packbeam escript
cd ../../..                                                 # back to projects/68_atomvm_port
cd test && make all                                         # libAtomVM + hello.beam + main.avm + final ELF + boot blob
make verilator-run                                          # ~13 sec wall to first UART output

Pinned commit: AtomVM 7b282159 (2025-W16 main).

What’s next

  • A bigger Erlang demo — blocked on a soft-float toolchain. Tried a two-process ping-pong (projects/68_atomvm_port/erlang/pingpong.erl) to exercise the scheduler. Got partway: pingpong:start/0 runs, prints the starting atom, then traps on fsd fs0, 168(sp) — a hardware FP store — inside libAtomVM’s term_compare prologue. Our chip has no FPU, but the only RV32 bare-metal toolchain in nixpkgs (pkgsCross.riscv32-embedded) ships single-multilib rv32imafdc/ilp32d only. The compiler uses callee-saved FP registers as scratch spill space even when no float arithmetic appears in the source. Hello-world doesn’t trip it because the inlined functions never need an FP-register spill; deeper code paths do.

    Three exits from this trap:

    1. Soft-float multilib — nixpkgs the toolchain with -march=rv32imac -mabi=ilp32 libgcc/newlib variants (chunky packaging work).
    2. Trap-and-emulate FP — illegal-instruction trap handler that decodes and emulates F/D ops in software (real chunk of trap-handler RTL/firmware).
    3. Add F to the chip — biggest scope, but the cleanest answer long-term and a fun project rung in its own right.

    Pick one when there’s energy for it. AtomVM’s hello-world stays green in the meantime.

  • Fold C into a trunk core. The optimization-arc rungs P63-P66 don’t have C; this rung does. Future trunk work will either retrofit C onto the pipelined cores or branch a new rung that combines C with the latest pipeline work.

  • Bring back libc properly if we want printf/scanf/etc. The custom bump-malloc + direct-UART-write workarounds are fine for hello-world but won’t carry larger Erlang programs that hit io:format or anything routed through stdio. Either run __libc_init_array and friends from start.S, or vendor a smaller libc (picolibc, llvm-libc) that doesn’t drag the same dependencies.