I Got an AMD R9700 (32GB) for Local Inference

If a local model was going to be part of the assistant, I wanted to understand the machine under it. So I bought a AI-oriented GPU.

I wanted to learn the internals of LLMs and inference, and I can’t afford the multi-million-dollar racks of datacenter GPUs the hyperscalers run. A local model is a temporary stand-in for that learning: not smart enough to replace a frontier model, not some AGI/ASI waking up in my utility room, but decent enough to test chatting flows and scripts and watch how serving behaves.

The homelab ran on a midrange NVIDIA card: family media, the odd coding project, the occasional game. The new generation ships with real VRAM, and 32GB on a single card runs a useful model.

The hot topic card

An AMD Radeon AI PRO R9700 with 32GB: recent ROCM and quantization supports in hardware, 2 slot blower fan, close to MSRP, and supposedly llama.cpp’s Vulkan backend was stable and quick now even though ROCM also works. I also considered the Intel B70 Pro 32GB - cheaper, but Intel’s software stack and performance is reportedly even further behind and I wouldn’t have time to fight that uphill battle now. Maybe next time.

The build

The card lives in a dedicated VM with the GPU passed through, running the local-inference stack:

  • llama-swap fronting llama.cpp: one endpoint, multiple models, swapped on demand. It speaks both the OpenAI /v1/* and the Anthropic /v1/messages APIs.
  • The resident model is a Qwen 3.6 35B MoE running at the maximum supported 256K-token context window in about 28 GiB of VRAM, with multi-token prediction giving a decent decode speedup. A 27B dense model loads on demand for comparison work.
  • Numbers from this setup: a ~179k-token prompt prefills around ~2000 tokens/sec at the start, slowing to about ~500 tokens/sec near the end, and decode runs ~42 tokens/sec at depth with ~73% draft-acceptance.

Inside the AI host: the ASRock Radeon AI PRO R9700, a large air cooler, and a Corsair HX1000i power supply Inside the AI host. The R9700 is the ASRock card in the middle; the rest is an ordinary desktop build that happens to live in a rack chassis. The NVIDIA 1050Ti is the old 4GB card for non-VM display-out, and there’s a SATA expansion card and an Intel X710 dual 10G SFP+ card in there too.

Pointing a coding agent at it

llama-swap also speaks the Anthropic Messages API, so I can point a coding agent at the local model with nothing but a base-URL override. Shows how prompt caching and serving behave under a real client.

Those numbers come from a measurement harness: a SQLite run database, a single CLI runner, and chart and report generation that renders every claim as a single-page HTML report (decode and prefill rates against context depth, time-to-first-token cold versus prefix-cache-warm, speculative-decoding draft-length sweeps). Everything is reproducible from the run database.

What the card unlocked

The rest of this series: defining SLOs for a token stream, chasing a stranger’s tokens-per-second claim down to the truth, and the prompt-cache behaviour above.