I Got an AMD R9700 (32GB) for Local Inference
If a local model was going to be part of the assistant, I wanted to understand the machine under it. So I bought a AI-oriented GPU.
I wanted to learn the internals of LLMs and inference, and I can’t afford the multi-million-dollar racks of datacenter GPUs the hyperscalers run. A local model is a temporary stand-in for that learning: not smart enough to replace a frontier model, not some AGI/ASI waking up in my utility room, but decent enough to test chatting flows and scripts and watch how serving behaves.
The homelab ran on a midrange NVIDIA card: family media, the odd coding project, the occasional game. The new generation ships with real VRAM, and 32GB on a single card runs a useful model.
The hot topic card
An AMD Radeon AI PRO R9700 with 32GB: recent ROCM and quantization supports in hardware, 2 slot blower fan, close to MSRP, and supposedly llama.cpp’s Vulkan backend was stable and quick now even though ROCM also works. I also considered the Intel B70 Pro 32GB - cheaper, but Intel’s software stack and performance is reportedly even further behind and I wouldn’t have time to fight that uphill battle now. Maybe next time.
The build
The card lives in a dedicated VM with the GPU passed through, running the local-inference stack:
- llama-swap fronting llama.cpp: one endpoint, multiple models, swapped on demand. It speaks both the OpenAI
/v1/*and the Anthropic/v1/messagesAPIs. - The resident model is a Qwen 3.6 35B MoE running at the maximum supported 256K-token context window in about 28 GiB of VRAM, with multi-token prediction giving a decent decode speedup. A 27B dense model loads on demand for comparison work.
- Numbers from this setup: a ~179k-token prompt prefills around ~2000 tokens/sec at the start, slowing to about ~500 tokens/sec near the end, and decode runs ~42 tokens/sec at depth with ~73% draft-acceptance.
Inside the AI host. The R9700 is the ASRock card in the middle; the rest is an ordinary desktop build that happens to live in a rack chassis. The NVIDIA 1050Ti is the old 4GB card for non-VM display-out, and there’s a SATA expansion card and an Intel X710 dual 10G SFP+ card in there too.
Pointing a coding agent at it
llama-swap also speaks the Anthropic Messages API, so I can point a coding agent at the local model with nothing but a base-URL override. Shows how prompt caching and serving behave under a real client.
Those numbers come from a measurement harness: a SQLite run database, a single CLI runner, and chart and report generation that renders every claim as a single-page HTML report (decode and prefill rates against context depth, time-to-first-token cold versus prefix-cache-warm, speculative-decoding draft-length sweeps). Everything is reproducible from the run database.
What the card unlocked
The rest of this series: defining SLOs for a token stream, chasing a stranger’s tokens-per-second claim down to the truth, and the prompt-cache behaviour above.