VFIO on 2nd-gen Ryzen - passing a GPU, USB and audio into a VM (and the reset bug that broke it)

What I was actually trying to do

elysium is my Proxmox box - a Ryzen 7 2700X (Zen+) on an Asus Crosshair VII Hero, the same X470 board from Homelab Part 1. The goal was to hand a guest VM real hardware: the GTX 1050Ti, a couple of the onboard AMD USB 3.0 controllers, and the onboard HD audio. USB redirect over QEMU’s native USB bus means managing individual USB device IDs, and hotplug would be difficult, so pass the whole USB root port through instead. That needs VFIO and PCIe passthrough.

The VM config ended up as:

hostpci0: 2b:00,pcie=1
hostpci1: 2c:00.3,pcie=1
hostpci2: 2d:00.3,pcie=1
hostpci3: 03:00.0,pcie=1

2b:00 is the GPU and its audio function (2b:00.0 / 2b:00.1), 2c:00.3 and 03:00.0 are the two AMD USB 3.0 host controllers, and 2d:00.3 is the onboard HD audio. Here’s the relevant slice of lspci so the addresses make sense:

root@elysium:/etc/pve/qemu-server# lspci
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
...
03:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 43d0 (rev 01)
03:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01)
03:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Bridge (rev 01)
...
2b:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
2b:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
2c:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
2c:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
2c:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] USB 3.0 Host controller
2d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
2d:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
2d:00.3 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller

I’ve trimmed the full output down to the devices I care about; the rest is Family 17h Data Fabric functions and Crosshair bits (the ASPEED BMC graphics, the two Intel I210 NICs, ASMedia SATA).

Problem 1: the IOMMU groups

VFIO isolates at the IOMMU group: you can only pass a device through if its whole group goes through, and the kernel won’t let you pass half a group. On Ryzen / X470 the chipset lumps unrelated devices into the same coarse group. My USB controller would land in a group with a SATA controller and a PCIe bridge, and I can’t hand the VM the host’s SATA controller.

Alex Williamson’s ACS override patch ships in the pve-kernel as 0003-pci-Enable-overrides-for-missing-ACS-capabilities. Turn it on at boot:

pcie_acs_override=downstream,multifunction

and the coarse groups get split so each device sits in its own group, so I can isolate one USB controller from its neighbours. You need the multifunction part - downstream alone didn’t split mine, which a lot of people hit on Ryzen.

The ACS override doesn’t change the hardware. The devices are still physically grouped, capable of peer-to-peer DMA behind the IOMMU’s back. The patch just changes how the groups appear to Linux so it’ll let you pass them through (the Proxmox wiki calls it a last resort and “not without risks”). You’re trading away isolation. For a homelab where I trust everything running on the box, fine. For a hostile multi-tenant setup, absolutely not.

Problem 2: the device that won’t reset

This one cost me multiple afternoons and headaches and made me think my hardware was haunted (there were other incidents too). “The AMD reset bug” gets blamed for everything, so be precise.

When a VM using a passed-through device stops, the host has to reset that device before it can be handed back out (or handed to the next VM). On a sane device that’s a Function Level Reset (FLR): one config write and the device comes back clean. The problem is that AMD’s Zen / Zen+ devices don’t reliably implement FLR - it’s a long-standing gap the VFIO community has grumbled about for years. When FLR isn’t available, the kernel falls back to a secondary bus reset on the parent PCIe bridge, which is a much bigger hammer.

A buggy Crosshair VII Hero BIOS combined with a kernel change broke that reset path. After a VM using a passed-through device shut down, the device wouldn’t come back. vfio-pci couldn’t reinitialise it, and the only way to recover was a full host reboot. Start a VM, shut it down, and now your whole hypervisor needs a power cycle to get the USB controller back. Useless.

The fix was the 0008-VCiYJ-ryzen-pci.patch I’d dropped into the custom kernel. It does two things to drivers/pci/pci.c:

--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -52,6 +52,9 @@ unsigned int pci_pm_d3_delay;
 
 static void pci_pme_list_scan(struct work_struct *work);
 
+static void pci_dev_save_and_disable(struct pci_dev *dev);
+static void pci_dev_restore(struct pci_dev *dev);
+
 static LIST_HEAD(pci_pme_list);
@@ -1379,15 +1382,7 @@ static void pci_restore_config_space(struct pci_dev *pdev)
 		pci_restore_config_space_range(pdev, 4, 9, 10, false);
 		pci_restore_config_space_range(pdev, 0, 3, 0, false);
 	} else if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
-		pci_restore_config_space_range(pdev, 12, 15, 0, false);
-
-		/*
-		 * Force rewriting of prefetch registers to avoid S3 resume
-		 * issues on Intel PCI bridges that occur when these
-		 * registers are not explicitly written.
-		 */
-		pci_restore_config_space_range(pdev, 9, 11, 0, true);
-		pci_restore_config_space_range(pdev, 0, 8, 0, false);
+		pci_restore_config_space_range(pdev, 0, 15, 0, true);
 	} else {
 		pci_restore_config_space_range(pdev, 0, 15, 0, false);
 	}
@@ -4636,6 +4631,8 @@ void pci_reset_secondary_bus(struct pci_dev *dev)
 {
 	u16 ctrl;
 
+	pci_dev_save_and_disable(dev);
+
 	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl);
 	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
 	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
@@ -4649,6 +4646,8 @@ void pci_reset_secondary_bus(struct pci_dev *dev)
 	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
 	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
 
+	pci_dev_restore(dev);
+
 	/*
 	 * Trhfa for conventional PCI is 2^25 clock cycles.

A secondary bus reset toggles the BUS_RESET bit in the bridge’s control register, which resets everything on that bus and blows away the device’s config-space state. The mainline code had a bridge-specific dance in pci_restore_config_space that rewrote the prefetch registers in a particular order to dodge an Intel bridge S3-resume quirk. On these AMD bridges that ordering left the config space in a state that didn’t come back. The patch rips out the Intel-specific handling and restores the whole bridge config range (registers 0-15) in one go.

The second change wraps pci_reset_secondary_bus in pci_dev_save_and_disable() before and pci_dev_restore() after, so the device’s config state is snapshotted, the bus is reset, and the state is written back. Same idea Alex Williamson later described in the vfio tree: if you reset around a device without preserving its state, “these original settings will be permanently lost.” Save it, reset, restore it, and the device comes back instead of staying dead until I reboot the whole host.

It’s a hack on a hack - the proper fix is FLR support AMD never reliably shipped on Zen+, and this is just making the fallback bus reset survivable. But it got the box to where I could start a VM, shut it down, start it again, and not have to reboot Proxmox every single time.

Putting it together

The recipe: ACS override to split the IOMMU groups (0003, already in the pve-kernel), the Ryzen reset patch to make the bus reset stick (0008, mine), both baked into a custom pve-kernel, then the hostpciN lines above to hand the GPU, both USB controllers and the audio to the guest. The VM got a real GPU, USB ports I could plug a keyboard and mouse straight into, and working sound, on a box that’s still also my hypervisor.

If you want the build half - the submodule that wouldn’t fetch and the version string that kept growing a + - that’s the custom kernel post.