Skip to Content
BlogEnabling GPUDirect P2P in OpenStack VMs

Enabling GPUDirect P2P in OpenStack VMs

It started with a simple question: “Can we test NVIDIA’s GPUDirect P2P to see how much it helps with LLM inference?”

What seemed like an afternoon task turned into a multi-day journey through QEMU internals, libvirt XML quirks, and eventually, a wrapper script that intercepts hypervisor calls. The kind of problem where every layer of abstraction fights you, until you find the one place where you can actually make a change.

The solution ended up being surprisingly simple. But getting there required understanding why the obvious approaches don’t work.

With this solution, we achieved up to 137% higher bidirectional bandwidth and 97% lower GPU-to-GPU latency in OpenStack VMs.

MSI CG480-S6053 Server with 8x RTX PRO 6000 Blackwell GPUs Test hardware: MSI CG480-S6053 with 8x RTX PRO 6000 Blackwell GPUs


Why GPU P2P Is Disabled in OpenStack VMs

Standard OpenStack PCIe passthrough provides VMs direct access to physical GPUs, but GPU-to-GPU communication is disabled by default. Running nvidia-smi topo -p2p r inside a VM shows every GPU pair as NS (Not Supported):

GPU0 GPU1 GPU2 GPU3 ... GPU0 X NS NS NS GPU1 NS X NS NS GPU2 NS NS X NS

Without P2P, all inter-GPU data transfers route through system RAM and the CPU. This is a significant bottleneck for multi-GPU workloads like distributed training and inference.


Verifying GPUDirect P2P Support on Bare Metal

First, confirm P2P works on bare metal. Install NVIDIA drivers directly on the host and run the topology check:

nvidia-smi topo -p2p r GPU0 GPU1 GPU2 GPU3 ... GPU0 X OK OK OK GPU1 OK X OK OK

All OK. The hardware supports P2P. NVIDIA’s P2P bandwidth test quantifies the difference:

MetricP2P DisabledP2P EnabledImprovement
Unidirectional Bandwidth~40 GB/s~53 GB/s+32%
Bidirectional Bandwidth~43 GB/s~102 GB/s+137%
GPU-to-GPU Latency~14μs~0.45μs97% faster

The goal is to preserve these gains inside VMs.


QEMU x-nv-gpudirect-clique Parameter

QEMU supports a parameter called x-nv-gpudirect-clique that groups passthrough GPUs into “cliques.” GPUs assigned to the same clique can perform P2P transfers. This feature was discussed in a 2017 QEMU mailing list thread with additional reports on Reddit’s /r/VFIO.

Syntax:

-device vfio-pci,host=05:00.0,x-nv-gpudirect-clique=0 \ -device vfio-pci,host=06:00.0,x-nv-gpudirect-clique=0

GPUs with the same clique ID can communicate directly. The challenge is getting this parameter into OpenStack-managed VMs.


Why Libvirt XML Configuration Fails

The obvious approach is to modify libvirt’s domain XML to inject the parameter.

<hostdev mode='subsystem' type='pci'> <source> <address domain='0x0000' bus='0x05' slot='0x00' function='0x0'/> </source> <qemu:commandline> <qemu:arg value='-device'/> <qemu:arg value='vfio-pci,host=05:00.0,x-nv-gpudirect-clique=0'/> </qemu:commandline> </hostdev>

Two problems make this approach impractical:

  1. Libvirt sanitizes XML. Custom parameters get reformatted or removed.
  2. OpenStack regenerates XML. Nova regenerates the entire domain XML from its templates on every VM launch.
🚫

This approach works for one-off testing but not for production deployments.


Solution: QEMU Wrapper Script for GPU P2P

The call chain is: OpenStack → Nova → libvirt → QEMU

At the end, something executes qemu-system-x86_64 with all parameters. Verifiable via:

ps -xau | grep qemu-system-x86_64

The solution is to replace the QEMU binary with a wrapper script that:

  1. Intercepts all original arguments
  2. Identifies GPU passthrough devices
  3. Injects x-nv-gpudirect-clique based on PCIe topology
  4. Executes the real QEMU with modified arguments
⚠️

Before deploying, run nvidia-smi topo -p2p r on your host to check P2P support. GPUs showing OK should share the same clique ID. GPUs showing NS need different cliques or shouldn’t use P2P.

Here’s the complete wrapper script:

#!/bin/bash # # QEMU Wrapper Script for NVIDIA GPU P2P (GPUDirect) Support # # How it works: # 1. Intercepts all QEMU calls from libvirt/Nova # 2. Scans for GPU passthrough devices (-device vfio-pci) # 3. Injects x-nv-gpudirect-clique parameter based on PCIe topology # 4. Executes the real QEMU with modified arguments REAL_QEMU="/usr/bin/qemu-system-x86_64.real" LOG_FILE="/var/log/qemu-p2p-wrapper.log" ENABLE_LOGGING="${QEMU_P2P_WRAPPER_LOG:-0}" # ============================================================ # GPU CLIQUE MAPPING - CUSTOMIZE THIS FOR YOUR HARDWARE # ============================================================ # Run `nvidia-smi topo -p2p r` to find your GPU PCIe addresses # GPUs that show "OK" for P2P should share the same clique ID # Our 8 GPUs all support full P2P mesh → all in clique 0 declare -A GPU_CLIQUE_MAP=( ["0000:05:00.0"]="0" ["0000:06:00.0"]="0" ["0000:76:00.0"]="0" ["0000:77:00.0"]="0" ["0000:85:00.0"]="0" ["0000:86:00.0"]="0" ["0000:f4:00.0"]="0" ["0000:f5:00.0"]="0" ) # ============================================================ # HELPER FUNCTIONS # ============================================================ log_message() { if [[ "$ENABLE_LOGGING" == "1" ]]; then echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_FILE" fi } get_clique_id() { local pcie_addr="$1" # Normalize address format if [[ ! "$pcie_addr" =~ ^0000: ]]; then pcie_addr="0000:$pcie_addr" fi if [[ -v GPU_CLIQUE_MAP["$pcie_addr"] ]]; then echo "${GPU_CLIQUE_MAP[$pcie_addr]}" else echo "" fi } modify_vfio_device() { local device_json="$1" # Only process vfio-pci devices (GPU passthrough) if [[ "$device_json" =~ \"driver\":\"vfio-pci\" ]]; then if [[ "$device_json" =~ \"host\":\"([0-9a-fA-F:\.]+)\" ]]; then local pcie_addr="${BASH_REMATCH[1]}" local clique_id=$(get_clique_id "$pcie_addr") if [[ -n "$clique_id" ]]; then log_message "Found GPU at $pcie_addr -> clique $clique_id" # Inject clique parameter if not already present if [[ ! "$device_json" =~ x-nv-gpudirect-clique ]]; then device_json="${device_json%\}}},\"x-nv-gpudirect-clique\":$clique_id}" log_message "Modified device JSON: $device_json" fi else log_message "GPU at $pcie_addr not in clique map, skipping" fi fi fi echo "$device_json" } # ============================================================ # MAIN: Parse arguments and inject clique parameters # ============================================================ log_message "=== QEMU P2P Wrapper started ===" log_message "Original args: $*" new_args=() while [[ $# -gt 0 ]]; do arg="$1" case "$arg" in -device) shift if [[ $# -gt 0 ]]; then device_spec="$1" # QEMU device specs come as JSON objects if [[ "$device_spec" =~ ^\{.*\}$ ]]; then modified_spec=$(modify_vfio_device "$device_spec") new_args+=("-device" "$modified_spec") else new_args+=("-device" "$device_spec") fi else new_args+=("-device") fi ;; *) new_args+=("$arg") ;; esac shift done log_message "Modified args: ${new_args[*]}" log_message "Executing: $REAL_QEMU ${new_args[*]}" # Hand off to real QEMU exec "$REAL_QEMU" "${new_args[@]}"

Installing the QEMU Wrapper

# Backup original binary sudo mv /usr/bin/qemu-system-x86_64 /usr/bin/qemu-system-x86_64.real # Deploy wrapper (save the script above as qemu-wrapper.sh first) sudo cp qemu-wrapper.sh /usr/bin/qemu-system-x86_64 sudo chmod +x /usr/bin/qemu-system-x86_64 # Restart services sudo systemctl restart libvirtd nova-compute

To debug, set QEMU_P2P_WRAPPER_LOG=1 environment variable and check /var/log/qemu-p2p-wrapper.log to see which devices are being modified.


GPUDirect P2P Performance Results

After deploying the wrapper, topology check inside the VM:

GPU0 GPU1 GPU2 GPU3 ... GPU0 X OK OK OK GPU1 OK X OK OK

All OK. P2P is enabled.

Bandwidth comparison:

MetricVM (No P2P)VM (With P2P)Bare Metal
Bidirectional Bandwidth~42-55 GB/s~54-102 GB/s~54-102 GB/s
P2P Connectivity❌ None✅ Full mesh✅ Full mesh

VMs achieve up to 81% higher bidirectional bandwidth for closely connected GPU pairs.


Key Takeaways for GPU Virtualization

Bypass OpenStack and Libvirt Abstractions

Libvirt and OpenStack aren’t designed for this use case. Intercepting at the QEMU level is simpler than modifying higher-level components.

PCIe Topology Affects P2P Performance

Not all GPU pairs support P2P equally. Use nvidia-smi topo -m to check connection types:

  • PIX (same switch) → best P2P, same clique
  • NODE (different root complex) → possible P2P, test first
  • SYS (different NUMA) → poor P2P, consider separate cliques

VM vs Bare Metal GPU Performance

Even with P2P enabled, VMs reach ~75-85% of bare metal bandwidth. For latency-critical workloads, bare metal remains preferable.

QEMU Wrapper Maintenance

⚠️

The wrapper script must be reapplied after QEMU package updates. Include it in infrastructure-as-code and set up alerts for package changes.


What’s Next

There’s a broader lesson here about virtualization and AI infrastructure.

Cloud providers abstract away hardware details for good reasons. But AI workloads are different. They push hardware to its limits in ways that general-purpose abstractions weren’t designed for. Sometimes you need to reach through those abstractions and touch the metal.

We’re planning to run comprehensive LLM training benchmarks comparing single GPU, multi-GPU without P2P, and multi-GPU with P2P. Based on the bandwidth improvements we’re seeing, we expect 15-30% better training throughput for data-parallel workloads and 10-20% lower inference latency for model parallelism.

The wrapper script is a hack. It works, it’s maintainable, and it unlocks real performance. Sometimes that’s what matters.


Resources

Last updated on