Enabling GPUDirect P2P in OpenStack VMs

It started with a simple question: “Can we test NVIDIA’s GPUDirect P2P to see how much it helps with LLM inference?”

What seemed like an afternoon task turned into a multi-day journey through QEMU internals, libvirt XML quirks, and eventually, a wrapper script that intercepts hypervisor calls. The kind of problem where every layer of abstraction fights you, until you find the one place where you can actually make a change.

The solution ended up being surprisingly simple. But getting there required understanding why the obvious approaches don’t work.

With this solution, we achieved up to 137% higher bidirectional bandwidth and 97% lower GPU-to-GPU latency in OpenStack VMs.

MSI CG480-S6053 Server with 8x RTX PRO 6000 Blackwell GPUs Test hardware: MSI CG480-S6053 with 8x RTX PRO 6000 Blackwell GPUs

Why GPU P2P Is Disabled in OpenStack VMs

Standard OpenStack PCIe passthrough provides VMs direct access to physical GPUs, but GPU-to-GPU communication is disabled by default. Running nvidia-smi topo -p2p r inside a VM shows every GPU pair as NS (Not Supported):


        GPU0    GPU1    GPU2    GPU3    ...
 GPU0   X       NS      NS      NS      
 GPU1   NS      X       NS      NS      
 GPU2   NS      NS      X       NS

Without P2P, all inter-GPU data transfers route through system RAM and the CPU. This is a significant bottleneck for multi-GPU workloads like distributed training and inference.

Verifying GPUDirect P2P Support on Bare Metal

First, confirm P2P works on bare metal. Install NVIDIA drivers directly on the host and run the topology check:


nvidia-smi topo -p2p r
        GPU0    GPU1    GPU2    GPU3    ...
 GPU0   X       OK      OK      OK      
 GPU1   OK      X       OK      OK

All OK. The hardware supports P2P. NVIDIA’s P2P bandwidth test quantifies the difference:

Metric	P2P Disabled	P2P Enabled	Improvement
Unidirectional Bandwidth	~40 GB/s	~53 GB/s	+32%
Bidirectional Bandwidth	~43 GB/s	~102 GB/s	+137%
GPU-to-GPU Latency	~14μs	~0.45μs	97% faster

The goal is to preserve these gains inside VMs.

QEMU x-nv-gpudirect-clique Parameter

QEMU supports a parameter called x-nv-gpudirect-clique that groups passthrough GPUs into “cliques.” GPUs assigned to the same clique can perform P2P transfers. This feature was discussed in a 2017 QEMU mailing list thread with additional reports on Reddit’s /r/VFIO .

Syntax:


-device vfio-pci,host=05:00.0,x-nv-gpudirect-clique=0 \
-device vfio-pci,host=06:00.0,x-nv-gpudirect-clique=0

GPUs with the same clique ID can communicate directly. The challenge is getting this parameter into OpenStack-managed VMs.

Why Libvirt XML Configuration Fails

The obvious approach is to modify libvirt’s domain XML to inject the parameter.


<hostdev mode='subsystem' type='pci'>
  <source>
    <address domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
  </source>
  <qemu:commandline>
    <qemu:arg value='-device'/>
    <qemu:arg value='vfio-pci,host=05:00.0,x-nv-gpudirect-clique=0'/>
  </qemu:commandline>
</hostdev>

Two problems make this approach impractical:

Libvirt sanitizes XML. Custom parameters get reformatted or removed.
OpenStack regenerates XML. Nova regenerates the entire domain XML from its templates on every VM launch.

🚫

This approach works for one-off testing but not for production deployments.

Solution: QEMU Wrapper Script for GPU P2P

The call chain is: OpenStack → Nova → libvirt → QEMU

At the end, something executes qemu-system-x86_64 with all parameters. Verifiable via:


ps -xau | grep qemu-system-x86_64

The solution is to replace the QEMU binary with a wrapper script that:

Intercepts all original arguments
Identifies GPU passthrough devices
Injects x-nv-gpudirect-clique based on PCIe topology
Executes the real QEMU with modified arguments

⚠️

Before deploying, run nvidia-smi topo -p2p r on your host to check P2P support. GPUs showing OK should share the same clique ID. GPUs showing NS need different cliques or shouldn’t use P2P.

Here’s the complete wrapper script:


#!/bin/bash
#
# QEMU Wrapper Script for NVIDIA GPU P2P (GPUDirect) Support
# 
# How it works:
# 1. Intercepts all QEMU calls from libvirt/Nova
# 2. Scans for GPU passthrough devices (-device vfio-pci)
# 3. Injects x-nv-gpudirect-clique parameter based on PCIe topology
# 4. Executes the real QEMU with modified arguments
 
REAL_QEMU="/usr/bin/qemu-system-x86_64.real"
LOG_FILE="/var/log/qemu-p2p-wrapper.log"
ENABLE_LOGGING="${QEMU_P2P_WRAPPER_LOG:-0}"
 
# ============================================================
# GPU CLIQUE MAPPING - CUSTOMIZE THIS FOR YOUR HARDWARE
# ============================================================
# Run `nvidia-smi topo -p2p r` to find your GPU PCIe addresses
# GPUs that show "OK" for P2P should share the same clique ID
# Our 8 GPUs all support full P2P mesh → all in clique 0
declare -A GPU_CLIQUE_MAP=(
    ["0000:05:00.0"]="0"
    ["0000:06:00.0"]="0"
    ["0000:76:00.0"]="0"
    ["0000:77:00.0"]="0"
    ["0000:85:00.0"]="0"
    ["0000:86:00.0"]="0"
    ["0000:f4:00.0"]="0"
    ["0000:f5:00.0"]="0"
)
 
# ============================================================
# HELPER FUNCTIONS
# ============================================================
log_message() {
    if [[ "$ENABLE_LOGGING" == "1" ]]; then
        echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_FILE"
    fi
}
 
get_clique_id() {
    local pcie_addr="$1"
    # Normalize address format
    if [[ ! "$pcie_addr" =~ ^0000: ]]; then
        pcie_addr="0000:$pcie_addr"
    fi
    
    if [[ -v GPU_CLIQUE_MAP["$pcie_addr"] ]]; then
        echo "${GPU_CLIQUE_MAP[$pcie_addr]}"
    else
        echo ""
    fi
}
 
modify_vfio_device() {
    local device_json="$1"
    
    # Only process vfio-pci devices (GPU passthrough)
    if [[ "$device_json" =~ \"driver\":\"vfio-pci\" ]]; then
        if [[ "$device_json" =~ \"host\":\"([0-9a-fA-F:\.]+)\" ]]; then
            local pcie_addr="${BASH_REMATCH[1]}"
            local clique_id=$(get_clique_id "$pcie_addr")
            
            if [[ -n "$clique_id" ]]; then
                log_message "Found GPU at $pcie_addr -> clique $clique_id"
                
                # Inject clique parameter if not already present
                if [[ ! "$device_json" =~ x-nv-gpudirect-clique ]]; then
                    device_json="${device_json%\}}},\"x-nv-gpudirect-clique\":$clique_id}"
                    log_message "Modified device JSON: $device_json"
                fi
            else
                log_message "GPU at $pcie_addr not in clique map, skipping"
            fi
        fi
    fi
    
    echo "$device_json"
}
 
# ============================================================
# MAIN: Parse arguments and inject clique parameters
# ============================================================
log_message "=== QEMU P2P Wrapper started ==="
log_message "Original args: $*"
 
new_args=()
 
while [[ $# -gt 0 ]]; do
    arg="$1"
    
    case "$arg" in
        -device)
            shift
            if [[ $# -gt 0 ]]; then
                device_spec="$1"
                
                # QEMU device specs come as JSON objects
                if [[ "$device_spec" =~ ^\{.*\}$ ]]; then
                    modified_spec=$(modify_vfio_device "$device_spec")
                    new_args+=("-device" "$modified_spec")
                else
                    new_args+=("-device" "$device_spec")
                fi
            else
                new_args+=("-device")
            fi
            ;;
        *)
            new_args+=("$arg")
            ;;
    esac
    shift
done
 
log_message "Modified args: ${new_args[*]}"
log_message "Executing: $REAL_QEMU ${new_args[*]}"
 
# Hand off to real QEMU
exec "$REAL_QEMU" "${new_args[@]}"

Installing the QEMU Wrapper


# Backup original binary
sudo mv /usr/bin/qemu-system-x86_64 /usr/bin/qemu-system-x86_64.real
 
# Deploy wrapper (save the script above as qemu-wrapper.sh first)
sudo cp qemu-wrapper.sh /usr/bin/qemu-system-x86_64
sudo chmod +x /usr/bin/qemu-system-x86_64
 
# Restart services
sudo systemctl restart libvirtd nova-compute

To debug, set QEMU_P2P_WRAPPER_LOG=1 environment variable and check /var/log/qemu-p2p-wrapper.log to see which devices are being modified.

GPUDirect P2P Performance Results

After deploying the wrapper, topology check inside the VM:


        GPU0    GPU1    GPU2    GPU3    ...
 GPU0   X       OK      OK      OK      
 GPU1   OK      X       OK      OK

All OK. P2P is enabled.

Bandwidth comparison:

Metric	VM (No P2P)	VM (With P2P)	Bare Metal
Bidirectional Bandwidth	~42-55 GB/s	~54-102 GB/s	~54-102 GB/s
P2P Connectivity	❌ None	✅ Full mesh	✅ Full mesh

VMs achieve up to 81% higher bidirectional bandwidth for closely connected GPU pairs.

Key Takeaways for GPU Virtualization

Bypass OpenStack and Libvirt Abstractions

Libvirt and OpenStack aren’t designed for this use case. Intercepting at the QEMU level is simpler than modifying higher-level components.

PCIe Topology Affects P2P Performance

Not all GPU pairs support P2P equally. Use nvidia-smi topo -m to check connection types:

PIX (same switch) → best P2P, same clique
NODE (different root complex) → possible P2P, test first
SYS (different NUMA) → poor P2P, consider separate cliques

VM vs Bare Metal GPU Performance

Even with P2P enabled, VMs reach ~75-85% of bare metal bandwidth. For latency-critical workloads, bare metal remains preferable.

QEMU Wrapper Maintenance

⚠️

The wrapper script must be reapplied after QEMU package updates. Include it in infrastructure-as-code and set up alerts for package changes.

What’s Next

There’s a broader lesson here about virtualization and AI infrastructure.

Cloud providers abstract away hardware details for good reasons. But AI workloads are different. They push hardware to its limits in ways that general-purpose abstractions weren’t designed for. Sometimes you need to reach through those abstractions and touch the metal.

We’re planning to run comprehensive LLM training benchmarks comparing single GPU, multi-GPU without P2P, and multi-GPU with P2P. Based on the bandwidth improvements we’re seeing, we expect 15-30% better training throughput for data-parallel workloads and 10-20% lower inference latency for model parallelism.

The wrapper script is a hack. It works, it’s maintainable, and it unlocks real performance. Sometimes that’s what matters.