Democratising the GPU (Part One): The Ansible role where the magic begins

For optimal reading, please switch to desktop mode.

Note: The content of this article relates to NVIDIA VGPU release 15.

When even your grandpa is asking you for some GPU resource to serve the latest language model, it's probably time to get serious about GPU. But you don't want to give him that whole NVIDIA A100 80GB, right? That's where NVIDIA's virtual GPU technology comes in; one or more virtual GPUs can be created on a single physical card, and whilst this technology isn't new, it has always been complex to set up. With the new vgpu role in our stackhpc.linux Ansible collection, we've tried to make this easier.

Before you get started

vGPU functionality requires a commercial license. The guest instances will need to be able to reach a license server, which can either be cloud hosted on NVIDIA infrastructure (CLS) or hosted on your own infrastructure (DLS). For more details see this NVIDIA knowledge article. Contact NVIDIA for purchasing details.

To MIG or not to MIG

MIG stands for "Multi-instance GPU" and it is a way to split a supported card into multiple separate partitions with dedicated resources. It differs from legacy vGPUs which used time slicing to schedule work onto the GPU. In MIG mode, NVIDIA promises deterministic latency and throughput as each workload runs in parallel. Sounds great, right? But the cost is that you lose a seventh of the physical resources of the GPU to management:

There are seven SM slices, not eight, because some SMs cover operational overhead when MIG mode is enabled.

Analysing the performance differences will be the focus of a future blog article, but for those who can't wait, here are some benchmarking results from VMware

Ansible Role

Whether you choose MIG or time slicing for your vGPUs, the stackhpc.linux.vgpu role has you covered. It's published on Ansible Galaxy, so all you need to do is add the following snippet to your requirements.yml:

collections:
- name: stackhpc.linux
    version: 1.0.1

Define some variables (e.g in <ansible_inventory>/group_vars/vgpu):

# Path to GRID driver downloaded from the NVIDIA licensing portal
vgpu_driver_url: "file://{{ lookup('env', 'HOME') }}/NVIDIA-GRID-Linux-KVM-525.105.14-525.105.17-528.89.zip"

#nvidia-692 GRID A100D-4C
#nvidia-693 GRID A100D-8C
#nvidia-694 GRID A100D-10C
#nvidia-695 GRID A100D-16C
#nvidia-696 GRID A100D-20C
#nvidia-697 GRID A100D-40C
#nvidia-698 GRID A100D-80C
#nvidia-699 GRID A100D-1-10C
#nvidia-700 GRID A100D-2-20C
#nvidia-701 GRID A100D-3-40C
#nvidia-702 GRID A100D-4-40C
#nvidia-703 GRID A100D-7-80C
#nvidia-707 GRID A100D-1-10CME
vgpu_definitions:
    # Configuring a MIG backed vGPU
    - pci_address: "0000:17:00.0"
      virtual_functions:
        - mdev_type: nvidia-700
          index: 0
        - mdev_type: nvidia-700
          index: 1
        - mdev_type: nvidia-700
          index: 2
        - mdev_type: nvidia-699
          index: 3
      mig_devices:
        "1g.10gb": 1
        "2g.20gb": 3
    # Configuring a card in a time-sliced configuration (non-MIG backed)
    - pci_address: "0000:65:00.0"
      virtual_functions:
        - mdev_type: nvidia-697
          index: 0
        - mdev_type: nvidia-697
          index: 1

and run this simple playbook:

---

  - hosts: vgpu
    tags:
      - iommu
    tasks:
      - import_role:
          name: stackhpc.linux.iommu
    handlers:
      - name: reboot
        reboot:
          reboot_timeout: 3600
        become: true

  - hosts: vgpu
    tags:
      - vgpu
    tasks:
      - import_role:
          name: stackhpc.linux.vgpu
    handlers:
      - name: reboot
        reboot:
          reboot_timeout: 3600
        become: true

Could it get much easier? See the role documentation for more information on the configuration options; as well as a more detailed walkthrough of the steps you need to get everything fully configured.

It's Apache 2.0 licensed, so happy hacking, and don't forget to contribute any useful changes back.

OpenStack Config

Of course, creating the mediated devices on the host is not enough. You have to pass them through to a virtual machine to make use of them. Since we are an OpenStack shop, we will use the example of configuring OpenStack Nova. Just add the following snippet to your nova-compute service's nova.conf:

[devices]
enabled_mdev_types = nvidia-700, nvidia-699

[mdev_nvidia-700]
device_addresses = 0000:21:00.4,0000:21:00.5,0000:21:00.6,0000:81:00.4,0000:81:00.5,0000:81:00.6
mdev_class = CUSTOM_NVIDIA_700

[mdev_nvidia-699]
device_addresses = 0000:21:00.7,0000:81:00.7
mdev_class = CUSTOM_NVIDIA_699

[devices]
enabled_mdev_types = nvidia-697

[mdev_nvidia-697]
device_addresses = 0000:21:00.4,0000:21:00.5,0000:81:00.4,0000:81:00.5
# Custom resource classes don't work when you only have single resource type.
mdev_class = vGPU

You will need to adjust the PCI addresses to match the PCI addresses of your VGPU virtual functions. These can be obtained by checking the mdevctl configuration after running the role:

# mdevctl list

73269d0f-b2c9-438d-8f28-f9e4bc6c6995 0000:17:00.4 nvidia-700 manual (defined)
dc352ef3-efeb-4a5d-a48e-912eb230bc76 0000:17:00.5 nvidia-700 manual (defined)
a464fbae-1f89-419a-a7bd-3a79c7b2eef4 0000:17:00.6 nvidia-700 manual (defined)
f3b823d3-97c8-4e0a-ae1b-1f102dcb3bce 0000:17:00.7 nvidia-699 manual (defined)
330be289-ba3f-4416-8c8a-b46ba7e51284 0000:65:00.4 nvidia-697 manual (defined)
1ba5392c-c61f-4f48-8fb1-4c6b2bbb0673 0000:65:00.5 nvidia-697 manual (defined)

The mdev_class maps to a resource class that you can set in your flavor definition. Note that if you only define a single mdev type on a given hypervisor, then the mdev_class configuration option is silently ignored and it will use the vGPU resource class (bug?).

Openstack Flavors

Define some flavors that request the resource class that was configured in nova.conf. An example definition, that can be used with openstack.cloud.compute_flavor Ansible module, is shown below:

vgpu_a100_2g_20gb:
  name: "vgpu.a100.2g.20gb"
  ram: 65536
  disk: 30
  vcpus: 8
  is_public: false
  extra_specs:
    hw:cpu_policy: "dedicated"
    hw:cpu_thread_policy: "prefer"
    hw:mem_page_size: "1GB"
    hw:cpu_sockets: 2
    hw:numa_nodes: 8
    hw_rng:allowed: "True"
    resources:CUSTOM_NVIDIA_700: "1"

You now should be able to launch a VM with this flavor.

Wait! A Spanner in the Works

Just when we thought we'd got this vGPU thing sussed, NVIDIA threw a spanner in the works. Starting with the V16 release of the GRID drivers (July 2023), NVIDIA dropped support for the following cards:

Graphics cards that support only C-series vGPUs, namely:

NVIDIA H800 PCIe 80GB

NVIDIA H100 PCIe 80GB

NVIDIA A800 PCIe 80GB

NVIDIA A800 PCIe 80GB liquid cooled

NVIDIA A800 HGX 80GB

NVIDIA A100 PCIe 80GB

NVIDIA A100 PCIe 80GB liquid cooled

NVIDIA A100X

NVIDIA A100 HGX 80GB

NVIDIA A100 PCIe 40GB

NVIDIA A100 HGX 40GB

NVIDIA A30

NVIDIA A30X

Instead, these graphics cards are supported with NVIDIA AI Enterprise.

A full table can be found here. This means that you can no longer use your Virtual Compute Server license and will need to get it upgraded to the NVIDIA AI Enterprise equivalent (at least if you want driver updates)! You also need to install alternative drivers on the hypervisor. We will follow up with details of the technical implications of this change in a subsequent blog article.

Future Blogs

This is the first blog in a series. The following articles are in the pipeline:

vGPUs in OpenStack with Kayobe
Switching to NVIDIA AI Enterprise
Dynamic vGPU parititioning on OpenStack with Cyborg
vGPU Performance shootout: MIG vs time slicing

Stay tuned...

Get in touch

If you would like to get in touch we would love to hear from you. Reach out to us via Bluesky, LinkedIn or directly via our contact page.

StackHPC

Other articles

An Ansible-driven Slurm "Appliance" for an HPC Environment

Kubeflow on Baremetal OpenStack