The Convergence of HPC, AI and Cloud

A Brief Introduction to Single-Root I/O Virtualisation (SR-IOV)

In a virtualised environment, SR-IOV enables closer access to underlying
hardware, trading greater performance for reduced operational flexibility.

This involves the creation of virtual functions (VFs), which are presented as
a copy of the physical function (PF) of the hardware device.  The VF is
passed-through to a VM, resulting in bypass of the hypervisor operating
system for network activity.  The principles of SR-IOV are presented in slightly greater depth
in a short Intel white paper, and the OpenStack fundamentals are described in the Neutron online documentation.

A VF can be bound to a given VLAN, or (on some hardware, such as recent
Mellanox NICs) it can be bound to a given VXLAN VNI.  The result is direct
access to a physical NIC attached to a tenant or provider network.

Note that there is no support for security groups or similar richer network
functionality as the VM is directly connected to the physical network
infrastructure, which provides no interface for injecting firewall rules or
other externally managed packet handling.
Mellanox also offer a more advanced capability, known as ASAP2, which builds
on SR-IOV to also offload Open vSwitch (OVS) functions from the hypervisor.
This is more complex and not in scope for this investigation.

Setup for SR-IOV

Aside from OpenStack, deployment of SR-IOV involves configuration at many levels.

BIOS needs to be configured to enable both Virtualization Technology and SR-IOV.
Mellanox NIC firmware must be configured to enable the creation of SR-IOV VFs and define the maximum number of VFs to support. This requires the installation of the Mellanox Firmware Tools (MFT) package from Mellanox OFED.
Kernel boot parameters are required to support direct access to SR-IOV hardware:
```
intel_iommu=on iommu=pt
```
A number of VFs can be created by writing the required number to a file under /sys, for example: /sys/class/net/eno6/device/sriov_numvfs

NOTE: There are certain NIC models (e.g. Mellanox Connect-X 3) that do not support management via sysfs, those need to be configured using modprobe (see modprobe.d man page).
This is typically done as a udev trigger script on insertion of the PF device. The upper limit set for VFs is given by another (read-only) file in the same directory.

As a framework for management using infrastructure-as-code principles and Ansible at every level, Kayobe provides support for running custom Ansible playbooks on the inventory and groups of the infrastructure deployment. Over time StackHPC has developed a number of roles to perform additional configuration as a custom site playbook. A recent addition is a Galaxy role for SR-IOV setup

A simple custom site playbook could look like this:

---
- name: Configure SR-IOV
  hosts: compute_sriov
  tasks:
    - include_role:
        name: stackhpc.sriov
  handlers:
    - name: reboot
      include_tasks: tasks/reboot.yml
      tags: reboot
...

This playbook would then be invoked from the Kayobe CLI:

(kayobe) $ kayobe playbook run sriov.yml

Once the system is prepared for supporting SR-IOV, OpenStack configuration is required to enable VF resource management, scheduling according to VF availability, and pass-through of the VF to VMs that request it.

SR-IOV and LAGs

An additional complication might be that hypervisors use bonded NICs to provide network access for VMs. This provides greater fault tolerance. However, a VF is normally associated with only one PF (and the two PFs in a bond would lead to inconsistent connectivity).

Mellanox NICs have a feature, VF-LAG, which claims to enable SR-IOV to work in configurations where the ports of a 2-port NIC are bonded together.

Setup for VF-LAG requires additional steps and complexities, and we'll be covering it in greater detail in another blog post soon.

Nova Configuration

Scheduling with Hardware Resource Awareness

SR-IOV VFs are managed in the same way as PCI-passthrough hardware (eg, GPUs). Each VF is managed as a hardware resource. The Nova scheduler must be configured not to schedule instances requesting SR-IOV resources to hypervisors with none available. This is done using the PciPassthroughFilter scheduler filter.

In Kayobe config, the Nova scheduler filters are configured by defining non-default parameters in nova.conf. In the kayobe-config repo, add this to etc/kayobe/kolla/config/nova.conf:

[filter_scheduler]
available_filters = nova.scheduler.filters.all_filters
enabled_filters = other-filters,PciPassthroughFilter

(The other filters listed may vary according to other configuration applied to the system).

Hypervisor Hardware Resources for Passthrough

The nova-compute service on each hypervisor requires configuration to define which hardware/VF resources are to be made available for passthrough to VMs. In addition, for infrastructure with multiple physical networks, an association must be made to define which VFs connect to which physical network. This is done by defining a whitelist (pci_passthrough_whitelist) of available hardware resources on the compute hypervisors. This can be tricky to configure if the available resources are different in an environment with multiple variants of hypervisor hardware specification. One solution using Kayobe's inventory is to define whitelist hardware mappings either globally, or in group variables or even individual host variables as follows:

# Physnet to device mappings for SR-IOV, used for the pci
# passthrough whitelist and sriov-agent configs
sriov_physnet_mappings:
  p4p1: physnet2

This state can then be applied by adding a macro-expanded term to etc/kayobe/kolla/config/nova.conf:

{% raw %}
[pci]
passthrough_whitelist = [{% for dev, physnet in sriov_physnet_mappings.items() %}{{ (loop.index0 > 0)|ternary(',','') }}{ "devname": "{{ dev }}", "physical_network": "{{ physnet }}" }{% endfor %}]
{% endraw %}

We have used the network device name in for designation here, but other options are available:

devname: network-device-name

(as used above)
address: pci-bus-address

Takes the form [[[[<domain>]:]<bus>]:][<slot>][.[<function>]].

This is a good way of unambiguously selecting a single device in the hardware device tree.
address: mac-address

Can be wild-carded.

Useful if the vendor of the SR-IOV NIC is different from all other NICs in the configuration, so that selection can be made by OUI.
vendor_id: pci-vendor product_id: pci-device

A good option for selecting a single hardware device model, wherever they are located.

These values are 4-digit hexadecimal (but the conventional 0x prefix is not required).

The vendor ID and device ID are available from lspci -nn (or lspci -x for the hard core). The IDs supplied should be those of the virtual function (VF) not the physical functions, which may be slightly different.

Neutron Configuration

Kolla-Ansible documents SR-IOV configuration well here:
https://docs.openstack.org/kolla-ansible/latest/reference/networking/sriov.html.
See https://docs.openstack.org/neutron/train/admin/config-sriov.html for full
details from Neutron's documentation.
For Kayobe configuration, we set a global flag kolla_enable_neutron_sriov
in etc/kayobe/kolla.yml:

kolla_enable_neutron_sriov: true

Neutron Server

SR-IOV usually connects to VLANs; here we assume Neutron has already been configured to support this. The sriovnicswitch ML2 mechanism driver must be enabled. In Kayobe config, this is added to etc/kayobe/neutron.yml:

# List of Neutron ML2 mechanism drivers to use. If unset the kolla-ansible
# defaults will be used.
kolla_neutron_ml2_mechanism_drivers:
  - openvswitch
  - l2population
  - sriovnicswitch

Neutron SR-IOV NIC Agent

Neutron requires an additional agent to run on compute hypervisors with SR-IOV resources. The SR-IOV agent must be configured with mappings between physical network name and the interface name of the SR-IOV PF. In Kayobe config, this should be added in a file etc/kayobe/kolla/config/neutron/sriov_agent.ini. Again we can do an expansion using the variables drawn from Kayobe config's inventory and extra variables:

{% raw %}
[sriov_nic]
physical_device_mappings = {% for dev, physnet in sriov_physnet_mappings.items() %}{{ (loop.index0 > 0)|ternary(',','') }}{{ physnet }}:{{ dev }}{% endfor %}
exclude_devices =
{% endraw %}