For optimal reading, please switch to desktop mode.
A Brief Introduction to Single-Root I/O Virtualisation (SR-IOV)
In a virtualised environment, SR-IOV enables closer access to underlying
hardware, trading greater performance for reduced operational flexibility.
This involves the creation of virtual functions (VFs), which are presented as
a copy of the physical function (PF) of the hardware device. The VF is
passed-through to a VM, resulting in bypass of the hypervisor operating
system for network activity. The principles of SR-IOV are presented in slightly greater depth
in a short
Intel white paper, and the OpenStack fundamentals are described in the
Neutron online documentation.
A VF can be bound to a given VLAN, or (on some hardware, such as recent
Mellanox NICs) it can be bound to a given VXLAN VNI. The result is direct
access to a physical NIC attached to a tenant or provider network.
Note that there is no support for security groups or similar richer network
functionality as the VM is directly connected to the physical network
infrastructure, which provides no interface for injecting firewall rules or
other externally managed packet handling.
Mellanox also offer a more advanced capability, known as
ASAP2, which builds
on SR-IOV to also offload Open vSwitch (OVS) functions from the hypervisor.
This is more complex and not in scope for this investigation.
Setup for SR-IOV
Aside from OpenStack, deployment of SR-IOV involves configuration at many levels.
BIOS needs to be configured to enable both Virtualization Technology and
SR-IOV.
Mellanox NIC firmware must be configured to enable the creation of SR-IOV
VFs and define the maximum number of VFs to support. This requires the
installation of the Mellanox Firmware Tools (MFT) package from Mellanox OFED.
Kernel boot parameters are required to support direct access to SR-IOV
hardware:
A number of VFs can be created by writing the required number to a file under
/sys, for example:
/sys/class/net/eno6/device/sriov_numvfs
NOTE: There are certain NIC models (e.g. Mellanox Connect-X 3) that do not
support management via sysfs, those need to be configured using modprobe
(see modprobe.d man page).
This is typically done as a udev trigger script on insertion of the PF
device. The upper limit set for VFs is given by another (read-only) file in
the same directory.
As a framework for management using infrastructure-as-code principles
and Ansible
at every level, Kayobe provides
support for running custom Ansible playbooks
on the inventory and groups of the infrastructure deployment. Over
time StackHPC has developed a number of roles to perform additional
configuration as a custom site playbook. A recent addition is a
Galaxy role for SR-IOV setup
A simple custom site playbook could look like this:
---
- name: Configure SR-IOV
hosts: compute_sriov
tasks:
- include_role:
name: stackhpc.sriov
handlers:
- name: reboot
include_tasks: tasks/reboot.yml
tags: reboot
...
This playbook would then be invoked from the Kayobe CLI:
(kayobe) $ kayobe playbook run sriov.yml
Once the system is prepared for supporting SR-IOV, OpenStack
configuration is required to enable VF resource management, scheduling
according to VF availability, and pass-through of the VF to VMs
that request it.
SR-IOV and LAGs
An additional complication might be that hypervisors use bonded
NICs to provide network access for VMs. This provides greater fault
tolerance. However, a VF is normally associated with only one PF
(and the two PFs in a bond would lead to inconsistent connectivity).
Mellanox NICs have a feature, VF-LAG, which claims to enable SR-IOV
to work in configurations where the ports of a 2-port NIC are bonded
together.
Setup for VF-LAG requires additional steps and complexities, and we'll be
covering it in greater detail in another blog post soon.
Nova Configuration
Scheduling with Hardware Resource Awareness
SR-IOV VFs are managed in the same way as PCI-passthrough hardware (eg, GPUs).
Each VF is managed as a hardware resource. The Nova scheduler must be
configured not to schedule instances requesting SR-IOV resources to hypervisors
with none available. This is done using the PciPassthroughFilter scheduler
filter.
In Kayobe config, the Nova scheduler filters are configured by defining
non-default parameters in nova.conf. In the kayobe-config repo, add this to
etc/kayobe/kolla/config/nova.conf:
[filter_scheduler]
available_filters = nova.scheduler.filters.all_filters
enabled_filters = other-filters,PciPassthroughFilter
(The other filters listed may vary according to other configuration applied
to the system).
Hypervisor Hardware Resources for Passthrough
The nova-compute service on each hypervisor requires configuration to define
which hardware/VF resources are to be made available for passthrough to VMs.
In addition, for infrastructure with multiple physical networks, an association
must be made to define which VFs connect to which physical network.
This is done by defining a whitelist (pci_passthrough_whitelist) of available
hardware resources on the compute hypervisors. This can be tricky to configure
if the available resources are different in an environment with multiple
variants of hypervisor hardware specification.
One solution using Kayobe's inventory is to define whitelist hardware mappings
either globally, or in group variables or even individual host variables as
follows:
# Physnet to device mappings for SR-IOV, used for the pci
# passthrough whitelist and sriov-agent configs
sriov_physnet_mappings:
p4p1: physnet2
This state can then be applied by adding a macro-expanded term to
etc/kayobe/kolla/config/nova.conf:
{% raw %}
[pci]
passthrough_whitelist = [{% for dev, physnet in sriov_physnet_mappings.items() %}{{ (loop.index0 > 0)|ternary(',','') }}{ "devname": "{{ dev }}", "physical_network": "{{ physnet }}" }{% endfor %}]
{% endraw %}
We have used the network device name in for designation here, but other
options are available:
devname: network-device-name
(as used above)
address: pci-bus-address
Takes the form [[[[<domain>]:]<bus>]:][<slot>][.[<function>]].
This is a good way of unambiguously selecting a single device in the
hardware device tree.
address: mac-address
Can be wild-carded.
Useful if the vendor of the SR-IOV NIC is different from all other NICs in
the configuration, so that selection can be made by OUI.
vendor_id: pci-vendor product_id: pci-device
A good option for selecting a single hardware device model, wherever they
are located.
These values are 4-digit hexadecimal (but the conventional 0x prefix is not
required).
The vendor ID and device ID are available from lspci -nn (or lspci -x for
the hard core). The IDs supplied should be those of the virtual function
(VF) not the physical functions, which may be slightly different.
Neutron Configuration
For Kayobe configuration, we set a global flag kolla_enable_neutron_sriov
in etc/kayobe/kolla.yml:
kolla_enable_neutron_sriov: true
Neutron Server
SR-IOV usually connects to VLANs; here we assume Neutron has already been
configured to support this.
The sriovnicswitch ML2 mechanism driver must be enabled. In Kayobe config,
this is added to etc/kayobe/neutron.yml:
# List of Neutron ML2 mechanism drivers to use. If unset the kolla-ansible
# defaults will be used.
kolla_neutron_ml2_mechanism_drivers:
- openvswitch
- l2population
- sriovnicswitch
Neutron SR-IOV NIC Agent
Neutron requires an additional agent to run on compute hypervisors with SR-IOV
resources. The SR-IOV agent must be configured with mappings between physical
network name and the interface name of the SR-IOV PF.
In Kayobe config, this should be added in a file
etc/kayobe/kolla/config/neutron/sriov_agent.ini.
Again we can do an expansion using the variables drawn from Kayobe config's
inventory and extra variables:
{% raw %}
[sriov_nic]
physical_device_mappings = {% for dev, physnet in sriov_physnet_mappings.items() %}{{ (loop.index0 > 0)|ternary(',','') }}{{ physnet }}:{{ dev }}{% endfor %}
exclude_devices =
{% endraw %}