The Convergence of HPC, AI and Cloud

For optimal reading, please switch to desktop mode.

The hardware powering modern cloud and High Performance Computing (HPC) systems is variable and complex. The assumption that access to memory and devices across a system is uniform is often incorrect. Without knowledge of the properties of the physical hardware of the host, the virtualised guests running atop can perform poorly.

This post covers how we can take advantage of knowledge of the system architecture in OpenStack Nova to improve guest VM performance, and how to configure OpenStack TripleO to support this.

Non-Uniform Memory Access (NUMA)

Server CPU clock speeds have for a long time ceased increasing. In order to continue to improve system performance, CPU and system vendors now scale outwards instead of upwards, offering servers with multiple CPU sockets and multiple cores per CPU. In multi-socket systems access to memory and devices is no longer uniform between the CPU nodes across all memory, as the inter-node communication paths are limited. This leads to variable memory bandwidth and latency, and is known as Non-Uniform Memory Access (NUMA).

When virtualisation is used on NUMA systems, typically guest VMs will have no knowledge of the memory architecture of the physical host. Consequently they will make poor use of the system's resources, making many expensive memory accesses across the interconnect bus. The same is true when accessing I/O devices such as Network Interface Cards (NICs).

To avoid these issues it is possible to expose all or a subset of the memory architecture of the physical system to the guest VM, allowing it to make more intelligent decisions around the use of memory and how tasks are scheduled to CPU cores.

We call this process the "physicalisation" of virtualisation. Revealing the underlying hardware sacrifices some generality and flexibility, but delivers performance gains through informed placement and scheduling. A compromise is struck; we find we can get most of the benefits of software defined infrastructure without paying a price in performance.

vCPU Pinning

In KVM, the virtual CPUs of a guest VM are emulated by host tasks in userspace of the hypervisor. As such they may be scheduled across any of the cores in the system. This behaviour can lead to sub-optimal cache performance as virtual CPUs are scheduled between CPU cores within a NUMA node or worse, between NUMA nodes. With virtual CPU pinning, we can improve this behaviour by restricting the physical CPU cores on which each virtual CPU can run.

It can in some scenarios be advantageous to also restrict host processes to a subset of the available CPU cores to avoid adverse interactions between hypervisor processes and the application workloads.

NUMA and vCPU Pinning in Nova

Support for NUMA topology awareness and vCPU pinning in OpenStack Nova was first introduced in October 2015 with the Juno release. These features allow Nova to more intelligently schedule VM instances onto the available hardware. Essentially, we can request a NUMA topology via Nova flavor keys or Glance image properties when creating a Nova instance. The same is true for vCPU pinning. The OpenStack admin guide provides some useful information on how to use these features. There is a good blog article by Red Hat on this topic which is worth reading. We are going to build on that here by describing how to deliver these capabilities in TripleO.

NUMA and vCPU Pinning in TripleO

The OpenStack TripleO project provides tools to deploy an OpenStack cloud, and Red Hat's popular OpenStack Platform (OSP) is based on TripleO. The default configuration of TripleO is not optimal for NUMA placement and vCPU pinning, so we'll outline a few steps that can be taken to improve the situation.

Kernel Command Line Arguments

We can use the isolcpus kernel command line argument to restrict host processes to a subset of the total available CPU cores. The argument specifies a list of ranges of CPU IDs from which host processes should be isolated. In other words, the CPUs we will use for guest VMs. For example, to reserve CPUs 4 through 23 exclusively for guest VMs we could specify:

isolcpus=4-23

Ideally we want this argument to be applied on the first and subsequent boots rather than applying it dynamically during deployment, to avoid waiting for our compute nodes to reboot. Currently Ironic does not provide a mechanism to specify additional kernel arguments on a per-node or per-image basis, so we must bake them into the overcloud image instead.

If using the Grub bootloader, additional arguments can be provided to the kernel by modifying the GRUB_CMDLINE_LINUX variable in /etc/default/grub in the overcloud compute image, then rebuilding the Grub configuration. We use the virt-customize command to apply post-build configuration to the overcloud images:

$ export ISOLCPUS=4-23
$ function cpu_pinning_args {
      CPU_PINNING_ARGS="isolcpus=${ISOLCPUS}"
      echo --run-command \"echo GRUB_CMDLINE_LINUX=\"'\\\"'\"\\$\{GRUB_CMDLINE_LINUX\} ${CPU_PINNING_ARGS}\"'\\\"'\" \>\> /etc/default/grub\"
  }
$ (cpu_pinning_args) | xargs virt-customize -v -m 4096 --smp 4 -a overcloud-compute.qcow2

(We structure the execution this way because typically we are composing a string of operations into a single invocation of virt-customize). Alternatively this change could be a applied with a custom diskimage-builder element.

One Size Does Not Fit All: Multiple Overcloud Images

While the isolcpus argument may provide performance benefits for guest VMs on compute nodes, it would be seriously harmful to limit host processes in the same way on controller and storage nodes. With different sets of nodes requiring different arguments, we now need multiple overcloud images. Thankfully, TripleO provides an (undocumented) set of options to set the image for each of the overcloud roles. We'll use the name overcloud-compute for the compute image here.

When uploading overcloud images to Glance, use the OS_IMAGE environment variable to reference an image with a non-default name:

$ export OS_IMAGE=overcloud-compute.qcow2
$ openstack overcloud image upload

We can execute this command multiple times to register multiple images. To specify a different image for the overcloud compute roles, create a Heat environment file containing the following:

parameter_defaults:
  NovaImage: overcloud-compute

Ensure that the image name matches the one registered with Glance and that the environment file is referenced when deploying or updating the overcloud. Other node roles will continue to use the default image overcloud-full. Our specialised kernel configuration is now only applied where it is needed, and not where it is harmful.

KVM and Libvirt

The Nova compute service will not advertise the NUMA topology of its host if it determines that the versions of libvirt and KVM are inappropriate. As of the Mitaka release, the following version restrictions are applied:

libvirt: >= 1.2.8, != 1.2.9.7
qemu-kvm: >= 2.1.0

On CentOS 7.3, the qemu-kvm package is at version 1.5.3. This can be updated to a more contemporary 2.4.1 by adding the kvm-common Yum repository and installing qemu-kvm-ev:

$ cat << EOF | sudo tee /etc/yum.repos.d/kvm-common.repo
[kvm-common]
name=KVM common
baseurl=http://mirror.centos.org/centos/7/virt/x86_64/kvm-common/
gpgcheck=0
EOF
$ sudo yum -y install qemu-kvm-ev

This should be applied to the compute nodes either after deployment or during the overcloud compute image build procedure. If applied after deployment, the openstack-nova-compute service should be restarted on the compute nodes to ensure it checks the library versions again:

$ sudo systemctl restart openstack-nova-compute

Kernel Same-page Merging (KSM)

Kernel Same-page Merging (KSM) allows for more efficient use of memory by guest VMs by allowing identical memory pages used by different processes to be backed by a single page of Copy On Write (COW) physical memory. When used on a NUMA system this can have adverse effects if pages are merged belonging to VMs on different nodes. In Linux kernels prior to 3.14 KSM was not NUMA-aware, which could lead to VM performance and stability issues. It is possible to disable KSM merging memory across NUMA nodes:

$ echo 0 > /sys/kernel/mm/ksm/merge_across_nodes

Some additional commmands are required if any shared pages are already in use. If memory is not constrained, greater performance may be achieved by disabling KSM altogether:

$ echo 0 > /sys/kernel/mm/ksm/run

Nova Scheduler Configuration

The Nova scheduler provides the NUMATopologyFilter filter to incorporate NUMA topology information into the placement process. TripleO does not appear to provide a mechanism to append additional filters to the default list (although it may be possible with sufficient 'puppet-fu'). To override the default scheduler filter list, use a Heat environment file like the following:

parameter_defaults:
  controllerExtraConfig:
    nova::scheduler::filter::scheduler_default_filters:
      - RetryFilter
      - AvailabilityZoneFilter
      - RamFilter
      - DiskFilter
      - ComputeFilter
      - ComputeCapabilitiesFilter
      - ImagePropertiesFilter
      - ServerGroupAntiAffinityFilter
      - ServerGroupAffinityFilter
      - NUMATopologyFilter

The controllerExtraConfig parameter (recently renamed to ControllerExtraConfig) allows us to specialise the overcloud configuration. Here nova::scheduler::filter::scheduler_default_filters references a variable in the Nova scheduler puppet manifest. Be sure to include this environment file in your openstack overcloud deploy command as a -e argument.

Nova Compute Configuration

The Nova compute service can be configured to pin virtual CPUs to a subset of the physical CPUs. We can use the set of CPUs previously isolated via kernel arguments. It is also prudent to reserve an amount of memory for the host processes. In TripleO we can again use a Heat environment file to set these options:

parameter_defaults:
  NovaComputeExtraConfig:
    nova::compute::vcpu_pin_set: 4-23
    nova::compute::reserved_host_memory: 2048

Here we are using CPUs 4 through 23 for vCPU pinning and reserving 2GB of memory for host processes. As before, remember to include this environment file when managing the TripleO overcloud.

Performance For All

We hope this guide helps the community to improve the performance of their TripleO-based OpenStack deployments. Thanks to the University of Cambridge for the use of their development cloud while developing this configuration.

StackHPC

TripleO, NUMA and vCPU Pinning: Improving Guest Performance