For optimal reading, please switch to desktop mode.
The hardware powering modern cloud and High Performance Computing (HPC) systems
is variable and complex. The assumption that access to memory and devices
across a system is uniform is often incorrect. Without knowledge
of the properties of the physical hardware of the host, the virtualised guests
running atop can perform poorly.
This post covers how we can take advantage of knowledge of the system
architecture in OpenStack Nova to improve guest VM performance, and how to
configure OpenStack TripleO to support this.
vCPU Pinning
In KVM, the virtual CPUs of a guest VM are emulated by host tasks in userspace
of the hypervisor. As such they may be scheduled across any of the cores in
the system. This behaviour can lead to sub-optimal cache performance as
virtual CPUs are scheduled between CPU cores within a NUMA node or worse,
between NUMA nodes. With virtual CPU pinning, we can improve this behaviour by
restricting the physical CPU cores on which each virtual CPU can run.
It can in some scenarios be advantageous to also restrict host processes to a
subset of the available CPU cores to avoid adverse interactions between
hypervisor processes and the application workloads.
NUMA and vCPU Pinning in TripleO
The OpenStack TripleO project provides tools to
deploy an OpenStack cloud, and Red Hat's popular OpenStack Platform (OSP) is
based on TripleO. The default configuration of TripleO is not optimal for
NUMA placement and vCPU pinning, so we'll outline a few steps that can be
taken to improve the situation.
Kernel Command Line Arguments
We can use the isolcpus kernel command line argument to restrict
host processes to a subset of the total available CPU cores. The argument
specifies a list of ranges of CPU IDs from which host processes should be
isolated. In other words, the CPUs we will use for guest VMs. For example, to
reserve CPUs 4 through 23 exclusively for guest VMs we could specify:
isolcpus=4-23
Ideally we want this argument to be applied on the first and subsequent boots
rather than applying it dynamically during deployment, to avoid waiting for our
compute nodes to reboot. Currently Ironic does not provide a mechanism to
specify additional kernel arguments on a per-node or per-image basis, so we
must bake them into the overcloud image instead.
If using the Grub bootloader, additional arguments can be provided to the
kernel by modifying the GRUB_CMDLINE_LINUX variable in
/etc/default/grub in the overcloud compute image, then rebuilding the Grub
configuration. We use the virt-customize command to apply post-build
configuration to the overcloud images:
$ export ISOLCPUS=4-23
$ function cpu_pinning_args {
CPU_PINNING_ARGS="isolcpus=${ISOLCPUS}"
echo --run-command \"echo GRUB_CMDLINE_LINUX=\"'\\\"'\"\\$\{GRUB_CMDLINE_LINUX\} ${CPU_PINNING_ARGS}\"'\\\"'\" \>\> /etc/default/grub\"
}
$ (cpu_pinning_args) | xargs virt-customize -v -m 4096 --smp 4 -a overcloud-compute.qcow2
(We structure the execution this way because typically we are composing a
string of operations into a single invocation of virt-customize).
Alternatively this change could be a applied with a custom
diskimage-builder element.
One Size Does Not Fit All: Multiple Overcloud Images
While the isolcpus argument may provide performance benefits for guest VMs on
compute nodes, it would be seriously harmful to limit host processes in the same way on
controller and storage nodes. With different sets of nodes requiring different arguments,
we now need multiple overcloud images. Thankfully, TripleO provides an (undocumented) set of
options to set the image for each of the overcloud roles. We'll use the name
overcloud-compute for the compute image here.
When uploading overcloud images to Glance, use the OS_IMAGE environment
variable to reference an image with a non-default name:
$ export OS_IMAGE=overcloud-compute.qcow2
$ openstack overcloud image upload
We can execute this command multiple times to register multiple images.
To specify a different image for the overcloud compute roles, create a Heat
environment file containing the following:
parameter_defaults:
NovaImage: overcloud-compute
Ensure that the image name matches the one registered with Glance and that the
environment file is referenced when deploying or updating the overcloud.
Other node roles will continue to use the default image overcloud-full.
Our specialised kernel configuration is now only applied where it is needed,
and not where it is harmful.
KVM and Libvirt
The Nova compute service will not advertise the NUMA topology of its host if it
determines that the versions of libvirt and KVM are inappropriate. As of the
Mitaka release, the following version restrictions are applied:
- libvirt: >= 1.2.8, != 1.2.9.7
- qemu-kvm: >= 2.1.0
On CentOS 7.3, the qemu-kvm package is at version 1.5.3. This can be
updated to a more contemporary 2.4.1 by adding the kvm-common Yum repository
and installing qemu-kvm-ev:
$ cat << EOF | sudo tee /etc/yum.repos.d/kvm-common.repo
[kvm-common]
name=KVM common
baseurl=http://mirror.centos.org/centos/7/virt/x86_64/kvm-common/
gpgcheck=0
EOF
$ sudo yum -y install qemu-kvm-ev
This should be applied to the compute nodes either after deployment or during
the overcloud compute image build procedure. If applied after deployment, the
openstack-nova-compute service should be restarted on the compute nodes to
ensure it checks the library versions again:
$ sudo systemctl restart openstack-nova-compute
Kernel Same-page Merging (KSM)
Kernel Same-page Merging (KSM) allows for more efficient use of memory by guest
VMs by allowing identical memory pages used by different processes to be backed
by a single page of Copy On Write (COW) physical memory. When used on a NUMA
system this can have adverse effects if pages are merged belonging to VMs on
different nodes. In Linux kernels prior to 3.14 KSM was not NUMA-aware, which
could lead to VM performance and stability issues. It is possible to disable
KSM merging memory across NUMA nodes:
$ echo 0 > /sys/kernel/mm/ksm/merge_across_nodes
Some additional commmands are required if any
shared pages are already in use. If memory is not constrained, greater
performance may be achieved by disabling KSM altogether:
$ echo 0 > /sys/kernel/mm/ksm/run
Nova Scheduler Configuration
The Nova scheduler provides the NUMATopologyFilter filter to incorporate
NUMA topology information into the placement process. TripleO does not appear
to provide a mechanism to append additional filters to the default list
(although it may be possible with sufficient 'puppet-fu'). To override the
default scheduler filter list, use a Heat environment file like the following:
parameter_defaults:
controllerExtraConfig:
nova::scheduler::filter::scheduler_default_filters:
- RetryFilter
- AvailabilityZoneFilter
- RamFilter
- DiskFilter
- ComputeFilter
- ComputeCapabilitiesFilter
- ImagePropertiesFilter
- ServerGroupAntiAffinityFilter
- ServerGroupAffinityFilter
- NUMATopologyFilter
The controllerExtraConfig parameter (recently renamed to
ControllerExtraConfig) allows us to specialise the overcloud configuration.
Here nova::scheduler::filter::scheduler_default_filters references a
variable in the Nova scheduler puppet manifest.
Be sure to include this environment file in your openstack overcloud deploy
command as a -e argument.
Nova Compute Configuration
The Nova compute service can be configured to pin virtual CPUs to a subset of
the physical CPUs. We can use the set of CPUs previously isolated via kernel
arguments. It is also prudent to reserve an amount of memory for the host
processes. In TripleO we can again use a Heat environment file to set these
options:
parameter_defaults:
NovaComputeExtraConfig:
nova::compute::vcpu_pin_set: 4-23
nova::compute::reserved_host_memory: 2048
Here we are using CPUs 4 through 23 for vCPU pinning and reserving 2GB of
memory for host processes. As before, remember to include this environment
file when managing the TripleO overcloud.