HPC Container Orchestration with Bare Metal Magnum

For optimal reading, please switch to desktop mode.

Our project with the Square Kilometre Array includes a requirement for high-performance containerised runtime environments. We have been building a system with bare metal infrastructure, multiple physical networks, high-performance data services and optimal integrations between OpenStack and container orchestration engines (such as Kubernetes and Docker Swarm).

We have previously documented our upgrade of OpenStack deployment from Ocata to Pike. This upgrade impacted Docker Swarm and Kubernetes: provisioning of both COEs in a bare metal environment failed after the upgrade. We resolved the issues with Docker Swarm but left Kubernetes for patching over a major release upgrade.

A fix was announced with Queens release, along with swarm-mode support for Docker. This strengthened the case to upgrade Magnum to Queens on an underlying Openstack Pike. The design ethos of Kolla-Ansible and Kayobe, using containerisation to avoid the nightmares of dependency interlock, made the targeted upgrade of Magnum a relatively smooth ride.

Fixing Magnum deployment by upgrading from Pike to Queens

We use Kayobe to manage the configuration of our Kolla deployment. Changing the version of a single OpenStack service (in this case, Magnum) is as simple as setting the version of a specific Docker container tag, as follows:

Prepare the Kaybobe environment (assuming it is already installed):
cd src/kayobe-config
git checkout BRANCH-NAME
git pull
source kayobe-env

cd ../kayobe
source ../../venv/kayobe/bin/activate
export KAYOBE_VAULT_PASSWORD=**secret**
Add magnum_tag: 6.0.0.0 to kayobe-config/etc/kayobe/kolla/globals.yml.
Finally, build and deploy the new version of Magnum to the control plane. To ensure that other OpenStack services are not affected during the deployment, we use --kolla-tags and --kolla-skip-tags:
kayobe overcloud container image build magnum --push
-e kolla_source_version=stable/queens
-e kolla_openstack_release=6.0.0.0
-e kolla_source_url=https://git.openstack.org/openstack/kolla

kayobe overcloud container image pull --kolla-tags magnum
--kolla-skip-tags common

kayobe overcloud service upgrade --kolla-tags magnum
--kolla-skip-tags common

That said, the upgrade came with a few unforeseen issues:

We discovered that Kolla Ansible, a tool that Kayobe uses to deploy Magnum containers, assumes that host machines running Kubernetes are able to communicate with Keystone on an internal endpoint, not an option in our case since the internal endpoints were internal to the control plane, which does not include tenant networks and instances (which could be baremetal nodes or VMs). Since this is generally an invalid assumption, a patch was pushed upstream which has been quickly approved in the code review process. After applying this patch, it is necessary to reconfigure default configuration templates for Heat, made possible by a single Kayobe command:
kayobe overcloud service reconfigure --kolla-tags heat
--kolla-skip-tags common
Docker community edition (v17.03.0-ce onwards) uses cgroupfs as the native.cgroupdriver. However, Magnum assumes that this is systemd and does not explicitly demand this be the case. As a result, deployment fails. This was addressed in this pull request.

By default, Magnum's behaviour is to assign a floating IP to each server in a container infrastructure cluster. This means that all the traffic flows through the control plane (when accessing the cluster from an external location; internal traffic is direct). Disabling floating IP appeared to have no effect which we filed as a bug on launchpad. Patch to fix Magnum to correctly handle disabling of floating IP in swarm mode is currently under way.

Patch kayobe-config to update magnum_tag to 6.0.0.0 as well as point magnum_conductor_footer and magnum_api_footer to a patched Magnum Queens fork stackhpc/queens on our Github account.

Fedora Atomic 27 image for containers

A recently recently released Fedora Atomic 27 image (download link comes packaged with baremetal and Mellanox drivers therefore it is no longer necessary to build custom image using diskimage-builder to incorporate these drivers. However, it was necessary to make a few one-off manual changes to the image which we achieved by making changes to the image through virsh console:

First, boot into the image using cloud-init credentials defined within init.iso (built from these instructions):
sudo virt-install --name fa27 --ram 2048 --vcpus 2 --disk
path=/var/lib/libvirt/images/Fedora-Atomic-27-20180326.1.x86_64.qcow2
--os-type linux --os-variant fedora25 --network bridge=virbr0
--cdrom /var/lib/libvirt/images/init.iso --noautoconsole

sudo virsh console fa27
The images are shipped with Docker v1.13.1 which is 12 releases behind the current stable release, Docker v18.03.1-ce (note that versioning scheme changed after v1.13.1 to v17.03.0-ce). To obtain up-to-date features required by our customers, we upgraded this to the latest release.
sudo su
cd /etc/yum.repos.d/
curl -O https://download.docker.com/linux/fedora/docker-ce.repo
rpm-ostree override remove docker docker-common cockpit-docker
rpm-ostree install docker-ce -r
Fedora Atomic 27 comes installed with packages for a scalable network file system called GlusterFS. However, one of our customer requirements was to support RDMA capability for GlusterFS in order to maximise IOPs for data intensive tasks compared to IP-over-Infiniband. The package was available on rpm-ostree repository as glusterfs-rdma. It was installed and enabled as follows:
# installing glusterfs
sudo su
rpm-ostree upgrade
rpm-ostree install glusterfs-rdma fio
systemctl enable rdma
rpm-ostree install perftest infiniband-diags
Write cloud-init script that runs once to resize root volume partition due to the fact that root volume is mounted as LVM and conventional cloud-init script to grow this partition fails leading to containers deployed inside swarm cluster to quickly fill up. The following script placed under /etc/cloud/cloud.cfg.d/99_growpart.cfg did the trick in our case which generalises to different types of root block devices:
#cloud-config
# resize volume
runcmd:
  - lsblk
  - PART=$(pvs | awk '$2 == "atomicos" { print $1 }')
  - echo $PART
  - /usr/bin/growpart ${PART:: -1} ${PART: -1}
  - pvresize ${PART}
  - lvresize -r -l 100%FREE /dev/atomicos/root
  - lsblk
Cleaning up the image makes it lighter but prevents users rolling back the installation, an action we do not anticipate users to need to perform. Removing cloud-init config allows it to run again and removes authorisation details. Cleaning service logs gives the image a fresh start.
# undeploy old images
sudo atomic host status
# current deployment <0/1> should have a * next to it

sudo ostree admin undeploy <0/1>
sudo rpm-ostree cleanup -r

# image cleanup
sudo rm -rf /var/log/journal/*
sudo rm -rf /var/log/audit/*
sudo rm -rf /var/lib/cloud/*
sudo rm /var/log/cloud-init*.log
sudo rm -rf /etc/sysconfig/network-scripts/ifcfg-*

# auth cleanup
sudo rm ~/.ssh/authorized_keys
sudo passwd -d fedora

# virsh cleanup
# press Ctrl+Shift+] to exit virsh console
sudo virsh shutdown fa27
sudo virsh undefine fa27

Ansible roles for managing container infrastructure

There are official Ansible modules for various OpenStack projects like Nova, Heat, Keystone, etc. However, Magnum currently does not have one, especially those that concern creating, updating and managing container infrastructure inventory. Magnum currently lacks certain useful features possible indirectly through Nova API like attaching multiple network interfaces to each node in the cluster that it creates. The ability to generate and reuse an existing cluster inventory is further necessitated by a specific requirement of this project to mount GlusterFS volumes to each node in the container infrastructure cluster.

In order to lay the foundation for performing preliminary data consumption tests for the Square Kilometre Array's (SKA) Performance Prototype Platform (P3), we needed to attach each node in the container infrastructure cluster to multiple high speed network interfaces:

10G Ethernet

25G High Throughput Ethernet

100G Infiniband

We have submitted a blueprint to support multiple networks using Magnum API since Nova already allows multiple network interfaces to be attached. In the meantime, we wrote an Ansible role to drive Magnum and generate an Ansible inventory from the cluster deployment. Using this inventory, further playbooks apply our enhancements to the deployment.

The role allows us to declare specification of container infrastructure required including a variable that is a list of networks to attach to cluster nodes. A bespoke ansible module os_container_infra creates/updates/deletes cluster as specified using python-magnumclient. Another module called os_stack_facts then gathers facts about container infrastructure using python-heatclient allowing us to generate an inventory of the cluster. Finally, a module called os_server_interface uses python-novaclient to attach each node in the container infrastructure cluster to additional network interfaces declared in the specifications.

We make use of the recently announced openstacksdk Python module for talking to OpenStack which was conceived with an aim to assimilate shade and os_client_config projects which have been performing similar functions under separate umbrellas. We enjoyed the experience of using openstacksdk API that is largely consistent with the parent projects. Ansible plans to eventually transition to openstacksdk but they do not currently have specific plans to support plugin libraries like python-magnumclient, python-heatclient and python-novaclient which provide wider coverage in terms of the range of interaction with their API compared to openstacksdk, which only offers a set of common-denominators across various OpenStack cloud platforms.

With Magnum playing an increasing more important role in the OpenStack ecosystem by allowing users to create and manage container orchestration engines like Kubernetes, we expect this role will make lives easier for those of us who regularly use Ansible to manage complex and large scale HPC infrastructure.