For optimal reading, please switch to desktop mode.
At StackHPC we have been working on an exciting new tool for our OpenTofu defined,
Ansible driven Slurm appliance,
which aims to reduce disruptive and costly downtime during upgrades of production
Slurm clusters. This tool allows us to stagger the rebuilding of compute nodes by using
the Slurm job queue to orchestrate the upgrade process itself.
This presented a number of challenges: We have to contend with the maturity of the
appliance and uphold the key IaC principles with which it was designed; the process must
be easily configured for different sites; and appliance secrets must be handled
securely.
Before we dive into the gritty details of how we fared, what exactly is the issue we
are trying to solve, and why is Slurm-controlled rebuild the answer?
Let Down and Hanging Around
Picture this:
You are a student with deadlines for a research paper fast-approaching. You need to run
some final computations using the university's Slurm cluster. You ssh into the login
node and are faced with something like this:
*********************************************************************************
NOTICE: Scheduled Maintenance Window
The cluster will be UNAVAILABLE starting:
XX XX 2025 at 08:00 UTC
Expected completion:
XX XX 2025 at 17:00 UTC
All jobs that cannot complete before the start of the maintenance will be held.
Any jobs still running at the start of the maintenance window WILL BE TERMINATED.
Please plan your work accordingly.
*********************************************************************************
The cluster's down for upgrade maintenance in a few days. The job you hoped to submit
would take too long to complete before the cutoff and now you have to wait until after
the maintenance window to even start it. You were hoping to finalise the analysis over
the weekend but now you don't even get the data until the middle of next week. Disaster!
Time to beg for an extension from your supervisor, you gauge.
If you're a sysadmin you might scoff at this scenario; it says more about the reckless
abandon of those who struggle to manage their time and plan ahead (...students!). You
would probably be right. After all, downtime is planned well in advance; users are given
plenty of notice, being bombarded with emails and warnings in the period leading up to
the maintenance windows, which are scheduled to minimise disruption. However, the fact
remains that downtime always has an impact on users. The bigger your system and the more
users you have, the greater effect it has on research productivity.
While very common at universities, Slurm is also prevalent in industry, where the
impact of downtime can be even more egregious. Productivity cost can be quantified and
feeds directly into revenue loss. This says nothing about the hidden costs, difficult to
quantify, but potentially more damaging in the long run. Inefficient maintenance windows
can put a damper on morale for agile development teams, for instance.
So what if there was another way?
Enter… Slurm-controlled rebuild!
By allowing Slurm to manage its own compute node upgrade process, we circumvent the need
for long maintenance reservations where all jobs are held or terminated if they overrun
into these reservation windows. If there was a way for the Slurm scheduler to manage
these upgrade jobs as part of the regular queue, then we could stagger the upgrade of
compute nodes and allow users to continue submitting jobs as normal, with only a short
period of enforced downtime for the upgrade of the login and control nodes.
The key benefit of this model is when the control node goes down, the slurmd
continues to run on each compute node and manages job execution locally. When
slurmctld
comes back online, it will query slurmd
on each node to
reconcile job states and update the slurmdbd
with any job completions that
happened while it was offline.
The implementation requires the rebuilt compute node to rejoin the cluster itself,
without external configuration from a deployment controller, as is expected with normal
Ansible operation.
Easier said than done, right?
Re-imaging and reconfiguring a node can be fiddly, full of edge cases, and prone to
drift. The standard way the ansible Slurm appliance works is to use OpenTofu and a
Packer-built image to provision instances (virtualised or baremetal) from a central
deploy host. A series of Ansible playbooks run to configure these instances into a Slurm
cluster, aiming to keep everything declarative. Standard operation dictates that after
reimaging a node with OpenTofu, it must then be reconfigured by Ansible.
What we've tried to do is decentralise this process as much as possible while remaining
fully in line with the IaC principles that the Slurm appliance is built on. To achieve
this we have developed a three-pronged approach:
1. ansible-init
The first of these, a foundational tool to Slurm-controlled rebuild, is the
initialisation system called ansible-init,
which is used to bootstrap instances based on metadata supplied via cloud-init.
ansible-init
performs as a systemd service e.g.
[Unit]
Wants=network.target
After=network-online.target
ConditionPathExists=!/var/lib/ansible-init.done
[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/usr/bin/ansible-init
Restart=on-failure
RestartSec=10
This tool is baked into the image. It checks for the presence of Ansible playbooks in a
specified directory and executes them in order. Once complete, it creates a sentinel
file to ensure idempotence i.e. the local ansible playbooks don't run in any context other
than the initial boot. Config and data we need to rejoin the rebuilt nodes to the cluster can
be made available via cloud-init metadata.
2. compute-init
The first of the ansible-init
playbooks should be compute-init.
This will be the actual self-configutation playbook which brings the cluster back
online. Limitations in cloud-init mean that we cannot pass every piece of config to this
playbook directly; there is a 255-byte limit per key-value pair for instance metadata.
Besides, there are a number of secrets which need to be supplied to configure the
cluster, and we do not want to expose those via metadata.
With this in mind, we make use of an NFS-exported mount on the control node, which
contains all the necessary variables from the normal Slurm appliance. This reduces the
amount of metadata we need to pass to the compute-init playbook massively, as we only
need to pass the control node address to mount the NFS share.
Further metadata can be supplied to the compute-init playbook to control the specific
set of features enabled in the appliance, which looks like this:
- name: Compute node initialisation
hosts: localhost
become: yes
vars:
os_metadata: "{{ lookup('url', 'http://169.254.169.254/openstack/latest/meta_data.json') | from_json }}"
server_node_ip: "{{ os_metadata.meta.control_address }}"
enable_compute: "{{ os_metadata.meta.compute | default(false) | bool }}"
enable_resolv_conf: "{{ os_metadata.meta.resolv_conf | default(false) | bool }}"
enable_etc_hosts: "{{ os_metadata.meta.etc_hosts | default(false) | bool }}"
enable_cacerts: "{{ os_metadata.meta.cacerts | default(false) | bool }}"
enable_sssd: "{{ os_metadata.meta.sssd | default(false) | bool }}"
enable_sshd: "{{ os_metadata.meta.sshd | default(false) | bool }}"
enable_tuned: "{{ os_metadata.meta.tuned | default(false) | bool }}"
....
The NFS share contains an /etc/hosts
file for the cluster, which allows the
other cluster nodes to be resolved, and a set of hostvars for each compute node. The
compute-init
playbook will then use these hostvars in a
series of tasks to configure the compute node. Finally the playbook will issue an
scontrol update state=resume ...
command to resume the node.
3. Rebuild-via-Slurm
Now that the mechanism is in place for autonomous reconfiguration of the compute nodes, we need a way to
trigger this rebuild process as part of a Slurm job. First of all, OpenTofu is
configured not to rebuild compute nodes when their image is updated in the OpenTofu
configuration. We achieve this by using the ignore_changes`
argument in a
resource's lifecycle block:
lifecycle {
ignore_changes = [
image_id,
]
}
So that when the OpenTofu configuration is manually updated in the normal way, and is
applied, only the control and login nodes are rebuilt. For the compute nodes (and any
other rebuildable partitions) the new image reference is ignored, but written to the
hosts inventory file, which is then available as an Ansible hostvar.
Ansible cluster configuration can be performed as normal, despite the drift in images
between the control/login nodes and the compute nodes, and the cluster works perfectly
fine. The compute node hostvars are exported to the control node, which includes the new
image reference.
Now in order to actually trigger the rebuild, our custom reboot
python command
line tool is configured as Slurm's RebootProgram
. When the --reboot
flag is passed to a Slurm job,
it will look up that image reference in the hostvars and compare it to the current image
on the compute node. If they do not match, it will use OpenStack to rebuild the compute
node to the desired image. If the image matches, the reboot
tool will simply
reboot the node. The RebootProgram
itself is run only on the control node,
therefore there is no need to distribute OpenStack credentials to compute nodes — a key
security and operational advantage.
These Slurm jobs should be submitted as higher priority so that these rebuild jobs
become the next job in the queue for every node. As existing jobs finish, the
rebuild job launches on each node in turn. No interruption to running jobs or job
submissions!
Final thoughts
Cluster maintenance is never fun, for users or administrators. But by letting Slurm
manage its own rebuild cycle, we've shown that there's a path towards rolling,
low-impact upgrades that keep clusters productive while still keeping the software stack
fresh and secure.
The tooling described here is available in our open source Ansible Slurm Appliance
with detailed documentation on how it was implemented and how to use it.
Get in touch
If you would like to get in touch we would love to hear from you. Reach out to
us via BlueSky,
LinkedIn or directly via
our contact page.