Slurm-Controlled Rebuild: Queue-Aware Cluster Upgrades

For optimal reading, please switch to desktop mode.

At StackHPC we have been working on an exciting new tool for our OpenTofu defined, Ansible driven Slurm appliance, which aims to reduce disruptive and costly downtime during upgrades of production Slurm clusters. This tool allows us to stagger the rebuilding of compute nodes by using the Slurm job queue to orchestrate the upgrade process itself.

This presented a number of challenges: We have to contend with the maturity of the appliance and uphold the key IaC principles with which it was designed; the process must be easily configured for different sites; and appliance secrets must be handled securely.

Before we dive into the gritty details of how we fared, what exactly is the issue we are trying to solve, and why is Slurm-controlled rebuild the answer?

Let Down and Hanging Around

Picture this:

You are a student with deadlines for a research paper fast-approaching. You need to run some final computations using the university's Slurm cluster. You ssh into the login node and are faced with something like this:

*********************************************************************************
NOTICE: Scheduled Maintenance Window

The cluster will be UNAVAILABLE starting:
   XX XX 2025 at 08:00 UTC

Expected completion:
   XX XX 2025 at 17:00 UTC

All jobs that cannot complete before the start of the maintenance will be held.
Any jobs still running at the start of the maintenance window WILL BE TERMINATED.

Please plan your work accordingly.

*********************************************************************************

The cluster's down for upgrade maintenance in a few days. The job you hoped to submit would take too long to complete before the cutoff and now you have to wait until after the maintenance window to even start it. You were hoping to finalise the analysis over the weekend but now you don't even get the data until the middle of next week. Disaster! Time to beg for an extension from your supervisor, you gauge.

If you're a sysadmin you might scoff at this scenario; it says more about the reckless abandon of those who struggle to manage their time and plan ahead (...students!). You would probably be right. After all, downtime is planned well in advance; users are given plenty of notice, being bombarded with emails and warnings in the period leading up to the maintenance windows, which are scheduled to minimise disruption. However, the fact remains that downtime always has an impact on users. The bigger your system and the more users you have, the greater effect it has on research productivity.

While very common at universities, Slurm is also prevalent in industry, where the impact of downtime can be even more egregious. Productivity cost can be quantified and feeds directly into revenue loss. This says nothing about the hidden costs, difficult to quantify, but potentially more damaging in the long run. Inefficient maintenance windows can put a damper on morale for agile development teams, for instance.

So what if there was another way?

Enter… Slurm-controlled rebuild!

By allowing Slurm to manage its own compute node upgrade process, we circumvent the need for long maintenance reservations where all jobs are held or terminated if they overrun into these reservation windows. If there was a way for the Slurm scheduler to manage these upgrade jobs as part of the regular queue, then we could stagger the upgrade of compute nodes and allow users to continue submitting jobs as normal, with only a short period of enforced downtime for the upgrade of the login and control nodes.

The key benefit of this model is when the control node goes down, the slurmd continues to run on each compute node and manages job execution locally. When slurmctld comes back online, it will query slurmd on each node to reconcile job states and update the slurmdbd with any job completions that happened while it was offline.

The implementation requires the rebuilt compute node to rejoin the cluster itself, without external configuration from a deployment controller, as is expected with normal Ansible operation.

Easier said than done, right?

Re-imaging and reconfiguring a node can be fiddly, full of edge cases, and prone to drift. The standard way the ansible Slurm appliance works is to use OpenTofu and a Packer-built image to provision instances (virtualised or baremetal) from a central deploy host. A series of Ansible playbooks run to configure these instances into a Slurm cluster, aiming to keep everything declarative. Standard operation dictates that after reimaging a node with OpenTofu, it must then be reconfigured by Ansible.

What we've tried to do is decentralise this process as much as possible while remaining fully in line with the IaC principles that the Slurm appliance is built on. To achieve this we have developed a three-pronged approach:

1. `ansible-init`

The first of these, a foundational tool to Slurm-controlled rebuild, is the initialisation system called ansible-init, which is used to bootstrap instances based on metadata supplied via cloud-init.

ansible-init performs as a systemd service e.g.

[Unit]
Wants=network.target
After=network-online.target
ConditionPathExists=!/var/lib/ansible-init.done

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/usr/bin/ansible-init
Restart=on-failure
RestartSec=10

This tool is baked into the image. It checks for the presence of Ansible playbooks in a specified directory and executes them in order. Once complete, it creates a sentinel file to ensure idempotence i.e. the local ansible playbooks don't run in any context other than the initial boot. Config and data we need to rejoin the rebuilt nodes to the cluster can be made available via cloud-init metadata.

2. `compute-init`

The first of the ansible-init playbooks should be compute-init. This will be the actual self-configutation playbook which brings the cluster back online. Limitations in cloud-init mean that we cannot pass every piece of config to this playbook directly; there is a 255-byte limit per key-value pair for instance metadata. Besides, there are a number of secrets which need to be supplied to configure the cluster, and we do not want to expose those via metadata.

With this in mind, we make use of an NFS-exported mount on the control node, which contains all the necessary variables from the normal Slurm appliance. This reduces the amount of metadata we need to pass to the compute-init playbook massively, as we only need to pass the control node address to mount the NFS share.

Further metadata can be supplied to the compute-init playbook to control the specific set of features enabled in the appliance, which looks like this:

- name: Compute node initialisation
  hosts: localhost
  become: yes
  vars:
    os_metadata: "{{ lookup('url', 'http://169.254.169.254/openstack/latest/meta_data.json') | from_json }}"
    server_node_ip: "{{ os_metadata.meta.control_address }}"
    enable_compute: "{{ os_metadata.meta.compute | default(false) | bool }}"
    enable_resolv_conf: "{{ os_metadata.meta.resolv_conf | default(false) | bool }}"
    enable_etc_hosts: "{{ os_metadata.meta.etc_hosts | default(false) | bool }}"
    enable_cacerts: "{{ os_metadata.meta.cacerts | default(false) | bool }}"
    enable_sssd: "{{ os_metadata.meta.sssd | default(false) | bool }}"
    enable_sshd: "{{ os_metadata.meta.sshd | default(false) | bool }}"
    enable_tuned:  "{{ os_metadata.meta.tuned | default(false) | bool }}"
    ....

The NFS share contains an /etc/hosts file for the cluster, which allows the other cluster nodes to be resolved, and a set of hostvars for each compute node. The compute-init playbook will then use these hostvars in a series of tasks to configure the compute node. Finally the playbook will issue an scontrol update state=resume ... command to resume the node.

3. Rebuild-via-Slurm

Now that the mechanism is in place for autonomous reconfiguration of the compute nodes, we need a way to trigger this rebuild process as part of a Slurm job. First of all, OpenTofu is configured not to rebuild compute nodes when their image is updated in the OpenTofu configuration. We achieve this by using the ignore_changes` argument in a resource's lifecycle block:

lifecycle {
   ignore_changes = [
      image_id,
   ]
}

So that when the OpenTofu configuration is manually updated in the normal way, and is applied, only the control and login nodes are rebuilt. For the compute nodes (and any other rebuildable partitions) the new image reference is ignored, but written to the hosts inventory file, which is then available as an Ansible hostvar.

Ansible cluster configuration can be performed as normal, despite the drift in images between the control/login nodes and the compute nodes, and the cluster works perfectly fine. The compute node hostvars are exported to the control node, which includes the new image reference.

Now in order to actually trigger the rebuild, our custom reboot python command line tool is configured as Slurm's RebootProgram. When the --reboot flag is passed to a Slurm job, it will look up that image reference in the hostvars and compare it to the current image on the compute node. If they do not match, it will use OpenStack to rebuild the compute node to the desired image. If the image matches, the reboot tool will simply reboot the node. The RebootProgram itself is run only on the control node, therefore there is no need to distribute OpenStack credentials to compute nodes — a key security and operational advantage.

These Slurm jobs should be submitted as higher priority so that these rebuild jobs become the next job in the queue for every node. As existing jobs finish, the rebuild job launches on each node in turn. No interruption to running jobs or job submissions!

How does this work in production?

That's a complex series of tools we've pieced together, but how does it work in practice? The steps taken in an active production Slurm cluster are as follows:

New image built and the reference in OpenTofu config updated

Login & control nodes are rebuilt with this new image

Cluster configuration playbook runs, exporting image reference to control node, and cluster comes back online with the login and control nodes on the new image

High-priority Slurm job runs with reboot flag set, which launches RebootProgram

RebootProgram runs on control node, checks for new image, rebuilds compute nodes

On boot, ansible-init runs compute-init playbook on each compute node and nodes rejoin the cluster

...which as a sequence diagram looks like a little something like this:

Final thoughts

Cluster maintenance is never fun, for users or administrators. But by letting Slurm manage its own rebuild cycle, we've shown that there's a path towards rolling, low-impact upgrades that keep clusters productive while still keeping the software stack fresh and secure.

The tooling described here is available in our open source Ansible Slurm Appliance with detailed documentation on how it was implemented and how to use it.

Get in touch

If you would like to get in touch we would love to hear from you. Reach out to us via Bluesky, LinkedIn or directly via our contact page.

StackHPC

Other articles

Democratising the GPU (Part One): The Ansible role where the magic begins

Heads Up: Ansible Galaxy Breaks the World