The Convergence of HPC, AI and Cloud

For optimal reading, please switch to desktop mode.

Welcome to the first 2026 edition of our (sort of) quarterly newsletter, Navigating Upstream.

StackHPC turned 10 this January, so in this edition we will be taking a look back over the last decade at how the company came to be what it is today.

After a brilliant 2025, we’re keen to carry that momentum into the new year. The team is set to grow significantly, and we have kicked off some of our biggest projects in the last month.

Also featured are a raft of updates to our SLURM appliance, infrastructure updates and the usual CVE notices and recaps of recent events.

– The StackHPC team

Birthday celebrations in the Bristol office.

Celebrating 10 years of StackHPC

This January marked 10 years since the incorporation of StackHPC as a limited company. Read on for the story of the company’s growth from its inception in a pub in Bristol, to winning the OpenInfra Superuser award.

Built on strong foundations

StackHPC can trace its roots all the way back through Bristol’s near 50-year computing history. The founding of Inmos in 1978 sparked a technological revolution in the south-west of England, drawing over £200 million in government investment and leading to significant developments in parallel supercomputing (Wired). Meiko, founded by former Inmos employees, launched an Inmos transputer-based supercomputer named the Computing Surface in 1986, and would employ both StackHPC co-founders John Taylor and Stig Telfer in the early 1990s.

Meiko’s technical team transferred to Quadrics in the late 1990s to keep developing Meiko networking technology. A spin-off called Gnodal formed ten years later, which folded in 2013 with its assets and team, including Stig, transferred to the Cray Bristol Research Laboratory. It was here that OpenStack’s potential for HPC infrastructure was first explored, and the Slingshot network fabric was developed.

StackHPC’s HPC cloud consultancy began at Cambridge University. John and Stig attended the OpenStack summit in Tokyo, 2015, with Stig presenting: The Case for a Scientific OpenStack. In Tokyo, John and Stig met Tim Bell from CERN, and the idea for an OpenStack Scientific Special Interest Group (SIG), a forum to share ideas across the scientific open infrastructure community, was born.

Stig delivering a keynote talk at the OpenStack Summit Barcelona 2016 with Paul Calleja from Cambridge University.

Embracing the open community

The next few years saw steady growth in StackHPC’s prominence in the OpenStack community. Presenting frequently at OpenStack summits, publishing studies on Keycloak, Kubernetes and HPC, as well as presenting at CERN OpenStack day in 2018 and 2019. For the 2017 OpenStack release, StackHPC engineers reviewed 97 contributions. By 2020’s OpenStack Ussuri, this figure had risen to 2,022, with our number of reviewers increasing from two to nine (Stackalytics).

At the OpenStack Summit Austin in 2016, StackHPC helped bring the OpenStack Scientific SIG to fruition. This has become a regular fixture at OpenInfra Summits every year since, providing a space for international collaboration to support the development of OpenStack in science and research use-cases. The ‘social infrastructure’ of SIG meetings make an ideal environment for technical exchange.

Like much of the world, StackHPC was forced to a halt in 2020, but our cloud-first and distributed approach allowed us to keep pushing forwards. The drive for bioinformatics projects created new demand for HPC and cloud-based solutions in the quest to sequence and overcome COVID-19. In November 2020, two projects we supported joined the TOP500 list, becoming the first fully OpenStack-based, software-defined supercomputers to qualify.

The ‘social infrastructure’ of the Scientific SIG at the OpenStack Summit Barcelona 2016.

Where we are now

Since then, StackHPC has continued to grow, with projects across Europe and the UK, the United States, Asia, Oceania and Africa and now employing almost 50 people. While a lot has changed, much remains the same. We are still headquartered in Bristol, but in our own office, and we have colleagues in Poland, and France. We continue to be champions for open source principles in all our work, exemplified by our position as the second highest contributor to the latest major OpenStack releases Epoxy and Gazpacho (Stackalytics: by reviews, resolved bugs, person-day hours).

A recent UK government publication noted StackHPC’s ‘highly innovative UK AI software stack’ in contribution to the University of Cambridge’s DAWN supercomputer, as the government is set to invest a further £36 million in the AI Research Resource (AIRR) at Cambridge for a new supercomputer project. DAWN has already played a role in supporting over 350 projects, such as accelerating research into personalised cancer vaccines. We are excited to continue our involvement in this and many other important projects throughout 2026.

The StackHPC team at OpenInfra Europe Summit 2025.

The State of the SLURM Appliance

Achieving peak performance from AI infrastructure is crucial for maximising returns on significant hardware and software investments, especially given the competitive nature of modern day AI workloads.

We discussed our SLURM Appliance in our first issue of this newsletter in August 2025 and since then 130 PRs have been merged. Aside from a continual stream of updates, most of these have focused on improving stability or usability.

The appliance can now be deployed on isolated networks with no outbound internet access, where previously this had always been a requirement. This required various structural changes in the Ansible and did increase the size of the images we build (and make freely available). However, it has also improved the reliability of all deployments (both client systems and in our CI) due to the reduced reliance on external resources. Similarly, we have continued to expand the use of our Release Train repository mirror for image builds, making these more reproducible and less reliant on other external infrastructure.

We've also made a number of improvements to the OpenTofu configurations which define the cluster infrastructure, most notably adding validation for input variables and making it simpler to add additional groups of nodes for either compute partitions or other services. Configuring remote state was something needed for most deployments, and is not easy to get right, so we've added templates and documentation for this to make best-practice simple.

As mentioned, package upgrades are a continual source of changes, the largest of which was moving our RockyLinux 9 image to 9.6, once a suitable version of NVIDIA DOCA was available. We still support RockyLinux 8 but are not planning to support RockyLinux 10 yet, as there is not yet significant demand, and to keep our CI matrix reasonable. We also upgraded Ansible (to v2.16), and the Ansible host preparation script can handle all the changes this entails.

Work on a baremetal appliance with NVIDIA DGX compute nodes has also led to support for software RAID root disks and NVIDIA's fabricmanager. We also improved our GPU support, with GRES options added to the OpenOndemand job launchers and automatic configuration of EESSI, which now supports NVIDIA GPUs. Clusters with EESSI enabled also now automatically deploy and configure a proxy, to reduce the manual configuration required.

Speaking of our Ondemand configuration, we’ve added support for DEX, which massively expands how our Ondemand deployments can authenticate users. As always, we wrap the underlying functionality to try to provide sensible, functional defaults for the appliance while allowing the required site customisations.

Lastly, the latest releases have brought fixes for two recent CVEs that affect SLURM. The first affects the authentication service for user access, MUNGE, and allows a potential attacker to impersonate any user, including root. The second concerns the potential to cause a Denial-of-Service or allow remote code execution via a vulnerability in OpenSSL. Find more details on both of these issues in CVE watch below.

Future plans include taking advantage of client experience in operating busy clusters to update our default SLURM configuration and making it easier to build and configure site-specific images.

The View from the Release Train - Infrastructure updates

A brief summary of important upgrades, features and fixes in our OpenStack infrastructure products in the last quarter.

StackHPC Kayobe-config

Upgrades
- Add support for Rocky Linux 9.7, including host packages and a full container image refresh.
Features and Improvements
- Adds networking-generic-switch support for bond interfaces in trunk mode on Arista switches.
- Add playbooks and configuration to enable the easy deployment of Pulp with TLS support in combination with certificates generated via OpenBao.
- Changed the IPA (Ironic Python Agent) image compression algorithm. This improves provisioning performance by reducing the size of the IPA boot ISO.
Fixes
- Update OVN chassis priority fix playbook to ensure the correct leader runs the priority alignment.
- Rebuild Nova and Neutron images to bring in upstream fixes to networking Mellanox agent.
Read the full release notes

Kayobe

Various fixes for issues including missing NTP configuration for infrastructure VMs, and an issue where the Bifrost host variable file failed to generate when the default ipv4 gateway was not defined.
Read the full release notes

Kolla-Ansible

Upgrades
- Support for Rocky Linux 10 has been added in Epoxy to allow for operating system migrations ahead of upgrades to OpenStack Flamingo (2025.2) or Gazpacho (2026.1) where RL10 will be a requirement.
Fixes
- An issue where OpenSearch log retention check would fail due to improperly loaded plugins was resolved by adding a check before tasks that require plugins.
- Addresses an issue causing HTTP 500 errors in Horizon when one of the memcached nodes was unavailable.
Read the full release notes

Kolla

Upgrades
- Kolla now supports building CentOS Stream 10 and Rocky Linux 10 container images.
- Prometheus services have been updated to the latest releases, including migration to a new TSDB format, compatible with Prometheus v3.
Fixes
- Pinned numpy to a safe version to prevent excessive memory consumption and memory leak in Gnocchi.
Read the full release notes

CVE Watch

The beginning of 2026 has already delivered its fair share of security vulnerabilities potentially affecting our customers’ deployments. We won’t be making predictions about this slowing down, if anything current AI tooling has shown it’s quite apt at discovering security bugs, so we expect a steady flow of vulnerabilities this year.

CVE-2026-25506: MUNGE (SLURM) Buffer overflow in message unpacking allows key leakage and credential forgery

This vulnerability stems from an issue in MUNGE, the authentication service used to authenticate users – including system service users – within the cluster and to the SLURM daemons. MUNGE can be understood as the basis of trust in the StackHPC SLURM appliance.

This vulnerability could allow an attacker to impersonate any other user,  including root , to the SLURM controller and SLURM daemon. This, in turn, would let the attacker gain access as this user on compute nodes via a job, and administer SLURM using scontrol commands etc.

The attacker only needs login access to a node of the cluster or the ability to submit a SLURM job to attack the local munge daemon, which makes it a serious flaw to patch quickly especially in busy clusters.

As packages for the fixes were not yet available from the vendors of the host OS we deploy to (Ubuntu and Rocky Linux) we went ahead and packaged the fix ourselves for our customers and have been deploying it.

CVE-2025-15467: OpenSSL stack buffer overflow

This one is both quite specific in its requirements and potentially quite bad in terms of impact.

The attack exploits a bug in parsing AuthEnvelopedData structures that use Authenticated Encryption with Associated Data (AEAD) ciphers such as AES-GCM. An attacaker with supplying a crafted Cryptographic Message Syntax (CMS) message with an oversized Initialization Vector (IV) can trigger a buffer overflow, which may lead to a crash and a Denial of Service (DoS) or, more worryingly, potentially allow for remote code execution.

Again, worth patching quickly. We’ve been deploying the patched versions to our customers’ systems.

CVE-2026-24708: Nova calls qemu-img without format restriction for resize (OSSA-2026-002)

A more concerning vulnerability, which resulted in an embargo on the release of any details prior to publication. As soon as we got advance notice of this bug, we immediately got to preparing patched versions of the Nova service to make available to our customers as the embargo was lifted, on Tuesday 18 February at 15:00 GMT.

The attack relies on writing a malicious QCOW header to a root or ephemeral disk and then triggering a resize, which may convince Nova's flat image backend to call qemu-img without a format restriction which results in an unsafe image resize operation, which in turn could destroy data on the host system.

Thankfully, only compute nodes using the Flat image backend are affected and as we analysed our customers’ systems, we determined most of them were configured not to use it, meaning they already had the mitigation enabled (ie: use_cow_images=True).

Furthermore, the attack requires access to the Nova Compute API, which puts public and user-reachable deployments at higher risk.

All in all, still a vulnerability to be taken seriously and patched quickly, as the situation can evolve quickly after an embargoed vulnerability is made public.

StackHPC in the Community

We’re barely two months into the year, but we’ve already made it to two conferences: FOSDEM in Brussels, Belgium, and HPC Asia in Osaka, Japan.

FOSDEM is an event with a focus on promoting the use of free and open source software, with plenty of beer in the mix. Ergo the perfect trip for StackHPC. Axel Simon and Eric Le Lay travelled to Brussels and did their best to take in as many of the 1000+ lectures available over two days.

Eric noted highlights in the world of software defined storage, with Ceph Tentacle optimised to support up to 10 thousand OSDs, double the capacity of Ceph Squid. GarageFS also featured, with a focus on its suitability for distributed sites with high latency and upcoming web UI implementations.

The Cloud Infrastructure Platform (CLIP) team presented their solution to reduce the barrier to entry for HPC users. Their ‘Zero-Touch’ HPC solution manifests as a self-configuring SLURM Cluster, with props given to the StackHPC team for the compute-init playbook, which automates compute nodes rejoining the cluster after reboots.

HPC Asia

Stig presenting at HPC Asia.

In late January, CTO Stig Telfer and computer tolerator Alex Welsh travelled to the beautiful city of Osaka to attend the joint Supercomputing Asia/HPC Asia conference.

There were clearly two big focal points at the event: AI, and quantum computing. AI taking centre stage will surprise no-one, but quantum computing is continuing to gain momentum. Recent breakthroughs mean enthusiasm for the technology appears to be at an all-time high.

Between the qubits and okonomiyaki, Stig was invited to present at the 10th iteration of the SuperCompCloud workshop. The group focuses on the interoperability of supercomputing and cloud technologies. He used this time to share StackHPC's recent successes at the University of Bristol, where the StackHPC SLURM Appliance is being used to great effect (slides publicly available here).

Upcoming Events

Our busy calendar continues into Q1, and we hope to be attending some of these events:

Global Open Ondemand Conference in Salt Lake City, USA, 9-12th March.
HPSF Conference in Chicago, USA, 16-20th March.
KubeCon + CloudNative Con in Amsterdam, Netherlands, 23-26th March.
HPC-AI Swiss Conference in Locarno, Switzerland, 20-23rd April.
StackConf in Munich, Germany, 28-29th April.
KubeCon in Mumbai, India, 18-19th June.
ISC-HPC in Hamburg, Germany, 22-26th June.

From the Blog

Slinking Time

Published 8 February 2026, by Raine Wales

Raine explores how FluxCD enables the packaging of Slinky as an Azimuth app, lowering the barrier to entry for SLURM by deploying it within a Kubernetes cluster.

Read the full post.

Azimuth All Alone: Standalone Mode

Published 17 December 2025, by Raine Wales

Raine presents a solution to deploy Azimuth on a Kubernetes cluster, allowing organisations and users without an existing OpenStack cloud to make use of the platform, and bring in new ideas.

Read the full post.

United by OpenStack: Empowering Women in STEM Beyond Borders

Published 12 November 2025, by Massimiliano Favaro-Bedford

Max summarises a great experience supporting students from the University of Jos, Nigeria, through a 7 week internship covering OpenStack and Magnum development.

Read the full post.

Keep an eye on our LinkedIn, or the blog page of our website for the latest posts.

Parting words

Thank you for taking the time to read the second edition of Navigating Upstream!

We always welcome any feedback and suggestions.

If you’d prefer not to receive future editions, you can opt out at any time using the link below or with a simple reply. Otherwise, we look forward to keeping in touch.

– The StackHPC Team

Reach out to us via Bluesky, LinkedIn or directly via our contact page.