For optimal reading, please switch to desktop mode.
Innovation in the Magnum Project
Kubernetes is a fast-developing ecosystem, and Kubernetes clusters
are deployed far and wide. OpenStack hosts only a small proportion
of the Kubernetes installed base.
The Cluster API project, focuses
on the interface between Kubernetes clusters and various types of
cloud infrastructure (including OpenStack) upon which the cluster
is deployed. The wide range of infrastructures that Cluster API
supports has given it a vibrant developer community and helps to
make Cluster API a de facto standard for Kubernetes deployment and
operations on cloud infra.
Bringing Cluster API's advantages to OpenStack infrastructure (and
to StackHPC's clients) presented a compelling opportunity.
This is a story about the Four Opens.
- 20 October 2021
- The concept of a Cluster API driver is introduced in
the Yoga cycle Magnum Project Teams Gathering (PTG).
- 26 October 2021
- An empty proof-of-concept Cluster API driver is
submitted by John Garbutt to the Magnum codebase. A lot of
community interest is generated. Work proceeds (albeit slowly)
through 2022.
- 12 January 2022
- To follow the open design process, a spec is
drafted and uploaded for review. The community reviews it, but
we all have busy lives and in some cases months pass between
reviews and responses. It takes a year to get approved.
- 19 October 2022
- Further discussion on the driver's (slow) progress at the
Antelope PTG.
- 12 December 2022
- The team at VexxHost announce an independently-developed Cluster API
driver,
release as a third-party out-of-tree Magnum driver.
- 6 January 2023
- The team from StackHPC combines with the team from VexxHost to
submit a joint presentation for the Open Infra summit in Vancouver - and it is
accepted!
- 14 June 2023
- The presentation is delivered at the Vancouver summit by Matt
Pryor from StackHPC and Mohammed Naser from Vexxhost. The
presentation covers some implementation differences and similarities
between the two drivers. Still two drivers, being developed
largely as separate efforts - could they be reconciled?
- October 2023
- Further discussion among the Magnum team at the Caracal PTG. Discussion
at this phase is about how to reconcile two alternative Cluster API
driver implemmentations, and how to avoid conflict between them.
- January 2024
- After much deliberation and effort at reducing driver conflict
within the Magnum project, spanning several PTG sessions, it is
decided to merge the in-tree driver whose development was led
by StackHPC into the Magnum codebase. This will happen in the
Dalmatian
release cycle, landing in Q4 2024.
Anatomy of a Cluster API driver
Historically, Magnum has deployed Kubernetes clusters in a bespoke
manner, with cluster infrastructure provisioned using OpenStack
Heat and then configured using
custom scripts. Over time this home-brewed approach has become a maintenance
burden for the open source developers of Magnum, and as a
result Magnum can be slow to support new Kubernetes versions.
Cluster API, on the other hand, is an active Kubernetes project supported
by the Cluster Lifecycle SIG
with a significant and broad community.
Cluster API comprises a set of Kubernetes operators that
provide declarative APIs for provisioning, upgrading and operating Kubernetes
clusters on a variety of infrastructures, of which OpenStack is one.
This is accomplished by having a core API for defining clusters that calls
out to infrastructure providers to provision cloud infrastructure such as
networks, routers, load balancers and machines. Once provisioned by the
infrastructure provider, the machines themselves are turned into a Kubernetes
cluster using kubeadm, the standard
upstream tool for creating Kubernetes clusters. Auto-healing and auto-scaling
are also supported using infrastructure-agnostic code that is maintained as
part of the upstream project.
StackHPC, with contributions from Catalyst Cloud,
have created a Magnum driver
that provisions Kubernetes clusters by creating and updating Cluster API resources
as an alternative to the standard Heat-based driver. Using Cluster API via Magnum
in this way has two major advantages:
- Magnum becomes a wrapper around deployment tools that are officially
supported and maintained upstream. This reduces the amount of Magnum-specific
code that needs to be maintained and allows us to easily support new
Kubernetes versions as they become available.
- Users can benefit from the features of Cluster API without having to
change their existing Magnum workflows - for instance, the OpenStack CLI
and Magnum Terraform modules will continue to work with the new driver.
Azimuth's Cluster API Engine
The new Cluster API driver reuses components developed to provide Kubernetes
support in Azimuth, an
open-source portal providing self-service HPC and AI platforms for OpenStack
clouds, where they have been battle-tested for a number of years.
Providing a managed Kubernetes service is much more than just provisioning
Kubernetes cluster nodes - clusters must have addons installed, such as a
CNI, OpenStack integrations and the
metrics server.
Whilst Cluster API is excellent at provisioning and operating Kubernetes
clusters, it has limited capability for managing cluster addons.
To address this, StackHPC developed an addon provider for Cluster API - another Kubernetes
operator that provides a declarative API for specifying which Helm charts and additional manifests should be installated on a Cluster
API cluster.
For example, this resource specifies that the NGINX ingress controller should be installed onto
the Cluster API cluster example using values from an inline template
(values from ConfigMaps and
Secrets are
also supported):
apiVersion: addons.stackhpc.com/v1alpha1
kind: HelmRelease
metadata:
name: ingress-nginx
spec:
# The name of the target Cluster API cluster
clusterName: example
# The namespace on the cluster to install the Helm release in
targetNamespace: ingress-nginx
# The name of the Helm release on the target cluster
releaseName: ingress-nginx
# Details of the Helm chart to use
chart:
repo: https://kubernetes.github.io/ingress-nginx
name: ingress-nginx
version: 4.9.1
# Values for the Helm release
# These can be merged from multiple sources
valuesSources:
- template: |
controller:
metrics:
enabled: true
serviceMonitor:
enabled: true
As well as providing managed Kubernetes clusters for tenants using Cluster API,
Azimuth itself also runs on a Kubernetes cluster. Usually this cluster is deployed
in a project on the same OpenStack cloud that Azimuth is configured to provision
resources in.
In order to share as much code and knowledge between these two use cases as possible,
StackHPC have codified the deployment of Cluster API clusters on OpenStack,
with addons properly configured, as a set of open-source Helm charts. These charts are used in Azimuth
for provisioning both the Azimuth cluster itself and tenant clusters.
Helm's Flexibility Advantage
The biggest benefit of encapsulating good practice using Helm is that the charts
can be reused outside of Azimuth, e.g. to operate Kubernetes clusters using
GitOps with Flux
or Argo CD. For example, deploying
a basic cluster with a monitoring stack requires the following Helm values:
# The name of a secret containing an OpenStack appcred for the target project
cloudCredentialsSecretName: my-appcred
# The ID of the image to use for cluster nodes
machineImageId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
# The version of kubelet that is in that image
kubernetesVersion: 1.28.6
clusterNetworking:
# The ID of the external network to use
externalNetworkId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
controlPlane:
# The name of the flavor to use for control plane nodes
machineFlavor: small
# The node groups for the cluster
nodeGroups:
- name: md-0
machineFlavor: small
machineCount: 2
# Enable the monitoring stack
addons:
monitoring:
enabled: true
In particular, these charts are used by StackHPC's Magnum driver to produce the
Cluster API resources for Magnum clusters. The Magnum driver simply converts
Magnum's representation of the cluster template and cluster into Helm values that
are passed to the CAPI Helm charts.
This means that although StackHPC's Cluster API Magnum driver is relatively new,
it is based on components that have been used in production for several years.
StackHPC will be contributing the Helm charts to the Magnum project to serve
as a reference for how to provision Cluster API clusters on OpenStack.
Magnum Cluster API Drivers are Like Buses
The Kolla-Ansible project already does
the work to package the driver developed by VEXXHOST and will also be
packaging the Cluster API Helm driver once it merges in the Magnum codebase.
In the Bobcat relase of Magnum, it is not possible for both of these drivers
to coexist as the current driver selection mechanism does not consider enough
information to differentiate them (see the Magnum Docs).
Jake Yip, the current Magnum PTL, has been doing some great work to make the
driver selection more explicit using image metadata, which will allow the
two Cluster API drivers to co-exist from the Caracal release onwards
(patch in Gerrit).
There are some technical differences between the VEXXHOST and StackHPC drivers.
In particular, as discussed above, StackHPC have chosen to use Helm to template
the Cluster API resources for Magnum clusters, whereas the VEXXHOST driver templates
the Cluster API resources in Python code.
We chose Helm for a number of reasons:
- The method to deploy Cluster API clusters on OpenStack is
reusable outside of Magnum, rather than being encapsulated
behind the Magnum APIs.
- The Helm charts, when combined with StackHPC's addon provider, provide a
powerful, declarative way for managing addons.
- More flexibility for operators - if an operator needs to modify the way that
clusters are deployed, the default charts can be replaced with custom charts
as long as the Helm values provided by the Magnum driver are respected.
- The Helm charts can have an independent release cycle, allowing Magnum to
support new Kubernetes versions without waiting for an OpenStack release
by just creating new cluster templates that reference the new Helm chart
version.
A significant difference between the two drivers is use of
ClusterClass.
Whilst ClusterClass does promise to make certain lifecycle operations easier,
particularly upgrades, the Cluster API project still classifies the feature as
"experimental (alpha)" and requires a feature gate to be enabled to use it.
The CAPI Helm charts have intelligence baked in to deal with upgrades, backed by
extensive use in production and cluster upgrades tested in CI. We are being
cautious about switching to an alpha feature that may still be subject to substantial
change. It is expected that the Helm charts will move to ClusterClass when
it moves into beta or GA.
Despite the technical differences, there are still a number of places where
code could be shared between the two drivers. Both drivers derive the Magnum
cluster state by looking at the same Cluster API objects, both drivers have
a need to make the certificates generated by Magnum available to the Cluster
API clusters, and both drivers monitor the cluster health in the same way.
At StackHPC, we are keen to collaborate to see if we can develop a
shared library that can be used by both drivers to perform these tasks,
reducing duplication and making the most of the available effort to push
the Magnum ecosystem forward together.