For optimal reading, please switch to desktop mode.
Have you ever had the nuisance of configuring a server BIOS? How
about a rack full of servers? Or an aisle, a hall, an entire
facility even? It gets to be tedious toil even before the second
server, and it also becomes increasingly unreliable to apply a
consistent configuration with increasing scale.
In this post we describe how we apply some modern tools from the
cloud toolbox (Ansible, Ironic and Python) to tackle this age-old
problem.
Server Management in the 21st Century
Baseboard management controllers (BMCs) are a valuable tool for
easing the inconvenience of hardware management. By using a BMC
we can configure our firmware using remote access, avoiding a trip
to the data centre and stepping from server to server with a crash
cart. This is already a big win.
However, BMCs are still pretty slow to apply changes, and are
manipulated individually. Through automation, we could address these
shortcomings.
I've seen some pretty hairy early efforts at automation, for example
playing out timed keystroke macros across a hundred terminals of BMC
sessions. This might work, but it's a desperate hack. Using the tools
created for configuration management we can do so much better.
A Quick Tour of OpenStack Server Hardware Management
OpenStack deployment usually draws upon some hardware inventory
management intelligence. In our recent project with the University
of Cambridge this was Red Hat OSP Director.
The heart of OSP Director is TripleO
and the heart of TripleO is OpenStack Ironic.
Ironic is OpenStack's bare metal manager. It masquerades as a
virtualisation driver for OpenStack Nova, and provisions bare metal
hardware when a user asks for a compute instance to be created.
TripleO uses this capability to good effect to create
OpenStack-on-OpenStack (OoO), in which the servers of the OpenStack
control plane are instances created within another OpenStack layer
beneath.
Our new tools fit neatly into the TripleO process between registration
and introspection
of undercloud nodes, and are complementary to the existing functionality
offered by TripleO.
A Cloud-centric Approach to Firmware Configuration Management
OpenStack Ironic tracks hardware state for every server in an OpenStack deployment.
A simple overview can be seen with ironic node-list:
+--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+
| 415c254f-3e82-446d-a63b-232af5816e4e | control1 | 3d27b7d2-729c-467c-a21b-74649f1b1203 | power on | active | False |
| 2646ece4-a24e-4547-bbe8-786eca16da82 | control2 | 8a066c7e-36ec-4c45-9e1b-5d0c5635f256 | power on | active | False |
| 2412f0ef-dedb-49c8-a923-778db36a57d9 | control3 | 6a62936f-40ec-49e7-a820-6f3329e5bb0c | power on | active | False |
| 81676b2d-9c37-4111-a32a-456a9f933e57 | compute0 | aac2866c-7d16-4089-9d94-611bfc38467e | power on | active | False |
| c6a5fbe7-566a-447e-a806-9e33676be5ea | compute1 | 619476ae-fec4-42c6-b3f5-3a4f5296d3bc | power on | active | False |
| c7f27dd4-67a7-42b9-93ab-2e444802c5c2 | compute2 | a074c3f8-eb87-46d6-89c8-f360fbf2a3df | power on | active | False |
| 025d84dc-a590-46c5-a456-211d5c1e8f1a | compute3 | 11524318-2ecf-4880-a1cf-76cd62935b00 | power on | active | False |
+--------------------------------------+----------+--------------------------------------+-------------+--------------------+-------------+
Ironic's node data includes how to access the BMC of every server
in the node inventory.
We extract the data from Ironic's inventory to generate a dynamic inventory for use
with Ansible. Instead of a file of hostnames, or a list of command line parameters,
a dynamic inventory is the output from an executed command. A dynamic inventory
executable accepts a few simple arguments and emits node inventory data in JSON format.
Using Python and the ironicclient module simplifies the implementation.
To perform fact gathering and configuration, two new Ansible roles were
developed and published on Ansible Galaxy.
- DRAC configuration
- Provides the drac Ansible module for configuration of BIOS settings
and RAID controllers. A single task is provided to execute the module.
The role is available on Ansible Galaxy as
stackhpc.drac and the
source code is available on Github as
stackhpc/drac.
- DRAC fact gathering
- Provides the drac_facts Ansible module for gathering facts from a
DRAC card. The module is not executed by this role but is available to
subsequent tasks and roles.
The role is available on Ansible Galaxy as
stackhpc.drac-facts
and the source code is available on Github as
stackhpc/drac-facts.
We use the python-dracclient module as a high-level interface
for querying and configuring the DRAC via the WSMAN protocol. This
module was developed by the Ironic team to support the DRAC family
of controllers. The module provides a useful level of abstraction
for these Ansible modules, hiding the complexities of the WSMAN
protocol.
Example Playbooks
The source code for all of the following examples is available on Github at
stackhpc/ansible-drac-examples. The playbooks are not
large, and we encourage you to read through them.
A Docker image providing all dependencies has also been created and
made available on Dockerhub at the stackhpc/ansible-drac-examples
repository.
To use this image, run:
$ docker run --name ansible-drac-examples -it --rm docker.io/stackhpc/ansible-drac-examples
This will start a Bash shell in the /ansible-drac-examples directory where
there is a checkout of the ansible-drac-examples repository. The
stackhpc.drac and stackhpc.drac-facts roles are installed under
/etc/ansible/roles/. Once the shell is exited the container will be
removed.
Ironic Inventory
In the example repository, the inventory script is
inventory/ironic_inventory.py. We need to provide this script with the
following environment variables to allow it to communicate with Ironic:
OS_USERNAME, OS_PASSWORD, OS_TENANT_NAME and OS_AUTH_URL.
For the remainder of this article we will assume that a file, cloudrc, is
available and exports these variables. To see the output of the inventory
script:
$ source cloudrc
$ ./inventory/ironic_inventory.py --list
To use this dynamic inventory with ansible-playbook, use the -i
argument:
$ source cloudrc
$ ansible-playbook -i inventory ...
The inventory will contain all Ironic nodes, named by their UUID. For
convenience, an Ansible group is created for each named node using its name
with a prefix of node_.
The inventory also contains groupings for servers in Ironic maintenance
mode, and for servers in different states in Ironic's hardware state
machine. Groups are also created for each server profile defined
by TripleO: controller, compute, block-storage, etc..
In the following examples, the playbooks will execute against all Ironic nodes
discovered by the inventory script. To limit the hosts against which a play is
executed, use the --limit argument to ansible-playbook.
If you would rather not make any changes to the systems in the inventory, use
the --check argument to ansible-playbook. This will display the changes
that would have been made if the --check argument were not passed.
Example 1: Gather and Display Facts About Firmware Configuration
The drac-facts.yml playbook shows how the stackhpc.drac-facts role
can be used to query the DRAC module of each node in the inventory. It also
displays the results. Run the following command to execute the playbook:
$ source cloudrc
$ ansible-playbook -i inventory drac-facts.yml
Under The Hood
The vast majority of the useful code provided by these roles takes the form of
python Ansible modules. This takes advantage of the capability of Ansible roles
to contain modules under a library directory, and means that no python code
needs to be installed on the system or included with the core or extra Ansible
modules.
The drac_facts Module
The drac_facts module is relatively simple. It queries the state of BIOS
settings, RAID controllers and the DRAC job queues. The results are translated
to a JSON-friendly format and returns them as facts.
The drac Module
The drac module is more complex than the drac_facts module.
The DRAC API provides a split-phase execution model, allowing changes
to be staged before either committing or aborting them. Committed
changes are applied by rebooting the system. To further complicate
matters, the BIOS settings and each of the RAID controllers represents
a separate configuration channel. Upon execution of the drac
module these channels may have uncommitted or committed pending
changes. We must therefore determine a minimal sequence of steps
to realise the requested configuration for an arbitrary initial
state, which may affect more than one of these channels.
The python-dracclient module provided almost all of the necessary
input data with one exception. When querying the virtual disks, the
returned objects did not contain the list of physical disks that each virtual
disk is composed of. We developed the required functionality and submitted it
to the python-dracclient project.
Thanks go to the python-dracclient community for their help in
implementing the feature.