For optimal reading, please switch to desktop mode.
To virtualise or not to virtualise?
If performance is what you need, then there's no debate - bare metal still
beats virtual machines; particularly for I/O intensive applications. However,
unless you can guarantee to keep it fully utilised, iron comes at a price.
In this article we describe how Nova can be used to provide access to both
hypervisors and bare metal compute nodes in a unified manner.
Scheduling
When support for bare metal compute via Ironic was first introduced to
Nova, it could not easily coexist with traditional hypervisor-based workloads.
Reported workarounds typically involved the use of host aggregates and flavor
properties.
Scheduling of bare metal is covered in detail in our
bespoke bare metal blog
article (see Recap: Scheduling in Nova).
Since the Placement service
was introduced, scheduling has significantly changed for bare metal. The
standard vCPU, memory and disk resources were replaced with a single unit of a
custom resource class for each Ironic node. There are two key side-effects of
this:
- a bare metal node is either entirely allocated or not at all
- the resource classes used by virtual machines and bare metal are disjoint, so
we could not end up with a VM flavor being scheduled to a bare metal node
A flavor for a 'tiny' VM might look like this:
openstack flavor show vm-tiny -f json -c name -c vcpus -c ram -c disk -c properties
{
"name": "vm-tiny",
"vcpus": 1,
"ram": 1024,
"disk": 1,
"properties": ""
}
A bare metal flavor for 'gold' nodes could look like this:
openstack flavor show bare-metal-gold -f json -c name -c vcpus -c ram -c disk -c properties
{
"name": "bare-metal-gold",
"vcpus": 64,
"ram": 131072,
"disk": 371,
"properties": "resources:CUSTOM_GOLD='1',
resources:DISK_GB='0',
resources:MEMORY_MB='0',
resources:VCPU='0'"
}
Note that the vCPU/RAM/disk resources are informational only, and are zeroed
out via properties for scheduling purposes. We will discuss this further later
on.
With flavors in place, users choosing between VMs and bare metal is handled by
picking the correct flavor.
What about networking?
In our mixed environment, we might want our VMs and bare metal instances to be
able to communicate with each other, or we might want them to be isolated from
each other. Both models are possible, and work in the same way as a typical
cloud - Neutron networks are isolated from each other until connected via a
Neutron router.
Bare metal compute nodes typically use VLAN or flat networking, although with
the right combination of network hardware and Neutron plugins other models may
be possible. With VLAN networking, assuming that hypervisors are connected to
the same physical network as bare metal compute nodes, then attaching a VM to
the same VLAN as a bare metal compute instance will provide L2 connectivity
between them. Alternatively, it should be possible to use a Neutron router to
join up bare metal instances on a VLAN with VMs on another network e.g. VXLAN.
What does this look like in practice? We need a combination of Neutron
plugins/drivers that support both VM and bare metal networking. To connect bare
metal servers to tenant networks, it is necessary for Neutron to configure
physical network devices. We typically use the networking-generic-switch ML2 mechanism
driver for this, although the networking-ansible driver is emerging as
a promising vendor-neutral alternative. These drivers support bare metal
ports, that is Neutron ports with a VNIC_TYPE of baremetal.
Vendor-specific drivers are also available, and may support both VMs and bare
metal.
Where's the catch?
One issue that more mature clouds may encounter is around the transition from
scheduling based on standard resource classes (vCPU, RAM, disk), to scheduling
based on custom resource classes. If old bare metal instances exist that were
created in the Rocky release or earlier, they may have standard resource class
inventory in Placement, in addition to their custom resource class. For
example, here is the inventory reported to Placement for such a node:
$ openstack resource provider inventory list <node UUID>
+----------------+------------------+----------+----------+-----------+----------+--------+
| resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |
+----------------+------------------+----------+----------+-----------+----------+--------+
| VCPU | 1.0 | 64 | 0 | 1 | 1 | 64 |
| MEMORY_MB | 1.0 | 131072 | 0 | 1 | 1 | 131072 |
| DISK_GB | 1.0 | 371 | 0 | 1 | 1 | 371 |
| CUSTOM_GOLD | 1.0 | 1 | 0 | 1 | 1 | 1 |
+----------------+------------------+----------+----------+-----------+----------+--------+
If this node is allocated to an instance whose flavor requested (or did not
explicitly zero out) standard resource classes, we will have a usage like this:
$ openstack resource provider usage show <node UUID>
+----------------+--------+
| resource_class | usage |
+----------------+--------+
| VCPU | 64 |
| MEMORY_MB | 131072 |
| DISK_GB | 371 |
| CUSTOM_GOLD | 1 |
+----------------+--------+
If this instance is deleted, the standard resource class inventory will become
available, and may be selected by the scheduler for a VM. This is not likely to
end well. What we must do is ensure that these resources are not reported to
Placement. This is done by default in the Stein release of Nova, and Rocky may
be configured to do the same by setting the following in nova.conf:
[workarounds]
report_ironic_standard_resource_class_inventory = False
However, if we do that, then Nova will attempt to remove inventory from
Placement resource providers that is already consumed by our instance, and will
receive a HTTP 409 Conflict. This will quickly fill our logs with unhelpful
noise.
Flavor migration
Thankfully, there is a solution. We can modify the embedded flavor in our
existing instances to remove the standard resource class inventory, which will
result in the removal of the allocation of these resources from Placement.
This will allow Nova to remove the inventory from the resource provider. There
is a Nova patch started by Matt
Riedemann which will remove our standard resource class inventory. The patch
needs pushing over the line, but works well enough to be cherry-picked
to Rocky.
The migration can be done offline or online. We chose to do it offline, to
avoid the need to deploy this patch. For each node to be migrated:
nova-manage db ironic_flavor_migration --resource_class <node resource class> --host <host> --node <node UUID>
Alternatively, if all nodes have the same resource class:
nova-manage db ironic_flavor_migration --resource_class <node resource class> --all
You can check the instance embedded flavors have been updated correctly via the
database:
sql> use nova
sql> select flavor from instance_extra;
Now (Rocky only), standard resource class inventory reporting can be disabled.
After the nova compute service has been running for a while, Placement will be
updated:
$ openstack resource provider inventory list <node UUID>
+----------------+------------------+----------+----------+-----------+----------+-------+
| resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |
+----------------+------------------+----------+----------+-----------+----------+-------+
| CUSTOM_GOLD | 1.0 | 1 | 0 | 1 | 1 | 1 |
+----------------+------------------+----------+----------+-----------+----------+-------+
$ openstack resource provider usage show <node UUID>
+----------------+--------+
| resource_class | usage |
+----------------+--------+
| CUSTOM_GOLD | 1 |
+----------------+--------+
Summary
We hope this shows that OpenStack is now in a place where VMs and bare metal
can coexist peacefully, and that even for those pesky pets, there is a path
forward to this brave new world. Thanks to the Nova team for working hard to
make Ironic a first class citizen.