For optimal reading, please switch to desktop mode.
To virtualise or not to virtualise?
If performance is what you need, then there's no debate - bare metal still beats virtual machines; particularly for I/O intensive applications. However, unless you can guarantee to keep it fully utilised, iron comes at a price. In this article we describe how Nova can be used to provide access to both hypervisors and bare metal compute nodes in a unified manner.
Scheduling
When support for bare metal compute via Ironic was first introduced to Nova, it could not easily coexist with traditional hypervisor-based workloads. Reported workarounds typically involved the use of host aggregates and flavor properties.
Scheduling of bare metal is covered in detail in our bespoke bare metal blog article (see Recap: Scheduling in Nova).
Since the Placement service was introduced, scheduling has significantly changed for bare metal. The standard vCPU, memory and disk resources were replaced with a single unit of a custom resource class for each Ironic node. There are two key side-effects of this:
- a bare metal node is either entirely allocated or not at all
- the resource classes used by virtual machines and bare metal are disjoint, so we could not end up with a VM flavor being scheduled to a bare metal node
A flavor for a 'tiny' VM might look like this:
openstack flavor show vm-tiny -f json -c name -c vcpus -c ram -c disk -c properties
{
"name": "vm-tiny",
"vcpus": 1,
"ram": 1024,
"disk": 1,
"properties": ""
}
A bare metal flavor for 'gold' nodes could look like this:
openstack flavor show bare-metal-gold -f json -c name -c vcpus -c ram -c disk -c properties
{
"name": "bare-metal-gold",
"vcpus": 64,
"ram": 131072,
"disk": 371,
"properties": "resources:CUSTOM_GOLD='1',
resources:DISK_GB='0',
resources:MEMORY_MB='0',
resources:VCPU='0'"
}
Note that the vCPU/RAM/disk resources are informational only, and are zeroed out via properties for scheduling purposes. We will discuss this further later on.
With flavors in place, users choosing between VMs and bare metal is handled by picking the correct flavor.
What about networking?
In our mixed environment, we might want our VMs and bare metal instances to be able to communicate with each other, or we might want them to be isolated from each other. Both models are possible, and work in the same way as a typical cloud - Neutron networks are isolated from each other until connected via a Neutron router.
Bare metal compute nodes typically use VLAN or flat networking, although with the right combination of network hardware and Neutron plugins other models may be possible. With VLAN networking, assuming that hypervisors are connected to the same physical network as bare metal compute nodes, then attaching a VM to the same VLAN as a bare metal compute instance will provide L2 connectivity between them. Alternatively, it should be possible to use a Neutron router to join up bare metal instances on a VLAN with VMs on another network e.g. VXLAN.
What does this look like in practice? We need a combination of Neutron plugins/drivers that support both VM and bare metal networking. To connect bare metal servers to tenant networks, it is necessary for Neutron to configure physical network devices. We typically use the networking-generic-switch ML2 mechanism driver for this, although the networking-ansible driver is emerging as a promising vendor-neutral alternative. These drivers support bare metal ports, that is Neutron ports with a VNIC_TYPE of baremetal. Vendor-specific drivers are also available, and may support both VMs and bare metal.
Where's the catch?
One issue that more mature clouds may encounter is around the transition from scheduling based on standard resource classes (vCPU, RAM, disk), to scheduling based on custom resource classes. If old bare metal instances exist that were created in the Rocky release or earlier, they may have standard resource class inventory in Placement, in addition to their custom resource class. For example, here is the inventory reported to Placement for such a node:
$ openstack resource provider inventory list <node UUID>
+----------------+------------------+----------+----------+-----------+----------+--------+
| resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |
+----------------+------------------+----------+----------+-----------+----------+--------+
| VCPU | 1.0 | 64 | 0 | 1 | 1 | 64 |
| MEMORY_MB | 1.0 | 131072 | 0 | 1 | 1 | 131072 |
| DISK_GB | 1.0 | 371 | 0 | 1 | 1 | 371 |
| CUSTOM_GOLD | 1.0 | 1 | 0 | 1 | 1 | 1 |
+----------------+------------------+----------+----------+-----------+----------+--------+
If this node is allocated to an instance whose flavor requested (or did not explicitly zero out) standard resource classes, we will have a usage like this:
$ openstack resource provider usage show <node UUID>
+----------------+--------+
| resource_class | usage |
+----------------+--------+
| VCPU | 64 |
| MEMORY_MB | 131072 |
| DISK_GB | 371 |
| CUSTOM_GOLD | 1 |
+----------------+--------+
If this instance is deleted, the standard resource class inventory will become available, and may be selected by the scheduler for a VM. This is not likely to end well. What we must do is ensure that these resources are not reported to Placement. This is done by default in the Stein release of Nova, and Rocky may be configured to do the same by setting the following in nova.conf:
[workarounds]
report_ironic_standard_resource_class_inventory = False
However, if we do that, then Nova will attempt to remove inventory from Placement resource providers that is already consumed by our instance, and will receive a HTTP 409 Conflict. This will quickly fill our logs with unhelpful noise.
Flavor migration
Thankfully, there is a solution. We can modify the embedded flavor in our existing instances to remove the standard resource class inventory, which will result in the removal of the allocation of these resources from Placement. This will allow Nova to remove the inventory from the resource provider. There is a Nova patch started by Matt Riedemann which will remove our standard resource class inventory. The patch needs pushing over the line, but works well enough to be cherry-picked to Rocky.
The migration can be done offline or online. We chose to do it offline, to avoid the need to deploy this patch. For each node to be migrated:
nova-manage db ironic_flavor_migration --resource_class <node resource class> --host <host> --node <node UUID>
Alternatively, if all nodes have the same resource class:
nova-manage db ironic_flavor_migration --resource_class <node resource class> --all
You can check the instance embedded flavors have been updated correctly via the database:
sql> use nova
sql> select flavor from instance_extra;
Now (Rocky only), standard resource class inventory reporting can be disabled. After the nova compute service has been running for a while, Placement will be updated:
$ openstack resource provider inventory list <node UUID>
+----------------+------------------+----------+----------+-----------+----------+-------+
| resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |
+----------------+------------------+----------+----------+-----------+----------+-------+
| CUSTOM_GOLD | 1.0 | 1 | 0 | 1 | 1 | 1 |
+----------------+------------------+----------+----------+-----------+----------+-------+
$ openstack resource provider usage show <node UUID>
+----------------+--------+
| resource_class | usage |
+----------------+--------+
| CUSTOM_GOLD | 1 |
+----------------+--------+
Summary
We hope this shows that OpenStack is now in a place where VMs and bare metal can coexist peacefully, and that even for those pesky pets, there is a path forward to this brave new world. Thanks to the Nova team for working hard to make Ironic a first class citizen.