The Convergence of HPC, AI and Cloud

For optimal reading, please switch to desktop mode.

At StackHPC, we use Kolla-Ansible to deploy Monasca, a multi-tenant monitoring-as-a-service solution that integrates with OpenStack, which allows users to deploy InfluxDB as a time-series database. As this database fills up over time with an unbounded retention period, it is not surprising that the response time of the database will be different to when it was initially deployed. Long term operation of Monasca by our clients in production has required a proactive approach to keep the monitoring and logging services running optimally. To this end, the problems we have seen are related to query performance which has been directly affecting our customers and other Monasca users. In this article, we tell a story of how we overcame these issues and introduce an opt-in database per tenant capability we pushed upstream into Monasca for the benefit of our customers and the wider OpenStack community who may be dealing with similar challenges of monitoring at scale.

The Challenges of Monitoring at Scale

Our journey starts at a point where the following disparate issues (but related at the same time in the sense that they are all symptoms of a growing OpenStack deployment) were brought to our attention:

When a user views collected metrics on a Monasca Grafana dashboard (which uses Monasca as the data source), firstly, it aims to dynamically obtain a list of host names. This query was not respecting the time boundary that can be selected on the dashboard and instead was scanning results from the entire database. Naturally, this went unnoticed when the database was small, but as the cardinality of the collected metrics grew over time (345 million at its peak on one site - that is 345 million unique time series), the duration of this query was taking up to an hour before eventually timing out. In the mean time it would be blocking resources for additional queries.
A user from a new OpenStack project would experience the same delay in query time against the Monasca API as a user from another project with a much larger metrics repository. This is because Monasca currently implements a single InfluxDB database by default and project scoped metrics are filtered using a WHERE statement. This was a clear bottleneck.
Last but not the least, all metrics which were being gathered were subject to the same retention policy. InfluxDB has support for multiple retention policies per database. To keep things further isolated, it is also possible to have a database per tenant, each with its own default retention policy. Not only does this increase the portability of projects, it also removes the overhead of filtering results by project each time a query is executed, naturally improving performance.

To address these issues, we implemented the following quick fixes, and while they alleviate the symptoms in the short term, we would not consider either of them sustainable or scalable solutions as they will soon require further manual intervention:

Disabling dynamic host name lookup by providing a static inventory of host names (which could be automated at deploy time for static projects). However, for dynamic inventories, this approach relies on manual update of the inventory.
Deleting metrics with highly variable dimensions, which contribute disproportionately to increasing the database cardinality (larger cardinality leads to increased query time for InfluxDB, although other Time Series Databases, e.g. TimescaleDB claim not to be affected in a similar way). Many metric sources expose metrics with highly variable dimensions and avoiding this is an intrinsically hard problem and not one confined to Monasca. For example, sources like cAdvisor expose a lot of metrics with highly variable dimensions by default and one has to be judicious about which metrics to scrape. In our Kolla-Ansible based deployment, the low hanging fruits were mostly metrics matching the regex pattern log.* originating from the OpenStack control plane useful for triggering alarms and then for a finite time horizon for auditing. However, since all data is currently stored under the same database and retention policy (since the Monasca API currently does not have a way of setting per project retention policies), it is not possible to define project specific data expiry date. For example, we were able to reduce 345 million unique time series down to a mere 227 thousand, 0.07% of the original by deleting these log metrics (deleting at a rate of 7 million series per hour for a total of 49 hours). Similarly, at another site, we were able to cut down from 2 million series to 186 thousand, 9% of the original (deleting at a rate of 29 thousand series per hour for 77 hours). In both cases, we managed to significantly cut down the query time from a state where queries were timing out down to a few seconds. However, employing database per tenancy with fine control over retention period remained the holy grail for delivering sustained performance.

Towards Greater Performance

Our multi-pronged approach to make the monitoring stack more performant and resilient can be summarised in the following ways:

The first part of our effort to improve the situation is by introducing a database per tenancy feature to Monasca. The enabling patches affecting monasca-{api,persister} projects have now merged upstream and are available from OpenStack Train release. This paves the way for using an instance of InfluxDB per tenant to further decouple the database back-end between tenants. In summary, these changes enable end users to:
- Enable a database for tenant within a single InfluxDB instance on an opt-in basis by setting db_per_tenant to True in monasca-{api,persister} configuration files.
- Set a default retention policy by defining default_retention_hours in monasca-persister configuration file. Further development of this thread would involve giving project owners the ability to set retention policy of their tenancy via the API.
- Migrate an existing monolithic database to a database per tenant model using an efficient migration tool we proudly upstreamed.
We also introduced experimental changes to limit the search results to the query time window selected on the Grafana dashboard. The required changes spanning several projects (monasca-{api,grafana-datasource,tempest-plugin}) have all merged upstream and also available from Openstack Train release. Since the only option previously was to search the entire database, queries targeting large databases were timing out which can now be avoided. The only caveat with this approach is that the results are approximate, i.e., the accuracy of the returned result is determined by the length of the shardGroupDuration which resolves to 1 week by default when the retention policy is infinite. This defaults to 1 day when the retention policy is 2 weeks. Considering that the earlier behaviour was to scan the entire database, this approach yields a considerable improvement, despite a minor loss in precision.

These additional features have allowed us to further reduce the query time to less than a second in a large, 100+ node deployment with 1 year retention policy; a dramatic improvement compared to queries without any time boundary where our users were frequently hitting query timeouts. Additionally, we have facilitated a more sustainable way to manage the life-cycle of data being generated and consumed by different tenants. For example, this allows tenancy for the control plane logs to have a short retention duration.

A Well-Rehearsed Migration Strategy

Existing production environments hoping to reap the benefit of capabilities we have discussed so far may also wish to migrate their existing monolithic database to a database per tenant model. A good migration tool requires a great migration strategy. In order to ensure minimal disruptions for our customers, we rehearsed the following migration strategy in a pre-production environment before applying the changes in production.

First of all, carry out a migration of the current snapshot of the database up to a desired --migrate-end-time-offset, e.g. 52 weeks into the past. This is much like a Virtual Machine migration, we start by syncing the majority of the data across which requires a minimum free disk space equivalent to the current size of the database. The following example is relevant to Kolla-Ansible based deployments:

docker exec -it -u root monasca_persister bash
source /var/lib/kolla/venv/bin/activate
pip install -U monasca-persister
docker exec -it -u root monasca_persister python /var/lib/kolla/venv/lib/python2.7/site-packages/monasca_persister/tools/influxdb/db-per-tenant/migrate-to-db-per-tenant.py \
--config-file /etc/monasca/persister.conf \
--migrate-retention-policy project_1:2,project_2:12,project_3:52 \
--migrate-skip-regex ^log\\..+ \
--migrate-time-unit w \
--migrate-start-time-offset 0 \
--migrate-end-time-offset 52

The initial migration is likely to take some time depending on the amount of data being migrated and the type of disk under the hood. While this is happening, the monasca_persister container is inserting new metrics into the original database which will need re-syncing after the initial migration is complete. Take a note of the length of time this phase of migration takes as this will determine the portion of the database that will need to be remigrated. You will be able to see that a new database with project specific retention policy of 2w has been created as follows for project_1:

docker exec -it influxdb influx -host 192.168.7.1 -database monasca_project_1 -execute "SHOW RETENTION POLICIES"

name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
2w      336h0m0s 24h0m0s            1        true

Once the initial migration is complete, stop the monasca_persister container and confirm that it has stopped. For deployments with multiple controllers, you will need to ensure this is the case on all nodes.

docker stop monasca_persister
docker ps | grep monasca_persister

Once the persister has stopped, nothing new is getting written to the original database while any new entries are being buffered on Kafka topics. It is a good idea to backup this database as this point for which InfluxDB provides a handy command line interface:

docker exec -it influxdb influxd backup -portable /var/lib/influxdb/backup

Upgrade Monasca containers to OpenStack Train release with database per tenancy features. For example, Kayobe/Kolla-Ansible users can run the following Kayobe CLI command which also ensures that the new versions of monsasca_persister containers are back up and running on all the controllers writing entries to a database per tenant:

kayobe overcloud service reconfigure -kt monasca

Populate the new databases with the missing database entries (minimum is 1 unit of time). InfluxDB automatically prevents duplicate entries therefore it is not a problem if there is an overlap in the migration window. In the following command, we assume that the original migration took less than a week to complete therefore set --migrate-end-time-offset to 1:

docker exec -it -u root monasca_persister python /var/lib/kolla/venv/lib/python2.7/site-packages/monasca_persister/tools/influxdb/db-per-tenant/migrate-to-db-per-tenant.py \
--config-file /etc/monasca/persister.conf \
--migrate-retention-policy project_1:2,project_2:12,project_3:52 \
--migrate-skip-regex ^log\\..+ \
--migrate-time-unit w \
--migrate-start-time-offset 0 \
--migrate-end-time-offset 1

Acknowledgements

This development work was generously funded by Verne Global who are already using the optimised capabilities to provide enhanced services for hpcDIRECT users.

Contact Us

If you would like to get in touch we would love to hear from you. Reach out to us on Twitter or directly via our contact page.