For optimal reading, please switch to desktop mode.
Published: Mon 24 May 2021
Updated: Mon 24 May 2021
By John Taylor
Steve Brasier
John Garbutt
In Networking .
tags: ethernet hpc
This is the second [first one here ] of a trilogy of
blogs on the use of modern Ethernet stacks (sometimes referred to
as the generic term High Performance Ethernet) as a viable alternative
to other interconnects such as Infiniband in HPC and AI. The reasons
and motivations for this are many:
Modern Ethernet stacks support RDMA and as such come close to
the base latencies (critical to the performance of a bona fide
HPC network) of alternative technologies, viz. order 1µs
Convergence of NIC hardware used in both Ethernet and Infiniband
mean that the delay in higher line rate Ethernet technologies is
not as great as previous generations
Ethernet as an IEEE standard means that suppliers of innovative
storage and new AI acceleration technologies choose Ethernet by
default.
In virtualised contexts, the properties of Ethernet in supporting
VLAN extensions significantly enhances security and multi-tenancy
without the performance penalties, albeit to some extent IB pkeys
can also be used.
If Ethernet can address these aspects, then the complexity of
medium-sized systems can be reduced, saving cost, time and risk as
only one data transport is required along with the control networks
which are always Ethernet. I say medium-sized systems as there is
still a question relating to congestion management at large-scale
(see previous blog) together with the scalability of global reduction
operations without hardware offload.
For StackHPC, the standard nature of Ethernet, however, has potential
for some interesting side-effects, unless care is taken on ensuring
optimisation of the configuration. The focus of this blog, explores
these aspects and draws upon many years of experience of the team
in SDN and HPC. We also assess these aspects in a novel manner in
particular by extensive use of hardware monitoring of the application
and the network stack. We also note here that we are only considering
the case of bare metal. In the case of virtualised compute, we point
you to the following article on advances in performance and
functionality using SR-IOV.
Previously we had compared and contrasted the performance of 25GbE
and 100Gbps EDR for a range of benchmarks up to 8 nodes. Our
experience with “medium” scale HPC operations with customers over
the past 5 years suggests that a typical operational profile of
applications running would indicate that 8 nodes represents a good
median and in fact one that remains pretty constant as new node
types (potentially doubling the number of cores) are purchased or
new nodes with acceleration are introduced.
Of course as a median, there may well exist power-users who say
they need (or probably more correctly want) the capability of the
whole machine for a single application - such as a time-critical
scientific deadline - and while such black swan events do occur,
they are, ipso facto, not common. Hence the focus here on medium-sized
(or high-end
systems) the largest tranche of the HPC market.
Thus in this article we extend the results firstly up to and including
a single Top-of-Rack switch for both Ethernet and IB and then to
explore the performance across multiple racks. Typically, but subject
to data centre constraints this ranges from 56 to 128 nodes
(two-to-three racks worth), or around the 1-2MUSD price tag (disclaimer
here), probably representing 80% of the HPC/AI market.
In doing this however, we of course need to be cognisant of the
following caveats, given the test equipment to hand. So, here we
go.
The system under test (SUT) comprises multiple racks of 56 nodes
of Cascade Lake with each node comprising a dual NIC configured as
one 50GbE and one 100Gbps HDR. These are Mellanox ConnectX-6. Each
interface is connected to a Mellanox Cumulus SN3700 and a Mellanox
HDR-200 IB Switch respectively.
Each Top-of-rack switch is then connected in a Leaf-Spine topology
with the following over-subscription ratio. For Ethernet this is
14:1 and for IB this is 2.3:1 .
Priority Flow Control was configured on the network, with MTU set at 9000.
As an aside, there is an excellent overview on these networking
aspects at Down Under Geophysical’s blog
(I’ve tagged blog #3 here). The reader is also pointed at the RoCE
vs IB comparison here .
The system is built from the ground-up using OpenStack and the
OpenStack Ironic bare-metal service and has applied to the base
infrastructure, an OpenHPC v2 Slurm Ansible playbook
to create the
necessary platform. Performance of the individual nodes is captured
by a sophisticated life-cycle management process ensuring that
components are only entered into service once they have transitioned
through a set of health and performance checks. More on that to
come in subsequent blogs.
The resulting system provides a cloud-native HPC environment that
provides a great deal of software defined flexibility for the
customer. For this analysis it provided a convenient programmatic
interface for manipulation of switch configurations and easy
deployment of Prometheus-Redfish service stack for monitoring of
the nodes and Ethernet switches within the system. This provided
an invaluable service in debugging the environment.
Given these hardware characteristics and depending (that dreadful
verb again) on the application we use, we would expect the inter-switch
performance will be governed by the bi-sectional ratios of b) above
together with the additional inter-switch links (ISLs) ratio .
As yet we have not been able to find an application/data-set
combination in which there is a significant difference between IB
and RoCE as we scale up nodes. So here, we again focus on using
Linpack as the base test as we know that network performance and
in particular bi-sectional bandwidth is a strong influence on
scalability, and we should be able to expose this with the application.
N.B. The system as a whole reached #98 in the most recent (November
2020) top500 list - more details are available here but save to say
this result was with IB.
However, let's first start with base latencies and bandwidth for a
MPI pingpong.
Network
PingPong Latency (microseconds)
PingPong Bandwidth (MB/sec)
IB 100Gbps
1.09
12069.36
ROCE 50Gbps
1.49
5738.79
[sources: IB , RoCE ]. We note here that the switch latency between
IB and RoCE is also included in this measure. This would be higher
in the case of the latter and again is a factor to consider at
scale.
Secondly, we now compare the performance within a single rack. The
final headline numbers are shown in the Table below, but how we got
there is an interesting journey.
Single Rack 56 nodes
Network
GFLOPS (Linpack)
IB 100Gbps
1.34976e+05
ROCE 50Gbps
1.33943e+05
NB. that the RoCE number at 50Gbps is within 99% of the IB number.
The first step in the analysis was to make sure we could more easily
toggle between the types of interconnect. Previously with tests we
had used openmpi3 and the pmi layer within Slurm. For this SUT we
decided to use the UCX transport. This provides a much easier way
to select the interconnect, however we did find out an interesting
side-effect when using a system with this transport, as without a
specific setting to configure the mlx5_[0,1] interface, the run-time
assumes the system to be a dual-rail configuration with a perverse
unintended consequence.
Evidence of the behaviour was determined by observing node metrics
and in particular system CPU combined with metrics from the Ethernet
switch. The attached screenshot shows this behaviour and was a
compelling diagnostic tool as multiple engineers were involved in
the configuration and set-up.
Here we observe a system CPU pattern, not observed in previous
experiments where system CPU is flat and near zero and user CPU
flat at 100%. On a different dashboard we were also monitoring
Ethernet traffic through the switch as well as packets through the
different mlx devices. This quickly resolved the issue of unintended
dual-rail behaviour.
Once we had resolved the single rack performance we moved on to
multiple rack measurements, where armed with the appropriate
dashboards we can begin to monitor the effects due to bi-sectional
bandwidth.
The results are shown in the graph below. Between 1 and 2 Racks the
performance is within 99% of the IB performance irrespective even
taking into account the reduced bisectional bandwidth.
We are building up a body of evidence in terms of RoCE vs IB
performance and will be adding further information as more application
performance is gained. These data are being documented at the
following github and
will be detailed in subsequent blogs. Further work will also look
at I/O performance.
Of course, many stalwarts of HPC interconnects will remain in the
IB camp (and for good reason) but we still see many organisations
moving from IB to Ethernet in the middle of the HPC pyramid. We
also expect the number of options in the market to increase in the
next two years.
Acknowledgements
We would like to take this opportunity to thank members of the
University of Cambridge, University Information Services for help
and support in this article.
Get in touch
If you would like to get in touch we would love to hear
from you. Reach out to us via Twitter
or directly via our contact page .