Aug 02, 2023

Building networks for AI/ML workloads

As AI continues to take higher attention both in and out of the technology world, this trend doesn’t skip the networking industry.

A notable example for the high level of attention this field is getting is the establishment of the Ultra Ethernet Consortium by some of the worlds giants of both compute and networking fields (simple homework for our readers, which relevant giant is missing from the consortium ?)

As AI training clusters continue to grow in multiple dimensions, so does the requirements from the underlying networking infrastructure.

In recent months we were involved in troubleshooting a large AI processor network fabric.
This fabric is used to connect AI training  chips by 100G and 400G links using  Arista 7060 switches.
We have learned a lot during this process, and in this post, we will be sharing some of what we learned.

AI/ML uses many AI processors (sometimes GPUs) in a cluster.
The training of AI models involves very large data sets.
To allow the cluster of GPUs to communicate with one another we have to create a high scale network that satisfies the performance required for AI/ML.
The scale of network is high from both number of nodes perspective as well as bandwidth per node.

The requirements of AI/ML come from the fact it uses RDMA   (Remote Direct Memory Access) where one node has direct access to the memory of another node bypassing the OS.
Historically RDMA networks used InfiniBand to create a lossless network where the client or server sends a packet it knows it will reach  the destination, but InfiniBand which is dominated by Nvidia is expensive and is one generation behind Ethernet in its speeds.
The alternative is to use the more common Ethernet fabrics. But native ethernet is not lossless, to overcome this  the industry created the RoCE protocol (RDMA over Converged Ethernet).  RoCEv1 uses Ethernet as the link layer protocol (Ethertype 0x8915) so It requires a single broadcast domain which limits its scale to small fabrics.
The newer RoCEv2 uses UDP (Port 4791) over IP for communication and allows the creation of very large fabrics. To achieve these requirements for lossless fabric RoCEv2 uses flow control (PFC and ECN).


Network topology  – CLOS Fabrics

The most common fabric architecture these days is CLOS. Where each leaf is connected to all spines to create non-blocking fabric. In AI/ML networks, it is important to create the fabric as non-oversubscribed which means that each leaf has the same amount of uplink as downlinks to reduce the chances of congestion.


For RoCEv2 to work it must have a lossless fabric which means none of the RoCEv2 packets will be lost on the network as a result of congestion as in case of a loss the whole process has to start from the beginning.

There are two main protocols to avoid congestion in the network by signaling the source of the traffic to pause the transmitting of data or by reducing its rate.


The first one is PFC (Priority Flow Control 802.1Qbb)  it is similar to the regular flow control available in regular Ethernet but adds per Ethernet priority (802.1P)/DSCP flow control to allow per priority/traffic class lossless communication. PFC works by sending a pause frame when the buffer of a switch is full the neighbor device halts sending traffic in the specified traffic class for the time specified in the pause frame. If the neighbor device is a switch , it causes its buffer to get full as well  and  as a result it sends a pause to the source of the traffic which stops sending traffic to the destination which frees up the buffer and so on.
To allow the switch to be a really lossless it has to save some headroom in its buffer for packets in flight until the neighbor device pauses the traffic.

Flow control is a very aggressive mechanism  as it pauses the traffic from multiple sources, it is also spreading the issue as if a spine sends pause to a leaf the leaf buffer is filling and creates more congestion.


ECN (Explicit Congestion Notification) is the two least bits in the traffic class of IPv4 and IPv6 packets (the other 6 bits are the DSCP).



When end points support ECN they mark these packets as 01 or 10 to signal they are “ECN Capable Transport”. When the buffer of the switch fills, but  before it experiences congestion and the switch uses RED (Random Early Detection) instead of dropping the packet it marks the ECN bits as 11. As a result, the destination RoCEv2 node sends CNP (Congestion Notification Packet) to the source of the traffic, and if the source supports it, it reduces the rate of the specific flow, without pausing the rest of the  traffic. The issue with ECN is that it is slow as the protocol needs full RTT until the source reacts to the CNP packet.



Future directions

There is a work in the IEEE to enhance the flow control mechanism to be faster and effect only the source of the flows for example


Modern switches are divided to two classes’: shallow buffers and deep buffers. Deep buffering allows more time for the transmission sources to adjust rates based on ECN/CNP and minimizes the chance to trigger PFC that could lead to head-of-line blocking but the result of that is higher latencies and more expensive switches.

Deep buffer switches use VoQ buffering which means that on each ingress port, there is a queue for each egress port of the switch and the switch transmits data from this VoQ only if it grants permission from the egress port to send the packets. The VoQ architecture prevents Head of Line blocking that happens when the output queue of a port is full

The buffer of each switch is divided into 3 categories

The buffering scheme of each networking ASIC is very important information to achieve lossless fabric.
As an example, it is important  to follow specific port ordering/numbering for downlinks and uplinks to optimize ASIC resources.
But in a lot of cases this information is not public. For example in the following Broadcom document, we can see “To optimize switch performance, distribute the port usage and bandwidth evenly across the eight pipelines and the two ingress traffic managers (ITM0 and ITM1). For additional information, refer to the MMU and Port Numbering sections in the BCM56980 Theory of Operations (56980-PG1xx).”  However, referenced document is not publicly available from Broadcom.
In some cases we had to work closely with our vendors to get the right information.

Load balancing

In ECMP there is a hash algorithm that calculates the output port from a list of possible next hops based on the packet 5 tuples (Source/Destination IP, Source Destination Port and Protocol). It works excellent for most types of traffic as there are a lot of flows but RoCEv2 traffic creates challenges for it as it doesn’t create a lot of flows and some of them are elephant flows (A single long and large flow that consume large amount of bandwidth from the port) so the results is that some ports from the leaf to the spines are over-utilized and some ports underutilized.

DLB (Dynamic Load Balancing)

DLB is a mechanism to overcome the ECMP limitation by continuously monitoring the utilization of each port in the micro-second timeframe and in case the port is over-utilized it routes new flows to a less utilized port

GLB (Global load balancing)

GLB will be introduced with the Broadcom Tomahawk 5 extend DLB as it allows the switch to take the downstream switch port utilization into consideration when it selects the next hop. Imagine where each one of three source  leafs’ connected with 100G interfaces to spines sends 40 Gbps flow of traffic to a single destination leaf and all of them select the same spine in that case the spine port to the destination leaf will be congested. To overcome this if the spine will report its port utilization to the leafs via GLB they will be able to avoid it and use other leafs.

Packet Spraying

To achieve the best load balancing you have to spray the traffic between all available ports and avoid the per-flow load balancing. Doing so at the packet level this can lead to out-of-order transmission of packets as each packet has a different size. To overcome this cell-based switch fabrics where the packets are broken into fixed size cells allow the leaf or line card to spray the packets without creating out-of-order transmission.

The traditional place for switch fabrics is within a single chassis but this lead to physical space issue and a limited number of ports. The recently announced Broadcom Jericho3-AI use external Ramon switch fabrics (like spines) to build huge fabrics of up to 32,000 GPUs, each with an 800Gbs interface in a single fabric.


What it means for the enterprise

The normal enterprise will not have AI/ML very soon but the same principals are also true for storage networks. These days we can build a backend Ethernet Fabric that replace the traditional Fiber-Channel fabric with an Ethernet IP fabric that support much faster links with lower costs.


Nitzan Tzelniker
Solutions Architect at Oasis Communication Technologies


Under Attack?
Broken Network System?

Leave your details below and we’ll get back to you shortly