It’s hard to read any business article today without seeing a mention about the expected impact of machine learning (ML). Potential benefits include improvements in diagnostic accuracy and personalized treatment plans in medicine, better customer service and chatbots in retail, intelligent tutoring systems in education and enhanced risk managements and fraud detection in finance. But while ML will drive business growth, improve efficiency, and unlock new opportunities it also presents significant challenges for data centers. The immense computational demands of ML workloads necessitate significant investments in infrastructure to manage scale, power, and bandwidth requirements.
As machine learning continues to revolutionize industries, the networking demands of ML training are diverging from those of traditional data centers. Understanding these differences is crucial for optimizing performance and unlocking the full potential of ML.
As an example of the networking requirements for typical data center networking, think about the networking demands of performing a Google search. While there are billions of searches every day, these searches are asynchronous and require a small data transfer between servers and storage devices. While latency is important, oversubscription on a network link only delays the one search request and a small delay does not affect overall performance. Overall, the cumulative traffic pattern is relatively steady and predictable as shown in Figure 1 due to the random distribution of billions of individual search requests.
Figure 1. In traditional data center networks, many asynchronous small bandwidth flows average out to a consistent load due to the random pattern of data requests. [1]
In contrast, the challenge of machine learning networking is due to the synchronous, high bandwidth data flows required for ML training. The data set size for machine learning is growing by an order of magnitude each year and recent large language models such as ChatGPT-4 have exceeded a trillion parameters. During training, these large data sets are exchanged between GPUs during each computation iteration, creating synchronous high bandwidth data flows. Also, because of the collective communication nature of the calculation, machine learning training can suffer from long-tail latency. The calculation depends on the results computed from all the other GPUs so the next calculation iteration can’t start until all GPUs have exchanged the required data. This means that all the expensive GPUs can sit idle waiting for the final data exchange to be completed between the last pair of GPUs. With large GPU clusters training on the latest data sets, the GPUs can be idle over 50% of the time while waiting for the data exchange to be completed.
Figure 2. In machine learning training, few synchronous high bandwidth flows lead to problems with long-tail latency and highlight bad load balancing decisions, leaving GPUs idle for significant time periods. [1]
If the network is not optimized for ML training, the data center performance will suffer from poor resource utilization, prolonged training times and increased costs.
Reconfigurable Optical Networks and Optic Circuit Switches (OCSs)
Bandwidth limitations pose a significant challenge for ML workloads. Reconfigurable optical networks, employing optic circuit switches (OCSs), offer a promising solution to optimize network performance and consequently enhance overall ML efficiency. These networks provide several advantages, including:
Dynamic bandwidth allocation: Reconfigurable networks adapt to changing network demands, ensuring optimal resource utilization
Low-latency, high-throughput: OCSs enable fast, dedicated connections for ML training workloads and avoid traffic congestion that can occur in typical Clos networks
Reduced power consumption: When optical circuit switches replace electrical switches in the network, the power savings can exceed 40%
Scalability and flexibility: Reconfigurable networks and OCSs support evolving ML model requirements and can future-proof the network
Optical circuit switches (OCSs) employ various technologies to achieve their switching capabilities. Micro-Electro-Mechanical Systems (MEMS) based OCSes utilize tiny movable mirrors to direct optical signals, providing fast switching speeds. Since the signal must exit the optical fiber, travel in free space to the mirrors and then be recollimated into the output fiber, MEMS system do suffer from high insertion loss. It is also difficult to scale MEMS system up to very large port counts.
Google has been a pioneer in the use of MEMS optical circuit switches (OCS) in its data centers as part of its "Mission Apollo" initiative. [2] These OCS devices replaced traditional electronic switches, allowing for efficient traffic management, increased capacity, and improved overall performance with a 30% reduction in capital expense and a 40% reduced power consumption. While the MEMS devices offer a fast-switching speed, the low port count necessitated the need for bi-directional optics, circulators and improved FEC chips to overcome the low port count of the MEMS device and relatively high optical loss through the switch.
In contrast, robotic optical switches employ a more mechanical approach, physically moving optical fibers to create new connections. While slower than MEMS counterparts, these switches often excel in reliability due to their secure, latched connections, significantly lower optical loss, and capacity to manage exceptionally high fiber counts. Robotic systems can handle over 10,000 fibers within a single rack by using various connector types such as MT and MPO. At such a scale, the dominant cost becomes the fiber cabling itself, which is a network necessity. Consequently, incorporating reconfigurability through a robotic system incurs minimal additional expense while offering significant benefits in network operation and efficiency.
An example of a robotic fiber system is the Telescent OCS. It comprises short fiber links between ports and a robotic mechanism to relocate these ports as needed. The system's scalability to high port counts is due to its patented routing algorithm, which enables fibers to be intricately woven around others. Initially designed for 1,008 simplex LC ports, the Telescent OCS has evolved to accommodate multiple fibers per port through MT-style connectors, exceeding 10,000 fibers per rack. The Telescent system has passed NEBS Level 3 certification and has been used in production networks. Both single mode and multimode fiber have been deployed in the Telescent system, allowing use with lower cost, short-reach multimode transmitters.
Unlocking Performance
As mentioned earlier, ML networks can suffer from long-tail latency while GPUs are waiting for a GPU pair to exchange data. However, since the bandwidth demand between GPUs is predictable and stable based on the parallelization strategy used in the training, the bandwidth between GPUs can be optimized by concentrating bandwidth between GPU pairs that have a high data exchange. Adjusting bandwidth allocation between GPU nodes to match the demand can yield substantial efficiency enhancements in ML workload training. A collaborative effort involving MIT, Meta, and Telescent, utilizing a robotic patch panel to optimize connectivity, achieved a remarkable 3.4x increase in training efficiency without incurring additional costs [3].
By embracing reconfigurable optical networks using optical circuit switches, hyperscale data center operators can:
Accelerate ML training: Reduce training times, increasing productivity and innovation
Improve resource utilization: Optimize network resources, reducing costs and environmental impact
Future-proof infrastructure: Adapt to evolving ML demands, ensuring long-term performance and competitiveness
In conclusion, the networking demands of machine learning training are distinct from those of traditional data centers. Reconfigurable optical networks and OCSs offer a powerful solution to address these differences, unlocking faster, more efficient, and more scalable ML performance.
[1] Evolved Networking – the AI Challenge | Cisco
[2] [2208.10041] Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale (arxiv.org)