Revolutionize High-Performance Computing with RDMA

To address the efficiency challenges of rapidly growing data storage and retrieval within data centers, the use of Ethernet-converged distributed storage networks is becoming increasingly popular. However, in storage networks where data flows are mainly characterized by large flows, packet loss caused by congestion will reduce data transmission efficiency and aggravate congestion. In order to solve this series of problems, RDMA technology emerged as the times require.

What is RDMA?

RDMA (Remote Direct Memory Access) is an advanced technology designed to reduce the latency associated with server-side data processing during network transfers. Allowing user-level applications to directly read from and write to remote memory without involving the CPU in multiple memory copies, RDMA bypasses the kernel and writes data directly to the network card. This achieves high throughput, ultra-low latency, and minimal CPU overhead. Presently, RDMA’s transport protocol over Ethernet is RoCEv2 (RDMA over Converged Ethernet v2). RoCEv2, a connectionless protocol based on UDP (User Datagram Protocol), is faster and consumes fewer CPU resources compared to the connection-oriented TCP (Transmission Control Protocol).

Building Lossless Network with RDMA

The RDMA networks achieve lossless transmission through the deployment of PFC and ECN functionalities. PFC technology controls RDMA-specific queue traffic on the link, applying backpressure to upstream devices during congestion at the switch’s ingress port. With ECN technology, end-to-end congestion control is achieved by marking packets during congestion at the egress port, prompting the sending end to reduce its transmission rate.

Optimal network performance is achieved by adjusting buffer thresholds for ECN and PFC, ensuring faster triggering of ECN than PFC. This allows the network to maintain full-speed data forwarding while actively reducing the server’s transmission rate to address congestion.

Accelerating Cluster Performance with GPU Direct-RDMA

The traditional TCP network heavily relies on CPU processing for packet management, often struggling to fully utilize available bandwidth. Therefore, in AI environments, RDMA has become an indispensable network transfer technology, particularly during large-scale cluster training. It surpasses high-performance network transfers in user space data stored in CPU memory and contributes to GPU transfers within GPU clusters across multiple servers. And the Direct-RDMA technology is a key component in optimizing HPC/AI performance, and NVIDIA enhances the performance of GPU clusters by supporting the function of GPU Direct-RDMA.

Streamlining RDMA Product Selection

In building high-performance RDMA networks, essential elements like RDMA adapters and powerful servers are necessary, but success also hinges on critical components such as high-speed optical modules, switches, and optical cables. As a leading provider of high-speed data transmission solutions, FS offers a diverse range of top-quality products, including high-performance switches, 200/400/800G optical modules, smart network cards, and more. These are precisely designed to meet the stringent requirements of low-latency and high-speed data transmission.

This entry was posted in Network Switch and tagged , , . Bookmark the permalink.

Leave a Reply