Internet-Draft Routing Area Working Group July 2023
Xu & Yao Expires 11 January 2024 [Page]
Workgroup:
Routing Area Working Group
Internet-Draft:
draft-xu-rtgwg-topo-aware-collective-with-inc-00
Published:
Intended Status:
Informational
Expires:
Authors:
S. Xu
China Mobile
K. Yao
China Mobile

Topology-aware Collective Communication in In-Network Computing Enabled Network: Problem Statement and Requirements

Abstract

In this document, the mapping mechanism between the logical and physical topology of collective communication is analysed in In-Network Computing(INC) enabled network, as well as the impact of topology-aware collective communication algorithms on INC enabled large-scale computing clusters. Requirements are also proposed to design efficient mapping mechanism between logical and physical topology and topology-aware collective communication algorithms.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 11 January 2024.

Table of Contents

1. Introduction

Large scale supercomputing systems have witnessed significant growth in the recent history. At the heart of these systems are compute nodes based on modern multi-core architectures and high speed networks. These systems offer vast amounts of computing power and resources to application developers and are allowing scientific applications to scale out to tens of thousands of processes.

These processes rely on Message Passing Interface (MPI) for information exchange and complete parallel computing. The hardware network in reality is a physical network, while the communication between processes that are independent of hardware devices is abstracted as a logical network. An important aspect of communication in parallel computing is the rational mapping between logical network and physical network. When INC is introduced, the network hardware can also join the process of collective communication., which in turn will impact the overall communication model. Therefore, In INC enabled large-scale clusters, the mapping rules need to be adjusted accordingly.

In large scale clusters, the network contention can significantly impact the performance of applications when the processor allocation is scattered across different racks in the cluster. It is critical to discover the topology of such clusters and design collective message exchange algorithms that are aware of the topology in order to improve the overall performance of real-world applications. After introducing INC, the topology discovery algorithm should not be limited to factors such as network structure and bandwidth, but also consider factors such as INC capacities and computational load.

2. Conventions Used in This Document

2.1. Terminology

INC In-Network Computing

MPI Message Passing Interface

2.2. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14[RFC2119][RFC8174] when, and only when, they appear in all capitals, as shown here.

3. Problem Statement

In traditional mode, computing tasks are completed by computers and servers in the cluster, and after enabling INC, some of the computing tasks are transferred to network devices. As a result, for the same MPI primitive, compared to traditional mode, after enabling INC, the communication subjects in the logical topology can not only be mapped to computers, but also to network devices. At the same time, the implementation of certain MPI primitives based on INC may result in topological difference compared to traditional patterns. The current topology mapping mechanism does not consider the content above.

How to use topology-aware algorithms to improve MPI primitive communication performance and reduce communication costs in large-scale clusters has been a hot research direction. [TopoIB] presents efficient topology-aware algorithms for two collective communication primitives and proposed a communication model to analyze the communication overhead of large-scale cluster communication. In [Themis], a new scheduling mechanism and topology-aware algorithm are proposed from the perspective of improving network bandwidth utilization, and it was verified that the network bandwidth utilization rate of a single AllReduce operation can be increased by 1.72 times. But when INC is enabled, these topology detection algorithms will not only be limited to network characteristics such as bandwidth and communication overhead, but should simultaneously consider the computing and processing capabilies of network devices themselves.

Hence, several problems are raised:

* How to properly map the communication logical topology subjects to the INC enabled physical network subjects?

* How will enabling INC change the logical network topologies of MPI primitives and what challenges will it bring?

* How do we efficiently discover the topology of an INC enabled large scale cluster?

* What are the challenges involved in designing efficient collective algorithms that are aware of the INC enabled network topology?

4. Requirements

The topology mapping algorithm between logical and physical networks in large-scale clusters enabled by INC, as well as the topology-aware collective communication algorithms used to enhance cluster communication, need to meet the following requirements:

* INC enabled communication entities in large-scale clusters MUST not only support mapping to computing nodes in physical network, but also supporting mapping to network devices in physical network.

* After introducing INC, logical communication may change. MPI primitives, for example, AllReduce, may correspond to one or more logical topologies that support INC. However, from the aspect of computation results, the implementation of logical topology that supports INC MUST be equivalent to traditional methods.

* Topology detection algorithms in large-scale clusters that enable INC not only need to consider network factors such as communication overhead and path bandwidth, but also consider the INC capability and computational load of network devices, such as SINC [I-D.lou-rtgwg-sinc].

* The topology-aware collective communication algorithm SHOULD consider the network path load as well as the impact of background traffic on cluster communication performance in INC enabled large-scale clusters.

* A reasonable evaluation model for INC enabled large-scale cluster is REQUIRED, taking into account the factors such as connectivity status and computing capabilities in network devices.

* The topology mapping algorithm and topology detection algorithm SHOULD support the fallback mechanism, which can remap the logical network to the traditional mode and achieve path detection after an INC failure.

5. Security Considerations

TBD.

6. IANA Considerations

TBD.

7. Informative References

[I-D.lou-rtgwg-sinc]
Lou, Z., Iannone, L., Li, Y., Zhangcuimin, and K. Yao, "Signaling In-Network Computing operations (SINC)", Work in Progress, Internet-Draft, draft-lou-rtgwg-sinc-00, , <https://datatracker.ietf.org/doc/html/draft-lou-rtgwg-sinc-00>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[Themis]
Rashidi S, Won W, Srinivasan S, et al., "Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models", , <https://doi.org/10.48550/arXiv.2110.04478>.
[TopoIB]
Kandalla K C, Subramoni H, Vishnu A, et al., "Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather", , <https://doi.org/10.1109/IPDPSW.2010.5470853>.

Authors' Addresses

Shiping Xu
China Mobile
Beijing
100053
China
Kehan Yao
China Mobile
Beijing
100053
China