A Performance Study of UCX over InfiniBand

A Performance Study of UCX over InfiniBand Nikela Papadopoulou National Technical University of Athens Lena Oden Forschungszentrum Jlich Pavan Balaji Argonne National Laboratory 02/04/2020 CCGRID17 Motivation UCX is a new open-source communication middleware Used by MPICH (developed at ANL), OpenMPI, OpenSHMEM Has a two-level API design in attempt to meet software demands for HPC systems Offers portability and programming ease Communication software demands on HPC systems Scalability minimal memory requirements Performance minimal instruction counts/cache activity Programming ease - high-level abstractions Portability support for multiple communication devices 02/04/2020 CCGRID17

2 Outline UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 3 Outline UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 4

UCX as communication middleware High level-communication software MPI, OpenSHMEM, UPC ... Communication middleware UCX, Portals, OFI, ... Low-level communication software IB Verbs, TCP, Cray uGNI, ... Communication devices InfiniBand, Ethernet, Cray Aries, ... 02/04/2020

CCGRID17 5 02/04/2020 Transport selection, Wireup, Fragmentation, Software protocols for operations that are not (necessarily) supported by hardware Initializatio Tagn/ RMA Atomics matching Wireup Hardware-supported operations: Active Messages, RMA, Atomics IB Verbs TCP NVIDIA CUDA CCGRID17

Cray uGNI Shared memory SYSV POSIX KNEM CMA UCS Low-level API: Transports UCT UCX High-level API: Protocols UCP UCX design Service s Memory

manageme nt Data structures 6 InfiniBand support in UCX UCX over InfiniBand Verbs UCX UCX with Accelerated Verbs UCX UCP UCT User APIs User APIs OS Driver API Verbs API

HCA Driver (MLX4/MLX5) Kernel-level Verbs Driver API OS Hardware UCP UCT Verbs API HCA Driver (MLX5) Kernel-level Verbs Hardware InfiniBand HCA

InfiniBand HCA UCT with Accelerated Verbs (AVerbs) available for the Mellanox MLX5 driver 02/04/2020 CCGRID17 7 Outline UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 8 UCX as middleware over InfiniBand UCP UCT UCT

Verbs Verbs Verbs Functionality High-level abstractions Generic API for native functions Native functions Portability Any hardware, multiple devices Any hardware with API implementation, single device InfiniBand only

Performance Overhead due to additional APIs and high-level abstractions Overhead due to additional API Closest to hardware Scalability 02/04/2020 Extra memory requirements due to Extra memory additional API and requirements due to high-level additional API abstractions CCGRID17 Memory requirements

relative to transport 9 Measuring overheads of UCP over UCT over (A)Verbs Comparison of UCP, UCT and Verbs/AVerbs: Latency and Instructions Benchmark: RDMA write with remote completion Transport: InfiniBand RC/xRC status = ucp_put(ep, buffer, size, remote_address, remote_key); //Put (RDMA Write operation) if (status != UCS_OK) exit(ERROR); ucp_ep_flush(ep); //Flush operation for remote completion UCP over RC (xRC) size <= 98B (size <= 220B): short protocol 98B < size <= 8KB (220B 8KB: fragmented buffered copy protocol Identical protocol selection in our UCT and Verbs/AVerbs benchmarks 02/04/2020 CCGRID17 10 Experimental Setup

Hardware (JLSE resources) Two Intel Xeon E5-2699 CPUs Mellanox InfiniBand EDR HCA Connect X-4 Mellanox FDR-10 switch Software UCX v1.0 (7/13/2016) configured to use InfiniBand only Intel C++ compiler 16.0.0 -O3 optimization, interprocedural optimization, inlining, AVX/SSE4.2 Intel Software Development Emulator v7.48 02/04/2020 CCGRID17 11 RDMA Write latency over InfiniBand Lower is InfiniBand xRC (accelerated) better! InfiniBand RC UCP slower than UCT, UCT slower than Verbs/AVerbs Small UCT-to-Verbs overhead Larger UCP-to-UCT overhead especially for short messages with RC 02/04/2020

CCGRID17 12 RDMA Write instructions over InfiniBand Instruction breakdown in UCP for Put and Flush UCP and UCT consume more instructions on Flush for short messages Verbs/AVerbs consume more instructions on Put for short messages From Verbs/AVerbs to UCT to UCP: Put instructions increase slightly Flush instructions increase significantly UCX adds overheads over the native network functions 02/04/2020 CCGRID17 13 Objectives Analyze UCP core operations Identify unnecessary overheads in UCP Close the performance gap between UCP and UCT and

native InfiniBand performance Without sacrificing any of the UCP functionality / portability! 02/04/2020 CCGRID17 14 Outline UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 15 RDMA operations in UCP Blocking RDMA functions: ucp_put, ucp_get Select UCT interface Select transfer protocol (short, bcopy, fragmented bcopy) Call UCT function

Non-blocking RDMA functions: ucp_put_nbi, ucp_get_nbi Select UCT interface Select transfer protocol Call UCT function or push communication request Requests created when no resources are available or for long messages Overhead added with request creation/queue insertion 02/04/2020 CCGRID17 16 Analyzing UCP overheads: ucp_put over InfiniBand RC/xRC 8B - RC 8B xRC 1KB - RC 1KB - xRC

ucp_put calls: UCT function for communication (A)Verbs function Request creation Bookkeeping Function calls (15-20) Function pointers! Checks to select transfer (6) Memory copies RESOLVE_RKEY_RMA (35) 02/04/2020 CCGRID17 17 System UCT UCP

Background: UCX communication context ucp_context Context Interface s Memory Domains ucp_key uct_mm_posix_iface uct_ib_ud_iface uct_mm_md uct_mm_key Registered Memory uct_ib_md uct_ib_key

Registered Memory UD Transports POSIX Devices Shared Memory 02/04/2020 uct_ib_rc_iface RC CCGRID17 InfiniBand 18 ucp_worker Endpoints UCP

The RESOLVE_RKEY_RMA function ucp_e p Endpoints UCT uct_worker uct_ep_m m_posix uct_ep_ib_ ud uct_ep_ib_r c 02/04/2020 UCP selects the fastest available interface for RMA RC/xRC for InfiniBand RMA AMO AM Wireu

p RESOLVE_RKEY_RMA performs Domains/ Interfaces MM/POSIX IB/UD IB/RC CCGRID17 UCP-to-UCT translations for: endpoint remote memory key transport configuration UCP-to-UCT remote key resolution occurs every time with an RDMA operation 19 Optimizing RDMA functions in UCP Objective: Reduce the overhead of RESOLVE_RKEY_RMA Optimization: Perform UCP-to-UCT translations within the ucp_rkey_unpack function Called at connection establishment Out of critical path for communication Necessary information stored within the UCP remote key data

structure More details in the paper 02/04/2020 CCGRID17 20 Optimizing RDMA functions in UCP Lower is better! 27 of 35 instructions of RESOLVE_RKEY_RMA function eliminated Small impact on latency (<1%) 02/04/2020 CCGRID17 21 Outline

UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 22 Progress/Flush operations in UCP Progress function: ucp_worker_progress Progress all communication operations of the UCP/UCT worker Flush functions: ucp_worker_flush, ucp_ep_flush Complete all outstanding operations on the worker or the endpoint on the local side (through progress) 02/04/2020 CCGRID17

23 Analyzing UCP overheads: ucp_ep_flush over InfiniBand RC/xRC Flush after 8B Put RC Flush after 8B Put xRC UD interface is progressed - not in UCT RX queue is polled not in Verbs RX queue is polled not in Verbs 02/04/2020 xUD interface is progressed not in UCT CCGRID17

24 Optimizing the UCP progress engine (1) InfiniBand interface progress polls both the InfiniBand send (TX) and receive (RX) queues for requests RDMA operations create requests on the TX queue only The RX queue must be polled by the progress engine for active messages Objective: Eliminate unnecessary polling The TX queue polling can be avoided if no messages are sent over an interface Works for any interface Optimization: Avoid unnecessary TX polling Condition: If no send requests are pending More details in the paper 02/04/2020 CCGRID17 25 Optimizing the UCP progress engine (1) Lower is better!

(x)UD TX polling avoided! Significant improvement for RC 12% instruction decrease for 8B-message Small impact on latency for xRC Polling is faster with AVerbs 4.8% instruction decrease for 8B-message 02/04/2020 CCGRID17 26 Background: Connection establishment/Wireup in UCP RMA UCP ucp_worker

ucp_e p ucp_ep_st ub UCT uct_worker AM Wireu p Domains/ Interfaces IB/UD uct_ep_ib_ ud uct_ep_ib_r c 02/04/2020

AMO IB/RC CCGRID17 27 Optimizing the UCP progress engine (2) Progress on the worker progresses both the (x)RC and the (x)UD interface RDMA operations take place over (x)RC! The UD interface remains open for wireup Always used for wireup over InfiniBand Objective: Eliminate unnecessary interface progress Upon UCP endpoint creation, a stub endpoint initiates endpoint connection All created endpoints are connected after a few calls to the progress engine Scalability bug in UCX! If all UCP endpoints are connected, the (x)UD interface is no longer useful Optimization: Eliminate (x)UD progress Condition: If no endpoints are connected for the interface More details in the paper 02/04/2020 CCGRID17

28 Optimizing the UCP progress engine (2) Lower is better! (x)UD progress avoided! Significant decrease in latency for RC 29.8% instruction decrease for 8B-message Small decrease in latency for xRC Polling is faster with AVerbs 26.6% instruction decrease for 8B-message 02/04/2020 CCGRID17 29 Outline

UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 30 Evaluating optimizations on UCP RDMA Write/Read Bandwidth/Message rate measurement (RC and xRC): Issue 256 non-blocking RDMA operations (Write/Put or Read/Get) TX queue length of 256 Flush the endpoint Evaluation: 1. UCP 2. UCP with Optimized RESOLVE_RKEY_RMA + Optimized TX polling 3. UCP with Optimized RESOLVE_RKEY_RMA + Optimized UD progress

02/04/2020 CCGRID17 31 Evaluation with UCP RDMA Write over RC Higher is Peak Peak BW at 8KB Message Rate at 8B better! Put optimization is good for mediumsided messages More instructions spent on Put than on Flush On bcopy, RESOLVE_RKEY_RMA called twice Avoiding UD progress impacts

the message rate for short messages 02/04/2020 CCGRID17 32 Evaluation with UCP RDMA Write over xRC Higher is Peak Message Rate at 16B Peak BW at 8KB better! Small impact of Put optimization More instructions spent on Flush than on Put Small impact of avoiding TX polling For short messages Fewer instructions over xRC

High impact of avoiding UD progress Progress is called more often on Flush over xRC 02/04/2020 CCGRID17 33 Evaluation with UCP RDMA Read All optimizations have smaller impact on UCP RDMA Read More time spent over the network Memory copies/cache effects more important than instruction count Avoiding progress on the (x)UD interface has a small impact on performance (up to 3.5% for xRC) More instructions spent on Flush than on Get operations 02/04/2020 CCGRID17

34 Discussion and Outlook We can close the performance gap between UCP and UCT without sacrificing the benefits of the high-level abstractions of UCP We helped reducing some overhead Optimizations for RDMA functions Eliminate overheads of UCP-to-UCT translations Avoid function pointers from UCP to UCT Requires major redesign Optimizations for progress functions Avoid progress over interfaces that are not in use Poll queues for multiple requests at a time A scalability bug: Connection Establishment / Wireup Current UCP design forces all endpoints to connect, even if never used 02/04/2020 CCGRID17 35 Thank you! Questions?

[email protected], [email protected], [email protected] Acknowledgement US Department of Energy, Office of Science Joint Laboratory for System Evaluation (JLSE), Argonne National Laboratory IKY fellowships of excellence for postgraduate studies in Greece SIEMENS program PMRS group, MCS Division, Argonne National Laboratory Gail Pieper 02/04/2020 CCGRID17 36 Backup 02/04/2020 CCGRID17 37 UCP UCX design: Communication Entities Worker

ucp_worker Progress Engine Worker uct_worker UCT Progress Engine 02/04/2020 Worker Core communication entity Same semantics in UCP/UCT Progress engine to progress communication (progress) to order communication operations (fence) to complete communication operations (flush)

CCGRID17 38 UCX design: Communication Entities Worker Worker Endpoint Endpoint Connection Establishment 02/04/2020 CCGRID17 39 ucp_worker Worker ucp_worker Endpoints

ucp_e p Endpoints ucp_e p Worker uct_worker Worker uct_worker uct_ep_m m_posix Endpoints uct_ep_m m_posix Endpoints

uct_ep_ib_ ud uct_ep_ib_r c 02/04/2020 uct_ep_ib_ ud uct_ep_ib_r c CCGRID17 UCP Worker UCT UCT UCP UCX design: Communication Entities 40

UCX design: Communication Primitives U C P U C T Tag matching Remote Memory Access Atomic Memory Operations Single function for any message size Blocking and non-blocking operations Active Messages Remote Memory Access Atomic Memory Operations Multiple functions for various transfer methods (short, buffered copy, zerocopy) Functions for each transport depending on the hardware

capabilities Non-blocking operations 02/04/2020 CCGRID17 41 UCX design: Connection Establishment 1. A worker creates a UCP endpoint 2. The worker selects appropriate UCT interfaces For For For For remote memory access (RMA) atomic memory operations (AMO) active messages (AM) wireup

3. If an interface is connectionless, the corresponding UCT endpoint is created and connected immediately 4. If an interface is connection-oriented (P2P), the corresponding UCT endpoint is created but not connected 5. If connection-oriented interfaces exist, UCP creates a stub endpoint The stub endpoint uses the UCT endpoint over the wireup interface to send wireup requests for all interfaces The stub endpoint is destroyed when all UCT endpoints are connected 02/04/2020 CCGRID17 42

Recently Viewed Presentations

  • Section 14-1 Human Heredity

    Section 14-1 Human Heredity

    Human Heredity Section 14-1 This section explains what scientists know about human chromosomes, as well as the inheritance of certain human traits and disorders. It also describes how scientists study the inheritance of human traits. Human Chromosomes How do biologists...
  • Slayt 1 - Dokuz Eylül University

    Slayt 1 - Dokuz Eylül University

    Title: Slayt 1 Author: Administrator Last modified by: levent Created Date: 2/9/2010 8:26:17 AM Document presentation format: On-screen Show (4:3) Company
  • International Business Environments & Operations 14e Daniels Radebaugh

    International Business Environments & Operations 14e Daniels Radebaugh

    The most recent set of negotiations for the WTO began in 2001 in Doha, Qatar. The Doha Round, which focuses on giving a boost to developing nations, has been challenging and has stalled numerous times. One of the major sources...
  • 4.files.edl.io

    4.files.edl.io

    Sarcasm-sneering or cutting remark. I used sarcasm when I told my ugly sister that she was as beautiful as our dog. ... Commitment-antonym- indifference. I had a commitment to my 6th grade boyfriend, yet he was . indifferent . and...
  • ABRF2007 Poster

    ABRF2007 Poster

    Deblocking solution of TFA/Methanol, 1:1(v/v). 1/3 of sample on prosorb and dried. Filter placed in eppendorf tube and 30ul of deblocking solution on the filter. Additional 70ul of deblocking solution at the bottom of the tube. Left at room temperature...
  • Using Literary Warrant to Define a Version of the DDC for ...

    Using Literary Warrant to Define a Version of the DDC for ...

    Diane Vizine-Goetz Research Scientist, OCLC Research Julianne Beall Assistant Editor, DDC ISKO Conference London, 13-16 July 2004 Exploratory Study Defining a version of the DDC To facilitate automatic assignment of DDC numbers to electronic documents Based on literary warrant for...
  • The Future for Energy Crops Diverse drivers impact

    The Future for Energy Crops Diverse drivers impact

    Acknowledgements Andrew Lovett Alan Bond Trudie Dockerty Katy Appleton Gisela Sünnenberg Martin Turner Jon Finch Rufus Sage Mark Cunningham David Bohan Alison Haughton Andrew Riche Thanks to the Growers and Stakeholders * * * * * * * *
  • Overtime Regulations and System-Wide Pay Plan What is

    Overtime Regulations and System-Wide Pay Plan What is

    Time spent answering e-mail that amounts to more than de minimus (insubstantial or insignificant) amounts of time each week is considered work time and must be reported in TimeTraq. Regardless of whether overtime was approved, the employee must be paid...