A Performance Study of UCX over InfiniBand Nikela Papadopoulou National Technical University of Athens Lena Oden Forschungszentrum Jlich Pavan Balaji Argonne National Laboratory 02/04/2020 CCGRID17 Motivation UCX is a new open-source communication middleware Used by MPICH (developed at ANL), OpenMPI, OpenSHMEM Has a two-level API design in attempt to meet software demands for HPC systems Offers portability and programming ease Communication software demands on HPC systems Scalability minimal memory requirements Performance minimal instruction counts/cache activity Programming ease - high-level abstractions Portability support for multiple communication devices 02/04/2020 CCGRID17
2 Outline UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 3 Outline UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 4
UCX as communication middleware High level-communication software MPI, OpenSHMEM, UPC ... Communication middleware UCX, Portals, OFI, ... Low-level communication software IB Verbs, TCP, Cray uGNI, ... Communication devices InfiniBand, Ethernet, Cray Aries, ... 02/04/2020
CCGRID17 5 02/04/2020 Transport selection, Wireup, Fragmentation, Software protocols for operations that are not (necessarily) supported by hardware Initializatio Tagn/ RMA Atomics matching Wireup Hardware-supported operations: Active Messages, RMA, Atomics IB Verbs TCP NVIDIA CUDA CCGRID17
manageme nt Data structures 6 InfiniBand support in UCX UCX over InfiniBand Verbs UCX UCX with Accelerated Verbs UCX UCP UCT User APIs User APIs OS Driver API Verbs API
HCA Driver (MLX4/MLX5) Kernel-level Verbs Driver API OS Hardware UCP UCT Verbs API HCA Driver (MLX5) Kernel-level Verbs Hardware InfiniBand HCA
InfiniBand HCA UCT with Accelerated Verbs (AVerbs) available for the Mellanox MLX5 driver 02/04/2020 CCGRID17 7 Outline UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 8 UCX as middleware over InfiniBand UCP UCT UCT
Verbs Verbs Verbs Functionality High-level abstractions Generic API for native functions Native functions Portability Any hardware, multiple devices Any hardware with API implementation, single device InfiniBand only
Performance Overhead due to additional APIs and high-level abstractions Overhead due to additional API Closest to hardware Scalability 02/04/2020 Extra memory requirements due to Extra memory additional API and requirements due to high-level additional API abstractions CCGRID17 Memory requirements
relative to transport 9 Measuring overheads of UCP over UCT over (A)Verbs Comparison of UCP, UCT and Verbs/AVerbs: Latency and Instructions Benchmark: RDMA write with remote completion Transport: InfiniBand RC/xRC status = ucp_put(ep, buffer, size, remote_address, remote_key); //Put (RDMA Write operation) if (status != UCS_OK) exit(ERROR); ucp_ep_flush(ep); //Flush operation for remote completion UCP over RC (xRC) size <= 98B (size <= 220B): short protocol 98B < size <= 8KB (220B 8KB: fragmented buffered copy protocol Identical protocol selection in our UCT and Verbs/AVerbs benchmarks 02/04/2020 CCGRID17 10 Experimental Setup
Hardware (JLSE resources) Two Intel Xeon E5-2699 CPUs Mellanox InfiniBand EDR HCA Connect X-4 Mellanox FDR-10 switch Software UCX v1.0 (7/13/2016) configured to use InfiniBand only Intel C++ compiler 16.0.0 -O3 optimization, interprocedural optimization, inlining, AVX/SSE4.2 Intel Software Development Emulator v7.48 02/04/2020 CCGRID17 11 RDMA Write latency over InfiniBand Lower is InfiniBand xRC (accelerated) better! InfiniBand RC UCP slower than UCT, UCT slower than Verbs/AVerbs Small UCT-to-Verbs overhead Larger UCP-to-UCT overhead especially for short messages with RC 02/04/2020
CCGRID17 12 RDMA Write instructions over InfiniBand Instruction breakdown in UCP for Put and Flush UCP and UCT consume more instructions on Flush for short messages Verbs/AVerbs consume more instructions on Put for short messages From Verbs/AVerbs to UCT to UCP: Put instructions increase slightly Flush instructions increase significantly UCX adds overheads over the native network functions 02/04/2020 CCGRID17 13 Objectives Analyze UCP core operations Identify unnecessary overheads in UCP Close the performance gap between UCP and UCT and
native InfiniBand performance Without sacrificing any of the UCP functionality / portability! 02/04/2020 CCGRID17 14 Outline UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 15 RDMA operations in UCP Blocking RDMA functions: ucp_put, ucp_get Select UCT interface Select transfer protocol (short, bcopy, fragmented bcopy) Call UCT function
Non-blocking RDMA functions: ucp_put_nbi, ucp_get_nbi Select UCT interface Select transfer protocol Call UCT function or push communication request Requests created when no resources are available or for long messages Overhead added with request creation/queue insertion 02/04/2020 CCGRID17 16 Analyzing UCP overheads: ucp_put over InfiniBand RC/xRC 8B - RC 8B xRC 1KB - RC 1KB - xRC
ucp_put calls: UCT function for communication (A)Verbs function Request creation Bookkeeping Function calls (15-20) Function pointers! Checks to select transfer (6) Memory copies RESOLVE_RKEY_RMA (35) 02/04/2020 CCGRID17 17 System UCT UCP
Background: UCX communication context ucp_context Context Interface s Memory Domains ucp_key uct_mm_posix_iface uct_ib_ud_iface uct_mm_md uct_mm_key Registered Memory uct_ib_md uct_ib_key
The RESOLVE_RKEY_RMA function ucp_e p Endpoints UCT uct_worker uct_ep_m m_posix uct_ep_ib_ ud uct_ep_ib_r c 02/04/2020 UCP selects the fastest available interface for RMA RC/xRC for InfiniBand RMA AMO AM Wireu
p RESOLVE_RKEY_RMA performs Domains/ Interfaces MM/POSIX IB/UD IB/RC CCGRID17 UCP-to-UCT translations for: endpoint remote memory key transport configuration UCP-to-UCT remote key resolution occurs every time with an RDMA operation 19 Optimizing RDMA functions in UCP Objective: Reduce the overhead of RESOLVE_RKEY_RMA Optimization: Perform UCP-to-UCT translations within the ucp_rkey_unpack function Called at connection establishment Out of critical path for communication Necessary information stored within the UCP remote key data
structure More details in the paper 02/04/2020 CCGRID17 20 Optimizing RDMA functions in UCP Lower is better! 27 of 35 instructions of RESOLVE_RKEY_RMA function eliminated Small impact on latency (<1%) 02/04/2020 CCGRID17 21 Outline
UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 22 Progress/Flush operations in UCP Progress function: ucp_worker_progress Progress all communication operations of the UCP/UCT worker Flush functions: ucp_worker_flush, ucp_ep_flush Complete all outstanding operations on the worker or the endpoint on the local side (through progress) 02/04/2020 CCGRID17
23 Analyzing UCP overheads: ucp_ep_flush over InfiniBand RC/xRC Flush after 8B Put RC Flush after 8B Put xRC UD interface is progressed - not in UCT RX queue is polled not in Verbs RX queue is polled not in Verbs 02/04/2020 xUD interface is progressed not in UCT CCGRID17
24 Optimizing the UCP progress engine (1) InfiniBand interface progress polls both the InfiniBand send (TX) and receive (RX) queues for requests RDMA operations create requests on the TX queue only The RX queue must be polled by the progress engine for active messages Objective: Eliminate unnecessary polling The TX queue polling can be avoided if no messages are sent over an interface Works for any interface Optimization: Avoid unnecessary TX polling Condition: If no send requests are pending More details in the paper 02/04/2020 CCGRID17 25 Optimizing the UCP progress engine (1) Lower is better!
(x)UD TX polling avoided! Significant improvement for RC 12% instruction decrease for 8B-message Small impact on latency for xRC Polling is faster with AVerbs 4.8% instruction decrease for 8B-message 02/04/2020 CCGRID17 26 Background: Connection establishment/Wireup in UCP RMA UCP ucp_worker
ucp_e p ucp_ep_st ub UCT uct_worker AM Wireu p Domains/ Interfaces IB/UD uct_ep_ib_ ud uct_ep_ib_r c 02/04/2020
AMO IB/RC CCGRID17 27 Optimizing the UCP progress engine (2) Progress on the worker progresses both the (x)RC and the (x)UD interface RDMA operations take place over (x)RC! The UD interface remains open for wireup Always used for wireup over InfiniBand Objective: Eliminate unnecessary interface progress Upon UCP endpoint creation, a stub endpoint initiates endpoint connection All created endpoints are connected after a few calls to the progress engine Scalability bug in UCX! If all UCP endpoints are connected, the (x)UD interface is no longer useful Optimization: Eliminate (x)UD progress Condition: If no endpoints are connected for the interface More details in the paper 02/04/2020 CCGRID17
28 Optimizing the UCP progress engine (2) Lower is better! (x)UD progress avoided! Significant decrease in latency for RC 29.8% instruction decrease for 8B-message Small decrease in latency for xRC Polling is faster with AVerbs 26.6% instruction decrease for 8B-message 02/04/2020 CCGRID17 29 Outline
UCX design UCX as middleware over InfiniBand Overheads and optimizations in UCP RDMA functions Overheads and optimizations in UCP progress functions Evaluation of optimizations in UCX Discussion and outlook 02/04/2020 CCGRID17 30 Evaluating optimizations on UCP RDMA Write/Read Bandwidth/Message rate measurement (RC and xRC): Issue 256 non-blocking RDMA operations (Write/Put or Read/Get) TX queue length of 256 Flush the endpoint Evaluation: 1. UCP 2. UCP with Optimized RESOLVE_RKEY_RMA + Optimized TX polling 3. UCP with Optimized RESOLVE_RKEY_RMA + Optimized UD progress
02/04/2020 CCGRID17 31 Evaluation with UCP RDMA Write over RC Higher is Peak Peak BW at 8KB Message Rate at 8B better! Put optimization is good for mediumsided messages More instructions spent on Put than on Flush On bcopy, RESOLVE_RKEY_RMA called twice Avoiding UD progress impacts
the message rate for short messages 02/04/2020 CCGRID17 32 Evaluation with UCP RDMA Write over xRC Higher is Peak Message Rate at 16B Peak BW at 8KB better! Small impact of Put optimization More instructions spent on Flush than on Put Small impact of avoiding TX polling For short messages Fewer instructions over xRC
High impact of avoiding UD progress Progress is called more often on Flush over xRC 02/04/2020 CCGRID17 33 Evaluation with UCP RDMA Read All optimizations have smaller impact on UCP RDMA Read More time spent over the network Memory copies/cache effects more important than instruction count Avoiding progress on the (x)UD interface has a small impact on performance (up to 3.5% for xRC) More instructions spent on Flush than on Get operations 02/04/2020 CCGRID17
34 Discussion and Outlook We can close the performance gap between UCP and UCT without sacrificing the benefits of the high-level abstractions of UCP We helped reducing some overhead Optimizations for RDMA functions Eliminate overheads of UCP-to-UCT translations Avoid function pointers from UCP to UCT Requires major redesign Optimizations for progress functions Avoid progress over interfaces that are not in use Poll queues for multiple requests at a time A scalability bug: Connection Establishment / Wireup Current UCP design forces all endpoints to connect, even if never used 02/04/2020 CCGRID17 35 Thank you! Questions?
[email protected], [email protected], [email protected] Acknowledgement US Department of Energy, Office of Science Joint Laboratory for System Evaluation (JLSE), Argonne National Laboratory IKY fellowships of excellence for postgraduate studies in Greece SIEMENS program PMRS group, MCS Division, Argonne National Laboratory Gail Pieper 02/04/2020 CCGRID17 36 Backup 02/04/2020 CCGRID17 37 UCP UCX design: Communication Entities Worker
ucp_worker Progress Engine Worker uct_worker UCT Progress Engine 02/04/2020 Worker Core communication entity Same semantics in UCP/UCT Progress engine to progress communication (progress) to order communication operations (fence) to complete communication operations (flush)
ucp_e p Endpoints ucp_e p Worker uct_worker Worker uct_worker uct_ep_m m_posix Endpoints uct_ep_m m_posix Endpoints
uct_ep_ib_ ud uct_ep_ib_r c 02/04/2020 uct_ep_ib_ ud uct_ep_ib_r c CCGRID17 UCP Worker UCT UCT UCP UCX design: Communication Entities 40
UCX design: Communication Primitives U C P U C T Tag matching Remote Memory Access Atomic Memory Operations Single function for any message size Blocking and non-blocking operations Active Messages Remote Memory Access Atomic Memory Operations Multiple functions for various transfer methods (short, buffered copy, zerocopy) Functions for each transport depending on the hardware
capabilities Non-blocking operations 02/04/2020 CCGRID17 41 UCX design: Connection Establishment 1. A worker creates a UCP endpoint 2. The worker selects appropriate UCT interfaces For For For For remote memory access (RMA) atomic memory operations (AMO) active messages (AM) wireup
3. If an interface is connectionless, the corresponding UCT endpoint is created and connected immediately 4. If an interface is connection-oriented (P2P), the corresponding UCT endpoint is created but not connected 5. If connection-oriented interfaces exist, UCP creates a stub endpoint The stub endpoint uses the UCT endpoint over the wireup interface to send wireup requests for all interfaces The stub endpoint is destroyed when all UCT endpoints are connected 02/04/2020 CCGRID17 42
Human Heredity Section 14-1 This section explains what scientists know about human chromosomes, as well as the inheritance of certain human traits and disorders. It also describes how scientists study the inheritance of human traits. Human Chromosomes How do biologists...
The most recent set of negotiations for the WTO began in 2001 in Doha, Qatar. The Doha Round, which focuses on giving a boost to developing nations, has been challenging and has stalled numerous times. One of the major sources...
Sarcasm-sneering or cutting remark. I used sarcasm when I told my ugly sister that she was as beautiful as our dog. ... Commitment-antonym- indifference. I had a commitment to my 6th grade boyfriend, yet he was . indifferent . and...
Deblocking solution of TFA/Methanol, 1:1(v/v). 1/3 of sample on prosorb and dried. Filter placed in eppendorf tube and 30ul of deblocking solution on the filter. Additional 70ul of deblocking solution at the bottom of the tube. Left at room temperature...
Diane Vizine-Goetz Research Scientist, OCLC Research Julianne Beall Assistant Editor, DDC ISKO Conference London, 13-16 July 2004 Exploratory Study Defining a version of the DDC To facilitate automatic assignment of DDC numbers to electronic documents Based on literary warrant for...
Acknowledgements Andrew Lovett Alan Bond Trudie Dockerty Katy Appleton Gisela Sünnenberg Martin Turner Jon Finch Rufus Sage Mark Cunningham David Bohan Alison Haughton Andrew Riche Thanks to the Growers and Stakeholders * * * * * * * *
Time spent answering e-mail that amounts to more than de minimus (insubstantial or insignificant) amounts of time each week is considered work time and must be reported in TimeTraq. Regardless of whether overtime was approved, the employee must be paid...
Ready to download the document? Go ahead and hit continue!