||Monday, August 28th (Tutorials)
||Breakfast and Registration
Exploiting High-Performance Interconnects to Accelerate Big Data Processing with Hadoop, Spark, Memcached, and gRPC/TensorFlow
D. K. Panda & Xiaoyi Lu, Ohio State University
High Performance Distributed Deep Learning for Dummies
D. K. Panda, Ammar Ahmad Awan, and Hari Subramoni, Ohio State University
Designing and Developing Performance Portable Network Codes
Pavel Shamis, ARM; Yossi Itigin, Mellanox Technologies
Developing to Open Fabrics Interfaces libfabric
Sean Hefty, Intel
Data-Center Interconnection (DCI) Technology Innovations in Transport Network Architectures
Nikhil Jain, LLNL; Misbah Mubarak, ANL
||Tuesday, August 29 (Symposium)
||Breakfast and Registration
Ada Gavrilovska & Eitan Zahavi General Chairs
Ryan Grant & Jitu Padhye Technical Program Chairs
||Host Opening Remarks
Elisabetta Romano, Ericsson
RDMA deployments: from cloud computing to machine learning
Chuanxiong Guo, MSR
Improving Non-Minimal and Adaptive Routing Algorithms in Slim Fly Networks
P. Y. Segura, J. Escudero-Sahuquillo, P. J. Garcia, F. J. Quiles, and T. Hoefler
Interconnection networks must meet the communication demands of current High-Performance Computing systems.
In order to interconnect efficiently the end nodes of these systems with a good performance-to-cost ratio, new
network topologies have been proposed in the last years which leverage high-radix switches, such as Slim Fly.
Adversarial-like traffic patterns, however, may reduce severely the performance of Slim Fly networks when using only minimal-path routing.
In order to mitigate the performance degradation in these scenarios, Slim Fly networks should configure an oblivious or adaptive non-minimal routing.
The non-minimal routing algorithms proposed for Slim Fly usually rely on Valiant's algorithm to select the paths, at the cost of doubling the average
path-length, as well as the number of Virtual Channels (VCs)required to prevent deadlocks.
Moreover, Valiant may introduce additional inefficiencies when applied to Slim Fly networks, such as the "turn-around problem" that we analyze in this work.
With the aim of overcoming these drawbacks, we propose in this paper two variants of the Valiant's algorithm that improve the non-minimal path selection in Slim Fly networks.
They are designed to be combined with adaptive routing algorithms that rely on Valiant to select non-minimal paths, such as UGAL or PAR, which we have adapted to the Slim Fly topology.
Through the results from simulation experiments, we show that our proposals improve the network performance and/or reduce the number of required VCs to prevent deadlocks, even
in scenarios with adversarial-like traffic.
A. Samuel, E. Zahavi, and I. Keslassy
The network plays a key role in High-Performance Computing (HPC) system efficiency. Unfortunately, traditional oblivious and congestion-aware
routing solutions are not application-aware and cannot deal with typical HPC traffic bursts, while centralized routing solutions face a significant
To address this problem, we introduce Routing Keys, a new scalable routing paradigm for HPC networks that decouples intra- and inter-application
flow contention. Our Application Routing Key (ARK) algorithm proactively allows each self-aware application to route its flows according to a predetermined
routing key, i.e., its own intra-application contention-free routing. In addition, in our Network Routing Key (NRK) algorithm, a centralized scheduler chooses
between several routing keys for the communication phases of each application, and therefore reduces inter-application contention while maintaining
intra-application contention-free routing and avoiding scalability issues. Using extensive evaluation, we show that both ARK and NRK achieve a significant
improvement in the application performance.
Fast Networks and Slow Memories: A Mechanism for Mitigating Bandwidth Mismatches
T. Schneider, M. Flajslik, J. Dinan, and K. Underwood
The advent of non-volatile memory (NVM) technologies has added an interesting nuance to the node level memory hierarchy. With modern 100 Gb/s networks,
the NVM tier of storage can often be slower than the high performance network in the system; thus, a new challenge arises in the datacenter. Whereas prior
efforts have studied the impacts of multiple sources targeting one node (i.e. incast) and have studied multiple flows causing congestion in inter-switch links,
it is now possible for a single flow from a single source to overwhelm the bandwidth of a key portion of the memory hierarchy. This can subsequently spread to
the switches and lead to congestion trees in a flow-controlled network or excessive packet drops without flow control. In this work we describe protocols which
avoid overwhelming the receiver in the case of a source/sink rate mismatch. We design our protocols on top of Portals 4, which enables us to make use of network
offload. Our protocol yields up to 4x higher throughput in a 5k node Dragonfly topology for a permutation traffic pattern in which only 1\% of all nodes have a
memory write-bandwidth limitation of 1/8th of the network bandwidth.
|Large Scale Networking
WaveLight: A Monolithic Low Latency Silicon-Photonics Communication Platform for the next generation Disaggregated
Cloud Data Centers
M. Akhter, P. Somogyi, C. Sun, M. Wade, R. Meade,
P. Bhargava, S. Lin, and N. Mehta
While transistor scaling has continued to improve the amount of raw computation that is possible under fixed area or power constraints, the need to move an increasing
number of bits across ever larger distances has gone largely unaddressed by recent transistor improvements. Optical technologies can overcome the fundamental tradeoffs
faced by electrical I/O. However, optical transceivers today are constrained to cumbersome module form factors. For the emerging cloud data centers, the power and cost
of current-generation optical modules prevent the use of optics as a suitable interconnect technology needed to improve utilization through compute and memory disaggregation.
The power and size of discrete optical devices today prevents formation of high density switching targeting real-time wireless networks.
To resolve these ongoing interconnect bottlenecks, a technology capable of deeply integrating a large number of optical and electronic devices is required. In this work, we present WaveLight, a zero-change photonics platform, whereby optical functions are designed directly into an existing high-volume CMOS process, to demonstrate an electro-optic flexible switching function with a light weight reliable low latency protocol. The integrated device leverages microring-based optical transceivers to deliver 80Gb/s of full duplex bandwidth and low power.
An FPGA Platform for Hyperscalers
F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss and S. Paredes
FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual
server nodes and to improve the overall power efficiency. Meanwhile, DC infrastructures are being redesigned to pack ever more compute capacity into the same
volume and power envelops. This redesign leads to the disaggregation of the server and its resources into a collection of standalone computing, memory, and
To embrace this evolution, we developed a platform that decouples the FPGA from the CPU of the server by connecting the FPGA directly to the DC network.
This proposal turns the FPGA into a disaggregated standalone computing resource that can be deployed at large-scale into emerging hyperscale data centers.
We show our platform which integrates 64 FPGAs (Kintex¨ UltraScaleTM XCKU060) from Xilinx in a 19Ó ? 2U chassis, and provides a bi-sectional bandwidth of
640 Gb/s. The platform is designed for cost-effectiveness and makes use of hot-water cooling for optimized energy efficiency. As a result, a DC rack can fit 16
platforms, for a total of 1024 FPGAs + 16TB of DRR4 memory.
Throughput Models of Interconnection Networks: the Good, the Bad, and the Ugly
P. Faizian, Md A. Mollah, Md S. Rahman, X. Yuan, S. Pakin, and M. Lang
Throughput performance, an important metric for
interconnection networks is often quantified by the aggregate
throughput for a set of representative traffic patterns. A num-
ber of models have been developed to estimate the aggregate
throughput for a given traffic pattern on an interconnection
network. Since all of the models predict the same property
of interconnection networks, ideally, they should give similar
performance results or at least follow the same performance
trend. In this work, we examine four commonly used interconnect
throughput models and identify the cases when all models show
similar trends, when different models yield different trends, and
when different models produce contradictory results. Our study
reveals important properties of the models and demonstrates
the subtle differences among them, which are important for
an interconnect designer/researcher to understand, in order to
properly select a throughput model in the process of interconnect
High Speed Networking at Google
Nandita Dukkipati, Google (Invited Talk)
Ethernet vs. HPC: Can the hyperscale ethernet data center handle all workloads?
Moderator: Roy Chua, Partner, SDxCentral & Wiretap Ventures
Hyperscale ethernet data centers (HEDCs) based on pure ethernet switching have come to dominate both the market and the conversation, and many think they
are all that is necessary. But for many years specialized designs for HPC have survived, though these specialized designs seem increasingly relegated to
fewer and fewer special cases. Some think this will only continue, and the low cost, pervasiveness, and simplicity of a single fabric will result in HEDCs
being "so good enough" that special technology for HPC will no longer be needed. Others disagree and claim that HEDCs enjoy generally uniform and static
workloads that do not fit the profiles of machine learning and HPC applications. They cite technologies like PCIexpress fabrics that are infiltrating
HEDCs to augment ethernet with capabilities that ethernet simply cannot provide and that are needed in both the massive HEDCs and smaller versions of
same, including specialized campus DCs or clusters and those deployed at the mobile edge for mobile edge computing. In the MEC scenario, media, security,
machine learning, and IOT are among the new applications driving the convergence of HPC and HEDC. The panel will debate these opposing viewpoints and will
speculate on whether specialized HPC DCs and general HEDCs will converge, diverge, or continue in separate parallel worlds.
Yogesh Bhatt, Senior Director, Ecosystem Innovation & Strategy, Ericsson
Dave Cohen, Senior Principal Engineer & System Architect, Intel
Pete Fiacco, CTO, GigaIO Networks
Bithika Khargharia, Former Principal Architect, Extreme Networks
Dave Meyer, Chief Scientist, Brocade
Ying Zhang, Software Engineer, Facebook
||Wednesday, August 30th (Symposium)
||Breakfast and Registration
Information Transfer in the era of 5G
David Allan, Ericsson
|Optics & Networks for Science
A High Speed Hardware Scheduler for 1000-port Optical Packet Switches to Enable Scalable Data Centers
J. Benjamin, P. Watts, A. Funnell, and B. Thomsen
Meeting the exponential increase in the global demand for bandwidth has become a major concern for todayŐs data centers. The scalability of any data center
is defined by the maximum capacity and port count of the switching devices it employs, limited by total pin bandwidth on current electronic switch ASICs. Optical
switches can provide higher capacity and port counts, and hence, can be used to transform data center scalability. We have recently demonstrated a 1000-port
star-coupler based wavelength division multiplexed (WDM) and time division multiplexed (TDM) optical switch architecture offering a bandwidth of 32 Tbit/s with
the use of fast wavelength-tunable transmitters and high-sensitivity coherent receivers. However, the major challenge in deploying such an optical switch to
replace current electronic switches lies in designing and implementing a scalable scheduler capable of operating on packet timescales.
In this paper, we present a pipelined and highly parallel electronic scheduler that configures the high-radix (1000-port) optical packet switch.
The scheduler can process requests from 1000 nodes and allocate timeslots across 320 wavelength channels and 4000 wavelength-tunable transceivers within a
time constraint of 1µs. Using the Opencell NanGate 45nm standard cell library, we show that the complete 1000-port parallel scheduler algorithm occupies a
circuit area of 52.7mm2, 4-8x smaller than that of a high-performance switch ASIC, with a clock period of less than 8ns, enabling 138 scheduling iterations
to be performed in 1µs. The performance of the scheduling algorithm is evaluated in comparison to maximal matching from graph theory and conventional
software-based wavelength allocation heuristics. The parallel hardware scheduler is shown to achieve similar matching performance and network throughput
while being orders of magnitude faster.
Subchannel Scheduling for Shared Optical On-chip Buses
S. Werner, J. Navaridas, and M. Luján
Maximizing bandwidth utilization of optical on-chip interconnects is essential to compensate for the impact of static laser power on power efficiency in
networks-on-chip. Shared optical buses offer a modular design solution with tremendous power savings by allowing optical bandwidth to be shared between all
connected nodes. Previous proposals resolve bus contention by scheduling senders sequentially on the entire optical bandwidth; however, logically splitting a
bus into subchannels to allow senders to be scheduled both sequentially and in parallel has been shown to be highly efficient in electrical interconnects,
and could also be applied to shared optical buses.
In this paper, we propose an efficient subchannel scheduling algorithm that aims to minimize the number of bus utilization cycles by assigning sender-receiver
pairs both to subchannels and time slots. We present both a distributed and a centralized bus arbitration scheme showing that light-weight implementations can be attained.
In fact, our results show that subchannel scheduling more than doubles throughput on shared optical buses compared to sequential scheduling without incurring any power
overheads in most cases. Arbitration latency overheads compared to state-of-the-art sequential arbitration schemes are at most 5-10%, and are only noticeable for very low
Utilizing HPC Network Technologies in High Energy Physics Experiments
Because of their performance characteristics, high-performance fabrics like Infiniband or OmniPath are interesting technologies for many local area network
applications, including data acquisition systems for high-energy physics experiments like the ATLAS experiment at CERN. This paper analyzes existing APIs for
high-performance fabrics and evaluates their suitability for data acquisition systems in terms of performance and domain applicability.
The study finds that existing software APIs for high-performance interconnects are focused on applications in high-performance computing with specific workloads
and are not compatible with the requirements of data acquisition systems. To evaluate the use of high-performance interconnects in data acquisition systems, a custom
library called NetIO has been developed and is compared against existing technologies.
NetIO has a message queue-like interface which matches the ATLAS use case better than traditional HPC APIs like MPI. The architecture of NetIO is based on an interchangeable
back-end system which supports different interconnects. A libfabric-based back-end supports a wide range of fabric technologies including Infiniband. On the front-end side, NetIO
supports several high-level communication patterns that are found in typical data acquisition applications like client/server and publish/subscribe. Unlike other frameworks,
NetIO distinguishes between high-throughput and low-latency communication, which is essential for applications with heterogeneous traffic patterns. This feature of NetIO allows
experiments like ATLAS to use a single network for different traffic types like physics data or detector control.
Benchmarks of NetIO in comparison with the message queue implementation ZeroMQ are presented. NetIO reaches up to 2x higher throughput on Ethernet and up to 3x higher throughput
on FDR Infiniband compared to ZeroMQ on Ethernet. The latencies measured with NetIO are comparable to ZeroMQ latencies.
|Toplogies, Routing and Process Placement
On the Impact of Routing Algorithms in the Effectiveness of Queuing Schemes in High-Performance Interconnection Networks
J. Rocher-Gonzalez, J. Escudero-Sahuquillo, P. J. Garc í a and F. J. Quiles
In High-Performance Computing (HPC) systems, the design of the interconnection network is crucial. Indeed, the network topology, the switch architecture and the routing
scheme determine the network performance and ultimately the system one. As the number of endnodes in HPC systems grows, and the supported applications become increasingly
demanding for communication, the use of techniques to deal with network congestion and its negative effects gains importance. For that purpose, routing schemes such as adaptive
or oblivious try to balance the network traffic in order to prevent and/or eliminate congestion. On the other hand, there are deterministic routing schemes that balance the number
of paths per link with the aim of reducing the head-of-line blocking derived from congestion situations. Furthermore, other techniques to deal with congestion are based on queuing
schemes. This approach is based on storing separately different packet flows at the ports buffers, so that the head-of-line blocking and/or buffer-hogging are reduced. Existing
queuing schemes use different policies to separate flows, and they can be implemented in different ways. However, most queuing schemes are often used and designed assuming that
the network is configured with deterministic routing, while actually they could be combined also with adaptive or oblivious routing.
This paper analyzes the behavior of different queuing schemes under different routing algorithms: deterministic, adaptive or oblivious. We focus on fat-tree networks, configured
with the most common routing algorithms of each type suitable for that topology. In order to evaluate these configurations, we have run simulation experiments modeling large fat-trees
built from switches with radices available in the market, and supporting several queuing schemes. The experiments results show how different the performance of the queuing schemes may be
when combined with either deterministic or oblivious/adaptive routing. Indeed, from these results we can conclude that some combinations of queuing schemes and routings are
Placement of Virtual Network Functions in Hybrid Data Center Networks
Z. Li and Y. Yang
Hybrid data center networks (HDCNs), where each ToR switch is installed with a directional antenna, emerge as a candidate helping alleviate the over-subscription problem
in traditional data centers. Meanwhile, as virtualization techniques develop rapidly, there is a trend that traditional network functions that are implemented in hardware will
also be virtualized into virtual machines. However, how to place virtual network functions (VNFs) into data centers to meet the customer requirements in a hybrid data center
network environment is a challenging problem. In this paper, we study the VNF placement in hybrid data center networks, and provide a joint VNF placement and antenna scheduling
model. We further simplify it to a mixed integer programming (MIP) problem. Due to the hardness of a MIP problem, we develop a heuristic algorithm to solve it, and also give an
on-line algorithm to meet the requirements from real scenarios. To the best of our knowledge, this is the first work concerning VNF placement in the context of HDCNs. Our extensive
simulations demonstrate the effectiveness of the proposed algorithms, which make them a very suitable and promising solution for VNF placement in HDCN environment.
MPI Process and Network Device Affinitization for Optimal HPC Application Performance
R. B. Ganapathi, A. Gopalakrishnan, and R. W. Mcguire
High Performance Computing(HPC) applications are highly optimized to maximize
allocated resources for the job such as compute resources, memory and
storage. Optimal performance for MPI applications requires the best possible affinity
across all the allocated resources. Typically, setting process affinity to
compute resources is well defined, i.e MPI processes on a compute node have
processor affinity set for one to one mapping between MPI processes and
the physical processing cores. Several well defined methods exist to
efficiently map MPI processes to a compute node.
With the growing complexity of HPC systems, platforms are
designed with complex compute and I/O subsystems. Capacity of I/O devices
attached to a node are expanded with PCIe switches resulting in large numbers
of PCIe endpoint devices. With a lot of heterogeneity in systems,
applications programmers are forced to think harder about affinitizing processes
as it affects performance based on not only compute but also NUMA placement
of IO devices. Mapping of process to processor cores and the closest IO
device(s) is not straightforward. While operating systems do a reasonable job
of trying to keep a process physically located near the processor core(s) and
memory, they lack the application developer's knowledge of process workflow
and optimal IO resource allocation when more than one IO
device is connected to the compute node.
In this paper we look at ways to assuage the problems of affinity choices by
abstracting the device selection algorithm from MPI application layer.
MPI continues to be the dominant programming model for HPC and hence
our focus in this paper is limited to providing a solution for MPI based
applications. Our solution can be extended to other HPC programming models
such as Partitioned Global Address Space(PGAS) or a hybrid MPI and PGAS
based applications. We propose a solution to solve NUMA effects at the
MPI runtime level independent of MPI applications. Our experiments are
conducted on a two node system where each node consists of two socket
Intel Xeon servers, attached with up to four Intel Omni-Path
fabric devices connected over PCIe. The performance benefits seen by
MPI applications by affinitizing MPI processes with best possible network
device is evident from the results where we notice up to 40\% improvement
in uni-directional bandwidth, 48\% bi-directional bandwidth, 32\% improvement
in latency measurements and finally up to 40\% improvement in message rate.
HP (Invited Talk)
|Efficient Network Design & Network Architecture
Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks
X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda
Deep Learning over Big Data (DLoBD) is becoming one of the most important research paradigms to mine value from the massive amount of gathered data. With this
emerging paradigm, more and more deep learning frameworks start running over Big Data stacks, such as Hadoop and Spark. With the convergence of HPC, Big Data, and Deep
Learning, many of these emerging frameworks have taken advantage of RDMA and multi-/many-core based CPUs/GPUs. Even though a lot of activities are happening in the field,
there is a lack of systematic studies on analyzing the impact of RDMA-capable networks and CPU/GPU on DLoBD stacks. To fill this gap, we propose a systematical
characterization methodology and conduct extensive performance evaluations on three representative DLoBD stacks (i.e., CaffeOnSpark, TensorFlowOnSpark, BigDL)
to expose the interesting trends in terms of performance, scalability, accuracy, and resource utilization. Our observations show that RDMA-based design for DLoBD stacks
can achieve up to 2.7X speedup compared to the IPoIB based scheme. More insights are shared in this paper to guide designing next-generation DLoBD stacks.
Low-Level Host Software Stack Optimizations to Improve Aggregate Fabric Throughput
V. T. Ravi, J. Erwin, P. Sivakumar, C. Tang, J. Xiong, M. Debbage, and R. B. Ganapathi
Scientific HPC applications along with the emerging class of Big Data and Machine Learning workloads are rapidly driving the fabric scale both on premises and
in the cloud. Achieving high aggregate fabric throughput is paramount to the overall performance of the application. However, achieving high fabric throughput at
scale can be challenging - that is, the application communication pattern will need to map well on to the target fabric architecture, and the multi-layered host
software stack in the middle will need to orchestrate that mapping optimally to unleash the full performance.
In this paper, we investigate low-level optimizations to the host software stack with the goal of improving the aggregate fabric throughput, and hence,
application performance. We develop and present a number of optimization and tuning techniques that are key driving factors to the fabric performance at scale - such
as, Fine-grained interleaving, improved pipelining, and careful resource utilization and management. We believe that these low-level optimizations can be commonly
leveraged by several programming models and its runtime implementations making these optimizations broadly applicable. Using a set of well-known MPI-based scientific
applications, we demonstrate that these optimizations can significantly improve the overall fabric throughput and the application performance. Interestingly, we
also observe that some of these optimizations are inter-related and can additively contribute to the overall performance.
Userspace RDMA Verbs on Commodity Hardware using DPDK
RDMA (Remote Direct Memory Access) is a technology which enables user applications to perform direct data transfer between the virtual memory of processes
on remote endpoints, without operating system involvement or intermediate data copies. Achieving zero intermediate data copies using RDMA requires specialized
network interface hardware. Software RDMA drivers emulate RDMA semantics in software to allow the use of RDMA without investing in such hardware, although
they cannot perform zero-copy transfers. Nonetheless, software RDMA drivers are useful for research, application development, testing, debugging, or as a less
expensive desktop client for a centralized RDMA server application running on RDMA-capable hardware.
Existing software RDMA drivers perform data transfer in the kernel. Data Plane Development Kit (DPDK) provides a framework for mapping Ethernet interface cards
into userspace and performing bulk packet transfers. This in turn allows a software RDMA driver to perform data transfer in userspace. We present our software RDMA
driver, urdma, which performs data transfer in userspace, discuss its design and implementation, and demonstrate that it can achieve lower small message latency than
existing kernel-based implementations while maintaining high bandwidth utilization for large messages.
||Awards & Closing Remarks
Ada Gavrilovska & Eitan Zahavi, General Chairs