Home| Program | Keynotes | Tutorials | Registration | Attendees | Committees | Sponsors | Travel Awards | Archive | Contact


Wednesday, August 24 (Symposium)
8:00-8:45 Breakfast and Registration
8:45-9:00 Introduction
Ryan Grant & Charlie Perkins General Chairs
James Dinan & Ricki Williams Technical Program Chairs
9:00-9:10 Host Opening Remarks
Mike McBride, Huawei
Session Chair: Ryan Grant

Cloudcasting - Perspectives on Virtual Routing for Cloud Centric Network Architectures
Kiran Makhijani, Huawei
10:15-10:30 Morning Break
Routing and Network Topology
Session Chair: Edgar A. Leon
  • Ensuring Deadlock-Freedom in Low-Diameter InfiniBand Networks

    Lossless networks, such as InfiniBand use flow- control to avoid packet-loss due to congestion. This introduces dependencies between input and output channels, in case of cyclic dependencies the network can deadlock. Deadlocks can be resolved by splitting a physical channel into multiple virtual channels with independent buffers and credit systems. Currently available routing engines for InfiniBand assign entire paths from source to destination nodes to different virtual channels. However, InfiniBand allows changing the virtual channel at every switch. We developed fast routing engines which make use of that fact and map individual hops to virtual channels. Our algorithm imposes a total order on virtual channels and increments the virtual channel at every hop, thus the diameter of the network is an upper bound for the required number of virtual channels. We integrated this algorithm into the InfiniBand software stack. Our algorithms provide deadlock free routing on state-of-the- art low-diameter topologies, using fewer virtual channels than currently available practical approaches, while being faster by a factor of four on large networks. Since low-diameter topologies are common among the largest supercomputers in the world, to provide deadlock-free routing for such systems is very important.

    Authors affliliation: ETH Zurich, Switzerland

    T. Schneider, O. Bibartiu and T. Hoefler
  • Scalable, Global, Optimal-bandwidth, Application-Specific Routing
    High performance computing platforms can benefit from additional bandwidth from the interconnection network because there are many applications with significant communication demands. Further, many HPC applications expressed as MPI programs have stable communication patterns across runs. Ideally, one would like to exploit the stable communication patterns by using global routing of communication paths to minimize network contention. Unfortunately, existing optimal-bandwidth, global routing techniques use mixed integer linear programs which fundamentally do not scale to the sizes that HPC workloads and platforms demand. Consequently, HPC platforms use simple distributed routing techniques, possibly with local adaptive routing, at best. Our design – Scalable Global Routing (SGR) – addresses this gap. Simulations reveal that in a 4096-node, 4D-torus network, SGR achieves global route computation with a speedup of nearly two orders of magnitude over prior global routing techniques. SGR outperforms simpler (non-global) routing techniques such as minimal adaptive routing by a 3.1X margin and non-minimal adaptive routing by a 37% margin.

    Authors affliliation: Cairo University, Egypt
    Purdue University*, USA

    A. Abdel-Gawad and M. Thottethodi*
  • Traffic Pattern-based Adaptive Routing for Intra-group Communication in Dragonfly Networks

    The Cray Cascade architecture uses Dragonfly as its interconnect topology and employs a globally adaptive routing scheme called UGAL. UGAL directs traffic based on link loads but may make inappropriate adaptive routing decisions in various situations, which degrades its performance. In this work, we propose to improve UGAL by incorporating a traffic patternbased adaptation mechanism for intra-group communication in Dragonfly. The idea is to explicitly use the link usage statistics that are collected in performance counters to infer the traffic pattern, and to take the inferred traffic pattern plus link loads into consideration when making adaptive routing decisions. Our performance evaluation results on a diverse set of traffic conditions indicate that by incorporating the traffic pattern-based adaptation mechanism, our scheme is more effective in making adaptive routing decisions and achieves lower latency under low load and higher throughput under high load than the existing UGAL in many situations.

    Authors affliliation: Florida State University, USA
    Los Alamos National Lab*, USA

    P. Faizian, M. S. Rahman, M. A. Mollah, X. Yuan, S. Pakin* and M. Lang*
12:00-13:30 Lunch
Switch Architecture and Traffic Management
Session Chair: Madeleine Glick
  • Improvements to the InfiniBand Congestion Control Mechanism
    The InfiniBand Congestion Control mechanism (IB CC) is able to reduce the negative consequences of congestion in many situations. However, its effectiveness depends on a set of parameters that must be set by administrators. If the parameters are not appropriately configured, IB CC could negatively impact network performance. Additionally, no one has been able to find a universal parameter setting that can fit all situations. These difficulties prevent IB CC from being widely used. In this paper we propose several enhancements to the existing IB CC. First, our improved IB CC significantly reduces parameter configuration. Second, the congestion will be removed quickly. Third, a new utilization-driven approach and a new Link Bandwidth Availability Report (LBAR) approach are implemented to guide sending interfaces on how and when to adjust their injection rates. These adjustments are aware of the actual network condition, rather than relying on preconfigured parameters, as in the existing IB CC. Simulation results have demonstrated that our improved IB CC is able to reduce the congestion consequences efficiently and can adapt to various network topologies and traffic patterns.

    Authors affliliation: University of New Hampshire, USA
    Simula Research Lab*, Norway

    Q. Liu, R. D. Russell and E. G. Gran*
  • Scalable High-Radix Modular Crossbar Switches
    Crossbars are a basic building block of networks on chip that can be used as fast, single-stage networks or in router cores for larger scale networks. However, scaling crossbars to high radices presents a number of efficiency, performance, and area challenges. Thus, we propose modular flow-through crossbar switch cores that perform better at high radices than conventional monolithic designs. The modular sub-blocks are arranged in a controlled flow-through, pipelined scheme to eliminate global connections and maintain linear performance scaling and high throughput. Modularity also enables energy savings via deactivation of unused I/O wires. Evaluation using an analytical crossbar switch modeling tool demonstrated improved energy delay product (up to 5.3X) compared to conventional crossbar switches, but with approximately 30% area overhead. Further, we evaluated modular crossbar networks with the proposed switch cores using BookSim2, cycle-accurate detailed network on chip tool. The proposed design achieves more than 90% saturation capacity with an internal speed up of 1.5, supports data line rates as high as 102.4Gbps (in 40nm CMOS bulk), and offers lower average network latency compared to conventional crossbars.

    Authors affliliation: CMU, USA
    Altera Corp*, USA
    Oracle Labs†, USA

    C. Cakir, R. Ho*, J. Lexau† and K.Mai
  • A Clos-Network Switch Architecture based on Partially-Buffered Crossbar Fabrics
    Modern Data Center Networks (DCNs) that scale to thousands of servers require high performance switches/routers to handle high traffic loads with minimum delays. Today’s switches need be scalable, have good performance and -more importantly- be cost-effective. This paper describes a novel three-stage Clos-network switching fabric with partially-buffered crossbar modules and different scheduling algorithms. Compared to conventional fully buffered and buffer-less switches, the proposed architecture fits a nice model between both designs and takes the best of both: i) less hardware requirements which considerably reduces both the cost and the implementation complexity, ii) the existence of few internal buffers allows for simple and high performance scheduling. Two alternative scheduling algorithms are presented. The first is scalable, it disperses the control function over multiple switching elements in the Clos-network. The second is simpler. It places some control on a central scheduler to ensure an ordered packets delivery. Simulations for various switch settings and traffic profiles have shown that the proposed architecture is scalable. It maintains high throughput, low latency performance for less hardware used.

    Authors affliliation: University of Leeds, England

    F. Hassen and L. Mhamdi
15:00-15:15 Afternoon Break
Moderator: Ron Brightwell

Many-core Reality Check — How increasing core counts, on-node networks, and deep integration will impact system interconnects
Node architectures have entered an era of intense innovation, with trends toward increasing numbers of devices per processor; integration of memory and the introduction of new memory technologies; and rapid increases in the scale of on-chip, on-package, and on-node networks. These trends, in turn, are influencing the design of network architectures and redefining the interaction between nodes and the network. In this session, our panelists will critically evaluate these trends to identify key issues that must be solved by the coming generations of high-speed networking technologies.

Mark Cummings, Orchestral Networks
Scot Schulz, Mellanox
Pavel Shamis, ARM
Keith Underwood, Intel

Thursday, August 25 (Symposium)
8:15-9:00 Breakfast and Registration
Session Chair: Charlie Perkins

Building Large Scale Data Centers: Cloud Network Design Best Practices
In this talk, we examine the network design principles of large scale Data Centers. Public cloud application scale tends to exceed the capacity of any multi-processor machine, making distributed applications the norm. These distributed applications are decomposed and deployed across multiple physical (or virtual) servers which introduce network demands for intra-application communications. Most large scale data centers have a scalable network infrastructure which happens to be a good fit for the distributed applications model. This deployment model is evolving to include parallel software clusters, microservices, and machine learning clusters. This evolution has ramifications on the corresponding network attributes. We map network best practices down to salient switch and NIC ASIC devices at the architecture and feature level, and time permitting we discuss how these practices are expressed into the public information major operators have published about their Data Center networks.
Ariel Hendel, Broadcom Limited (Invited Talk)

Broadcom Distinguished Engineer focusing on Data Center Networks and Switch Architecture. Ariel joined Broadcom in 2008, prior to Broadcom he was a Distinguished Engineer at Sun Microsystems where he was twice recipient of the Chairman Award.

He earned his BS degree from the Technion, Haifa, and MS from Polytechnic University, New York. Ariel holds more than 40 patents with several more pending.

Session Chair: Charlie Perkins

Network topologies for large-scale compute centers: It's the diameter, stupid!
We discuss the history and design tradeoffs for large-scale topologies in high-performance computing. We observe that datacenters are slowly following due to the growing demand for low latency and high throughput at lowest cost. We then introduce a high-performance cost-effective network topology called Slim Fly that approaches the theoretically optimal network diameter. We analyze Slim Fly and compare it to both traditional and state-of-the-art networks. Our analysis shows that Slim Fly has significant advantages over other topologies in latency, bandwidth, resiliency, cost, and power consumption. Finally, we propose deadlock-free routing schemes and physical layouts for large computing centers as well as a detailed cost and power model. Slim Fly enables constructing cost effective and highly resilient datacenter and HPC networks that offer low latency and high bandwidth under different HPC workloads such as stencil or graph computations.
Torsten Hoefler, ETH Zurich (Invited Talk)
Torsten is an Assistant Professor of Computer Science at ETH Zurich, Switzerland. Before joining ETH, he led the performance modeling and simulation efforts of parallel petascale applications for the NSF-funded Blue Waters project at NCSA/UIUC. He is also a key member of the Message Passing Interface (MPI) Forum where he chairs the "Collective Operations and Topologies" working group. Torsten won best paper awards at the ACM/IEEE Supercomputing Conference SC10, SC13, SC14, EuroMPI'13, HPDC'15, HPDC'16, IPDPS'15, and other conferences. He published numerous peer-reviewed scientific conference and journal articles and authored chapters of the MPI-2.2 and MPI-3.0 standards. He received the Latsis prize of ETH Zurich as well as an ERC starting grant in 2015. His research interests revolve around the central topic of "Performance-centric System Design" and include scalable networks, parallel programming techniques, and performance modeling. Additional information about Torsten can be found on his homepage at htor.inf.ethz.ch.
10:00-10:30 Morning Break
Memory and Data Caching
Session Chair: Ricki Williams
  • Race Cars vs. Trailer Trucks: Switch Buffers Sizing vs. Latency Tradeoffs in Data Center Networks
    This paper raises the data center designers question of trade-off between high-buffer switches versus low-latency switches. Packet buffer hardware dictates this trade-off due to the constraints of DRAM and SRAM technologies. While the designers who prefer network robust solutions would typically prefer large-buffer switches with settling for high latency; the designers who can adapt applications to the network behavior would prefer the low-latency switches in order to gain better application performance. In this paper, we review the question of switch buffer sizing in data center networks, by considering the switch delay in light of common traffic patterns in data centers. To the best of our knowledge, this is the first paper that discusses the switch buffer sizing question by considering switch latency trade-off. We review previous works on switch buffer sizing given the typical parameters of data center networks, and survey the typical data center traffic patterns that challenge the switch buffer. Also, we provide simulation results that show the effect of switch latency on the effective bandwidth of acknowledgement-based congestion controlled flows. Finally, we discuss the gain that flow control provides to end-to-end network performance.

    Authors affliliation: Mellanox, Israel

    A. Shpiner and E. Zahavi
  • A Multilevel NOSQL Cache Design Combining In-NIC and In-Kernel Caches
    Since a large-scale in-memory data store, such as key- value store (KVS), is an important software platform for data centers, this paper focuses on an FPGA-based custom hardware to further improve the efficiency of KVS. Although such FPGA-based KVS accelerators have been studied and shown a high performance per Watt compared to software-based processing, since their cache capacity is strictly limited by the DRAMs implemented on FPGA boards, their application domain is also limited. To address this issue, in this paper, we propose a multilevel NOSQL cache architec- ture that utilizes both the FPGA-based hardware cache and an in- kernel software cache in a complementary style. They are referred as L1 and L2 NOSQL caches, respectively. The proposed multilevel NOSQL cache architecture motivates us to explore various design options, such as cache write and inclusion policies between L1 and L2 NOSQL caches. We implemented a prototype system of the proposed multilevel NOSQL cache using NetFPGA-10G board and Linux Netfilter framework. Based on the prototype implementation, we explore the various design options for the multilevel NOSQL caches. Simulation results show that our multilevel NOSQL cache design reduces the cache miss ratio and improves the throughput compared to the non-hierarchical design.

    Authors affliliation: Keio University, Japan

    Y. Tokusashi and H. Matsutani
  • RoB-Router: Low Latency Network-on-Chip Router Microarchitecture Using Reorder Buffer
    Switch allocation is the critical pipeline stage for network-on-chips (NoCs) and it is influenced by the order of packets in input buffers. Traditional input-queued routers in NoCs only have a small number of virtual channels (VCs) and the packets in a VC are organized in fixed order. Such design is susceptible to head-of-line (HoL) blocking as only the packet at the head of a VC can be allocated by the switch allocator. HoL blocking significantly degrades the efficiency of switch allocation as well as the performance of NoCs. In this paper, we propose to utilize reorder buffer (RoB) techniques to mitigate HoL blocking and accelerate switch allocation and thus reduce the latency of NoCs. We propose to design VCs as RoBs to allow packets located not at the head of a VC to be allocated before the head packet. RoBs reduce the conflicts in switch allocation and can efficiently increase matching number in switch allocation. We design RoB-Router based on traditional input-queued routers in a lightweight fashion considering the trade-off between performance and cost. Our design can be extended to most state-of-the-art input-queued routers. Evaluation results show that RoB-Router can achieve 46% and 15.7% performance improvement in packet latency under synthetic traffic and traces from PARSEC than current most efficient switch allocator TS-Router, and the cost of energy and area is moderate.

    Authors affliliation: National University of Defense Technology, China

    C. Li, D. Dong, X. Liao, J. Wu and F. Lei
12:00-13:30 Lunch
Session Chair: Ricki Williams

Software-Defined Everything and the New Role of Interconnects
Roy Chua, SDxCentral
14:30-14:45 Afternoon Break
Node and Network Architectures
Session Chair: Songkrant Muneenaem
  • Offloading Collective Operations to Programmable Logic on a Zynq Cluster

    This paper describes our architecture and implementation for offloading collective operations to programmable logic in the communication substrate. Collective operations – operations that involve communication between groups of co- operating processes – are widely used in parallel processing. The design and implementation strategies of collective operations plays a significant role in their performance and thus affects the performance of many high performance computing applications that utilize them. Collectives are central to the widely used Message Passing Interface (MPI) programming model. The programmable logic provided by FPGAs is a powerful option for creating task-specific logic to aid applications. While our work is evaluated on the Xilinx Zynq SoC, it is generally applicable in scenarios where there is programmable logic in the communication pipeline, including FPGAs on network interface cards like the NetFPGA or new systems like Intel’s Xeon with on-die Altera FPGA resources. In this paper we have adapted and generalized our previous work in offloading collective operations to the NetFPGA. Here we present a general collective offloading framework for use in applications using the Message Passing Interface (MPI). The implementation is realized on the Xilinx Zynq reference platform, the Zedboard, using an Ethernet daughter card called EthernetFMC. Results from microbenchmarks are presented as well as from some scientific applications using MPI.

    Authors affliliation: Indiana Univeristy, USA

    O. Arap and M. Swany
  • Exploring Data Vortex Network Architectures

    In this work, we present an overview of the Data Vortex interconnection network, a network designed for both traditional HPC and emerging irregular and data analytics workloads. The Data Vortex network consists of a congestion-free, high-radix network switch and a Vortex Interconnection Controller (VIC) that interfaces the compute node with the rest of the network. The Data Vortex network is designed to transfer fine-grained network packets at a high injection rate, without congestioning the network or negatively impacting performance. Our results show that the Data Vortex networks is more efficient than traditional HPC networks with fine-grained data transfers. Moreover, our experiments show that a Data Vortex system achieves higher scalability even when using global synchronization primitives.

    Authors affliliation: Pacific Northwest National Lab, USA
    CMU*, USA

    R. Gioiosa, T. Warfel*, J. Yin, A. Tumeo and D. Haglin
  • Exploring Wireless Technology for Off-Chip Memory Access

    The trend of shifting from multi-core to many-core processors is exceeding the data-carrying capacity of the traditional on-chip communication fabric. While the importance of the on-chip communication paradigm cannot be denied, the off-chip memory access latency is fast becoming an important challenge. As more memory intensive applications are developed, off-chip memory access will limit the performance of chip multi-core processors (CMPs). However, with the shrinkage of transistor dimension, the energy consumption and the latency of the traditional metallic interconnects are increasing due to smaller wire widths, longer wire lengths, and complex multi-hop routing requirements. In contrast, emerging wireless technology requires lower energy with single-hop communication, albeit with limited bandwidth (at a 60 GHz center frequency). In this paper, we have proposed several hybrid-wireless architectures to access off-chip memory by exploiting frequency division multiplexing (FDM), time division multiplexing (TDM), and space division multiplexing (SDM) techniques. We explore the design-space of building hybrid-wireless interconnects by considering conservative and aggressive wireless bandwidths and directionality. Our hybrid-wireless architectures require a maximum of two hops and show 10.91\% reduction in execution time compared to a baseline metallic architecture. In addition, the proposed hybrid-wireless architectures show on an average 62.07\% and 32.52\% energy per byte improvement over traditional metallic interconnects for conservative and aggressive off-chip metallic link energy-efficiency respectively. Nevertheless, the proposed hybrid-wireless architectures incur an area overhead due to the higher transceiver area requirement.

    Authors affliliation: Ohio University, USA

    M. A. Sikder, A. Kodi, S. Kaya, W. Rayess, D. Matolak and D. Ditomaso
16:15-16:30 Awards & Closing Remarks
Ryan Grant & Charlie Perkins, General Chairs
Friday, August 26 (Tutorials)
8:00-8:30 Breakfast and Registration
8:30-12:00 Tutorial 1

Accelerating Big Data Processing with Hadoop, Spark, and Memcached Over High-Performance Interconnects

D. K. Panda & Xiaoyi Lu, Ohio State University

Tutorial 2

Designing and Developing Performance Portable Network Codes

Pavel Shamis, ARM; Alina Sklarevich, Mellanox Technologies & Swen Boehm, ORNL

12:00-13:30 Lunch
13:30-17:00 Tutorial 3

Data-Center Interconnection (DCI) Technology Innovations in Transport Network Architectures

Loukas Paraschis and Abhinava Shivakumar Sadasivarao, Infinera

Tutorial 4

Efficient Communication in GPU Clusters with GPUDirect Technologies

Davide Rossetti and Sreeram Potluri, NVIDIA



Materials due: August 10