Home| Program | Keynotes | Tutorials | Registration | Venue | Committees | Sponsors | Travel Awards | Archive | Contact


Wednesday, August 26 (Symposium)
8:00-8:45 Breakfast and Registration
8:45-9:00 Introduction
Fabrizio Petrini General Chair
Ryan Grant/Ada Gavrilovska Technical Program Chair
9:00-9:10 Host Opening Remarks
Rodrigo Liang, Oracle
Host Keynote
Session Chair: Fabrizio Petrini

Commercial Computing Trends and Their Impact on Interconnect Technology
Rick Hetherington, Oracle
10:15-10:45 Morning Break
Best Papers
Session Chair: Ron Brightwell
  • NUMA Aware I/O in Virtualized Systems

    In a Non Uniform Memory Access (NUMA) system, I/O to a local device is more efficient than I/O to a remote device, because a device connected to the same socket having the CPU and memory offers a closer proximity for I/O operations. Modern microprocessors also support on-chip I/O interconnects which allow the processor to drive I/O to a local device without the need to interact with main memory.

    Modern systems are also highly virtualized. In NUMA systems, scheduling of Virtual Machines (VMs) is very complex because multiple VMs are competing for system resources. In this paper, we study how to schedule VMs for better I/O on a NUMA system. We propose the design of a NUMA aware I/O scheduler which aligns VMs, hypervisor threads on NUMA boundaries, while extracting the most benefit from local device I/O. We evaluate our implementation for a variety of workloads. We demonstrate a benefit of more than 25% in throughput and packet rate, a benefit of more than 10% in CPU utilization, and a benefit of higher than 5% in VM consolidation ratio.

    Authors affliliation: VMWare Inc., USA

    A. Banerjee, R. Mehta and Z. Shen
  • The BXI Interconnect Architecture
    BXI, Bull eXascale Interconnect, is the new interconnection network developed by Atos for High Performance Computing. In this paper, we first present an overview of the BXI network. The BXI network is designed and optimized for HPC workloads at very large scale. It is based on the Portals 4 protocol and permits a complete offload of communication primitives in hardware, thus enabling independent progress of computation and communication. We then describe the two BXI ASIC components, the network interface and the BXI switch, and the BXI software environment. We focus on the network interface architecture. We finally explain how the Bull exascale platform integrates BXI to build a large scale parallel system and we give some performance estimations.

    Authors affliliation: Atos, France

    S. Derradji, A. Poudes, J.-P. Panziera and F. Wellenreiter
  • Exploiting Offload Enabled Network Interfaces

    Network interface cards are one of the key components to achieve efficient parallel performance. In the past, they have gained new functionalities such as lossless transmission and remote direct memory access that are now ubiquitous in high-performance systems. Prototypes of next generation network cards now offer new features that facilitate device programming.

    In this work, various possible uses of network offload features are explored. We use the Portals 4 interface specification as an example to demonstrate various techniques such as fully asynchronous, multi-schedule asynchronous, and solo collective communications. MPI collectives are used as a proof of concept for how to leverage our proposed semantics. In a solo collective, one or more processes can participate in a collective communication without being aware of it. This semantic enables fully asynchronous algorithms. We discuss how the application of the solo collectives can improve the performance of iterative methods, such as multigrid solvers. The results obtained show how this work may be used to accelerate existing MPI applications, but they also display how these techniques can greatly ease programming of algorithms outside of the Bulk Synchronous Parallel (BSP) model.

    Authors affliliation: ETH Zurich, Switzerland
    Intel*, USA

    S. Di Girolamo, P. Jolivet, K. D. Underwood* and T. Hoefler
12:15-13:30 Lunch
Short Papers
Session Chair: Xinyu Que
  • A Brief Introduction to the OpenFabrics Interfaces - A New Network API for Maximizing High Performance Application Efficiency
    OpenFabrics Interfaces (OFI) is a new family of application program interfaces that exposes communication services to middleware and applications. Libfabric is the first member of this family of interfaces and was designed under the auspices of the OpenFabrics Alliance by a broad coalition of industry, academic, and national labs partners over the past two years. Building and expanding on the goals and objectives of the verbs interface, libfabric is specifically designed to meet the performance and scalability requirements of high performance applications such as Message Passing Interface MPI libraries, Symmetric Hierarchical Memory Access (SHMEM) libraries, Partitioned Global Address Space (PGAS) programming models, Database Management Systems (DBMS) and enterprise applications running in a tightly coupled network environment. A key aspect of libfabric is that it is designed to be independent of the underlying network protocols as well as the implementation of the networking devices. This paper provides a brief discussion of the motivation for creating a new API and describes the novel requirements gathering process that was used to drive its design. Next, we provide a high level overview of the API architecture and design, and finally we discuss the current state of development, release schedule and future work.

    Authors affliliation: Cray, USA
    Intel†, USA
    Cisco*, USA
    University of New Hampshire♦
    Los Alamos National Laboratory♠

    P. Grun, S. Hefty†, S. Sur†, D. Goodell*, R. Russell♦, H. Pritchard♠, and J. Squyres*
  • UCX: An Open Source Framework for HPC Network APIs and Beyond
    This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, PGAS languages, task-based paradigms and I/O bound applications. We present the initial design and architecture of UCX, and also, provide an in-depth discussion of the API and protocols that could be used to implement MPI and OpenSHMEM. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead- critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 usec, a bandwidth of 6138.5MB/sec, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware. UCX is an open source, BSD licensed software hosted on GitHUB, currently accessible to collaborators, that will be publicly released in the near future.

    Authors affliliation: ORNL, USA
    Mellanox†, Israel
    IBM♦, USA
    University of Tennessee Knoxville♠, USA

    P. Shamis, M. G. Venkata, M. G. Lopez, M. B. Baker, O. Hernandez, Y. Itigin†, M. Dubman†, G. Shainer†, R. Graham†, L. Liss†, Y. Shahar†, S. Potluri*, D. Rossetti*, D. Becker*, D. Poole*, C. Lamb*, S. Kumar♦, C. Stunkel♦, G. Bosilca♠, and A. Bouteiller♠
Session Chair: Ryan Grant

Intel Omni-Path Architecture: Enabling Scalable, High Performance Fabrics Todd Rimmer (Invited Paper)
14:55-15:20 Afternoon Break
Session Chair: Mitch Gusat

HPC vs. Data Center Networks

Ron Brightwell, Sandia National Laboratories
Kevin Deierling, Mellanox
Dave Goodell, Cisco
Ariel Hendel, Broadcom
Katharine Schmidtke, Facebook
Keith Underwood, Intel

Thursday, August 27 (Symposium)
8:15-9:00 Breakfast and Registration
Session Chair: Vikram Dham

Recent Advances in Machine Learning and their Application to Networking
David Meyer, Brocade
10:00-10:30 Morning Break
Session Chair: Ada Gavrilovska

Run-time Strategies for Energy-efficient Operation of Silicon-Photonic NoCs
Over the past two decades, the general-purpose compute capacity of the world increased by 1.5x every year, and we will need to maintain this rate of growth to support the increasingly sophisticated data-driven applications of the future. The computing community has migrated towards manycore computing systems with the goal of improving the computing capacity per chip through parallelism while staying within the chip power budget. Energy-efficient data communication has been identified as one of key requirements for achieving this goal, and silicon-photonic network-on-chip (NoC) has been proposed as one of technologies that can meet this requirement. Silicon-photonic NoC provides high bandwidth density, but the large power consumed in the laser sources and in tuning against on-chip thermal gradients has been a big impediment towards its adoption by the wider community. In this talk, I'll present two run-time strategies for achieving energy-efficient operation of the silicon-photonic NoCs in manycore systems. For reducing laser power, I'll present our approach of using cache and NoC reconfiguration at run time. The key idea here is to provide the minimum L2 cache size and the minimum NoC bandwidth (i.e. the minimum number of active silicon-photonic links) required for an application to achieve the maximum possible performance at any given point of time. For managing the thermal tuning power of the silicon-photonic NoC, I'll present a run-time job allocation technique that minimizes the temperature gradients among the ring modulators/filters to minimize localized thermal tuning power consumption and maximize network bandwidth to maximize the application performance.
Ajay Joshi, Boston University (Invited Talk)
Ajay Joshi received his Ph.D. degree from the ECE Department at Georgia Tech in 2006. He then worked as a postdoctoral researcher in the EECS Department at MIT until 2009. He is currently an Assistant Professor in the ECE Department at Boston University. His research interests span across various aspects of VLSI design including circuits and architectures for communication and computation, and emerging device technologies including silicon photonics and memristors. He received the NSF CAREER Award in 2012 and Boston University ECE Department's Award for Excellence in Teaching in 2014.
Optics & NoC
Session Chair: Ada Gavrilovska
  • OWN: Optical and Wireless Network-on-Chips (NoCs) for Kilo-core Architectures
    Current trends of increasing cores in chip multi-processors (CMPs) will continue and kilo-core CMPs could be available within a decade. However, metallic-based interconnects may not scale to kilo-core architectures due to longer hop count, power inefficiency and increased execution. Emerging technologies such as 3D stacking, silicon photonics, and on-chip wireless interconnects are under serious consideration as they show promising results for power-efŮcient, low latency scalable on-chip interconnects. In this paper, we propose an architecture combining two emerging technology: photonics and wireless called Optical and Wireless Network-on-chip (OWN). OWN integrates the high-bandwidth low-latency silicon photonics with on-chip ▀exible wireless technology to develop a scalable low latency interconnects for kilo-core CMPs. Our simulations on synthetic trafŮc show that 1024-core OWN consumes 40.2% less energy per bit than wireless-only architectures and 23.2% higher than photonics-only for uniform trafŮc. OWN has higher throughput compared to the complete wired network CMESH, hybrid wireless network WCUBE and hybrid photonic network ATAC for synthetic trafŮc types uniform random, bitreversal and lower than ATAC for matrix transpose.

    Authors affliliation: Ohio University, USA
    University of Arizona*,USA

    A. Kodi, A. Sikdar, A. Louri*, S. Kaya and M. Kennedy
  • AMON: Advanced Mesh-like Optical NoC
    Optical Networks-on-chip constitute a promising approach to tackle the power wall problem present in large- scale electrical NoC design. In order to enable their adoption and ensure scalability, oNoCs have to be carefully designed with power and energy consumption in mind. In this paper, we propose Amon, an all-optical NoC design based on passive microrings and Wavelength Division Multiplexing, including the switch architecture and a contention-free routing algorithm. The goal is to obtain a design that minimizes the total number of wavelengths and microrings, the wiring complexity and the diameter. An analytical comparison with state-of-the-art design proposals of all-optical NoCs shows that the proposed design can substantially improve the most important performance metrics: hop counts, chip area, energy and power consumption. Our experimental work confirms that this improvement translates into higher performance and efficiency. Finally, our design provides a tile-based structure which should facilitate VLSI integration when compared with recent ring-like solutions. In general we show it provides a more scalable solution than previous designs.

    Authors affliliation: University of Manchester, UK

    S. Werner and J. Navaridas
12:00-13:30 Lunch
Efficient Network Design
Session Chair: Fabrizio Petrini
  • Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-all Collective Algorithms

    Many high performance scientific applications rely on collective operations to move large volumes of data because of their versatility and ease of use. Such data intensive collectives have a notable impact on the execution time of the program and hence the energy consumption owing to the amount of memory/ processor/network resources involved in the data movement. On the other hand, mechanisms such as offload and one-sided calls, backed by RDMA-enabled interconnects like InfiniBand coupled with modern transport protocols like Dynamic Connected (DC), provide new ways to express collective communication.

    However, there have been no efforts to fundamentally redesign collective algorithms from an energy standpoint using the rich plethora of existing and upcoming communication mechanisms and transport protocols. In this paper, we take up this challenge and study the impact that RDMA and transport protocol aware designs can have on the energy and performance of dense collective operations like All-to-all. Through evaluation, we also identify that while a single transport protocol may bring both performance and energy benefits for one application it may not do so consistently for all applications. Motivated by this, we propose designs that yield both benefits for all evaluated applications. Our experimental evaluation shows that our proposed designs are able to deliver up to 1.7X savings in energy with little or no degradation in the communication performance for All-to-all collective operations on modern HPC systems.

    Authors affliliation: Ohio State University, USA

    H. Subramoni, A. Venkatesh, K. Hamidouche, K. Tomko
    and D. Panda
  • Implementing Ultra Low Latency Data Center Services with Programmable Logic

    Data centers utilize a multitude of low-level network services to implement high-level applications. For example, scalable Key/Value Store (KVS) databases are implemented by storing keys and values in memory on remote servers then searching for values using keys sent over Ethernet. Most existing KVS databases run variations of memcached in software and scale out to increase the search throughput.

    In this paper, we take an alternate approach by implementing an ultra low latency KVS database in Field Programmable Gate Array (FPGA) logic. As with a software-based KVS, the transaction is sent over Ethernet to the machine which stores the value associated with that key. We find that the implementation in logic, however, scales up to provide much higher throughput with lower latency and power consumption.

    High-level applications store, replace, delete and search keys using standard KVS APIs. Our API hashes long keys into statistically unique identifiers and maps variable-length messages into a finite set of fixed-size values. These keys and values are then formatted into a set of compact binary messages and transported over standard Ethernet to KVS servers in the data center.

    When transporting messages over standard 10 Gigabit Ethernet and by processing OCSMs in an FPGA, the logic can search with a fiber-to-fiber latency of under 1 microsecond. Each KVS database implemented as an FPGA core processes 150 Million Searches Per Second (MSPS) per 40 Gigabits/second of link speed. The FPGA KVS was measured to process messages 7x faster while using 13x less energy than kernel-bypass software.

    Authors affliliation: Algo-Logic Systems, USA

    J. Lockwood and M. Monga
  • Enhanced Overloaded CDMA Interconnect (OCI) Bus Architecture for on-Chip Communication

    On-chip interconnect is a major building block and main performance bottleneck in modern complex System-onChips (SoCs). The bus topology and its derivatives are the most deployed communication architecture in contemporary SoCs. Space switching exemplified by cross bars and multiplexers, and time sharing are the key enablers of various bus architectures. The cross bar has quadratic complexity, while resource sharing significantly degrade the overall system's performance. In this work we motivate using Code Division Multiple Access (CDMA) as a bus sharing strategy which offers many advantages over other topologies. Our work seeks to complement the conventional CDMA bus features by applying overloaded CDMA practices to increase the bus utilization efficiency.

    We present the Difference-Overloaded CDMA Interconnect (D-OCI) bus that leverages the balancing property of the Walsh codes to increase the number of interconnected elements by 50%. We advance two implementations of the D-OCI bus optimized for both speed and resource utilization. We validate the bus operation and compare the performance of the D-OCI and conventional CDMA buses. We also present the synthesis results for the UMC- 0.13 Ám design kit to give an idea of the maximum achievable bus frequency on ASIC platforms. Moreover, we advance a proof-ofconcept HLS implementation of the D-OCI bus on a Xilinx Zynq- 7000 SoC and compare its performance, latency, and resource utilization to the ARM AXI bus. The performance evaluation demonstrates superiority of the D-OCI bus.

    Authors affliliation: Alexandria University, Egypt

    K. Ahmed and M. Farag
15:00-15:15 Afternoon Break
Session Chair: Madeleine Glick

Facebook Network Architecture and Its Impact on Interconnects
The effect of networking at scale and its impact on the strategy for optical interconnect deployment will be described.
Katharine Schmidtke, Facebook (Invited Talk)

Katharine is currently responsible for Optical Technology strategy at Facebook. She has a Ph.D. In non-linear optics from Southampton University in the UK and is a postdoc at Stanford University. She has over 20 years experience in the Opto-Electronics industry including positions at Finisar and JDSU.

Session Chair: Ada Gavrilovska

NFV in Cloud Infrastructure - Ready or Not

VMware was formed out of a Stanford research project to virtualize the x86 architecture, a difficult feat considering the complexity of x86 architecture. Over the past several years, both Intel and AMD have added support in hardware for virtualization extensions that helped improve performance and scale for hypervisors to support more workloads with higher consolidation ratios.

Today, about 75% of x86 server workloads run in virtual machines (VMs), primarily on data center infrastructure. Network Functions Virtualization (NFV) is about leveraging standard virtualization technology to consolidate network equipment onto industry standard high volume servers, switches and storage. This is a natural path of evolution for telecommunications companies like their move from circuit to packet switching, or from ATM and Frame Relay to MPLS and IP. NFV will allow telcos to be more agile to meet the rapidly increasing volume of wireless traffic fueled by mobile devices and over-the-top services, and enable them to offer differentiated services of their own.

NFV workloads, however are different and more performance demanding than typical IT workloads. As a result, it is tempting for telcos to use techniques like OS and hypervisor bypass to achieve maximum performance, sometimes overlooking the flexibility, agility, security and other benefits that hypervisors are uniquely positioned architecturally to provide. The desire to achieve maximum performance is sometimes misguided by using micro-benchmarks that are meaningless when translated to real NFV applications in a production environment.

In this talk we will look at a real-world use case of a telecommunications service and examine the trade-offs between performance and flexibility for NFV. We will outline how the tradeoffs can affect the overall system design. Finally, we will provide our view of directions and call to action for solutions to address both goals.

Bhavesh Davda, VMware (Invited Talk)

Bhavesh Davda is a senior engineer in the CTO Office at VMware. He is currently leading the engineering efforts to make vSphere the best-in-class infrastructure to support Telco (NFV) workloads. He also works on enabling virtualization of real-time applications, low latency applications, high rate packet processing, high performance computing, SR-IOV and RDMA, which have historically been challenging to virtualize. Previously, he managed the vSphere Networking R&D team, where the team worked on the entire virtual networking stack from guest OS drivers for paravirtual NICs, to virtualization of paravirtual and emulated NICs in the hypervisor, to efficiently implementing I/O down to the physical NIC drivers and the TCP/IP stack in the vmkernel hypervisor for uses like VMotion, Fault-Tolerance and NFS at the hypervisor level.

Bhavesh has over 20 years of experience in the systems software field. Before VMware, he was a Distinguished Member of Technical Staff at Avaya Labs in the operating systems group of their flagship Enterprise Communications Manager product.

17:00-17:15 Closing Remarks
Fabrizio Petrini, General Chair
Friday, August 28 (Tutorials)
8:00-8:30 Breakfast and Registration
8:30-12:30 Tutorial 1

Accelerating Big Data Processing with Hadoop, Spark, and Memcached Over High-Performance Interconnects

D. K. Panda & Xiaoyi Lu, Ohio State University

Tutorial 2

ONOS Tutorial Hot Interconnect

Thomas Vachuska, Madan Jampani, Ali Al-Shabibi & Brian O'Connor, ONOS

12:30-13:30 Lunch
13:30-17:30 Tutorial 3

Flow and Congestion Control for High Performance Clouds: How to Design and Tune the Datacenter Fabric & SDN for Big Data

Mitch Gusat, IBM Research, Zurich Laboratory

Tutorial 4

Software-defined Wide-Area Networking: Challenges, Opportunities and Reality

Inder Monga, Energy Sciences Network & Srini Seetharaman, Infinera



IEEE Micro: January 6


Materials due: August 7


Early Registration: Aug. 14