||Wednesday, August 26 (Symposium)
||Breakfast and Registration
Fabrizio Petrini General Chair
Ryan Grant/Ada Gavrilovska Technical Program Chair
||Host Opening Remarks
Rodrigo Liang, Oracle
| Host Keynote
Session Chair: Fabrizio Petrini
Commercial Computing Trends and Their Impact on Interconnect Technology
Rick Hetherington, Oracle
Session Chair: Ron Brightwell
NUMA Aware I/O in Virtualized Systems
A. Banerjee, R. Mehta and Z. Shen
In a Non Uniform Memory Access (NUMA) system, I/O to a local device is more efficient than I/O to a remote device, because a device connected to the same socket
having the CPU and memory offers a closer proximity for I/O operations. Modern microprocessors also support on-chip I/O interconnects which allow the processor to drive
I/O to a local device without the need to interact with main memory.
Modern systems are also highly virtualized. In NUMA systems, scheduling of Virtual Machines (VMs) is very complex because multiple VMs are competing for system
resources. In this paper, we study how to schedule VMs for better I/O on a NUMA system. We propose the design of a NUMA aware I/O scheduler which aligns VMs, hypervisor
threads on NUMA boundaries, while extracting the most benefit from local device I/O. We evaluate our implementation for a variety of workloads. We demonstrate a benefit
of more than 25% in throughput and packet rate, a benefit of more than 10% in CPU utilization, and a benefit of higher than 5% in VM consolidation ratio.
Authors affliliation: VMWare Inc., USA
The BXI Interconnect Architecture
S. Derradji, A. Poudes, J.-P. Panziera and F. Wellenreiter
BXI, Bull eXascale Interconnect, is the new interconnection network developed by Atos for High Performance Computing. In this paper, we first present an overview
of the BXI network. The BXI network is designed and optimized for HPC workloads at very large scale. It is based on the Portals 4 protocol and permits a complete
offload of communication primitives in hardware, thus enabling independent progress of computation and communication. We then describe the two BXI ASIC components,
the network interface and the BXI switch, and the BXI software environment. We focus on the network interface architecture. We finally explain how the Bull exascale
platform integrates BXI to build a large scale parallel system and we give some performance estimations.
Authors affliliation: Atos, France
Exploiting Offload Enabled Network Interfaces
S. Di Girolamo, P. Jolivet, K. D. Underwood* and T. Hoefler
Network interface cards are one of the key components to achieve efficient parallel performance. In the past, they have gained new functionalities such as
lossless transmission and remote direct memory access that are now ubiquitous in high-performance systems. Prototypes of next generation network cards now
offer new features that facilitate device programming.
In this work, various possible uses of network offload features are explored. We use the Portals 4 interface specification as an example to demonstrate
various techniques such as fully asynchronous, multi-schedule asynchronous, and solo collective communications. MPI collectives are used as a proof of concept
for how to leverage our proposed semantics. In a solo collective, one or more processes can participate in a collective communication without being aware of it.
This semantic enables fully asynchronous algorithms. We discuss how the application of the solo collectives can improve the performance of iterative methods,
such as multigrid solvers. The results obtained show how this work may be used to accelerate existing MPI applications, but they also display how these techniques
can greatly ease programming of algorithms outside of the Bulk Synchronous Parallel (BSP) model.
Authors affliliation: ETH Zurich, Switzerland
Session Chair: Xinyu Que
A Brief Introduction to the OpenFabrics Interfaces -
A New Network API for Maximizing High Performance Application Efficiency
P. Grun, S. Hefty†, S. Sur†, D. Goodell*, R. Russell♦, H. Pritchard♠, and J. Squyres*
OpenFabrics Interfaces (OFI) is a new family of application program interfaces that exposes communication services to middleware and applications. Libfabric is the
first member of this family of interfaces and was designed under the auspices of the OpenFabrics Alliance by a broad coalition of industry, academic, and national labs
partners over the past two years. Building and expanding on the goals and objectives of the verbs interface, libfabric is specifically designed to meet the performance and
scalability requirements of high performance applications such as Message Passing Interface MPI libraries, Symmetric Hierarchical Memory Access (SHMEM) libraries, Partitioned
Global Address Space (PGAS) programming models, Database Management Systems (DBMS) and enterprise applications running in a tightly coupled network environment. A key aspect of
libfabric is that it is designed to be independent of the underlying network protocols as well as the implementation of the networking devices. This paper provides a brief discussion
of the motivation for creating a new API and describes the novel requirements gathering process that was used to drive its design. Next, we provide a high level overview of the API
architecture and design, and finally we discuss the current state of development, release schedule and future work.
Authors affliliation: Cray, USA
University of New Hampshire♦
Los Alamos National Laboratory♠
UCX: An Open Source Framework for HPC Network APIs and Beyond
P. Shamis, M. G. Venkata, M. G. Lopez, M. B. Baker, O. Hernandez, Y. Itigin†, M. Dubman†, G. Shainer†, R. Graham†, L. Liss†,
Y. Shahar†, S. Potluri*, D. Rossetti*, D. Becker*, D. Poole*, C. Lamb*, S. Kumar♦, C. Stunkel♦, G. Bosilca♠, and A. Bouteiller♠
This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort
of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems.
UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains. We envision these APIs to satisfy the networking
needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, PGAS languages, task-based paradigms and I/O bound applications. We present the initial
design and architecture of UCX, and also, provide an in-depth discussion of the API and protocols that could be used to implement MPI and OpenSHMEM. To evaluate the design we
implement the APIs and protocols, and measure the performance of overhead- critical network primitives fundamental for implementing many parallel programming models and system
libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we
achieved a message exchange latency of 0.89 usec, a bandwidth of 6138.5MB/sec, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth
and message rate achieved by any network stack (publicly known) on this hardware. UCX is an open source, BSD licensed software hosted on GitHUB, currently accessible to collaborators,
that will be publicly released in the near future.
Authors affliliation: ORNL, USA
University of Tennessee Knoxville♠, USA
Session Chair: Ryan Grant
Intel Omni-Path Architecture: Enabling Scalable, High Performance Fabrics
Todd Rimmer (Invited Paper)
Session Chair: Mitch Gusat
HPC vs. Data Center Networks
Ron Brightwell, Sandia National Laboratories
Kevin Deierling, Mellanox
Dave Goodell, Cisco
Ariel Hendel, Broadcom
Katharine Schmidtke, Facebook
Keith Underwood, Intel
||Thursday, August 27 (Symposium)
||Breakfast and Registration
Session Chair: Vikram Dham
Recent Advances in Machine Learning and their Application to Networking
David Meyer, Brocade
Session Chair: Ada Gavrilovska
Run-time Strategies for Energy-efficient Operation of Silicon-Photonic NoCs
Over the past two decades, the general-purpose compute capacity of the world increased by 1.5x every year, and we will need to maintain this rate of growth to
support the increasingly sophisticated data-driven applications of the future. The computing community has migrated towards manycore computing systems with the goal
of improving the computing capacity per chip through parallelism while staying within the chip power budget. Energy-efficient data communication has been identified
as one of key requirements for achieving this goal, and silicon-photonic network-on-chip (NoC) has been proposed as one of technologies that can meet this requirement.
Silicon-photonic NoC provides high bandwidth density, but the large power consumed in the laser sources and in tuning against on-chip thermal gradients has been a big
impediment towards its adoption by the wider community. In this talk, I'll present two run-time strategies for achieving energy-efficient operation of the silicon-photonic
NoCs in manycore systems. For reducing laser power, I'll present our approach of using cache and NoC reconfiguration at run time. The key idea here is to provide the minimum
L2 cache size and the minimum NoC bandwidth (i.e. the minimum number of active silicon-photonic links) required for an application to achieve the maximum possible performance
at any given point of time. For managing the thermal tuning power of the silicon-photonic NoC, I'll present a run-time job allocation technique that minimizes the temperature
gradients among the ring modulators/filters to minimize localized thermal tuning power consumption and maximize network bandwidth to maximize the application performance.
Ajay Joshi, Boston University (Invited Talk)
Ajay Joshi received his Ph.D. degree from the ECE Department at Georgia Tech in 2006. He then worked as a postdoctoral researcher in the EECS Department at MIT until 2009. He
is currently an Assistant Professor in the ECE Department at Boston University. His research interests span across various aspects of VLSI design including circuits and architectures
for communication and computation, and emerging device technologies including silicon photonics and memristors. He received the NSF CAREER Award in 2012 and Boston University
ECE Department's Award for Excellence in Teaching in 2014.
|Optics & NoC
Session Chair: Ada Gavrilovska
OWN: Optical and Wireless Network-on-Chips (NoCs) for Kilo-core Architectures
A. Kodi, A. Sikdar, A. Louri*, S. Kaya and M. Kennedy
Current trends of increasing cores in chip multi-processors (CMPs) will continue and kilo-core CMPs could be available within a decade. However, metallic-based
interconnects may not scale to kilo-core architectures due to longer hop count, power inefficiency and increased execution. Emerging technologies such as 3D stacking,
silicon photonics, and on-chip wireless interconnects are under serious consideration as they show promising results for power-efŮcient, low latency scalable on-chip interconnects.
In this paper, we propose an architecture combining two emerging technology: photonics and wireless called Optical and Wireless Network-on-chip (OWN). OWN integrates the high-bandwidth
low-latency silicon photonics with on-chip ▀exible wireless technology to develop a scalable low latency interconnects for kilo-core CMPs. Our simulations on synthetic trafŮc show that
1024-core OWN consumes 40.2% less energy per bit than wireless-only architectures and 23.2% higher than photonics-only for uniform trafŮc. OWN has higher throughput compared to the
complete wired network CMESH, hybrid wireless network WCUBE and hybrid photonic network ATAC for synthetic trafŮc types uniform random, bitreversal and lower than ATAC for matrix transpose.
Authors affliliation: Ohio University, USA
University of Arizona*,USA
AMON: Advanced Mesh-like Optical NoC
S. Werner and J. Navaridas
Optical Networks-on-chip constitute a promising approach to tackle the power wall problem present in large- scale electrical NoC design. In order to enable their
adoption and ensure scalability, oNoCs have to be carefully designed with power and energy consumption in mind. In this paper, we propose Amon, an all-optical NoC
design based on passive microrings and Wavelength Division Multiplexing, including the switch architecture and a contention-free routing algorithm. The goal is to
obtain a design that minimizes the total number of wavelengths and microrings, the wiring complexity and the diameter. An analytical comparison with state-of-the-art
design proposals of all-optical NoCs shows that the proposed design can substantially improve the most important performance metrics: hop counts, chip area, energy and
power consumption. Our experimental work confirms that this improvement translates into higher performance and efficiency. Finally, our design provides a tile-based
structure which should facilitate VLSI integration when compared with recent ring-like solutions. In general we show it provides a more scalable solution than previous designs.
Authors affliliation: University of Manchester, UK
|Efficient Network Design
Session Chair: Fabrizio Petrini
Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-all Collective Algorithms
H. Subramoni, A. Venkatesh, K. Hamidouche, K. Tomko
Many high performance scientific applications rely on collective operations to
move large volumes of data because of their versatility and ease of use. Such
data intensive collectives have a notable impact on the execution time of the
program and hence the energy consumption owing to the amount of memory/
processor/network resources involved in the data movement. On the other hand,
mechanisms such as offload and one-sided calls, backed by RDMA-enabled
interconnects like InfiniBand coupled with modern transport protocols like
Dynamic Connected (DC), provide new ways to express collective communication.
However, there have been no efforts to fundamentally redesign collective
algorithms from an energy standpoint using the rich plethora of existing and
upcoming communication mechanisms and transport protocols. In this paper, we
take up this challenge and study the impact that RDMA and transport protocol
aware designs can have on the energy and performance of dense collective
operations like All-to-all. Through evaluation, we also identify that while a
single transport protocol may bring both performance and energy benefits for one
application it may not do so consistently for all applications. Motivated by
this, we propose designs that yield both benefits for all evaluated
applications. Our experimental evaluation shows that our proposed designs are
able to deliver up to 1.7X savings in energy with little or no degradation in
the communication performance for All-to-all collective operations on modern HPC systems.
Authors affliliation: Ohio State University, USA
and D. Panda
Implementing Ultra Low Latency Data Center Services with Programmable Logic
J. Lockwood and M. Monga
Data centers utilize a multitude of low-level network services to implement high-level applications. For example, scalable Key/Value Store (KVS) databases are implemented
by storing keys and values in memory on remote servers then searching for values using keys sent over Ethernet. Most existing KVS databases run variations of memcached in software
and scale out to increase the search throughput.
In this paper, we take an alternate approach by implementing an ultra low latency KVS database in Field Programmable Gate Array (FPGA) logic. As with a software-based KVS, the transaction
is sent over Ethernet to the machine which stores the value associated with that key. We find that the implementation in logic, however, scales up to provide much higher throughput with lower
latency and power consumption.
High-level applications store, replace, delete and search keys using standard KVS APIs. Our API hashes long keys into statistically unique identifiers and maps variable-length messages into a
finite set of fixed-size values. These keys and values are then formatted into a set of compact binary messages and transported over standard Ethernet to KVS servers in the data center.
When transporting messages over standard 10 Gigabit Ethernet and by processing OCSMs in an FPGA, the logic can search with a fiber-to-fiber latency of under 1 microsecond. Each KVS database
implemented as an FPGA core processes 150 Million Searches Per Second (MSPS) per 40 Gigabits/second of link speed. The FPGA KVS was measured to process messages 7x faster while using 13x less energy
than kernel-bypass software.
Authors affliliation: Algo-Logic Systems, USA
Enhanced Overloaded CDMA Interconnect (OCI) Bus Architecture for on-Chip Communication
K. Ahmed and M. Farag
On-chip interconnect is a major building block and main performance bottleneck in modern complex System-onChips (SoCs). The bus topology and its derivatives are the
most deployed communication architecture in contemporary SoCs. Space switching exemplified by cross bars and multiplexers, and time sharing are the key enablers of
various bus architectures. The cross bar has quadratic complexity, while resource sharing significantly degrade the overall system's performance. In this work we motivate
using Code Division Multiple Access (CDMA) as a bus sharing strategy which offers many advantages over other topologies. Our work seeks to complement the conventional CDMA
bus features by applying overloaded CDMA practices to increase the bus utilization efficiency.
We present the Difference-Overloaded CDMA Interconnect (D-OCI) bus that leverages the balancing property of the Walsh codes to increase the number of interconnected elements
by 50%. We advance two implementations of the D-OCI bus optimized for both speed and resource utilization. We validate the bus operation and compare the performance of the D-OCI
and conventional CDMA buses. We also present the synthesis results for the UMC- 0.13 Ám design kit to give an idea of the maximum achievable bus frequency on ASIC platforms. Moreover,
we advance a proof-ofconcept HLS implementation of the D-OCI bus on a Xilinx Zynq- 7000 SoC and compare its performance, latency, and resource utilization to the ARM AXI bus. The performance
evaluation demonstrates superiority of the D-OCI bus.
Authors affliliation: Alexandria University, Egypt
Session Chair: Madeleine Glick
Facebook Network Architecture and Its Impact on Interconnects
The effect of networking at scale and its impact on the strategy for optical interconnect deployment will be described.
Katharine Schmidtke, Facebook (Invited Talk)
Katharine is currently responsible for Optical Technology strategy at Facebook. She has a Ph.D. In non-linear optics from Southampton University in the UK and
is a postdoc at Stanford University. She has over 20 years experience in the Opto-Electronics industry including positions at Finisar and JDSU.
Session Chair: Ada Gavrilovska
NFV in Cloud Infrastructure - Ready or Not
VMware was formed out of a Stanford research project to virtualize the x86 architecture, a difficult feat considering the complexity of x86 architecture. Over the past several
years, both Intel and AMD have added support in hardware for virtualization extensions that helped improve performance and scale for hypervisors to support more workloads with higher
Today, about 75% of x86 server workloads run in virtual machines (VMs), primarily on data center infrastructure. Network Functions Virtualization (NFV) is about leveraging standard
virtualization technology to consolidate network equipment onto industry standard high volume servers, switches and storage. This is a natural path of evolution for telecommunications
companies like their move from circuit to packet switching, or from ATM and Frame Relay to MPLS and IP. NFV will allow telcos to be more agile to meet the rapidly increasing volume of
wireless traffic fueled by mobile devices and over-the-top services, and enable them to offer differentiated services of their own.
NFV workloads, however are different and more performance demanding than typical IT workloads. As a result, it is tempting for telcos to use techniques like OS and hypervisor bypass to
achieve maximum performance, sometimes overlooking the flexibility, agility, security and other benefits that hypervisors are uniquely positioned architecturally to provide. The desire to
achieve maximum performance is sometimes misguided by using micro-benchmarks that are meaningless when translated to real NFV applications in a production environment.
In this talk we will look at a real-world use case of a telecommunications service and examine the trade-offs between performance and flexibility for NFV. We will outline how the tradeoffs
can affect the overall system design. Finally, we will provide our view of directions and call to action for solutions to address both goals.
Bhavesh Davda, VMware (Invited Talk)
Bhavesh Davda is a senior engineer in the CTO Office at VMware. He is currently leading the engineering efforts to make vSphere the best-in-class infrastructure to support Telco (NFV) workloads.
He also works on enabling virtualization of real-time applications, low latency applications, high rate packet processing, high performance computing, SR-IOV and RDMA, which have historically been
challenging to virtualize. Previously, he managed the vSphere Networking R&D team, where the team worked on the entire virtual networking stack from guest OS drivers for paravirtual NICs, to
virtualization of paravirtual and emulated NICs in the hypervisor, to efficiently implementing I/O down to the physical NIC drivers and the TCP/IP stack in the vmkernel hypervisor for uses like VMotion,
Fault-Tolerance and NFS at the hypervisor level.
Bhavesh has over 20 years of experience in the systems software field. Before VMware, he was a Distinguished Member of Technical Staff at Avaya Labs in the operating systems group of their flagship
Enterprise Communications Manager product.
Fabrizio Petrini, General Chair
||Friday, August 28 (Tutorials)
||Breakfast and Registration
Accelerating Big Data Processing with Hadoop, Spark, and Memcached Over High-Performance Interconnects
D. K. Panda & Xiaoyi Lu, Ohio State University
ONOS Tutorial Hot Interconnect
Thomas Vachuska, Madan Jampani, Ali Al-Shabibi & Brian O'Connor, ONOS
Flow and Congestion Control for High Performance Clouds: How to Design and Tune the Datacenter Fabric & SDN for Big Data
Mitch Gusat, IBM Research, Zurich Laboratory
Software-defined Wide-Area Networking: Challenges, Opportunities and Reality
Inder Monga, Energy Sciences Network & Srini Seetharaman, Infinera
FOLLOW US ON:
|IEEE Micro: January 6
|Materials due: August 7
|Early Registration: Aug. 14