||Wednesday, August 22 (Symposium)
||Breakfast and Registration
Torsten Hoefler Technical Program Chair
Patrick Geoffray, Hamid Ahmadi General Chairs
Session Chair: Patrick Geoffray
The Future Of Network Technology - What is Old, is New Again
John Roese, Huawei
Session Chair: Ada Gavrilovska
ParaSplit: A Scalable Architecture on FPGA for Terabit Packet Classification
AbstractJ. Fong, X. Wang, Y. Qi, J. Li and W. Jiang
Packet classification is a fundamental enabling function for various applications in switches, routers and firewalls. Due to their
performance and scalability limitations, current packet classification solutions are insufficient in addressing the challenges from
the growing network bandwidth and the increasing number of new applications. This paper presents a scalable parallel architecture,
named ParaSplit, for high-performance packet classification. We propose a rule set partitioning algorithm based on range-point conversion
to reduce the overall memory requirement. We further optimize the partitioning by applying the Simulated Annealing technique. We implement
the architecture on a Field Programmable Gate Array (FPGA) to achieve high throughput by exploiting the abundant parallelism in the hardware.
Evaluation using real-life data sets show that ParaSplit achieves significant reduction in memory requirement, compared with the-state-of-the-art
algorithms such as HyperSplit and EffiCuts. Because of the memory efficiency of ParaSplit, our FPGA design can support in the on-chip memory
multiple engines, each of which contains up to 10K complex rules. As a result, the architecture with multiple ParaSplit engines in parallel
can achieve up to Terabit per second throughput for large and complex rule sets on a single FPGA device.
A Low-Latency Library in FPGA Hardware for High-Frequency Trading (HFT)
AbstractJ. Lockwood, A. Gupte, N. Mehta, M. Blott, T. English and K. Vissers
Current High-Frequency Trading (HFT) platforms are typically implemented in software on computers with high-performance network adapters.
The high and unpredictable latency of these systems has led the trading world to explore alternative "hybrid" architectures with
hardware acceleration. In this paper, we describe how FPGAs are being used in electronic trading to approach the goal of zero latency.
We present an FPGA IP library which implements networking, I/O, memory interfaces and financial protocol parsers. The library provides
pre-built infrastructure which accelerates the development and verification of new financial applications. We have developed an example
financial application using the IP library on a custom 1U FPGA appliance. The application sustains 10Gb/s Ethernet line rate with a fixed
end-to-end latency of 1μ - up to two orders of magnitude lower than comparable software implementations.
Rx Stack Accelerator for 10 GbE Integrated NIC
F. Abel, F. Verplanken, C. Hagleitner
The miniaturization of CMOS technology has reached a scale at which server processors are starting to integrate multi-gigabit network interface
controllers. While transistors are becoming cheap and abundant in solid-state circuits, they remain a scarce resource on a processor die where ever
more cores and caches must share a fixed amount of silicon area and power. Therefore, a successful design candidate for integration must provide high
networking performance under high logic density and low power dissipation.
This paper describes the design of an integrated accelerator to offload computation intensive protocol-processing tasks. The accelerator combines the
concepts of the transport-triggered architecture with a programmable finite-state machine to deliver high instruction-level parallelism, efficient multiway
branching and flexibility. The flexibility is key to adapt to protocol changes and address new applications.
This receive stack accelerator was used in the construction of an integrated quad-port 10 GbE host Ethernet adapter in 45-nm CMOS technology. The ratio of
performance (15 Mfps, 20 Gb/s Tput per port) to area (0.7 mm2) and the power consumption (0.15 W) of this accelerator are core enablers for integrating a
network adapter and a processor compute complex.
|Traffic Generation and Scheduling
Session Chair: Christos Kolias
Caliper: Precise and Responsive Traffic Generator
M. Ghobadi, G. Salmon, Y. Ganjali, M. Labrecque and J. G. Steffan
This paper presents Caliper, a highly-accurate
packet injection tool that generates precise and responsive traffic.
Caliper takes live packets generated on a host computer and
transmits them onto a gigabit Ethernet network with precise
inter-transmission times. Existing software traffic generators rely
on generic Network Interface Cards which, as we demonstrate,
do not provide high-precision timing guarantees. Hence, performing
valid and convincing experiments becomes difficult or
impossible in the context of time-sensitive network experiments.
Our evaluations show that Caliper is able to reproduce packet
inter-transmission times from a given arbitrary distribution while
capturing the closed-loop feedback of TCP sources. Specifically,
we demonstrate that Caliper provides three orders of magnitude
better precision compared to commodity NIC: with requested
traffic rates up to the line rate, Caliper incurs an error of 8 ns
or less in packet transmission times. Furthermore, we explore
Caliper's ability to integrate with existing network simulators
to project simulated traffic characteristics into a real network
environment. Caliper is freely available online.
Weighted Differential Scheduler
H. Eberle and W. Olesinski
The Weighted Differential Scheduler (WDS) is a new scheduling discipline for accessing shared resources. The work described here was
motivated by the need for a simple weighted scheduler for a network switch where multiple packet flows are competing for an output port.
The scheme can be implemented with simple arithmetic logic and finite state machines.
We are describing several versions of WDS that can merge two or more flows. An analysis reveals that WDS has lower jitter than any other
weighted scheduler known to us.
Session Chair: Torsten Hoefler
Cray High Speed Networking
Bob Alverson, Cray Inc.
Session Chair: Fredy Neeser
How to Compare Alternative Architectures
Radia Perlman, Intel (invited talk)
There are various aspects of network infrastructure that
are orthogonal, and therefore can be compared conceptually. For
example, the syntax of encapsulation, how forwarding tables are
calculated, and whether forwarding tables are filled in proactively,
or on-demand when a new flow starts. This talk will explain these
concepts, and show how various proposed architectures (such as TRILL,
VXLAN, OpenFlow, etc. compare.)
Moderator: Fabrizio Petrini
The Network is Moving into the Sockets
The recent acquisitions of Intel—FulcrumMicro, Qlogic and Cray's
networking division, combined with AMD's acquisition of SeaMicro show
that the heat is on in the networking world.
There is clear trend to move network interface into the socket, with
performance, scalability and power reduction wins when the network sits
close to the processing engine.
This has dramatic implications in the data-center and high-performance
networking world. In this panel we will discuss how this trend could
change the future of upcoming data centers and network architectures.
Lloyd Dickman Bay Storage Technology
Christian Bell Myricom
Gilad Shainer Mellanox
Moray McLaren HP
Greg Thorson SGI
Keith Underwood Intel
||Thursday, August 23 (Symposium)
||Breakfast and Registration
Session Chair: Fabrizio Petrini
Power-Efficient, High-Bandwidth Optical Interconnects for High Performance Computing
Fuad Doany, IBM T. J. Watson
Session Chair: Torsten Hoefler
Portals 4: Enabling Application/Architecture Co-Design for High-Performance Interconnects
Ron Brightwell, Sandia National Laboratories (invited talk)
The Portals project has entered a third decade of research and
development in scalable, high-performance networking for large-scale
scientific parallel computing systems. Portals has evolved from its
inception as a component of early lightweight operating systems to
become an important vehicle for interconnect exploration. Unlike most
user-level network programming interfaces, Portals employs a building
block approach that encapsulates the semantic requirements of a broad
range of upper-level protocols needed to support high-performace
computing applications and services. This approach has also enabled
hardware designers to focus on developing components that accelerate
key functions in Portals, facilitating the application/architecture
co-design process. I will provide an overview of the latest version of
the Portals interconnect API and describe research activities aimed at
exploiting some recently added capabilties.
Performance Evaluation of Open MPI on Cray XE/XK Systems
S. Gutierrez, M. G. Venkata, N. Hjelm, R. Graham
Open MPI is a widely used open-source implementation of the MPI-2 standard that supports a variety of platforms and interconnects. Current versions of
Open MPI, however, lack support for the Cray XE6 and XK6 architectures, both of which use the Gemini System Interconnect. In this paper, we present
extensions to natively support these architectures within Open MPI; describe and propose solutions for performance and scalability bottlenecks; and provide
an extensive evaluation of our implementation, which is the first open-source MPI implementation for the Cray XE/XK system families used at 49,152 processes.
Application and micro-benchmark results show that the performance and scaling characteristics of our implementation are similar to the vendor-supplied MPI's.
Micro-benchmark results show short-data 1 byte and 1024 byte message latencies of 1.20 usec and 4.13 usec, which are 10.00% and 39.71% better than the vendor-supplied
MPI's, respectively. Our implementation achieves a bandwidth of 5.32 GB/s at 8 MB, which is similar to the vendor-supplied MPI's bandwidth at the same message size.
Two Sequoia benchmark applications, LAMMPS and AMG2006, were also chosen to evaluate our implementation at scales up to 49,152 cores where we exhibited similar
performance and scaling characteristics when compared to the vendor-supplied MPI implementation. LAMMPS achieved a parallel efficiency of 88.20% at 49,152 cores using Open MPI,
which is on par with the vendor-supplied MPI's achieved parallel efficiency.
Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing System
J. Vienne , J. Chen, Md. Wasi-ur-Rahman, N. Islam, H. Subramoni, D. K. Panda
Communication interfaces of high performance computing
(HPC) systems and clouds have been continually
evolving to meet the ever increasing communication demands
being placed on them by HPC applications and
cloud computing middlewares (e.g., Hadoop). The PCIe
interfaces can now deliver speeds up to 128 Gbps (Gen3)
and high performance interconnects (10/40 GigE, Infini-
Band 32 Gbps QDR, InfiniBand 54 Gbps FDR, 10/40 GigE
RDMA over Converged Ethernet) are capable of delivering
speeds from 10 to 54 Gbps. However no previous study
has demonstrated how much benefit an end user in the
HPC / cloud computing domain can expect by utilizing
newer generations of these interconnects over older ones
or how one type of interconnect (such as IB) performs in
comparison to another (such as RoCE).
In this paper, we evaluate various high performance
interconnects over the new PCIe Gen3 interface with HPC
as well as cloud computing workloads. Our comprehensive
analysis, done at different levels, provides a global scope
of the impact these modern interconnects have on the
performance of HPC applications and cloud computing
middlewares. The results of our experiments show that the
latest InfiniBand FDR interconnect gives the best performance
for HPC as well as cloud computing applications.
|Routing and Switching
Session Chair: John Lockwood
Electronic-Photonic Integration within Switches and Routers
Mike Watts, MIT (invited talk)
We review recent successes in silicon photonics and how the new
capabilities afforded by silicon photonics will impact future Ethernet,
Infiniband, and ultimately optical domain switches and routers.
Specifically, we consider the impact silicon photonics can have on the
cost, bandwidth, radix, and power consumption scaling of future switches
Bufferless Routing in Optical Gaussian Macrochip Interconnect
Z. Zhang, Z. Guo and Y. Yang
In this paper, we study bufferless routing in a novel optical multichip system, called Gaussian macrochip, where embedded chips are interconnected
by an optical Gaussian network. By taking advantage of the underlying Hamiltonian cycles in the Gaussian network, we design a bufferless routing algorithm
for the Gaussian macrochip, which routes packets along the shortest path in the absence of deflection, and guarantees that deflected packets reach their
destinations along a segment of the Hamilton cycle. Our extensive simulation results demonstrate that by adopting the proposed routing algorithm,
Gaussian macrochip can support much higher inter-chip communication bandwidth, has much shorter average packet delay, and is more power efficient than
the previously proposed architectures for optical multichip systems.
Occupancy Sampling for Terabit CEE Switches
F. Neeser, N. Chrysos, R. Clauberg, D. Crisan, M. Gusat, C. Minkenberg, K. Valk, C. Basso
One consequential feature of Converged Enhanced Ethernet (CEE) is losslessness, achieved through L2 Priority Flow Control (PFC) and Quantized Congestion Notification (QCN).
We focus on QCN and its effectiveness in identifying congestive flows in input-buffered CEE switches. QCN assumes an idealized, output-queued switch; however, as future switches
scale to higher port counts and link speeds, purely output-queued or shared memory architectures lead to excessive memory bandwidth requirements; moreover, PFC typically requires
dedicated buffers per input.
Our objective is to complement PFC's coarse per-port/priority granularity with QCN's per-flow control. By detecting buffer overload early, QCN can drastically reduce PFC's
side effects. We install QCN congestion points (CPs) at input buffers (e.g. VOQs) and demonstrate that arrival-based marking cannot correctly discriminate between culprits and victims.
Our main contribution is occupancy sampling (QCN-OS), a novel, QCN-compatible marking scheme. We focus on random occupancy sampling, a practical method not requiring any per-flow state.
For CPs with arbitrarily scheduled buffers, QCN-OS is shown to correctly identify congestive flows, improving buffer utilization, switch efficiency, and fairness.
Session Chair: Dan Pitt
Software Defined Networks will tame complex networks
Nick McKeown, Stanford University
Moderator: Matt Palmer
Two decades of 'closed' switch and router designs have led to networking R&D 'ossification', possibly constraining the academic and startup innovation. The advent of SDN,
virtual overlays and OpenFlow in the context of datacenter networking (DCN) now challenges the status-quo with new designs and players.
Once every generation conflicting forces churn a field, triggering opportunities for innovation and re-adjusting the balance of power, e.g: Is SDN and/or OpenFlow becoming
the Linux of networking? Is SDN equivalent with OF? Should their APIs be standardized - or let the market decide? How can we build 1M-node DCNs with centralized controls?
What's the impact of SDN and OF on the highly optimized PoDs, vs. the generic vendor fabric? Is the pendulum over-swinging from distributed towards centralized? How about
Dave Meyer Cisco
Kireeti Kompella Juniper
Jeff Mogul HP
Vijoy Pandey IBM
Dimitri Stiliadis Lucent
FOLLOW US ON: