Home| Program | Keynotes | Tutorials | Videos | Attendees | Committees | Sponsors | Previous Conferences | Contact


Wednesday, August 21 (Symposium)
8:00-8:45 Breakfast and Registration
8:45-9:00 Intro
Cyriel Minkenberg, Sudipta Sengupta Technical Program Chairs
Madeleine Glick, Torsten Hoefler, Fabrizio Petrini General Chairs
9:00-9:10 Welcome Address
Claudio DeSanti, Cisco Fellow
Keynote I
Session Chair: Cyriel Minkenberg

Scale and Programmability in Google's Software Defined Data Center WAN
Amin Vahdat, UCSD/Google
10:10-10:30 Morning Break
On-Chip Communications
Session Chair: Ada Gavrilovska
  • Deterministic Multiplexing of NoC on Grid CMPs
    As the number of cores in a chip has increased over the past several years, the problem of inter-core communication has become a bottleneck. Traditional bus architectures cannot handle the traffic load for the increasingly large number of communicating units on chip. Using a nanophotonic Network-on- Chip(NoC) is a proposed solution that is motivated by recent research indicating a higher throughput over electronic NOCs. In order to avoid optical/electronic/optical conversion at intermediate routers, all-to-all connectivity can be achieved through Time-Division-Multiplexing (TDM). Previous work has focused on using a non-deterministic approach to determine the multiplexing schedule in optically connected mesh NoCs. Such an approach, however, produces an irregular schedule that is not scalable, especially if TDM is to be combined with wavelength-division multiplexing (WDM) and space-division multiplexing to reduce the communication delay. In this work, we present a regular multiplexing schedule for all-to-all connectivity which is at least as efficient as the previously introduced irregular schedule. Moreover, because of its regularity and its systematic construction, our schedule is scalable to arbitrary-size meshes and allows for efficient combination of TDM, WDM and space division multiplexing (the use of multiple NoCs).
    J. Carpenter and R. Melhem
  • Minimizing Delay in Shared Pipelines
    Pipelines are widely used to increase throughput in multi-core chips by parallelizing packet processing. Typically, each packet type is serviced by a dedicated pipeline. However, with the increase in the number of packet types and their number of required services, there are not enough cores for pipelines. In this paper, we study pipeline sharing, such that a single pipeline can be used to serve several packet types. Pipeline sharing decreases the needed total number of cores, but typically increases pipeline lengths and therefore packet delays. We consider the optimization problem of allocating cores between different packet types such that the average delay is minimized. We suggest a polynomial-time algorithm that finds the optimal solution when the packet types preserve a specific property. We also present a greedy algorithm for the general case. Last, we examine our solutions on synthetic examples, on packet-processing applications, and on real-life H.264 standard requirements.
    O. Rottenstreich, I. Keslassy, Y. Revah and A. Kadosh
  • Heterogeneous Multi-processor Coherent Interconnect
    The rapid increase in processor and memory integration onto a single die continues to place increasingly complex demands on the interconnect network. In addition to providing low latency, high speed and high bandwidth access from all processors to all shared resources, the burdens of hardware cache coherence and resource virtualization are being placed upon the interconnect as well. This paper describes a multi-core shared memory controller interconnect (MSMC) which supports up to 12 processors, 8 independent banks of IO-coherent on-chip shared SRAM, an IO-coherent external memory controller, and high-bandwidth IO connections to the SoC infrastructure. MSMC also provides basic IO address translation and memory protection for the on-chip shared SRAM and external memory as well as soft error (SER) protection with hardware scrubbing for the on-chip memory. MSMC formed the heart of the compute cluster for a 28-nm CMOS device including 8 Texas Instruments C66x DSP processors and 4 cache-coherent ARM A15 processors sharing 6 MB of on-chip SRAM running at 1.3 Ghz. At this speed MSMC provides all connected masters a combined read/write bandwidth of nearly 1TB/s to access a combined read/write bandwidth of 457.6 GB/s to all shared resources @ 16 mm^2.
    K. Chirca, M. Pierson, J. Zbiciak, D. Thompson, D. Wu, S. Myilswamy, R. Griesmer, K. Basavaraj, T. Huynh, A. Dayal, J. You, P. Eyres, Y. Ghadiali, T. Beck, A. Hill, N. Bhoria, D. Bui, J. Tran, M. Rahman, H. Fei, S. Jagathesan and T. Anderson
12:00-13:00 Lunch
Session Chair: Rami Melhem
  • TCP Pacing in Data Center Networks
    This paper studies the effectiveness of TCP pacing in data center networks. TCP senders inject bursts of packets into the network at the beginning of each round-trip time. These bursts stress the network queues which may cause loss, reduction in throughput and increased latency. Such undesirable effects become more pronounced in data center environments where traffic is bursty in nature and buffer sizes are small. TCP pacing is believed to reduce the burstiness of TCP traffic and to mitigate the impact of small buffering in routers. Unfortunately, current research literature has not always agreed on the overall benefits of pacing. In this paper, we present a model for the effectiveness of pacing. Our model demonstrates that for a given buffer size, as the number of concurrent flows are increased beyond a {\em Point of Inflection} (PoI), non-paced TCP outperforms paced TCP. We present lower and upper bounds for the PoI and argue that increasing the number of concurrent flows beyond the PoI, increases inter-flow burstiness of paced packets and reduces the effectiveness of pacing. We validate our model using a novel and practical implementation of paced TCP in the Linux kernel and perform several experiments in a test-bed.
    M. Ghobadi and Y. Ganjali
  • Clustered Linked List Forest for IPv6 Lookup
    Providing a high operating frequency and abundant parallelism, Field Programmable Gate Arrays (FPGAs) are the most promising platform to implement SRAM-based pipelined architectures for high-speed Internet Protocol (IP) lookup. Owing to the restrictions of the state-of-the-art FPGAs on the number of I/O pins and on-chip memory, the existing approaches can hardly accommodate the large and sparsely-distributed IPv6 routing tables. Therefore, memory efficient data structures are recently in high demand. In this paper, clustered linked list forest (CLLF) data structure is proposed for solving the longest prefix matching (LPM) problem in IP lookup. Our structure comprising multiple parallel linked lists of prefix nodes achieves significant memory compaction in comparison to the existing approaches. CLLF data structure is implemented on a high throughput SRAM-based parallel and pipelined architecture on FPGAs. Utilizing a state-of-the-art FPGA device, CLLF architecture can accommodate up to 686K IPv6 prefixes while supporting fast incremental routing table updates.
    O. Erdem and A. Carus
  • HybridCuts: A Scheme Combining Decomposition and Cutting for Packet Classification
    Packet classification is an enabling function for a variety of Internet applications such as access control, quality of service and differentiated services. Decision-tree and decomposition are the most well-known algorithmic approaches. Compared to architectural solutions, both approaches are memory and performance inefficient, falling short of the needs of high-speed networks. EffiCuts, the state-of-the-art decision-tree technique, significantly reduces memory overhead of classic cutting algorithms with separated trees and equi-dense cuts. However, it suffers from too many memory accesses and a large number of separated trees. Besides, EffiCuts needs comparator circuitry to support equi-dense cuts, which makes it less practical. Decomposition based schemes, such as BV, can leverage the parallelism offered by modern hardware for memory accesses, but they have poor storage scalability. In this paper, we propose HybridCuts, a combination of decomposition and decision-tree techniques that improves storage and performance simultaneously. The decomposition part of HybridCuts has the benefits of traditional decomposition-based techniques without the trouble of aggregating results from a large number of bit vectors or a set of big lookup tables. Meanwhile, thanks to the clever partitioning of the rule set, an efficient cutting algorithm following the decomposition can build short decision trees with significant reduction on rule replications. Using ClassBench, we show that HybridCuts achieves similar memory reduction compared to Efficuts, but it outperforms Efficuts significantly in terms of memory accesses for packet classification. In addition, HybridCuts is more practical for implementation than Efficuts, which maintains complicated data structures, takes a huge amount of time for tree merging, and requires special hardware support for efficient cuts.
    W. Li and X. Li
14:30-15:00 Afternoon Break
Session Chair: Cyriel Minkenberg

Architecture and Performance of the Tilera TILE-Gx8072 Manycore Processor
This talk describes the Tilera TILE-Gx processor architecture, discusses the design choices, and presents performance results on representative applications for the 72-core TILE-Gx72™, the flagship processor in Tilera's TILE-Gx™ family. This processor family is comprised of a series of high-performance, low-power 64-bit manycore processor SoCs, tightly coupled with high performance packet processing. These highly integrated processors deliver exceptional performance and performance-per-watt in the embedded networking, cyber security, and high throughput computing markets. Of particular interest is the iMesh network-on-chip, which scales to 100s of cores and provides high-speed interconnection of all on-die elements and cache coherence across the chip.
Matthew Mattina, CTO Tilera
Mr. Mattina is the Chief Technology Officer at Tilera and is responsible for processor strategy and technology. As processor architect at Tilera, he co-led the design of the 64-core TILE-Pro, and the 9- to 72-core TILE-Gx processor families. Prior to Tilera, Mr. Mattina was with Intel Corporation where he was co-lead architect for the Tukwila Multicore Processor, supervising a team of architects and designers. At Intel, Mr. Mattina invented and designed the Intel Ring Uncore Architecture, used across Intel's x86 multicore processor designs. This technology won the Intel Achievement Award in 2010. Prior to Intel, he was an architect and circuit design engineer at Digital Equipment Corporation, working on the Alpha EV7 and EV8 processors. Mr. Mattina also served as Technical Leader at Cisco Systems in the TelePresence Infrastructure Business Unit, where he contributed to the hardware and software design of next-generation high-definition video conferencing products. He has been granted over 20 patents and has published journal and conference papers relating to CPU design, multicore processors, and cache coherence protocols. Mr. Mattina holds a BS in Computer and Systems Engineering from Rensselaer Polytechnic Institute and a MS in Electrical Engineering from Princeton University.
Session Chair: Dan Pitt

Overview and Next Steps for the Open Compute Project

Billions of people and their many devices will be coming online in the next decade, and those who are already online are living ever-more connected lives. The industry is building out a huge physical infrastructure to support this growth, but we are doing so in a largely closed fashion, inhibiting the pace of innovation and preventing us from achieving the kinds of efficiencies that might otherwise be possible.

In this talk, John Kenevey will provide an overview of the Open Compute Project, a thriving consumer-led community dedicated to promoting more openness and a greater focus on scale, efficiency, and sustainability in the development of infrastructure technologies. John will give a brief history of the project and describe its vision for the future, focusing on a new project within OCP to develop an open network switch.

John Kenevey, Facebook & the Open Compute Project
John has 18 years of experience in the technology sector, spanning startups to small, medium and large cap technology companies. In 2011 John initiated, orchestrated and founded the Open Compute Project. Two years into the project, OCP has gained traction across the supplier ecosystem with the likes of HP, Dell, AMD and Intel joining and contributing to the project. Currently, John has shifted his focus to building out an OCP incubation channel to exploit the growing opportunity that the Open Compute Project has provided. John advises several startups in Silicon Valley. He has a Masters degree in Economics from University College Dublin.
Keynote II
Session Chair: Torsten Hoefler

Networking as a Service
Tom Anderson, University of Washington
Keynote III
Session Chair: Christos Kolias

The Network is the Cloud
David Yen, SVP/GM Data Center Group, Cisco
18:00-19:00 Head Bubba Memorial Cocktail Reception
Thursday, August 22 (Symposium)
8:00-9:00 Breakfast and Registration
Keynote IV
Session Chair: Madeleine Glick

Hybrid Datacenter Networks
George Papen, University of California, San Diego
10:10-10:30 Morning Break
OpenFlow and High-Performance Computing
Session Chair: Mohammad Alizadeh
  • Efficient Security Applications Implementation in OpenFlow Controller with FleXam
    Current OpenFlow specifications provide limited access to packet-level information such as packet content, making it very inefficient, if not impossible, to deploy security and monitoring applications as controller applications. In this paper, we propose FleXam, a flexible sampling extension for OpenFlow designed to provide access to packet level information at the controller.
    Simplicity of FleXam makes it possible to implement it easily in OpenFlow switches and operate at line rate without requiring any additional memory. At the same time, its flexibility allows implementation of various monitoring and security applications in the controller, while maintaining balance between overhead and collected information details. FleXam realizes the advantages of both proactive and reactive routing schemes by providing a tunable trade-off between the visibility of individual flows, and the controller load. As an example, we demonstrate how FleXam can be used to implement a port scan detection application with an extremely low overhead.
    S. Shirali-Shahreza and Y. Ganjali
  • OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management
    Dragonfly networks are appealing topologies for large-scale Datacenter and HPC networks, that provide high throughput with a low diameter and moderate cost. However, they are prone to congestion under certain frequent traffic patterns that saturate specific network links. Adaptive nonminimal routing can be used to avoid such congestion. That kind of routing employs longer paths to circumvent local or global congested links. However, if a distance-based deadlock avoidance mechanism is employed, more Virtual Channels (VCs) are required, what increases design complexity and cost. OFAR (On-the-Fly Adaptive Routing) is a routing proposal that decouples virtual channels from deadlock avoidance, making local and global misrouting affordable. However, the severity of congestion with OFAR is higher, because it relies on an escape network with low bisection bandwidth. Additionally, OFAR allows for unlimited misrouting on the escape subnetwork, leading to unbounded paths in the network and long latencies. In this paper we propose and evaluate OFAR-CM, a variant of OFAR combined with a simple congestion management (CM) mechanism which only relies on local information, specifically the credit count of the output ports in the local router. With simple escape networks such as a Hamiltonian ring or a tree, OFAR outperforms former proposals with distance-based deadlock avoidance. Additionally, although long paths are allowed in theory, in practice packets arrive at their destination in a small number of hops. Altogether, OFAR-CM constitutes the first practicable mechanism to the date supporting both local and global misrouting in Dragonfly networks.
    M. Garcia, E. Vallejo, R. Beivide, M. Valero and G. Rodriguez
  • Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
    The emergence of co-processors such as Intel. Many Integrated Cores (MICs) is changing the landscape of supercomputing. The MIC is a memory constrained environment and its processors also operate at slower clock rates. Further, the communication characteristics between MIC processes are also different compared to communication between host processes. Communication libraries that do not consider these architectural subtleties cannot deliver good communication performance. The performance of MPI collective operations strongly affect the performance of parallel applications. Owing to the challenges introduced by the emerging heterogeneous systems, it is critical to fundamentally re-design collective algorithms to ensure that applications can fully leverage the MIC architecture. In this paper, we propose a generic framework to optimize the performance of important collective operations, such as, MPI Bcast, MPI Reduce and MPI Allreduce, on Intel MIC clusters. We also present a detailed analysis of the compute phases in reduce operations for MIC clusters. To the best of our knowledge, this is the first paper to propose novel designs to improve the performance of collectives on MIC clusters. Our designs improve the latency of the MPI Bcast operation with 4,864 MPI processes by up to 76%. We also observe up to 52.4% improvements in the communication latency of the MPI Allreduce operation with 2K MPI processes on heterogeneous MIC clusters. Our designs also improve the execution time of the WindJammer application by up to 16%.
    K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri and D. K. Panda
12:00-13:00 Lunch
Short Papers
Session Chair: Patrick Geoffray
  • On the Data Path Performance of Leaf-Spine Datacenter Fabrics
    Modern datacenter networks must support a multitude of diverse and demanding workloads at low cost and even the most simple architectural choices can impact mission-critical application performance. This forces network architects to continually evaluate tradeoffs between ideal designs and pragmatic, cost effective solutions. In real commercial environments the number of parameters that the architect can control is fairly limited and typically includes only the choice of topology, link speeds, oversubscription, and switch buffer sizes. In this paper we provide some guidance to the network architect about the impact these choices have on data path performance. We analyze the behavior of Leaf-Spine topologies under realistic traffic workloads via extensive simulations and identify what is important for performance and what is not important. We present intuitive arguments that explain our findings and provide a framework for reasoning about different design tradeoffs.
    M. Alizadeh and T. Edsall
  • Can Parallel Replication Benefit HDFS for High-Performance Interconnects?
    The Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications. HDFS has been adopted as the underlying file system of numerous data-intensive applications due to its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to three (default replication factor) DataNodes. The current implementation of HDFS in Apache Hadoop supports pipelined replication which introduces increased latency for real-time, latency-sensitive applications. In this paper, we have introduced an alternative parallel replication scheme in both socket-based and RDMA-based design of HDFS over InfiniBand. Parallel replication allows the client to write all the replicas in parallel. We have analysed the challenges and issues of parallel replication and compared its performance with pipelined replication. With modern high performance networks, parallel replication can offer much better response times for latency-sensitive applications compared to that of pipelined replication. Experimental results show that, parallel replication can reduce the execution times of TeragGen benchmarks by up to 16% over IPoIB (IP over InfiniBand), 10GigE and RDMA (Remote Direct Memory Access) over InfiniBand. The throughput of the TestDFSIO benchmark is also increased by 12% over high performance interconnects like IPoIB, 10GigE and RDMA over InfiniBand by parallel replication. It can also enhance the HBase Put operation performance by 17% for the above-mentioned interconnects and protocols. Whereas, for throughput over networks like 1GigE and also for smaller data sizes, parallel replication does not benefit the performance.
    N. Islam, X. Lu, M. Rahman and D. K. Panda
  • Interconnect for Tightly Coupled Accelerators Architecture
    In recent years, heterogenious clusters using accelerators are widely used for high performance computing system. In such clusters, the inter-node communication among accelerators requires several memory copies via CPU memory, and the communication latency causes severe performance degradation. To address this problem, we propose Tightly Coupled Accelerators (TCA) architecture to reduce the communication latency between accelerators over different nodes. In the TCA architecture, PCI Express packets are directly used for the communication among accelerators over nodes. In addition, we designed the communication chip, named PEACH2 chip, to realize the TCA architecture. In this paper, we introduce the design and implementation of PEACH2 chip using FPGA, and PEACH2 board as the PCI extension board is presented. The GPU cluster with several tens of nodes based on the TCA architecture will be installed in our center, and this system will be able to demonstrate the effectiveness of the TCA architecture.
    T. Hanawa, Y. Kodama, T. Boku and M. Sato
  • Low latency scheduling algorithm for Shared Memory Communication over optical networks
    Optical Network on Chips (NoCs) based on silicon photonics have been proposed to reduce latency and power consumption in future chip multi-core processors (CMP). However, high performance CMPs use a shared memory model which generates large numbers of short messages, typically of the order of 8-256B. Messages of this length create high overhead for optical switching systems due to arbitration and switching times. Current schemes only start the arbitration process when the message arrives at the input buffer of the network. In this paper, we propose a scheme which intelligently uses the information from the memory controllers to schedule optical paths. We identified predictable patterns of messages associated with memory operations for a 32 core x86 system using the MESI coherency protocol. We used the first message of each pattern to open the optical paths which will be used by all subsequent messages thereby eliminating arbitration time for the latter. Without considering the initial request message, this scheme can therefore reduce the time of flight of a data message in the network by 29% and that of a control message by 67%. We demonstrate the benefits of this scheduling algorithm for applications in the PARSEC benchmark suite with overall average reductions, in terms of the overhead latency per message, of 31.8% for the streamcluster benchmark and up to 70.6% for the swaptions benchmark.
    M. Madarbux, P. Watts and A. Van Laer
  • Bursting Data between Data Centers: Case for Transport SDN
    Public and Private Enterprise clouds are changing the nature of WAN data center interconnects. Datacenter WAN interconnects today are pre-allocated, static optical trunks of high capacity. These optical pipes carry aggregated packet traffic originating from within the datacenters while routing decisions are made by devices at the datacenter edges. In this paper, we propose a software-defined networking enabled optical transport architecture (Transport SDN) that meshes seamlessly with the deployment of SDN within the Data Centers. The proposed programmable architecture abstracts a core transport node into a programmable virtual switch that leverages the OpenFlow protocol for control. A demonstration use-case of an OpenFlow-enabled optical virtual switch managing a small optical transport network for a big-data application is described. With appropriate extensions to OpenFlow, we discuss how the programmability and flexibility SDN brings to packet-optical datacenter interconnect will be substantial in solving some of the complex multi-vendor, multi-layer, multi-domain issues that hybrid cloud providers face.
    A. Sadasivarao, S. Syed, P. Pan, C. Liou, A. Lake, C. Guok and I. Monga
14:30-15:00 Afternoon Break
Keynote V
Session Chair: Christos Kolias

Changing Data Center
Tom Edsall, CTO at Insieme Networks
Evening Panel
Moderator: Mitch Gusat

Data-Center Congestion Control: the Neglected Problem

After decades of TCP and RED/ECN, extensive experiences with Valiant and hash/ECMP-based load balancing, what have we learned thus far?

Debate on the pros and cons of flow and congestion controls in future DC and HPC networks - considering the specifics of each layer, from L2 link level, up to L4 transports and L5 applications. The panelists will argue the balance between h/w and s/w solutions, their timescales, costs, and expected real-life impact.

Also considered will be the related issues of HOL-blocking in various multihop topologies, load balancing, adaptive routing, globalscheduling, OpenFlow options, new DC-TCP versions and application-levelchanges.And an intriguing new challenge: How about the SDN, aka OVN or virtual DCN?

Mohammad Alizadeh, Insieme Networks/Stanford
Claudio DeSanti, Cisco
Mehul Dholakia, Brocade
Tom Edsall, Insieme Networks
Bruce Kwan, Broadcom
Gilad Shainer, Mellanox

17:30-17:45 Closing Remarks
Friday, August 23 (Tutorials)
8:00-8:30 Breakfast and Registration
8:30-12:30 Tutorial 1

Accelerating Big Data with Hadoop and Memcached Using High Performance Interconnects: Opportunities and Challenges

D. K. Panda & Xiaoyi Lu, Ohio State University

Tutorial 2

Openstack & SDN - A Hands on Tutorial

Ramesh Durairaj, Oracle & Edgar Magana, PLUMgrid

12:30-13:30 Lunch
13:30-17:30 Tutorial 3

The role of optical interconnects in data-center networking, and WAN optimization

Loukas Paraschis, Cisco

Tutorial 4

Flow and Congestion Controls for Multitenant Datacenters: Virtualization, Transport and Workload Impact

Mitch Gusat & Keshav Kamble, IBM