Home| Program | Tutorials | Keynotes | Registration | Attendees | Committees | Sponsors | Travel Awards | Archive | Contact


Monday, August 28th (Tutorials)
8:00-8:30 Breakfast and Registration

Exploiting High-Performance Interconnects to Accelerate Big Data Processing with Hadoop, Spark, Memcached, and gRPC/TensorFlow

D. K. Panda & Xiaoyi Lu, Ohio State University

12:00-13:30 Lunch

High Performance Distributed Deep Learning for Dummies

D. K. Panda, Ammar Ahmad Awan, and Hari Subramoni, Ohio State University


Designing and Developing Performance Portable Network Codes

Pavel Shamis, ARM; Yossi Itigin, Mellanox Technologies


Developing to Open Fabrics Interfaces libfabric

Sean Hefty, Intel; James Swaro, Cray

12:00-13:30 Lunch

The TraceR/CODES Framework for Application Simulations on HPC Networks

Nikhil Jain, LLNL; Misbah Mubarak, ANL

Tuesday, August 29 (Symposium)
8:00-8:50 Breakfast and Registration
8:50-9:05 Introduction
Ada Gavrilovska & Eitan Zahavi General Chairs
Ryan Grant & Jitu Padhye Technical Program Chairs
9:05-9:10 Host Opening Remarks
Meenakshi Kaul-Basu, Ericsson
Session Chair: Ada Gavrilovska

RDMA deployments: from cloud computing to machine learning
Chuanxiong Guo, MSR
10:15-10:30 Morning Break
Best Papers
Session Chair: Ryan E. Grant
  • Improving Non-Minimal and Adaptive Routing Algorithms in Slim Fly Networks

    Interconnection networks must meet the communication demands of current High-Performance Computing systems. In order to interconnect efficiently the end nodes of these systems with a good performance-to-cost ratio, new network topologies have been proposed in the last years which leverage high-radix switches, such as Slim Fly.

    Adversarial-like traffic patterns, however, may reduce severely the performance of Slim Fly networks when using only minimal-path routing. In order to mitigate the performance degradation in these scenarios, Slim Fly networks should configure an oblivious or adaptive non-minimal routing. The non-minimal routing algorithms proposed for Slim Fly usually rely on Valiant's algorithm to select the paths, at the cost of doubling the average path-length, as well as the number of Virtual Channels (VCs)required to prevent deadlocks.

    Moreover, Valiant may introduce additional inefficiencies when applied to Slim Fly networks, such as the "turn-around problem" that we analyze in this work. With the aim of overcoming these drawbacks, we propose in this paper two variants of the Valiant's algorithm that improve the non-minimal path selection in Slim Fly networks. They are designed to be combined with adaptive routing algorithms that rely on Valiant to select non-minimal paths, such as UGAL or PAR, which we have adapted to the Slim Fly topology. Through the results from simulation experiments, we show that our proposals improve the network performance and/or reduce the number of required VCs to prevent deadlocks, even in scenarios with adversarial-like traffic.

    P. Y. Segura, J. Escudero-Sahuquillo, P. J. Garcia, F. J. Quiles, and T. Hoefler
  • Routing Keys

    The network plays a key role in High-Performance Computing (HPC) system efficiency. Unfortunately, traditional oblivious and congestion-aware routing solutions are not application-aware and cannot deal with typical HPC traffic bursts, while centralized routing solutions face a significant scalability bottleneck.

    To address this problem, we introduce Routing Keys, a new scalable routing paradigm for HPC networks that decouples intra- and inter-application flow contention. Our Application Routing Key (ARK) algorithm proactively allows each self-aware application to route its flows according to a predetermined routing key, i.e., its own intra-application contention-free routing. In addition, in our Network Routing Key (NRK) algorithm, a centralized scheduler chooses between several routing keys for the communication phases of each application, and therefore reduces inter-application contention while maintaining intra-application contention-free routing and avoiding scalability issues. Using extensive evaluation, we show that both ARK and NRK achieve a significant improvement in the application performance.

    A. Samuel, E. Zahavi, and I. Keslassy
  • Fast Networks and Slow Memories: A Mechanism for Mitigating Bandwidth Mismatches

    The advent of non-volatile memory (NVM) technologies has added an interesting nuance to the node level memory hierarchy. With modern 100 Gb/s networks, the NVM tier of storage can often be slower than the high performance network in the system; thus, a new challenge arises in the datacenter. Whereas prior efforts have studied the impacts of multiple sources targeting one node (i.e. incast) and have studied multiple flows causing congestion in inter-switch links, it is now possible for a single flow from a single source to overwhelm the bandwidth of a key portion of the memory hierarchy. This can subsequently spread to the switches and lead to congestion trees in a flow-controlled network or excessive packet drops without flow control. In this work we describe protocols which avoid overwhelming the receiver in the case of a source/sink rate mismatch. We design our protocols on top of Portals 4, which enables us to make use of network offload. Our protocol yields up to 4x higher throughput in a 5k node Dragonfly topology for a permutation traffic pattern in which only 1\% of all nodes have a memory write-bandwidth limitation of 1/8th of the network bandwidth.

    T. Schneider, M. Flajslik, J. Dinan, T. Hoefler, and K. Underwood
12:00-13:00 Lunch
Large Scale Networking
Session Chair: James Dinan
  • WaveLight: A Monolithic Low Latency Silicon-Photonics Communication Platform for the next generation Disaggregated Cloud Data Centers

    While transistor scaling has continued to improve the amount of raw computation that is possible under fixed area or power constraints, the need to move an increasing number of bits across ever larger distances has gone largely unaddressed by recent transistor improvements. Optical technologies can overcome the fundamental tradeoffs faced by electrical I/O. However, optical transceivers today are constrained to cumbersome module form factors. For the emerging cloud data centers, the power and cost of current-generation optical modules prevent the use of optics as a suitable interconnect technology needed to improve utilization through compute and memory disaggregation. The power and size of discrete optical devices today prevents formation of high density switching targeting real-time wireless networks.

    To resolve these ongoing interconnect bottlenecks, a technology capable of deeply integrating a large number of optical and electronic devices is required. In this work, we present WaveLight, a zero-change photonics platform, whereby optical functions are designed directly into an existing high-volume CMOS process, to demonstrate an electro-optic flexible switching function with a light weight reliable low latency protocol. The integrated device leverages microring-based optical transceivers to deliver 80Gb/s of full duplex bandwidth and low power.
    M. Akhter, P. Somogyi, C. Sun, M. Wade, R. Meade, P. Bhargava, S. Lin, and N. Mehta
  • An FPGA Platform for Hyperscalers

    FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. Meanwhile, DC infrastructures are being redesigned to pack ever more compute capacity into the same volume and power envelops. This redesign leads to the disaggregation of the server and its resources into a collection of standalone computing, memory, and storage modules.

    To embrace this evolution, we developed a platform that decouples the FPGA from the CPU of the server by connecting the FPGA directly to the DC network. This proposal turns the FPGA into a disaggregated standalone computing resource that can be deployed at large-scale into emerging hyperscale data centers.

    We show our platform which integrates 64 FPGAs (Kintex¨ UltraScaleTM XCKU060) from Xilinx in a 19Ó ? 2U chassis, and provides a bi-sectional bandwidth of 640 Gb/s. The platform is designed for cost-effectiveness and makes use of hot-water cooling for optimized energy efficiency. As a result, a DC rack can fit 16 platforms, for a total of 1024 FPGAs + 16TB of DRR4 memory.

    F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss and S. Paredes
  • Throughput Models of Interconnection Networks: the Good, the Bad, and the Ugly

    Throughput performance, an important metric for interconnection networks is often quantified by the aggregate throughput for a set of representative traffic patterns. A num- ber of models have been developed to estimate the aggregate throughput for a given traffic pattern on an interconnection network. Since all of the models predict the same property of interconnection networks, ideally, they should give similar performance results or at least follow the same performance trend. In this work, we examine four commonly used interconnect throughput models and identify the cases when all models show similar trends, when different models yield different trends, and when different models produce contradictory results. Our study reveals important properties of the models and demonstrates the subtle differences among them, which are important for an interconnect designer/researcher to understand, in order to properly select a throughput model in the process of interconnect evaluation.

    P. Faizian, Md A. Mollah, Md S. Rahman, X. Yuan, S. Pakin, and M. Lang
Session Chair: Eitan Zahavi

Performance Isolation for Highly Efficient Shared Infrastructure Services
This talk will describe the challenges and some fundamental constructs in achieving flexible performance isolation in highly efficient shared infrastructure services.
Nandita Dukkipati, Google (Invited Talk)
14:45-15:05 Afternoon Break

Ethernet vs. HPC: Can the hyperscale ethernet data center handle all workloads?
Hyperscale ethernet data centers (HEDCs) based on pure ethernet switching have come to dominate both the market and the conversation, and many think they are all that is necessary. But for many years specialized designs for HPC have survived, though these specialized designs seem increasingly relegated to fewer and fewer special cases. Some think this will only continue, and the low cost, pervasiveness, and simplicity of a single fabric will result in HEDCs being "so good enough" that special technology for HPC will no longer be needed. Others disagree and claim that HEDCs enjoy generally uniform and static workloads that do not fit the profiles of machine learning and HPC applications. They cite technologies like PCIexpress fabrics that are infiltrating HEDCs to augment ethernet with capabilities that ethernet simply cannot provide and that are needed in both the massive HEDCs and smaller versions of same, including specialized campus DCs or clusters and those deployed at the mobile edge for mobile edge computing. In the MEC scenario, media, security, machine learning, and IOT are among the new applications driving the convergence of HPC and HEDC. The panel will debate these opposing viewpoints and will speculate on whether specialized HPC DCs and general HEDCs will converge, diverge, or continue in separate parallel worlds.
Moderator: Roy Chua, Partner, SDxCentral & Wiretap Ventures

Yogesh Bhatt, Senior Director, Ecosystem Innovation & Strategy, Ericsson
Dave Cohen, Senior Principal Engineer & System Architect, Intel
Pete Fiacco, CTO, GigaIO Networks
Bithika Khargharia, Former Principal Architect, Extreme Networks
Dave Meyer, Chief Scientist, Brocade
Ying Zhang, Software Engineer, Facebook

Wednesday, August 30th (Symposium)
8:15-9:00 Breakfast and Registration
Session Chair: Eitan Zahavi

Information Transfer in the era of 5G
David Allan, Ericsson
10:00-10:30 Morning Break
Optics & Networks for Science
Session Chair: Madeleine Glick
  • A High Speed Hardware Scheduler for 1000-port Optical Packet Switches to Enable Scalable Data Centers

    Meeting the exponential increase in the global demand for bandwidth has become a major concern for todayŐs data centers. The scalability of any data center is defined by the maximum capacity and port count of the switching devices it employs, limited by total pin bandwidth on current electronic switch ASICs. Optical switches can provide higher capacity and port counts, and hence, can be used to transform data center scalability. We have recently demonstrated a 1000-port star-coupler based wavelength division multiplexed (WDM) and time division multiplexed (TDM) optical switch architecture offering a bandwidth of 32 Tbit/s with the use of fast wavelength-tunable transmitters and high-sensitivity coherent receivers. However, the major challenge in deploying such an optical switch to replace current electronic switches lies in designing and implementing a scalable scheduler capable of operating on packet timescales.

    In this paper, we present a pipelined and highly parallel electronic scheduler that configures the high-radix (1000-port) optical packet switch. The scheduler can process requests from 1000 nodes and allocate timeslots across 320 wavelength channels and 4000 wavelength-tunable transceivers within a time constraint of 1µs. Using the Opencell NanGate 45nm standard cell library, we show that the complete 1000-port parallel scheduler algorithm occupies a circuit area of 52.7mm2, 4-8x smaller than that of a high-performance switch ASIC, with a clock period of less than 8ns, enabling 138 scheduling iterations to be performed in 1µs. The performance of the scheduling algorithm is evaluated in comparison to maximal matching from graph theory and conventional software-based wavelength allocation heuristics. The parallel hardware scheduler is shown to achieve similar matching performance and network throughput while being orders of magnitude faster.

    J. Benjamin, P. Watts, A. Funnell, and B. Thomsen
  • Subchannel Scheduling for Shared Optical On-chip Buses

    Maximizing bandwidth utilization of optical on-chip interconnects is essential to compensate for the impact of static laser power on power efficiency in networks-on-chip. Shared optical buses offer a modular design solution with tremendous power savings by allowing optical bandwidth to be shared between all connected nodes. Previous proposals resolve bus contention by scheduling senders sequentially on the entire optical bandwidth; however, logically splitting a bus into subchannels to allow senders to be scheduled both sequentially and in parallel has been shown to be highly efficient in electrical interconnects, and could also be applied to shared optical buses.

    In this paper, we propose an efficient subchannel scheduling algorithm that aims to minimize the number of bus utilization cycles by assigning sender-receiver pairs both to subchannels and time slots. We present both a distributed and a centralized bus arbitration scheme showing that light-weight implementations can be attained. In fact, our results show that subchannel scheduling more than doubles throughput on shared optical buses compared to sequential scheduling without incurring any power overheads in most cases. Arbitration latency overheads compared to state-of-the-art sequential arbitration schemes are at most 5-10%, and are only noticeable for very low injection rates.

    S. Werner, J. Navaridas, and M. Luján
  • Utilizing HPC Network Technologies in High Energy Physics Experiments

    Because of their performance characteristics, high-performance fabrics like Infiniband or OmniPath are interesting technologies for many local area network applications, including data acquisition systems for high-energy physics experiments like the ATLAS experiment at CERN. This paper analyzes existing APIs for high-performance fabrics and evaluates their suitability for data acquisition systems in terms of performance and domain applicability.

    The study finds that existing software APIs for high-performance interconnects are focused on applications in high-performance computing with specific workloads and are not compatible with the requirements of data acquisition systems. To evaluate the use of high-performance interconnects in data acquisition systems, a custom library called NetIO has been developed and is compared against existing technologies.

    NetIO has a message queue-like interface which matches the ATLAS use case better than traditional HPC APIs like MPI. The architecture of NetIO is based on an interchangeable back-end system which supports different interconnects. A libfabric-based back-end supports a wide range of fabric technologies including Infiniband. On the front-end side, NetIO supports several high-level communication patterns that are found in typical data acquisition applications like client/server and publish/subscribe. Unlike other frameworks, NetIO distinguishes between high-throughput and low-latency communication, which is essential for applications with heterogeneous traffic patterns. This feature of NetIO allows experiments like ATLAS to use a single network for different traffic types like physics data or detector control.

    Benchmarks of NetIO in comparison with the message queue implementation ZeroMQ are presented. NetIO reaches up to 2x higher throughput on Ethernet and up to 3x higher throughput on FDR Infiniband compared to ZeroMQ on Ethernet. The latencies measured with NetIO are comparable to ZeroMQ latencies.

    Authors affliliation:

    J. Schumacher
12:00-13:30 Lunch
Toplogies, Routing and Process Placement
Session Chair: Taylor Groves
  • On the Impact of Routing Algorithms in the Effectiveness of Queuing Schemes in High-Performance Interconnection Networks

    In High-Performance Computing (HPC) systems, the design of the interconnection network is crucial. Indeed, the network topology, the switch architecture and the routing scheme determine the network performance and ultimately the system one. As the number of endnodes in HPC systems grows, and the supported applications become increasingly demanding for communication, the use of techniques to deal with network congestion and its negative effects gains importance. For that purpose, routing schemes such as adaptive or oblivious try to balance the network traffic in order to prevent and/or eliminate congestion. On the other hand, there are deterministic routing schemes that balance the number of paths per link with the aim of reducing the head-of-line blocking derived from congestion situations. Furthermore, other techniques to deal with congestion are based on queuing schemes. This approach is based on storing separately different packet flows at the ports buffers, so that the head-of-line blocking and/or buffer-hogging are reduced. Existing queuing schemes use different policies to separate flows, and they can be implemented in different ways. However, most queuing schemes are often used and designed assuming that the network is configured with deterministic routing, while actually they could be combined also with adaptive or oblivious routing.

    This paper analyzes the behavior of different queuing schemes under different routing algorithms: deterministic, adaptive or oblivious. We focus on fat-tree networks, configured with the most common routing algorithms of each type suitable for that topology. In order to evaluate these configurations, we have run simulation experiments modeling large fat-trees built from switches with radices available in the market, and supporting several queuing schemes. The experiments results show how different the performance of the queuing schemes may be when combined with either deterministic or oblivious/adaptive routing. Indeed, from these results we can conclude that some combinations of queuing schemes and routings are counterproductive.

    J. Rocher-Gonzalez, J. Escudero-Sahuquillo, P. J. Garc í a and F. J. Quiles
  • Placement of Virtual Network Functions in Hybrid Data Center Networks

    Hybrid data center networks (HDCNs), where each ToR switch is installed with a directional antenna, emerge as a candidate helping alleviate the over-subscription problem in traditional data centers. Meanwhile, as virtualization techniques develop rapidly, there is a trend that traditional network functions that are implemented in hardware will also be virtualized into virtual machines. However, how to place virtual network functions (VNFs) into data centers to meet the customer requirements in a hybrid data center network environment is a challenging problem. In this paper, we study the VNF placement in hybrid data center networks, and provide a joint VNF placement and antenna scheduling model. We further simplify it to a mixed integer programming (MIP) problem. Due to the hardness of a MIP problem, we develop a heuristic algorithm to solve it, and also give an on-line algorithm to meet the requirements from real scenarios. To the best of our knowledge, this is the first work concerning VNF placement in the context of HDCNs. Our extensive simulations demonstrate the effectiveness of the proposed algorithms, which make them a very suitable and promising solution for VNF placement in HDCN environment.

    Z. Li and Y. Yang
  • MPI Process and Network Device Affinitization for Optimal HPC Application Performance

    High Performance Computing(HPC) applications are highly optimized to maximize allocated resources for the job such as compute resources, memory and storage. Optimal performance for MPI applications requires the best possible affinity across all the allocated resources. Typically, setting process affinity to compute resources is well defined, i.e MPI processes on a compute node have processor affinity set for one to one mapping between MPI processes and the physical processing cores. Several well defined methods exist to efficiently map MPI processes to a compute node. With the growing complexity of HPC systems, platforms are designed with complex compute and I/O subsystems. Capacity of I/O devices attached to a node are expanded with PCIe switches resulting in large numbers of PCIe endpoint devices. With a lot of heterogeneity in systems, applications programmers are forced to think harder about affinitizing processes as it affects performance based on not only compute but also NUMA placement of IO devices. Mapping of process to processor cores and the closest IO device(s) is not straightforward. While operating systems do a reasonable job of trying to keep a process physically located near the processor core(s) and memory, they lack the application developer's knowledge of process workflow and optimal IO resource allocation when more than one IO device is connected to the compute node.

    In this paper we look at ways to assuage the problems of affinity choices by abstracting the device selection algorithm from MPI application layer. MPI continues to be the dominant programming model for HPC and hence our focus in this paper is limited to providing a solution for MPI based applications. Our solution can be extended to other HPC programming models such as Partitioned Global Address Space(PGAS) or a hybrid MPI and PGAS based applications. We propose a solution to solve NUMA effects at the MPI runtime level independent of MPI applications. Our experiments are conducted on a two node system where each node consists of two socket Intel Xeon servers, attached with up to four Intel Omni-Path fabric devices connected over PCIe. The performance benefits seen by MPI applications by affinitizing MPI processes with best possible network device is evident from the results where we notice up to 40\% improvement in uni-directional bandwidth, 48\% bi-directional bandwidth, 32\% improvement in latency measurements and finally up to 40\% improvement in message rate.

    R. B. Ganapathi, A. Gopalakrishnan, and R. W. Mcguire
15:00-15:15 Afternoon Break
Session Chair: Ada Gavrilovska

Communication at the Speed of Memory
The data deluge caused by the proliferation of connected data sources is creating an unprecedented imbalance in the ability of our IT infrastructure to access and mine data. While it is relatively easy to provision compute resources, it is much more challenging to provision the data resources to feed them. This is true across memory, storage, and system interconnects which are all going through profound transformations, caused by new applications and disruptive technologies, such as storage-class memory and silicon photonics. This talk discusses why we need a new approach to architect data movement, and what role the recently announced Gen-Z protocol can play (www.genzconsortium.org). Gen-Z is and open systems interconnect designed to provide memory semantic access to data and devices via direct-attached, switched, or fabric topologies. It is designed to address fabric needs within individual servers, at rack-scale, and all the way to next-generation exascale HPC systems. Gen-Z scalability comes from a combination of protocol features and optimized implementations. For example, the routing tables support scalability at minimal space and latency overhead; the routing mechanisms provide packet-by-packet path adaptivity on many topologies; the duplicate avoidance scheme alleviates the endpoints from the overheads of tracking connections. The talk covers the key aspects of Gen-Z as a high-performance interconnect, such as CPU-initiated atomics, high efficiency RDMA via memory semantics, and fast acting congestion control, and how these features provide the basis to enable next-generation exascale systems.
Paolo Faraboschi, HP (Invited Talk)
Paolo Faraboschi is a Fellow and VP at Hewlett Packard Enterprise Labs. His interests are at the intersection of systems, architecture, and software. He is currently researching memory-driven computing technologies and their use towards exascale computing. Recently, he was the lead hardware architect of The Machine project, researching how we can build better memory-driven computing systems for big data problems. From 2010 to 2014, he worked on low-energy servers and HP project Moonshot. From 2004 to 2009, at HP Labs in Barcelona, he led a research activity on scalable system-level simulation and modelling. From 1995 to 2003, at HP Labs Cambridge, he was the principal architect of the Lx/ST200 family of VLIW cores, widely used in video SoCs and HP's printers. Paolo is an IEEE Fellow and an active member of the computer architecture community. He is an author on 30 patents, over 100 publications, and the book "Embedded Computing: a VLIW approach". Before joining HP in 1994, he received a Ph.D. in EECS from the University of Genoa, Italy.
Efficient Network Design & Network Architecture
Session Chair: Jitu Padhye
  • Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks

    Deep Learning over Big Data (DLoBD) is becoming one of the most important research paradigms to mine value from the massive amount of gathered data. With this emerging paradigm, more and more deep learning frameworks start running over Big Data stacks, such as Hadoop and Spark. With the convergence of HPC, Big Data, and Deep Learning, many of these emerging frameworks have taken advantage of RDMA and multi-/many-core based CPUs/GPUs. Even though a lot of activities are happening in the field, there is a lack of systematic studies on analyzing the impact of RDMA-capable networks and CPU/GPU on DLoBD stacks. To fill this gap, we propose a systematical characterization methodology and conduct extensive performance evaluations on three representative DLoBD stacks (i.e., CaffeOnSpark, TensorFlowOnSpark, BigDL) to expose the interesting trends in terms of performance, scalability, accuracy, and resource utilization. Our observations show that RDMA-based design for DLoBD stacks can achieve up to 2.7X speedup compared to the IPoIB based scheme. More insights are shared in this paper to guide designing next-generation DLoBD stacks.

    X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda
  • Low-Level Host Software Stack Optimizations to Improve Aggregate Fabric Throughput

    Scientific HPC applications along with the emerging class of Big Data and Machine Learning workloads are rapidly driving the fabric scale both on premises and in the cloud. Achieving high aggregate fabric throughput is paramount to the overall performance of the application. However, achieving high fabric throughput at scale can be challenging - that is, the application communication pattern will need to map well on to the target fabric architecture, and the multi-layered host software stack in the middle will need to orchestrate that mapping optimally to unleash the full performance.

    In this paper, we investigate low-level optimizations to the host software stack with the goal of improving the aggregate fabric throughput, and hence, application performance. We develop and present a number of optimization and tuning techniques that are key driving factors to the fabric performance at scale - such as, Fine-grained interleaving, improved pipelining, and careful resource utilization and management. We believe that these low-level optimizations can be commonly leveraged by several programming models and its runtime implementations making these optimizations broadly applicable. Using a set of well-known MPI-based scientific applications, we demonstrate that these optimizations can significantly improve the overall fabric throughput and the application performance. Interestingly, we also observe that some of these optimizations are inter-related and can additively contribute to the overall performance.

    V. T. Ravi, J. Erwin, P. Sivakumar, C. Tang, J. Xiong, M. Debbage, and R. B. Ganapathi
  • Userspace RDMA Verbs on Commodity Hardware using DPDK

    RDMA (Remote Direct Memory Access) is a technology which enables user applications to perform direct data transfer between the virtual memory of processes on remote endpoints, without operating system involvement or intermediate data copies. Achieving zero intermediate data copies using RDMA requires specialized network interface hardware. Software RDMA drivers emulate RDMA semantics in software to allow the use of RDMA without investing in such hardware, although they cannot perform zero-copy transfers. Nonetheless, software RDMA drivers are useful for research, application development, testing, debugging, or as a less expensive desktop client for a centralized RDMA server application running on RDMA-capable hardware.

    Existing software RDMA drivers perform data transfer in the kernel. Data Plane Development Kit (DPDK) provides a framework for mapping Ethernet interface cards into userspace and performing bulk packet transfers. This in turn allows a software RDMA driver to perform data transfer in userspace. We present our software RDMA driver, urdma, which performs data transfer in userspace, discuss its design and implementation, and demonstrate that it can achieve lower small message latency than existing kernel-based implementations while maintaining high bandwidth utilization for large messages.

    P. MacArthur
17:30-17:45 Awards & Closing Remarks
Ada Gavrilovska & Eitan Zahavi, General Chairs