Interval-based memory reclamation
In this paper we present interval-based reclamation (IBR), a new approach to safe reclamation of disconnected memory blocks in nonblocking concurrent data structures. Safe reclamation is a difficult problem: a thread, before freeing a block, must ensure ...
Harnessing epoch-based reclamation for efficient range queries
Concurrent sets with range query operations are highly desirable in applications such as in-memory databases. However, few set implementations offer range queries. Known techniques for augmenting data structures with range queries (or operations that ...
A persistent lock-free queue for non-volatile memory
Non-volatile memory is expected to coexist with (or even displace) volatile DRAM for main memory in upcoming architectures. This has led to increasing interest in the problem of designing and specifying durable data structures that can recover from ...
Superneurons: dynamic GPU memory management for training deep neural networks
Going deeper and wider in neural architectures improves their accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need to change to less desired network architectures,...
Juggler: a dependence-aware task-based execution framework for GPUs
Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may ...
HPVM: heterogeneous parallel virtual machine
We propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs. Our ...
Hierarchical memory management for mutable state
It is well known that modern functional programming languages are naturally amenable to parallel programming. Achieving efficient parallelism using functional languages, however, remains difficult. Perhaps the most important reason for this is their ...
Bridging the gap between deep learning and sparse matrix format selection
This work presents a systematic exploration on the promise and special challenges of deep learning for sparse matrix format selection---a problem of determining the best storage format for a matrix to maximize the performance of Sparse Matrix Vector ...
Optimizing N-dimensional, winograd-based convolution for manycore CPUs
Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse ...
vSensor: leveraging fixed-workload snippets of programs for performance variance detection
Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer ...
Cache-tries: concurrent lock-free hash tries with constant-time operations
Concurrent non-blocking hash tries have good cache locality, and horizontally scalable operations. However, operations on most existing concurrent hash tries run in O(log n) time.
In this paper, we show that the concurrent hash trie operations can run ...
Featherlight on-the-fly false-sharing detection
Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools ...
Register optimizations for stencils on GPUs
- Prashant Singh Rawat,
- Fabrice Rastello,
- Aravind Sukumaran-Rajam,
- Louis-Noël Pouchet,
- Atanas Rountev,
- P. Sadayappan
The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means ...
FlashR: parallelize and scale R for machine learning using SSDs
R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R ...
DisCVar: discovering critical variables using algorithmic differentiation for transient faults
Aggressive technology scaling trends have made the hardware of high performance computing (HPC) systems more susceptible to faults. Some of these faults can lead to silent data corruption (SDC), and represent a serious problem because they alter the HPC ...
Practical concurrent traversals in search trees
Operations of concurrent objects often employ optimistic concurrency-control schemes that consist of a traversal followed by a validation step. The validation checks if concurrent mutations interfered with the traversal to determine if the operation ...
Communication-avoiding parallel minimum cuts and connected components
We present novel scalable parallel algorithms for finding global minimum cuts and connected components, which are important and fundamental problems in graph processing. To take advantage of future massively parallel architectures, our algorithms are ...
Safe privatization in transactional memory
Transactional memory (TM) facilitates the development of concurrent applications by letting the programmer designate certain code blocks as atomic. Programmers using a TM often would like to access the same data both inside and outside transactions, ...
Making pull-based graph processing performant
Graph processing engines following either the push-based or pull-based pattern conceptually consist of a two-level nested loop structure. Parallelizing and vectorizing these loops is critical for high overall performance and memory bandwidth ...
An effective fusion and tile size model for optimizing image processing pipelines
Effective models for fusion of loop nests continue to remain a challenge in both general-purpose and domain-specific language (DSL) compilers. The difficulty often arises from the combinatorial explosion of grouping choices and their interaction with ...
Lazygraph: lazy data coherency for replicas in distributed graph-parallel computation
Replicas 1 of a vertex play an important role in existing distributed graph processing systems which make a single vertex to be parallel processed by multiple machines and access remote neighbors locally without any remote access. However, replicas of ...
PAM: parallel augmented maps
Ordered (key-value) maps are an important and widely-used data type for large-scale data processing frameworks. Beyond simple search, insertion and deletion, more advanced operations such as range extraction, filtering, and bulk updates form a critical ...
Efficient shuffle management with SCache for DAG computing frameworks
In large-scale data-parallel analytics, shuffle, or the cross-network read and aggregation of partitioned data between tasks with data dependencies, usually brings in large overhead. To reduce shuffle overhead, we present SCache, an open source plug-in ...
High-performance genomic analysis framework with in-memory computing
In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data ...
Griffin: uniting CPU and GPU in information retrieval systems for intra-query parallelism
Interactive information retrieval services, such as enterprise search and document search, must provide relevant results with consistent, low response times in the face of rapidly growing data sets and query loads. These growing demands have led ...
swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures
Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the ...
VerifiedFT: a verified, high-performance precise dynamic race detector
Dynamic data race detectors are valuable tools for testing and validating concurrent software, but to achieve good performance they are typically implemented using sophisticated concurrent algorithms. Thus, they are ironically prone to the exact same ...
Efficient parallel determinacy race detection for two-dimensional dags
A program is said to have a determinacy race if logically parallel parts of a program access the same memory location and one of the accesses is a write. These races are generally bugs in the program since they lead to non-deterministic program behavior ...
Performance challenges in modular parallel programs
Over the past decade, many programming languages and systems for parallel-computing have been developed, including Cilk, Fork/Join Java, Habanero Java, Parallel Haskell, Parallel ML, and X10. Although these systems raise the level of abstraction at ...
Reducing the burden of parallel loop schedulers for many-core processors
This work proposes a low-overhead half-barrier pattern to schedule fine-grain parallel loops and considers its integration in the Intel OpenMP and Cilkplus schedulers. Experimental evaluation demonstrates that the scheduling overhead of our techniques ...