Accelerating Irregular Applications with Pipeline Parallelism

Author(s)

Nguyen, Quan Minh

DownloadThesis PDF (1.660Mb)

Advisor

Sanchez, Daniel

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Irregular applications have frequent data-dependent memory accesses and control flow. They arise in many emerging and important domains, including sparse deep learning, graph analytics, and database processing. Conventional architectures cannot handle irregular applications efficiently because their techniques for improving performance, like exploiting instruction-level or data-level parallelism, are not tailored to them. Thus, continued progress in these crucial domains depends on exploring new avenues of parallelism. Fortunately, irregular applications contain abundant but untapped pipeline parallelism: they can be divided into networks of stages. Pipelining not only exposes parallelism but also enables decoupling, which hides the latency of long events by allowing producer stages to run ahead of consumer stages. To properly decouple these applications, though, this pipeline parallelism must be exploited at fine-grain, with few operations per stage. Prior work has proposed architectures, compilers, and languages, but focus on regular pipelines, and thus are unable to overcome several challenges of irregular applications. First, architectures need to support the efficient execution of many fine-grain pipeline stages. Second, such irregular pipelines suffer from load imbalance, as the amount of work in each stage varies rapidly as the program runs. Finally, these stages must communicate and coordinate changes in control flow. This thesis demonstrates that exploiting fine-grain pipeline parallelism in irregular applications is effective and practical. To this end, this thesis proposes two hardware architectures and a compiler: Pipette, the first architecture, reuses existing structures in modern out-of-order cores to implement load-balanced decoupled communication between stages; and Fifer, the second architecture, makes the acceleration benefits of coarse-grain reconfigurable arrays available to irregular applications. Pipette achieves gmean 1.9x speedup over a data-parallel implementation, and Fifer achieves up to 47x speedup over an out-of-order multicore while using considerably less area. Both architectures also further accelerate challenging memory accesses and resolve the load balancing and control flow challenges that are ubiquitous in irregular applications. Finally, Phloem is a compiler that makes it easy for programmers to use these architectures by producing high-performance pipeline-parallel implementations of irregular applications from serial code. Phloem automatically achieves 85% of the performance of manually pipelined versions.

Date issued

2022-05

URI

https://hdl.handle.net/1721.1/144589

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses