[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3466752.3480124acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

Twig: Profile-Guided BTB Prefetching for Data Center Applications

Published: 17 October 2021 Publication History

Abstract

Modern data center applications have deep software stacks, with instruction footprints that are orders of magnitude larger than typical instruction cache (I-cache) sizes. To efficiently prefetch instructions into the I-cache despite large application footprints, modern server-class processors implement a decoupled frontend with Fetch Directed Instruction Prefetching (FDIP). In this work, we first characterize the limitations of a decoupled frontend processor with FDIP and find that FDIP suffers from significant Branch Target Buffer (BTB) misses. We also find that existing techniques (e.g., stream prefetchers and predecoders) are unable to mitigate these misses, as they rely on an incomplete understanding of a program’s branching behavior.
To address the shortcomings of existing BTB prefetching techniques, we propose Twig, a novel profile-guided BTB prefetching mechanism. Twig analyzes a production binary’s execution profile to identify critical BTB misses and inject BTB prefetch instructions into code. Additionally, Twig coalesces multiple non-contiguous BTB prefetches to improve the BTB’s locality. Twig exposes these techniques via new BTB prefetch instructions. Since Twig prefetches BTB entries without modifying the underlying BTB organization, it is easy to adopt in modern processors. We study Twig’s behavior across nine widely-used data center applications, and demonstrate that it achieves an average 20.86% (up to 145%) performance speedup over a baseline 8K-entry BTB, outperforming the state-of-the-art BTB prefetch mechanism by 19.82% (on average).

References

[1]
[n. d.]. Adding Processor Trace support to Linux. https://lwn.net/Articles/648154/.
[2]
[n. d.]. Apache Cassandra. http://cassandra.apache.org/.
[3]
[n. d.]. Apache kafka. https://kafka.apache.org/powered-by.
[4]
[n. d.]. Apache Tomcat. https://tomcat.apache.org/.
[5]
[n. d.]. An Introduction to Last Branch Records. https://lwn.net/Articles/680985/.
[6]
[n. d.]. Scarab. https://github.com/hpsresearchgroup/scarab.
[7]
[n. d.]. Twitter Finagle. https://twitter.github.io/finagle/.
[8]
[n. d.]. Verilator. https://www.veripool.org/wiki/verilator.
[9]
2019. facebookarchive/oss-performance: Scripts for benchmarking various php implementations when running open source software. https://github.com/facebookarchive/oss-performance. (Online; last accessed 15-November-2019).
[10]
Keith Adams, Jason Evans, Bertrand Maher, Guilherme Ottoni, Andrew Paroski, Brett Simmers, Edwin Smith, and Owen Yamauchi. 2014. The hiphop virtual machine. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications. 777–790.
[11]
Ali Ansari, Fatemeh Golshan, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2020. MANA: Microarchitecting an instruction prefetcher. The First Instruction Prefetching Championship (2020).
[12]
Ali Ansari, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2020. Divide and Conquer Frontend Bottleneck. In Proceedings of the 47th Annual International Symposium on Computer Architecture.
[13]
Truls Asheim, Boris Grot, and Rakesh Kumar. 2021. BTB-X: A Storage-Effective BTB Organization. IEEE Computer Architecture Letters(2021).
[14]
Grant Ayers, Jung Ho Ahn, Christos Kozyrakis, and Parthasarathy Ranganathan. 2018. Memory hierarchy for web search. In 2018 IEEE International Symposium on High Performance Computer Architecture. IEEE, 643–656.
[15]
Grant Ayers, Heiner Litz, Christos Kozyrakis, and Parthasarathy Ranganathan. 2020. Classifying Memory Access Patterns for Prefetching. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 513–526.
[16]
Grant Ayers, Nayana Prasad Nagendra, David I August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. Asmdb: understanding and mitigating front-end stalls in warehouse-scale computers. In Proceedings of the 46th International Symposium on Computer Architecture. 462–473.
[17]
Stephen M Blackburn, Robin Garner, Chris Hoffmann, Asjad M Khang, Kathryn S McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z Guyer, 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications. 169–190.
[18]
James Bonanno, Adam Collura, Daniel Lipetz, Ulrich Mayer, Brian Prasky, and Anthony Saporito. 2013. Two level bulk preload branch prediction. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture. IEEE, 71–82.
[19]
Peter Braun and Heiner Litz. 2019. Understanding memory access patterns for prefetching. In International Workshop on AI-assisted Design for Architecture (AIDArc), held in conjunction with ISCA.
[20]
Ioana Burcea and Andreas Moshovos. 2009. Phantom-BTB: a virtualized branch target buffer design. Acm Sigplan Notices 44, 3 (2009), 313–324.
[21]
Michael Butler, Leslie Barnes, Debjit Das Sarma, and Bob Gelinas. 2011. Bulldozer: An approach to multithreaded compute performance. IEEE Micro 31, 2 (2011), 6–15.
[22]
Dehao Chen, Tipp Moseley, and David Xinliang Li. 2016. AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications. In CGO.
[23]
Robert Cohn and P Geoffrey Lowney. 1996. Hot cold optimization of large Windows/NT applications. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 80–89.
[24]
Barry Fagin. 1997. Partial resolution in branch target buffers. IEEE Trans. Comput. 46, 10 (1997), 1142–1145.
[25]
Michael Ferdman, Cansu Kaynak, and Babak Falsafi. 2011. Proactive instruction fetch. In International Symposium on Microarchitecture.
[26]
Michael Ferdman, Thomas F Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2008. Temporal instruction fetch streaming. In International Symposium on Microarchitecture.
[27]
Nathan Gober, Gino Chacon, Daniel Jiménez, and Paul V Gratz. [n. d.]. The Temporal Ancestry Prefetcher. ([n. d.]).
[28]
Daniel A Jiménez Paul V Gratz and Gino Chacon Nathan Gober. [n. d.]. BARCa: Branch Agnostic Region Searching Algorithm. ([n. d.]).
[29]
Brian Grayson, Jeff Rupley, Gerald Zuraski Zuraski, Eric Quinnell, Daniel A Jiménez, Tarun Nakra, Paul Kitchin, Ryan Hensley, Edward Brekelbaum, Vikas Sinha, 2020. Evolution of the samsung exynos CPU microarchitecture. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture. IEEE, 40–51.
[30]
Vishal Gupta, Neelu Shivprakash Kalani, and Biswabandan Panda. [n. d.]. Run-Jump-Run: Bouquet of Instruction Pointer Jumpers for High Performance Instruction Prefetching. ([n. d.]).
[31]
Stavros Harizopoulos and Anastassia Ailamaki. 2004. STEPS towards cache-resident transaction processing. In International conference on Very large data bases.
[32]
Milad Hashemi, Kevin Swersky, Jamie A Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis, and Parthasarathy Ranganathan. 2018. Learning memory access patterns. arXiv preprint arXiv:1803.02329(2018).
[33]
Mark D Hill and Alan Jay Smith. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12 (1989), 1612–1630.
[34]
Intel. 2021. Front-End Bound. https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/front-end-bound.html.
[35]
Yasuo Ishii, Jaekyu Lee, Krishnendra Nathella, and Dam Sunwoo. 2020. Rebasing Instruction Prefetching: An Industry Perspective. IEEE Computer Architecture Letters(2020).
[36]
Yasuo Ishii, Jaekyu Lee, Krishnendra Nathella, and Dam Sunwoo. 2021. Re-establishing Fetch-Directed Instruction Prefetching: An Industry Perspective. IEEE International Symposium on Performance Analysis of Systems and Software (2021).
[37]
Daniel A Jiménez, Stephen W Keckler, and Calvin Lin. 2000. The impact of delay on the design of branch predictors. In Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture. 67–76.
[38]
Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 158–169.
[39]
Cansu Kaynak, Boris Grot, and Babak Falsafi. 2013. Shift: Shared history instruction fetch for lean-core server processors. In International Symposium on Microarchitecture.
[40]
Cansu Kaynak, Boris Grot, and Babak Falsafi. 2015. Confluence: unified instruction supply for scale-out servers. In Proceedings of the 48th International Symposium on Microarchitecture. 166–177.
[41]
Tanvir Ahmed Khan, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. 2020. I-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 146–159.
[42]
Tanvir Ahmed Khan, Dexin Zhang, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. 2021. Ripple: Profile-guided instruction cache replacement for data center applications. In Proceedings of the 48th International Symposium on Computer Architecture.
[43]
Ryotaro Kobayashi, Yuji Yamada, Hideki Ando, and Toshio Shimada. 1999. A cost-effective branch target buffer with a two-level table organization. In Proceedings of the 2nd International Symposium of Low-Power and High-Speed Chips (COOL Chips II).
[44]
Aasheesh Kolli, Ali Saidi, and Thomas F Wenisch. 2013. RDIP: return-address-stack directed instruction prefetching. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 260–271.
[45]
Rakesh Kumar, Boris Grot, and Vijay Nagarajan. 2018. Blasting through the Front-End Bottleneck with Shotgun. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 30–42. https://doi.org/10.1145/3173162.3173178
[46]
Rakesh Kumar, Cheng-Chieh Huang, Boris Grot, and Vijay Nagarajan. 2017. Boomerang: A metadata-free architecture for control flow delivery. In 2017 IEEE International Symposium on High Performance Computer Architecture. IEEE, 493–504.
[47]
Lee and Smith. 1984. Branch Prediction Strategies and Branch Target Buffer Design. Computer 17, 1 (1984), 6–22. https://doi.org/10.1109/MC.1984.1658927
[48]
David Xinliang Li, Raksit Ashok, and Robert Hundt. 2010. Lightweight feedback-directed cross-module optimization. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization. 53–61.
[49]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. Acm sigplan notices 40, 6 (2005), 190–200.
[50]
Chi-Keung Luk and Todd C Mowry. 1998. Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors. In International Symposium on Microarchitecture.
[51]
C-K Luk, Robert Muth, Harish Patil, Robert Cohn, and Geoff Lowney. 2004. Ispike: a post-link optimizer for the Intel/spl reg/Itanium/spl reg/architecture. In International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 15–26.
[52]
Pierre Michaud. 2020. PIPS: Prefetching Instructions with Probabilistic Scouts. In The 1st Instruction Prefetching Championship.
[53]
Nayana Prasad Nagendra, Grant Ayers, David I August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2020. Asmdb: Understanding and mitigating front-end stalls in warehouse-scale computers. IEEE Micro 40, 3 (2020), 56–63.
[54]
Tomoki Nakamura, Toru Koizumi, Yuya Degawa, Hidetsugu Irie, Shuichi Sakai, and Ryota Shioya. [n. d.]. D-JOLT: Distant Jolt Prefetcher. ([n. d.]).
[55]
Guilherme Ottoni. 2018. HHVM JIT: A Profile-guided, Region-based Compiler for PHP and Hack. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 151–165.
[56]
Guilherme Ottoni and Bin Liu. [n. d.]. HHVM Jump-Start: Boosting Both Warmup and Steady-State Performance at Scale. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 340–350.
[57]
Guilherme Ottoni and Bertrand Maher. 2017. Optimizing function placement for large-scale data-center applications. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 233–244.
[58]
Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. Bolt: a practical binary optimizer for data centers and beyond. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2–14.
[59]
Maksim Panchenko, Rafael Auler, Laith Sakka, and Guilherme Ottoni. 2021. Lightning BOLT: powerful, fast, and scalable binary optimization. In Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction. 119–130.
[60]
Reena Panda, Paul V Gratz, and Daniel A Jiménez. 2011. B-fetch: Branch prediction directed prefetching for in-order processors. IEEE Computer Architecture Letters 11, 2 (2011), 41–44.
[61]
Andrea Pellegrini, Nigel Stephens, Magnus Bruce, Yasuo Ishii, Joseph Pusdesris, Abhishek Raja, Chris Abernathy, Jinson Koppanalil, Tushar Ringe, Ashok Tummala, 2020. The Arm Neoverse N1 Platform: Building Blocks for the Next-Gen Cloud-to-Edge Infrastructure SoC. IEEE Micro 40, 2 (2020), 53–62.
[62]
Chris H Perleberg and Alan Jay Smith. 1993. Branch target buffer design and optimization. IEEE transactions on computers 42, 4 (1993), 396–412.
[63]
Larry L Peterson. 2001. Architectural and compiler support for effective instruction prefetching: a cooperative approach. ACM Transactions on Computer Systems(2001).
[64]
Erez Petrank and Dror Rawitz. 2002. The Hardness of Cache Conscious Data Placement. In POPL.
[65]
Karl Pettis and Robert C Hansen. 1990. Profile guided code positioning. In Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation. 16–27.
[66]
Aleksandar Prokopec, Andrea Rosà, David Leopoldseder, Gilles Duboscq, Petr Tůma, Martin Studener, Lubomír Bulej, Yudi Zheng, Alex Villazón, Doug Simon, Thomas Würthinger, and Walter Binder. 2019. Renaissance: Benchmarking Suite for Parallel Applications on the JVM. In Programming Language Design and Implementation.
[67]
Alex Ramirez, Luiz André Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P Geoffrey Lowney, and Mateo Valero. 2001. Code layout optimizations for transaction processing workloads. ACM SIGARCH Computer Architecture News(2001).
[68]
Glenn Reinman, Todd Austin, and Brad Calder. 1999. A scalable front-end architecture for fast instruction delivery. ACM SIGARCH Computer Architecture News 27, 2 (1999), 234–245.
[69]
Glenn Reinman, Brad Calder, and Todd Austin. 1999. Fetch directed instruction prefetching. In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, 16–27.
[70]
Alberto Ros and Alexandra Jimborean. 2020. The entangling instruction prefetcher. IEEE Computer Architecture Letters 19, 2 (2020), 84–87.
[71]
Eric Rotenberg, Steve Bennett, and James E Smith. 1996. Trace cache: a low latency approach to high bandwidth instruction fetching. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 24–34.
[72]
J Rupley. 2018. Samsung Exynos M3 Processor. IEEE Hot Chips 30(2018).
[73]
André Seznec. 2014. Tage-sc-l branch predictors. In JILP-Championship Branch Prediction.
[74]
André Seznec. 2020. The FNL+ MMA Instruction Cache Prefetcher. In IPC-1-First Instruction Prefetching Championship.
[75]
S Seznec. 1996. Don’t use the page number, but a pointer to it. In 23rd Annual International Symposium on Computer Architecture. IEEE, 104–104.
[76]
Alan Jay Smith. 1978. Sequential program prefetching in memory hierarchies. Computer12(1978), 7–21.
[77]
Stephen Somogyi, Thomas F Wenisch, Anastasia Ailamaki, and Babak Falsafi. 2009. Spatio-temporal memory streaming. ACM SIGARCH Computer Architecture News 37, 3 (2009), 69–80.
[78]
Niranjan Soundararajan, Peter Braun, Tanvir Khan, Baris Kasikci, Heiner Litz, and Sreenivas Subramoney. 2021. PDede: Partitioned, Deduplicated, Delta Branch Target Buffer. In Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture.
[79]
Akshitha Sriraman, Abhishek Dhanotia, and Thomas F Wenisch. 2019. Softsku: Optimizing server architectures for microservice diversity@ scale. In Proceedings of the 46th International Symposium on Computer Architecture. 513–526.
[80]
David Suggs, Mahesh Subramony, and Dan Bouvier. 2020. The AMD “Zen 2” Processor. IEEE Micro 40, 2 (2020), 45–52.
[81]
Thomas F Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2008. Temporal streams in commercial server applications. In 2008 IEEE International Symposium on Workload Characterization. IEEE, 99–108.
[82]
Thomas F Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2009. Practical off-chip meta-data for temporal memory streaming. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, 79–90.
[83]
Thomas F Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki, and Babak Falsafi. 2005. Temporal streaming of shared memory. In 32nd International Symposium on Computer Architecture. IEEE, 222–233.
[84]
Wikipedia contributors. 2020. Apache Kafka — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Apache_Kafka&oldid=988898935. [Online; accessed 23-November-2020].
[85]
Wikipedia contributors. 2020. Verilator — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Verilator&oldid=989046249. [Online; accessed 8-April-2021].
[86]
Wikipedia contributors. 2021. Apache Cassandra — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Apache_Cassandra&oldid=1010524207. [Online; accessed 7-April-2021].
[87]
Wikipedia contributors. 2021. X86-64 — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=X86-64&oldid=1016690406. [Online; accessed 10-April-2021].
[88]
Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In ISPASS.
[89]
Tse-Yu Yeh and Yale N Patt. 1992. A comprehensive instruction fetch mechanism for a processor supporting speculative execution. ACM SIGMICRO Newsletter 23, 1-2 (1992), 129–139.
[90]
Jingren Zhou and Kenneth A Ross. 2004. Buffering databse operations for enhanced instruction cache performance. In International conference on Management of data.

Cited By

View all
  • (2024)Enhancing Power Efficiency in Branch Target Buffer Design with a Two-Level Prediction MechanismElectronics10.3390/electronics1307118513:7(1185)Online publication date: 23-Mar-2024
  • (2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
  • (2024)Alternate Path μ-op Cache Prefetching2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00092(1230-1245)Online publication date: 29-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
October 2021
1322 pages
ISBN:9781450385572
DOI:10.1145/3466752
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Prefetching
  2. branch target buffer
  3. data center
  4. frontend stalls

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Intel Corporation
  • NSF
  • Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA

Conference

MICRO '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)535
  • Downloads (Last 6 weeks)98
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Enhancing Power Efficiency in Branch Target Buffer Design with a Two-Level Prediction MechanismElectronics10.3390/electronics1307118513:7(1185)Online publication date: 23-Mar-2024
  • (2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
  • (2024)Alternate Path μ-op Cache Prefetching2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00092(1230-1245)Online publication date: 29-Jun-2024
  • (2024)UDP: Utility-Driven Fetch Directed Instruction Prefetching2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00089(1188-1201)Online publication date: 29-Jun-2024
  • (2024)AVM-BTB: Adaptive and Virtualized Multi-level Branch Target Buffer2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00012(17-31)Online publication date: 29-Jun-2024
  • (2023)A Storage-Effective BTB Organization for Servers2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070938(1153-1167)Online publication date: Feb-2023
  • (2023)JACO: JAva Code Layout Optimizer Enabling Continuous Optimization without Pausing Application Services2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00032(295-306)Online publication date: 31-Oct-2023
  • (2022)Debugging in the brave new world of reconfigurable hardwareProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507701(946-962)Online publication date: 28-Feb-2022
  • (2022)Shooting Down the Server Front-End BottleneckACM Transactions on Computer Systems10.1145/348449238:3-4(1-30)Online publication date: 4-Jan-2022
  • (2022)Whisper: Profile-Guided Branch Misprediction Elimination for Data Center ApplicationsProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00017(19-34)Online publication date: 1-Oct-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media