Computer Science > Machine Learning

arXiv:2406.11016 (cs)

[Submitted on 16 Jun 2024 (v1), last revised 3 Oct 2024 (this version, v2)]

Title:Optimized Speculative Sampling for GPU Hardware Accelerators

Authors:Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet

Abstract:In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a minor decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.

Comments:	Accepted at EMNLP 2024
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2406.11016 [cs.LG]
	(or arXiv:2406.11016v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2406.11016

Submission history

From: Dominik Wagner [view email]
[v1] Sun, 16 Jun 2024 17:19:23 UTC (349 KB)
[v2] Thu, 3 Oct 2024 08:05:14 UTC (321 KB)

Computer Science > Machine Learning

Title:Optimized Speculative Sampling for GPU Hardware Accelerators

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Optimized Speculative Sampling for GPU Hardware Accelerators

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators