8000 [CuDNN Attention] Performance Grouped Query Attention · Issue #139586 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[CuDNN Attention] Performance Grouped Query Attention #139586
Open
@drisspg

Description

@drisspg

Summary

We recently landed support for grouped query attention via use enable_gqa on sdpa, however this is only enabled on the flash attention backend. This leads to a weird situation where it could have been more beneficial for a user to have not used the enable GQA flag and called repeat interleave prior to calling SDPA in order to use the CUDNN net backend.

It looks Like there is explicit support for the GQA situation in the CUDI and NAPI, we should add support for this.
https://docs.nvidia.com/deeplearning/cudnn/latest/api/cudnn-graph-library.html#cudnn-backend-operation-reduction-descriptor

cc @msaroufim @mikaylagawarecki @jainapurva, @eqy, @Skylion007

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: performanceIssues related to performance, either of kernel code or framework gluemodule: sdpaAll things related to torch.nn.functional.scaled_dot_product_attentiiontriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0