CUTLASS 3.8.0 - January 2025
CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS and cuDNN. CUTLASS decomposes these "moving parts" into reusable, modular software components abstracted by C++ template classes. Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications.
To support a wide variety of applications, CUTLASS provides extensive support for mixed-precision computations, providing specialized data-movement and multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16, FP32 emulation via tensor core instruction, 8b floating point types (e5m2 and e4m3), block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8), narrow integer types (4 and 8b signed and unsigned integers), and binary 1b data types (where architectures allow for the native support of such data types). CUTLASS demonstrates optimal matrix multiply operations targeting the programmable, high-throughput Tensor Cores implemented by NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.
In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
See the Quick Start Guide to get started quickly.
See the functionality docs for a more comprehensive list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU architecture.
CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture. For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.
- Support for new CuTe building blocks specifically for Blackwell architecture:
- 5th generation Blackwell Tensor Core instructions (TCGen05) via CuTe MMA atoms.
- Extensions to Tensor Memory Accelerator via CuTe Copy atoms.
- Exposure of Blackwell's new tensor memory (note: distinct from TMA) as
tmem
across CuTe as a first class data locale. - Exposure of
tmem->rmem
,rmem->tmem
andsmem->tmem data movement instructions
as copy atoms in CuTe. make_tmem_copy()
utility method to ease creation of tiled copies for tmem copy atoms.- Support for new variants of LDSM on Blackwell via CuTe Copy atoms.
- Support for new CUTLASS building blocks specifically for Blackwell architecture:
- Various narrow precision FP4, FP6, and FP8 formats as well as their block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8
- Pipelines that implement Blackwell specific synchronization.
- Cluster launch control API supporting preferred and fallback cluster shapes.
- Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
- Tile schedulers using Blackwell's Cluster Launch Control (CLC) feature to implement dynamic persistence scheduling for GEMMs, and stream-K.
- Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
- Full support for Blackwell kernels in CUTLASS 3.x API:
- Blackwell specific kernel layers that
- Implement a new warp-specialization recipe tuned specifically for Blackwell.
- Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators.
- Support stream-K load balancing for all kernel types everywhere via composable scheduler support.
- Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for
- Non-block scaled data types without support for pointer array and grouped GEMM with TMA
- Non-block scaled data types with support for pointer array and grouped GEMM with TMA
- Block scaled data types without support for pointer array and grouped GEMM with TMA
- Block scaled data types with support for pointer array and grouped GEMM with TMA
- Blackwell collective mainloop for convolution kernels supporting non-block scaled data types for fprop, dgrad, and wgrad.
- New GEMM, convolution, and epilogue dispatch policies for collectives, kernel layers, and builders.
- Blackwell epilogue that supports loading accumulators from
tmem
and full set of EVT fusions.
- Blackwell specific kernel layers that
- CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
- Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
- Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
- Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell
- Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples, demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
- GEMM with opt-in collective builder schedules showcasing available recipes for Blackwell.
- Block scaled data type GEMMs targeting Blackwell's native block scaled Tensor Cores:
- GEMM example demonstrating Blackwell's new preferred cluster support via dynamic cluster shapes for increased occupancy.
- GEMM with CLC based StreamK scheduler for load balancing.
- Grouped GEMM for vanilla FP8 data inputs and NVFP4 block scaled inputs.
- Convolution kernels for fprop, dgrad, and wgrad.
- Fused multi-head attention fprop kernel supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
- Documentation updates:
- Quickstart - instantiating a Blackwell block-scaled GEMM.
- Detailed Blackwell block-scaled GEMM functionality documentation
- A new functionality documentation specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
- Updates to compatibility section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and Target Architecture.
Note: CUTLASS 3.x builds are known to be broken on Windows platforms for all CUDA toolkits. CUTLASS team is working on a fix.
See the CHANGELOG for details of all past releases and updates.
CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, they exhibit nearly optimal utilization of peak theoretical throughput. The figure below shows CUTLASS 3.8's performance as a % of theoretical peak utilization on various input and output data types when run on NVIDIA Blackwell SM100 architecture GPU.
The two figures below show the continual CUTLASS performance improvements on an NVIDIA H100 (NVIDIA Hopper architecture) since CUTLASS 3.1. CUTLASS 3.5.1 was compiled with the CUDA 12.5u1 Toolkit. Tensor Core operations are implemented using CUDA's mma and wgmma instructions.