CUTLASS 3.8.0

CUTLASS 3.8.0 - January 2025

CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS and cuDNN. CUTLASS decomposes these "moving parts" into reusable, modular software components abstracted by C++ template classes. Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications.

To support a wide variety of applications, CUTLASS provides extensive support for mixed-precision computations, providing specialized data-movement and multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16, FP32 emulation via tensor core instruction, 8b floating point types (e5m2 and e4m3), block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8), narrow integer types (4 and 8b signed and unsigned integers), and binary 1b data types (where architectures allow for the native support of such data types). CUTLASS demonstrates optimal matrix multiply operations targeting the programmable, high-throughput Tensor Cores implemented by NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.

In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.

See the Quick Start Guide to get started quickly.

See the functionality docs for a more comprehensive list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU architecture.

What's New in CUTLASS 3.8

CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture. For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.

Support for new CuTe building blocks specifically for Blackwell architecture:
- 5th generation Blackwell Tensor Core instructions (TCGen05) via CuTe MMA atoms.
- Extensions to Tensor Memory Accelerator via CuTe Copy atoms.
- Exposure of Blackwell's new tensor memory (note: distinct from TMA) as tmem across CuTe as a first class data locale.
- Exposure of tmem->rmem, rmem->tmem and smem->tmem data movement instructions as copy atoms in CuTe.
- make_tmem_copy() utility method to ease creation of tiled copies for tmem copy atoms.
- Support for new variants of LDSM on Blackwell via CuTe Copy atoms.
Support for new CUTLASS building blocks specifically for Blackwell architecture:
- Various narrow precision FP4, FP6, and FP8 formats as well as their block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8
- Pipelines that implement Blackwell specific synchronization.
- Cluster launch control API supporting preferred and fallback cluster shapes.
- Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
- Tile schedulers using Blackwell's Cluster Launch Control (CLC) feature to implement dynamic persistence scheduling for GEMMs, and stream-K.
- Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
Full support for Blackwell kernels in CUTLASS 3.x API:
- Blackwell specific kernel layers that
  - Implement a new warp-specialization recipe tuned specifically for Blackwell.
  - Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators.
  - Support stream-K load balancing for all kernel types everywhere via composable scheduler support.
- Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for
- Blackwell collective mainloop for convolution kernels supporting non-block scaled data types for fprop, dgrad, and wgrad.
- New GEMM, convolution, and epilogue dispatch policies for collectives, kernel layers, and builders.
- Blackwell epilogue that supports loading accumulators from tmem and full set of EVT fusions.
CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
- Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
- Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell
- Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples, demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
- GEMM with opt-in collective builder schedules showcasing available recipes for Blackwell.
- Block scaled data type GEMMs targeting Blackwell's native block scaled Tensor Cores:
- GEMM example demonstrating Blackwell's new preferred cluster support via dynamic cluster shapes for increased occupancy.
- GEMM with CLC based StreamK scheduler for load balancing.
- Grouped GEMM for vanilla FP8 data inputs and NVFP4 block scaled inputs.
- Convolution kernels for fprop, dgrad, and wgrad.
- Fused multi-head attention fprop kernel supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
Documentation updates:
- Quickstart - instantiating a Blackwell block-scaled GEMM.
- Detailed Blackwell block-scaled GEMM functionality documentation
- A new functionality documentation specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
- Updates to compatibility section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and Target Architecture.

Note: CUTLASS 3.x builds are known to be broken on Windows platforms for all CUDA toolkits. CUTLASS team is working on a fix.

See the CHANGELOG for details of all past releases and updates.

Performance

CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, they exhibit nearly optimal utilization of peak theoretical throughput. The figure below shows CUTLASS 3.8's performance as a % of theoretical peak utilization on various input and output data types when run on NVIDIA Blackwell SM100 architecture GPU.

The two figures below show the continual CUTLASS performance improvements on an NVIDIA H100 (NVIDIA Hopper architecture) since CUTLASS 3.1. CUTLASS 3.5.1 was compiled with the CUDA 12.5u1 Toolkit. Tensor Core operations are implemented using CUDA's mma and wgmma instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 591 Commits
.github		.github
cmake		cmake
docs		docs
examples		examples
include		include
media		media
python		python
test		test
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
ACTIVE_DEVELOPERS.md		ACTIVE_DEVELOPERS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CUDA.cmake		CUDA.cmake
Doxyfile		Doxyfile
LICENSE.txt		LICENSE.txt
PUBLICATIONS.md		PUBLICATIONS.md
README.md		README.md
bin2hex.cmake		bin2hex.cmake
cuBLAS.cmake		cuBLAS.cmake
cuDNN.cmake		cuDNN.cmake
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Operating System	Compiler
Ubuntu 18.04	GCC 7.5.0
Ubuntu 20.04	GCC 10.3.0
Ubuntu 22.04	GCC 11.2.0

GPU	CUDA Compute Capability	Minimum CUDA Toolkit Required by CUTLASS-3
NVIDIA V100 Tensor Core GPU	7.0	11.4
NVIDIA TitanV	7.0	11.4
NVIDIA GeForce RTX 20x0 series	7.5	11.4
NVIDIA T4	7.5	11.4
NVIDIA A100 Tensor Core GPU	8.0	11.4
NVIDIA A10	8.6	11.4
NVIDIA GeForce RTX 30x0 series	8.6	11.4
NVIDIA GeForce RTX 40x0 series	8.9	11.8
NVIDIA L40	8.9	11.8
NVIDIA H100 Tensor Core GPU	9.0	11.8
NVIDIA H200 Tensor Core GPU	9.0	11.8
NVIDIA B200 Tensor Core GPU	10.0	12.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUTLASS 3.8.0

What's New in CUTLASS 3.8

Performance

CuTe

Compatibility

Operating Systems

Hardware

Target Architecture

Documentation

Resources

Building CUTLASS

Project Structure

CUTLASS Template Library

CUTLASS SDK Examples

Tools

Test

Performance Profiling

Building all GEMM and Co 8000 nvolution kernels (long build times)

Building a subset of GEMM and Convolution kernels (reduced build times)

Building a subset Tensor Core GEMM kernels

Building one CUDA Core GEMM kernel

Building a subset of Tensor Core Convolution kernels

Building one Convolution CUDA kernel

More Details on Compiling CUTLASS Kernels and CUTLASS Profiler

About

Contributors

Copyright

About

Releases

Packages

Languages

License

lloigor/cutlass

Folders and files

Latest commit

History

Repository files navigation

CUTLASS 3.8.0

What's New in CUTLASS 3.8

Performance

CuTe

Compatibility

Operating Systems

Hardware

Target Architecture

Documentation

Resources

Building CUTLASS

Project Structure

CUTLASS Template Library

CUTLASS SDK Examples

Tools

Test

Performance Profiling

Building all GEMM and Co 8000 nvolution kernels (long build times)

Building a subset of GEMM and Convolution kernels (reduced build times)

Building a subset Tensor Core GEMM kernels

Building one CUDA Core GEMM kernel

Building a subset of Tensor Core Convolution kernels

Building one Convolution CUDA kernel

More Details on Compiling CUTLASS Kernels and CUTLASS Profiler

About

Contributors

Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages