8000 Releases Β· NVIDIA/cutlass Β· GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Releases: NVIDIA/cutlass

CUTLASS 3.9.2

04 May 04:25
ad7b2f5
Compare
Choose a tag to compare
  • Fixed Blockwise and Groupwise GEMM hang issue when problem size K is 128.
  • Optimal code generation with CUDA toolkit versions 12.9.

CUTLASS 3.9.1

01 May 04:29
f535c33
Compare
Choose a tag to compare
  • Fixed Group Gemm hang issue in CUTLASS 3.x
  • Improved Hopper Blockwise and Groupwise GEMM performance.

CUTLASS 3.9.0

25 Apr 01:53
e94e888
Compare
Choose a tag to compare

CUTLASS 3.8.0

21 Feb 05:32
afa1772
Compare
Choose a tag to compare

CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture.
For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.

Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.

CUTLASS 3.7.0

18 Jan 15:07
b78588d
Compare
Choose a tag to compare
  • A new Hopper blockwise scaling FP8 GEMM where the operands and block scaling tensor are staged via shared memory.
  • Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
  • Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new make_kernel_hardware_info API as shown in example 48.
  • Enabled high precision accumulation for Hopper FP8 Sparse GEMM.

CUTLASS 3.6.0

25 Dec 22:19
bf9da7b
Compare
Choose a tag to compare

CUTLASS 3.5.1

29 Aug 20:15
f7b19de
Compare
Choose a tag to compare

CUTLASS 3.5.0

12 Apr 01:40
7d49e6c
Compare
Choose a tag to compare
  • Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
    • Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
    • Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
    • Support for Fprop, Dgrad, and Wgrad algorithms.
    • CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
    • NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
  • Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
  • Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
    • Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
    • Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
  • 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
  • Updates to CuTe documentation for cute::Tensor<>, MMA atoms, and an overhauled CuTe GEMM tutorial series.
  • Extensions to CuTe to support L2 prefetching and TMA store+reductions.
  • Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
  • Fixes to greatly reduce build warnings.
  • Updates and bugfixes from the community (thanks!)

CUTLASS 3.4.1

15 Feb 21:03
bbe579a
Compare
Choose a tag to compare

CUTLASS 3.4.0

16 Jan 22:39
751eb9a
Compare
Choose a tag to compare
  • Improved Mixed-input Hopper GEMMs supporting {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors tuned for optimal performance on Hopper H100.
  • Beta release of Pointer-Array Batched GEMMs utilizing TMA and Hopper H100 tensor cores now available. (Requires CUDA 12.3 or above)
  • Beta release of Group-GEMM - commonly used in optimization of Mixture-Of-Expert models, is now available on Hopper GPUs taking advantage of TMA and Hopper H100 tensor cores. (Requires CUDA 12.3 or above)
  • Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now.
  • Impovements to NamedBarriers including details of ReservedNamedBarriers used within the CUTLASS library.
  • Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved.
0