Releases · ml-explore/mlx

@zcbenz

Highlights

5 bit quantization
Significant progress on CUDA back-end by @zcbenz

Core

Features

5bit quants
Allow per-target Metal debug flags
Add complex eigh
reduce vjp for mx.all and mx.any
real and imag properties
Non-symmetric mx.linalg.eig and mx.linalg.eigh
convolution vmap
Add more complex unary ops (sqrt, square, ...)
Complex scan
Add mx.broadcast_shapes
Added output_padding parameters in conv_transpose
Add random normal distribution for complex numbers
Add mx.fft.fftshift and mx.fft.ifftshift` helpers
Enable vjp for quantized scale and bias

Performance

Optimizing Complex Matrix Multiplication using Karatsuba’s Algorithm
Much faster 1D conv

Cuda

Generalize gpu backend
Use fallbacks in fast primitives when eval_gpu is not implemented
Add memory cache to CUDA backend
Do not check event.is_signaled() in eval_impl
Build for compute capability 70 instead of 75 in CUDA backend
CUDA backend: backbone

Bug Fixes

Fix out-of-bounds default value in logsumexp/softmax
include mlx::core::version() symbols in the mlx static library
Fix Nearest upsample
Fix large arg reduce
fix conv grad
Fix some complex vjps
Fix typo in row_reduce_small
Fix put_along_axis for empty arrays
Close a couple edge case bugs: hadamard and addmm on empty inputs
Fix fft for integer overflow with large batches
fix: conv_general differences between gpu, cpu
Fix batched vector sdpa
GPU Hadamard for large N
Improve bandwidth for elementwise ops
Fix compile merging
Fix shapeless export to throw on dim mismatch
Fix mx.linalg.pinv for singular matrices
Fixed shift operations
Fix integer overflow in qmm

Contributors

Thanks to some awesome contributors!

@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1

🚀

Highlights

Custom logsumexp for reduced memory in training (benchmark)
Depthwise separable convolutions
- Up to 4x faster than PT
- benchmark
Batched Gather MM and Gather QMM for ~2x faster prompt processing for MoEs
- benchmarks
- more benchmarks

Core

Performance

Fused vector attention supports 256 dim
Tune quantized matrix vector dispatch for small batches of vectors

Features

Move memory API in the top level mlx.core and enable for CPU only allocator
Enable using MPI from all platforms and allow only OpenMPI
Add a ring all gather for the ring distributed backend
Enable gemm for complex numbers
Fused attention supports literal "causal" mask
Log for complex numbers
Distributed all_min and all_max both for MPI and the ring backend
Add logcumsumexp
Add additive mask for fused vector attention
Improve the usage of the residency set

NN

Add sharded layers for model/tensor parallelism

Bugfixes

Fix possible allocator deadlock when using multiple streams
Ring backend supports 32 bit platforms and FreeBSD
Fix FFT bugs
Fix attention mask type for fused attention kernel
Fix fused attention numerical instability with masking
Add a fallback for float16 gemm
Fix simd sign for uint64
Fix issues in docs

🐛 🚀

🐛

Highlights

Much faster fused attention with support for causal masking
- Benchmarks
- Improvements in prompt processing speed and memory use, benchmarks
- Much faster small batch fused attention for e.g. speculative decoding, benchmarks
Major redesign of CPU back-end for faster CPU-GPU synchronization

Core

Performance

Support fused masking in scaled_dot_product_attention
Support transposed head/seq for fused vector scaled_dot_product_attention
SDPA support for small batch (over sequence) queries
Enabling fused attention for head dim 128
Redesign CPU back-end for faster cpu/gpu synch

Features

Allow debugging in distributed mode
Support mx.fast.rms_norm without scale
Adds nuclear norm support in mx.linalg.norm
Add XOR on arrays
Added mlx::core::version()
Allow non-square lu in mx.linalg.lu
Double for lapack ops (eigh, svd, etc)
Add a prepare tb ring script
Ring docs
Affine quant always in fp32

Optimizers

Add a multi optimizer optimizers.MultiOptimizer

Bug Fixes

Do not define MLX_VERSION globally
Reduce binary size post fast synch
Fix vmap for flatten
Fix copy for large arrays with JIT
Fix grad with inplace updates
Use same accumulation precision in gemv as gemm
Fix slice data size
Use a heap for small sizes
Fix donation in scan
Ensure linspace always contains start and stop
Raise an exception in the rope op if input is integer
Limit compile buffers by
fix mx.float64 type promotion
Fix CPU SIMD erf_inv
Update smooth_l1_loss in losses.

🚀

🐞

Highlights

4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
More performance improvements across the board:
- Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
- Faster winograd convolutions, benchmarks
- Up to 3x faster sort, benchmarks
- Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
- Faster unified CPU back-end with vector operations
Double precision (mx.float64) support on the CPU

Core

Features

Bitwise invert mx.bitwise_invert
mx.linalg.lu, mx.linalg.lu_factor, mx.linalg.solve, mx.linalg.solve_triangular
Support loading F8_E4M3 from safetensors
mx.float64 supported on the CPU
Matmul JVPs
Distributed launch helper :mlx.launch
Support non-square QR factorization with mx.linalg.qr
Support ellipsis in mx.einsum
Refactor and unify accelerate and common back-ends

Performance

Faster synchronization Fence for synchronizing CPU-GPU
Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
Fast winograd convolutions, benchmarks
Allow dynamic ops per buffer based on dispatches and memory, benchmarks
Up to 3x faster sort, benchmarks
Faster small batch qmv, benchmarks
Ring distributed backend
- Uses raw sockets for faster all reduce
Some CPU ops are much faster with the new Simd<T, N>

NN

Orthogonal initializer nn.init.orthogonal
Add dilation for conv 3d layers

Bug fixes

Limit grad recursion depth by not recursing through non-grad inputs
Fix synchronization bug for GPU stream async CPU work
Fix shapeless compile on ubuntu24
Recompile when shapeless changes
Fix rope fallback to not upcast
Fix metal sort for certain cases
Fix a couple of slicing bugs
Avoid duplicate malloc with custom kernel init
Fix compilation error on Windows
Allow Python garbage collector to break cycles on custom objects
Fix grad with copies
Loading empty list is ok when strict = false
Fix split vmap
Fixes output donation for IO ops on the GPU
Fix creating an array with an int64 scalar
Catch stream errors earlier to avoid aborts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Highlights

Core

Features

Performance

Cuda

Bug Fixes

Contributors

Contributors

Uh oh!

Uh oh!

Uh oh!

Highlights

Core

Performance

Features

NN

Bugfixes

Uh oh!

Uh oh!

Uh oh!

Highlights

Core

Performance

Features

Optimizers

Bug Fixes

Uh oh!

Uh oh!

Uh oh!

Highlights

Core

Features

Performance

NN

Bug fixes

Uh oh!

Releases: ml-explore/mlx

v0.26.0

Highlights

Core

Features

Performance

Cuda

Bug Fixes

Contributors

Contributors

Uh oh!

v0.25.2

Uh oh!

v0.25.1

Uh oh!

v0.25.0

Highlights

Core

Performance

Features

NN

Bugfixes

Uh oh!

v0.24.2

Uh oh!

v0.24.1

Uh oh!

v0.24.0

Highlights

Core

Performance

Features

Optimizers

Bug Fixes

Uh oh!

v0.23.2

Uh oh!

v0.23.1

Uh oh!

v0.23.0

Highlights

Core

Features

Performance

NN

Bug fixes

Uh oh!