Releases: ml-explore/mlx
Releases Β· ml-explore/mlx
v0.26.0
Highlights
- 5 bit quantization
- Significant progress on CUDA back-end by @zcbenz
Core
Features
- 5bit quants
- Allow per-target Metal debug flags
- Add complex eigh
- reduce vjp for
mx.all
andmx.any
real
andimag
properties- Non-symmetric
mx.linalg.eig
andmx.linalg.eigh
- convolution vmap
- Add more complex unary ops (
sqrt
,square
, ...) - Complex scan
- Add
mx.broadcast_shapes
- Added
output_padding
parameters inconv_transpose
- Add random normal distribution for complex numbers
- Add
mx.fft.fftshift and
mx.fft.ifftshift` helpers - Enable vjp for quantized scale and bias
Performance
- Optimizing Complex Matrix Multiplication using Karatsubaβs Algorithm
- Much faster 1D conv
Cuda
- Generalize gpu backend
- Use fallbacks in fast primitives when
eval_gpu
is not implemented - Add memory cache to CUDA backend
- Do not check
event.is_signaled()
ineval_impl
- Build for compute capability 70 instead of 75 in CUDA backend
- CUDA backend: backbone
Bug Fixes
- Fix out-of-bounds default value in logsumexp/softmax
- include
mlx::core::version()
symbols in the mlx static library - Fix Nearest upsample
- Fix large arg reduce
- fix conv grad
- Fix some complex vjps
- Fix typo in row_reduce_small
- Fix
put_along_axis
for empty arrays - Close a couple edge case bugs:
hadamard
andaddmm
on empty inputs - Fix fft for integer overflow with large batches
- fix:
conv_general
differences between gpu, cpu - Fix batched vector sdpa
- GPU Hadamard for large N
- Improve bandwidth for elementwise ops
- Fix compile merging
- Fix shapeless export to throw on dim mismatch
- Fix
mx.linalg.pinv
for singular matrices - Fixed shift operations
- Fix integer overflow in qmm
Contributors
Thanks to some awesome contributors!
@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1
v0.25.2
v0.25.1
v0.25.0
Highlights
- Custom logsumexp for reduced memory in training (benchmark)
- Depthwise separable convolutions
- Up to 4x faster than PT
- benchmark
- Batched Gather MM and Gather QMM for ~2x faster prompt processing for MoEs
Core
Performance
- Fused vector attention supports 256 dim
- Tune quantized matrix vector dispatch for small batches of vectors
Features
- Move memory API in the top level mlx.core and enable for CPU only allocator
- Enable using MPI from all platforms and allow only OpenMPI
- Add a ring all gather for the ring distributed backend
- Enable gemm for complex numbers
- Fused attention supports literal "causal" mask
- Log for complex numbers
- Distributed
all_min
andall_max
both for MPI and the ring backend - Add
logcumsumexp
- Add additive mask for fused vector attention
- Improve the usage of the residency set
NN
- Add sharded layers for model/tensor parallelism
Bugfixes
- Fix possible allocator deadlock when using multiple streams
- Ring backend supports 32 bit platforms and FreeBSD
- Fix FFT bugs
- Fix attention mask type for fused attention kernel
- Fix fused attention numerical instability with masking
- Add a fallback for float16 gemm
- Fix simd sign for uint64
- Fix issues in docs
v0.24.2
v0.24.1
v0.24.0
Highlights
- Much faster fused attention with support for causal masking
- Benchmarks
- Improvements in prompt processing speed and memory use, benchmarks
- Much faster small batch fused attention for e.g. speculative decoding, benchmarks
- Major redesign of CPU back-end for faster CPU-GPU synchronization
Core
Performance
- Support fused masking in
scaled_dot_product_attention
- Support transposed head/seq for fused vector
scaled_dot_product_attention
- SDPA support for small batch (over sequence) queries
- Enabling fused attention for head dim 128
- Redesign CPU back-end for faster cpu/gpu synch
Features
- Allow debugging in distributed mode
- Support
mx.fast.rms_norm
without scale - Adds nuclear norm support in
mx.linalg.norm
- Add XOR on arrays
- Added
mlx::core::version()
- Allow non-square lu in
mx.linalg.lu
- Double for lapack ops (
eigh
,svd
, etc) - Add a prepare tb ring script
- Ring docs
- Affine quant always in fp32
Optimizers
- Add a multi optimizer
optimizers.MultiOptimizer
Bug Fixes
- Do not define
MLX_VERSION
globally - Reduce binary size post fast synch
- Fix vmap for flatten
- Fix copy for large arrays with JIT
- Fix grad with inplace updates
- Use same accumulation precision in gemv as gemm
- Fix slice data size
- Use a heap for small sizes
- Fix donation in scan
- Ensure linspace always contains start and stop
- Raise an exception in the rope op if input is integer
- Limit compile buffers by
- fix
mx.float64
type promotion - Fix CPU SIMD erf_inv
- Update smooth_l1_loss in losses.
v0.23.2
v0.23.1
π
v0.23.0
Highlights
- 4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
- More performance improvements across the board:
- Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
- Faster winograd convolutions, benchmarks
- Up to 3x faster sort, benchmarks
- Much faster
mx.put_along_axis
andmx.take_along_axis
, benchmarks - Faster unified CPU back-end with vector operations
- Double precision (
mx.float64
) support on the CPU
Core
Features
- Bitwise invert
mx.bitwise_invert
mx.linalg.lu
,mx.linalg.lu_factor
,mx.linalg.solve
,mx.linalg.solve_triangular
- Support loading F8_E4M3 from safetensors
mx.float64
supported on the CPU- Matmul JVPs
- Distributed launch helper :
mlx.launch
- Support non-square QR factorization with
mx.linalg.qr
- Support ellipsis in
mx.einsum
- Refactor and unify accelerate and common back-ends
Performance
- Faster synchronization
Fence
for synchronizing CPU-GPU - Much faster
mx.put_along_axis
andmx.take_along_axis
, benchmarks - Fast winograd convolutions, benchmarks
- Allow dynamic ops per buffer based on dispatches and memory, benchmarks
- Up to 3x faster sort, benchmarks
- Faster small batch qmv, benchmarks
- Ring distributed backend
- Uses raw sockets for faster all reduce
- Some CPU ops are much faster with the new
Simd<T, N>
NN
- Orthogonal initializer
nn.init.orthogonal
- Add dilation for conv 3d layers
Bug fixes
- Limit grad recursion depth by not recursing through non-grad inputs
- Fix synchronization bug for GPU stream async CPU work
- Fix shapeless compile on ubuntu24
- Recompile when
shapeless
changes - Fix rope fallback to not upcast
- Fix metal sort for certain cases
- Fix a couple of slicing bugs
- Avoid duplicate malloc with custom kernel init
- Fix compilation error on Windows
- Allow Python garbage collector to break cycles on custom objects
- Fix grad with copies
- Loading empty list is ok when
strict = false
- Fix split vmap
- Fixes output donation for IO ops on the GPU
- Fix creating an array with an int64 scalar
- Catch stream errors earlier to avoid aborts