Highlights
- 5 bit quantization
- Significant progress on CUDA back-end by @zcbenz
Core
Features
- 5bit quants
- Allow per-target Metal debug flags
- Add complex eigh
- reduce vjp for
mx.all
andmx.any
real
andimag
properties- Non-symmetric
mx.linalg.eig
andmx.linalg.eigh
- convolution vmap
- Add more complex unary ops (
sqrt
,square
, ...) - Complex scan
- Add
mx.broadcast_shapes
- Added
output_padding
parameters inconv_transpose
- Add random normal distribution for complex numbers
- Add
mx.fft.fftshift and
mx.fft.ifftshift` helpers - Enable vjp for quantized scale and bias
Performance
- Optimizing Complex Matrix Multiplication using Karatsuba’s Algorithm
- Much faster 1D conv
Cuda
- Generalize gpu backend
- Use fallbacks in fast primitives when
eval_gpu
is not implemented - Add memory cache to CUDA backend
- Do not check
event.is_signaled()
ineval_impl
- Build for compute capability 70 instead of 75 in CUDA backend
- CUDA backend: backbone
Bug Fixes
- Fix out-of-bounds default value in logsumexp/softmax
- include
mlx::core::version()
symbols in the mlx static library - Fix Nearest upsample
- Fix large arg reduce
- fix conv grad
- Fix some complex vjps
- Fix typo in row_reduce_small
- Fix
put_along_axis
for empty arrays - Close a couple edge case bugs:
hadamard
andaddmm
on empty inputs - Fix fft for integer overflow with large batches
- fix:
conv_general
differences between gpu, cpu - Fix batched vector sdpa
- GPU Hadamard for large N
- Improve bandwidth for elementwise ops
- Fix compile merging
- Fix shapeless export to throw on dim mismatch
- Fix
mx.linalg.pinv
for singular matrices - Fixed shift operations
- Fix integer overflow in qmm
Contributors
Thanks to some awesome contributors!
@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1