8000 Releases Β· ml-explore/mlx Β· GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Releases: ml-explore/mlx

v0.26.0

02 Jun 23:24
0408ba0
Compare
Choose a tag to compare

Highlights

  • 5 bit quantization
  • Significant progress on CUDA back-end by @zcbenz

Core

Features

  • 5bit quants
  • Allow per-target Metal debug flags
  • Add complex eigh
  • reduce vjp for mx.all and mx.any
  • real and imag properties
  • Non-symmetric mx.linalg.eig and mx.linalg.eigh
  • convolution vmap
  • Add more complex unary ops (sqrt, square, ...)
  • Complex scan
  • Add mx.broadcast_shapes
  • Added output_padding parameters in conv_transpose
  • Add random normal distribution for complex numbers
  • Add mx.fft.fftshift and mx.fft.ifftshift` helpers
  • Enable vjp for quantized scale and bias

Performance

  • Optimizing Complex Matrix Multiplication using Karatsuba’s Algorithm
  • Much faster 1D conv

Cuda

  • Generalize gpu backend
  • Use fallbacks in fast primitives when eval_gpu is not implemented
  • Add memory cache to CUDA backend
  • Do not check event.is_signaled() in eval_impl
  • Build for compute capability 70 instead of 75 in CUDA backend
  • CUDA backend: backbone

Bug Fixes

  • Fix out-of-bounds default value in logsumexp/softmax
  • include mlx::core::version() symbols in the mlx static library
  • Fix Nearest upsample
  • Fix large arg reduce
  • fix conv grad
  • Fix some complex vjps
  • Fix typo in row_reduce_small
  • Fix put_along_axis for empty arrays
  • Close a couple edge case bugs: hadamard and addmm on empty inputs
  • Fix fft for integer overflow with large batches
  • fix: conv_general differences between gpu, cpu
  • Fix batched vector sdpa
  • GPU Hadamard for large N
  • Improve bandwidth for elementwise ops
  • Fix compile merging
  • Fix shapeless export to throw on dim mismatch
  • Fix mx.linalg.pinv for singular matrices
  • Fixed shift operations
  • Fix integer overflow in qmm

Contributors

Thanks to some awesome contributors!

@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1

v0.25.2

09 May 21:35
659a519
Compare
Choose a tag to compare

πŸš€

v0.25.1

24 Apr 23:11
eaf709b
Compare
Choose a tag to compare

πŸš€

v0.25.0

17 Apr 23:50
b529515
Compare
Choose a tag to compare

Highlights

  • Custom logsumexp for reduced memory in training (benchmark)
  • Depthwise separable convolutions
  • Batched Gather MM and Gather QMM for ~2x faster prompt processing for MoEs

Core

Performance

  • Fused vector attention supports 256 dim
  • Tune quantized matrix vector dispatch for small batches of vectors

Features

  • Move memory API in the top level mlx.core and enable for CPU only allocator
  • Enable using MPI from all platforms and allow only OpenMPI
  • Add a ring all gather for the ring distributed backend
  • Enable gemm for complex numbers
  • Fused attention supports literal "causal" mask
  • Log for complex numbers
  • Distributed all_min and all_max both for MPI and the ring backend
  • Add logcumsumexp
  • Add additive mask for fused vector attention
  • Improve the usage of the residency set

NN

  • Add sharded layers for model/tensor parallelism

Bugfixes

  • Fix possible allocator deadlock when using multiple streams
  • Ring backend supports 32 bit platforms and FreeBSD
  • Fix FFT bugs
  • Fix attention mask type for fused attention kernel
  • Fix fused attention numerical instability with masking
  • Add a fallback for float16 gemm
  • Fix simd sign for uint64
  • Fix issues in docs

v0.24.2

03 Apr 20:18
86389bf
Compare
Choose a tag to compare

πŸ› πŸš€

v0.24.1

24 Mar 20:19
aba899c
Compare
Choose a tag to compare

πŸ›

v0.24.0

20 Mar 22:31
1177d28
Compare
Choose a tag to compare

Highlights

  • Much faster fused attention with support for causal masking
    • Benchmarks
    • Improvements in prompt processing speed and memory use, benchmarks
    • Much faster small batch fused attention for e.g. speculative decoding, benchmarks
  • Major redesign of CPU back-end for faster CPU-GPU synchronization

Core

Performance

  • Support fused masking in scaled_dot_product_attention
  • Support transposed head/seq for fused vector scaled_dot_product_attention
  • SDPA support for small batch (over sequence) queries
  • Enabling fused attention for head dim 128
  • Redesign CPU back-end for faster cpu/gpu synch

Features

  • Allow debugging in distributed mode
  • Support mx.fast.rms_norm without scale
  • Adds nuclear norm support in mx.linalg.norm
  • Add XOR on arrays
  • Added mlx::core::version()
  • Allow non-square lu in mx.linalg.lu
  • Double for lapack ops (eigh, svd, etc)
  • Add a prepare tb ring script
  • Ring docs
  • Affine quant always in fp32

Optimizers

  • Add a multi optimizer optimizers.MultiOptimizer

Bug Fixes

  • Do not define MLX_VERSION globally
  • Reduce binary size post fast synch
  • Fix vmap for flatten
  • Fix copy for large arrays with JIT
  • Fix grad with inplace updates
  • Use same accumulation precision in gemv as gemm
  • Fix slice data size
  • Use a heap for small sizes
  • Fix donation in scan
  • Ensure linspace always contains start and stop
  • Raise an exception in the rope op if input is integer
  • Limit compile buffers by
  • fix mx.float64 type promotion
  • Fix CPU SIMD erf_inv
  • Update smooth_l1_loss in losses.

v0.23.2

05 Mar 21:24
f599c11
Compare
Choose a tag to compare

πŸš€

v0.23.1

19 Feb 01:53
71de73a
Compare
Choose a tag to compare

🐞

v0.23.0

14 Feb 21:39
6cec78d
Compare
Choose a tag to compare

Highlights

  • 4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
  • More performance improvements across the board:
    • Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
    • Faster winograd convolutions, benchmarks
    • Up to 3x faster sort, benchmarks
    • Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
    • Faster unified CPU back-end with vector operations
  • Double precision (mx.float64) support on the CPU

Core

Features

  • Bitwise invert mx.bitwise_invert
  • mx.linalg.lu, mx.linalg.lu_factor, mx.linalg.solve, mx.linalg.solve_triangular
  • Support loading F8_E4M3 from safetensors
  • mx.float64 supported on the CPU
  • Matmul JVPs
  • Distributed launch helper :mlx.launch
  • Support non-square QR factorization with mx.linalg.qr
  • Support ellipsis in mx.einsum
  • Refactor and unify accelerate and common back-ends

Performance

  • Faster synchronization Fence for synchronizing CPU-GPU
  • Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
  • Fast winograd convolutions, benchmarks
  • Allow dynamic ops per buffer based on dispatches and memory, benchmarks
  • Up to 3x faster sort, benchmarks
  • Faster small batch qmv, benchmarks
  • Ring distributed backend
  • Some CPU ops are much faster with the new Simd<T, N>

NN

  • Orthogonal initializer nn.init.orthogonal
  • Add dilation for conv 3d layers

Bug fixes

  • Limit grad recursion depth by not recursing through non-grad inputs
  • Fix synchronization bug for GPU stream async CPU work
  • Fix shapeless compile on ubuntu24
  • Recompile when shapeless changes
  • Fix rope fallback to not upcast
  • Fix metal sort for certain cases
  • Fix a couple of slicing bugs
  • Avoid duplicate malloc with custom kernel init
  • Fix compilation error on Windows
  • Allow Python garbage collector to break cycles on custom objects
  • Fix grad with copies
  • Loading empty list is ok when strict = false
  • Fix split vmap
  • Fixes output donation for IO ops on the GPU
  • Fix creating an array with an int64 scalar
  • Catch stream errors earlier to avoid aborts
0