8000 GitHub - PrimeIntellect-ai/pi-quant: SIMD quantization kernels
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

PrimeIntellect-ai/pi-quant

Repository files navigation

pi-quant: Prime Intellect Fast Quantization Library

logo.png

Overview

Fast, multithreaded CPU quantization kernels with various rounding modes, outperforming PyTorch’s built-in quantization routines by more than 2 times on all tested hardware. The kernels are optimized with SIMD intrinsics for different CPU architectures, including AMD64 (SSE4.2, AVX2, AVX512F) and ARM64 (Neon). The most optimal kernel is selected at runtime using runtime CPU detection.

What is Quantization?

Quantization is the process of mapping continuous values into a finite, discrete set of values. In machine learning and signal processing, it is commonly used to reduce the precision of numerical data, lowering memory usage and improving computational efficiency while maintaining acceptable accuracy.

Features

✅ Parallel De/Quantization: Efficiently quantizes and de-quantizes data using multiple threads.

✅ Rich Datatype Support: Provides f32, f64 ↔ (u)int4/8/16/32/64.

✅ Modern Python API: Use the library from Python with PyTorch, numpy or standalone.

✅ Architecture-Specific Optimizations: The kernels are optimized with SIMD intrinsics for different CPU architectures, including AMD64 (SSE4.2, AVX2, AVX512F) and ARM64 (Neon).

✅ Thread Pool: Reuses threads for minimal overhead.

✅ Flexible Rounding Modes: Supports both nearest and stochastic rounding modes.

✅ C99 API: Provides a C99 API for C projects or foreign language bindings (see quant.h).

✅ Store Operators: Supports multiple store modes (SET, ADD) during dequantization — useful for ring-reduction operations.

✅ Quantization Parameters: Efficient SIMD-parallel computation of quantization scale and zero point from input data.

Installation

To install pi-quant from PyPI, run the following command:

pip install pypiquant

Benchmarks

Benchmark

The benchmarks were run on a variety of hardware. We benchmark against PyTorch’s torch.quantize_per_tensor and **torch.ao.quantization.fx._decomposed.quantize_per_tensor**. Each benchmark quantized float32 to uint8 across 1000 runs. The number of elements and other details can be seen in the benchmark code.

Benchmark 1 (AMD EPYC 9654, 360 vCPUs)

1000 runs with numel 27264000
CPU: AMD EPYC 9654 96-Core Processor, Runtime: AVX512-F
Memory: 1485 GB
Linux: 6.8.0-57-generic

bench1.png Torch FX Quant refers to torch.ao.quantization.fx._decomposed.quantize_per_tensor, Torch Builtin Quant to **torch.quantize_per_tensor** and Fast Quant to pi-quant’s piquant.quantize_torch.

Benchmark 2 (AMD EPYC 7742, 128 vCPUs)

1000 runs with numel 27264000
CPU: AMD EPYC 7742 64-Core Processor, Runtime: AVX2
Memory: 528 GB
Linux: 6.8.0-1023-nvidia
bench2.png

Benchmark 3 (Apple M3 Pro)

1000 runs with numel 27264000
CPU: Apple M3 Pro, Runtime: Neon
Memory: 18 GB
OSX: 15.4 (24E248)
bench3.png

About

SIMD quantization kernels

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 5

Languages

0