Releases · thu-pacman/chitu

Better support for MetaX (沐曦) GPUs:

Support of both Llama-like models and DeepSeek models. Tested with DeepSeek-R1-Distill-Llama-70B and DeepSeek-R1-671B using bf16, fp16, and soft fp8 precision.
New infer.op_impl=muxi_custom_kernel mode optimized for small batches.

Added support for online conversion from FP4 to FP8 and BF16, supporting the FP4 quantized version of DeepSeek-R1 671B on non-Blackwell GPUs.

Multiple bugs fixed.

Performance improvements on hybrid CPU+GPU inference.

What's new:

[HIGHLIGHT] Hybrid CPU+GPU inference (compatible with multi-GPU and multi-request).
Support of new models (see below for full list).
Multiple optimizations to operator kernels.

Officially supported models:

[NEW] QwQ-32B-FP8 (https://huggingface.co/qingcheng-ai/QWQ-32B-FP8)
Usage: Append models=QwQ-32B-FP8 command line argument when starting Chitu
[NEW] QwQ-32B-AWQ (https://huggingface.co/Qwen/QwQ-32B-AWQ)
Usage: Append models=QwQ-32B-AWQ command line argument when starting Chitu
[NEW] Llama-3.3-70B-Instruct (https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
Usage: Append models=Llama-3.3-70B-Instruct command line argument when starting Chitu
[NEW] DeepSeek-R1-Distill-Llama-70B (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)
Usage: Append models=DeepSeek-R1-Distill-Llama-70B command line argument when starting Chitu
Qwen2.5-32B (https://huggingface.co/Qwen/Qwen2.5-32B)
Usage: Append models=Qwen2.5-32B command line argument when starting Chitu
QwQ-32B (https://huggingface.co/Qwen/QwQ-32B)
Usage: Append models=QwQ-32B command line argument when starting Chitu
Mixtral-8x7B-Instruct-v0.1 (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
Usage: Append models=Mixtral-8x7B-Instruct-v0.1 command line argument when starting Chitu
Qwen2-72B-Instruct (https://huggingface.co/Qwen/Qwen2-72B-Instruct)
Usage: Append models=Qwen2-72B-Instruct command line argument when starting Chitu
Meta-Llama-3-8B-Instruct-original (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct (Please use its "original" checkpoint))
Usage: Append models=Meta-Llama-3-8B-Instruct-original command line argument when starting Chitu
glm-4-9b-chat (https://huggingface.co/THUDM/glm-4-9b-chat)
Usage: Append models=glm-4-9b-chat command line argument when starting Chitu
DeepSeek-R1 (https://huggingface.co/deepseek-ai/DeepSeek-R1)
Usage: Append models=DeepSeek-R1 command line argument when starting Chitu
DeepSeek-R1-Distill-Qwen-14B (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)
Usage: Append models=DeepSeek-R1-Distill-Qwen-14B command line argument when starting Chitu
Qwen2-7B-Instruct (https://huggingface.co/Qwen/Qwen2-7B-Instruct)
Usage: Append models=Qwen2-7B-Instruct command line argument when starting Chitu
DeepSeek-R1-bf16 (https://huggingface.co/opensourcerelease/DeepSeek-R1-bf16)
Usage: Append models=DeepSeek-R1-bf16 command line argument when starting Chitu
DeepSeek-V3 (https://huggingface.co/deepseek-ai/DeepSeek-V3)
Usage: Append models=DeepSeek-V3 command line argument when starting Chitu

(This release has been yanked)

HOT FIX: Fix major performance regression when CUDA graph is enabled (via infer.use_cuda_graph=True).

NOTE: CUDA graph support in this release is broken. Use v0.1.2 instead.

What's new:

Support of setting activation type to float16 for DeepSeek R1 (via appending keep_dtype_in_checkpoint=False dtype=float16 in command line arguments).
Config file for QwQ-32B.
A number of bug fixes for running with CUDA graph.
Further optimizations of operator kernels.

Initial release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: thu-pacman/chitu

v0.3.1

v0.3.0

v0.2.3

v0.2.2

v0.2.1

v0.2.0

v0.1.2

v0.1.1

v0.1.0