-
Notifications
You must be signed in to change notification settings - Fork 48
File Limit Request: vLLM - 800 MiB #6326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1, it would be great to have this! |
|
+1, many Nvidia arches to support, especially blackwell is in! It's definitely a reasonable ask. :-) |
+1 |
+1! I love vLLM! |
@Thespi-Brain can you please help take a look? thanks! 🙏 |
@Thespi-Brain kindly ping for any update, thanks! |
+1 - This is currently a blocker for vLLM for support NVIDIA GTX 5000-series consumer GPUs |
+1 this would help vLLM serve more users <3 |
+1 to include blackwell support! |
+1 - This would be critical for vLLM to ship support for more hardware architectures out of box to make users' life easier! |
+1 |
+1 to this. |
a gentle nudge @Thespi-Brain @cmaureir 🙏 |
Howdy folks, brigading issues like this is not helpful. I'm going to lock this issue. It will be unlocked when it reaches the top of the queue. |
Project URL
https://pypi.org/project/vllm/
Does this project already exist?
New Limit
800 MiB
Update issue title
Which indexes
PyPI
About the project
Running large language models (LLMs) is both resource-intensive and complex, especially as these models scale to hundreds of billions of parameters. That’s where vLLM comes in. Originally built around the innovative PagedAttention algorithm, vLLM has grown into a comprehensive, state-of-the-art inference engine. A thriving community is also continuously adding new features and optimizations to vLLM, including pipeline parallelism, chunked prefill, speculative decoding, and disaggregated serving.
Since its release, vLLM has garnered significant attention, achieving over 46,500 GitHub stars and over 1000 contributors—a testament to its popularity and thriving community.
Reasons for the request
Last year, we requested the size limit to be 400 MiB, see #3792 for more details. Since then, vLLM keeps growing, in terms of both popularity and the models/algorithms it supports. Now it's approaching the limit again. The release https://pypi.org/project/vllm/0.8.5.post1/#files has 326 MiB now.
Recently, when we add support for NVIDIA's blackwell, the wheel size grows to 450 MiB, which limits our release process. We tried to drop the support for older GPUs, but that doesn't help. The binary for the new GPUs dominates in size.
In addition, blackwell GPUs introduce more data types, including FP4, FP6, which means we need more variants of GPU kernels to support the functionality. Therefore, we forsee that the wheel size will continue to grow in the near future.
We kindly request to grow the size limit to 800 MiB, so that vLLM can better serve the community.
Code of Conduct
The text was updated successfully, but these errors were encountered: