High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
-
Updated
Sep 6, 2024 - C++
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
A @ClickHouse fork that supports high-performance vector search and full-text search.
a lightweight LLM model inference framework
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
🤘 TT-NN operator library, and TT-Metalium low level kernel programming model.
WebAssembly binding for llama.cpp - Enabling on-browser LLM inference
Pure C++ implementation of several models for real-time chatting on your computer (CPU)
A high-performance inference system for large language models, designed for production environments.
校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
LLaVA server (llama.cpp).
Add a description, image, and links to the llm topic page so that developers can more easily learn about it.
To associate your repository with the llm topic, visit your repo's landing page and select "manage topics."