-
Notifications
You must be signed in to change notification settings - Fork 394
feat: multi-thread (via asyncio.task) in processor #904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your accou 8000 nt
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I'm not sold on the approach of this change as described here: https://github.com/ai-dynamo/dynamo/pull/904/files#r2069528912
But I don't want to block you from improving the performance if you deem this strategy necessary over just increasing the number of processor replicas.
288915f
to
6f8fdf0
Compare
…/multithread-proc
Set to auto-merge. Note that multi-process/thread/asycio-task are all temporary solution before we refactor processor in Rust. |
Implement an asyncio tasks + queue architecture in
processor.py
to improve tokenization perf at high load. This partially solves #873Note that by eyeballing htop, we're still GIL-bounded. Not sure if the vllm tokenization process release GIL or not.
Another benefit of the queue architecture is that it naturally enables queuing request at processor instead of engine.