[Core]: Support destroying all KV cache during runtime #10810

HollowMan6 · 2024-12-01T21:05:56Z

Implements #10714

API Design:

Destroy (this PR implements): vllm.LLM().llm_engine._destroy_kv_caches()
ReInitialize (already have): vllm.LLM().llm_engine._initialize_kv_caches()
Stop loop (already have): vllm.LLM().llm_engine.model_executor.stop_remote_worker_execution_loop()

This PR only implements _destroy_kv_caches for GPU executor and workers, as I don’t have other available hardware, feel free to take over this PR to implement others, and once we finish all the implementations, we can make destroy_cache() an abstract method.

Also, since the engine won’t generate without KV Caches (will throw errors), this PR assumes that the developers will handle everything on their side so that no request will be sent to generate after _destroy_kv_caches() and before _initialize_kv_caches() (in sleep mode)

Code for testing:

import ray, time
from ray.util.placement_group import placement_group
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy

@ray.remote
class LLMRayActor:
    def __init__(self, *args, **kwargs):
        import vllm

        if not kwargs["tensor_parallel_size"] == 1:
            kwargs["worker_use_ray"] = True

        self.llm = vllm.LLM(*args, **kwargs)

    def generate(self, *args, **kwargs):
        return self.llm.generate(*args, **kwargs)

    def destroy_cache(self):
        self.stop_remote_worker_execution_loop()
        self.llm.llm_engine._destroy_kv_caches()

    def load_cache(self):
        self.stop_remote_worker_execution_loop()
        self.llm.llm_engine._initialize_kv_caches()

    def stop_remote_worker_execution_loop(self):
        self.llm.llm_engine.model_executor.stop_remote_worker_execution_loop()

def create_vllm_engines(
    num_engines: int,
    tensor_parallel_size: int,
    model: str,
):
    vllm_engines = []
    for _ in range(num_engines):
        num_gpus = int(tensor_parallel_size == 1)
        scheduling_strategy = None

        if tensor_parallel_size > 1:
            bundles = [{"GPU": 1, "CPU": 1}] * tensor_parallel_size
            pg = placement_group(bundles)
            ray.get(pg.ready())

            scheduling_strategy = PlacementGroupSchedulingStrategy(
                placement_group=pg, placement_group_capture_child_tasks=True, placement_group_bundle_index=0
            )

        vllm_engines.append(
            LLMRayActor.options(
                num_cpus=1,
                num_gpus=num_gpus,
                scheduling_strategy=scheduling_strategy,
            ).remote(
                model,
                tensor_parallel_size=tensor_parallel_size,
            )
        )

    return vllm_engines

if __name__ == "__main__":
    # engines = create_vllm_engines(2, 2, "meta-llama/Llama-3.1-8B-Instruct")
    engines = create_vllm_engines(4, 1, "meta-llama/Llama-3.1-8B-Instruct")

    ref = []
    for engine in engines:
        ref.append(engine.generate.remote("San Francisco is a"))
    print(f"output: {ray.get(ref)}")

    ref = []
    for engine in engines:
        ref.append(engine.destroy_cache.remote())
    ray.get(ref)

    time.sleep(5)

    ref = []
    for engine in engines:
        ref.append(engine.load_cache.remote())
    ray.get(ref)

    ref = []
    for engine in engines:
        ref.append(engine.generate.remote("New York is a"))
    print(f"output: {ray.get(ref)}")

github-actions · 2024-12-01T21:06:09Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Implements vllm-project#10714 API Design: - Destroy (this PR implements): `vllm.LLM().llm_engine._destroy_kv_caches()` - ReInitialize (already have): `vllm.LLM().llm_engine._initialize_kv_caches()` - Stop loop (already have): `vllm.LLM().llm_engine.model_executor.stop_remote_worker_execution_loop()` This PR only implements `_destroy_kv_caches` for GPU executor and workers, as I don’t have other available hardware, feel free to take over this PR to implement others, and once we finish all the implementations, we can make `destroy_cache()` an abstract method. Also, since the engine won’t generate without KV Caches (will throw errors), this PR assumes that the developers will handle everything on their side so that no request will be sent to generate after `_destroy_kv_caches()` and before `_initialize_kv_caches()` (in sleep mode) Code for testing: ```python import ray, time from ray.util.placement_group import placement_group from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy @ray.remote class LLMRayActor: def __init__(self, *args, **kwargs): import vllm if not kwargs["tensor_parallel_size"] == 1: kwargs["worker_use_ray"] = True self.llm = vllm.LLM(*args, **kwargs) def generate(self, *args, **kwargs): return self.llm.generate(*args, **kwargs) def destroy_cache(self): self.stop_remote_worker_execution_loop() self.llm.llm_engine._destroy_kv_caches() def load_cache(self): self.stop_remote_worker_execution_loop() self.llm.llm_engine._initialize_kv_caches() def stop_remote_worker_execution_loop(self): self.llm.llm_engine.model_executor.stop_remote_worker_execution_loop() def create_vllm_engines( num_engines: int, tensor_parallel_size: int, model: str, ): vllm_engines = [] for _ in range(num_engines): num_gpus = int(tensor_parallel_size == 1) scheduling_strategy = None if tensor_parallel_size > 1: bundles = [{"GPU": 1, "CPU": 1}] * tensor_parallel_size pg = placement_group(bundles) ray.get(pg.ready()) scheduling_strategy = PlacementGroupSchedulingStrategy( placement_group=pg, placement_group_capture_child_tasks=True, placement_group_bundle_index=0 ) vllm_engines.append( LLMRayActor.options( num_cpus=1, num_gpus=num_gpus, scheduling_strategy=scheduling_strategy, ).remote( model, tensor_parallel_size=tensor_parallel_size, ) ) return vllm_engines if __name__ == "__main__": # engines = create_vllm_engines(2, 2, "meta-llama/Llama-3.1-8B-Instruct") engines = create_vllm_engines(4, 1, "meta-llama/Llama-3.1-8B-Instruct") ref = [] for engine in engines: ref.append(engine.generate.remote("San Francisco is a")) print(f"output: {ray.get(ref)}") ref = [] for engine in engines: ref.append(engine.destroy_cache.remote()) ray.get(ref) time.sleep(5) ref = [] for engine in engines: ref.append(engine.load_cache.remote()) ray.get(ref) ref = [] for engine in engines: ref.append(engine.generate.remote("New York is a")) print(f"output: {ray.get(ref)}") ``` Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 · 2024-12-02T10:12:18Z

Also, since the engine won’t generate without KV Caches (will throw errors), this PR assumes that the developers will handle everything on their side so that no request will be sent to generate after _destroy_kv_caches() and before _initialize_kv_caches() (in sleep mode)

Maybe another possible way to get this handled is to check if we have initialized the KV cache when the engine receives the request, if not, initialize it so that we can get rid of manual intervention. But this didn't get implemented in this PR.

HollowMan6 requested review from zhuohan123, youkaichao, alexm-neuralmagic, comaniac and njhill as code owners December 1, 2024 21:05

HollowMan6 force-pushed the kvcache-evict branch from 0d96a53 to 8eca8b1 Compare December 1, 2024 21:13

HollowMan6 mentioned this pull request Dec 2, 2024

[Feature]: API for evicting all KV cache from GPU memory (or sleep mode) #10714

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core]: Support destroying all KV cache during runtime #10810

[Core]: Support destroying all KV cache during runtime #10810

[Core]: Support destroying all KV cache during runtime #10810

Are you sure you want to change the base?

[Core]: Support destroying all KV cache during runtime #10810

Conversation