8000 feat: Add unified x86 / aarch64 (ARM) build for VLLM image by rmccorm4 · Pull Request #839 · ai-dynamo/dynamo · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: Add unified x86 / aarch64 (ARM) build for VLLM image #839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 28, 2025

Conversation

rmccorm4
Copy link
Contributor
@rmccorm4 rmccorm4 commented Apr 26, 2025

Overview:

Parameterizes ARCH for x86/arm installs of etcd, nats, dynamo, and NIXL, similar to #803

Extends NIXL install from #594 with ARM support.

VLLM ARM build/install from source added in separate PR here for easier isolation of changes, but merged into this PR: #845


Manually tested with following builds (minimal runtime testing).

x86

# or default with no args: 
#   ./container/build.sh --framework vllm
./container/build.sh --framework vllm --platform linux/amd64

# example x86 image produced:
#   gitlab-master.nvidia.com:5005/dl/ai-dynamo/dynamo-ci/rmccormick:dynamo_2cee3f_vllm0.8.4_x86

ARM

# NOTE: Building vllm from source can easily take anywhere from 1-4+ hours depending
# on VLLM_MAX_JOBS value and available system memory.
# - With 256GB RAM and VLLM_MAX_JOBS=16 - I believe it took somewhere between 3-4 hours
# - With > 1 TB RAM (GB200 machine) and VLLM_MAX_JOBS=64 - I believe it took about an 1-1.5 hours
./container/build.sh --framework vllm --platform linux/arm64

# Example on system with large memory to speed up build vllm build on ARM
./container/build.sh --framework vllm --platform linux/arm64 --build-args VLLM_MAX_JOBS=16

# example aarch64 image produced: 
#   gitlab-master.nvidia.com:5005/dl/ai-dynamo/dynamo-ci/rmccormick:dynamo_6084b0f_vllm0.8.4_aarch64

Minimal runtime validation of vllm/torch installs after: #845

root@gb-nvl-081-compute06:/workspace# python3
Python 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import vllm
WARNING 04-28 06:03:01 [__init__.py:23] Using ai_dynamo_vllm
WARNING 04-28 06:03:01 [__init__.py:23] Using ai_dynamo_vllm
WARNING 04-28 06:03:01 [__init__.py:23] Using ai_dynamo_vllm
WARNING 04-28 06:03:01 [__init__.py:23] Using ai_dynamo_vllm
WARNING 04-28 06:03:01 [__init__.py:23] Using ai_dynamo_vllm
INFO 04-28 06:03:01 [__init__.py:240] Automatically detected platform cuda.
INFO 04-28 06:03:01 [nixl.py:31] NIXL is available
>>> import torch
>>> torch.cuda.is_available()
True

Needs follow-up: (@saturley-hall @nv-anants)

  • If the wheel needs to be published to pypi, it may need a couple tweaks to package metadata to publish as ai_dynamo_vllm, similar to how it's done in the x86 download+patch+publish path:
    # WAR: Set package version check to 'vllm' instead of 'ai_dynamo_vllm' to avoid
    # platform detection issues on ARM install.
    # TODO: Rename package from vllm to ai_dynamo_vllm like x86 path below to remove this WAR.
    sed -i 's/version("ai_dynamo_vllm")/version("vllm")/g' vllm/platforms/__init__.py && \
  • Currently it is installed from source and ends up listed as package name vllm instead of ai_dynamo_vllm on ARM only:
$ ./container/run.sh -it --image gitlab-master.nvidia.com:5005/dl/ai-dynamo/dynamo-ci/rmccormick:dynamo_6084b0f_vllm0.8.4_aarch64
$ pip freeze
...
vllm @ file:///tmp/vllm/vllm-0.8.4
  • Building vLLM (flash attention) from source is quite slow and system memory intensive (see MAX_JOBS). Hopefully with the release of prebuilt wheels for pytorch2.7 supporting ARM, vLLM may be able to start publishing ARM wheels as well so we can just download+patch like we do for x86 too.

Copy link
copy-pr-bot bot commented Apr 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rmccorm4 rmccorm4 changed the title feat: Add unified x86 / aarch64 (ARM) build for VLLM image (part 1) feat: Add unified x86 / aarch64 (ARM) build for VLLM image Apr 26, 2025
@rmccorm4
Copy link
Contributor Author
rmccorm4 commented Apr 27, 2025

VLLM ARM build/install from source added in separate PR here for easier isolation of logical changes: #845

@rmccorm4 rmccorm4 marked this pull request as ready for review April 27, 2025 06:28
@rmccorm4 rmccorm4 merged commit 566068d into main Apr 28, 2025
6 checks passed
@rmccorm4 rmccorm4 deleted the rmccormick/clean/arm64/vllm branch April 28, 2025 20:47
@rmccorm4
Copy link
Contributor Author

Sanity test

Example dynamo serve commands in vllm image on GB200:

# Start container
./container/run.sh -it --image gitlab-master.nvidia.com:5005/dl/ai-dynamo/dynamo-ci/rmccormick:dynamo_6084b0f_vllm0.8.4_aarch64

# Starts nats/etcd
nats-server -js &
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 &

# Serve
cd /workspace/examples/llm
dynamo serve graphs.agg:Frontend -f configs/agg.yaml &

Example request:

MODEL=${MODEL:-"deepseek-ai/DeepSeek-R1-Distill-Llama-8B"}

curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "'${MODEL}'",
    "messages": [
        {
            "role": "user",
            "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
        }
    ],
    "stream":false,
    "max_tokens": 30
}'

Output

...
2025-04-28T20:48:10.936Z  INFO serve.serve: Running dynamo serve with service configs {'Common': {'model': 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', 'block-size': 64, 'max-model-len': 16384}, 'Frontend': {'served_model_name': 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', 'endpoint': 'dynamo.Processor.chat/completions', 'port': 8000}, 'Processor': {'router': 'round-robin', 'common-configs': ['model', 'block-size', 'max-model-len']}, 'VllmWorker': {'enforce-eager': True, 'max-num-batched-tokens': 16384, 'enable-prefix-caching': True, 'router': 'random', 'tensor-parallel-size': 1, 'ServiceArgs': {'workers': 1, 'resources': {'gpu': 1}}, 'common-configs': ['model', 'block-size', 'max-model-len']}}   
2025-04-28T20:48:10.936Z  INFO loader.find_and_load_service: Loading service from import string: graphs.agg:Frontend   
2025-04-28T20:48:10.936Z  INFO loader.find_and_load_service: Working directory: .   
2025-04-28T20:48:10.937Z  INFO loader.find_and_load_service: Changing working directory to: /workspace/examples/llm   
2025-04-28T20:48:10.937Z  INFO loader.find_and_load_service: Adding /workspace/examples/llm to sys.path   
2025-04-28T20:48:10.937Z  INFO loader._do_import: Parsed import string - path: graphs.agg, attributes: Frontend   
2025-04-28T20:48:10.937Z  INFO loader._do_import: Importing from module name: graphs.agg   
2025-04-28T20:48:10.937Z  INFO loader._do_import: Attempting to import module: graphs.agg   
2025-04-28T20:48:17.834Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:17.835Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:17.837Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:17.849Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:17.850Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:17.852Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-04-28T20:48:18.440Z  INFO nixl: NIXL is available   
2025-04-28T20:48:20.451Z  INFO loader._do_import: Navigating attributes: Frontend   
2025-04-28T20:48:20.451Z  INFO loader._do_import: Getting attribute: Frontend   
2025-04-28T20:48:20.451Z  INFO loader.find_and_load_service: Removing /workspace/examples/llm from sys.path   
2025-04-28T20:48:20.451Z  INFO loader.find_and_load_service: Restoring working directory to: /workspace/examples/llm   
2025-04-28T20:48:20.451Z  INFO serve.serve: Loaded service: Frontend   
2025-04-28T20:48:20.451Z  INFO serve.serve: Dependencies: ['Processor', 'VllmWorker']   
╭──────────────── Dynamo Serve ────────────────╮
│ Starting Dynamo service: graphs.agg:Frontend │
╰──────────────────────────────────────────────╯
2025-04-28T20:48:20.544Z  INFO loader.find_and_load_service: Loading service from import string: graphs.agg:Frontend   
2025-04-28T20:48:20.544Z  INFO loader.find_and_load_service: Working directory: .   
2025-04-28T20:48:20.544Z  INFO loader.find_and_load_service: Changing working directory to: /workspace/examples/llm   
2025-04-28T20:48:20.544Z  INFO loader.find_and_load_service: Adding /workspace/examples/llm to sys.path   
2025-04-28T20:48:20.544Z  INFO loader._do_import: Parsed import string - path: graphs.agg, attributes: Frontend   
2025-04-28T20:48:20.544Z  INFO loader._do_import: Importing from module name: graphs.agg   
2025-04-28T20:48:20.544Z  INFO loader._do_import: Attempting to import module: graphs.agg   
2025-04-28T20:48:20.544Z  INFO loader._do_import: Navigating attributes: Frontend   
2025-04-28T20:48:20.544Z  INFO loader._do_import: Getting attribute: Frontend   
2025-04-28T20:48:20.544Z  INFO loader.find_and_load_service: Removing /workspace/examples/llm from sys.path   
2025-04-28T20:48:20.544Z  INFO loader.find_and_load_service: Restoring working directory to: /workspace/examples/llm   
2025-04-28T20:48:20.589Z  INFO resource._discover_gpus: Discovered 4 GPUs   
2025-04-28T20:48:20.635Z  INFO resource._discover_gpus: Discovered 4 GPUs   
2025-04-28T20:48:20.635Z  INFO allocator.get_resource_envs: Getting resource envs for service Frontend   
2025-04-28T20:48:20.635Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-04-28T20:48:20.635Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: []   
2025-04-28T20:48:20.635Z  INFO allocator.get_resource_envs: Getting resource envs for service Processor   
2025-04-28T20:48:20.636Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-04-28T20:48:20.636Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: []   
2025-04-28T20:48:20.640Z  INFO serving.create_dynamo_watcher: Created watcher for Processor's in the dynamo namespace   
2025-04-28T20:48:20.640Z  INFO allocator.get_resource_envs: Getting resource envs for service VllmWorker   
2025-04-28T20:48:20.640Z  INFO allocator.get_resource_envs: GPU requirement found: 1   
2025-04-28T20:48:20.640Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-04-28T20:48:20.641Z  INFO allocator.get_resource_envs: GPU allocation enabled   
2025-04-28T20:48:20.641Z  INFO allocator.get_resource_envs: Local deployment detected. Allocating GPUs for 1 workers of 'VllmWorker'   
2025-04-28T20:48:20.641Z  INFO allocator.get_resource_envs: GPU 0 (NVIDIA Graphics Device): Memory: 184.0GB free / 185.0GB total, Utilization: 0%, Temperature: 31°C   
2025-04-28T20:48:20.641Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: [{'CUDA_VISIBLE_DEVICES': '0'}]   
2025-04-28T20:48:20.641Z  INFO serving.create_dynamo_watcher: Created watcher for VllmWorker's in the dynamo namespace   
2025-04-28T20:48:20.642Z  INFO serving.serve_dynamo_graph: Created watcher for Frontend with 1 workers in the dynamo namespace   
2025-04-28T20:48:20.643Z  INFO arbiter._ensure_ioloop: Installing handle_callback_exception to loop   
2025-04-28T20:48:20.643Z  INFO sighandler.__init__: Registering signals...   
2025-04-28T20:48:20.644Z  INFO arbiter.start: Starting master on pid 173   
2025-04-28T20:48:20.645Z  INFO arbiter.initialize: sockets started   
2025-04-28T20:48:20.654Z  INFO arbiter.start: Arbiter now waiting for commands   
2025-04-28T20:48:20.654Z  INFO watcher._start: dynamo_Processor started   
2025-04-28T20:48:20.660Z  INFO watcher._start: dynamo_VllmWorker started   
2025-04-28T20:48:20.665Z  INFO watcher._start: dynamo_Frontend started   
2025-04-28T20:48:20.665Z  INFO serving.<lambda>: Starting Dynamo Service Frontend (Press CTRL+C to quit)   
2025-04-28T20:48:23.794Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.796Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.796Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.797Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.797Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.798Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.817Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.818Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.820Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.820Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-04-28T20:48:23.821Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.822Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-04-28T20:48:23.936Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.937Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.940Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.959Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.960Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:23.962Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-04-28T20:48:24.091Z  INFO nixl: NIXL is available   
2025-04-28T20:48:24.091Z  INFO nixl: NIXL is available   
2025-04-28T20:48:24.216Z  INFO nixl: NIXL is available   
2025-04-28T20:48:24.897Z  INFO serve_dynamo.worker: [Processor:1] Registering component dynamo/Processor   
2025-04-28T20:48:24.898Z  INFO serve_dynamo.worker: [Processor:1] Created Processor component   
2025-04-28T20:48:24.898Z  INFO config.as_args: [Processor:1] Running Processor with args=['--model', 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', '--block-size', '64', '--max-model-len', '16384', '--router', 'round-robin']   
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 826/826 [00:00<00:00, 10.9MB/s]
2025-04-28T20:48:25.275Z  INFO serve_dynamo.worker: [VllmWorker:1] Registering component dynamo/VllmWorker   
2025-04-28T20:48:25.275Z  INFO _core: created custom lease: Lease { id: 7587886415768669961, cancel_token: CancellationToken { is_cancelled: false } }
2025-04-28T20:48:25.276Z  INFO serve_dynamo.worker: [VllmWorker:1] Created VllmWorker component with custom lease id 7587886415768669961   
2025-04-28T20:48:25.276Z  INFO config.as_args: [VllmWorker:1] Running VllmWorker with args=['--model', 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', '--block-size', '64', '--max-model-len', '16384', '--enforce-eager', '--max-num-batched-tokens', '16384', '--enable-prefix-caching', '--router', 'random', '--tensor-parallel-size', '1']   
2025-04-28T20:48:25.331Z  INFO worker.__init__: [VllmWorker:1] Prefill queue: nats://localhost:4222:vllm   
chat model deepseek-ai/DeepSeek-R1-Distill-Llama-8B removed from the public namespace: public
Added new chat model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+------------+------------------------------------------+-----------+-----------+------------------+
| MODEL TYPE | MODEL NAME                               | NAMESPACE | COMPONENT | ENDPOINT         |
+------------+------------------------------------------+-----------+-----------+------------------+
| chat       | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | dynamo    | Processor | chat/completions |
+------------+------------------------------------------+-----------+-----------+------------------+
2025-04-28T20:48:26.301Z  INFO frontend.start_http_server: [Frontend:1] Starting HTTP server   
2025-04-28T20:48:26.301Z  WARN serve_dynamo.web_worker: [Frontend:1] No API routes found, not starting FastAPI server   
2025-04-28T20:48:26.301Z  INFO serve_dynamo.web_worker: [Frontend:1] Service is running, press Ctrl+C to stop   
2025-04-28T20:48:26.307Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
2025-04-28T20:48:26.307Z  INFO dynamo_runtime::pipeline::network::tcp::server: tcp transport service on 10.115.171.13:41195
2025-04-28T20:48:26.307Z  INFO dynamo_llm::http::service::discovery: added Chat model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
2025-04-28T20:48:31.587Z  INFO config._resolve_task: This model supports multiple tasks: {'embed', 'reward', 'generate', 'score', 'classify'}. Defaulting to 'generate'.   
2025-04-28T20:48:31.668Z  INFO config._resolve_task: This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.   
2025-04-28T20:48:31.669Z  WARN cuda.is_async_output_supported: To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used   
2025-04-28T20:48:31.673Z  INFO api_server.build_async_engine_client_from_engine_args: Started engine process with PID 695   
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.07k/3.07k [00:00<00:00, 28.8MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.08M/9.08M [00:00<00:00, 24.4MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 181/181 [00:00<00:00, 2.73MB/s]
2025-04-28T20:48:33.313Z  WARN config.get_diff_sampling_param: Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.   
2025-04-28T20:48:33.313Z  INFO serving_chat.__init__: Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.95}   
2025-04-28T20:48:33.394Z  INFO serving_completion.__init__: Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.95}   
Processor init: round-robin
2025-04-28T20:48:33.395Z  INFO dynamo_runtime::pipeline::network::tcp::server: tcp transport service on 10.115.171.13:42559
2025-04-28T20:48:35.167Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:35.168Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:35.170Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:35.188Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:35.189Z  WARN __init__.vllm_version_matches_substr: Using ai_dynamo_vllm   
2025-04-28T20:48:35.190Z  INFO __init__.resolve_current_platform_cls_qualname: Automatically detected platform cuda.   
2025-04-28T20:48:35.392Z  INFO nixl: NIXL is available   
2025-04-28T20:48:36.196Z  INFO llm_engine.__init__: Initializing a V0 LLM engine (v0.8.5.dev0+gdc1b4a6.d20250428) with config: model='deepseek-ai/DeepSeek-R1-Distill-Llama-8B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Llama-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Llama-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,    
2025-04-28T20:48:37.674Z  INFO cuda.get_attn_backend_cls: Using Flash Attention backend.   
2025-04-28T20:48:38.267Z  INFO parallel_state.initialize_model_parallel: rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0   
2025-04-28T20:48:38.267Z  INFO model_runner.load_model: Starting to load model deepseek-ai/DeepSeek-R1-Distill-Llama-8B...   
2025-04-28T20:48:40.748Z  INFO weight_utils.download_weights_from_hf: Using model weights format ['*.safetensors']   
model-00002-of-000002.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.39G/7.39G [02:02<00:00, 60.4MB/s]
model-00001-of-000002.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.67G/8.67G [02:19<00:00, 62.3MB/s]
2025-04-28T20:51:00.000Z  INFO weight_utils.download_weights_from_hf: Time spent downloading weights for deepseek-ai/DeepSeek-R1-Distill-Llama-8B: 139.250745 seconds   
model.safetensors.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24.2k/24.2k [00:00<00:00, 151MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.28s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.01s/it]

2025-04-28T20:51:02.205Z  INFO loader.load_model: Loading weights took 2.04 seconds   
2025-04-28T20:51:02.350Z  INFO model_runner.load_model: Model loading took 14.9889 GiB and 143.940618 seconds   
2025-04-28T20:51:03.161Z  INFO worker.determine_num_available_blocks: Memory profiling takes 0.69 seconds
the current vLLM instance can use total_gpu_memory (184.00GiB) x gpu_memory_utilization (0.90) = 165.60GiB
model weights take 14.99GiB; non_torch_memory takes 0.16GiB; PyTorch activation peak memory takes 1.70GiB; the rest of the memory reserved for KV Cache is 148.76GiB.   
2025-04-28T20:51:03.284Z  INFO executor_base.initialize_cache: # cuda blocks: 19040, # CPU blocks: 512   
2025-04-28T20:51:03.284Z  INFO executor_base.initialize_cache: Maximum concurrency for 16384 tokens per request: 74.38x   
2025-04-28T20:51:24.446Z  INFO llm_engine._initialize_kv_caches: init engine (profile, create kv cache, warmup model) took 22.10 seconds   
2025-04-28T20:51:24.996Z  INFO worker.async_init: [VllmWorker:1] VllmWorker has been initialized   
2025-04-28T20:51:24.996Z  INFO serve_dynamo.worker: [VllmWorker:1] Starting VllmWorker instance with all registered endpoints   
2025-04-28T20:51:24.996Z  INFO serve_dynamo.worker: [VllmWorker:1] Serving VllmWorker with lease: 7587886415768669961   
2025-04-28T20:51:24.997Z  WARN utils.append_dynamo_state: [VllmWorker:1] Skipping append to state file /root/.dynamo/state/dynamo.json because it doesn't exist   
2025-04-28T20:51:24.997Z  INFO serve_dynamo.worker: [VllmWorker:1] Appended lease 7587886415768669961/694d967e27e74f09 to dynamo_VllmWorker   
2025-04-28T20:51:24.997Z  INFO worker.create_metrics_publisher_endpoint: [VllmWorker:1] Creating metrics publisher endpoint with lease: <builtins.PyLease object at 0xeb6ad46e3240>   
2025-04-28T20:51:25.072Z  INFO logging.check_required_workers: [Processor:1] Waiting for more workers to be ready.
 Current: 1, Required: 1   
Workers ready: [7587886415768669961]
2025-04-28T20:51:25.074Z  INFO serve_dynamo.worker: [Processor:1] Starting Processor instance with all registered endpoints   
2025-04-28T20:51:25.074Z  INFO serve_dynamo.worker: [Processor:1] Serving Processor with primary lease   

root@gb-nvl-082-compute08:/workspace/examples/llm# MODEL=${MODEL:-"deepseek-ai/DeepSeek-R1-Distill-Llama-8B"}
#MODEL=${MODEL:-"deepseek-ai/DeepSeek-R1"}

curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "'${MODEL}'",
    "messages": [
        {
            "role": "user",
            "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
        }
    ],
    "stream":false,
    "max_tokens": 30
}'
2025-04-28T20:54:49.157Z  INFO chat_utils._log_chat_template_content_format: Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.   
2025-04-28T20:54:49.310Z  INFO worker.generate: [VllmWorker:1] Prefilling locally for request bf41b932-8272-4c7c-bea5-e856587e51f3 with length 193   
2025-04-28T20:54:49.310Z  INFO engine._handle_process_request: Added request bf41b932-8272-4c7c-bea5-e856587e51f3.   
{"id":"bf41b932-8272-4c7c-bea5-e856587e51f3","choices":[{"index":0,"message":{"content":"Okay, so I'm trying to help develop a char
8000
acter background for someone exploring Aeloria, this ancient city lost beneath the sands. The user provided","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"length","logprobs":null}],"created":1745873689,"model":"deepseek-ai/DeepSeek-R1-Distill-Llama-8B","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":null}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0