Description
🐛 Describe the bug
Hello,
I successfully built the sgl-kernel(https://github.com/sgl-project/sglang/tree/main/sgl-kernel) with sm_120 (NVIDIA RTX 50 series) and CUDA 12.8, but encountered the following issue when running sglang.launch_server command using --enable-torch-compile
. Please help.
Suspicious log
[rank0]:E0410 07:52:47.507000 255947 torch/_inductor/select_algorithm.py:2134] [18/3] No valid triton configs. OutOfResources: out of resource: shared memory, Required: 110592, Hardware limit: 101376. Reducing block sizes or
num_stagesmay help
Full error logs:
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--dtype bfloat16 \
--context-length 8192 \
--enable-torch-compile
INFO 04-10 07:51:29 [__init__.py:256] Automatically detected platform cuda.
[2025-04-10 07:51:31] server_args=ServerArgs(model_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='bfloat16', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=8192, device='cuda', served_model_name='meta-llama/Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=386970438, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode=None, enable_torch_compile=True, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=False, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, n_share_experts_fusion=0, disable_shared_experts_fusion=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998)
INFO 04-10 07:51:34 [__init__.py:256] Automatically detected platform cuda.
INFO 04-10 07:51:34 [__init__.py:256] Automatically detected platform cuda.
[2025-04-10 07:51:37 TP0] Attention backend not set. Use flashinfer backend by default.
[2025-04-10 07:51:37 TP0] Init torch distributed begin.
[W410 07:51:37.867666368 ProcessGroupNCCL.cpp:959] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-04-10 07:51:37 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-04-10 07:51:37 TP0] Load weight begin. avail mem=30.83 GB
[2025-04-10 07:51:38 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:01, 1.80it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.52it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:01<00:00, 2.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.89it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.84it/s]
[2025-04-10 07:51:41 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=15.72 GB, mem usage=15.12 GB.
[2025-04-10 07:51:41 TP0] KV Cache is allocated. #tokens: 98436, K size: 6.01 GB, V size: 6.01 GB
[2025-04-10 07:51:41 TP0] Memory pool end. avail mem=3.40 GB
2025-04-10 07:51:42,094 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[2025-04-10 07:51:42 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=2.89 GB
Capturing batches (avail_mem=2.89 GB): 0%| | 0/23 [00:00<?, ?it/s]2025-04-10 07:51:42,595 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-04-10 07:51:42,624 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
Capturing batches (avail_mem=1.76 GB): 83%|█████████████████████████████████████████████████████████████████▎ | 19/23 [00:46<00:36, 9.21s/it][rank0]:E0410 07:52:46.205000 255947 torch/_inductor/select_algorithm.py:1905] [18/3] Exception No valid triton configs. OutOfResources: out of resource: shared memory, Required: 110592, Hardware limit: 101376. Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_admin2/2v/c2v53oqtsrcythafq3wmf7ttbffclvmwuddkrnvu6qbu34humvb5.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4)
[rank0]:E0410 07:52:47.507000 255947 torch/_inductor/select_algorithm.py:2134] [18/3] Runtime error during autotuning:
[rank0]:E0410 07:52:47.507000 255947 torch/_inductor/select_algorithm.py:2134] [18/3] No valid triton configs. OutOfResources: out of resource: shared memory, Required: 110592, Hardware limit: 101376. Reducing block sizes or `num_stages` may help..
[rank0]:E0410 07:52:47.507000 255947 torch/_inductor/select_algorithm.py:2134] [18/3] Ignoring this choice.
AUTOTUNE mm(8x4096, 4096x128256)
strides: [4096, 1], [1, 4096]
dtypes: torch.bfloat16, torch.bfloat16
triton_mm_11 0.6508 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
triton_mm_1 0.6527 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=2
triton_mm_8 0.6529 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
triton_mm_16 0.6529 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
triton_mm_7 0.6533 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
triton_mm_4 0.6548 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
mm 0.6564 ms 99.1%
triton_mm_2 0.8950 ms 72.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
triton_mm_3 0.9093 ms 71.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
triton_mm_13 0.9093 ms 71.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.6529 seconds and 3.4570 seconds precompiling for 18 choices
Capturing batches (avail_mem=1.74 GB): 87%|████████████████████████████████████████████████████████████████████▋ | 20/23 [01:06<00:38, 12.72s/it][rank0]:E0410 07:52:55.281000 255947 torch/_inductor/select_algorithm.py:1905] [5/4_1] Exception No valid triton configs. OutOfResources: out of resource: shared memory, Required: 110592, Hardware limit: 101376. Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_admin2/dq/cdqvg3b47v3ynrgzorcj3cxhzuakj73ueg6kiu3b3duuqvyoucuu.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4)
[rank0]:E0410 07:52:56.331000 255947 torch/_inductor/select_algorithm.py:2134] [5/4_1] Runtime error during autotuning:
[rank0]:E0410 07:52:56.331000 255947 torch/_inductor/select_algorithm.py:2134] [5/4_1] No valid triton configs. OutOfResources: out of resource: shared memory, Required: 110592, Hardware limit: 101376. Reducing block sizes or `num_stages` may help..
[rank0]:E0410 07:52:56.331000 255947 torch/_inductor/select_algorithm.py:2134] [5/4_1] Ignoring this choice.
AUTOTUNE mm(4x4096, 4096x6144)
strides: [4096, 1], [1, 4096]
dtypes: torch.bfloat16, torch.bfloat16
triton_mm_33 0.0342 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
triton_mm_25 0.0348 ms 98.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
triton_mm_21 0.0348 ms 98.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
triton_mm_19 0.0411 ms 83.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
mm 0.0420 ms 81.5%
triton_mm_18 0.0430 ms 79.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=2
triton_mm_24 0.0430 ms 79.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
triton_mm_20 0.0444 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
triton_mm_28 0.0485 ms 70.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
triton_mm_31 0.0518 ms 66.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.3201 seconds and 3.5530 seconds precompiling for 18 choices
[rank0]:E0410 07:53:02.558000 255947 torch/_inductor/select_algorithm.py:1905] [15/4] Exception No valid triton configs. OutOfResources: out of resource: shared memory, Required: 110592, Hardware limit: 101376. Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_admin2/3k/c3k437xehmkbwej7ef7t5iacnc2xa4usgxigf2wf2lbrv5rdqpmt.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4)
[rank0]:E0410 07:53:03.597000 255947 torch/_inductor/select_algorithm.py:2134] [15/4] Runtime error during autotuning:
[rank0]:E0410 07:53:03.597000 255947 torch/_inductor/select_algorithm.py:2134] [15/4] No valid triton configs. OutOfResources: out of resource: shared memory, Required: 110592, Hardware limit: 101376. Reducing block sizes or `num_stages` may help..
[rank0]:E0410 07:53:03.597000 255947 torch/_inductor/select_algorithm.py:2134] [15/4] Ignoring this choice.
AUTOTUNE mm(4x4096, 4096x4096)
strides: [4096, 1], [1, 4096]
dtypes: torch.bfloat16, torch.bfloat16
triton_mm_42 0.0239 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
triton_mm_38 0.0245 ms 97.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
triton_mm_50 0.0281 ms 85.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
triton_mm_35 0.0410 ms 58.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=2
triton_mm_41 0.0410 ms 58.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
mm 0.0420 ms 57.0%
triton_mm_36 0.0423 ms 56.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
triton_mm_37 0.0424 ms 56.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
triton_mm_45 0.0424 ms 56.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
triton_mm_48 0.0444 ms 53.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.3096 seconds and 3.4381 seconds precompiling for 18 choices
[rank0]:E0410 07:53:08.888000 255947 torch/_inductor/select_algorithm.py:1905] [16/4] Exception No valid triton configs. OutOfResources: out of resource: shared memory, Required: 110592, Hardware limit: 101376. Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_admin2/if/cifharvgs2sarmdz3enrviovyllg7mgznsjr5hs2nmukl35pefhj.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4)
[rank0]:E0410 07:53:10.461000 255947 torch/_inductor/select_algorithm.py:2134] [16/4] Runtime error during autotuning:
[rank0]:E0410 07:53:10.461000 255947 torch/_inductor/select_algorithm.py:2134] [16/4] No valid triton configs. OutOfResources: out of resource: shared memory, Required: 110592, Hardware limit: 101376. Reducing block sizes or `num_stages` may help..
[rank0]:E0410 07:53:10.461000 255947 torch/_inductor/select_algorithm.py:2134] [16/4] Ignoring this choice.
AUTOTUNE mm(4x4096, 4096x28672)
strides: [4096, 1], [1, 4096]
dtypes: torch.bfloat16, torch.bfloat16
triton_mm_58 0.1489 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
triton_mm_62 0.1489 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
triton_mm_67 0.1495 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
triton_mm_52 0.1509 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=2
triton_mm_59 0.1526 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
triton_mm_55 0.1530 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
mm 0.1695 ms 87.9%
triton_mm_65 0.1761 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
triton_mm_53 0.1859 ms 80.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
triton_mm_64 0.1879 ms 79.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.5421 seconds and 5.2288 seconds precompiling for 18 choices
[rank0]:E0410 07:53:10.590000 255947 torch/_inductor/select_algorithm.py:1905] [16/4] Exception No valid triton configs. OutOfResources: out of resource: shared memory, Required: 110592, Hardware limit: 101376. Reducing block sizes or `num_stages` may help. for benchmark choice TritonTemplateCaller(/tmp/torchinductor_admin2/ei/ceiw3uwk25mdxmun4j6oxjcvx5dl6iewuutgnwxz3kdr4erjmbj7.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4)
[rank0]:E0410 07:53:10.945000 255947 torch/_inductor/select_algorithm.py:2134] [16/4] Runtime error during autotuning:
[rank0]:E0410 07:53:10.945000 255947 torch/_inductor/select_algorithm.py:2134] [16/4] No valid triton configs. OutOfResources: out of resource: shared memory, Required: 110592, Hardware limit: 101376. Reducing block sizes or `num_stages` may help..
[rank0]:E0410 07:53:10.945000 255947 torch/_inductor/select_algorithm.py:2134] [16/4] Ignoring this choice.
AUTOTUNE mm(4x14336, 14336x4096)
strides: [14336, 1], [1, 14336]
dtypes: torch.bfloat16, torch.bfloat16
triton_mm_76 0.0813 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
triton_mm_72 0.0814 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
mm 0.0876 ms 92.8%
triton_mm_84 0.0976 ms 83.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
triton_mm_69 0.1255 ms 64.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=2
triton_mm_75 0.1284 ms 63.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
triton_mm_79 0.1366 ms 59.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
triton_mm_71 0.1368 ms 59.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
triton_mm_70 0.1407 ms 57.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
triton_mm_82 0.1572 ms 51.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.4711 seconds and 0.0004 seconds precompiling for 18 choices
Capturing batches (avail_mem=1.74 GB): 87%|████████████████████████████████████████████████████████████████████▋ | 20/23 [01:32<00:13, 4.62s/it]
[2025-04-10 07:53:14 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/admin2/Projects/sglang/python/sglang/srt/managers/scheduler.py", line 1999, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/home/admin2/Projects/sglang/python/sglang/srt/managers/scheduler.py", line 249, in __init__
self.tp_worker = TpWorkerClass(
File "/home/admin2/Projects/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
File "/home/admin2/Projects/sglang/python/sglang/srt/managers/tp_worker.py", line 74, in __init__
self.model_runner = ModelRunner(
File "/home/admin2/Projects/sglang/python/sglang/srt/model_executor/model_runner.py", line 177, in __init__
self.initialize(min_per_gpu_memory)
File "/home/admin2/Projects/sglang/python/sglang/srt/model_executor/model_runner.py", line 215, in initialize
self.init_cuda_graphs()
File "/home/admin2/Projects/sglang/python/sglang/srt/model_executor/model_runner.py", line 933, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/home/admin2/Projects/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 267, in __init__
self.capture()
File "/home/admin2/Projects/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 351, in capture
) = self.capture_one_batch_size(bs, forward)
File "/home/admin2/Projects/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 443, in capture_one_batch_size
run_once()
File "/home/admin2/Projects/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 436, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 658, in _fn
return fn(*args, **kwargs)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 70, in inner
return fn(*args, **kwargs)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/admin2/Projects/sglang/python/sglang/srt/models/llama.py", line 420, in forward
hidden_states = self.model(
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "/home/admin2/Projects/sglang/python/sglang/srt/models/llama.py", line 309, in forward
hidden_states, residual = layer(
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "/home/admin2/Projects/sglang/python/sglang/srt/models/llama.py", line 239, in forward
def forward(
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 850, in _fn
return fn(*args, **kwargs)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1207, in forward
return compiled_fn(full_args)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 331, in runtime_wrapper
all_outs = call_func_at_runtime_with_args(
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
out = normalize_as_list(f(args))
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 692, in inner_fn
outs = compiled_fn(args)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 498, in wrapper
return compiled_fn(runtime_args)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 558, in __call__
return self.current_callable(inputs)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2441, in run
return model(new_inputs)
File "/tmp/torchinductor_admin2/cn/ccn2ppbcgn45hf2ub63ainr4rid5owhgdhmtnl3ikzbqhw4nuvpz.py", line 149, in call
triton_red_fused__to_copy_add_mean_mul_pow_rsqrt_0.run(arg1_1, arg0_1, arg2_1, buf1, buf2, 4, 4096, stream=stream0)
File "/home/admin2/.virtualenvs/compile/lib/python3.10/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 1014, in run
if launcher.store_cubin and (not benchmark_run or not self.cuda_kernel_saved):
AttributeError: 'NoneType' object has no attribute 'store_cubin'
[2025-04-10 07:53:14] Received sigquit from a child process. It usually means the child failed.
Killed
Versions
PyTorch version: 2.8.0.dev20250407+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.35
Python version: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-57-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.8.93
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5090
Nvidia driver version: 570.124.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper 3970X 32-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU max MHz: 4549,1211
CPU min MHz: 2200,0000
BogoMIPS: 7399.85
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
Virtualization: AMD-V
L1d cache: 1 MiB (32 instances)
L1i cache: 1 MiB (32 instances)
L2 cache: 16 MiB (32 instances)
L3 cache: 128 MiB (8 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.8.0.87
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] torch==2.8.0.dev20250407+cu128
[pip3] torchao==0.9.0
[pip3] torchaudio==2.6.0.dev20250407+cu128
[pip3] torchvision==0.22.0.dev20250407+cu128
[pip3] triton==3.3.0+git61cb963f
[conda] Could not collect
cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @aakhundov