Description
🐛 Describe the bug
Hi all,
I'm trying to compile the current main
version (d3d655ad14e
) of PyTorch with ROCm 6.4.1 for Python 3.11 on Debian 12 and I'm facing issues.
I've also tried, and am facing the same issues if I try to compile v2.7.1
.
I've setup all of the requirements as per the documentation: https://github.com/pytorch/pytorch?tab=readme-ov-file#amd-rocm-support
git submodule sync
git submodule update --init --recursive
pip install -r requirements.txt
pip install mkl-static mkl-include
python tools/amd_build/build_amd.py
After that, I invoke the compile using the following command:
env CMAKE_GENERATOR="Unix Makefiles" \
CMAKE_CXX_COMPILER="/opt/rocm/llvm/bin/clang++" \
CMAKE_C_COMPILER="/opt/rocm/llvm/bin/clang" \
python setup.py bdist_wheel
I get the following compile error when trying to link libfbgemm.a
:
ld.lld: error: undefined symbol: __kmpc_barrier
>>> referenced by Utils.cc
>>> Utils.cc.o:(std::pair<unsigned char*, unsigned char*> fbgemm::radix_sort_parallel<unsigned char, unsigned char>(unsigned char*, unsigned char*, unsigned char*, unsigned char*, long, long, bool) (.omp_outlined)) in archive ../lib/libfbgemm.a
>>> referenced by Utils.cc
>>> Utils.cc.o:(std::pair<unsigned char*, unsigned char*> fbgemm::radix_sort_parallel<unsigned char, unsigned char>(unsigned char*, unsigned char*, unsigned char*, unsigned char*, long, long, bool) (.omp_outlined)) in archive ../lib/libfbgemm.a
>>> referenced by Utils.cc
>>> Utils.cc.o:(std::pair<unsigned char*, unsigned char*> fbgemm::radix_sort_parallel<unsigned char, unsigned char>(unsigned char*, unsigned char*, unsigned char*, unsigned char*, long, long, bool) (.omp_outlined)) in archive ../lib/libfbgemm.a
>>> referenced 132 more times
ld.lld: error: undefined symbol: __kmpc_barrier
>>> referenced by Utils.cc
>>> Utils.cc.o:(std::pair<unsigned char*, unsigned char*> fbgemm::radix_sort_parallel<unsigned char, unsigned char>(unsigned char*, unsigned char*, unsigned char*, unsigned char*, long, long, bool) (.omp_outlined)) in archive ../lib/libfbgemm.a
>>> referenced by Utils.cc
>>> Utils.cc.o:(std::pair<unsigned char*, unsigned char*> fbgemm::radix_sort_parallel<unsigned char, unsigned char>(unsigned char*, unsigned char*, unsigned char*, unsigned char*, long, long, bool) (.omp_outlined)) in archive ../lib/libfbgemm.a
>>> referenced by Utils.cc
>>> Utils.cc.o:(std::pair<unsigned char*, unsigned char*> fbgemm::radix_sort_parallel<unsigned char, unsigned char>(unsigned char*, unsigned char*, unsigned char*, unsigned char*, long, long, bool) (.omp_outlined)) in archive ../lib/libfbgemm.a
...
It seems that the fbgemm
library doesn't properly link with OpenMP.
If I change the compile command, and try to compile with the following added *FLAGS
(to force linking with OpenMP):
env CMAKE_GENERATOR="Unix Makefiles" \
CMAKE_CXX_COMPILER="/opt/rocm/llvm/bin/clang++" \
CMAKE_C_COMPILER="/opt/rocm/llvm/bin/clang" \
CMAKE_CXX_FLAGS="-fopenmp" \
HIPCC_COMPILE_FLAGS_APPEND="-fopenmp" \
HIPCC_LINK_FLAGS_APPEND="-fopenmp" \
python setup.py bdist_wheel
Linking of fbgemm.a
seems to be working, but compilation still fails because the build system is trying to compile the following HIP .c
files with -std=c++17
:
torch/csrc/dynamo/cpython_defs.c
torch/csrc/dynamo/eval_frame.c
If I comment out the following line in cmake/Dependencies.cmake
(because it seems that HIP_CXX_FLAGS
are applied to both *.cpp
and *.c
files):
# list(APPEND HIP_CXX_FLAGS -std=c++17)
Then the compilation fully succeeds, but if I install the built Python Wheel and try to import torch
, I get the following error:
ImportError: .../python3.11/site-packages/torch/lib/libtorch_hip.so: undefined symbol: _ZNK2at10TensorBase14const_data_ptrIN3c104HalfELi0EEEPKT_v
Which means that the following template specialization is missing:
c10::Half const* at::TensorBase::const_data_ptr<c10::Half, 0>() const
And indeed, if I examine the entire source tree, that specialization isn't anywhere.
There are other specializations (for uint16_t
, uint32_t
, ...) but a specialization for c10::Half
is missing.
// Found in: aten/src/ATen/templates/TensorMethods.cpp:21
// Found in: torchgen/packaged/ATen/templates/TensorMethods.cpp:21
#define DEFINE_CAST(T, name) \
template <> \
TORCH_API const T* TensorBase::const_data_ptr() const { \
check_type(*this, ScalarType::name, #name); \
return this->unsafeGetTensorImpl()->data_ptr_impl<T>(); \
} \
\
template <> \
TORCH_API const T* TensorBase::const_data_ptr<const T>() const { \
check_type(*this, ScalarType::name, #name); \
return this->unsafeGetTensorImpl()->data_ptr_impl<std::remove_const_t<T>>(); \
} \
\
template <> \
TORCH_API T* TensorBase::mutable_data_ptr() const { \
check_type(*this, ScalarType::name, #name); \
return this->unsafeGetTensorImpl()->mutable_data_ptr_impl<T>(); \
} \
\
template <> \
TORCH_API T* TensorBase::data_ptr() const { \
return mutable_data_ptr<T>(); \
} \
AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(DEFINE_CAST)
AT_FORALL_QINT_TYPES(DEFINE_CAST)
DEFINE_CAST(uint16_t, UInt16)
DEFINE_CAST(uint32_t, UInt32)
DEFINE_CAST(uint64_t, UInt64)
#undef DEFINE_CAST
In the above examples I'm using AMD's clang compiler from the ROCm bundle because when using Debian's compilers (both gcc and clang) I get compilation errors much earlier than using this compiler. Its version is:
AMD clang version 19.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.4.1 25184 c87081df219c42dc27c5b6d86c0525bc7d01f727)
I'm a developer and I have experience with C++/CMake/Python so I can debug and provide whatever information is necessary, although I don't have experience with your codebase.
I've successfully compiled both onnxruntime
and CTranslate2
on this machine with this toolchain so I don't think that it's a toolchain issue...
Thanks.
Versions
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 12 (bookworm) (x86_64)
GCC version: (Debian 12.2.0-14+deb12u1) 12.2.0
Clang version: 14.0.6
CMake version: version 4.0.2
Libc version: glibc-2.36
Python version: 3.11.2 (main, Apr 28 2025, 14:11:48) [GCC 12.2.0] (64-bit runtime)
Python platform: Linux-6.1.0-37-amd64-x86_64-with-glibc2.36
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 9950X 16-Core Processor
CPU family: 26
Model: 68
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU(s) scaling MHz: 71%
CPU max MHz: 4300.0000
CPU min MHz: 3000.0000
BogoMIPS: 8599.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_freeze
Virtualization: AMD-V
L1d cache: 768 KiB (16 instances)
L1i cache: 512 KiB (16 instances)
L2 cache: 16 MiB (16 instances)
L3 cache: 64 MiB (2 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==2.3.0
[pip3] optree==0.16.0
[conda] Could not collect
cc @malfet @seemethere @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd
Metadata
Metadata
Assignees
Labels
Type
Projects
Status