Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed compilation (cpu_adam issue) #288

Open
JohnTailor opened this issue Jun 26, 2023 · 0 comments
Open

DeepSpeed compilation (cpu_adam issue) #288

JohnTailor opened this issue Jun 26, 2023 · 0 comments

Comments

@JohnTailor
Copy link

JohnTailor commented Jun 26, 2023

Thanks for the repo. If I use torchrun as suggested by the webpage (see below) it fails due to an error while compiling cpuadam within the deepspeed library. The actual error is due to a compile command from nvcc (see below). The error is:

/usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:

This leads to a follow up error: AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

I googled and tried a few things:
NVlabs/instant-ngp#119
microsoft/DeepSpeed#1846

but that did not help. Anyone any idea?

My guess it there is some version issue either with gcc or cuda env or so. But since I installed alpaca into a new virtual environment (I tried both conda and venv) versioning issues should not really happen. So maybe sth. else...

Commands causing errors:

torchrun --nproc_per_node=4 --master_port=23222 train.py --model_name_or_path /home/johannes/modelhf/llama-7b/ --data_path /home/johannes/alpaca/alpaca_data.json --bf16 True --output_dir /home/johannes/outalpaca --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --deepspeed "./configs/default_offload_opt_param.json" --tf32 True

then later it calls, which raises the above error in /usr/include/c++/11/bits/std_function.h:

/usr/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/johannes/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /home/johannes/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant