ColossalAI: [BUG]: colossalai/kernel/cuda_native/csrc/moe_cuda_kernel.cu:5:10: fatal error: cub/cub.cuh: No such file or directory (update: now with more build errors!)
🐛 Describe the bug
Trying to run a finetune torchrun script, get this error. ColossaiAL was built from source as directed, but it still fails.
anon@linuxmint:/media/anon/bighdd/ai/toolbox/training$ ./finetune.bash
+ export BATCH_SIZE=4
+ BATCH_SIZE=4
+ export MODEL=/media/anon/bighdd/ai/models/opt-350m
+ MODEL=/media/anon/bighdd/ai/models/opt-350m
+ export NUMBER_OF_GPUS=1
+ NUMBER_OF_GPUS=1
+ export OUTPUT_DIR=checkpoints
+ OUTPUT_DIR=checkpoints
++ date +%Y-%m-%d_%H-%M-%S
+ LOG_NAME=2022-12-22_14-15-45
+ export HF_DATASETS_OFFLINE=1
+ HF_DATASETS_OFFLINE=1
+ mkdir -p checkpoints/logs
+ mkdir -p checkpoints/runs
+ torchrun --nproc_per_node 1 --master_port 19198 ./colossalai/run_clm.py --train_file ./data/train.json --learning_rate 2e-5 --checkpointing_steps 64 --mem_cap 0 --model_name_or_path /media/anon/bighdd/ai/models/opt-350m --output_dir checkpoints --per_device_eval_batch_size 4 --per_device_train_batch_size 4
+ tee checkpoints/logs/2022-12-22_14-15-45.log
2022-12-22 14:15:51.339450: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Colossalai should be built with cuda extension to use the FP16 optimizer
If you want to activate cuda mode for MoE, please install with cuda_ext!
[12/22/22 14:15:54] INFO colossalai - colossalai - INFO:
/home/anon/.local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[12/22/22 14:15:55] INFO colossalai - colossalai - INFO:
/home/anon/.local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/anon/.local/lib/python3.8/site-packages/colossalai/initialize.py:117
launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline
parallel size: 1, tensor parallel size: 1
INFO colossalai - colossalai - INFO: ./colossalai/run_clm.py:309 main
INFO colossalai - colossalai - INFO: Start preparing dataset
Using custom data configuration default-ced548c04fa8d0c8
Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████| 1/1 [00:00<00:00, 597.82it/s]
Using custom data configuration default-ced548c04fa8d0c8
Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Using custom data configuration default-ced548c04fa8d0c8
Found cached dataset json (/home/anon/.cache/huggingface/datasets/json/default-ced548c04fa8d0c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
INFO colossalai - colossalai - INFO: ./colossalai/run_clm.py:350 main
INFO colossalai - colossalai - INFO: Dataset is prepared
INFO colossalai - colossalai - INFO: ./colossalai/run_clm.py:366 main
INFO colossalai - colossalai - INFO: Model config has been created
load model from /media/anon/bighdd/ai/models/opt-350m
INFO colossalai - colossalai - INFO: ./colossalai/run_clm.py:373 main
INFO colossalai - colossalai - INFO: GPT2Tokenizer has been created
INFO colossalai - colossalai - INFO: ./colossalai/run_clm.py:388 main
INFO colossalai - colossalai - INFO: Finetune a pre-trained model
[12/22/22 14:16:04] INFO colossalai - ProcessGroup - INFO:
/home/anon/.local/lib/python3.8/site-packages/colossalai/tensor/process_group.py:24 get
INFO colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]
[12/22/22 14:16:07] INFO colossalai - colossalai - INFO: ./colossalai/run_clm.py:400 main
INFO colossalai - colossalai - INFO: using Colossal-AI version 0.1.13
searching chunk configuration is completed in 0.67 s.
used number: 315.85 MB, wasted number: 3.01 MB
total wasted percentage is 0.95%
/home/anon/.local/lib/python3.8/site-packages/colossalai/gemini/chunk/chunk.py:40: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
return tensor.storage().size() == 0
/home/anon/.local/lib/python3.8/site-packages/colossalai/gemini/chunk/chunk.py:45: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
tensor.storage().resize_(0)
[12/22/22 14:16:09] INFO colossalai - colossalai - INFO: ./colossalai/run_clm.py:415 main
INFO colossalai - colossalai - INFO: GeminiDDP has been created
Running tokenizer on dataset: 100%|██████████| 10/10 [00:23<00:00, 2.34s/ba]
Running tokenizer on dataset: 100%|██████████| 1/1 [00:01<00:00, 1.18s/ba]
[12/22/22 14:16:37] WARNING colossalai - colossalai - WARNING: ./colossalai/run_clm.py:444 main
WARNING colossalai - colossalai - WARNING: The tokenizer picked seems to have a very large `model_max_length`
(1000000000000000019884624838656). Picking 1024 instead. You can change that default value by passing
--block_size xxx.
Grouping texts in chunks of 1024: 100%|██████████| 10/10 [00:05<00:00, 1.92ba/s]
Grouping texts in chunks of 1024: 100%|██████████| 1/1 [00:00<00:00, 3.61ba/s]
[12/22/22 14:16:42] INFO colossalai - colossalai - INFO: ./colossalai/run_clm.py:503 main
INFO colossalai - colossalai - INFO: Dataloaders have been created
/home/anon/.local/lib/python3.8/site-packages/colossalai/tensor/colo_tensor.py:182: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
ret = func(*args, **kwargs)
/home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/nvme_optimizer.py:55: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor._storage() instead of tensor.storage()
numel += p.storage().size()
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py:80 in │
│ __init__ │
│ │
│ 77 │ │ super(HybridAdam, self).__init__(model_params, default_args, nvme_offload_fracti │
│ 78 │ │ self.adamw_mode = adamw_mode │
│ 79 │ │ try: │
│ ❱ 80 │ │ │ import colossalai._C.cpu_optim │
│ 81 │ │ │ import colossalai._C.fused_optim │
│ 82 │ │ except ImportError: │
│ 83 │ │ │ raise ImportError('Please install colossalai from source code to use HybridA │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'colossalai._C.cpu_optim'
During handling of the above exception, another exception occurred:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /media/anon/bighdd/ai/toolbox/training/./colossalai/run_clm.py:643 in <module> │
│ │
│ 640 │
│ 641 │
│ 642 if __name__ == "__main__": │
│ ❱ 643 │ main() │
│ 644 │
│ │
│ /media/anon/bighdd/ai/toolbox/training/./colossalai/run_clm.py:519 in main │
│ │
│ 516 │ │ }, │
│ 517 │ ] │
│ 518 │ │
│ ❱ 519 │ optimizer = HybridAdam(optimizer_grouped_parameters, lr=args.learning_rate) │
│ 520 │ optimizer = ZeroOptimizer(optimizer, model, initial_scale=2**14) │
│ 521 │ │
│ 522 │ # Scheduler and math around the number of training steps. │
│ │
│ /home/anon/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py:83 in │
│ __init__ │
│ │
│ 80 │ │ │ import colossalai._C.cpu_optim │
│ 81 │ │ │ import colossalai._C.fused_optim │
│ 82 │ │ except ImportError: │
│ ❱ 83 │ │ │ raise ImportError('Please install colossalai from source code to use HybridA │
│ 84 │ │ │
│ 85 │ │ self.cpu_adam_op = colossalai._C.cpu_optim.CPUAdamOptimizer(lr, betas[0], betas[ │
│ 86 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ adamw_mode) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: Please install colossalai from source code to use HybridAdam
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 206247) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/home/anon/.local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/anon/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./colossalai/run_clm.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-12-22_14:16:47
host : linuxmint
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 206247)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Environment
Python 3.8.10 torch: 2.0.0.dev20221215+cu117 colossalai-0.1.13 Nvidia 3060 12GB NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 Cuda compilation tools, release 10.1, V10.1.243
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 15 (6 by maintainers)
Downgraded to pytorch 1.10 with CUDA 10.2 ColossalAI build failure