LMFlow: run run_chatbot.sh RuntimeError: Tensors must be contiguous
Hello, I downloaded the gpt-neo-2.7B model and ran run_chatbot.sh, which displayed the following error? How can I resolve it?
(lmflow) root@shenma:~/LMFlow# ./scripts/run_chatbot.sh output_models/gpt-neo-2.7B
[2023-04-06 09:55:55,760] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0: setting --include=localhost:0
[2023-04-06 09:55:55,769] [INFO] [runner.py:550:main] cmd = /root/anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None examples/chatbot.py --deepspeed configs/ds_config_chatbot.json --model_name_or_path output_models/gpt-neo-2.7B
[2023-04-06 09:55:56,692] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-06 09:55:56,692] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-06 09:55:56,692] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-06 09:55:56,692] [INFO] [launch.py:162:main] dist_world_size=1
[2023-04-06 09:55:56,692] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
Traceback (most recent call last):
File "/root/LMFlow/examples/chatbot.py", line 117, in <module>
main()
File "/root/LMFlow/examples/chatbot.py", line 44, in main
model = AutoModel.get_model(
File "/root/LMFlow/src/lmflow/models/auto_model.py", line 14, in get_model
return HFDecoderModel(model_args, *args, **kwargs)
File "/root/LMFlow/src/lmflow/models/hf_decoder_model.py", line 224, in __init__
self.ds_engine = deepspeed.initialize(model=self.backend_model, config_params=ds_config)[0]
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 297, in __init__
self._configure_distributed_model(model)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1182, in _configure_distributed_model
self._broadcast_model()
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1105, in _broadcast_model
dist.broadcast(p,
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 123, in log_wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 228, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 78, in broadcast
return torch.distributed.broadcast(tensor=tensor,
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1555, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: Tensors must be contiguous
[2023-04-06 09:56:28,731] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 439575
[2023-04-06 09:56:28,731] [ERROR] [launch.py:324:sigkill_handler] ['/root/anaconda3/envs/lmflow/bin/python', '-u', 'examples/chatbot.py', '--local_rank=0', '--deepspeed', 'configs/ds_config_chatbot.json', '--model_name_or_path', 'output_models/gpt-neo-2.7B'] exits with return code = 1
deepspeed 0.8.3
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (8 by maintainers)
Okay, thank you. I closed the issue