wandb: CLI: Windows with Anaconda prompt — W&B process failed to launch
Wandb version: 0.8.1, Python 3.6, Amazon DLAMI 23 environment, local-only mode (no API key set, using local wandb logging)
Wandb is crashing with following message:
ec2-user@ip-172-31-6-116 ~]$ { python nccl_multiversion.py --internal_role=worker --internal_cmd='/home/ec2-user/anaconda3/bin/mpirun -n 16 -N 8 -x FI_PROVIDER="efa" -x FI_OFI_RXR_RX_COPY_UNEXP=1 -x FI_OFI_RXR_RX_COPY_OOO=1 -x FI_EFA_MR_CACHE_ENABLE=1 -x FI_OFI_RXR_INLINE_MR_ENABLE=1 -x LD_LIBRARY_PATH=/home/ec2-user/nccl/nccl-2.4.6/aws-ofi-nccl/install/lib/:/home/ec2-user/nccl/nccl-2.4.6/nccl/build/lib:/usr/local/cuda-10.0/lib64:/opt/amazon/efa/lib64:/home/ec2-user/anaconda3/lib:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_TREE_THRESHOLD=0 --host 172.31.6.116,172.31.6.125 --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none --oversubscribe /home/ec2-user/nccl/nccl-2.4.6/nccl-tests/build/all_reduce_perf -b 8 -e 1024M -f 2 -g 1 -c 1 ' --internal_info=80037D7100285809000000435544415F484F4D45710158140000002F7573722F6C6F63616C2F637564612D31302E30710258080000004D50495F484F4D45710358180000002F686F6D652F6563322D757365722F616E61636F6E646133710458080000004E554D5F4750555371054B1058090000004E5045525F4E4F444571064B08580700000053495A455F4D4271074D00045806000000646F5F65666171084B0158090000006F66695F706174636871098958100000004E43434C5F56455253494F4E5F544147710A5805000000322E342E36710B580B000000464F4C4445525F524F4F54710C581E0000002F686F6D652F6563322D757365722F6E63636C2F6E63636C2D322E342E36710D58090000004E43434C5F484F4D45710E58290000002F686F6D652F6563322D757365722F6E63636C2F6E63636C2D322E342E362F6E63636C2F6275696C64710F58080000004546415F484F4D457110580F0000002F6F70742F616D617A6F6E2F6566617111752E; } > >(tee -a /tmp/ncluster/0.nccl_multiversion-kyezc/8.out) 2> >(tee -a /tmp/ncluster/0.nccl_multiversion-kyezc/8.out >&2); echo $? > /tmp/ncluster/0.nccl_multiversion-kyezc/8.status
wandb: W&B is a tool that helps track and visualize machine learning experiments
wandb: No credentials found. Run "wandb login" to visualize your metrics
wandb: Started W&B process version 0.8.1 with PID 26257
wandb: ERROR W&B process (PID 26257) did not respond
wandb: ERROR Failed to kill wandb process, PID 26257
wandb: ERROR W&B process failed to launch, see: wandb/debug.log
Traceback (most recent call last):
File "nccl_multiversion.py", line 281, in <module>
main()
File "nccl_multiversion.py", line 275, in main
worker()
File "nccl_multiversion.py", line 251, in worker
wandb.init(project='nccl_multiversion', name=name)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/wandb/__init__.py", line 776, in init
_init_headless(run, False)
File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/wandb/__init__.py", line 248, in _init_headless
"W&B process failed to launch, see: {}".format(path))
wandb.run_manager.LaunchError: W&B process failed to launch, see: wandb/debug.log
Here’s the debug log
2019-06-05 15:23:49,504 DEBUG MainThread:2899 [wandb_config.py:_load_defaults():81] no defaults not found in config-defaults.yaml
2019-06-05 15:23:49,513 DEBUG MainThread:2899 [cmd.py:execute():722] Popen(['git', 'cat-file', '--batch-check'], cwd=/Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks, universal_newlines=False, shell=None)
2019-06-05 15:23:49,529 DEBUG MainThread:2899 [cmd.py:execute():722] Popen(['git', 'rev-parse', '--show-toplevel'], cwd=/Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks, universal_newlines=False, shell=None)
2019-06-05 15:23:49,560 DEBUG MainThread:2899 [run_manager.py:__init__():452] Initialized sync for nccl_multiversion/7t5zzgq4
2019-06-05 15:23:49,566 INFO MainThread:2899 [run_manager.py:wrap_existing_process():994] wrapping existing process 2848
2019-06-05 15:23:49,566 WARNING MainThread:2899 [io_wrap.py:register():104] SIGWINCH handler was not None: <Handlers.SIG_DFL: 0>
2019-06-05 15:23:49,571 DEBUG MainThread:2899 [connectionpool.py:_new_conn():813] Starting new HTTPS connection (1): pypi.org:443
2019-06-05 15:23:49,739 DEBUG MainThread:2899 [connectionpool.py:_make_request():393] https://pypi.org:443 "GET /pypi/wandb/json HTTP/1.1" 200 31736
2019-06-05 15:23:49,776 INFO MainThread:2899 [run_manager.py:init_run():810] system metrics and metadata threads started
2019-06-05 15:23:49,776 INFO MainThread:2899 [run_manager.py:init_run():844] upserting run before process can begin, waiting at most 10 seconds
2019-06-05 15:23:49,786 DEBUG Thread-13 :2899 [connectionpool.py:_new_conn():813] Starting new HTTPS connection (1): api.wandb.ai:443
2019-06-05 15:23:49,977 DEBUG Thread-13 :2899 [connectionpool.py:_make_request():393] https://api.wandb.ai:443 "POST /graphql HTTP/1.1" 200 408
2019-06-05 15:23:49,979 INFO Thread-13 :2899 [run_manager.py:_upsert_run():908] saving patches
2019-06-05 15:23:49,980 DEBUG Thread-13 :2899 [cmd.py:execute():722] Popen(['git', 'rev-parse', '--show-toplevel'], cwd=/Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks, universal_newlines=False, shell=None)
2019-06-05 15:23:49,994 DEBUG Thread-13 :2899 [cmd.py:execute():722] Popen(['git', 'diff', '--cached', '--abbrev=40', '--full-index', '--raw'], cwd=/Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks, universal_newlines=False, shell=None)
2019-06-05 15:23:50,012 DEBUG Thread-13 :2899 [cmd.py:execute():722] Popen(['git', 'diff', '--abbrev=40', '--full-index', '--raw'], cwd=/Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks, universal_newlines=False, shell=None)
2019-06-05 15:23:50,030 DEBUG Thread-13 :2899 [cmd.py:execute():722] Popen(['git', 'version'], cwd=/Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks, universal_newlines=False, shell=None)
2019-06-05 15:23:50,073 DEBUG Thread-13 :2899 [cmd.py:execute():722] Popen(['git', 'merge-base', 'HEAD', '567306fda7365c06b3a049f3e4ff3f7c252e84e7'], cwd=/Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks, universal_newlines=False, shell=None)
2019-06-05 15:23:50,091 INFO Thread-13 :2899 [run_manager.py:_upsert_run():910] saving pip packages
2019-06-05 15:23:50,093 INFO Thread-13 :2899 [run_manager.py:_upsert_run():912] initializing streaming files api
2019-06-05 15:23:50,094 INFO Thread-13 :2899 [run_manager.py:_upsert_run():919] unblocking file change observer, beginning sync with W&B servers
2019-06-05 15:23:50,094 INFO MainThread:2899 [run_manager.py:wrap_existing_process():1011] informing user process we are ready to proceed
2019-06-05 15:23:50,095 INFO MainThread:2899 [run_manager.py:_sync_etc():1067] entering loop for messages from user process
2019-06-05 15:23:50,103 INFO MainThread:2899 [run_manager.py:_sync_etc():1090] received message from user process: {"exitcode": 1}
2019-06-05 15:23:50,104 INFO MainThread:2899 [run_manager.py:_sync_etc():1171] closing log streams and sending exitcode to W&B
2019-06-05 15:23:50,104 INFO MainThread:2899 [run_manager.py:shutdown():926] shutting down system stats and metadata service
2019-06-05 15:23:50,511 INFO Thread-2 :2899 [run_manager.py:_on_file_created():576] file/dir created: /Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks/wandb/run-20190605_222348-7t5zzgq4/wandb-metadata.json
2019-06-05 15:23:50,512 INFO Thread-2 :2899 [run_manager.py:_on_file_created():576] file/dir created: /Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks/wandb/run-20190605_222348-7t5zzgq4/diff.patch
2019-06-05 15:23:50,513 INFO Thread-2 :2899 [run_manager.py:_on_file_created():576] file/dir created: /Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks/wandb/run-20190605_222348-7t5zzgq4/wandb-events.jsonl
2019-06-05 15:23:50,513 INFO Thread-2 :2899 [run_manager.py:_on_file_created():576] file/dir created: /Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks/wandb/run-20190605_222348-7t5zzgq4/requirements.txt
2019-06-05 15:23:50,514 INFO Thread-2 :2899 [run_manager.py:_on_file_created():576] file/dir created: /Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks/wandb/run-20190605_222348-7t5zzgq4/wandb-history.jsonl
2019-06-05 15:23:50,778 INFO MainThread:2899 [run_manager.py:shutdown():931] stopping streaming files and file change observer
2019-06-05 15:23:51,511 INFO Thread-2 :2899 [run_manager.py:_on_file_modified():587] file/dir modified: /Users/yaroslavvb/Dropbox/git0/aws-network-benchmarks/wandb/run-20190605_222348-7t5zzgq4/wandb-metadata.json
2019-06-05 15:23:51,521 DEBUG Thread-17 :2899 [connectionpool.py:_new_conn():813] Starting new HTTPS connection (1): api.wandb.ai:443
2019-06-05 15:23:51,521 DEBUG Thread-16 :2899 [connectionpool.py:_new_conn():813] Starting new HTTPS connection (1): api.wandb.ai:443
2019-06-05 15:23:51,525 DEBUG Thread-18 :2899 [connectionpool.py:_new_conn():813] Starting new HTTPS connection (1): api.wandb.ai:443
2019-06-05 15:23:51,705 DEBUG Thread-17 :2899 [connectionpool.py:_make_request():393] https://api.wandb.ai:443 "POST /graphql HTTP/1.1" 200 772
2019-06-05 15:23:51,710 DEBUG Thread-17 :2899 [connectionpool.py:_new_conn():813] Starting new HTTPS connection (1): storage.googleapis.com:443
2019-06-05 15:23:51,711 DEBUG Thread-16 :2899 [connectionpool.py:_make_request():393] https://api.wandb.ai:443 "POST /graphql HTTP/1.1" 200 806
2019-06-05 15:23:51,715 DEBUG Thread-16 :2899 [connectionpool.py:_new_conn():813] Starting new HTTPS connection (1): storage.googleapis.com:443
2019-06-05 15:23:51,717 DEBUG Thread-18 :2899 [connectionpool.py:_make_request():393] https://api.wandb.ai:443 "POST /graphql HTTP/1.1" 200 780
2019-06-05 15:23:51,721 DEBUG Thread-18 :2899 [connectionpool.py:_new_conn():813] Starting new HTTPS connection (1): storage.googleapis.com:443
2019-06-05 15:23:52,099 DEBUG Thread-16 :2899 [connectionpool.py:_make_request():393] https://storage.googleapis.com:443 "PUT /wandb-production.appspot.com/yaroslavvb/nccl_multiversion/7t5zzgq4/wandb-metadata.json?Expires=1559773491&GoogleAccessId=gorilla-cloud-storage%40wandb-production.iam.gserviceaccount.com&Signature=YcBz%2F5Si%2BE9T1pJ%2FpLum5DWlaz7l79gtgck7nEaJ2bGpGTPinplW5BLyi%2BNdVLwClNk4I7DLxaaZz363D0mgeSbJb3FFGDNC%2B6Ci4f4tihQrBptXbIm3%2FDnyPazYRacJ5PcanJf6GsWoH7aEPnNsHyFm%2FJy%2FprivF8eG3R0AMnTh%2FfsywV9CLuhLnUKcwnjvcdn3Q%2BslC%2FhuK64ADvUb3R97lCy%2FdUFT2rfhDhY7vnSch6M3AFZT%2FRlvFjcgWJP2gEg%2Fhm1f7iPRW16FKpYjbTXx3MW4vdLaM%2FkD19%2BrU7lFW69bU7X3%2F1nfb4u4%2FPeW6mBjfnJLiMA6i7Et2QeKoQ%3D%3D HTTP/1.1" 200 0
2019-06-05 15:23:52,099 DEBUG Thread-18 :2899 [connectionpool.py:_make_request():393] https://storage.googleapis.com:443 "PUT /wandb-production.appspot.com/yaroslavvb/nccl_multiversion/7t5zzgq4/requirements.txt?Expires=1559773491&GoogleAccessId=gorilla-cloud-storage%40wandb-production.iam.gserviceaccount.com&Signature=esIUFRofyDV8K4R%2FeRqfVtBLRyWysT9QvfM6nbIDMHT%2FtPaiminOAN4Xow6Be34NRhGleW2KdJjzQIyHp%2Fpsn59T8Kcfo1vjsJzVNxKgQuFQxc9sgf3yB%2B2Hoa76AfZpM9jeAkBsWXO3Td7QEF03BTUAjVkpFEIGxIGYYOwANiT3MiCZVWiWkBXdz%2BwYcmjaBQ3imTfxN2Btm0qL9%2F6vP5WiMF8CrBzK16eToecPRSK8PNFHpedzDvJjCKThVfhxlxNyHgKjPXjgTHlgBOqi9nHeUeeonnyMj2BvCYNigObfJLZIEPmuKFxk%2FYawUm8KFHDhEQflMcYjg6sSLo%2BBVw%3D%3D HTTP/1.1" 200 0
2019-06-05 15:23:52,101 DEBUG Thread-17 :2899 [connectionpool.py:_make_request():393] https://storage.googleapis.com:443 "PUT /wandb-production.appspot.com/yaroslavvb/nccl_multiversion/7t5zzgq4/diff.patch?Expires=1559773491&GoogleAccessId=gorilla-cloud-storage%40wandb-production.iam.gserviceaccount.com&Signature=ZuTl9J8ajP8P712VQCIv%2BQPGYBgQZnE6PoumJE4W8FFJGsLfsRL%2F8t3o7I8zqDuHvK1TM0sOBnBm7o6w7cV3nrkxTkAMtLc%2Be7KdvSB78UjJHtChBkKCp4Y1BZTzJS0%2BvN%2BqSPC9M%2BBGDon02SmQ0hkmXhEO3vWCRzvDmzXFrbRlETzT9uKncU8Wgv4iXFz5H%2FTIHz5YpG5eX2yo54VgBQlKt9gg680f58bdOQV1XEYYGlX7CI0Ay0olLSBOZaKzSCWLZ0RoIVwG%2FxlqVzqZ9AKVtDfcaBmBL6EJBsu92vSRPEVGs08S9TbU%2BYv24gxzxhERquUymy30QB5wu%2Fqa2A%3D%3D HTTP/1.1" 200 0
2019-06-05 15:23:52,519 DEBUG Thread-6 :2899 [connectionpool.py:_new_conn():813] Starting new HTTPS connection (1): api.wandb.ai:443
2019-06-05 15:23:52,718 DEBUG Thread-6 :2899 [connectionpool.py:_make_request():393] https://api.wandb.ai:443 "POST /yaroslavvb/nccl_multiversion/7t5zzgq4/file_stream HTTP/1.1" 200 311
2019-06-05 15:23:52,804 DEBUG Thread-6 :2899 [connectionpool.py:_make_request():393] https://api.wandb.ai:443 "POST /yaroslavvb/nccl_multiversion/7t5zzgq4/file_stream HTTP/1.1" 200 308
2019-06-05 15:23:52,805 INFO MainThread:2899 [run_manager.py:_sync_etc():1182] process only ran for 4 seconds, not syncing files
2019-06-05 15:23:52,805 INFO MainThread:2899 [7t5zzgq4:run_manager.py:_sync_etc():1182] process only ran for 4 seconds, not syncing files
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 3
- Comments: 22 (5 by maintainers)
For anyone hitting this error on windows. Try our new Windows optimized version at: https://github.com/wandb/client-ng You can use it by running
pip install wandb-ng@harpone That init call looks like you are reusing id between workers (typically need to incorporate $RANK to make it unique). IMHO it’s better to not touch the ID and instead only set run_name (logging from chief process only) or group_name (logging from all processes) . For the case of latter, you need to make group_name unique between reruns or wandb display will merge them.
Here’s how I’m setting it
https://github.com/cybertronai/imagenet18/blob/218ef7e63894c8a107eb764f27d7cd27309e960d/training/train_imagenet_nv.py#L110