wandb: Sync with server is extremely slow

wandb --version && python --version && uname

  • wandb, version 0.10.5
  • Python, version 3.7.6
  • Linux Ubuntu 20.04

Python script to reproduce:

import wandb
wandb.init()
wandb.log({'a':1})

The script takes more than 10 minutes to finish, it get stuck at sync phase. On another machine with the same configuration it works, I am not able to pinpoint the problem. The beginning of debug.log is the following:

2020-10-09 17:56:17,590 INFO    MainThread:14442 [wandb_init.py:_log_setup():292] Logging user logs to wandb/run-20201009_175617-2p03aens/logs/debug.log
2020-10-09 17:56:17,590 INFO    MainThread:14442 [wandb_init.py:_log_setup():293] Logging internal logs to wandb/run-20201009_175617-2p03aens/logs/debug-internal.log
2020-10-09 17:56:17,590 INFO    MainThread:14442 [wandb_setup.py:_flush():68] setting env: {}
2020-10-09 17:56:17,590 INFO    MainThread:14442 [wandb_setup.py:_flush():68] setting user settings: {}
2020-10-09 17:56:17,590 INFO    MainThread:14442 [wandb_setup.py:_flush():68] multiprocessing start_methods=fork,spawn,forkserver
2020-10-09 17:56:23,256 INFO    MainThread:14442 [wandb_run.py:_console_start():1265] atexit reg
2020-10-09 17:56:23,257 INFO    MainThread:14442 [wandb_run.py:_redirect():1135] redirect: SettingsConsole.REDIRECT
2020-10-09 17:56:23,257 INFO    MainThread:14442 [wandb_run.py:_redirect():1138] Redirecting console.
2020-10-09 17:56:23,257 INFO    MainThread:14442 [redirect.py:install():196] install start
2020-10-09 17:56:23,258 INFO    MainThread:14442 [redirect.py:install():211] install stop
2020-10-09 17:56:23,258 INFO    MainThread:14442 [redirect.py:install():196] install start
2020-10-09 17:56:23,258 INFO    MainThread:14442 [redirect.py:install():211] install stop
2020-10-09 17:56:23,258 INFO    MainThread:14442 [wandb_run.py:_redirect():1182] Redirects installed.
2020-10-09 17:56:23,259 INFO    MainThread:14442 [wandb_run.py:_atexit_cleanup():1238] got exitcode: 0
2020-10-09 17:56:23,260 INFO    MainThread:14442 [wandb_run.py:_restore():1210] restore
2020-10-09 17:56:23,260 INFO    MainThread:14442 [redirect.py:uninstall():215] uninstall start
2020-10-09 17:56:23,260 INFO    MainThread:14442 [redirect.py:_stop():264] _stop: stdout
2020-10-09 17:56:23,260 INFO    MainThread:14442 [redirect.py:_stop():270] _stop closed: stdout
2020-10-09 17:56:23,260 INFO    stdout    :14442 [redirect.py:_pipe_relay():119] relay done saw last write: stdout
2020-10-09 17:56:23,260 INFO    stdout    :14442 [redirect.py:_pipe_relay():135] relay done done: stdout
2020-10-09 17:56:23,260 INFO    MainThread:14442 [redirect.py:_stop():276] _stop joined: stdout
2020-10-09 17:56:23,260 INFO    MainThread:14442 [redirect.py:_stop():278] _stop rd closed: stdout
2020-10-09 17:56:23,261 INFO    MainThread:14442 [redirect.py:uninstall():219] uninstall done
2020-10-09 17:56:23,261 INFO    MainThread:14442 [redirect.py:uninstall():215] uninstall start
2020-10-09 17:56:23,261 INFO    MainThread:14442 [redirect.py:_stop():264] _stop: stderr
2020-10-09 17:56:23,261 INFO    MainThread:14442 [redirect.py:_stop():270] _stop closed: stderr
2020-10-09 17:56:23,261 INFO    stderr    :14442 [redirect.py:_pipe_relay():119] relay done saw last write: stderr
2020-10-09 17:56:23,261 INFO    stderr    :14442 [redirect.py:_pipe_relay():135] relay done done: stderr
2020-10-09 17:56:23,261 INFO    MainThread:14442 [redirect.py:_stop():276] _stop joined: stderr
2020-10-09 17:56:23,261 INFO    MainThread:14442 [redirect.py:_stop():278] _stop rd closed: stderr
2020-10-09 17:56:23,261 INFO    MainThread:14442 [redirect.py:uninstall():219] uninstall done
2020-10-09 17:56:23,716 INFO    MainThread:14442 [wandb_run.py:_wait_for_finish():1349] got exit ret: uuid: "977487dc190d4a7b85170a7da99da7ca"
response {
  poll_exit_response {
    file_counts {
      wandb_count: 1
    }
    pusher_stats {
      total_bytes: 472
    }
  }
}

2020-10-09 17:56:25,723 INFO    MainThread:14442 [wandb_run.py:_wait_for_finish():1349] got exit ret: uuid: "ae39742a10bb4754aadce46e0e72d889"
response {
  poll_exit_response {
    file_counts {
      wandb_count: 4
    }
    pusher_stats {
      total_bytes: 1089
    }
  }
}

2020-10-09 17:56:27,730 INFO    MainThread:14442 [wandb_run.py:_wait_for_finish():1349] got exit ret: uuid: "0c5c130d2f1e4edf98d2e9ff0a1a040e"
response {
  poll_exit_response {
    file_counts {
      wandb_count: 4
    }
    pusher_stats {
      total_bytes: 1089
    }
  }
}

2020-10-09 17:56:29,736 INFO    MainThread:14442 [wandb_run.py:_wait_for_finish():1349] got exit ret: uuid: "8725216e2144469e83762b70c7f20e62"
response {
  poll_exit_response {
    file_counts {
      wandb_count: 4
    }
    pusher_stats {
      total_bytes: 1089
    }
  }
}

debug files: debug.log debug-internal.log

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 20 (5 by maintainers)

Most upvoted comments

Hi, I have the exact same problem and my setting is:

wandb, version 0.13.1 Python version 3.9.7 Linux Ubuntu 20.04

The sync is very slow and takes around 15minutes for a file of size 0.005MB

The same problem here. It used to work all of a sudden, it started getting stucked in syncing step. I also check if there any network issue (via git push and pull, some network testing), there was no problem with any other network related task.

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.79. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.