wandb: [CLI]: wandb saving to local does not work when soft links to the current project are used

Describe the bug

I am having a weird issue where I change the location of all my code & data to a different location with more disk space, then I soft link my projects & data to those locations with more space. I assume there must be some file handle issue because wandb’s logger is throwing me issues. So my questions:

  1. how do I have wandb only log online and not locally? (e.g. stop trying to log anything to ./wandb[or any secret place it might be logging to] since it’s creating issues). Note my code was running fine after I stopped logging to wandb so I assume that was the issue. note that the dir=None is the default to wandb’s param.
  2. how do I resolve this issue entirely so that it works seemlessly with all my projects softlinked somewhere else?

More details on the error

Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
    self.flush()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/observers/api.py", line 199, in run
    self.dispatch_events(self.event_queue, self.timeout)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/observers/api.py", line 368, in dispatch_events
    handler.dispatch(event)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/vendor/watchdog/events.py", line 454, in dispatch
    _method_map[event_type](event)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/filesync/dir_watcher.py", line 275, in _on_file_created
    logger.info("file/dir created: %s", event.src_path)
Message: 'file/dir created: %s'
Arguments: ('/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/wandb/run-20221023_170722-1tfzh49r/files/output.log',)
--- Logging error ---
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
    self.flush()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal_util.py", line 50, in run
    self._run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal_util.py", line 101, in _run
    self._process(record)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/internal.py", line 263, in _process
    self._hm.handle(record)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/handler.py", line 130, in handle
    handler(record)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/internal/handler.py", line 138, in handle_request
    logger.debug(f"handle_request: {request_type}")
Message: 'handle_request: stop_status'
Arguments: ()
N/A% (0 of 100000) |      | Elapsed Time: 0:00:00 | ETA:  --:--:-- |   0.0 s/it

Traceback (most recent call last):
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1814, in <module>
    main()
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1747, in main
    train(args=args)
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1794, in train
    meta_train_iterations_ala_l2l(args, args.agent, args.opt, args.scheduler)
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/training/meta_training.py", line 167, in meta_train_iterations_ala_l2l
    log_zeroth_step(args, meta_learner)
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/meta_learning.py", line 92, in log_zeroth_step
    log_train_val_stats(args, args.it, step_name, train_loss, train_acc, training=True)
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 55, in log_train_val_stats
    _log_train_val_stats(args=args,
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 116, in _log_train_val_stats
    args.logger.log('\n')
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logger.py", line 89, in log
    print(msg, flush=flush)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/redirect.py", line 640, in write
    self._old_write(data)
OSError: [Errno 116] Stale file handle
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: Synced vit_mi Adam_rfs_cifarfs Adam_cosine_scheduler_rfs_cifarfs 0.001: args.jobid=101161: https://wandb.ai/brando/entire-diversity-spectrum/runs/1tfzh49r
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20221023_170722-1tfzh49r/logs
--- Logging error ---
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router_sock.py", line 27, in _read_message
    resp = self._sock_client.read_server_response(timeout=1)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 283, in read_server_response
    data = self._read_packet_bytes(timeout=timeout)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 269, in _read_packet_bytes
    raise SockClientClosedError()
wandb.sdk.lib.sock_client.SockClientClosedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router.py", line 70, in message_loop
    msg = self._read_message()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router_sock.py", line 29, in _read_message
    raise MessageRouterClosedError
wandb.sdk.interface.router.MessageRouterClosedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1087, in emit
    self.flush()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1067, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/interface/router.py", line 77, in message_loop
    logger.warning("message_loop has been closed")
Message: 'message_loop has been closed'
Arguments: ()
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmpmvf78q6owandb'>
  _warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmpt5etqpw_wandb-artifacts'>
  _warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmp55lzwviywandb-media'>
  _warnings.warn(warn_message, ResourceWarning)
/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/tempfile.py:817: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/srv/condor/execute/dir_27749/tmprmk7lnx4wandb-media'>
  _warnings.warn(warn_message, ResourceWarning)

Error:

====> about to start train loop
Starting training!
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)'))': /api/5288891/envelope/
--- Logging error ---
Traceback (most recent call last):
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/logging/__init__.py", line 1086, in emit
    stream.write(msg + self.terminator)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/wandb/sdk/lib/redirect.py", line 640, in write
    self._old_write(data)
OSError: [Errno 116] Stale file handle
Call stack:
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/worker.py", line 128, in _target
    callback()
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 467, in send_envelope_wrapper
    self._send_envelope(envelope)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 384, in _send_envelope
    self._send_request(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/sentry_sdk/transport.py", line 230, in _send_request
    response = self._pool.request(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/request.py", line 78, in request
    return self.request_encode_body(
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/request.py", line 170, in request_encode_body
    return self.urlopen(method, url, **extra_kw)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/poolmanager.py", line 375, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/urllib3/connectionpool.py", line 780, in urlopen
    log.warning(
Message: "Retrying (%r) after connection broken by '%r': %s"
Arguments: (Retry(total=2, connect=None, read=None, redirect=None, status=None), SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')), '/api/5288891/envelope/')

cross:

Additional Files

No response

Environment

WandB version:

OS: linux

Python version: 3.9

Versions of relevant libraries: nothing else jsut wandb and softlinks in linux

Additional Context

No response

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 71 (12 by maintainers)

Most upvoted comments

It is not ok to call before a run has finished. finish() will upload any data from that directory which could take multiple minutes depending on your logging. Deleting it before finishing will break our syncing logic and you will lose data.

ok so I should first:

  1. get the wand_dir_to_delete
  2. then finish
  3. then delete the run folder

Right? @vanpelt

Yep, wandb.run.dir in python only after wandb.init(...) has been called and before wandb.finish() is called. Every run has it’s own unique directory so deleting this won’t impact other runs. Here’s some psuedo code:

import wandb
dir_to_delete = None
wandb.init()
#...
dir_to_delete = wandb.run.dir
wandb.finish()
if dir_to_delete is not None:
  # delete it

You can run df -h from anywhere on the file system. It will list all of the filesystems available, how much free space they have, and where they are mounted.

@brando90 we write a debug log to the file system. You need to set the wandb dir to a non-network filesystem. Generally, /tmp or whatever is in the TEMP env variable (echo $TEMP) is writeable and not a network filesystem.

If you don’t have a non network writeable filesystem you could mount a ramdisk or tmpfs (search stackoverflow to learn how). The main thing is you want wandb_dir not to be your home filesystem.