gsutil: gsutil cp hangs on many small files when running in parallel

I have a GCS bucket with millions of small files in different folders. When I run:

$ gsutil -m cp -r gs://my-bucket .

The process will eventually hang before completion, sometimes after 5 minutes and sometimes after several hours. This seems to be is 100% reproducible. I’m using version 4.27 but this has happened in older versions as well. As a workaround have I to use:

$ gsutil cp -r gs://my-bucket .

which works but it takes several days to download everything so it’s not optimal.

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 17
  • Comments: 29

Most upvoted comments

For what it’s worth, I found that using only threads for parallelization (and not child processes) appears to avoid the underlying deadlock here. e.g. -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=24

I am still facing the issue, it reaches 99% of copied files and then terminates.

Revisiting this after many years, and unfortunately I run into the same problem even though I use:

gsutil -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=24 -m cp -r gs://my-bucket .

I have Apple M1 Pro.

Are you using macOS? If yes, then this is a known issue for mac because of multiprocessing using fork. You can set parallel_process_count=1 to disable multiprocessing as mentioned here https://github.com/GoogleCloudPlatform/gsutil/issues/464#issuecomment-633334888

Google Cloud SDK 362.0.0 bq 2.0.71 core 2021.10.21 gsutil 5.4

Passing -o GSUtil:parallel_process_count=1 to the gsutil command when using -m worked for me as well.

Are you all running these as subprocesses by chance? I had this issue when running gsutil through python with subprocess. Calling gsutil with a full path resolved the hanging for me.

Specifically, my python code now looks like this example,

def get_gsutil_path():
    cmd = "which gsutil"
    path = subprocess.check_output(cmd, shell=True, stderr=subprocess.PIPE)
    return path.decode("utf-8").replace("\n", "")

def glob_bucket(bucket):
    cmd = f'{get_gsutil_path()} ls gs://{bucket}/**'
    bucket_files = subprocess.check_output(cmd, shell=True, stderr=subprocess.PIPE)
    series = pd.read_csv(io.StringIO(bucket_files.decode('utf-8')), sep='\n', header=None).loc[:, 0]
    return series.tolist()

Is there any chance you’re running another instance of gsutil separately while that one is still working? This is the only thing I’ve found that will almost certainly make gsutil hang (due to the separate gsutil processes/threads trying to use the same directory, "$HOME/.gsutil" unless otherwise specified, to manage state and “locked” auth cache files).

If you’re not sure, I’d recommend invoking that command with the option to tell gsutil to use a different, non-default state dir, e.g.:

gsutil -o "GSUtil:state_dir=/absolute/path/to/some/directory" -m cp -r gs://my-bucket .

(and don’t have any other invocations of gsutil specify that directory). If that still hangs, we’ll know the state directory isn’t the problem… at which point, you might want to run the command with the top-level -D option and see if the debug logs reveal anything fishy, or even show at what point gsutil stopped doing anything useful.

10k files and 100% reproducible. 99% Done forever. 😢

ation/json]...
- [10.0k/10.0k files][ 12.0 MiB/ 12.0 MiB]  99% Done  82.7 KiB/s ETA 00:00:00   

I get the same issue with Google Cloud SDK 361.0.0 and gsutil 5.4.

Adding -o GSUtil:parallel_process_count=1 to my gsutil command worked.

For what it’s worth, I found that using only threads for parallelization (and not child processes) appears to avoid the underlying deadlock here. e.g. -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=24

Thanks, you saved my day

Are you using macOS? If yes, then this is a known issue for mac because of multiprocessing using fork. You can set parallel_process_count=1 to disable multiprocessing as mentioned here #464 (comment)

Yes I am using macOS, thanks for the information, parallel_process_count=1 did work