huggingface_hub: StableDiffusionImg2ImgPipeline OSError: Consistency check failed

Describe the bug

I’m trying to run DreamPose Repository. When I finished fine-tuning the UNet, the code saved the fine-tuned network with this code snippet

            if accelerator.is_main_process and global_step % 500 == 0:
                pipeline = StableDiffusionImg2ImgPipeline.from_pretrained(
                    args.pretrained_model_name_or_path,
                    #adapter=accelerator.unwrap_model(adapter),
                    unet=accelerator.unwrap_model(unet),
                    tokenizer=tokenizer,
                    image_encoder=accelerator.unwrap_model(clip_encoder),
                    clip_processor=accelerator.unwrap_model(clip_processor),
                    revision=args.revision,
                )
                pipeline.save_pretrained(os.path.join(args.output_dir, f'checkpoint-{epoch}'))
                model_path = args.output_dir+f'/unet_epoch_{epoch}.pth'
                torch.save(unet.state_dict(), model_path)
                adapter_path = args.output_dir+f'/adapter_{epoch}.pth'
                torch.save(adapter.state_dict(), adapter_path)

It failed due to: OSError: Consistency check failed: file should be of size 1215981833 but has size 492265879 (model.safetensors). (You can find the full output In the Logs section.)

I have modified the force_download parameter to be true but nothing changed.
I have enough space to save the model
I’m using the latest huggingface-hub version.

Reproduction

No response

Logs

Fetching 14 files:   0%|                                 | 0/14 [00:00<?, ?it/s]Force download:  True
Force download:  True
Fetching 14 files:  21%|█████▎                   | 3/14 [00:06<00:23,  2.11s/it]
Traceback (most recent call last):
  File "finetune-unet.py", line 458, in <module>92M/492M [00:05<00:00, 85.9MB/s]
    main(args)
  File "finetune-unet.py", line 438, in main
    pipeline = StableDiffusionImg2ImgPipeline.from_pretrained(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 908, in from_pretrained
    cached_folder = cls.download(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 1349, in download
    cached_folder = snapshot_download(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 235, in snapshot_download
    thread_map(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "***/anaconda3/envs/***/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 211, in _inner_hf_hub_download
    return hf_hub_download(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1365, in hf_hub_download
    http_get(
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 547, in http_get
    raise EnvironmentError(
OSError: Consistency check failed: file should be of size 1215981833 but has size 492265879 (model.safetensors).
We are sorry for the inconvenience. Please retry download and pass `force_download=True, resume_download=False` as argument.
If the issue persists, please let us know by opening an issue on https://github.com/huggingface/huggingface_hub.
Downloading model.safetensors: 100%|█████████| 492M/492M [00:05<00:00, 83.3MB/s]
Steps: 100%|██████████████| 500/500 [06:10<00:00,  1.35it/s, loss=0.95, lr=1e-5]
Traceback (most recent call last):
  File "***/anaconda3/envs/***/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/accelerate/commands/launch.py", line 941, in launch_command
    simple_launcher(args)
  File "***/anaconda3/envs/***/lib/python3.8/site-packages/accelerate/commands/launch.py", line 603, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['***/anaconda3/envs/***/bin/python', 'finetune-unet.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=demo/sample_emre/train', '--output_dir=demo/custom-chkpts_default', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--learning_rate=1e-5', '--num_train_epochs=500', '--dropout_rate=0.0', '--custom_chkpt=checkpoints/unet_epoch_20.pth', '--revision', 'ebb811dd71cdc38a204ecbdd6ac5d580f529fd8c', '--use_8bit_adam']' returned non-zero exit status 1.

System info

- huggingface_hub version: 0.15.1
- Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.17
- Python version: 3.8.16
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: ***.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 1.13.1+cu116
- Jinja2: N/A
- Graphviz: N/A
- Pydot: N/A
- Pillow: 10.0.0
- hf_transfer: N/A
- gradio: N/A
- numpy: 1.24.4
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: ***.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: ***.cache/huggingface/assets
- HF_TOKEN_PATH: ***.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False

About this issue

Original URL
State: open
Created a year ago
Comments: 15 (7 by maintainers)

Most upvoted comments

I’ve done everything that you mentioned and started fine-tuning but the result is the same, OSError.

Ok thanks for confirming. That’s so weird 😬 I’ll try to reproduce myself and let you know.

Wauplin on Jul 10, 2023

Wow actually the issue is very intriguing 🤯 It seems that for some reason the safety_checker/model.safetensors and the text_encoder/model.safetensors files have been mixed.

Here are the actual sizes of the files on S3:

➜  ~ curl --head https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/safety_checker/model.safetensors | grep size
x-linked-size: 1215981830
➜  ~ curl --head https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/text_encoder/model.safetensors | grep size
x-linked-size: 492265879

Given the error message you got (OSError: Consistency check failed: file should be of size 1215981833 but has size 492265879 (model.safetensors).), this cannot be a coincidence.

Wauplin on Jul 10, 2023

The error is the same, but I think it should be related to the force_download parameter that I’ve hardcoded into the huggingface_hub library. The code tries to download text_encoder safe.tensors file. I’ll change the library to its default version and give it a try. I’ll also write here if it’ll work.

emreaniloguz on Jul 10, 2023

Hmmm, so no errors at all when using the one from diffusers… But I wouldn’t say it is because of DreamPose implementation either since the failing part is really an internal consistency check within huggingface_hub thinking

(Though now that you successfully cached the repo locally, you should be able to continue with your training. It’s not fixing the actual issue but at least unblock you, right?)

I’ll share the result in 5 min.

emreaniloguz on Jul 10, 2023