huggingface_hub: 413 Client Error: Payload Too Large when using upload_folder on a lot of files

Describe the bug

When trying to commit a folder with many CSV files, I got the following error:

HTTPError: 413 Client Error: Payload Too Large for url: https://huggingface.co/api/datasets/nateraw/test-upload-folder-bug/preupload/main

I assume there is a limit to total payload size when uploading a folder that I am going over here. I confirmed it has nothing to do with the number of files, but rather the total size of the files that are being uploaded. It would be great in the short term if we could document what this limit is clearly in the upload_folder fn.

Reproduction

The following fails on the last line. I wrote it so you can run it yourself without updating the repo ID or anything…so if you’re logged in, the below should work (assuming you have torchvision installed).

import os

from torchvision.datasets.utils import download_and_extract_archive
from huggingface_hub import upload_folder, whoami, create_repo

user = whoami()['name']
repo_id = f'{user}/test-upload-folder-bug'
create_repo(repo_id, exist_ok=True, repo_type='dataset')

os.mkdir('./data')
download_and_extract_archive(
    url='https://zenodo.org/api/files/f7f7377b-8405-4d4f-b814-f021df5593b1/hyperbard_data.zip',
    download_root='./data',
    remove_finished=True
)
upload_folder(
    folder_path='./data',
    path_in_repo="",
    repo_id=repo_id,
    repo_type='dataset'
)

Logs

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-2-91516b1ea47f> in <module>()
     18     path_in_repo="",
     19     repo_id=repo_id,
---> 20     repo_type='dataset'
     21 )

3 frames
/usr/local/lib/python3.7/dist-packages/huggingface_hub/hf_api.py in upload_folder(self, repo_id, folder_path, path_in_repo, commit_message, commit_description, token, repo_type, revision, create_pr)
   2115             token=token,
   2116             revision=revision,
-> 2117             create_pr=create_pr,
   2118         )
   2119 

/usr/local/lib/python3.7/dist-packages/huggingface_hub/hf_api.py in create_commit(self, repo_id, operations, commit_message, commit_description, token, repo_type, revision, create_pr, num_threads)
   1813             token=token,
   1814             revision=revision,
-> 1815             endpoint=self.endpoint,
   1816         )
   1817         upload_lfs_files(

/usr/local/lib/python3.7/dist-packages/huggingface_hub/_commit_api.py in fetch_upload_modes(additions, repo_type, repo_id, token, revision, endpoint)
    380         headers=headers,
    381     )
--> 382     resp.raise_for_status()
    383 
    384     preupload_info = validate_preupload_info(resp.json())

/usr/local/lib/python3.7/dist-packages/requests/models.py in raise_for_status(self)
    939 
    940         if http_error_msg:
--> 941             raise HTTPError(http_error_msg, response=self)
    942 
    943     def close(self):

HTTPError: 413 Client Error: Payload Too Large for url: https://huggingface.co/api/datasets/nateraw/test-upload-folder-bug/preupload/main


### System Info

```shell
Colab

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 22 (15 by maintainers)

Most upvoted comments

It should be soon!! cc @Wauplin

@fcakyon or use pip install huggingface_hub==0.11.0rc0 which is about to be publicly released and will be a more robust future-proof fix 😃

ah yes we probably want to chunk client-side in this use case (you’re probably hitting POST limit size of 10MB). + enforce a reasonable total max size, regardless of chunking (maybe 100MB)

Note that this only applies to non-LFS files so 100MB is more than reasonable IMO.

Also cc @coyotte508 and @Pierrci for visibility

@Wauplin amazing news!

@nateraw thanks a lot, snippet is very clear and simple!

@nateraw I opened #920 with a fix

Can you try it out and confirm it fixes your issue, please?

Then likely there is so many files that the 250kB limit is overcome just with the preupload call.

Either the hub library should batch the preupload calls (in chunks of 250 files for example) or we should allow a bigger body on the hub side

I think the limit is already 100MB on the hub side.

But since the python library is sending in base64 (to be able to send files with non-UTF8 characters) it’s closer to 70~75MB max.

What’s the size of the files @nateraw ?