nltk: Failed to download NLTK data: HTTP ERROR 405 / 403

>>> nltk.download("all")
[nltk_data] Error loading all: HTTP Error 405: Not allowed.

>>> nltk.version_info
sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)

Also, I tried to visit https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/cmudict.zip. Got the same HTTP 405 ERROR.

Find the same problem on stackoverflow: https://stackoverflow.com/questions/45318066/getting-405-while-trying-to-download-nltk-dta

Any comments would be appreciated.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 12
  • Comments: 53 (24 by maintainers)

Commits related to this issue

Most upvoted comments

@plaihonen you should be able to use this alternative index by doing something like python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj punkt

It seems like the Github is down/blocking access to the raw content on the repo.

Meanwhile the temporary solution is something like this:

PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA

Currently downloading the gh-pages.zip and replacing the nltk_data directory is the working solution for now.

Before we find on another channel to distribute nltk_data, please use the above solution.


~Strangely, it only seems to affect the nltk user account. It works fine on the fork: https://raw.githubusercontent.com/alvations/nltk_data/gh-pages/index.xml~

~Doing this would work too:~

@dfridman1 @prakruthi-karuna read the issue above. Work around is:

python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj all

Currently this is the only working solution:

PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA

Since the root of the problem is bandwidth abuse, it seems a poor idea to recommend manual fetching of the entire nltk_data tree as a work-around. How about you show us how resource ids map to URLs, @alvations, so I can wget just the punkt bundle, for example?

The long-term solution, I believe, is to make it less trivial for beginning users to fetch the entire data bundle (638MB compressed, when I checked). Instead of arranging (and paying for) more bandwidth to waste on pointless downloads, stop providing "all" as a download option; the documentation should instead show the inattentive scripter how to download the specific resource(s) they need. And in the meantime, get out of the habit of writing nltk.download("all") (or equivalent) as sample or recommended usage, on stackoverflow (I’m looking at you, @alvations) and in the downloader docstrings. (For exploring the nltk, nltk.dowload("book"), not "all", is just as useful and much smaller.)

At present it is difficult to figure out which resource needs to be downloaded; if I install the nltk and try out nltk.pos_tag(["hello", "friend"]), there’s no way to map the error message to a resource ID that I can pass to nltk.download(<resource id>). Downloading everything is the obvious work-aroundin such cases. If nltk.data.load() or nltk.data.find() can be patched to look up the resource id in such cases, I think you’ll see your usage on nltk_data go down significantly over the long term.

Files seem back up? Seems wise though to continue to seek alternative distribution mechanisms. In my own organization though we plan to move to hosting internally and check in quarterly

@darshanlol @alvations mentioned a solution. If you are trying to build a docker, the following worked for me:

ENV PATH_TO_NLTK_DATA $HOME/nltk_data/
RUN apt-get -qq update
RUN apt-get -qq -y install wget
RUN wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
RUN apt-get -y install unzip
RUN unzip gh-pages.zip -d $PATH_TO_NLTK_DATA
# add below code
RUN mv $PATH_TO_NLTK_DATA/nltk_data-gh-pages/packages/* $PATH_TO_NLTK_DATA/

@alvations seems only the gzip download works for now. And packages needed to be moved under /home/username/nltk_data/ folder.

export PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages $PATH_TO_NLTK_DATA
# add below code
mv $PATH_TO_NLTK_DATA/nltk_data-gh-pages/packages/* $PATH_TO_NLTK_DATA/

I’ve just opened a ticket with them via the contact page.

@adsk2050 if you are using JIO network then switch it to some other network and try again.

Glad that it works now! I’m afraid there’s not much I can do to help, nor am I sure who I could contact about the issue. There’s some more discussion on it here: https://github.com/community/community/discussions/32889

@sthakur369 Perhaps some of the solutions from https://github.com/nltk/nltk/issues/3092#issuecomment-1366526151 will help. Alternatively, others have reported success with a VPN or mobile hotspot. I believe the problem is caused by a DNS issue somewhere, and it cannot be reproduced by everyone. For reference, the nltk.download will try to load this page: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml Which is hosted by GitHub. For some users, this page will simply not load.

For those running NLTK in a CI environment. I’d like to propose GH-1795 which allows use to specify an alternative URL for downloading. The idea here is one can setup a local copy of nltk_data on a webserver (or even python -m http.server) and then have a global variable that can override the download URL.

This is so that we can override without modifying projects local command calls to include -u from a CI system like Jenkins.

Do we have a temporary work-around yet?