nltk: Failed to download NLTK data: HTTP ERROR 405 / 403
>>> nltk.download("all")
[nltk_data] Error loading all: HTTP Error 405: Not allowed.
>>> nltk.version_info
sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
Also, I tried to visit https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/cmudict.zip. Got the same HTTP 405 ERROR.
Find the same problem on stackoverflow: https://stackoverflow.com/questions/45318066/getting-405-while-trying-to-download-nltk-dta
Any comments would be appreciated.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 12
- Comments: 53 (24 by maintainers)
@plaihonen you should be able to use this alternative index by doing something like
python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj punkt
It seems like the Github is down/blocking access to the raw content on the repo.
Meanwhile the temporary solution is something like this:
Currently downloading the
gh-pages.zip
and replacing thenltk_data
directory is the working solution for now.Before we find on another channel to distribute
nltk_data
, please use the above solution.~Strangely, it only seems to affect the
nltk
user account. It works fine on the fork: https://raw.githubusercontent.com/alvations/nltk_data/gh-pages/index.xml~~Doing this would work too:~
@dfridman1 @prakruthi-karuna read the issue above. Work around is:
Currently this is the only working solution:
Since the root of the problem is bandwidth abuse, it seems a poor idea to recommend manual fetching of the entire
nltk_data
tree as a work-around. How about you show us how resource ids map to URLs, @alvations, so I canwget
just thepunkt
bundle, for example?The long-term solution, I believe, is to make it less trivial for beginning users to fetch the entire data bundle (638MB compressed, when I checked). Instead of arranging (and paying for) more bandwidth to waste on pointless downloads, stop providing
"all"
as a download option; the documentation should instead show the inattentive scripter how to download the specific resource(s) they need. And in the meantime, get out of the habit of writingnltk.download("all")
(or equivalent) as sample or recommended usage, on stackoverflow (I’m looking at you, @alvations) and in the downloader docstrings. (For exploring the nltk,nltk.dowload("book")
, not"all"
, is just as useful and much smaller.)At present it is difficult to figure out which resource needs to be downloaded; if I install the nltk and try out
nltk.pos_tag(["hello", "friend"])
, there’s no way to map the error message to a resource ID that I can pass tonltk.download(<resource id>)
. Downloading everything is the obvious work-aroundin such cases. Ifnltk.data.load()
ornltk.data.find()
can be patched to look up the resource id in such cases, I think you’ll see your usage onnltk_data
go down significantly over the long term.Files seem back up? Seems wise though to continue to seek alternative distribution mechanisms. In my own organization though we plan to move to hosting internally and check in quarterly
@darshanlol @alvations mentioned a solution. If you are trying to build a docker, the following worked for me:
@alvations seems only the gzip download works for now. And packages needed to be moved under
/home/username/nltk_data/
folder.I’ve just opened a ticket with them via the contact page.
@adsk2050 if you are using JIO network then switch it to some other network and try again.
Glad that it works now! I’m afraid there’s not much I can do to help, nor am I sure who I could contact about the issue. There’s some more discussion on it here: https://github.com/community/community/discussions/32889
@sthakur369 Perhaps some of the solutions from https://github.com/nltk/nltk/issues/3092#issuecomment-1366526151 will help. Alternatively, others have reported success with a VPN or mobile hotspot. I believe the problem is caused by a DNS issue somewhere, and it cannot be reproduced by everyone. For reference, the
nltk.download
will try to load this page: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml Which is hosted by GitHub. For some users, this page will simply not load.For those running NLTK in a CI environment. I’d like to propose GH-1795 which allows use to specify an alternative URL for downloading. The idea here is one can setup a local copy of nltk_data on a webserver (or even python -m http.server) and then have a global variable that can override the download URL.
This is so that we can override without modifying projects local command calls to include
-u
from a CI system like Jenkins.Do we have a temporary work-around yet?