datasets: Some languages in wikipedia dataset are not loading
Hi,
I am working with the wikipedia
dataset and I have a script that goes over 92 of the available languages in that dataset. So far I have detected that ar
, af
, an
are not loading. Other languages like fr
and en
are working fine. Here’s how I am loading them:
import nlp
langs = ['ar'. 'af', 'an']
for lang in langs:
data = nlp.load_dataset('wikipedia', f'20200501.{lang}', beam_runner='DirectRunner', split='train')
print(lang, len(data))
Here’s what I see for ‘ar’ (it gets stuck there):
Downloading and preparing dataset wikipedia/20200501.ar (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/gaguilar/.cache/huggingface/datasets/wikipedia/20200501.ar/1.0.0/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50...
Note that those languages are indeed in the list of expected languages. Any suggestions on how to work around this? Thanks!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (11 by maintainers)
Hey @gaguilar ,
I just found the “char2subword” paper and I’m really interested in trying it out on own vocabs/datasets like for historical texts (I’ve already trained some lms on newspaper articles with OCR errors).
Do you plan to release the code for your paper or is it possible to get the implementation 🤔 Many thanks 🤗
@lhoestq Any updates on this? I have similar issues with the Romanian dump, tnx.
Ok, thanks for clarifying, that makes sense. I will time those examples later today and post back here.
Also, it seems that not all dumps should use the same date. For instance, I was checking the Spanish dump doing the following:
I got the error below because this URL does not exist: https://dumps.wikimedia.org/eswiki/20200501/dumpstatus.json. So I checked the actual available dates here https://dumps.wikimedia.org/eswiki/ and there is no 20200501. If one tries for a date available in the link, then the nlp library does not allow such a request because is not in the list of expected datasets.
Hi ! This looks related to this issue: https://github.com/huggingface/datasets/issues/1994 Basically the parser that is used (mwparserfromhell) has some issues for some pages in
es
. We already reported some issues fores
on their repo at https://github.com/earwig/mwparserfromhell/issues/247 but it looks like there are still a few issues. Might be a good idea to open a new issue on the mwparserfromhell repoHi ! The link https://dumps.wikimedia.org/idwiki/20210501/dumpstatus.json seems to be working fine for me.
Regarding the time outs, it must come either from an issue on the wikimedia host side, or from your internet connection. Feel free to try again several times.
This is an issue on the
mwparserfromhell
side. You could try to updatemwparserfromhell
and see if it fixes the issue. If it doesn’t we’ll have to create an issue on their repo for them to fix it. But first let’s see if the latest version ofmwparserfromhell
does the job.Thanks ! This will be very helpful.
About the date issue, I think it’s possible to use another date with
However we’ve not processed wikipedia dumps for other dates than 20200501 (yet ?)
One more thing that is specific to 20200501.es: it was available once but the
mwparserfromhell
was not able to parse it for some reason, so we didn’t manage to get a processed version of 20200501.es (see #321 )