acl-anthology: Broken & missing links on the server
I have crosschecked a full file list from the aclweb.org server (created by @mjpost on 29.03.2019) with what would be expected after parsing the Anthology XML.
The result is a list of files that are either missing (= they should currently be linked on the website, but will 404) or unexpected (= they are on the server, but not currently linked).
Most recent status in this comment.
It reveals a swath of problems, for example:
-
Journals that have front matter (as discussed in #181) will show up as “unexpected”, e.g.:
Unexpected: J98-1000.pdf
-
Attachments that appear to have wrong names in the XML, e.g.:
Unexpected: P16-1070.Notes.pdf Missing: P16-1070.Notes.zip
-
Something weird going on with EACL 1997; papers are listed twice—once as E97-, once as P97- (probably a joint meeting?)—with the E97-* files not actually existing on the server.
-
Some of them are also false alarms, e.g., a bunch of TACL papers show up as missing, such as:
Missing: Q18-1006.pdf Missing: Q18-1034.pdf Missing: Q18-1035.pdf
But the URLs for them actually work: Q18-1006, Q18-1034, Q18-1035. The same applies to (many of?) the seemingly missing revisions & errata. Maybe there’s some redirection magic going on on the server to places that are not included in the file list I’ve got?
-
Potentially many more.
What next?
Lines that stem from clear mistakes in the XML could obviously be manually fixed.
For journals that have front matter, and also for full volume PDFs, we could mark in the XML if these files actually exist or not (e.g., by providing—and relying on—an explicit <file>
or <url type="internal">
tag, as discussed in #156.
I can also update the gist after we commit corrections and/or I get an updated file list.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 22 (20 by maintainers)
Commits related to this issue
- Nested volumes and explicit <url> tags (#324) A summary of changes: - Introduces a nested format (closes #317) - URLs are stored using a relative format for internal links (closes #156), which fa... — committed to acl-org/acl-anthology by mjpost 5 years ago
- added 346 missing attachments (#264 #535) — committed to acl-org/acl-anthology by mjpost 5 years ago
- Added missing supplementary material, tightened schema (#536) * added 346 missing attachments (#264, closes #535) * added script used * dataset, presentation, software tags renamed as attachments ... — committed to acl-org/acl-anthology by mjpost 5 years ago
- Removing XML entries for files missing on the server (re #264, #598) — committed to acl-org/acl-anthology by mbollmann 4 years ago
- Link several unallocated revisions, errata, attachments (re #264) — committed to acl-org/acl-anthology by mbollmann 4 years ago
- Removing XML entries for files missing on the server (re #264, #598) — committed to acl-org/acl-anthology by mbollmann 4 years ago
- Link several unallocated revisions, errata, attachments (re #264) — committed to acl-org/acl-anthology by mbollmann 4 years ago
- Nested volumes and explicit <url> tags (#324) A summary of changes: - Introduces a nested format (closes #317) - URLs are stored using a relative format for internal links (closes #156), which fa... — committed to ir-anthology/ir-anthology by mjpost 5 years ago
- Added missing supplementary material, tightened schema (#536) * added 346 missing attachments (#264, closes #535) * added script used * dataset, presentation, software tags renamed as attachments ... — committed to ir-anthology/ir-anthology by mjpost 5 years ago
- Removing XML entries for files missing on the server (re #264, #598) — committed to ir-anthology/ir-anthology by mbollmann 4 years ago
- Link several unallocated revisions, errata, attachments (re #264) — committed to ir-anthology/ir-anthology by mbollmann 4 years ago
http://cs.jhu.edu/~post/tmp/ls-RQ1-2020-04-18.gz
I think papers should always have an explicit
<url>
field and the absence of it should indicate that there isn’t one. As for the whole-volume PDFs…maybe they should be automatically created during the build if they are missing?