acl-anthology: Broken & missing links on the server

I have crosschecked a full file list from the aclweb.org server (created by @mjpost on 29.03.2019) with what would be expected after parsing the Anthology XML.

The result is a list of files that are either missing (= they should currently be linked on the website, but will 404) or unexpected (= they are on the server, but not currently linked).

Most recent status in this comment.

It reveals a swath of problems, for example:

  • Journals that have front matter (as discussed in #181) will show up as “unexpected”, e.g.:

    Unexpected:  J98-1000.pdf
    
  • Attachments that appear to have wrong names in the XML, e.g.:

    Unexpected:  P16-1070.Notes.pdf
       Missing:  P16-1070.Notes.zip
    
  • Something weird going on with EACL 1997; papers are listed twice—once as E97-, once as P97- (probably a joint meeting?)—with the E97-* files not actually existing on the server.

  • Some of them are also false alarms, e.g., a bunch of TACL papers show up as missing, such as:

       Missing:  Q18-1006.pdf
       Missing:  Q18-1034.pdf
       Missing:  Q18-1035.pdf
    

    But the URLs for them actually work: Q18-1006, Q18-1034, Q18-1035. The same applies to (many of?) the seemingly missing revisions & errata. Maybe there’s some redirection magic going on on the server to places that are not included in the file list I’ve got?

  • Potentially many more.

What next?

Lines that stem from clear mistakes in the XML could obviously be manually fixed.

For journals that have front matter, and also for full volume PDFs, we could mark in the XML if these files actually exist or not (e.g., by providing—and relying on—an explicit <file> or <url type="internal"> tag, as discussed in #156.

I can also update the gist after we commit corrections and/or I get an updated file list.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 22 (20 by maintainers)

Commits related to this issue

Most upvoted comments

I think papers should always have an explicit <url> field and the absence of it should indicate that there isn’t one. As for the whole-volume PDFs…maybe they should be automatically created during the build if they are missing?