acl-anthology: PDFs missing on the server

according to the search console, google was unable to index quite a few paper PDFs.

Most of these are not problems of our part, but some are. Here is the list of 404s on our side (papers are linked from the anthology):

Second block of missing PDFs (from search console -> “submitted URL has crawl issue”):

More:

2020 edition:

Created using magic^tm

Files that could not be downloaded

Files with checksum mismatch

Please mark the papers that are fixed to track progress.

For posterity sake, this is how I got that list:

  • download CSV with problematic URLs from search console
  • run this:
for i in $(sed 's/,.*//' https___www.aclweb.org_anthology_\ Index\Coverage\ Drilldown\ 2019-10-22.csv); do
  status=$(curl -I $i | head -n1)
  if [[ "$status" =~ "404" ]]; then
    echo "$i $status"
  fi
done > anthology-404.txt

The 2020 edition list is from the mirroring effort and should check all files we reference.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Okay, I put in place:

  • O11-1010.pdf
  • O00-1007.pdf
  • O07-1014.pdf
  • O98-1011.pdf

Spotchecking some filenames, most of these seem to already have been reported in my list in https://github.com/acl-org/acl-anthology/issues/264#issuecomment-506827374.