acl-anthology: PDFs missing on the server
according to the search console, google was unable to index quite a few paper PDFs.
Most of these are not problems of our part, but some are. Here is the list of 404s on our side (papers are linked from the anthology):
- https://www.aclweb.org/anthology/L18-1254.pdf
- https://www.aclweb.org/anthology/O95-1002.pdf (volume)
- https://www.aclweb.org/anthology/O00-1011.pdf (volume)
- https://www.aclweb.org/anthology/O10-2006.pdf
- https://www.aclweb.org/anthology/O91-1004.pdf (volume]
- https://www.aclweb.org/anthology/E06-1029.pdf
- https://www.aclweb.org/anthology/O90-1005.pdf (volume)
- https://www.aclweb.org/anthology/O96-2004.pdf
- https://www.aclweb.org/anthology/O97-4001.pdf
- https://www.aclweb.org/anthology/W04-2303.pdf
- https://www.aclweb.org/anthology/W03-2413.pdf
- https://www.aclweb.org/anthology/W03-2410.pdf
- https://www.aclweb.org/anthology/W10-0409.pdf
- https://www.aclweb.org/anthology/I08-2128.pdf
- https://www.aclweb.org/anthology/attachments/W19-5212.OptionalSupplementaryMaterial.pdf
- https://www.aclweb.org/anthology/N16-1085v1.pdf
- https://www.aclweb.org/anthology/S19-2021v2.pdf
- https://www.aclweb.org/anthology/S19-2021v1.pdf
- https://www.aclweb.org/anthology/P12-1031v2.pdf
- https://www.aclweb.org/anthology/P12-1031v1.pdf
- https://www.aclweb.org/anthology/D15-2.pdf
- https://www.aclweb.org/anthology/W14-2104e1.pdf.pdf
- https://www.aclweb.org/anthology/attachments/S19-2207.Software.tex
Second block of missing PDFs (from search console -> “submitted URL has crawl issue”):
- https://www.aclweb.org/anthology/M92-1003.pdf
- https://www.aclweb.org/anthology/W09-2009.pdf
- https://www.aclweb.org/anthology/W09-1008.pdf
- https://www.aclweb.org/anthology/O07-1014.pdf
- https://www.aclweb.org/anthology/W03-3022.pdf
- https://www.aclweb.org/anthology/I08-7008.pdf
- https://www.aclweb.org/anthology/I08-2120.pdf
- https://www.aclweb.org/anthology/O98-1011.pdf
- https://www.aclweb.org/anthology/W10-0707.pdf
- https://www.aclweb.org/anthology/O11-1010.pdf
- https://www.aclweb.org/anthology/O00-1007.pdf
More:
2020 edition:
Created using magic^tm
Files that could not be downloaded
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.2.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.3.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.4.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.5.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.7.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.12.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.13.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.14.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.15.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.27.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.28.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.29.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.30.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.31.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.32.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.33.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.34.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.36.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.37.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.38.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.39.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2011.eamt-1.41.Presentation.pdf
- https://www.aclweb.org/anthology//attachments/2020.inlg-1.3.Supplementary_Attachment.pdf
- https://www.aclweb.org/anthology//attachments/2020.inlg-1.5.Supplementary_Attachment.pdf
- https://www.aclweb.org/anthology//attachments/2020.inlg-1.9.Supplementary_Attachment.pdf
- https://www.aclweb.org/anthology//attachments/2020.inlg-1.10.Supplementary_Attachment.pdf
- https://www.aclweb.org/anthology//attachments/2020.inlg-1.17.Supplementary_Attachment.pdf
- https://www.aclweb.org/anthology//attachments/2020.inlg-1.19.Supplementary_Attachment.pdf
- https://www.aclweb.org/anthology//attachments/2020.inlg-1.33.Supplementary_Attachment.pdf
- https://www.aclweb.org/anthology//attachments/2020.inlg-1.37.Supplementary_Attachment.pdf
- https://www.aclweb.org/anthology//attachments/2020.inlg-1.38.Supplementary_Attachment.pdf
- https://www.aclweb.org/anthology//attachments/2020.inlg-1.42.Supplementary_Attachment.zip
- https://www.aclweb.org/anthology//attachments/2020.inlg-1.45.Supplementary_Attachment.zip
Files with checksum mismatch
- https://www.aclweb.org/anthology//2020.amta-research.11.pdf
- https://www.aclweb.org/anthology//2020.wosp-1.pdf
- https://www.aclweb.org/anthology//2020.wosp-1.12.pdf
- https://www.aclweb.org/anthology//N18-2078v1.pdf
Please mark the papers that are fixed to track progress.
For posterity sake, this is how I got that list:
- download CSV with problematic URLs from search console
- run this:
for i in $(sed 's/,.*//' https___www.aclweb.org_anthology_\ Index\Coverage\ Drilldown\ 2019-10-22.csv); do
status=$(curl -I $i | head -n1)
if [[ "$status" =~ "404" ]]; then
echo "$i $status"
fi
done > anthology-404.txt
The 2020 edition list is from the mirroring effort and should check all files we reference.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 19 (16 by maintainers)
Commits related to this issue
- Missing PDF fixes (#598) (#600) * removed D15-2 PDF * removed .pdf from W14-2104 * removed S19-2207 attachment * remove W19-5212 attachment * removed nonexistent correction for P12-1031 * found ... — committed to acl-org/acl-anthology by mjpost 5 years ago
- added URL to PDF (#598) — committed to acl-org/acl-anthology by mjpost 4 years ago
- Merge pull request #807 from acl-org/add-eamt2015-pdf added URL to PDF (#598) — committed to acl-org/acl-anthology by danielgildea 4 years ago
- Removing XML entries for files missing on the server (re #264, #598) — committed to acl-org/acl-anthology by mbollmann 4 years ago
- Removing XML entries for files missing on the server (re #264, #598) — committed to acl-org/acl-anthology by mbollmann 4 years ago
- Fixed PDF that was actually HTML (#598) — committed to acl-org/acl-anthology by mjpost 3 years ago
- Fixed PDF that was actually HTML (#598) — committed to acl-org/acl-anthology by mjpost 3 years ago
- Added 2020 full PDF (closes #1197) (#1198) * Fixed PDF that was actually HTML (#598) * Added EMNLP 2020 PDF — committed to acl-org/acl-anthology by mjpost 3 years ago
- Missing PDF fixes (#598) (#600) * removed D15-2 PDF * removed .pdf from W14-2104 * removed S19-2207 attachment * remove W19-5212 attachment * removed nonexistent correction for P12-1031 * found ... — committed to ir-anthology/ir-anthology by mjpost 5 years ago
- added URL to PDF (#598) — committed to ir-anthology/ir-anthology by mjpost 4 years ago
- Merge pull request #807 from acl-org/add-eamt2015-pdf added URL to PDF (#598) — committed to ir-anthology/ir-anthology by danielgildea 4 years ago
- Removing XML entries for files missing on the server (re #264, #598) — committed to ir-anthology/ir-anthology by mbollmann 4 years ago
- Fixed PDF that was actually HTML (#598) — committed to ir-anthology/ir-anthology by mjpost 3 years ago
Okay, I put in place:
Spotchecking some filenames, most of these seem to already have been reported in my list in https://github.com/acl-org/acl-anthology/issues/264#issuecomment-506827374.