almanac.httparchive.org: Investigate 404 errors

In the production server logs I’m seeing lots of ambiguous error messages like this:

werkzeug.exceptions.NotFound: 404 Not Found: The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.
at match (/env/lib/python3.7/site-packages/werkzeug/routing.py:1799)
at match_request (/env/lib/python3.7/site-packages/flask/ctx.py:336)
at raise_routing_exception (/env/lib/python3.7/site-packages/flask/app.py:1774)
at dispatch_request (/env/lib/python3.7/site-packages/flask/app.py:1791)
at full_dispatch_request (/env/lib/python3.7/site-packages/flask/app.py:1813)

At times the server is spiking at 200 404s per minute. (This is suspiciously high)

Sometimes this happens when a site doesn’t have a favicon or something innocuous, but I can’t imagine why we’d be having this many 404s unless there’s a broken link somewhere.

Two things:

  • Improve error logging so we know what the broken link is and where it’s coming from (cc @mikegeyser)
  • Rerun the SEO-style audit of the website so that we can more easily/proactively find broken links (#286 cc @AymenLoukil @catalinred @rachellcostello)

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

OK I got it.

We don’t have a working 404 page - except for the routes we have defined (i.e. /static/XXX or /lang/year/XXX).

This repeats the error: http://127.0.0.1:8080/en/ for example, as does https://127.0.0.1:8080/anythingrandom - because we have no routes matching those patterns.

It shows an error page instead of the 404 page and returns a 500 to the browser, though it did start life as a 404:

ERROR:root:An error occurred during a request due to page not found: /en/
Traceback (most recent call last):
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 1791, in dispatch_request
    self.raise_routing_exception(req)
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 1774, in raise_routing_exception
    raise request.routing_exception
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/ctx.py", line 336, in match_request
    self.url_adapter.match(return_rule=True)
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/werkzeug/routing.py", line 1799, in match
    raise NotFound()
werkzeug.exceptions.NotFound: 404 Not Found: The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.
INFO:werkzeug:127.0.0.1 - - [11/Nov/2019 20:04:47] "GET /en/ HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 1791, in dispatch_request
    self.raise_routing_exception(req)
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 1774, in raise_routing_exception
    raise request.routing_exception
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/ctx.py", line 336, in match_request
    self.url_adapter.match(return_rule=True)
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/werkzeug/routing.py", line 1799, in match
    raise NotFound()
werkzeug.exceptions.NotFound: 404 Not Found: The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 2309, in __call__
    return self.wsgi_app(environ, start_response)
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 2295, in wsgi_app
    response = self.handle_exception(e)
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 1741, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 1713, in handle_user_exception
    return self.handle_http_exception(e)
  File "/Users/barry/sources/almanac.httparchive.org/src/env/lib/python3.8/site-packages/flask/app.py", line 1644, in handle_http_exception
    return handler(e)
  File "/Users/barry/almanac.httparchive.org/src/main.py", line 145, in page_not_found
    return render_template('error/404.html', error=e), 404
  File "/Users/barry/almanac.httparchive.org/src/main.py", line 18, in render_template
    year = request.view_args.get('year', DEFAULT_YEAR)
AttributeError: 'NoneType' object has no attribute 'get'

Adding a default route like this fixes it:

@app.route('/', defaults={'path': ''})
@app.route('/<path:path>')
def catch_all(path):
    abort(404, 'barry was here')

And I know this fixes it as it returns our correct 404 page and gives that exact error message on it (barry was here) so I know it’s making it to this route.

Other posts seem to suggest that is how this should work, and I’ve tested and the other routes still work (home page, chapters, methodology…etc.) as well as static pages, sitemap.xml …etc.

Will submit a PR, though suppose I should change the 404 error message 😀

However I’m also going to add a case to handle that /en/ case and redirect to default year:

@app.route('/<lang>/')
@validate
def lang_only(lang):
    return redirect(url_for('home', lang=lang, year=DEFAULT_YEAR))

I’m still seeing vague 404 error messages in Stackdriver:

image

However, the actual App Engine server logs are no longer showing any meaningful errors on things like broken images or bad requests, so I’m comfortable closing this issue.

Good find!

I run a crawl and here are the links generating errors :

https://almanac.httparchive.org/en/2019/](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) from https://almanac.httparchive.org/en/2019/resource-hints

https://almanac.httparchive.org/static/images/2019/05_Third_Parties/fig7.png from https://almanac.httparchive.org/en/2019/third-parties

https://almanac.httparchive.org/static/images/2019/08_Security/fig1.png from https://almanac.httparchive.org/en/2019/security

https://www.ssllabs.com/ssl-pulse/) from https://almanac.httparchive.org/en/2019/security

https://almanac.httparchive.org/static/images/2019/08_Security/fig8.png from https://almanac.httparchive.org/en/2019/security

https://almanac.httparchive.org/static/images/2019/08_Security/fig3.png from https://almanac.httparchive.org/en/2019/security

https://almanac.httparchive.org/static/images/2019/08_Security/fig2.png from https://almanac.httparchive.org/en/2019/security

https://fonts.gstatic.com/ from https://almanac.httparchive.org/en/2019/fonts

https://rainy-periwinkle.glitch.me/permalink/bc8f154a95dfe06a6d0fdb099b6c8df61727b2289141a0ef16dc17b2b57d3068.html from https://almanac.httparchive.org/en/2019/markup https://rainy-periwinkle.glitch.me/permalink/3214f840b6ae3ef1074291f60fa1be4b9d9df401fe0190bfaff4bb078c8614a5.html from https://almanac.httparchive.org/en/2019/markup

Modify these links to HTTPS :

http://speedcurve.com/ from https://almanac.httparchive.org/en/2019/contributors http://paulcalvano.com/ from https://almanac.httparchive.org/en/2019/contributors http://www.filamentgroup.com/ from https://almanac.httparchive.org/en/2019/fonts