newspaper: Article `download()` failed with 404 Client Error

Hi,

I keep getting this error message - Article download() failed with 404 Client Error: Not Found for url: http://www.foxnews.com/2017/09/22/sheriff-clarke-trump-wins-either-way-luther-strange-roy-moore-alabama-senate-race on URL http://www.foxnews.com/2017/09/22/sheriff-clarke-trump-wins-either-way-luther-strange-roy-moore-alabama-senate-race

It happens for various article url links.

Here is the code i am using, `news_content = newspaper.build(url) for eachArticle in news_content.articles: i = i +1 article = news_content.articles[i]

    article.download()#now download and parse each articles
    article.parse()

    article.nlp()


    backupfile.write("\n"+ "--------------------------------------------------------------" + "\n")
    backupfile.write(str(article.keywords))


    datasetfile.write("\n" + "----SUMMARY ARTICLE-> No. " + str(i) + "\n")
    datasetfile.write(article.summary) #only summary of the article is written in the dataset directory


    backupfile.write("\n"+"----SUMMARY ARTICLE---" + "\n")
    backupfile.write(article.summary)
    backupfile.write("\n"+"----TEXT INSIDE ARTICLE---" + "\n")
    backupfile.write(article.text)
    time.sleep(2)`

Attached below is the screenshot of the error, screenshot from 2017-09-23 14-46-29

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 17 (2 by maintainers)

Most upvoted comments

I posted the solution here:

from newspaper import Article
from newspaper import Config

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()

config.browser_user_agent = user_agent
url = "https://www.newsweek.com/new-mexico-compound-charges-dropped-children-1096830".strip()



page = Article(url, config=config)


page.download()
page.parse()
print(page.text)

Here is the link: https://stackoverflow.com/a/63060794/2414957

I just used a simple try except structure. Seems to works just fine (at least for the 404 error I was seeing)(code below - don’t mind the splitting and stuff’ 😃)


    try:
        article.download()
        article.parse()
        article2 = article.text.split()
    except:
        print('***FAILED TO DOWNLOAD***', article.url)
        continue

url.strip() will not fix a bad URL. See URL above returned by cnn object. Click on it. It is a bad URL.

First you told me just to do “except:” now you are telling me there is no error handling?

One of my colleagues had the same problem. She striped off newline character in the url strings using url.strip() and the error stopped.

if its just getting the text that you want to do, Since you already get information from curl and python request

then use newspaper’s , full_text


from newspaper import fulltext  
text = fulltext(html_content)