sparkler: De-Duplicate documents in CrawlDB (Solr)

Since now we have a more sophisticated definition of the id field (with timestamp included), we have to think on de-duplication of the documents.

I am opening a discussion channel here to define de-duplication. Some of the suggestions are:

  • Compare SHA256 hash of the raw_content i.e. signature field (but this will enforce fetching of the duplicate document even though we are not storing it)
  • Compare the url field

We can refer here for the implementation.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 16 (16 by maintainers)

Commits related to this issue

Most upvoted comments

De-duplication of Outlinks:

Agreed, you can dedup by crawlid+url

De-duplication of Content:

We need to flush out more details on this and how will it be implemented, I am open to start a discussion on this if this is on the timeline right now, or else we can defer until it comes under development.

crawl_id-url combination is better for what ?

Better for the dedupe_id

Let me elaborate on my point. There are two types of de-duplication.

  • De-duplication of Outlinks: This is what we are discussing here. The objective is to de-duplicate outlinks so that we don’t end up crawling the same page again and again. For ex - if every page of a website points back to it’s home page, we would like to remove the home page URL from the outlinks so that Sparkler don’t fetch it again.

  • De-duplication of Content: This is, what I think, you are talking about. This is when you are refreshing the crawl or we want the crawler to fetch the page again based on the property retry_interval_seconds. This is not implemented yet and when it will be, we will add the newly fetched document into our index and it will have the same dedupe_id. We can handle this with different Solr handlers.

I didnt get what you are trying to say here. What fields contribute to de-duplication ? And what are you trying to deduplicate, url or content or something else ?

I was thinking on the line of generalization and giving the control to user i.e. define what schema field combination will define the de-duplication id. Let’s push this back because it was just a random thought and not helping the issue.