sparkler: De-Duplicate documents in CrawlDB (Solr)

Since now we have a more sophisticated definition of the id field (with timestamp included), we have to think on de-duplication of the documents.

I am opening a discussion channel here to define de-duplication. Some of the suggestions are:

Compare SHA256 hash of the raw_content i.e. signature field (but this will enforce fetching of the duplicate document even though we are not storing it)
Compare the url field

We can refer here for the implementation.

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 16 (16 by maintainers)

Commits related to this issue

Merge pull request #72 from spicule-kythera/mvn2sbt trigger build — committed to USCDataScience/sparkler by buggtb 3 years ago

Most upvoted comments

De-duplication of Outlinks:

Agreed, you can dedup by crawlid+url

De-duplication of Content:

We need to flush out more details on this and how will it be implemented, I am open to start a discussion on this if this is on the timeline right now, or else we can defer until it comes under development.

sujen1412 on Feb 3, 2017

crawl_id-url combination is better for what ?

Better for the dedupe_id

Let me elaborate on my point. There are two types of de-duplication.

De-duplication of Outlinks: This is what we are discussing here. The objective is to de-duplicate outlinks so that we don’t end up crawling the same page again and again. For ex - if every page of a website points back to it’s home page, we would like to remove the home page URL from the outlinks so that Sparkler don’t fetch it again.
De-duplication of Content: This is, what I think, you are talking about. This is when you are refreshing the crawl or we want the crawler to fetch the page again based on the property retry_interval_seconds. This is not implemented yet and when it will be, we will add the newly fetched document into our index and it will have the same dedupe_id. We can handle this with different Solr handlers.

I didnt get what you are trying to say here. What fields contribute to de-duplication ? And what are you trying to deduplicate, url or content or something else ?

I was thinking on the line of generalization and giving the control to user i.e. define what schema field combination will define the de-duplication id. Let’s push this back because it was just a random thought and not helping the issue.

karanjeets on Feb 2, 2017