sparkler: De-Duplicate documents in CrawlDB (Solr)
Since now we have a more sophisticated definition of the id field (with timestamp included), we have to think on de-duplication of the documents.
I am opening a discussion channel here to define de-duplication. Some of the suggestions are:
- Compare SHA256 hash of the raw_content i.e.
signaturefield (but this will enforce fetching of the duplicate document even though we are not storing it) - Compare the
urlfield
We can refer here for the implementation.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 16 (16 by maintainers)
Commits related to this issue
- Merge pull request #72 from spicule-kythera/mvn2sbt trigger build — committed to USCDataScience/sparkler by buggtb 3 years ago
Agreed, you can dedup by
crawlid+urlWe need to flush out more details on this and how will it be implemented, I am open to start a discussion on this if this is on the timeline right now, or else we can defer until it comes under development.
Better for the dedupe_id
Let me elaborate on my point. There are two types of de-duplication.
De-duplication of Outlinks: This is what we are discussing here. The objective is to de-duplicate outlinks so that we don’t end up crawling the same page again and again. For ex - if every page of a website points back to it’s home page, we would like to remove the home page URL from the outlinks so that Sparkler don’t fetch it again.
De-duplication of Content: This is, what I think, you are talking about. This is when you are refreshing the crawl or we want the crawler to fetch the page again based on the property
retry_interval_seconds. This is not implemented yet and when it will be, we will add the newly fetched document into our index and it will have the same dedupe_id. We can handle this with different Solr handlers.I was thinking on the line of generalization and giving the control to user i.e. define what schema field combination will define the de-duplication id. Let’s push this back because it was just a random thought and not helping the issue.