openverse: Duplicates identified in SMK data
Description
As part of creating an announcement for the recent reingestion of SMK data, we noticed that there were duplicates that existed within our data. For instance, this search yields a few duplicate images: https://wordpress.org/openverse/search/image/?q=nature&source=smk
@obulat identified that this was because the same images had two different foreign identifiers (and URLs):
On that page we can see that two of the Nature morte pictures are the same, but they have different URLs and probably foreign IDs (which is how we determine if the images are the same or not): https://iip.smk.dk/iiif/jp2/KMSKMSsp210.tif.reconstructed.tif.jp2/full/!2048,/0/default.jpg @aetherunbound, do you know if we had any SMK items previously? The only difference between the two items seems to be the prefix before (what I think is) the foreign ID:
9306t251c_.
Additional context
Resolution
- 🙋 I would be interested in resolving this bug.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (15 by maintainers)
To summarize:
My only hesitancy here is deleting data, since it could create dead links for folks.
@AetherUnbound Oh, I see. Yes I think we could remove the code to retrieve alternative images, delete the SMK images and re-run the DAG. If they’re just supplemental images to the primary images I don’t think we need to index them.
Although it would be nice in the future to support images that are really galleries in some form.
I’ve run the following commands to gather the SMK data, and placed the uploaded file here: s3://openverse-catalog/image/smk/smk_deleted_2023_02_16.csv
I have not yet deleted the data, since we’ll want to do that once we have an updated version of the DAG deployed (which I will work on next)
@AetherUnbound I don’t think we need to delay the announcement as long as SMK results are available to users. I’m also okay with deleting the duplicates even if it’s not all of them.
It does seem like we should implement this. What do you think? I don’t think we should ever return more results for SMK than their API does, which is currently around 38k:
https://api.smk.dk/api/v1/art/search/?keys=*&lang=en&filters=[has_image:true],[public_domain:true]