openverse: Duplicates identified in SMK data

Description

As part of creating an announcement for the recent reingestion of SMK data, we noticed that there were duplicates that existed within our data. For instance, this search yields a few duplicate images: https://wordpress.org/openverse/search/image/?q=nature&source=smk

@obulat identified that this was because the same images had two different foreign identifiers (and URLs):

On that page we can see that two of the Nature morte pictures are the same, but they have different URLs and probably foreign IDs (which is how we determine if the images are the same or not): https://iip.smk.dk/iiif/jp2/KMSKMSsp210.tif.reconstructed.tif.jp2/full/!2048,/0/default.jpg @aetherunbound, do you know if we had any SMK items previously? The only difference between the two items seems to be the prefix before (what I think is) the foreign ID: 9306t251c_.

Additional context

Resolution

  • 🙋 I would be interested in resolving this bug.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (15 by maintainers)

Most upvoted comments

To summarize:

  • Remove the alternative images logic from the provider ingestion script
  • Delete all SMK data from the catalog
  • Re-run the SMK ingestion DAG (prior to the next data refresh)

My only hesitancy here is deleting data, since it could create dead links for folks.

@AetherUnbound Oh, I see. Yes I think we could remove the code to retrieve alternative images, delete the SMK images and re-run the DAG. If they’re just supplemental images to the primary images I don’t think we need to index them.

Although it would be nice in the future to support images that are really galleries in some form.

I’ve run the following commands to gather the SMK data, and placed the uploaded file here: s3://openverse-catalog/image/smk/smk_deleted_2023_02_16.csv

create temporary table smk_deleted_2023_02_16 as select * from image where provider='smk';
\copy smk_deleted_2023_02_16 to '/tmp/smk_deleted_2023_02_16.csv' DELIMITER ',' CSV HEADER;

I have not yet deleted the data, since we’ll want to do that once we have an updated version of the DAG deployed (which I will work on next)

@AetherUnbound I don’t think we need to delay the announcement as long as SMK results are available to users. I’m also okay with deleting the duplicates even if it’s not all of them.

Use the foreign landing URL as the foreign identifier (rather than the image ID), so we only get one image per actual result

It does seem like we should implement this. What do you think? I don’t think we should ever return more results for SMK than their API does, which is currently around 38k:

https://api.smk.dk/api/v1/art/search/?keys=*&lang=en&filters=[has_image:true],[public_domain:true]