science-on-schema.org: Linking a checksum to DataDownload

Can we use the schema:identifier property? URN schema to indicate checksum?

Proposal:

  1. Use schema:PropertyValue
  2. use schema:identifier to specify the urn of the checksum (e.g. md5:9e85e71b33f71ac738e4793ff142c464)
  3. use schema:propertyID to specify the type of checksum as text
  4. use schema:additionalType to specify the type of checksum using controlled vocabularies
  5. use schema:value to specify the value of the schecksum

Examples:

MD5:

{
  "@type": "DataDownload",
  "identifier": [
    ...DOI and other identifiers go here...,
    {
      "@type": "PropertyValue",
      "additionalType": ["http://www.wikidata.org/entity/Q185235", "http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/md5"],
      "identifier": "md5:9e85e71b33f71ac738e4793ff142c464",
      "propertyID": "MD5",
      "value": "9e85e71b33f71ac738e4793ff142c464",
    }
  ]
}

SHA256:

{
  "@type": "DataDownload",
  "identifier": [
    ...DOI and other identifiers go here...,
    {
      "@type": "PropertyValue",
      "additionalType": "http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256",
      "identifier": "sha256:8808ACDC7FB7DC2F941EBACC7906B32D2676044494A740C21F6E0DC20893A2A6",
      "propertyID": "SHA256",
      "value": "8808ACDC7FB7DC2F941EBACC7906B32D2676044494A740C21F6E0DC20893A2A6",
    }
  ]
}

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 23 (1 by maintainers)

Most upvoted comments

@cboettig I did consider using spdx:ChecksumAlgorithm and spdx:Checksum, and spdx:checksumValue, but I thought there were some issues:

  • it separates the checksum algorithm from the value representation, so complicates parsing and introduces blank nodes unless we are careful
  • the spdx class definitions have a number of domain and range entailments in SPDX (like to spdx:File) and are defined specifically for software
  • it puts yet another term in our vocabulary outside of SO. But we’d done that several times already…
  • spdx:algorithm has a defined range that only includes md5, sha1, and sha256. probably an oversight that could be fixed.

The benefits would be

  • easier to recognize it as a checksum because of the dedicated class
    • doesn’t conflate identifier and checksum semantics
  • consistency with the direction of DCAT3 as @andrea-perego points out above

So, given the direction of DCAT3, here’s an alternative proposal that would be very clear about the semantics of the checksum, and avoids blank nodes by using the hash uri serialization as the “@id”:

{
    "@context": {
        "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#"
    },
    "@type": "Dataset",
    "@id": "https://dataone.org/datasets/doi%3A10.18739%2FA2NK36607",
    "sameAs": "https://doi.org/10.18739/A2NK36607",
    "name": "Conductivity-Temperature-Depth (CTD) data along DBO5 (Distributed Biological Observatory - Barrow Canyon), from the 2009 Circulation, Cross-shelf Exchange, Sea Ice, and Marine Mammal Habitat on the Alaskan Beaufort Sea Shelf cruise on USCGC Healy (HLY0904)",
    "distribution": {
      "@type": "DataDownload",
      "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae",
      "identifier": [
        {
          "@type": "PropertyValue",
          "@id": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae",
          "propertyID": "https://rfc-editor.org/rfc/rfc4122.txt",
          "value": "urn:uuid:2646d817-9897-4875-9429-9c196be5c2ae"
        }
      ],
      "spdx:Checksum": {
        "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": { 
          "@id": "spdx:checksumAlgorithm_sha256" 
        }
      }
    }
}

The triples that are related to the checksum would then be:

<https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae> <http://spdx.org/rdf/terms#Checksum> <hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51> .
<hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51> <http://spdx.org/rdf/terms#checksumAlgorithm> <http://spdx.org/rdf/terms#checksumAlgorithm_sha256> .
<hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51> <http://spdx.org/rdf/terms#checksumValue> "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51" .

So, let’s call our two options:

  1. Option 1: Place checksum as schema:identifier
  2. Option 2. Place checksum as spdx:Checksum

Right now, I think I like Option 2 better. Thoughts?

On the minimalism front, I hear what folks are saying and agree with some aspects of it, but I think there is room and need for guidance supporting different discovery (and other) use cases. As Carl lays out, there are important discovery use cases for checksums, and so simple guidance for “if you want to provide a checksum, do it like this…” will go a long ways towards overcoming the ambiguity and multiplicity of approaches in schema.org. At no point are we saying that people must provide checksums – we’re simply trying to provide implementation guidance for how to do in a simple interoperable way for those who want to provide it. We can continue this discussion on minimalism, but I think someone should open a new issue on it – its not really the topic here for Checksum, and the minimalism discussion applies to many other fields in the already released SOSO guidance docs. In addition, folks that would like to see the SOSO effort change direction and be more minimalist might consider joining our twice monthly calls so we can discuss that strategic direction in more detail.

To summarize the Checksum discussion thus far, and try to reach agreement on it, we have proposed two options: 1) to include checksum as an identifier, or 2) to include checksum as spdx:checksum. While these approaches are not exclusive, my read of the conversation thus far is that people think it is better to use spdx:checksum because it specifically signals the intent of the field, and doesn’t conflate it with using a checksum as an identifier (which can be done as well but for different reasons). I think we’ve explored the options pretty thoroughly in this thread, and so I propose that we follow the examples of spdx:checksum in previous comments, and that we discuss this to get agreement on our next call. I will write up guidance docs in our proposed decision format for that meeting if that’s ok. I’ve added it to the agenda for the May 27th call.

@fils thanks, seems we are pretty aligned then 😃 My hope is that SO can be useful for many of the small research stations and communities that have important data but lack the ability to properly announce them today. I very much support your statement on guidance for ranking and decoration, I think that is crucial to achieve this goal.

Definitely agree with @fils on this!

for me a primary interest in this is to aid potential disambiguation of data being exposed by multiple parties. So I view this as a vital element for data. I’m already using this quite a bit and the latest proposed guidance seems fine on quick inspection.

That’s also why I like the proposal of identifiers based on the checksums and independent of the party exposing the data. Certainly we can always extract that information if it’s in the checksum field like in these examples, but it’s also easily lost from there – e.g. like @fils says there’s the temptation to just json-ld frame it out and have a pure-schema.org representation, or any of the other representations that don’t have a native checksum attribute. So I still support @mbjones original suggestion that it would be nice to also normalize including this in an identifier field, which is (for better or worse) a much more widely implemented field.

If I had my druthers it would be the ‘canonical’ identifier or @id for any downloadable content object (schema:DataDownload), becuase "@id": "hash://sha256/39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51" is not subject to link rot and not provider specific the way that "@id": "https://dataone.org/datasets/urn%3Euuid%3E2646d817-9897-4875-9429-9c196be5c2ae" is, but also recognize maybe we need to walk before we run. Having checksums in the checksum field is at least a nice start.

Related question is whether to encourage providing more than one checksum. If so, I think it would look something like:

{
"@context": {
      "@vocab": "https://schema.org/",
      "spdx": "http://spdx.org/rdf/terms#",
      "spdx:checksumAlgorithm": {"@id": "spdx:checksumAlgorithm", "@type": "@id"}
    },
  "spdx:checksum": [
        {
        "@type": "spdx:Checksum",
        "spdx:checksumValue": "39ae639d33cea4a287198bbcdca5e6856e6607a7c91dc4c54348031be2ad4c51",
        "spdx:checksumAlgorithm": "spdx:checksumAlgorithm_sha256" 
        },
        {
        "@type": "spdx:Checksum",
        "spdx:checksumValue": "65d3616852dbf7b1a6d4b53b00626032",
        "spdx:checksumAlgorithm": "spdx:checksumAlgorithm_md5" 
        }
        ]
}

@steingod for me a primary interest in this is to aid potential disambiguation of data being exposed by multiple parties. So I view this as a vital element for data. I’m already using this quite a bit and the latest proposed guidance seems fine on quick inspection.

In most cases for the end user the goal is to easily get the data. This is trivial to JSON-LD Frame out and no more complex in SPARQL space than any guidance (which can be taken many ways) 😉

I’m all for embedding the checksum in the identifier, but it is surprising to me that DCAT2 and Schema.org don’t have a more native concept to express a checksum just as a checksum.

I’m not sure if it wouldn’t be cleaner to use spdx:ChecksumAlgorithm and spdx:Checksum as the property/value pair for the raw checksum, and separately list the hash URI as an associated identifier… (I see the appeal of using ni:/// and linking to RFC 6920, but as noted above the ni:/// syntax is somewhat cumbersome from a developer perspective (having a base64-encoded string with optional and non-optional rules about which characters should then be percent-encoded is a bit tricky and means that more than one valid string can be used for the same identifier). Having a RFC specification for the hash URI spec would be a nice resolution to all of this…

use of schema:identifier mentioned as potential solution here: https://github.com/schemaorg/schemaorg/issues/1831