quadstore: BACKEND: Blank nodes should not match existing data

When inserting data containing blank nodes, the blank subject or object is stored verbatim with the same blank node identifier as the input. This breaks the requirement that blank nodes are scoped to the input document. For example (I tried adding this as a unit test in quadstore.prototype.put.js):

    it('should not re-use blank nodes', async function () {
      const { dataFactory, store } = this;
      await store.put(dataFactory.quad(
        dataFactory.blankNode('_:s'),
        dataFactory.namedNode('ex://p'),
        dataFactory.namedNode('ex://o'),
        dataFactory.namedNode('ex://g'),
      ));
      await store.put(dataFactory.quad(
        dataFactory.blankNode('_:s'),
        dataFactory.namedNode('ex://p'),
        dataFactory.namedNode('ex://o'),
        dataFactory.namedNode('ex://g'),
      ));
      const { items: foundQuads } = await store.get({});
      should(foundQuads).have.length(2);
    });

This test fails because the two invocations of put are using the same blank node label. Instead, they should result in different quads with disjoint subjects.

For more complex examples, such as lists, the accidental re-use of blank node identifiers (for example after a re-start) could badly affect data integrity.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 23 (19 by maintainers)

Most upvoted comments

Published in version 7.3.0!

Hi all!

@gsvarovsky:

  1. Use an incrementing integer, stored as a plain key-value and updated with every batch, to generate internal blank node identifiers.

A nanoid-labeled blank node is still significantly smaller than the average named node and seems to be comparable to shortened named nodes when using prefixes. I don’t think slightly longer blank nodes are likely to become an issue on their own unless as a part of a bigger issue related to the comparatively low quad/MB ratio that can be achieved using quadstore’s indexing strategy.

  1. Apply a scope to the regular write methods if one is not provided – needs a small change in the API, I think.

I do agree that the default behavior is not correct but it’s also simple to maintain, easily understood and easily extendable. Furthermore, I suspect that it matches expectations of how a low-level RDF/JS library should work as per @rubensworks comment. I think that forcing a scope when none is provided would break a lot of assumptions, both spoken and unspoken.

  1. Apply a scope to read methods too, which generates new blank nodes that have no correspondence with internal ones – in other words, hide internal blank nodes completely.

I agree in principle but I can’t come up with a sane way to do this without adding unreasonable amounts of complexity.

  1. Make scopes themselves persist-able (optionally) so that worriers like me can safely recover from crashes.

At what point should a scope be persisted? For example, imagine we’re import-ing a stream. The scope would have to be persisted to disk whenever its internal cache is updated, which would mean serializing its entire cache quite frequently… Actually, now that I think of it, the scope could be persisted to disk incrementally, with each newly-cached blank node persisted in the (K, V) form (scope-<scopeId>-<originalLabel>, <newLabel>) in the first batch operation that contains it.

In any case, preWrite should make this relatively easy, although I suspect that persist-able scopes would benefit from a (much) more integrated API.

const scopeId = await store.createScope(); // inits a new scope
const scopeId = await store.loadScope('some-id'); // re-hydrates a previously-created scope
await store.putStream(stream, { scope: scopeId }); // updates the scope with each new blank node
await store.multiPut(quads, { scope: scopeId }); // updates the scope with each new blank node
await store.deleteScope(scopeId); // drops the scope

Does it even make sense to provide scoping support without persist-able scopes?

TL;DR - don’t like how bnodes work - don’t use bnodes 😃

I think this is a valuable suggestion, @namedgraph. It could be that scoping is simply too dependent on each specific use-case to be easily implemented in a low-level library such as quadstore.

True, we only need to remember original labels insofar as we’re looking for them while performing further write operations. We don’t need to store them as returning them could lead to the very collisions we’re trying to avoid.