quadstore: BACKEND: Blank nodes should not match existing data
When inserting data containing blank nodes, the blank subject or object is stored verbatim with the same blank node identifier as the input. This breaks the requirement that blank nodes are scoped to the input document. For example (I tried adding this as a unit test in quadstore.prototype.put.js
):
it('should not re-use blank nodes', async function () {
const { dataFactory, store } = this;
await store.put(dataFactory.quad(
dataFactory.blankNode('_:s'),
dataFactory.namedNode('ex://p'),
dataFactory.namedNode('ex://o'),
dataFactory.namedNode('ex://g'),
));
await store.put(dataFactory.quad(
dataFactory.blankNode('_:s'),
dataFactory.namedNode('ex://p'),
dataFactory.namedNode('ex://o'),
dataFactory.namedNode('ex://g'),
));
const { items: foundQuads } = await store.get({});
should(foundQuads).have.length(2);
});
This test fails because the two invocations of put
are using the same blank node label. Instead, they should result in different quads with disjoint subjects.
For more complex examples, such as lists, the accidental re-use of blank node identifiers (for example after a re-start) could badly affect data integrity.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 23 (19 by maintainers)
Published in version
7.3.0
!Hi all!
@gsvarovsky:
A nanoid-labeled blank node is still significantly smaller than the average named node and seems to be comparable to shortened named nodes when using prefixes. I don’t think slightly longer blank nodes are likely to become an issue on their own unless as a part of a bigger issue related to the comparatively low
quad/MB
ratio that can be achieved using quadstore’s indexing strategy.I do agree that the default behavior is not correct but it’s also simple to maintain, easily understood and easily extendable. Furthermore, I suspect that it matches expectations of how a low-level RDF/JS library should work as per @rubensworks comment. I think that forcing a scope when none is provided would break a lot of assumptions, both spoken and unspoken.
I agree in principle but I can’t come up with a sane way to do this without adding unreasonable amounts of complexity.
At what point should a scope be persisted? For example, imagine we’re
import
-ing a stream. The scope would have to be persisted to disk whenever its internal cache is updated, which would mean serializing its entire cache quite frequently… Actually, now that I think of it, the scope could be persisted to disk incrementally, with each newly-cached blank node persisted in the (K, V) form(scope-<scopeId>-<originalLabel>, <newLabel>)
in the first batch operation that contains it.In any case,
preWrite
should make this relatively easy, although I suspect that persist-able scopes would benefit from a (much) more integrated API.Does it even make sense to provide scoping support without persist-able scopes?
I think this is a valuable suggestion, @namedgraph. It could be that scoping is simply too dependent on each specific use-case to be easily implemented in a low-level library such as quadstore.
True, we only need to remember original labels insofar as we’re looking for them while performing further write operations. We don’t need to store them as returning them could lead to the very collisions we’re trying to avoid.