arangojs: Transactions don't play nicely with round-robin load balancing strategy
Currently running with arangojs
version 6.14.0
, a 3-node cluster, and loadBalancingStrategy: 'ROUND_ROBIN'
.
(Aside: yes, we are in the process of looking to upgrade to arangojs
version 7.x
.)
We have a part of our software which makes use of ArangoDB transactions, like:
try {
const transaction = await db.beginTransaction({ read: myEdgeCollection, write: myEdgeCollection });
const savedEdge = await transaction.run(() => myEdgeCollection.save(thingToSave));
await transaction.run(() => myEdgeCollection.update({ _key: existingObject._key }, { next: savedEdge._key }));
await transaction.commit();
}
catch (arangoError) {
log.error('Error encountered:', arangoError);
return Promise.reject(createError(500, 'Internal Server Error'));
}
This works some of the time, but sometimes also throws errors like this:
Error encountered: ArangoError: transaction '506880805423764' not found
at new ArangoError (/opt/my-service/node_modules/arangojs/lib/async/error.js:71:21)
at Object.resolve (/opt/my-service/node_modules/arangojs/lib/async/connection.js:273:32)
at callback (/opt/my-service/node_modules/arangojs/lib/async/connection.js:139:26)
at IncomingMessage.<anonymous> (/opt/my-service/node_modules/arangojs/lib/async/util/request.node.js:78:21)
at IncomingMessage.emit (events.js:327:22)
at endReadableNT (_stream_readable.js:1221:12)
at /opt/my-service/node_modules/async-listener/glue.js:188:31
at processTicksAndRejections (internal/process/task_queues.js:84:21) {
isArangoError: true,
response: IncomingMessage {
_readableState: ReadableState {
objectMode: false,
highWaterMark: 16384,
buffer: BufferList { head: null, tail: null, length: 0 },
length: 0,
pipes: null,
pipesCount: 0,
flowing: true,
ended: true,
endEmitted: true,
reading: false,
sync: true,
needReadable: false,
emittedReadable: false,
readableListening: false,
resumeScheduled: false,
emitClose: true,
autoDestroy: false,
destroyed: false,
defaultEncoding: 'utf8',
awaitDrainWriters: null,
multiAwaitDrain: false,
readingMore: true,
decoder: null,
encoding: null,
[Symbol(kPaused)]: false
},
readable: false,
_events: [Object: null prototype] { end: [Array], data: [Function] },
_eventsCount: 2,
_maxListeners: undefined,
socket: Socket {
connecting: false,
_hadError: false,
_parent: null,
_host: null,
_readableState: [ReadableState],
readable: true,
_events: [Object: null prototype],
_eventsCount: 6,
_maxListeners: undefined,
_writableState: [WritableState],
writable: true,
allowHalfOpen: false,
_sockname: null,
_pendingData: null,
_pendingEncoding: '',
server: null,
_server: null,
parser: null,
_httpMessage: null,
timeout: 0,
[Symbol(asyncId)]: -1,
[Symbol(kHandle)]: [TCP],
[Symbol(kSetNoDelay)]: false,
[Symbol(lastWriteQueueSize)]: 0,
[Symbol(timeout)]: null,
[Symbol(kBuffer)]: null,
[Symbol(kBufferCb)]: null,
[Symbol(kBufferGen)]: null,
[Symbol(kCapture)]: false,
[Symbol(kBytesRead)]: 0,
[Symbol(kBytesWritten)]: 0
},
connection: Socket {
connecting: false,
_hadError: false,
_parent: null,
_host: null,
_readableState: [ReadableState],
readable: true,
_events: [Object: null prototype],
_eventsCount: 6,
_maxListeners: undefined,
_writableState: [WritableState],
writable: true,
allowHalfOpen: false,
_sockname: null,
_pendingData: null,
_pendingEncoding: '',
server: null,
_server: null,
parser: null,
_httpMessage: null,
timeout: 0,
[Symbol(asyncId)]: -1,
[Symbol(kHandle)]: [TCP],
[Symbol(kSetNoDelay)]: false,
[Symbol(lastWriteQueueSize)]: 0,
[Symbol(timeout)]: null,
[Symbol(kBuffer)]: null,
[Symbol(kBufferCb)]: null,
[Symbol(kBufferGen)]: null,
[Symbol(kCapture)]: false,
[Symbol(kBytesRead)]: 0,
[Symbol(kBytesWritten)]: 0
},
httpVersionMajor: 1,
httpVersionMinor: 1,
httpVersion: '1.1',
complete: true,
headers: {
'x-content-type-options': 'nosniff',
server: 'ArangoDB',
connection: 'Keep-Alive',
'content-type': 'application/json; charset=utf-8',
'content-length': '98'
},
rawHeaders: [
'X-Content-Type-Options',
'nosniff',
'Server',
'ArangoDB',
'Connection',
'Keep-Alive',
'Content-Type',
'application/json; charset=utf-8',
'Content-Length',
'98'
],
trailers: {},
rawTrailers: [],
aborted: false,
upgrade: false,
url: '',
method: null,
statusCode: 404,
statusMessage: 'Not Found',
client: Socket {
connecting: false,
_hadError: false,
_parent: null,
_host: null,
_readableState: [ReadableState],
readable: true,
_events: [Object: null prototype],
_eventsCount: 6,
_maxListeners: undefined,
_writableState: [WritableState],
writable: true,
allowHalfOpen: false,
_sockname: null,
_pendingData: null,
_pendingEncoding: '',
server: null,
_server: null,
parser: null,
_httpMessage: null,
timeout: 0,
[Symbol(asyncId)]: -1,
[Symbol(kHandle)]: [TCP],
[Symbol(kSetNoDelay)]: false,
[Symbol(lastWriteQueueSize)]: 0,
[Symbol(timeout)]: null,
[Symbol(kBuffer)]: null,
[Symbol(kBufferCb)]: null,
[Symbol(kBufferGen)]: null,
[Symbol(kCapture)]: false,
[Symbol(kBytesRead)]: 0,
[Symbol(kBytesWritten)]: 0
},
_consuming: false,
_dumped: false,
req: ClientRequest {
_events: [Object: null prototype],
_eventsCount: 3,
_maxListeners: undefined,
outputData: [],
outputSize: 0,
writable: true,
_last: false,
chunkedEncoding: false,
shouldKeepAlive: true,
useChunkedEncodingByDefault: true,
sendDate: false,
_removedConnection: false,
_removedContLen: false,
_removedTE: false,
_contentLength: null,
_hasBody: true,
_trailer: '',
finished: true,
_headerSent: true,
socket: [Socket],
connection: [Socket],
_header: 'PATCH /_db/my-db/_api/document/my_edge_collection/1482195603? HTTP/1.1\r\n' +
'authorization: Basic ZmxleDpmbGV4\r\n' +
'content-type: application/json\r\n' +
'x-arango-version: 30000\r\n' +
'x-arango-trx-id: 506880805423764\r\n' +
'content-length: 21\r\n' +
'Host: 10.20.15.184:7001\r\n' +
'Connection: keep-alive\r\n' +
'\r\n',
_onPendingData: [Function: noopPendingOutput],
agent: [Agent],
socketPath: undefined,
method: 'PATCH',
insecureHTTPParser: undefined,
path: '/_db/my-db/_api/document/my_edge_collection/1482195603?',
_ended: true,
res: [Circular],
aborted: false,
timeoutCb: null,
upgradeOrConnect: false,
parser: null,
maxHeadersCount: null,
reusedSocket: true,
onSocket: [Function],
[Symbol(kCapture)]: false,
[Symbol(kNeedDrain)]: false,
[Symbol(corked)]: 0,
[Symbol(kOutHeaders)]: [Object: null prototype]
},
request: ClientRequest {
_events: [Object: null prototype],
_eventsCount: 3,
_maxListeners: undefined,
outputData: [],
outputSize: 0,
writable: true,
_last: false,
chunkedEncoding: false,
shouldKeepAlive: true,
useChunkedEncodingByDefault: true,
sendDate: false,
_removedConnection: false,
_removedContLen: false,
_removedTE: false,
_contentLength: null,
_hasBody: true,
_trailer: '',
finished: true,
_headerSent: true,
socket: [Socket],
connection: [Socket],
_header: 'PATCH /_db/my-dbh/_api/document/my_edge_collection/1482195603? HTTP/1.1\r\n' +
'authorization: Basic ZmxleDpmbGV4\r\n' +
'content-type: application/json\r\n' +
'x-arango-version: 30000\r\n' +
'x-arango-trx-id: 506880805423764\r\n' +
'content-length: 21\r\n' +
'Host: 10.20.15.184:7001\r\n' +
'Connection: keep-alive\r\n' +
'\r\n',
_onPendingData: [Function: noopPendingOutput],
agent: [Agent],
socketPath: undefined,
method: 'PATCH',
insecureHTTPParser: undefined,
path: '/_db/my-db/_api/document/my_edge_collection/1482195603?',
_ended: true,
res: [Circular],
aborted: false,
timeoutCb: null,
upgradeOrConnect: false,
parser: null,
maxHeadersCount: null,
reusedSocket: true,
onSocket: [Function],
[Symbol(kCapture)]: false,
[Symbol(kNeedDrain)]: false,
[Symbol(corked)]: 0,
[Symbol(kOutHeaders)]: [Object: null prototype]
},
body: {
code: 404,
error: true,
errorMessage: "transaction '506880805423764' not found",
errorNum: 1655
},
arangojsHostId: 2,
[Symbol(kCapture)]: false
},
statusCode: 404,
errorNum: 1655,
code: 404
}
It seems to me that what is likely happening is:
- the transaction is being created on one node in the cluster (say node 1)
- the next operation, referencing the transaction, is sent to another node (say node 2)
- the existence of the transaction has not yet been replicated to node 2, so the error is thrown.
This seems to me like a fundamental issue with either ArangoDB itself of the ArangoJS library. As far as I can tell, the only options we have right now are:
- sleep after creating a transaction (yuck)
- retry every transaction operation in case of this issue (yuck)
- switch to another load balancing strategy (not ideal, we wanted to use round robin for a reason).
I’d love to learn of a different way of dealing with this - so I look forward to hearing your thoughts. Many thanks for taking the time to read this issue.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 33 (20 by maintainers)
@radicaled I was able to solve the deadlock problem via the idea I hinted at. Namely I’ve introduced a priority queue for transaction-related requests enqueued while the main queue is at capacity. In order to ensure there’s always room for these priority requests, arangojs now reserves at least one socket (unless
agentOptions.maxTotalSockets
is set to 1) for priority requests. This change is currently in a dedicated branch and not in the main branch.This results in this test passing in standalone mode as well as when using round-robin. However in round-robin we were able to observe that every coordinator returns the first
begin
, i.e. there will always be three conflicting transactions running in parallel. While the tests did initially pass, we were able to provoke a failure by introducing a trivial delay in each transaction, resulting in timeouts and dropped transactions as expected.The ArangoDB core team is investigating this behavior. I was able to reproduce this behavior for N coordinators (where N > 1) in round-robin with as few as 2N-1 concurrent transactions. This does not explain why individual transactions would randomly fail when there are no other concurrent transactions but it suggests there is a problem with the behavior of exclusive transactions in multi-coordinator scenarios.
@radicaled I was able to reproduce the problem as above and added a branch with a testcase which @dothebart is investigating.
@oliverlockwood It would explain the scenario for two transactional operations at a time if there are other unrelated requests being processed at the same time. I’m unable to reproduce the problem of a streaming transaction simply failing in a round robin setup in isolation (i.e. with no other transaction getting in the way).
@pluma
While the reproduction sample I provided is high concurrency, the problem first occurred in my development environment under very low traffic situations: there were less than 5 users active in the development environment when the problem was first revealed. Afterward I was able to reproduce it sporadically locally myself, as a single user, but sporadic isn’t good enough a test-case in a Github issue. The real-life transaction that first brought my attention to this issue to my attention was a fairly simple 1-read, 1-write exclusive transaction, under very low traffic circumstances, with very few possible concurrent transactions pending via arangojs.
Since these transactions are fine when done on via server-side transactions – and also blazing fast – it seems feels like this is a problem with streaming transactions in general. Even in the high concurrency example, I would expect arangojs or ArangoDB to process them sequentially until the queue is drained or some arbitrary timeout hits. And given that the example reproduction is a single collection write, I wouldn’t expect any timeouts.
So, I guess I’m asking, what’s the end analysis here? Can arangojs only “queue up” 1-2 exclusive streaming transaction at a time, and its expected that all other transaction attempts submitted will fail with a 30 second timeout?
On the subject of alternative approaches:
We require an exclusive lock on the collection to avoid phantom reads, and this is an operation that happens as a result of a client interfacing with the system via a UI, so creating an entirely new sequential job pipeline to run exclusive transactions is overkill and may not even work for certain client operations without a large change to how we communicate with those clients. We don’t forsee any scaling problems given the circumstances in which these exclusive transactions are used.
On the subject of server-side transactions:
The server-side version of our transactions works fine and is incredibly fast, well within tolerance, and as expected given how simple our queries are: mostly reads and usually only 1 write. The pain this issue is causing us is an inability to re-use code: we’re living in TypeScript land over here, with a considerable amount of pre-existing code for querying and transforming data. Not only do we lose that with server-side transactions, they’re also effectively a second copy of some elements of our business logic, so we have to maintain two separate versions of the same logic that are implicitly tied together.
Yes, that was what I was asking for. i.e. Cursors live on one coordinator. If you hit the wrong one, it has to forward your request. If its not able to reach the other coordinator (either by not knowing the right name or tcp connectivity issues) it won’t be able to forward it.