amazon-neptune-tools: problems with streaming to kinesis with neptune-export tool

We’re using neptune-export tool for graph data export and encountered the following problem:

if one exports from the neptune db cluster using the following config:

{
  "command": "export-pg-from-queries",
  "params": {
    "endpoint": "[your prod endpoint here]",
    "batchSize": 8,
    "concurrency": 8,
    "queriesFile": "sharded_queries/export_queries.json",
    "format": "json",
    "logLevel": "info",
    "maxContentLength": 100000000,
    "serializer": "GRAPHBINARY_V1D0"
  }
}

the rows in the resulting files look fine, one per line and no unexpectedly terminated lines or garbage at the file’s end(last line)

HOWEVER:

if one specifies the following config for streaming to kinesis:**

{
  "command": "export-pg-from-queries",
  "params": {
    "endpoint": "[your endpoint here]",
    "batchSize": 8,
    "concurrency": 8,
    "queriesFile": "sharded_queries/export_queries.json",
    "format": "json",
    "logLevel": "info",
    "maxContentLength": 100000000,
    "serializer": "GRAPHBINARY_V1D0",
    "streamName": "[stream-name]",
    "output": "stream"
  }
}

the data gets sent to the stream but if you examine the files in the export dir which the kinesis producer is tailing during the export you’ll see the rows separated by ‘[’ chars sometimes and also last line will not be completely written as if the output was abruptly terminated in the middle.

While reading from the kinesis stream data was written to you’ll encounter byte sequences which are not utf-8 decodable and you’ll fail to read the data.

Could you please take a look at it? It’d also be nice to have an option to turn off file-system writes completely (no buffering to export dir, just stream to kinesis directly).

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (7 by maintainers)

Most upvoted comments

@alxcnt Thanks. I’ve opened an enhancement issue for this: https://github.com/aws/neptune-export/issues/12

Please comment there if you’ve anything else to add. Thanks.

iansrobinson on Mar 1, 2023

Hi @alxcnt

You need to run the tool with credentials that can access the Neptune Management API for your Neptune cluster. If you’re running it in a different account to your Neptune cluster, the tool won’t be able to access any metadata about the cluster. At the moment there isn’t a way to supply credentials from another account, though we can consider adding that as a feature.

Access to the Management API for inferring the clusterId has been necessary since the end of Nov 2022, when the clusterId inferencing was improved. The clone cluster feature, which is necessary for ensuring a static view of the data for databases undergoing a write workload, has needed access to the Management API for a couple of years.

The Management API client uses the default credentials provider chain . These are the locations it will look for the relevant credentials:

public DefaultAWSCredentialsProviderChain() {
        super(new EnvironmentVariableCredentialsProvider(),
              new SystemPropertiesCredentialsProvider(),
              WebIdentityTokenCredentialsProvider.create(),
              new ProfileCredentialsProvider(),
              new EC2ContainerCredentialsProviderWrapper());
    }

iansrobinson on Mar 1, 2023