lambda-streams-to-firehose: Duplicated records

Hello @IanMeyers and others

I’m getting random duplicates rows. For last day :

"id","count"
2535282,2
2543816,2
2543817,2
2549680,2
2549679,2
2535281,2
2555470,2
2565819,2

I can see the duplicate entries on s3 file but not when reading kinesis stream content.

Trying to investigate.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 17 (13 by maintainers)

Most upvoted comments

The other thing to consider is that you’ll get relatively few duplicates while running normally, but the two places I see large numbers of duplicates are:

  1. Lambda/KCL failure - if your consumer reading from your Kinesis stream happens to process part of a batch, transmit it and die/crash before it checkpoints, you can end up getting a chunk of duplicates when it starts up again.

  2. Backfilling - if you ever have the need to replay old data (recovering from a crash or user error for instance), having a system that can drop duplicates makes this process MUCH easier. You can just replay data from the last N hours or days and not have to worry about introducing even more duplicates.

That being said, you’re right, I don’t think Firehose is a great fit for this at the moment. It’s a lot of effort for potentially not much gain like you said. If you really want to use Firehose to write all the way to Redshift, then you’re either going to have to do something complicated like you mentioned above or you’ll have to live with the dupes. I’ll try to find the actual SQL I use tomorrow to compare against what you posted above, but yours looks about like what I’d expect with the SELECT DISTINCT and the LEFT JOIN on id (which I assume is your unique ID).

If you’re not concerned about #2 I mentioned above (i.e. you only care about duplicates that happen within a limited time window), you could also consider using the new Firehose embedded Lambda function feature (https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html) to do some clever deduping. Maybe you could have it store the last hour or day worth of unique event IDs that it has seen in a Dynamo DB table and try to dedupe that way. I haven’t tried it…just an idea.

Good luck!