seqkit: Idea: Robust FASTA/Q checksum
Prerequisites
- make sure you’re are using the latest version by
seqkit version - read the usage
Describe your issue
- describe the problem
- provide a reproducible example
Hi there, thanks for writing Seqkit. I’ve found it really useful for my research.
Maybe you’re like me, where you have tons of very similar FASTA/Q files, and it can be a challenge to determine which files are the same in terms of sequence content, and which ones are different.
For example, file-based checksums like md5sum, are insufficient., as the FASTA/FASTQ record IDs/descriptions can vary in a trivial way (e.g. scaffold000001 vs scf1). Or, just the order of the record/scaffolds within the FASTA/Q file can be changed, which is inconsequential for most analyses (e.g. mapping or assemlbly). Similarly, seqkit stat only provides high level summary statistics & can’t detect subtle changes to the file.
As a solution, I’ve made a FASTA/Q record & sequence based checksum using Seqkit. See below for a Bash function implementing this:
function faschk() {
MD5CHK=$(cat $1 | openssl md5 | cut -f 2 -d " " | cut -c1-4)
SEQNUM=$(seqkit stat -T $1 | tail -n 1 | cut -f 4)
IDCHK=$(seqkit seq --quiet -n -i $1 | sort | openssl md5 | cut -f 2 -d " " | cut -c1-8)
SEQCHK=$(seqkit seq --quiet -u $1 | seqkit sort --quiet -s | seqkit seq --quiet -s | openssl md5 | cut -f 2 -d " " | cut -c1-8)
ESEQCHK=$(seqkit sort --quiet -s $1 | seqkit seq --quiet -s | openssl md5 | cut -f 2 -d " " | cut -c1-4)
echo $1 faschk:${SEQNUM}-${MD5CHK}-${IDCHK}-${SEQCHK}-${ESEQCHK}
}
Hopefully what it does is self explanatory from the code. Let me know if any parts need more explanation, but in short, it sorts the sequence or record IDs, and pipes them into an md5sum, making a final summary string which is a concatenation of these smaller checksums.
Here is a real life example: I uploaded some raw read sets to SRA multiple times accidentally, here I can confirm that the sequence content of SRR15608978_1.fastq & SRR15608979.fastq is the same, even if the md5sum & record IDs are different.
SRR15608978_1.fastq faschk:26394840-a5f8-552576a8-b7ebb9f8-b7eb
SRR15608978_2.fastq faschk:26394840-3592-552576a8-e0dfc296-e0df
SRR15608979.fastq faschk:26394840-d80c-198f7632-b7ebb9f8-b7eb
But, this approach is pretty slow, since it is a dumb algorithm that sorts the dataset multiple times. It would be great to have an internal seqkit command which did something similar but ran with better time & memory efficiency. What do you think? I don’t know Go, but maybe with some guidance could learn Go & help implement this.
P.S. I looked briefly but didn’t see implementation of a similar idea in fasta_utilities, fastx_toolkit, pyfaidx , seqmagick, or seqtk.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 26 (23 by maintainers)
Commits related to this issue
- new command: sum. #262 — committed to shenwei356/seqkit by shenwei356 2 years ago
- seqkit sum: support cirular genomes. #262 — committed to shenwei356/seqkit by shenwei356 2 years ago
- seqkit sum: fix bug for real data #262 — committed to shenwei356/seqkit by shenwei356 2 years ago
- seqkit sum: add metadata to the digest #262 — committed to shenwei356/seqkit by shenwei356 2 years ago
- add prefix 'seqkit.' to digest. #262 — committed to shenwei356/seqkit by shenwei356 2 years ago
I use this tool on the Genbank viral collection and find 279 genomes are duplicated.
sum.summary2.tar.gz
Here are the top 10:
Steps
I would leave it to
seqkit fx2tabwhich already support output md5 sum of every sequence, we can make it support circular sequence.Circular genomes supported:
Usage
Examples
A, B, C, D are the same vircular genomes with different starting positions or strands:
Sum of all files (the sequences order does not matter):
Circular genomes (The same genomes with different start positions or in reverse complement strand will not affect the result):