seqkit: Idea: Robust FASTA/Q checksum

Prerequisites

  • make sure you’re are using the latest version by seqkit version
  • read the usage

Describe your issue

  • describe the problem
  • provide a reproducible example

Hi there, thanks for writing Seqkit. I’ve found it really useful for my research.

Maybe you’re like me, where you have tons of very similar FASTA/Q files, and it can be a challenge to determine which files are the same in terms of sequence content, and which ones are different.

For example, file-based checksums like md5sum, are insufficient., as the FASTA/FASTQ record IDs/descriptions can vary in a trivial way (e.g. scaffold000001 vs scf1). Or, just the order of the record/scaffolds within the FASTA/Q file can be changed, which is inconsequential for most analyses (e.g. mapping or assemlbly). Similarly, seqkit stat only provides high level summary statistics & can’t detect subtle changes to the file.

As a solution, I’ve made a FASTA/Q record & sequence based checksum using Seqkit. See below for a Bash function implementing this:

function faschk() {
MD5CHK=$(cat $1 | openssl md5 | cut -f 2 -d " " | cut -c1-4)
SEQNUM=$(seqkit stat -T $1 | tail -n 1 | cut -f 4)
IDCHK=$(seqkit seq --quiet -n -i $1 | sort | openssl md5 | cut -f 2 -d " " | cut -c1-8)
SEQCHK=$(seqkit seq --quiet -u $1 | seqkit sort --quiet -s | seqkit seq --quiet -s | openssl md5 | cut -f 2 -d " " | cut -c1-8)
ESEQCHK=$(seqkit sort --quiet -s $1 | seqkit seq --quiet -s | openssl md5 | cut -f 2 -d " " | cut -c1-4)
echo $1 faschk:${SEQNUM}-${MD5CHK}-${IDCHK}-${SEQCHK}-${ESEQCHK}
}

Hopefully what it does is self explanatory from the code. Let me know if any parts need more explanation, but in short, it sorts the sequence or record IDs, and pipes them into an md5sum, making a final summary string which is a concatenation of these smaller checksums.

Here is a real life example: I uploaded some raw read sets to SRA multiple times accidentally, here I can confirm that the sequence content of SRR15608978_1.fastq & SRR15608979.fastq is the same, even if the md5sum & record IDs are different.

SRR15608978_1.fastq faschk:26394840-a5f8-552576a8-b7ebb9f8-b7eb
SRR15608978_2.fastq faschk:26394840-3592-552576a8-e0dfc296-e0df
SRR15608979.fastq faschk:26394840-d80c-198f7632-b7ebb9f8-b7eb

But, this approach is pretty slow, since it is a dumb algorithm that sorts the dataset multiple times. It would be great to have an internal seqkit command which did something similar but ran with better time & memory efficiency. What do you think? I don’t know Go, but maybe with some guidance could learn Go & help implement this.

P.S. I looked briefly but didn’t see implementation of a similar idea in fasta_utilities, fastx_toolkit, pyfaidx , seqmagick, or seqtk.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 26 (23 by maintainers)

Commits related to this issue

Most upvoted comments

I use this tool on the Genbank viral collection and find 279 genomes are duplicated.

sum.summary2.tar.gz

Here are the top 10:

species genome-size #genomes files
Flavobacterium phage vB_FspP_elemoA_9-1A 60073 27 GCA_014071075.1 (Flavobacterium phage vB_FspP_elemoA_9-1A); GCA_014071015.1 (Flavobacterium phage vB_FspP_elemoA_8-1B); GCA_014071005.1 (Flavobacterium phage vB_FspP_elemoA_8-1A); GCA_014070725.1 (Flavobacterium phage vB_FspP_elemoA_14-9B); GCA_014070655.1 (Flavobacterium phage vB_FspP_elemoA_13-1E); GCA_014071055.1 (Flavobacterium phage vB_FspP_elemoA_8-9B); GCA_014071025.1 (Flavobacterium phage vB_FspP_elemoA_8-1D); GCA_014070765.1 (Flavobacterium phage vB_FspP_elemoA_15-9B); GCA_014070985.1 (Flavobacterium phage vB_FspP_elemoA_7-9B); GCA_014070855.1 (Flavobacterium phage vB_FspP_elemoA_3-9A); GCA_014070615.1 (Flavobacterium phage vB_FspP_elemoA_12-1B); GCA_014070595.1 (Flavobacterium phage vB_FspP_elemoA_11-9B); GCA_014071035.1 (Flavobacterium phage vB_FspP_elemoA_8-3A); GCA_014070565.1 (Flavobacterium phage vB_FspP_elemoA_10-9A); GCA_014070535.1 (Flavobacterium phage vB_FspP_elemoA_1-9B); GCA_014070775.1 (Flavobacterium phage vB_FspP_elemoA_15-9C); GCA_014070715.1 (Flavobacterium phage vB_FspP_elemoA_14-9A); GCA_014070675.1 (Flavobacterium phage vB_FspP_elemoA_13-3D); GCA_014070665.1 (Flavobacterium phage vB_FspP_elemoA_13-3B); GCA_014070625.1 (Flavobacterium phage vB_FspP_elemoA_12-3A); GCA_014070895.1 (Flavobacterium phage vB_FspP_elemoA_5-9A); GCA_014070835.1 (Flavobacterium phage vB_FspP_elemoA_2-9A); GCA_014070825.1 (Flavobacterium phage vB_FspP_elemoA_2-5G); GCA_014070735.1 (Flavobacterium phage vB_FspP_elemoA_15-3B); GCA_014070875.1 (Flavobacterium phage vB_FspP_elemoA_4-9A); GCA_014070785.1 (Flavobacterium phage vB_FspP_elemoA_2-5A); GCA_014070695.1 (Flavobacterium phage vB_FspP_elemoA_13-9B)
Vibrio phage 1.136.O._10N.261.45.E11 31617 8 GCA_003928295.1 (Vibrio phage 1.136.O._10N.261.45.E11); GCA_003928415.1 (Vibrio phage 1.142.O._10N.261.49.E11); GCA_003926555.1 (Vibrio phage 1.028.O._10N.286.45.B6); GCA_003928615.1 (Vibrio phage 1.156.O._10N.261.45.A6); GCA_003929835.1 (Vibrio phage 1.219.O._10N.261.45.E2); GCA_003931415.1 (Vibrio phage 2.159.B._10N.261.46.F12); GCA_003929815.1 (Vibrio phage 1.217.O._10N.261.45.A1); GCA_003931395.1 (Vibrio phage 2.159.A._10N.261.46.F12)
Vibrio phage 36E38.1 36824 6 GCA_020477115.1 (Vibrio phage 36E38.1); GCA_020483865.1 (Vibrio phage 75E35.1); GCA_020477065.1 (Vibrio phage 34E29.1); GCA_020477405.1 (Vibrio phage 44E38.1); GCA_020477205.1 (Vibrio phage 41E34.2); GCA_020477415.1 (Vibrio phage 44E38.2)
Rheinheimera phage vB_RspM_barba_10-9E 80556 5 GCA_014515185.1 (Rheinheimera phage vB_RspM_barba_10-9E); GCA_014517155.1 (Rheinheimera phage vB_RspM_barba_12-7E); GCA_014517095.1 (Rheinheimera phage vB_RspM_barba_12-6I); GCA_014517035.1 (Rheinheimera phage vB_RspM_barba_12-6C); GCA_014517265.1 (Rheinheimera phage vB_RspM_barba_12-8G)
Rheinheimera phage vB_RspM_barba_12-6H 80348 5 GCA_014517085.1 (Rheinheimera phage vB_RspM_barba_12-6H); GCA_014517065.1 (Rheinheimera phage vB_RspM_barba_12-6F); GCA_014517275.1 (Rheinheimera phage vB_RspM_barba_12-8H); GCA_014517025.1 (Rheinheimera phage vB_RspM_barba_12-6B); GCA_014517305.1 (Rheinheimera phage vB_RspM_barba_12-9B)
Stx2a-converting phage Stx2_PV03-14 65397 5 GCA_019334045.1 (Stx2a-converting phage Stx2_PV03-14); GCA_019334055.1 (Stx2a-converting phage Stx2_08E027); GCA_019333895.1 (Stx2a-converting phage Stx2_579); GCA_019334035.1 (Stx2a-converting phage Stx2_8048); GCA_019334025.1 (Stx2a-converting phage Stx2_15436)
Escherichia phage PA44 62708 5 GCA_002594945.1 (Escherichia phage PA44); GCA_002594885.1 (Escherichia phage PA33); GCA_002594765.1 (Escherichia phage PA18); GCA_002594685.1 (Escherichia phage PA4); GCA_002594845.1 (Escherichia phage PA30)
Corynebacterium phage LGCM-V7 36671 5 GCA_002622225.1 (Corynebacterium phage LGCM-V7); GCA_002622245.1 (Corynebacterium phage LGCM-V8); GCA_002622185.1 (Corynebacterium phage LGCM-V5); GCA_002622205.1 (Corynebacterium phage LGCM-V6); GCA_002622265.1 (Corynebacterium phage LGCM-V9)
Streptococcus satellite phage Javan282 14560 5 GCA_004773155.1 (Streptococcus satellite phage Javan282); GCA_004773135.1 (Streptococcus satellite phage Javan281); GCA_004773175.1 (Streptococcus satellite phage Javan283); GCA_004773115.1 (Streptococcus satellite phage Javan280); GCA_004773075.1 (Streptococcus satellite phage Javan277)

Steps

time seqkit sum -c -a -b --infile-list <(fd gz$ files.renamed/) -j 160 > sum.tsv 2>sum.tsv.log

p=sum

cp $p.tsv $p.tsv.backup

sed 's/.fna.gz//' -i $p.tsv

# add species name
csvtk join -Ht -f '2;1' $p.tsv name.map > $p.tsv.t
mv $p.tsv.t $p.tsv

csvtk replace -Ht -f 2 -p '(.+)' -r '$1 ({kv})' -k name.map $p.tsv \
    | csvtk fold -Ht -f 1 -v 2 \
    | csvtk join -Ht - $p.tsv <(csvtk freq -Ht $p.tsv) \
    | csvtk uniq -Ht -f 1 \
    | csvtk cut -Ht -f 6,5,7,2 \
    | csvtk sort -Ht -k 3:nr -k 2:nr \
    | csvtk add-header -t -n "species,genome-size,#genomes,files" \
    > $p.summary.tsv
    
csvtk filter2 -t -f '$3 > 1' $p.summary.tsv > $p.summary2.tsv

I would leave it to seqkit fx2tab which already support output md5 sum of every sequence, we can make it support circular sequence.

Circular genomes supported:

Usage

compute message digest for all sequences in FASTA/Q files

Attentions:
  1. Sequence headers and qualities are skipped, only sequences matter.
  2. The order of sequences records does not matter.
  3. Circular complete genomes are supported with the flag -c/--circular.
     - The same genomes with different start positions or in reverse
       complement strand will not affect the result.
     - The message digest would change with different values of k-mer size.
  4. Multiple files are processed in parallel (-j/--threads).

Method:
  1. Converting the sequences to low cases, optionally removing gaps (-g).
  2. Computing the hash (xxhash) for all sequences or k-mers of a circular
     complete genome (-c/--circular).
  3. Sorting all hash values, for ignoring the order of sequences.
  4. Computing MD5 digest from the hash values, sequences length, and
     the number of sequences.

Usage:
  seqkit sum [flags]

Flags:
  -a, --all                  all information, including the sequences length and the number of sequences
  -b, --basename             only output basename of files
  -c, --circular             the file contains a single cicular genome sequence
  -G, --gap-letters string   gap letters (default "- \t.")
  -h, --help                 help for sum
  -k, --kmer-size int        k-mer size for processing circular genomes (default 1000)
  -g, --remove-gaps          remove gaps

Examples

A, B, C, D are the same vircular genomes with different starting positions or strands:

$ cat virus-{A,B,C,D}.fasta
>seq
TGGTAGGGAGTTGAGTAGCATGGGTATAGTATAGTGTCATGATGCCAGATTTTAAAAAAA
>seq.revcom
TTTTTTTAAAATCTGGCATCATGACACTATACTATACCCATGCTACTCAACTCCCTACCA
>seq.new-start
GGTAGGGAGTTGAGTAGCATGGGTATAGTATAGTGTCATGATGCCAGATTTTAAAAAAAT
>seq.revcom.new-start
TTTTTTAAAATCTGGCATCATGACACTATACTATACCCATGCTACTCAACTCCCTACCAT

# cat to one file
$ cat virus-{A,B,C,D}.fasta > virues.fasta

# shuffle and rename
$ cat virus-{A,B,C,D}.fasta \
    | seqkit shuffle \
    | seqkit replace -p '.*' -r '{nr}' \
    | tee virues.shuffled.fasta
>1
TTTTTTAAAATCTGGCATCATGACACTATACTATACCCATGCTACTCAACTCCCTACCAT
>2
TGGTAGGGAGTTGAGTAGCATGGGTATAGTATAGTGTCATGATGCCAGATTTTAAAAAAA
>3
GGTAGGGAGTTGAGTAGCATGGGTATAGTATAGTGTCATGATGCCAGATTTTAAAAAAAT
>4
TTTTTTTAAAATCTGGCATCATGACACTATACTATACCCATGCTACTCAACTCCCTACCA

Sum of all files (the sequences order does not matter):

$ seqkit sum viru*.fasta
9bbe0abefc26013dffdde952a6725b17        virues.fasta
9bbe0abefc26013dffdde952a6725b17        virues.shuffled.fasta
176250c8d1cde6c385397df525aa1a94        virus-A.fasta
7a813339f9ae686b376b1df55cd596ca        virus-B.fasta
0fd51028bfbfa85ddbdd2b86ef7bd1c1        virus-C.fasta
88b1d20dd0fe0dbf41c00b075fee4e4e        virus-D.fasta

Circular genomes (The same genomes with different start positions or in reverse complement strand will not affect the result):

$ seqkit sum -c -k 21  virus-*.fasta
a7bc5eca47c14bb9db5568f7c5446b92        virus-A.fasta
a7bc5eca47c14bb9db5568f7c5446b92        virus-B.fasta
a7bc5eca47c14bb9db5568f7c5446b92        virus-C.fasta
a7bc5eca47c14bb9db5568f7c5446b92        virus-D.fasta

$ seqkit sum -c -k 51  virus-*.fasta
90cf26400dc1853e178582dfe8204a3a        virus-A.fasta
90cf26400dc1853e178582dfe8204a3a        virus-B.fasta
90cf26400dc1853e178582dfe8204a3a        virus-C.fasta
90cf26400dc1853e178582dfe8204a3a        virus-D.fasta