seqkit: Inconsistency with `seqkit fx2tab` and input fasta identifiers

Prerequisites

make sure you’re are using the latest version by seqkit version
read the usage

Describe your issue

describe the problem
provide a reproducible example

Here’s my seqkit version:

(base) [jespinoz@exp-15-02 EukProt]$ seqkit version
seqkit v2.5.1

Here’s my protein EukProt protein directory:

(base) [jespinoz@exp-15-02 EukProt]$ ls proteins/*.fasta | head
proteins/EP00001_Collodictyon_triciliatum.fasta
proteins/EP00002_Diphylleia_rotans.fasta
proteins/EP00003_Mantamonas_plastica.fasta
proteins/EP00004_Rigifila_ramosa.fasta
proteins/EP00005_Dracoamoeba_jomungandri.fasta
proteins/EP00006_Acanthamoeba_castellanii.fasta
proteins/EP00007_Cunea_sp_BSH-02190019.fasta
proteins/EP00008_Paramoeba_atlantica.fasta
proteins/EP00009_Paramoeba_aestuarina.fasta
proteins/EP00010_Vexillifera_abyssalis.fasta

Here’s my command to get the hash ids for all the sequences:

(base) [jespinoz@exp-15-02 EukProt]$ pv proteins/*.fasta | seqkit seq | seqkit fx2tab -s -n > id_to_hash.tsv
12.8GiB 0:05:05 [43.0MiB/s] [================================================================================================================================================>] 100%

Now I can check how my sequences are in the hash table:

(base) [jespinoz@exp-15-02 EukProt]$ wc -l id_to_hash.tsv
28592257 id_to_hash.tsv

Which agrees with the number of protein sequences:

(base) [jespinoz@exp-15-02 EukProt]$ pv proteins/*.fasta | grep -c "^>"
12.8GiB 0:00:25 [ 523MiB/s] [================================================================================================================================================>] 100%
28592257

However, this protein that is in one of the fasta files:

(base) [jespinoz@exp-15-02 EukProt]$ grep "EP00908_Phaeocystis_cordata_P000001" proteins/EP00908_Phaeocystis_cordata.fasta
>EP00908_Phaeocystis_cordata_P000001 Transcript_1.p1 GENE.Transcript_1~~Transcript_1.p1  ORF type:internal len:122 (+),score=4.23 Transcript_1:2-364(+)

Is not in the hash table:

(base) [jespinoz@exp-15-02 EukProt]$ grep "EP00908_Phaeocystis_cordata_P000001" id_to_hash.tsv

Here’s the proteins I’m testing this on:

https://figshare.com/articles/dataset/EukProt_a_database_of_genome-scale_predicted_proteins_across_the_diversity_of_eukaryotic_life/12417881?file=34434377

Looks like it’s not parsing the id correctly (maybe some weird unicode characters?)

(base) [jespinoz@exp-15-02 EukProt]$ grep "EP00908_Phaeocystis_cordata" id_to_hash.tsv  | head
EP00908_Phaeocystis_cordata_P000002 Transcript_10.p1 GENE.Transcript_10~~Transcript_10.p1  ORF type:internal len:115 (+),score=35.34 Transcript_10:1-342(+)	d3c38e690156e4908bc9486d7ab7e206
EP00908_Phaeocystis_cordata_P000003 Transcript_100.p2 GENE.Transcript_100~~Transcript_100.p2  ORF type:internal len:87 (-),score=32.37 Transcript_100:3-260(-)	15ce6e46b4c9b6703376adc90e95cbce
EP00908_Phaeocystis_cordata_P000004 Transcript_1000.p4 GENE.Transcript_1000~~Transcript_1000.p4  ORF type:5prime_partial len:336 (+),score=115.35 Transcript_1000:2-1009(+)	cc04edc8423df02f72e4f9dbc8f315db
EP00908_Phaeocystis_cordata_P000005 Transcript_10000.p2 GENE.Transcript_10000~~Transcript_10000.p2  ORF type:internal len:336 (-),score=182.16 Transcript_10000:2-1006(-)	11cc3cc21fa7209ba9ff0c6fe4b4a95e
EP00908_Phaeocystis_cordata_P000006 Transcript_10001.p1 GENE.Transcript_10001~~Transcript_10001.p1  ORF type:complete len:247 (-),score=24.03 Transcript_10001:357-1028(-)	a5e8a9195b560f5f6592b73fcfe39d51
EP00908_Phaeocystis_cordata_P000007 Transcript_10002.p1 GENE.Transcript_10002~~Transcript_10002.p1  ORF type:complete len:417 (-),score=105.01 Transcript_10002:109-1359(-)	e7dc17314119ebca1bddcb10b98eae80
EP00908_Phaeocystis_cordata_P000008 Transcript_10005.p1 GENE.Transcript_10005~~Transcript_10005.p1  ORF type:5prime_partial len:238 (-),score=40.87 Transcript_10005:73-786(-)	b235ed434f43be0ae5341d1b80a5ea3a
EP00908_Phaeocystis_cordata_P000009 Transcript_10009.p1 GENE.Transcript_10009~~Transcript_10009.p1  ORF type:3prime_partial len:186 (+),score=25.92 Transcript_10009:176-730(+)	b8a74828dd334604a1458a8b1a18937f
EP00908_Phaeocystis_cordata_P000010 Transcript_1001.p1 GENE.Transcript_1001~~Transcript_1001.p1  ORF type:5prime_partial len:964 (-),score=325.22 Transcript_1001:308-3199(-)	dd1939ef7b124d45abaa38728b2bf4c4
EP00908_Phaeocystis_cordata_P000011 Transcript_10010.p1 GENE.Transcript_10010~~Transcript_10010.p1  ORF type:internal len:301 (+),score=57.55 Transcript_10010:3-902(+)	695073337d79bcffd36e12f603871372

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 18 (9 by maintainers)

Most upvoted comments

Try this:

 mamba install -c shenwei356 rush
 # conda install -c shenwei356 rush

shenwei356 on Nov 13, 2023

I see, so pv concatenates all files without adding line breaks, while it’s not a problem when seqkit or other tools handle each file separately.

shenwei356 on Nov 10, 2023

seqkit replace --by-seq -p '\*$'

shenwei356 on Nov 8, 2023

cat hairpin.fa | seqkit fx2tab -isQ | awk '{print $3"\t"$2}' | seqkit tab2fx

version without quotes

cat hairpin.fa | seqkit fx2tab -isQ | csvtk cut -Ht -f 3,2 | seqkit tab2fx

shenwei356 on Nov 7, 2023