seqkit: Inconsistency with `seqkit fx2tab` and input fasta identifiers
Prerequisites
- make sure you’re are using the latest version by
seqkit version - read the usage
Describe your issue
- describe the problem
- provide a reproducible example
Here’s my seqkit version:
(base) [jespinoz@exp-15-02 EukProt]$ seqkit version
seqkit v2.5.1
Here’s my protein EukProt protein directory:
(base) [jespinoz@exp-15-02 EukProt]$ ls proteins/*.fasta | head
proteins/EP00001_Collodictyon_triciliatum.fasta
proteins/EP00002_Diphylleia_rotans.fasta
proteins/EP00003_Mantamonas_plastica.fasta
proteins/EP00004_Rigifila_ramosa.fasta
proteins/EP00005_Dracoamoeba_jomungandri.fasta
proteins/EP00006_Acanthamoeba_castellanii.fasta
proteins/EP00007_Cunea_sp_BSH-02190019.fasta
proteins/EP00008_Paramoeba_atlantica.fasta
proteins/EP00009_Paramoeba_aestuarina.fasta
proteins/EP00010_Vexillifera_abyssalis.fasta
Here’s my command to get the hash ids for all the sequences:
(base) [jespinoz@exp-15-02 EukProt]$ pv proteins/*.fasta | seqkit seq | seqkit fx2tab -s -n > id_to_hash.tsv
12.8GiB 0:05:05 [43.0MiB/s] [================================================================================================================================================>] 100%
Now I can check how my sequences are in the hash table:
(base) [jespinoz@exp-15-02 EukProt]$ wc -l id_to_hash.tsv
28592257 id_to_hash.tsv
Which agrees with the number of protein sequences:
(base) [jespinoz@exp-15-02 EukProt]$ pv proteins/*.fasta | grep -c "^>"
12.8GiB 0:00:25 [ 523MiB/s] [================================================================================================================================================>] 100%
28592257
However, this protein that is in one of the fasta files:
(base) [jespinoz@exp-15-02 EukProt]$ grep "EP00908_Phaeocystis_cordata_P000001" proteins/EP00908_Phaeocystis_cordata.fasta
>EP00908_Phaeocystis_cordata_P000001 Transcript_1.p1 GENE.Transcript_1~~Transcript_1.p1 ORF type:internal len:122 (+),score=4.23 Transcript_1:2-364(+)
Is not in the hash table:
(base) [jespinoz@exp-15-02 EukProt]$ grep "EP00908_Phaeocystis_cordata_P000001" id_to_hash.tsv
Here’s the proteins I’m testing this on:
Looks like it’s not parsing the id correctly (maybe some weird unicode characters?)
(base) [jespinoz@exp-15-02 EukProt]$ grep "EP00908_Phaeocystis_cordata" id_to_hash.tsv | head
EP00908_Phaeocystis_cordata_P000002 Transcript_10.p1 GENE.Transcript_10~~Transcript_10.p1 ORF type:internal len:115 (+),score=35.34 Transcript_10:1-342(+) d3c38e690156e4908bc9486d7ab7e206
EP00908_Phaeocystis_cordata_P000003 Transcript_100.p2 GENE.Transcript_100~~Transcript_100.p2 ORF type:internal len:87 (-),score=32.37 Transcript_100:3-260(-) 15ce6e46b4c9b6703376adc90e95cbce
EP00908_Phaeocystis_cordata_P000004 Transcript_1000.p4 GENE.Transcript_1000~~Transcript_1000.p4 ORF type:5prime_partial len:336 (+),score=115.35 Transcript_1000:2-1009(+) cc04edc8423df02f72e4f9dbc8f315db
EP00908_Phaeocystis_cordata_P000005 Transcript_10000.p2 GENE.Transcript_10000~~Transcript_10000.p2 ORF type:internal len:336 (-),score=182.16 Transcript_10000:2-1006(-) 11cc3cc21fa7209ba9ff0c6fe4b4a95e
EP00908_Phaeocystis_cordata_P000006 Transcript_10001.p1 GENE.Transcript_10001~~Transcript_10001.p1 ORF type:complete len:247 (-),score=24.03 Transcript_10001:357-1028(-) a5e8a9195b560f5f6592b73fcfe39d51
EP00908_Phaeocystis_cordata_P000007 Transcript_10002.p1 GENE.Transcript_10002~~Transcript_10002.p1 ORF type:complete len:417 (-),score=105.01 Transcript_10002:109-1359(-) e7dc17314119ebca1bddcb10b98eae80
EP00908_Phaeocystis_cordata_P000008 Transcript_10005.p1 GENE.Transcript_10005~~Transcript_10005.p1 ORF type:5prime_partial len:238 (-),score=40.87 Transcript_10005:73-786(-) b235ed434f43be0ae5341d1b80a5ea3a
EP00908_Phaeocystis_cordata_P000009 Transcript_10009.p1 GENE.Transcript_10009~~Transcript_10009.p1 ORF type:3prime_partial len:186 (+),score=25.92 Transcript_10009:176-730(+) b8a74828dd334604a1458a8b1a18937f
EP00908_Phaeocystis_cordata_P000010 Transcript_1001.p1 GENE.Transcript_1001~~Transcript_1001.p1 ORF type:5prime_partial len:964 (-),score=325.22 Transcript_1001:308-3199(-) dd1939ef7b124d45abaa38728b2bf4c4
EP00908_Phaeocystis_cordata_P000011 Transcript_10010.p1 GENE.Transcript_10010~~Transcript_10010.p1 ORF type:internal len:301 (+),score=57.55 Transcript_10010:3-902(+) 695073337d79bcffd36e12f603871372
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 18 (9 by maintainers)
Try this:
I see, so
pvconcatenates all files without adding line breaks, while it’s not a problem when seqkit or other tools handle each file separately.version without quotes