Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to identiify species, even in the representative genome of E.coli #87

Open
Daniel-Tichy opened this issue Oct 5, 2022 · 7 comments

Comments

@Daniel-Tichy
Copy link

Daniel-Tichy commented Oct 5, 2022

First, I want to thank you for your work in this pipeline, but I have been trying to run ECTyper since yesterday without success and it seems to be a problem with it.

  1. I created a new conda environment with conda create --name ectyper
  2. I installed the module with conda install -c bioconda ectyper
  3. I downloaded the representative genome of E.coli (Escherichia coli O157:H7 str. Sakai) Refseq: NC_002695.2
  4. Move the file to a new directory and rename O157H7.fasta
  5. Execute Ectyper with ectyper -i O157H7.fasta --verify -o output_dir

And it the results indicates that

2022-10-05 15:00:13,591 ectyper.predictionFunctions INFO     Name	Species	O-type	H-type	Serotype	QC	Evidence	GeneScores	AlleleKeys	GeneIdentities(%)	GeneCoverages(%)	GeneContigNames	GeneRanges	GeneLengths	Database	Warnings
2022-10-05 15:00:13,591 ectyper.predictionFunctions INFO     O157H7	-	-	-	-:-	WARNING (WRONG SPECIES)	-	-							v1.0 (11-03-2020)	Sample identified as -: serotyping results are only available for E.coli samples.If sure that sample is E.coli run without --verify parameter.Sample was not identified as valid E.coli sample but as -
2022-10-05 15:00:13,591 ectyper      INFO     
ECTyper has finished successfully. 

It seems to identify correctly the serotype without the --verify argument but I need to assign the species as E.coli

Also, it seems to be something with MASH and the database, because previous to that result I get this:

2022-10-05 15:00:13,588 ectyper.speciesIdentification INFO     Following top hits returned by MASH ['GCF_000002435.1_GL2_genomic.fna.gz', 'GCF_000003955.1_ASM395v1_genomic.fna.gz', 'GCF_000005845.2_ASM584v2_genomic.fna.gz', 'GCF_000006665.1_ASM666v1_genomic.fna.gz', 'GCF_000006825.1_ASM682v1_genomic.fna.gz', '']
2022-10-05 15:00:13,589 ectyper.speciesIdentification WARNING  
Top MASH sketch hit GCF_000002435.1_GL2_genomic.fna.gz with 1/1000 shared hashes.
Could not assign species based on MASH distance to reference sketch file.
Due to either:
1. MASH sketch meta data accessions do not start with the GCF_ prefix in assembly_summary_refseq.txt or
2. Number of shared hashes to reference is less than 100 (i.e. too distant).
3. Genome coverage is very limited causing species verification to fail.
If sample is E.coli, try running without --verify parameter

but GCF_000002435.1 is the ID of Giardia lamblia ATCC 50803

I also tried with the docker version (use docker pull kbessonov/ectyper:1.0.0 because docker pull kbessonov/ectyperdoest work) but I get the same issues.

INFO     Name	Species	O-type	H-type	Serotype	QC	Evidence	GeneScores	AlleleKeys	GeneIdentities(%)	GeneCoverages(%)	GeneContigNames	GeneRanges	GeneLengths	Database	Warnings
2022-10-05 18:22:37,697 ectyper.predictionFunctions INFO     input	-	-	-	-:-	WARNING (WRONG SPECIES)	-	-							v1.0 (11-03-2020)	File /ectyper/input.fasta not found!Sample was not identified as valid E.coli sample but as -
2022-10-05 18:22:37,698 ectyper      INFO     
ECTyper has finished successfully.

The Galaxy version seems to be working fine at least, but I need this to work locally for APECtyper

@kbessonov1984
Copy link
Collaborator

Hi Daniel,
Species identification feature in ECTYPER is a bonus feature and is not always perfect. It was used to primarily filter out Shigella vs E.coli samples. There are many other tools that allow to do more thorough species identification.

The reason for behaviour fluctuations is that every time you install ECTYPER it pull the freshest RefSeq genomes metadata from http://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt but MASH sketch is stale coming from MASH tool and so some accessions might be deprecated and assigned to other species.

I was planning for the next release to make RefSeq data more static so that metadata will be 100% linked to the self-made RefSeq genomes sketch (which will take some time to generate).

Thus as you can see there is no easy solution to the problem of keeping species identification current and accurate. Thus I suggest using tool without --verify flag and use kraken2 of other tools to do species ID.

@Daniel-Tichy
Copy link
Author

Hi Kirill! Thank you very much for your fast answer.
Your suggestion would be enough to bypass the problem and identify the isolates but the work in this paper that I need to replicate (APECTyper ) uses ECTyper as a dependency and if It cannot identify the isolate as E.coli the whole analysis fails.

Is there a suggestion that you can give me to modify ECTYPER so it doesnt pull the freshest RefSeq genomes metadata from NCBI and does it from a local table of some sort?

Thank you again in advance.

@kbessonov1984
Copy link
Collaborator

kbessonov1984 commented Oct 5, 2022

OK, I see.

I checked that paper and there is mistake as they refer to ECTYPER version 1.0.2 which is not available at this moment. I think they meant v1.0.0. But regardless, I was not able to replicate your error on my end, but I agree that this issue needs attention.

I will create an auxiliary script that will download RefSeq metadata, genome sequences, create a mash sketch and update existing database. The metadata currently is updated every 6 months. I will remove this check in the next release and provide both metadata and mash sketch with the tool for greater consistency in the results. This way user can update their database at will or use outdated stock database included with the tool.

Here are my results:

Command-line arguments Namespace(cores=1, dbpath=None, debug=False, input='NC_002695.2.fasta', output='ectyper', percentCoverageHtype=50, percentCoverageOtype=90, percentIdentityHtype=95, percentIdentityOtype=90, refseq=None, sequence=False, verify=True)
RefSeq MASH refseq.genomes.k21s1000.msh age is 1 days downloaded on 2022-10-04 at 15:19:43 located /home/CSCScience.ca/kbessono/.conda/envs/ectyper-1.0.0/lib/python3.7/site-packages/ectyper/Data/refseq.genomes.k21s1000.msh
RefSeq sketch (refseq.genomes.k21s1000.msh) and assembly meta data (assembly_summary_refseq.txt) is in good health and does not need to be downloaded
Gathering genome files
Using genomes in file NC_002695.2.fasta
Identifying genome file types
Folowing files were not found in the input:
Creating combined serotype and identification fasta file
Assembling final list of fasta files
Following top hits returned by MASH ['GCF_001516935.2_ASM151693v2_genomic.fna.gz', 'GCF_000008865.1_ASM886v1_genomic.fna.gz', 'GCF_000303895.2_ASM30389v2_genomic.fna.gz', 'GCF_000267645.2_ASM26764v2_genomic.fna.gz', 'GCF_001865295.1_ASM186529v1_genomic.fna.gz', '']
GCF_001516935
MASH species RefSeq top hit GCF_001516935.2_ASM151693v2_genomic.fna.gz with distance 0.000312573 and shared hashes ratio 987/1000
MASH dist predicted species name: Escherichia coli O157:H7
Standardizing the E.coli genome headers based on file names
Predicting serotype from blast output
Serotype prediction completed
BLAST output file against reference alleles is written at ectyper/blast_output_alleles.txt
Reporting results:
Name    Species O-type  H-type  Serotype        QC      Evidence        GeneScores      AlleleKeys      GeneIdentities(%)       GeneCoverages(%)        GeneContigNames GeneRanges      GeneLengths     Database        Warnings
NC_002695.2     Escherichia coli O157:H7        O157    H7      O157:H7 PASS (REPORTABLE)       Based on 3 allele(s)    wzx:1;wzy:1;fliC:1;     O157-1-wzx-origin;O157-118-wzy;H7-1-fliC-origin;        100;100;100;    100;100;100;    NC_002695.2;NC_002695.2;NC_002695.2;    2783354-2784745;2785447-2786613;2624516-2626273;        1392;1167;1758; v1.0 (11-03-2020)       -

What concerns me is that you only get 1/1000 k-mers matched to GCF_001516935.2 genome while in my case I've got 987/1000 possible k-mers. I guess there might be an issue with assembly reading so that ECTYPER does not read the full assembly and so you get this error.

MASH species RefSeq top hit GCF_001516935.2_ASM151693v2_genomic.fna.gz with distance 0.000312573 and shared hashes ratio 987/1000

The NCBI metadata on this accession gives me E.coli O157:H7 and not Gardia lamblia. Also NCBI web-gui confirms E.coli mapping at https://www.ncbi.nlm.nih.gov/search/all/?term=GCF_001516935

GCF_001516935.2 PRJNA224116     SAMN04347375    LPWC00000000.2  na      83334   562     Escherichia coli O157:H7        strain=EDL 932          latest  Contig  Major   Full    2016/05/24      ASM151693v2     USDA/ARS/Eastern Regional Research Center       GCA_001516935.2 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/516/935/GCF_001516935.2_ASM151693v2

Check your assembly_summary_refseq.txt file at ECTYPER install location at ../lib/python<version>/site-packages/ectyper/Data/assembly_summary_refseq.txt

I also tried to pull a Docker image using Singularity as I am working on a cluster and here are my results. You can also pull it with Docker.

singularity pull ectyper.sif docker://quay.io/biocontainers/ectyper:1.0.0--pyhdfd78af_1
singularity run  -B $PWD:/mnt ectyper.sif ectyper -i /mnt/NC_002695.2.fasta --verify -o /mnt/ectyper_singularity

2022-10-05 16:09:32,141 ectyper      INFO     Database structure QC is OK at /usr/local/lib/python3.9/site-packages/ectyper/Data/ectyper_alleles_db.json
2022-10-05 16:09:32,141 ectyper      INFO     Starting ectyper v1.0.0 running on allele database v1.0 (11-03-2020)
2022-10-05 16:09:32,144 ectyper      INFO     Output_directory is /mnt/ectyper_singularity
2022-10-05 16:09:32,145 ectyper      INFO     Command-line arguments Namespace(input='/mnt/NC_002695.2.fasta', cores=1, percentIdentityOtype=90, percentIdentityHtype=95, percentCoverageOtype=90, percentCoverageHtype=50, verify=True, output='/mnt/ectyper_singularity', refseq=None, sequence=False, debug=False, dbpath=None)
2022-10-05 16:09:32,145 ectyper.speciesIdentification INFO     RefSeq sketch (refseq.genomes.k21s1000.msh) and assembly meta data (assembly_summary_refseq.txt) is in good health and does not need to be downloaded
2022-10-05 16:09:32,147 ectyper      INFO     Gathering genome files
2022-10-05 16:09:32,147 ectyper.genomeFunctions INFO     Using genomes in file /mnt/NC_002695.2.fasta
2022-10-05 16:09:32,147 ectyper      INFO     Identifying genome file types
2022-10-05 16:09:32,257 ectyper.genomeFunctions INFO     Folowing files were not found in the input:
2022-10-05 16:09:32,284 ectyper.genomeFunctions INFO     Creating combined serotype and identification fasta file
2022-10-05 16:09:32,310 ectyper      INFO     Assembling final list of fasta files
2022-10-05 16:09:39,374 ectyper.speciesIdentification INFO     Following top hits returned by MASH ['GCF_001516935.2_ASM151693v2_genomic.fna.gz', 'GCF_000008865.1_ASM886v1_genomic.fna.gz', 'GCF_000303895.2_ASM30389v2_genomic.fna.gz', 'GCF_000267645.2_ASM26764v2_genomic.fna.gz', 'GCF_001865295.1_ASM186529v1_genomic.fna.gz', '']
2022-10-05 16:09:39,375 ectyper.speciesIdentification INFO     GCF_001516935
2022-10-05 16:09:39,858 ectyper.speciesIdentification INFO     MASH species RefSeq top hit GCF_001516935.2_ASM151693v2_genomic.fna.gz with distance 0.000312573 and shared hashes ratio 987/1000
2022-10-05 16:09:39,858 ectyper.speciesIdentification INFO     MASH dist predicted species name: Escherichia coli O157:H7
2022-10-05 16:09:39,861 ectyper      INFO     Standardizing the E.coli genome headers based on file names
2022-10-05 16:09:44,867 ectyper.predictionFunctions INFO     Predicting serotype from blast output
2022-10-05 16:09:44,968 ectyper.predictionFunctions INFO     Serotype prediction completed
2022-10-05 16:09:44,983 ectyper      INFO     BLAST output file against reference alleles is written at /mnt/ectyper_singularity/blast_output_alleles.txt
2022-10-05 16:09:45,008 ectyper      INFO     Reporting results:
2022-10-05 16:09:45,009 ectyper.predictionFunctions INFO     Name	Species	O-type	H-type	Serotype	QC	Evidence	GeneScores	AlleleKeys	GeneIdentities(%)	GeneCoverages(%)	GeneContigNames	GeneRanges	GeneLengths	Database	Warnings
2022-10-05 16:09:45,010 ectyper.predictionFunctions INFO     NC_002695.2	Escherichia coli O157:H7	O157	H7	O157:H7	PASS (REPORTABLE)	Based on 3 allele(s)	wzx:1;wzy:1;fliC:1;	O157-1-wzx-origin;O157-118-wzy;H7-1-fliC-origin;	100;100;100;	100;100;100;	NC_002695.2;NC_002695.2;NC_002695.2;	2783354-2784745;2785447-2786613;2624516-2626273;	1392;1167;1758;	v1.0 (11-03-2020)	-
2022-10-05 16:09:45,013 ectyper      INFO
ECTyper has finished successfully.

@Daniel-Tichy
Copy link
Author

Thank you again for your fast answer Kirill!
I will check the file assembly_summary_refseq.txt tomorrow.
But, a big detail is that in your case you are getting the right match (GCF_001516935.2) for the analyzed file (NC_002695.2 ), for some reason for the same file I am getting that poor match (1/1000) with GCF_000002435.1, no with GCF_001516935.2.

I will keep you updated.
PD: Just in case I am working in Ubuntu 20.04.5, the conda environment uses:
Python 3.6.15
bcftools 1.16
blastn 2.13.0
seqtk 1.3
samtools 1.16.1
bowtie2 2.4.5
mash 2.3

pandas 1.1.5
biopython 1.79
requests 2.12.5

I had a warning about urllib3, which I had to install separately in the version 1.26.12

@kbessonov1984
Copy link
Collaborator

kbessonov1984 commented Oct 6, 2022

Yes, take a time to explore the issue. Make sure that in your conda environment ECTYPER has the following files and sizes are approximately the same ../lib/python<version>/site-packages/ectyper/Data/:

@Daniel-Tichy
Copy link
Author

Daniel-Tichy commented Oct 6, 2022

Hi Kirill!
I am here with good news and bad news.
The good news are that my files seems to be of the expected size.
-rw-rw-r-- 1 daniel daniel 83073K oct 6 13:34 assembly_summary_refseq.txt
-rw-rw-r-- 2 daniel daniel 2355K abr 25 2021 ectyper_alleles_db.json
-rw-rw-r-- 1 daniel daniel 736441K oct 6 13:34 refseq.genomes.k21s1000.msh

So It seems the problem is not there.

The bad news is that I kinda identify exactly the problem and it seems to be on my end. I am extremely sorry.

I used the file refseq.genomes.k21s1000.msh as the query and my file O157H7.fasta as the reference to estimate the distance of the genome with each entry in the sketches file with this command:
mash dist O157H7.fasta /FILEPATH/ectyper/Data/refseq.genomes.k21s1000.msh > dist_mash.txt
and then
sort -gk3 dist_mash.txt > dist_mash2.txt

Guess which one entry was on top? GCF_000002435.1_GL2_genomic.fna.gz (Giardia lamblia ATCC 50803)

It seems that because my PC is in Spanish it alters the way that sort works, if I run ECTyper with this command:
LC_ALL=C ectyper -i O157H7.fasta --verify -o output_dir

It works perfectly fine, maybe LC_ALL=C should be added to the sort_cmd function in line 192 of speciesIdentification.py

Thank you again for your time and greetings from Chile.

@kbessonov1984
Copy link
Collaborator

kbessonov1984 commented Oct 6, 2022

You are absolutely right about the different sort behaviour under different locale settings. I did not realize that it will be an issue. Thank you for discovering it!

The best way is to set locale for all external subprocess calls with the help of env flag (subprocess.run(..., env={'LC_ALL':'C'}).

def run_subprocess(cmd, input_data=None, un=False, ignorereturncode=False):

PS: Saludos de Canada y somos muy agradecidos por informarnos!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants