-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to identiify species, even in the representative genome of E.coli #87
Comments
Hi Daniel, The reason for behaviour fluctuations is that every time you install ECTYPER it pull the freshest RefSeq genomes metadata from http://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt but MASH sketch is stale coming from MASH tool and so some accessions might be deprecated and assigned to other species. I was planning for the next release to make RefSeq data more static so that metadata will be 100% linked to the self-made RefSeq genomes sketch (which will take some time to generate). Thus as you can see there is no easy solution to the problem of keeping species identification current and accurate. Thus I suggest using tool without |
Hi Kirill! Thank you very much for your fast answer. Is there a suggestion that you can give me to modify ECTYPER so it doesnt pull the freshest RefSeq genomes metadata from NCBI and does it from a local table of some sort? Thank you again in advance. |
OK, I see. I checked that paper and there is mistake as they refer to ECTYPER version 1.0.2 which is not available at this moment. I think they meant v1.0.0. But regardless, I was not able to replicate your error on my end, but I agree that this issue needs attention. I will create an auxiliary script that will download RefSeq metadata, genome sequences, create a mash sketch and update existing database. The metadata currently is updated every 6 months. I will remove this check in the next release and provide both metadata and mash sketch with the tool for greater consistency in the results. This way user can update their database at will or use outdated stock database included with the tool. Here are my results:
What concerns me is that you only get 1/1000 k-mers matched to
The NCBI metadata on this accession gives me E.coli O157:H7 and not
Check your I also tried to pull a Docker image using Singularity as I am working on a cluster and here are my results. You can also pull it with Docker.
|
Thank you again for your fast answer Kirill! I will keep you updated. pandas 1.1.5 I had a warning about urllib3, which I had to install separately in the version 1.26.12 |
Yes, take a time to explore the issue. Make sure that in your conda environment ECTYPER has the following files and sizes are approximately the same
|
Hi Kirill! So It seems the problem is not there. The bad news is that I kinda identify exactly the problem and it seems to be on my end. I am extremely sorry. I used the file refseq.genomes.k21s1000.msh as the query and my file O157H7.fasta as the reference to estimate the distance of the genome with each entry in the sketches file with this command: Guess which one entry was on top? GCF_000002435.1_GL2_genomic.fna.gz (Giardia lamblia ATCC 50803) It seems that because my PC is in Spanish it alters the way that It works perfectly fine, maybe Thank you again for your time and greetings from Chile. |
You are absolutely right about the different sort behaviour under different The best way is to set locale for all external
PS: Saludos de Canada y somos muy agradecidos por informarnos! |
First, I want to thank you for your work in this pipeline, but I have been trying to run ECTyper since yesterday without success and it seems to be a problem with it.
conda create --name ectyper
conda install -c bioconda ectyper
ectyper -i O157H7.fasta --verify -o output_dir
And it the results indicates that
It seems to identify correctly the serotype without the --verify argument but I need to assign the species as E.coli
Also, it seems to be something with MASH and the database, because previous to that result I get this:
but GCF_000002435.1 is the ID of Giardia lamblia ATCC 50803
I also tried with the docker version (use
docker pull kbessonov/ectyper:1.0.0
becausedocker pull kbessonov/ectyper
doest work) but I get the same issues.The Galaxy version seems to be working fine at least, but I need this to work locally for APECtyper
The text was updated successfully, but these errors were encountered: