-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit d3bbde4
Showing
41 changed files
with
69,388 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
Cerulean Hybrid Genome Assembler v0.1.1 | ||
|
||
This software extends contigs assembled using short read datasets like Illumina | ||
paired-end reads using long reads like PacBio RS long reads. | ||
|
||
The method is fully described in: | ||
Deshpande, V., Fung, E. D., Pham, S., & Bafna, V. (2013). | ||
Cerulean: A hybrid assembly using high throughput short and long reads. | ||
arXiv preprint arXiv:1307.7933. | ||
|
||
A] Requirements: | ||
Ubuntu 12.04 (may run on other operating systems, but not tested) | ||
Python 2.7.1 (may run on older versions, but not tested) | ||
numpy, matplotlib libraries for Python | ||
ABySS assembler: http://www.bcgsc.ca/platform/bioinfo/software/abyss | ||
SMRT Analysis tookit (for BLASR): http://pacbiodevnet.com/ | ||
PBJelly: https://sourceforge.net/projects/pb-jelly/ | ||
|
||
B] Inputs and Pre-processing: | ||
i) Assembled contigs from ABySS short read assembler | ||
ii)Mapping of Pacbio reads to ABySS contigs using BLASR | ||
|
||
i) Assembly of Illumina paired-end reads: | ||
If the paired-end reads are stored in fastq format in the files reads1.fastq | ||
and reads2.fastq, then contigs may be assembled by: | ||
$ abyss-pe k=64 n=10 in='reads1.fastq reads2.fastq' name=<dataname> | ||
This will generate 2 files used for inputs to Cerulean: | ||
* <dataname>-contigs.fa #This contains the contig sequences | ||
* <dataname>-contigs.dot #This contains the graph structure | ||
|
||
ii)Mapping PacBio reads to ABySS contigs using BLASR: | ||
Note: sawriter and blasr are part of SMRT Analysis toolkit | ||
Note: You need to set the environmental variables and path: | ||
$ export SEYMOUR_HOME=/opt/smrtanalysis/ | ||
$ source $SEYMOUR_HOME/etc/setup.sh | ||
|
||
Suppose PacBio reads are stored in <dataname>_pacbio.fasta | ||
$ sawriter <dataname>-contigs.fa | ||
$ blasr <dataname>_pacbio.fa <dataname>-contigs.fa -minMatch 10 \ | ||
-minPctIdentity 70 -bestn 30 -nCandidates 30 -maxScore -500 \ | ||
-nproc <numthreads> -noSplitSubreads \ | ||
-out <dataname>_pacbio_contigs_mapping.fasta.m4 | ||
|
||
Make sure the fasta.m4 file generated has the following format: | ||
qname tname qstrand tstrand score pctsimilarity tstart tend tlength \ | ||
qstart qend qlength ncells | ||
The file format may be verified by adding the option -header to blasr. | ||
|
||
C] Execute: | ||
Cerulean requires that all input files are in the same directory <basedir>: | ||
i) <basedir>/<dataname>-contigs.fa | ||
ii) <basedir>/<dataname>-contigs.dot | ||
iii) <basedir>/<dataname>_pacbio_contigs_mapping.fasta.m4 | ||
|
||
To run: | ||
$ python src/Cerulean.py --dataname <dataname> --basedir <basedir> \ | ||
--nproc <numthreads> | ||
|
||
This will generate: | ||
i) <basedir>_cerulean.fasta | ||
ii) <basedir>_cerulean.dot | ||
Note: The dot does not have same contigs as fasta, but intermediate graph. | ||
|
||
|
||
D] Post-processing: | ||
Currently Cerulean does not include consensus sequence of PacBio reads in gaps | ||
The gaps may be filled using PBJelly. | ||
$ python $JELLYPATH/fakeQuals.py <dataname>_cerulean.fasta <dataname>_cerulean.qual | ||
$ python $JELLYPATH/fakeQuals.py <dataname>_pacbio.fasta <dataname>_pacbio.qual | ||
$ cp $JELLYPATH/lambdaExample/Protocol.xml . | ||
$ mkdir PBJelly | ||
Modify Protocol.xml as follows: | ||
Set <reference> to $PATH_TO_<basedir>/<dataname>_cerulean.fasta | ||
Set <outputDir> to $PATH_TO_<basedir>/PBJelly | ||
Set <baseDir> to $PATH_TO_<basedir> | ||
Set <job> to <dataname>_pacbio.fasta | ||
Set <blasr> option -nproc <numthreads> | ||
Note: PBJelly requires that the suffix be .fasta and not .fa | ||
Next run PBJelly: | ||
($ source $JELLYPATH/exportPaths.sh) | ||
$ python $JELLYPATH/Jelly.py <stage> Protocol.xml | ||
where <stage> has to be in the order: | ||
setup | ||
mapping | ||
support | ||
extraction | ||
assembly | ||
output | ||
|
||
The assembled contigs may be view in <basedir>/PBJelly/assembly/jellyOutput.fasta | ||
|
||
In case of any questions or errors please contact vdeshpan eng DT ucsd DT edu |
Oops, something went wrong.