Skip to content

Latest commit

 

History

History
193 lines (134 loc) · 5.7 KB

HiFi.md

File metadata and controls

193 lines (134 loc) · 5.7 KB

Pacbio HiFi

If this is not the case yet remember to activate the correct conda env:

conda activate LongReads

Dataset

There are 2 sample, available at:

~/data/public/teachdata/ebame-2022/metagenomics/HIFI_datasets/samples/
  • HumanReal: Sample from a pool of vegans and omnivore
  • Zymo: mock community

Chose either for the rest of the tutorial.

As previously we can use the command seqkit stats to assess these samples.

Solution

seqkit stats ~/data/public/teachdata/ebame-2022/metagenomics/HIFI_datasets/samples/HumanReal_sample1e5.fastq.gz

Assembly

Mostly 2 pipeline have been developed for assembling HiFi reads for metagenomic datasets.

Which one is the best?

In this tutorial we are going to use hifiasm-meta. As usual, try to craft your own command line to do so. But first be sure to work in this directory:

mkdir -p ~/data/mydatalocal/HiFi
cd ~/data/mydatalocal/HiFi
Solution

cd ~/data/mydatalocal/HiFi
hifiasm_meta -o asm ~/data/public/teachdata/ebame-2022/metagenomics/HIFI_datasets/samples/HumanReal_sample1e5.fastq.gz -t 4

This should take about 4858.465 seconds

So instead lets comment on the pre-run version.

cd ~/data/mydatalocal/HiFi
ln -s ~/data/public/teachdata/ebame-2022/metagenomics/HIFI_datasets/HumanReal_asm prerun_asm

Here assembly is also available in the form of a .gfa. Hifiasm uses the GFA1 format. but also adds on entry relative to reads (line starting by A).

There is a quite a diversity of output file, what are they? What is a unitig a contig? Let's check the documentation

Let's use Bandage to assess this assembly.

cd ~/data/mydatalocal/HiFi/prerun_asm
Bandage load asm.p_ctg.gfa
If you have issues with Bandage

Fix1

Did you use -X or -Y when connecting to the VM? If not, please disconect and retype ssh with that flag:

Fix2

If you have Bandage on your laptop, use the scp command to download the gfa file on your laptop:

scp [email protected]:~/data/mydatalocal/HiFi/prerun_asm/asm.p_ctg.gfa	.

This will copy the file to the directory you executed that command from. Also to be clear this command should not be run on the vm. This is a command for your laptop to request that file from the distant server. So it should be run on a terminal before you connect to the vm.

Fix3

Try and follow explanation on how to forward display from this google doc: https://docs.google.com/document/d/1VPnL-5mXXQimkXQNiQagPhgzRn8j1JBHCLV42r8-Wqc/edit#

--> look at assembly statistics

--> look at actual size of smaller size contigs

--> What are the circular components?

The Zymo is a bit more exciting than the HumanReal in terms of circular components, however some of those long contigs even if not circular are could already satisfy medium or high MAGs criterion for quality.

MAGs?

Let's focus here only on circular contigs.

Extract circular contigs

Here, no command will help you, you need to learn about the .gfa format and write a script in your favourite scripting language.

Here is the path to homemade script, try to run it:

~/repos/Ebame/scripts/Circ_cont_from_gfa.py 
It should not be too hard if you use the -h

cd ~/data/mydatalocal/HiFi
~/repos/Ebame/scripts/Circ_cont_from_gfa.py prerun_asm/asm.p_ctg.gfa circ_contigs

Check circular contigs

Clearly some of these are not genomes, we can still launch checkm on everything:

conda activate STRONG
cd ~/data/mydatalocal/HiFi
checkm lineage_wf circ_contigs checkm -r -x .fa -t 4 --tab_table

But wait what if I want to look at one of the non circular contigs?

Retry to use checkm on a contigs you chose and saved with Bandage.

--> find a long contigs from Bandage. Copy the name --> on the server, use the following command line replacing <NODE>:

cd ~/data/mydatalocal/HiFi
mkdir linear_contigs
Bandage reduce  prerun_asm/asm.p_ctg.gfa linear_contigs/<Node>.gfa  --scope aroundnodes --nodes <NODE> --distance 0

--> use a bash online to extract, name and sequence from that gfa graph:

cd ~/data/mydatalocal/HiFi/linear_contigs
awk '/^S/{print ">"$2"\n"$3}' <Node>.gfa > <Node>.fa

--> use checkm

Plasmids?

Some of these smaller contigs are likely to be plasmids. Let use a machine learning approach to verify this. Try to use a bash loop to apply this command to all files in circ_contigs:

conda activate plasmidnet
mamba install prodigal -y
mkdir plasmidnet
prodigal -i circ_contigs/s1123.ctg001147c.fa -a plasmidnet/s1123.ctg001147c.faa -p meta
plasmidnet.py -f plasmidnet/s1123.ctg001147c.faa  -o plasmidnet -m ~/repos/PlasmidNet/model.zip -j 4

Try to write a bash loop to apply the same treatment to all circular contigs

Solution

conda activate plasmidnet
mamba install prodigal -y
mkdir -p plasmidnet
for file in  circ_contigs/*.fa 
do
	faa=$(basename $file)"a"
	echo $faa
	prodigal -i $file -a plasmidnet/$faa -p meta
	plasmidnet.py -f plasmidnet/$faa -o plasmidnet -m ~/repos/PlasmidNet/model.zip -j 4
done

Once all circular contigs ran, you can merge the results with:

cat plasmidnet/*_results.tab|head -n1  > plasmidnet_results.tab
cat plasmidnet/*_results.tab| sed '/^contig/d'  >> plasmidnet_results.tab