Author: Jason Chin, Casey Bergman and Serge Koren Latest Update: Aug. 28, 2014
This tutorial provides a working example of how to assemble high coverage
PacBio data from the Drosophila melanogaster genome using a new overlapper
Min-hash Mapper (MHAP) for PacBio raw sequences with the undated pacBioToCA
script developed by Sergey Koren and Konsantin Berlin using the StarCluster
toolkit on Amazon Web Services. The MHAP is designed to map PacBio reads to
each other to facilitate the “pre-assembly” step which converts the raw PacBio
reads into more accurate “pre-assembled”/“error corrected” reads (abbr. as
“p-reads”) so a typical overlap-layout-consensus assembly algorithm/software
can be used to assemble a genome.
This tutorial assumes you have are familiar with the basics of using the StarCluster framework to launch SGE clusters on AWS EC2 virtual machines.
The workflow described here uses a development version of StarCluster since the stable version does not support the kind of instance that we need to use in AWS EC2.
You can install the development version by directly cloning the StarCluster GitHub repository. The following is a simple example of how to instal the development version of StarCluster from github. You might have to install other python packages that StarCluster is dependent on.
git clone https://github.com/jtriley/StarCluster.git
cd StarCluster
# you can check out the exact revision that I am using for this document
git checkout 4149bbed292b0298478756d778d8fbf1dd210daf
sudo python setup.py install
To make this version of StarCluster work properly to remove unused cluster nodes,
comment out one line of code in starcluster/plugins/sge.py
:
class SGEPlugin(clustersetup.DefaultClusterSetup):
def _remove_from_sge(self, node):
#comment out the following line in the code
#master.ssh.execute('qconf -de %s' % node.alias)
To use StarCluster you will need an active AWS EC2 account and know how to find your AWS access key, secret access key and user id. This assembly workflow also uses a pre-built EC2 AMI (ami–234e514a) and a public EC2 EBS snapshot (snap–7f6e98d3) containing raw sequence fasta/fastq files and the examples of the assembly that will be produced by this process.
The AMI that is used in this protocol are a 64-bit Ubuntu-based image that is
derived from AMI provided by the StarCluster team. ami-234e514a
is based on
ami-3393a45a us-east-1 starcluster-base-ubuntu-13.04-x86_64 (EBS)
.
You will need to create a volume in your AWS account from the
Dmel_asm_pacbio_07312014 / snap-7f6e98d3
public EBS snapshot
before creating your StarCluster configuration file. To do this, click this
link
in the AWS us-east–1 region above and then click Create Volume
from the
Actions
menu above, then select Standard
from the Type
menu in the Create Volume
window, then click Create
.
The AMI image used here are pre-built with most of the software packages necessary for the assembly process. To understand how these AMIs were built, and to help with building your own customised AMI for assembly using FALCON, please see this script:
https://raw.github.com/PacificBiosciences/FALCON/v0.1.2/examples/install_note.sh
The FALCON package within the image ami-234e514a
has be updated to an
experimental branch v0.1.2
of FALCON commit
51003c8e.
Here is an example of the configuration for StarCluster::
[aws info]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_access_key
aws_user_id = your_user_id
[volume DMEL]
volume_id=your_dmel_data_EBS_id #e.g volume_id=vol-xxxxxxxx
mount_path=/mnt/dmel_asm #do not change this
[key starcluster]
key_location = ~/.ssh/starcluster.rsa
[cluster asm-starcluster-testing]
keyname = starcluster
cluster_size = 1
cluster_user = sgeadmin
cluster_shell = bash
master_image_id = ami-234e514a
master_instance_type = t1.micro #$0.00031 per Hour
node_image_id = ami-234e514a
node_instance_type = t1.micro
availability_zone = us-east-1a
volumes = DMEL
[cluster asm]
keyname = starcluster
cluster_size = 1
cluster_user = sgeadmin
cluster_shell = bash
master_image_id = ami-234e514a
master_instance_type = c3.8xlarge # $1.68 per partial or full instance hour
node_image_id = ami-234e514a
node_instance_type = c3.8xlarge
availability_zone = us-east-1a
volumes = DMEL
[global]
default_template = asm
ENABLE_EXPERIMENTAL=True
The example configuration file above uses two different cluster configurations.
The first asm-starcluster-testing
is provided for testing your starcluster
setup. The second asm
is for the assembly process. The c3.8xlarge
virtual machine
has 32 vCPU and 60G RAM (see Amazon instance type).
This instance type is chosen because it has enough CPU power but has a relatively low cost.
Amazon may impose limits on how many c3.8xlarge
instances that a new user can initiate with.
In this case you may need to use a different AMI type such as cr1.8xlarge
for the asm
cluster configuration.
Once you have created your config file, save it as ~/.starcluster/config. You will then need create the starcluster ssh key pair that will allow you to connect securely to AWS.
```
starcluster createkey starcluster -o ~/.ssh/starcluster.rsa
```
Now let’s test that your StarCluster configuration is working by launching a test cluster, logging in and out, adding nodes, then shutting the test cluster down.
You can build a test cluster with one node as follows:
```
starcluster start -c asm-starcluster-testing asm
```
Once the cluster is built, one can login the master node by ssh as follows:
```
starcluster sshmaster asm
```
You will now be logged in as root on the master node, and you can look at the contents of the DMEL volume:
```
ls -lrt /mnt/dmel_asm/
```
You will currently have only the master node running, which you can verify as follows:
```
qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
master linux-x64 1 0.02 590.2M 97.0M 0.0 0.0
```
Now let’s test that submitting a job to SGE is working:
qsub -V -b y -cwd -o /dev/null -e /dev/null "date > test.date"
qstat
This will create a file called test.date
in your home directory, verifying
that you can submit jobs to SGE.
Next, logout of the Starcluster by typing exit
. This will send you back to
your own machine, where we can test adding an additional node to the cluster
and log back into the cluster as follows:
```
starcluster addnode -n 1 asm
starcluster sshmaster asm
```
You will now have an additional worker node added to your cluster, which you
can see by typing qhost
again.
```
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
master linux-x64 1 0.02 590.2M 97.0M 0.0 0.0
node001 linux-x64 1 0.04 590.2M 87.9M 0.0 0.0
```
To shut down your cluster but maintain the EC2 instances, logout and invoke the following from your own machine:
```
starcluster stop asm
```
A starcluster in the stopped state can be restarted as follows:
```
starcluster start -x asm
```
To completely terminate all of the instances in your cluster, execute:
```
starcluster terminate asm
```
The data in your EBS volume will still be maintained after you shutdown or terminate your cluster.
At this stage you should have a working StarCluster configuration and you can now move on to generating preassembled reads and performing the assembly.
There is existing assembly results in the snapshot so you can examine the results to get familiar about the setup and what outputs are expected.
Once you login, you can change the working directory to /mnt/dmel_asm
and
see what is in there:
```
cd /mnt/dmel_asm
ls
```
You will see the following directories:
```
mhap_WGS # a working directory for a new assembly, running script and configuration files included
mhap_WGS_template # script, configuration template and assembly results as examples
orig_data # sequence data
pacBioToCA.lastest.pl # the latest version of the script pacBioToCA used for generating the DMEL assemblies
packages # software packages for building the environment, just for reference
sge_setup # simple scripts to setup SGE envrionment
wgs-assembler-experimental # the directory for the binaries and libraries for generating the DMEL assemblies
wgs-8.2beta # the final version of the WGS assembler
ecoli_data # single SMRTCell data for E. coli
ecoli_example # a working directory for a new E. coli assembly with wgs-8.2beta, running script and configuration files included
ecoli_example_template # E. coli assembly results done by wgs-8.2beta
yeast_data # yeast dataset
yeast_WGS # yeast assembly results done by wgs-8.2beta
```
You might want to check out the previous generated assemblies in /mnt/dmel_asm/mhap_WGS_template
so
you can get some idea what results are expected to see:
```
cd /mnt/dmel_asm/mhap_WGS_template
ls
```
This is what you will see:
```
/mnt/dmel_asm/mhap_WGS_template# ls
disk_usage dmel.frg falcon_asm run_CA_asm.sh
dmel dmel.log fastalength run_pre_asm.sh
dmel.correction.err dmel.longest25.fasta fetch_read.py tempdmel
dmel.correction.hist dmel.longest25.fasta.qual filter_CA_contigs.sh time_usage.sh
dmel.err dmel.longest25.fasta.qv filtered_subreads.fastq
dmel.fasta dmel.longest25.frg pacbio_asm.spec
dmel.fastq dmel.qual pacbio_pre_asm.spec
```
Here are some directories and files that is useful to know what they are:
```
filtered_subreads.fastq # the input sequences for the whole assembly process, it is a symbolic link to the file in /mnt/dmel_asm/orig_data
pacbio_pre_asm.spec # the configuration for the pre-assembly step using MHAP and pacBioToCA
pacbio_asm.spec # the configuration for the assembly step using CA
run_pre_asm.sh # the command line code to run pacBioToCA for the preassembly part
run_CA_as.sh # the command line code to run the CA for the assembly
```
Some of the important output files are:
```
tempdeml # the working directory of the pre-assembly step
dmel.fasta # All output of the pre-assembled reads
dmel.longest25.fasta # the fasta file used for the assembly step
dmel # the working directory for the CA assembly step
falcon_asm # the working directory for the FALCON assembly step
```
There are several auxiliary scripts and binaries for further processing and evaluating the assemblies:
```
fetch_read.py # fetch sequences from a fasta file given a list of identifier
filter_CA_contigs.sh # a script to filter out contig with less 50 p-reads support in the output of CA
time_usage.sh # simple one-liner to calculate the time usage for the MHAP step
```
The assembly outputs are at
```
/mnt/dmel_asm/mhap_WGS_template/dmel/9-terminator/asm.ctg.fasta # raw CA contigs
/mnt/dmel_asm/mhap_WGS_template/dmel/asm.ctg.over50.fa # filtered CA output
/mnt/dmel_asm/mhap_WGS_template/falcon_asm/primary_tigs_c.fa # the "primary contigs" from FALCON assembler
/mnt/dmel_asm/mhap_WGS_template/falcon_asm/pa-tigs_nodup.fa # all non redundant contigs from FALCON assembler
```
Here are the assembly statistics reported by using the summaryAssembly.py
from (PBSuite):
``` For asm.ctg.over50.fa
====================
Contig Stats
====================
#Seqs 126
Min 60700
1st Qu. 139751
Median 240762
Mean 1129317
3rd Qu. 577131
Max 25750697
Total 142294054
n50 20985507
n90 366384
n95 196318
For pa-tigs_nodup.fa
====================
Contig Stats
====================
#Seqs 346
Min 606
1st Qu. 14697
Median 37813
Mean 418258
3rd Qu. 144473
Max 25639991
Total 144717436
n50 16788716
n90 237357
n95 104117
```
One important remark is that to get the best results, an extra step is needed to polish the consensus sequence using the Quiver Algorithm. This process is not included in this tutorial.
The example shown in this and the next section uses the code in
/mnt/dmel_asm/wgs-assembler-experimental
. One can generate the assembly using
the latest beta code in /mnt/dmel_asm/wgs-8.2beta
by changing related path
for the executables.
The following contains the instructions for generating your own assembly in
2 steps: (1) pre-assembly and (2) assembly. The reason that for this two
step process is that pre-assembly is more computational intensive so it is better
to use multiple computational nodes. The assembly has some I/O overhead so it is
cheaper to use single node for the process. pacBioToCA
is capable to run
the whole process at once, if that is the preferred way to go for you, you can
change the assemble = 0
in the pacbio_pre_asm.spec
to assemble = 1
. I notice
there is some non-optimized NFS I/O load during several steps, so it is better
turn off the journaling feature of the ext4
file system (or use other file system)
to improve the performance too.
“Pre-assembly” is the process to error correct PacBio reads to generate
“preassembled reads” (p-reads) which have good accuracy to be assembled by
traditional Overlap-Layout-Consensus assembly algorithms directly. In this
tutorial, we use a min-hash algorithm to overlap the reads for error
correction. It is much faster than using blasr
aligner for the same purpose.
First, let start an EC2 cluster of one node to set up a few things by running
following starcluster
command on your local machine:
```
starcluster start -c asm asm
```
Once the cluster is built, one can login the master node from your local machine by:
```
starcluster sshmaster asm
```
Once you login, you should see a command prompt like:
```
(HBAR_ENV)root@master:~#
```
The (HBAR_ENV)
indicates that you are using the python virtualenv
in
/home/HBAR_ENV
. (The “activation” script /home/HBAR_ENV/bin/activate
was
executed with in /root/.bashrc
during login process.)
We will need to first update the Grid Engine environment. It may be wise to do
the following inside a screen
session to ensure persistence if your
environment:
```
cd /mnt/dmel_asm/sge_setup
bash sge_setup.sh
```
Before starting up the preassembly process, you will need to delete some files
in the assembly example. The current CA generates large amount of intermediate
files for debugging purpose. The directory mhap_WGS_template/
uses 293G of
the 500G snapshot. If you do not delete some files, the volume can not fit in
two CA assembly results. However, you might want to keep some results for
comparison. I suggest to remove dmel/5-consensus/asm.*.*
as they occupy about
209G space and are just intermediate files for one of the CA step.
```
rm /mnt/dmel_asm/mhap_WGS_template/dmel/5-consensus/asm.*.*
```
Once you release the 209G space, the rest of the volume should be enough.
At this point you can add more compute nodes to speed up the process by invoking the following on your local machine:
```
starcluster addnode -n 1 asm
```
Replace the number of 1
by a large one to run more jobs in parallel. The
assembly example was generated using 4 nodes (1 master and 3 compute nodes),
and it took about 7 hours to finish the MHAP step.
Once the new nodes have been added, you need to configure the SGE environment on your starcluster to allow the new nodes to submit to the queue. For example, if you added 3 compute nodes above, you would need to add all three compute nodes to the SGE configuration as follows:
```
qconf -as node001
qconf -as node002
qconf -as node003
```
To start the assembly process, change to the directory /mnt/dmel_asm/mhap_WGS
and run
the script run_pre_asm.sh
.
```
cd /mnt/dmel_asm/mhap_WGS
bash run_pre_asm.sh
```
This will submit the SGE job to the Grid Engine. The default is to run the jobs until the pre-assembled reads are generated.
In some rare cases, some of the pre-assembly consensus step jobs might fail. One can remove certain files just to restart that process. We will document on how to troubleshooting that late. In my last test, every thing went smoothly.
Once the final output dmel.fasta
is generated in /mnt/dmel_asm/mhap_WGS
and no more pacBioToCA
processes are running, you can start the CA part. I prefer doing that in a single
node SGE mode. You can use starcluster removenode
command on your local machine to remove unwanted
nodes, e.g.
```
starcluster removenode asm -a node001
starcluster removenode asm -a node002
starcluster removenode asm -a node003
```
On the cluster, make sure you are in the right directory and execute the CA assembly as follows:
```
cd /mnt/dmel_asm/mhap_WGS
bash run_CA_asm.sh
```
There will be some error message which one can mostly ignored. However, you
might want to check if some assembly processes are running. If, for some
reason, all processes are dead, you can just re-rerun the run_CA_asm.sh
script. Typically, it might pick where it fails and restart correctly. I had
to run twice to get my CA assembly. Alternatively, you can set the flag
assemble = 1
in pacbio_pre_asm.spec
, in this case, you will do the whole
assembly in one command.
This part takes 4 to 6 hours in my test using single node.
A script filter_CA_contigs.sh
is provided to filter out the CA contig. It
will remove all contigs that have less than 50 p-reads support. To run that,
execute bash ../filter_CA_contigs.sh
inside the dmel/
working directory.
You can also generate an alternative assembly using an experimental assembler FALCON. To do that, you have to create a working directory and copy the necessary scripts from the example to set it up:
```
cd /mnt/dmel_asm/mhap_WGS
mkdir falcon_asm
cd falcon_asm
cp ../../mhap_WGS_template/falcon_asm/*.py .
cp ../../mhap_WGS_template/falcon_asm/run_asm.sh .
cp ../../mhap_WGS_template/falcon_asm/post_asm.sh .
ln -s ../dmel.longest25.fasta preads.fa
```
Then, you can run the run_asm.sh
to start the assembly process:
```
bash run_asm.sh
```
In my test, it takes about 3 to 5 hours for the initial overlapping jobs. The rest of the layout and generating contig jobs are relatively fast. They can be finished in 10s of minutes.
Here is a list of the output files:
```
full_string_graph.adj # the adjacent nodes of the edges in the full string graph
string_graph.gexf # the gexf file of the string graph for graph visualization
string_graph.adj # the adjecent nodes of the edges in the string graph after transitive reduction
edges_list # full edge list
paths # path for the unitigs
unit_edges.dat # path and sequence of the untigs
uni_graph.gexf # unitig graph in gexf format
unitgs.fa # fasta files of the unitigs
all_tigs_paths # paths for all final contigs (= primary contigs + associated contigs)
all_tigs.fa # fasta file for all contigs
primary_tigs_paths # paths for all primary contigs
primary_tigs.fa # fasta file fot the primary contigs
primary_tigs_paths_c # paths for all primary contigs, detectable mis-assemblies are broken
primary_tigs_c.fa # fasta file fot the primary contigs, detectable mis-assemblies are broken
asm_graph.gexf # the assembly graph where the edges are the contigs
```
There might be some redundant contigs. The following script (post_asm.sh
) can be used to remove
redundant contigs:
```
export PATH=$PATH:/home/HBAR_ENV/MUMmer3.23
cd /mnt/dmel_asm/mhap_WGS
cd /mnt/dmel_asm/new_asm/3-asm-falcon
nucmer -maxmatch all_tigs.fa all_tigs.fa -p all_tigs_self >& /dev/null #~2hr
show-coords -o -H -T all_tigs_self.delta | grep CONTAINS | awk '$7>96' | awk '{print $9}' | sort -u > all_tigs_duplicated_ids
remove_dup_ctg.py
cat p-tigs_nodup.fa a-tigs_nodup.fa > pa-tigs_nodup.fa
```
The non-reduant set of contigs in pa-tigs_nodup.fa
will be suitable for further correction
by the Quiver algorithm.
The following directories contained the assemblies for E. coli and Yeast assembled
with the final wgs-8.2beta
.
```
wgs-8.2beta # the final version of the WGS assembler
ecoli_data # single SMRTCell data for E. coli
ecoli_example # a working directory for a new E. coli assembly with wgs-8.2beta, running script and configuration files included
ecoli_example_template # E. coli assembly results done by wgs-8.2beta
yeast_data # yeast dataset
yeast_WGS # yeast assembly results done by wgs-8.2beta
```
You can assemble both E. coli and Yeast using single node configuration. To
assemble E. coli, one can change the working directory to
/mnt/dmel_asm/ecoli_example
and run the script run.sh
to re-generate the
assembly in the directory /mnt/dmel_asm/ecoli_example_template
. The
configuration is set up to run the assembly in local mode. No SGE set up is
required for the E. coli assembly. An assembled example using wgs-8.2beta
is
provided in the yeast_WGS
. You can re-generate the assembly with the run.sh
and pacbio_asm.spec
. However, the default setting here require a SEG setup,
even though the assembly can be done with one single node. Please follow the
instruction for DMEL assembly for setting SGE if you like to reproduce the
results for the yeast assembly.