Dmel Assembly using MHAP, Celera Assembly and/or FALCON on Amazon EC2

Author: Jason Chin, Casey Bergman and Serge Koren Latest Update: Aug. 28, 2014

Introduction

This tutorial provides a working example of how to assemble high coverage PacBio data from the Drosophila melanogaster genome using a new overlapper Min-hash Mapper (MHAP) for PacBio raw sequences with the undated pacBioToCA script developed by Sergey Koren and Konsantin Berlin using the StarCluster toolkit on Amazon Web Services. The MHAP is designed to map PacBio reads to each other to facilitate the “pre-assembly” step which converts the raw PacBio reads into more accurate “pre-assembled”/“error corrected” reads (abbr. as “p-reads”) so a typical overlap-layout-consensus assembly algorithm/software can be used to assemble a genome.

This tutorial assumes you have are familiar with the basics of using the StarCluster framework to launch SGE clusters on AWS EC2 virtual machines.

Installing StarCluster

The workflow described here uses a development version of StarCluster since the stable version does not support the kind of instance that we need to use in AWS EC2.

You can install the development version by directly cloning the StarCluster GitHub repository. The following is a simple example of how to instal the development version of StarCluster from github. You might have to install other python packages that StarCluster is dependent on.

    git clone https://github.com/jtriley/StarCluster.git
    cd StarCluster
    # you can check out the exact revision that I am using for this document
    git checkout 4149bbed292b0298478756d778d8fbf1dd210daf 
    sudo python setup.py install

To make this version of StarCluster work properly to remove unused cluster nodes, comment out one line of code in starcluster/plugins/sge.py:

    class SGEPlugin(clustersetup.DefaultClusterSetup):
        def _remove_from_sge(self, node):
            #comment out the following line in the code
            #master.ssh.execute('qconf -de %s' % node.alias)

Required AWS resources

To use StarCluster you will need an active AWS EC2 account and know how to find your AWS access key, secret access key and user id. This assembly workflow also uses a pre-built EC2 AMI (ami–234e514a) and a public EC2 EBS snapshot (snap–7f6e98d3) containing raw sequence fasta/fastq files and the examples of the assembly that will be produced by this process.

The AMI that is used in this protocol are a 64-bit Ubuntu-based image that is derived from AMI provided by the StarCluster team. ami-234e514a is based on ami-3393a45a us-east-1 starcluster-base-ubuntu-13.04-x86_64 (EBS).

You will need to create a volume in your AWS account from the Dmel_asm_pacbio_07312014 / snap-7f6e98d3 public EBS snapshot before creating your StarCluster configuration file. To do this, click this link in the AWS us-east–1 region above and then click Create Volume from the Actions menu above, then select Standard from the Type menu in the Create Volume window, then click Create.

The AMI image used here are pre-built with most of the software packages necessary for the assembly process. To understand how these AMIs were built, and to help with building your own customised AMI for assembly using FALCON, please see this script:

    https://raw.github.com/PacificBiosciences/FALCON/v0.1.2/examples/install_note.sh

The FALCON package within the image ami-234e514a has be updated to an experimental branch v0.1.2 of FALCON commit 51003c8e.

Creating a StarCluster Configuration File

Here is an example of the configuration for StarCluster::

    [aws info]
    aws_access_key_id = your_access_key
    aws_secret_access_key = your_secret_access_key
    aws_user_id = your_user_id

    [volume DMEL]
    volume_id=your_dmel_data_EBS_id #e.g volume_id=vol-xxxxxxxx
    mount_path=/mnt/dmel_asm #do not change this

    [key starcluster]
    key_location = ~/.ssh/starcluster.rsa

    [cluster asm-starcluster-testing]
    keyname = starcluster
    cluster_size = 1
    cluster_user = sgeadmin
    cluster_shell = bash
    master_image_id = ami-234e514a 
    master_instance_type = t1.micro #$0.00031 per Hour
    node_image_id = ami-234e514a
    node_instance_type = t1.micro
    availability_zone = us-east-1a
    volumes = DMEL

    [cluster asm]
    keyname = starcluster
    cluster_size = 1
    cluster_user = sgeadmin
    cluster_shell = bash
    master_image_id = ami-234e514a
    master_instance_type = c3.8xlarge # $1.68 per partial or full instance hour
    node_image_id = ami-234e514a 
    node_instance_type = c3.8xlarge
    availability_zone = us-east-1a
    volumes = DMEL

    [global]
    default_template = asm
    ENABLE_EXPERIMENTAL=True

The example configuration file above uses two different cluster configurations. The first asm-starcluster-testing is provided for testing your starcluster setup. The second asm is for the assembly process. The c3.8xlarge virtual machine has 32 vCPU and 60G RAM (see Amazon instance type). This instance type is chosen because it has enough CPU power but has a relatively low cost. Amazon may impose limits on how many c3.8xlarge instances that a new user can initiate with. In this case you may need to use a different AMI type such as cr1.8xlarge for the asm cluster configuration.

Testing Your StarCluster Configuration

Once you have created your config file, save it as ~/.starcluster/config. You will then need create the starcluster ssh key pair that will allow you to connect securely to AWS.

```
    starcluster createkey starcluster -o ~/.ssh/starcluster.rsa
```

Now let’s test that your StarCluster configuration is working by launching a test cluster, logging in and out, adding nodes, then shutting the test cluster down.

You can build a test cluster with one node as follows:

```
    starcluster start -c asm-starcluster-testing asm
```

Once the cluster is built, one can login the master node by ssh as follows:

```
    starcluster sshmaster asm
```

You will now be logged in as root on the master node, and you can look at the contents of the DMEL volume:

```
    ls -lrt /mnt/dmel_asm/
```

You will currently have only the master node running, which you can verify as follows:

```
    qhost

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
master                  linux-x64       1  0.02  590.2M   97.0M     0.0     0.0
```

Now let’s test that submitting a job to SGE is working: qsub -V -b y -cwd -o /dev/null -e /dev/null "date > test.date" qstat

This will create a file called test.date in your home directory, verifying that you can submit jobs to SGE.

Next, logout of the Starcluster by typing exit. This will send you back to your own machine, where we can test adding an additional node to the cluster and log back into the cluster as follows:

``` 
    starcluster addnode -n 1 asm 
    starcluster sshmaster asm
```

You will now have an additional worker node added to your cluster, which you can see by typing qhost again.

```
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
master                  linux-x64       1  0.02  590.2M   97.0M     0.0     0.0
node001                 linux-x64       1  0.04  590.2M   87.9M     0.0     0.0
```

To shut down your cluster but maintain the EC2 instances, logout and invoke the following from your own machine:

```
    starcluster stop asm
```

A starcluster in the stopped state can be restarted as follows:

```
    starcluster start -x asm
```

To completely terminate all of the instances in your cluster, execute:

```
    starcluster terminate asm
```

The data in your EBS volume will still be maintained after you shutdown or terminate your cluster.

At this stage you should have a working StarCluster configuration and you can now move on to generating preassembled reads and performing the assembly.

Assembly Examples

There is existing assembly results in the snapshot so you can examine the results to get familiar about the setup and what outputs are expected.

Once you login, you can change the working directory to /mnt/dmel_asm and see what is in there:

```
    cd /mnt/dmel_asm
    ls
```

You will see the following directories:

```
    mhap_WGS # a working directory for a new assembly, running script and configuration files included
    mhap_WGS_template # script, configuration template and assembly results as examples
    orig_data # sequence data
    pacBioToCA.lastest.pl # the latest version of the script pacBioToCA used for generating the DMEL assemblies
    packages # software packages for building the environment, just for reference 
    sge_setup # simple scripts to setup SGE envrionment
    wgs-assembler-experimental # the directory for the binaries and libraries for generating the DMEL assemblies
    wgs-8.2beta # the final version of the WGS assembler
    ecoli_data # single SMRTCell data for E. coli
    ecoli_example # a working directory for a new E. coli assembly with wgs-8.2beta, running script and configuration files included
    ecoli_example_template # E. coli assembly results done by wgs-8.2beta
    yeast_data  # yeast dataset
    yeast_WGS  # yeast assembly results done by wgs-8.2beta  
```

You might want to check out the previous generated assemblies in /mnt/dmel_asm/mhap_WGS_template so you can get some idea what results are expected to see:

```
    cd /mnt/dmel_asm/mhap_WGS_template
    ls 
```

This is what you will see:

```
    /mnt/dmel_asm/mhap_WGS_template# ls

    disk_usage            dmel.frg                   falcon_asm               run_CA_asm.sh
    dmel                  dmel.log                   fastalength              run_pre_asm.sh
    dmel.correction.err   dmel.longest25.fasta       fetch_read.py            tempdmel
    dmel.correction.hist  dmel.longest25.fasta.qual  filter_CA_contigs.sh     time_usage.sh
    dmel.err              dmel.longest25.fasta.qv    filtered_subreads.fastq
    dmel.fasta            dmel.longest25.frg         pacbio_asm.spec
    dmel.fastq            dmel.qual                  pacbio_pre_asm.spec

```

Here are some directories and files that is useful to know what they are:

```
    filtered_subreads.fastq # the input sequences for the whole assembly process, it is a symbolic link to the file in /mnt/dmel_asm/orig_data
    pacbio_pre_asm.spec # the configuration for the pre-assembly step using MHAP and pacBioToCA
    pacbio_asm.spec # the configuration for the assembly step using CA
    run_pre_asm.sh # the command line code to run pacBioToCA for the preassembly part
    run_CA_as.sh # the command line code to run the CA for the assembly
```

Some of the important output files are:

```
    tempdeml # the working directory of the pre-assembly step
    dmel.fasta # All output of the pre-assembled reads
    dmel.longest25.fasta # the fasta file used for the assembly step
    dmel # the working directory for the CA assembly step
    falcon_asm # the working directory for the FALCON assembly step
```

There are several auxiliary scripts and binaries for further processing and evaluating the assemblies:

```
    fetch_read.py # fetch sequences from a fasta file given a list of identifier
    filter_CA_contigs.sh # a script to filter out contig with less 50 p-reads support in the output of CA
    time_usage.sh # simple one-liner to calculate the time usage for the MHAP step
```

The assembly outputs are at

```
    /mnt/dmel_asm/mhap_WGS_template/dmel/9-terminator/asm.ctg.fasta # raw CA contigs
    /mnt/dmel_asm/mhap_WGS_template/dmel/asm.ctg.over50.fa # filtered CA output
    /mnt/dmel_asm/mhap_WGS_template/falcon_asm/primary_tigs_c.fa # the "primary contigs" from FALCON assembler
    /mnt/dmel_asm/mhap_WGS_template/falcon_asm/pa-tigs_nodup.fa  # all non redundant contigs from FALCON assembler 
```

Here are the assembly statistics reported by using the summaryAssembly.py from (PBSuite):

``` For asm.ctg.over50.fa 
    ====================
    Contig Stats
    ====================
    #Seqs   126
    Min     60700
    1st Qu. 139751
    Median  240762
    Mean    1129317
    3rd Qu. 577131
    Max     25750697
    Total   142294054
    n50     20985507
    n90     366384
    n95     196318

    For pa-tigs_nodup.fa 
    ====================
    Contig Stats
    ====================
    #Seqs   346
    Min     606
    1st Qu. 14697
    Median  37813
    Mean    418258
    3rd Qu. 144473
    Max     25639991
    Total   144717436
    n50     16788716
    n90     237357
    n95     104117
```

One important remark is that to get the best results, an extra step is needed to polish the consensus sequence using the Quiver Algorithm. This process is not included in this tutorial.

Generate the DMEL assembly

The example shown in this and the next section uses the code in /mnt/dmel_asm/wgs-assembler-experimental. One can generate the assembly using the latest beta code in /mnt/dmel_asm/wgs-8.2beta by changing related path for the executables.

The following contains the instructions for generating your own assembly in 2 steps: (1) pre-assembly and (2) assembly. The reason that for this two step process is that pre-assembly is more computational intensive so it is better to use multiple computational nodes. The assembly has some I/O overhead so it is cheaper to use single node for the process. pacBioToCA is capable to run the whole process at once, if that is the preferred way to go for you, you can change the assemble = 0 in the pacbio_pre_asm.spec to assemble = 1. I notice there is some non-optimized NFS I/O load during several steps, so it is better turn off the journaling feature of the ext4 file system (or use other file system) to improve the performance too.

“Pre-assembly” is the process to error correct PacBio reads to generate “preassembled reads” (p-reads) which have good accuracy to be assembled by traditional Overlap-Layout-Consensus assembly algorithms directly. In this tutorial, we use a min-hash algorithm to overlap the reads for error correction. It is much faster than using blasr aligner for the same purpose.

First, let start an EC2 cluster of one node to set up a few things by running following starcluster command on your local machine:

```
    starcluster start -c asm asm
```

Once the cluster is built, one can login the master node from your local machine by:

```
    starcluster sshmaster asm
```

Once you login, you should see a command prompt like:

```
    (HBAR_ENV)root@master:~#
```

The (HBAR_ENV) indicates that you are using the python virtualenv in /home/HBAR_ENV. (The “activation” script /home/HBAR_ENV/bin/activate was executed with in /root/.bashrc during login process.)

We will need to first update the Grid Engine environment. It may be wise to do the following inside a screen session to ensure persistence if your environment:

```
    cd /mnt/dmel_asm/sge_setup
    bash sge_setup.sh
```

Before starting up the preassembly process, you will need to delete some files in the assembly example. The current CA generates large amount of intermediate files for debugging purpose. The directory mhap_WGS_template/ uses 293G of the 500G snapshot. If you do not delete some files, the volume can not fit in two CA assembly results. However, you might want to keep some results for comparison. I suggest to remove dmel/5-consensus/asm.*.* as they occupy about 209G space and are just intermediate files for one of the CA step.

```
    rm /mnt/dmel_asm/mhap_WGS_template/dmel/5-consensus/asm.*.*
```

Once you release the 209G space, the rest of the volume should be enough.

At this point you can add more compute nodes to speed up the process by invoking the following on your local machine:

```
    starcluster addnode -n 1 asm
```

Replace the number of 1 by a large one to run more jobs in parallel. The assembly example was generated using 4 nodes (1 master and 3 compute nodes), and it took about 7 hours to finish the MHAP step.

Once the new nodes have been added, you need to configure the SGE environment on your starcluster to allow the new nodes to submit to the queue. For example, if you added 3 compute nodes above, you would need to add all three compute nodes to the SGE configuration as follows:

```
    qconf -as node001
    qconf -as node002
    qconf -as node003
```

To start the assembly process, change to the directory /mnt/dmel_asm/mhap_WGS and run the script run_pre_asm.sh.

```
    cd /mnt/dmel_asm/mhap_WGS
    bash run_pre_asm.sh
```

This will submit the SGE job to the Grid Engine. The default is to run the jobs until the pre-assembled reads are generated.

In some rare cases, some of the pre-assembly consensus step jobs might fail. One can remove certain files just to restart that process. We will document on how to troubleshooting that late. In my last test, every thing went smoothly.

Once the final output dmel.fasta is generated in /mnt/dmel_asm/mhap_WGS and no more pacBioToCA processes are running, you can start the CA part. I prefer doing that in a single node SGE mode. You can use starcluster removenode command on your local machine to remove unwanted nodes, e.g.

```
    starcluster removenode asm -a node001
    starcluster removenode asm -a node002
    starcluster removenode asm -a node003
```

On the cluster, make sure you are in the right directory and execute the CA assembly as follows:

```
    cd /mnt/dmel_asm/mhap_WGS
    bash run_CA_asm.sh 
```

There will be some error message which one can mostly ignored. However, you might want to check if some assembly processes are running. If, for some reason, all processes are dead, you can just re-rerun the run_CA_asm.sh script. Typically, it might pick where it fails and restart correctly. I had to run twice to get my CA assembly. Alternatively, you can set the flag assemble = 1 in pacbio_pre_asm.spec, in this case, you will do the whole assembly in one command.

This part takes 4 to 6 hours in my test using single node.

A script filter_CA_contigs.sh is provided to filter out the CA contig. It will remove all contigs that have less than 50 p-reads support. To run that, execute bash ../filter_CA_contigs.sh inside the dmel/ working directory.

FALCON Assembly for DMEL

You can also generate an alternative assembly using an experimental assembler FALCON. To do that, you have to create a working directory and copy the necessary scripts from the example to set it up:

```
    cd /mnt/dmel_asm/mhap_WGS
    mkdir falcon_asm
    cd falcon_asm
    cp ../../mhap_WGS_template/falcon_asm/*.py .
    cp ../../mhap_WGS_template/falcon_asm/run_asm.sh .
    cp ../../mhap_WGS_template/falcon_asm/post_asm.sh .
    ln -s ../dmel.longest25.fasta preads.fa
```

Then, you can run the run_asm.sh to start the assembly process:

```
    bash run_asm.sh
```

In my test, it takes about 3 to 5 hours for the initial overlapping jobs. The rest of the layout and generating contig jobs are relatively fast. They can be finished in 10s of minutes.

Here is a list of the output files:

```
    full_string_graph.adj  # the adjacent nodes of the edges in the full string graph
    string_graph.gexf      # the gexf file of the string graph for graph visualization
    string_graph.adj       # the adjecent nodes of the edges in the string graph after transitive reduction
    edges_list             # full edge list 
    paths                  # path for the unitigs
    unit_edges.dat         # path and sequence of the untigs
    uni_graph.gexf         # unitig graph in gexf format 
    unitgs.fa              # fasta files of the unitigs
    all_tigs_paths         # paths for all final contigs (= primary contigs + associated contigs)
    all_tigs.fa            # fasta file for all contigs
    primary_tigs_paths     # paths for all primary contigs 
    primary_tigs.fa        # fasta file fot the primary contigs
    primary_tigs_paths_c   # paths for all primary contigs, detectable mis-assemblies are broken 
    primary_tigs_c.fa      # fasta file fot the primary contigs, detectable mis-assemblies are broken
    asm_graph.gexf         # the assembly graph where the edges are the contigs
```

There might be some redundant contigs. The following script (post_asm.sh) can be used to remove redundant contigs:

```
    export PATH=$PATH:/home/HBAR_ENV/MUMmer3.23
    cd /mnt/dmel_asm/mhap_WGS
    cd /mnt/dmel_asm/new_asm/3-asm-falcon
    nucmer -maxmatch all_tigs.fa all_tigs.fa -p all_tigs_self >& /dev/null  #~2hr
    show-coords -o -H -T all_tigs_self.delta | grep CONTAINS | awk '$7>96' | awk '{print $9}' | sort -u > all_tigs_duplicated_ids
    remove_dup_ctg.py
    cat p-tigs_nodup.fa a-tigs_nodup.fa > pa-tigs_nodup.fa
```

The non-reduant set of contigs in pa-tigs_nodup.fa will be suitable for further correction by the Quiver algorithm.

E. coli and Yeast Assemblies using the final wgs–8.2beta

The following directories contained the assemblies for E. coli and Yeast assembled with the final wgs-8.2beta.

```
    wgs-8.2beta # the final version of the WGS assembler
    ecoli_data # single SMRTCell data for E. coli
    ecoli_example # a working directory for a new E. coli assembly with wgs-8.2beta, running script and configuration files included
    ecoli_example_template # E. coli assembly results done by wgs-8.2beta
    yeast_data  # yeast dataset
    yeast_WGS  # yeast assembly results done by wgs-8.2beta  
```

You can assemble both E. coli and Yeast using single node configuration. To assemble E. coli, one can change the working directory to /mnt/dmel_asm/ecoli_example and run the script run.sh to re-generate the assembly in the directory /mnt/dmel_asm/ecoli_example_template. The configuration is set up to run the assembly in local mode. No SGE set up is required for the E. coli assembly. An assembled example using wgs-8.2beta is provided in the yeast_WGS. You can re-generate the assembly with the run.sh and pacbio_asm.spec. However, the default setting here require a SEG setup, even though the assembly can be done with one single node. Please follow the instruction for DMEL assembly for setting SGE if you like to reproduce the results for the yeast assembly.