PacBio Corrected Reads (PBcR) Pipeline

Pre-Compiled binaries and source
Datasets
MHAP assemblies
Amazon Web Services (AWS) Instance

Preprint

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing on BioRxiv.

Pre-Compiled binaries and source

The Celera Assembler (CA) PBcR pipeline is for the correction (also referred to as "pre-assembly") and assembly of long-read sequencing data. The CA 8.2 release includes a novel overlapping algorithm for single-molecule noisy sequencing data. The overlapper provides increased sensitivity and speed over existing methods. The overlapping algorithm reference implementation, MHAP, is also available standalone.

CA 8.2 source and pre-compiled binaries.

MHAP source and pre-compiled jar

List of commands used for correction and assembly is available here as well as a spec file (human spec). Note that the latest release will auto-detect your available RAM/CPUs on a single machine so you can specify an empty spec file if you want it to fill all resources on the machine. The spec file above restricts the run to 32GB and 16-cores. The human spec file is designed for an SGE cluster.

For the latest usage information and announcements, please visit the PBcR wiki

Package to produce a chromosome tiling for the human genome given the reference and an assembly here

Alignments of 102 BACs from Chaisson et. al. to human quivered assembly here

A public AWS AMI and snapshot. Documentation on running MHAP on AWS including a eukaryote (D. melanogaster using StarCluster) is available. This will cost approximately $300. The image includes data for E. coli and S. cerevisiae and instructions on assembly as well.

Datasets

Below you will find all the datasets used for testing the assembly/correction pipeline.

E. coli K12 MG1655

MiSeq 85X, 150bp reads originally from the Illumina scientific data website. Used for comparison with SPAdes v3.1.1, download 1.fastq and 2.fastq

PacBio RS C2 SRA

PacBio RS C2 filtered subreads

PacBio RS P4 SRA

PacBio RS P4 filtered subreads

PacBio RS P5 SRA

PacBio RS P5 filtered subreads

S. cerevisiae W303

PacBio RS P4 SRA

PacBio RS P4 filtered subreads

A. thaliana Ler-0

PacBio RS P4 SRA

PacBio RS P4 filtered subreads

PacBio RS P5 SRA

PacBio RS P5 filtered subreads

D. melanogaster ISO1

PacBio RS P5 SRA

PacBio RS P5 filtered subreads

Human CHM1htert

PacBio RS P5 SRA

PacBio RS P5 filtered subreads

MHAP Polished sequences and assembly

Below are the PBcR sequences and assemblies

E. coli K12 MG1655

PacBio RS P5 polished sequences

PacBio RS P5 contigs

PacBio RS P5 Quivered contigs

S. cerevisiae W303

PacBio RS P4 polished sequences

PacBio RS P4 contigs

PacBio RS P4 full assembly

PacBio RS P4 Quivered contigs

PacBio RS P4 Quivered full assembly

A. thaliana Ler-0

PacBio RS P5 polished sequences

PacBio RS P5 contigs

PacBio RS P5 full assembly

PacBio RS P5 Quivered contigs

PacBio RS P5 Quivered full assembly

D. melanogaster ISO1

PacBio RS P5 polished sequences

PacBio RS P5 contigs

PacBio RS P5 full assembly

PacBio RS P5 Quivered contigs

PacBio RS P5 Quivered full assembly

Human CHM1htert

PacBio RS P5 polished sequences

PacBio RS P5 contigs

PacBio RS P5 full assembly

PacBio RS P5 Quivered contigs

PacBio RS P5 Quivered full assembly

Amazon Web Services (AWS) Image and Instructions

In an attempt to simplify CA use and facilitate assembly for researchers without access to computation resources, we have created an AWS image including the CA 8.2 release as well as example datasets.

The data drive includes instructions to reproduce E. coli, S. cerevisiae, and D. melanogaster assemblies. For E. coli, the full run (filtering H5 file, polishing/assembly, quiver) can be reproduced for <$3 and <2hrs (20 minutes for polishing/assembly). The D. melanogaster polishing/assembly requires approximately $300 and appoximately 10hrs.

AWS AMI and snapshot

AWS documentation