Preprint
Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing on BioRxiv.
Pre-Compiled binaries and source
The Celera Assembler (CA) PBcR pipeline is for the correction (also referred to as "pre-assembly") and assembly of long-read sequencing data. The CA 8.2 release includes a novel overlapping algorithm for single-molecule noisy sequencing data. The overlapper provides increased sensitivity and speed over existing methods. The overlapping algorithm reference implementation, MHAP, is also available standalone.
- CA 8.2 source and pre-compiled binaries.
- MHAP source and pre-compiled jar
- List of commands used for correction and assembly is available here as well as a spec file (human spec). Note that the latest release will auto-detect your available RAM/CPUs on a single machine so you can specify an empty spec file if you want it to fill all resources on the machine. The spec file above restricts the run to 32GB and 16-cores. The human spec file is designed for an SGE cluster.
- For the latest usage information and announcements, please visit the PBcR wiki
- Package to produce a chromosome tiling for the human genome given the reference and an assembly here
- Alignments of 102 BACs from Chaisson et. al. to human quivered assembly here
- A public AWS AMI and snapshot. Documentation on running MHAP on AWS including a eukaryote (D. melanogaster using StarCluster) is available. This will cost approximately $300. The image includes data for E. coli and S. cerevisiae and instructions on assembly as well.
Datasets
Below you will find all the datasets used for testing the assembly/correction pipeline.
- E. coli K12 MG1655
- MiSeq 85X, 150bp reads originally from the Illumina scientific data website. Used for comparison with SPAdes v3.1.1, download 1.fastq and 2.fastq
- PacBio RS C2 SRA
- PacBio RS C2 filtered subreads
- PacBio RS P4 SRA
- PacBio RS P4 filtered subreads
- PacBio RS P5 SRA
- PacBio RS P5 filtered subreads
- S. cerevisiae W303
- PacBio RS P4 SRA
- PacBio RS P4 filtered subreads
- A. thaliana Ler-0
- PacBio RS P4 SRA
- PacBio RS P4 filtered subreads
- PacBio RS P5 SRA
- PacBio RS P5 filtered subreads
- D. melanogaster ISO1
- PacBio RS P5 SRA
- PacBio RS P5 filtered subreads
- Human CHM1htert
- PacBio RS P5 SRA
- PacBio RS P5 filtered subreads
MHAP Polished sequences and assembly
Below are the PBcR sequences and assemblies
- E. coli K12 MG1655
- PacBio RS P5 polished sequences
- PacBio RS P5 contigs
- PacBio RS P5 Quivered contigs
- S. cerevisiae W303
- PacBio RS P4 polished sequences
- PacBio RS P4 contigs
- PacBio RS P4 full assembly
- PacBio RS P4 Quivered contigs
- PacBio RS P4 Quivered full assembly
- A. thaliana Ler-0
- PacBio RS P5 polished sequences
- PacBio RS P5 contigs
- PacBio RS P5 full assembly
- PacBio RS P5 Quivered contigs
- PacBio RS P5 Quivered full assembly
- D. melanogaster ISO1
- PacBio RS P5 polished sequences
- PacBio RS P5 contigs
- PacBio RS P5 full assembly
- PacBio RS P5 Quivered contigs
- PacBio RS P5 Quivered full assembly
- Human CHM1htert
- PacBio RS P5 polished sequences
- PacBio RS P5 contigs
- PacBio RS P5 full assembly
- PacBio RS P5 Quivered contigs
- PacBio RS P5 Quivered full assembly
Amazon Web Services (AWS) Image and Instructions
In an attempt to simplify CA use and facilitate assembly for researchers without access to computation resources, we have created an AWS image including the CA 8.2 release as well as example datasets.
The data drive includes instructions to reproduce E. coli, S. cerevisiae, and D. melanogaster assemblies. For E. coli, the full run (filtering H5 file, polishing/assembly, quiver) can be reproduced for <$3 and <2hrs (20 minutes for polishing/assembly). The D. melanogaster polishing/assembly requires approximately $300 and appoximately 10hrs.
- AWS AMI and snapshot
- AWS documentation