Preprint

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing on BioRxiv.


Pre-Compiled binaries and source

The Celera Assembler (CA) PBcR pipeline is for the correction (also referred to as "pre-assembly") and assembly of long-read sequencing data. The CA 8.2 release includes a novel overlapping algorithm for single-molecule noisy sequencing data. The overlapper provides increased sensitivity and speed over existing methods. The overlapping algorithm reference implementation, MHAP, is also available standalone.

  • CA 8.2 source and pre-compiled binaries.
  • MHAP source and pre-compiled jar
  • List of commands used for correction and assembly is available here as well as a spec file (human spec). Note that the latest release will auto-detect your available RAM/CPUs on a single machine so you can specify an empty spec file if you want it to fill all resources on the machine. The spec file above restricts the run to 32GB and 16-cores. The human spec file is designed for an SGE cluster.
  • For the latest usage information and announcements, please visit the PBcR wiki
  • Package to produce a chromosome tiling for the human genome given the reference and an assembly here
  • Alignments of 102 BACs from Chaisson et. al. to human quivered assembly here
  • A public AWS AMI and snapshot. Documentation on running MHAP on AWS including a eukaryote (D. melanogaster using StarCluster) is available. This will cost approximately $300. The image includes data for E. coli and S. cerevisiae and instructions on assembly as well.

Datasets

Below you will find all the datasets used for testing the assembly/correction pipeline.

  1. E. coli K12 MG1655
  2. S. cerevisiae W303
  3. A. thaliana Ler-0
  4. D. melanogaster ISO1
  5. Human CHM1htert

MHAP Polished sequences and assembly

Below are the PBcR sequences and assemblies

  1. E. coli K12 MG1655
  2. S. cerevisiae W303
  3. A. thaliana Ler-0
  4. D. melanogaster ISO1
  5. Human CHM1htert

Amazon Web Services (AWS) Image and Instructions

In an attempt to simplify CA use and facilitate assembly for researchers without access to computation resources, we have created an AWS image including the CA 8.2 release as well as example datasets.

The data drive includes instructions to reproduce E. coli, S. cerevisiae, and D. melanogaster assemblies. For E. coli, the full run (filtering H5 file, polishing/assembly, quiver) can be reproduced for <$3 and <2hrs (20 minutes for polishing/assembly). The D. melanogaster polishing/assembly requires approximately $300 and appoximately 10hrs.