Trypanosoma cruzi

Arthur Delcher
January 2005

Whole-genome shotgun sequencing and assembly.

A total of 1,192,680 end-sequences were generated from five different insert-size libraries of T. cruzi CL-Brener to provide a total of 768,436,632 nt of high-quality sequence. The bulk (96%) of the sequence coverage came from 2-kb (701,082 reads) and 10-kb (435,593 reads) plasmid libraries constructed at TIGR, with the remainder generated from a BAC TC3 library (16,405 reads) constructed by Denis LePaslier (CEPH, Paris) and 35-kb fosmid (16,843 reads) and 90-kb BAC (22,757 reads) libraries constructed at Children’s Hospital Oakland Research Institute (http://bacpac.chori.org/tcruzi105.htm). Ninety percent of the reads represented mate-pairs, and the average and variance of insert size of all libraries were estimated by agarose gel electrophoresis and re-evaluated by mapping the reads against previously closed BACs to ensure accurate measure of the distance between mate-pairs. The results of these analyses and the coverage of each library are shown in Table S1. Genome assembly was carried out using the Celera Assembler, with no more than 1.5% error rate accepted during the overlapping stage and the “Bubble Smoothing” sub-routine turned off, since it resulted in a large number of frame-shifts.

The assembly resulted in 5517 scaffolds (8780 contigs) totaling 67.2 Mb, with half of the nucleotides incorporated into scaffolds longer than 150.8 kb (i.e., N50 = 150.8 kb). The scaffolds contained 3263 gaps of average estimated length of 983 bp (accounting for less than 5% of the scaffold sequence). The mean contig size was 7.7 kb and the N50 for contig bases was 25,824 bp. In addition, another 24,123 contigs (N50 = 947 bp) did not fall into scaffolds. Twenty-eight scaffolds and 117 contigs not assigned to scaffolds were excluded from this dataset because they correspond to bacterial contamination. The annotated genome corresponds to scaffolds greater than 5 kb (784 scaffolds containing 3954 contigs), in addition to 54 unassigned contigs greater than 5 kb (Table S3). The longest scaffold and contig were 987 kb and 256 kb, respectively, and the N50 size was only 25.8 kb. In addition, the numerous repetitive regions of the genome, especially those containing large tandem arrays of multi-gene families, were collapsed and/or misassembled. The quality of the assembly was evaluated by mapping the existing scaffolds onto two large contigs sequenced and assembled from 106-kb BAC CH105-42O19 and a previously published 93.4-kb contig from chromosome 3. The scaffolds covered both contigs entirely, with 149/151 protein-coding genes correctly represented in the assembled data. Both haplotypes were assembled separately in those areas, since all sequences were represented by two overlapping scaffolds, one with > 99.8 % identity and the other < 96%.

A total of 154,155 reads totaling 126 Mb were generated from 2- and 10-kb plasmid libraries of T. cruzi Esmeraldo and used to classify CL-Brener contigs into the following categories based on NUCmer matches: (i) similar to the Esmeraldo haplotype (lineage IIb), (ii) dissimilar to Esmeraldo haplotype, (iii) homozygous or haploid regions, (iv) repetitive regions, and (v) merged regions. For Esmeraldo reads matching exactly two contigs, the contig with the higher percentage identity (98–100%) was deemed the IIb haplotype, while the corresponding region of the other contig was assigned to the non-Esmeraldo-like haplotype. Regions where the Esmeraldo reads matched only a single contig were taken to represent haploid regions corresponding to the IIb parent, if the coverage and/or single nucleotide polymorphism (SNP) density was low; or homozygous or heterozygous regions with very similar sequence that had merged during assembly, if the coverage and/or SNP density was high. Conversely, contigs showing no match to Esmeraldo reads (or matches covering less than 90% of the Esmeraldo read) were presumed to represent haploid regions from the non-Esmeraldo-like haplotype, although they may formally represent unsampled regions of the Esmeraldo because of low sequence coverage. When the Esmeraldo reads matched three or more contigs, the corresponding regions were classified as repetitive. About 30.5 Mb of the T. cruzi assembled sequence corresponds to heterozygous regions (with 15.2 Mb corresponding to IIb and 15.3 Mb to the non-Esmeraldo-like haplotype), 2 Mb corresponds to homozygous regions, 5.4 Mb apparently represents regions where the two haplotypes were merged, and 22.5 Mb corresponds to repetitive regions which could not be assigned to a particular haplotype (Table S2). Reassuring to us, few contigs (22%) appear to represent chimeras between the two haplotypes, pointing to the relative accuracy of the assembly.