Trypanosoma cruzi
Arthur Delcher
January 2005
Whole-genome shotgun sequencing and assembly.
A total of 1,192,680 end-sequences were
generated from five different insert-size libraries of T. cruzi CL-Brener to provide a total of
768,436,632 nt of high-quality sequence. The bulk (96%) of the sequence coverage came from
2-kb (701,082 reads) and 10-kb (435,593 reads) plasmid libraries constructed at TIGR, with the
remainder generated from a BAC TC3 library (16,405 reads) constructed by Denis LePaslier
(CEPH, Paris) and 35-kb fosmid (16,843 reads) and 90-kb BAC (22,757 reads) libraries
constructed at Children’s Hospital Oakland Research Institute
(http://bacpac.chori.org/tcruzi105.htm). Ninety percent of the reads represented mate-pairs, and
the average and variance of insert size of all libraries were estimated by agarose gel
electrophoresis and re-evaluated by mapping the reads against previously closed BACs to ensure
accurate measure of the distance between mate-pairs. The results of these analyses and the
coverage of each library are shown in Table S1. Genome assembly was carried out using the
Celera Assembler, with no more than 1.5% error rate accepted during the overlapping stage
and the “Bubble Smoothing” sub-routine turned off, since it resulted in a large number of
frame-shifts.
The assembly resulted in 5517 scaffolds (8780 contigs) totaling 67.2 Mb, with half of the
nucleotides incorporated into scaffolds longer than 150.8 kb (i.e., N50 = 150.8 kb). The scaffolds
contained 3263 gaps of average estimated length of 983 bp (accounting for less than 5% of the
scaffold sequence). The mean contig size was 7.7 kb and the N50 for contig bases was 25,824 bp.
In addition, another 24,123 contigs (N50 = 947 bp) did not fall into scaffolds. Twenty-eight
scaffolds and 117 contigs not assigned to scaffolds were excluded from this dataset because they
correspond to bacterial contamination. The annotated genome corresponds to scaffolds greater
than 5 kb (784 scaffolds containing 3954 contigs), in addition to 54 unassigned contigs greater
than 5 kb (Table S3). The longest scaffold and contig were 987 kb and 256 kb, respectively, and
the N50 size was only 25.8 kb. In addition, the numerous repetitive regions of the genome,
especially those containing large tandem arrays of multi-gene families, were collapsed and/or
misassembled. The quality of the assembly was evaluated by mapping the existing scaffolds onto
two large contigs sequenced and assembled from 106-kb BAC CH105-42O19 and a previously
published 93.4-kb contig from chromosome 3. The scaffolds covered both contigs entirely,
with 149/151 protein-coding genes correctly represented in the assembled data. Both haplotypes
were assembled separately in those areas, since all sequences were represented by two
overlapping scaffolds, one with > 99.8 % identity and the other < 96%.
A total of 154,155 reads totaling 126 Mb were generated from 2- and 10-kb plasmid libraries
of T. cruzi Esmeraldo and used to classify CL-Brener contigs into the following categories based
on NUCmer matches: (i) similar to the Esmeraldo haplotype (lineage IIb), (ii) dissimilar to
Esmeraldo haplotype, (iii) homozygous or haploid regions, (iv) repetitive regions, and (v)
merged regions. For Esmeraldo reads matching exactly two contigs, the contig with the higher
percentage identity (98–100%) was deemed the IIb haplotype, while the corresponding region of
the other contig was assigned to the non-Esmeraldo-like haplotype. Regions where the
Esmeraldo reads matched only a single contig were taken to represent haploid regions
corresponding to the IIb parent, if the coverage and/or single nucleotide polymorphism (SNP)
density was low; or homozygous or heterozygous regions with very similar sequence that had
merged during assembly, if the coverage and/or SNP density was high. Conversely, contigs
showing no match to Esmeraldo reads (or matches covering less than 90% of the Esmeraldo
read) were presumed to represent haploid regions from the non-Esmeraldo-like haplotype,
although they may formally represent unsampled regions of the Esmeraldo because of low
sequence coverage. When the Esmeraldo reads matched three or more contigs, the
corresponding regions were classified as repetitive. About 30.5 Mb of the T. cruzi assembled
sequence corresponds to heterozygous regions (with 15.2 Mb corresponding to IIb and 15.3 Mb
to the non-Esmeraldo-like haplotype), 2 Mb corresponds to homozygous regions, 5.4 Mb
apparently represents regions where the two haplotypes were merged, and 22.5 Mb corresponds
to repetitive regions which could not be assigned to a particular haplotype (Table S2).
Reassuring to us, few contigs (22%) appear to represent chimeras between the two haplotypes,
pointing to the relative accuracy of the assembly.
|