Am. J. Bot. Li-Cor Advertisement
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


  Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Facebook   Add to Reddit   Add to Technorati   Add to Twitter
What's this?
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplementary Data
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Web of Science (16)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Timme, R. E.
Right arrow Articles by Jansen, R. K.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Timme, R. E.
Right arrow Articles by Jansen, R. K.
Agricola
Right arrow Articles by Timme, R. E.
Right arrow Articles by Jansen, R. K.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Facebook   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?
(American Journal of Botany. 2007;94:302-312.)
© 2007 Botanical Society of America, Inc.


Genetics

A comparative analysis of the Lactuca and Helianthus (Asteraceae) plastid genomes: identification of divergent regions and categorization of shared repeats1

Ruth E. Timme6, Jennifer V. Kuehl, Jeffrey L. Boore and Robert K. Jansen

Section of Integrative Biology and Institute of Cellular and Molecular Biology, The University of Texas at Austin, 1 University Station C0930, Austin, Texas 78712 USA; DOE Joint Genome Institute and Lawrence Berkeley National Laboratory, 2800 Mitchell Drive, Walnut Creek, California, 94598 USA; Department of Integrative Biology, University of California, 3060 Valley Life Sciences Building #3140, Berkeley, California 94720 USA; Genome Project Solutions, 1024 Promenade Street, Hercules, California 94547 USA

Received for publication July 5, 2006. Accepted for publication January 5, 2007.

ABSTRACT

We have sequenced two complete chloroplast genomes in the Asteraceae, Helianthus annuus (sunflower), and Lactuca sativa (lettuce), which belong to the distantly related subfamilies, Asteroideae and Cichorioideae, respectively. The Helianthus chloroplast genome is 151 104 bp and the Lactuca genome is 152 772 bp long, which is within the usual size range for chloroplast genomes in flowering plants. When compared to tobacco, both genomes have two inversions: a large 22.8-kb inversion and a smaller 3.3-kb inversion nested within it. Pairwise sequence divergence across all genes, introns, and spacers in Helianthus and Lactuca has resulted in the discovery of new, fast-evolving DNA sequences for use in species-level phylogenetics, such as the trnY-rpoB, trnL-rpl32, and ndhC-trnV spacers. Analysis and categorization of shared repeats resulted in seven classes useful for future repeat studies: double tandem repeats, three or more tandem repeats, direct repeats dispersed in the genome, repeats found in reverse complement orientation, hairpin loops, runs of A's or T's in excess of 12 bp, and gene or tRNA similarity. Results from BLAST searches of our genomic sequence against expressed sequence tag (EST) databases for both genomes produced eight likely RNA edited sites (C -> U changes). These detailed analyses in Asteraceae contribute to a broader understanding of plastid evolution across flowering plants.

Key Words: Asteraceae • chloroplast DNA • comparative genomics • divergent sequence • genomic repeats • Helianthus annuusLactuca sativa • RNA editing

Asteraceae is the second largest family of plants, with over 20 000 species (Bremer, 1994 ). For the past two decades, numerous phylogenetic studies using chloroplast DNA sequence data have contributed to our understanding of the evolutionary relationships within this family. These include comparisons of the chloroplast genes rbcL (Kim et al., 1992 ) and ndhF (Kim and Jansen, 1995 ), as well as noncoding DNA from the trnL intron plus the trnL-trnF intergenic spacer (Jansen and Kim, 1994 ; Bayer and Starr, 1998 ), matK (Denda et al., 1999 ), and with lesser resolution, psbA-trnH (Kim et al., 1999 ). This research culminated in a study by Panero and Funk (2002) that used over 13 000 bp per taxon for the largest family-wide classification revision of Asteraceae in over a hundred years. Still, many uncertainties remain with regards to species, generic, and tribal level relationships. It would be very useful to have more information on the relative rates of sequence evolution among the Asteraceae plastid genes and on genome organization as a potential set of characters to help guide future phylogenetic studies.

To contribute to this area of research, we report two complete chloroplast genome sequences from members of the Asteraceae, Helianthus annuus and Lactuca sativa. These plants belong to two distantly related subfamilies, Asteroideae and Cichorioideae, respectively (Panero and Funk, 2002 ). In addition to these chloroplast genomes, there are only two other published chloroplast genome sequence for any plant within the large group, euasterids II, Panax ginseng (Araliaceae) (Kim and Lee, 2004 ) and Daucus carota (Apiaceae) (Ruhlman et al., 2006 ).

Early chloroplast genome mapping studies demonstrated that Helianthus annuus and Lactuca sativa share a 22.8-kb inversion relative to members of the subfamily Barnadesioideae (Heyraud et al., 1987 ; Jansen and Palmer, 1987a , b ). By comparison to outgroups, this inversion was shown to be derived, indicating that the Asteroideae and Cichorioideae are more closely related than either is to Barnadesioideae. A later mapping study (Knox et al., 1993 ) and subsequent sequencing study (Kim et al., 2005 ) found that taxa that share this 22.8-kb inversion also contain within this region a second, smaller, 3.3-kb inversion.

The complete chloroplast genome sequences of Helianthus and Lactuca enable analysis of repeat patterns in the genomes and of RNA editing by comparison to available expressed sequence tag (EST) sequences. In addition, because both of these genomes are from crop plants, their sequences will facilitate development of chloroplast genetic engineering technology as demonstrated in recent studies by Daniell and colleagues (Daniell et al., 2004 , 2005 ; Ruiz and Daniell, 2005 ; Saski et al., 2005 ; Lee et al., 2006 ). Knowing the exact sequence of spacer regions is crucial for introducing transgenes into the chloroplast genome (Daniell et al., 2005 ). From a broader perspective, these two genomes will enable Asteraceae, the second largest plant family, to be included in larger analyses of chloroplast genome rearrangement and rates of gene evolution across flowering plants. This is important because plastids are uniform enough to perform interesting comparative studies across flowering plants, but divergent enough to capture interesting evolutionary genomic events. To understand these larger processes, it is necessary at first to perform smaller, more detailed comparative studies, which will then inform our understanding of genome evolution on a broader scale.

MATERIALS AND METHODS

Chloroplast isolation, amplification, and sequencing
Fresh leaf material from Lactuca (Lactuca sativa strain Salinas) and Helianthus (Helianthus annuus line HA383) was used for the chloroplast isolation. These strains are the same ones used in the EST and nuclear genome sequencing efforts of the Compositae Genome Project (Michelmore et al., 2006 ). Chloroplasts were isolated from the fresh leaves by the sucrose-gradient method (Palmer, 1986 ). They were then lysed and amplified using the REPLI-g whole genome amplification kit (Molecular Staging, New Haven, Connecticut, USA). The product was then digested with EcoRI and BstBI, and the clear banding pattern ensured that the amplification product was indeed chloroplast and not nuclear DNA. A detailed description of these steps is outlined in Jansen et al. (2005) . Purified cpDNA was sheared by serial passage through a narrow aperture using a Hydroshear device (Gene Machines, Genomic Solutions, Ann Arbor, Michigan, USA). These fragments were enzymatically treated to repair blunt ends, were gel purified, and then ligated into pUC18 plasmids. These clones were introduced into E. coli by electroporation, plated onto nutrient agar with antibiotic selection, and grown overnight. Colonies were randomly selected and robotically processed through rolling circle amplification of plasmid clones, sequencing reactions using BigDye chemistry (Applied Biosystems, Foster City, California, USA), reaction cleanup using solid-phase reversible immobilization, and sequencing determination using an ABI 3730 XL automated DNA sequencer (Applied Biosystems). Detailed protocols are available at http://www.jgi.doe.gov/sequencing/protocols/protsproduction.html.

Genome assembly and annotation
Sequences from randomly chosen clones were processed using the computer program phred and assembled based on overlapping sequences into a draft genome sequence using the program phrap (Ewing and Green, 1998 ). Quality of sequence determination and assembly was verified by eye using the program Consed (Gordon et al., 1998 ). PCR and sequencing at The University of Texas at Austin were used to bridge gaps and mend low-quality areas of the genome. Additional sequences were added until a completely contiguous consensus was created representing the entire cpDNA. Throughout the entire consensus, we verified that all regions had a quality of Q40 or greater and included at least two overlapping reads. For both Lactuca and Helianthus, most of the genome far exceeds these minimum requirements. The beginning of each genome was standardized for gene annotation to be the first base pair after the IRa (in this case both started right before trnH). The program DOGMA [Dual Organellar GenoMe Annotator (Wyman et al., 2004 )] was used to assist in fully annotating all genes and to identify coding sequence, rRNAs, and tRNAs using the plastid/bacterial genetic code.

Calculating sequence divergence
The whole genome sequence and annotation of Lactuca and Helianthus were compared to the reference genome, Nicotiana tabacum, by a percent identity plot produced by the program MultiPipMaker (Schwartz et al., 2000 ). The individual genes, rRNAs, tRNAs, introns, and intergenic spacers were also exported from both genomes in DOGMA and aligned by hand in MacClade (Maddison and Maddison, 2002 ) for a more detailed quantification of sequence divergence. Because we only compared two genomes, we quantified sequence divergence as the proportion (p) of aligned nucleotide sites within a specified region that are different (p-distance). A perl script was written to call PAUP* (Swofford, 2003 ) on each nexus file, calculate the p-distance between each region, and write out to a tab-delimited file. Indels were calculated by hand-aligning each pair of genes then counting the number of gaps in the alignment.

Examination of repeat structure
REPuter (Kurtz et al., 2001 ) is a widely used program that identifies repeated sequences in genomes; however, there are two issues that skew repeat results when using the program. One is the use of Hamming distance (HD) as a measure of determining similarity of repeating sequence. This is a fixed parameter that only allows one user-defined number of differences per repeat, which is the same regardless of length. In effect, this biases toward the number of smaller repeats found in the genome because a greater percentage of differences for smaller repeats is allowed. The second issue is that REPuter finds overlapping repeats, which over estimates the number of actual repeats present. We solved these problems using the program Comparative Repeat Analysis (CRA) (N. Holtshulte and S. K. Wyman at Williams College, http://bugmaster.jgi-psf.org/repeats/, unpublished program) that runs and filters REPuter output, identifying the shared and unique repeats among the input genomes. We used CRA for both Lactuca and Helianthus genomes and compared them to the reference genome, Nicotiana tabacum. The following constraints were set in CRA to solve the first issue of HD as a measure of similarity: (1) minimum repeat size of 21 bp, and (2) 90% or greater sequence identity for each 10 bp bin (i.e., HD was set to 2 for 21–30 bp, HD = 3 for 31–40, HD = 4 for 41–50 etc., until no further repeats were found). The second issue of reporting overlapping repeats is solved by CRA sifting through the REPuter output and excluding repeats contained within others. For time reasons, only repeats above 22 bp were examined by eye and placed into author-defined repeat categories.

Variation between coding sequences and cDNAs
Expressed sequence tags (EST) for both Lactuca and Helianthus were downloaded from two different databases: the Compositae Genome Project Database (CGPDB) (Michelmore et al., 2006 ) and the TIGR Gene Index Database, (TIGR, 2005 ). The complete set of coding sequences from our direct sequencing of Lactuca and Helianthus was searched for similarity by BLAST against their respective EST databases. Significant hits with an e-10 value or below were examined by eye for base-pair differences and summarized in a table as possible RNA edited sites.

RESULTS

Size, gene content, order, and organization
The Lactuca chloroplast genome (DQ383816) was 152 772 bp in length (Fig. 1) and contained a pair of inverted repeats (IRs) of 25 034 bp each, separated by a large and small single-copy (LSC and SSC) region of 84 105 bp and 18 599 bp, respectively. The Helianthus chloroplast genome (NC_007977) was 151 104 bp in length, with IRs of 24 633 bp each, separated by an LSC of 83 530 bp and a SSC of 18 308 bp. The G+C content of both Helianthus and Lactuca was 38% across the whole cp genome. Gene content and arrangement were identical in both cpDNAs. They also shared one large (Inv 1) and one small inversion (Inv 2) with respect to Nicotiana tabacum. There were 81 unique protein-coding genes in both genomes, six of which were duplicated in the IR. The four rRNA genes were contained completely within the IR, so they were doubled in the genome. There were 29 unique tRNA genes, of which seven were in the IR, which brought the total number to 36 in the genome. There were 18 unique intron-containing genes, five of these were duplicated in the IR; 16 genes had a single intron, and two genes had two introns.


Figure 1
View larger version (36K):
[in this window]
[in a new window]

 
Fig. 1. Chloroplast genome map for Helianthus (NC_007977) and Lactuca (DQ383816). Gene order and content are the same in both genomes; they differ slightly in their extent of the IR. Thick lines in the inner circle indicate extent of inverted repeats (IRa and IRb). Genes on outside of the map are transcribed in the clockwise direction, and genes on the inside are transcribed in the counterclockwise direction

 
Sequence divergence
The p-distance (proportion of base pairs that differed between two sequences) for the 25 most divergent noncoding regions of cpDNA is listed in Table 1, with values ranging from 0.084 to 0.226. Figure 2 shows the average p-distance for four classes of genomic regions: protein-coding genes, introns, intergenic spacers, and RNA genes (both rRNA and tRNAs). The intergenic spacer divergence was almost double the next highest class (introns). RNAs held the lowest sequence divergence, at an average of only 0.8%. Table 2 shows the 10 most divergent protein-coding sequences, which ranged from 0.102 to 0.036. These top 10 genes spanned all but two (ATP synthase and RNA polymerase) of the seven gene classes highlighted in Fig. 1. Sequence divergence across the whole genome of Helianthus and Lactuca is graphically summarized in Fig. 3 by a percent identity plot. Nicotiana was included for comparison and the annotation from Helianthus was used for gene locations. As expected, the introns and intergenic spacers were most divergent, but the graph also shows variable regions within coding sequences. The variable regions are scattered across the LSC and SSC regions. The p-distances for all regions analyzed are summarized in Appendix S1 (see Supplemental Data accompanying online version of this article).


View this table:
[in this window]
[in a new window]

 
Table 1. The 25 genomic regions with the largest p-distances between Lactuca and Helianthus genomes, rank-ordered from most to least divergent. No. indels is the total number of gap characters required for the pair-wise alignment

 

Figure 2
View larger version (19K):
[in this window]
[in a new window]

 
Fig. 2. Average p-distances for four classes of genomic regions in Lactuca and Helianthus

 

View this table:
[in this window]
[in a new window]

 
Table 2. The 10 most-divergent coding regions between Lactuca and Helianthus genomes

 

Figure 3
View larger version (44K):
[in this window]
[in a new window]

 
Fig. 3. A percent identity plot comparing the Helianthus chloroplast genome to Lactuca and Nicotiana. The top line contains the genes of H. annuus in order with their transcription direction indicated by arrows above the genes. Sequence similarity of aligned regions in L. sativa and N. tabacum (determined by BlastZ) is shown as horizontal bars indicating average percent identity between 50–100% (shown on y-axis of graph). The x-axis represents the coordinate in the chloroplast genome. Parallel lines, as in psbA, indicate repeated sequence in the genome. Highlighted in horizontal shading are the intergenic spacers in Table 1, in gray are the genes in Table 2

 
There is one other Lactuca sativa chloroplast genome on GenBank (NC_007578) whose entire sequence differed by only six single base-pair indel events. Most of these indels were in homopolymer runs, all in noncoding regions. This level of difference could be due to slight differences in cultivars of L. sativa (the NC_007578 strain used was not noted in GenBank). In addition to these minor differences in sequence similarity, the annotation of genes and intron boundaries differed in several locations. The intron boundaries in our annotations were assigned using intron-splicing sites to guide the annotation; in most cases our introns split a codon. And, where the annotation of the 5' or 3' ends of genes was ambiguous we matched our annotation to Nicotiana tabacum, which has been verified with expression studies. We also annotated two additional genes compared to the NC_007578 accession: atpI and ycf1.

Repeat analysis
Because the raw REPuter (Kurtz et al., 2001 ) output contains many redundant repeats, we used the filtering program Comparative Repeat Analysis (CRA) (N. Holtshulte and S. K. Wyman, Williams College, unpublished data), which identifies and excludes repeats that are contained entirely within other repeats. CRA also identifies shared repeats by similarity searching using BLAST to identify the repeats in other input genomes. The direct output of the CRA analysis is found in Fig. 4A. Most of the repeats were less than 40 bp, with only two larger than 90 bp. Only repeats that are 23 bp or larger were examined by eye for both Helianthus and Lactuca. Because we were interested in the role of repeats in genome organization, we attempted to categorize these repeats and arrived at seven classes (Fig. 4B): (1) three or more tandem repeats, (2) direct repeats dispersed in the genome, (3) repeats found in reverse complement orientation dispersed in the genome (4) hairpin loops with a predicted 2° structure based on mfold (Zuker, 2003 ) (palindromic repeats), (5) tandem repeats, (6) runs of A's or T's in excess of 12 bp (no repeats of G or C of those lengths are found), and (7) repeats of tRNAs (i.e., similarity between trnS-GCU and trnS-UGA) or portions of protein-encoding genes. Figure 4C shows the updated, more-accurate histogram of frequency of repeats after recategorizing them by length. For example, the four largest repeats from Fig. 4A were actually composed of smaller tandem repeats, so these were reclassified with their shorter length. In comparison to Fig. 4A, there are much fewer large repeats. This number went down even more when we recognized that two of our categories were not considered "real" repeats for our purposes: gene similarity and tRNA repeats provided evidence of gene duplication, which is shared among most land plants and poly-A and poly-T runs are actually single subunit repeats (SSRs). Figure 4D omits these categories of repeats and identifies which of the remaining ones were shared and unique among the genomes. Only two were shared by Nicotiana plus both Asteraceae genomes, four repeats were shared only among Asteraceae genomes, and the rest were unique to Helianthus or Lactuca. The two repeats that were shared among Helianthus, Lactuca, and Nicotiana were as follows: a 32-bp tandem repeat in the rrn4.5-rrn5 spacer and a 42-bp repeat that occurred in the second intron of ycf3, ndhA intron, and the rps12-ycf15 intergenic spacer. Most repeats were found in noncoding DNA (Fig. 4E). The greater number of repeats present in spacers vs. introns when corrected for proportion was almost identical: 3 repeats/18 introns vs. 17 repeats/112 spacers = 0.166 vs. 0.152, respectively. A table with more specific repeat information is located in Appendix S2 (see Supplemental Data accompanying online version of this article).


Figure 4
View larger version (33K):
[in this window]
[in a new window]

 
Fig. 4. Repeat analyses. (A) REPuter output filtered by the program CRA for repeats 21 bp or larger given a ≥90% sequence similarity. (B) Proportions of 23 bp or larger repeats in seven repeat categories. Numbers include repeats from both Helianthus (HEL) and Lactuca (LAC). (C) Corrected frequency histogram of repeats for Helianthus and Lactuca after manual examination and reassignment of some repeats. Some larger repeats and hairpin loops were actually composed of smaller tandem repeats. Repeats shorter than 23 bp were removed. (D) Summary of shared repeats among Helianthus (HEL), Lactuca (LAC), and Nicotiana (NIC). (E) Location of repeats from Fig. 4D. Repeats that occurred in two regions were counted in both

 
Variation between coding sequences and cDNAs
Expressed sequence tags were available in the databases for only a subset of the relevant chloroplast genes. There were 40 ESTs of Helianthus and eight Lactuca chloroplast genes present between the two databases (CGPDB, http://cgpdb.ucdavis.edu/sitemap.html, and TIGR, http://compbio.dfci.harvard.edu/tgi/plant.html). The differences were summarized in Table 3. C -> U changes, which are thought to be conventional angiosperm RNA editing changes (Hirose et al., 1999 ), occurred in 14 genes in the Helianthus and Lactuca genomes. Eleven of these caused amino-acid changes in the mRNA, two of which induced stop codons (rps3 and psbC). Several indels of a single nucleotide also occurred throughout the EST sequences from both databases. These indel events were not summarized in Table 3.


View this table:
[in this window]
[in a new window]

 
Table 3. Base pair differences between genomic sequences and processed mRNA in the form of expressed sequence tags (ESTs) for Helianthus and Lactuca

 
DISCUSSION

Genome organization
Although the Helianthus and Lactuca chloroplast genomes are identical in gene content and arrangement, they differ in length. Some of this length difference is due to the length difference in the IRs: the Lactuca IR is 401 bp longer than the Helianthus IR. Even though the Lactuca genome IR is longer, the Helianthus IR extends further into the genes at both its margins relative to Lactuca by 146 bp total. The Helianthus IR extends an additional 105 bp into the coding region of ycf1 compared with Lactuca and an additional 41 bp into rps19. The general boundaries of the Asteraceae IRs (i.e., within ycf1 and rps19) are similar to others reported, although the exact extent into the single-copy genes varies among other published genomes, such as Glycine max, Nicotiana tabacum, Gossypium hirsutum, Eucalyptus globulus, and Panax ginseng (Wakasugi et al., 1998 ; Kim and Lee, 2004 ; Saski et al., 2005 ; Steane, 2005 ; Lee et al., 2006 ).

There is a significant length difference between the IR regions in the Helianthus and Lactuca chloroplast genomes due to a large gene deletion. The genic IR boundaries are expanded in Helianthus, but the overall length of its IR is shorter than Lactuca's due to a deletion of 456 bp in ycf2. This 152 amino-acid (aa) deletion is relative to Lactuca and Nicotiana. The gene ycf2 is commonly absent in some species' chloroplast genomes (Millen et al., 2001 ), i.e., monocot grasses, specifically maize, rice, and sugarcane (Maier et al., 1995 ; Matsuoka et al., 2002 ; Asano et al., 2004 ). However, knockout studies of ycf2 have confirmed it as an essential chloroplast gene for survival in Nicotiana tabacum (Drescher et al., 2000 ). If this deleterious effect is true in all dicots, then the gene must be functional because Helianthus continues to exist with this deletion. No studies on Helianthus have looked at the possible transfer of this gene to the nucleus. Other supporting evidence that the ycf2 gene in Helianthus is functional is that the rest of the gene is highly conserved relative to the Lactuca copy, with only 1.31% sequence divergence. If the large deletion in the Helianthus copy rendered it a pseudogene, we would expect there to be higher sequence divergence and/or internal stop codons unless the deletion were very recent. Early RFLP studies identified a deletion of similar size and location (Schilling and Jansen, 1989 , 1997 ), which was shown to be derived within subtribe Helianthinae. Once a well-resolved subtribe phylogeny is available the exact timing of this deletion can be better determined.

Other differences between the chloroplast genomes occur with respect to gene length. In Helianthus, the start codon in the accD gene occurs 15 aa further into the gene than it does in Lactuca, a position that matches the annotation in Lotus and Arabidopsis. Lactuca also has a 25-aa insertion in the middle of the accD gene. As with ycf2, we assume the gene is still functional because sequence divergence is otherwise low across the rest of the gene. There are a few other instances where the lengths of genes differ (matK, rbcL, rpl22, rpl33, rpoC2, ycf1, ycf15) by a few amino acids, but the majority of genes between Helianthus and Lactuca have no indels. The tRNAs are even lower in indel events: one involves a 5-bp deletion in trnS-UGA that is shared between Helianthus and Lactuca, and two others are a 1-bp indel in both trnV-UAC and trnI-GAU. We assume these events do not affect tRNA function for similar reasons to those stated earlier.

Our exploration of using previously published EST databases as a comparative tool against our direct genomic sequence gave us some unexpected results. From our experience, we recommend that users of online EST databases be wary of basing detailed conclusions on sequences without accompanying quality scores. None of the possible C -> U editing sites are shared between Helianthus and Lactuca (Table 3), nor are they shared with other published angiosperm RNA edited sites (Tsudzuki et al., 2001 ). Although edited sites can be shared among distantly related taxa (Hirose et al., 1999 ), they might be more recently derived as seen in other studies (Tsudzuki et al., 2001 ). The other bp differences are not considered editing sites because only C -> U changes have been reported in angiosperms (Tsudzuki et al., 2001 ). This finding is interesting because, at least for the CGPDB database, the ESTs were made from the exact same strain of plant as was used in the chloroplast genome sequencing. These differences could be due either to intraspecific polymorphisms or to low-quality sequence in the ESTs (our stringent phred–phrap requirement across the genomic sequence makes it very unlikely that low-quality sequence could be present in the genomic sequence). Both Daniell et al. (2006) and Lee et al. (2006) showed a similar pattern of intraspecific polymorphism between DNA and EST sequences, and in Lee et al.'s case only two of 11 polymorphisms were C -> U edits. Because the CGPDB posts the raw data along with the EST contigs, we checked the chromatograms for the Helianthus gene, psbC, which had two indels present in the EST sequence. Both indels in this gene were miscalled peaks resulting from low-quality sequence data. This also calls into question any base-pair difference, including the C -> U changes, between the ESTs and our genomic sequence. The TIGR database does not post the raw data so we could not determine the authenticity of polymorphisms from this database. For this reason, we estimated the expected number of C -> U changes given the number of polymorphic sites we collected. If incorrect base calls occur at random, we would expect only 1/12 of them to be C -> U changes. To get an unbiased estimate of base-pair differences, we added up the non-C -> U changes between genomic and EST sequences, which totaled to 80. Therefore the expected number of C -> U changes, if they occurred by chance, is 7.3 (80 non-C -> U changes / 11 possible changes). We had 16 C -> U changes, so we can estimate that on average seven to eight of our C -> U possible editing sites are probably due to low-quality reads and eight to nine are possible RNA edited sites. From this experience, we recommend that users of online EST databases exercise caution in using these types of sequence databases without quality scores.

Evolutionary implications
Past analyses of repeated sequences in chloroplast genomes have focused primarily on simple sequence repeats (SSRs) (Powell et al., 1995 ; Marshall et al., 2001 ; Provan et al., 2001 ), which are useful for population-level studies. But, tools for identifying and summarizing larger and more complex repeats have only recently emerged as current studies showed they were associated with rearranged genomes (Hupfer et al., 2000 ; Kim et al., 2005 ; Saski et al., 2005 ). We have attempted to place these larger repeats into classes instead of lumping them all together. This will make future comparative repeat studies much more direct and informative. We showed that REPuter vastly overestimates the number of repeats, and even with helpful filters like CRA, the number of larger repeats is still inflated (see Fig. 4A vs. 4C).

Because repeats have been implicated in the rearrangement of chloroplast genomes, we looked for them at our three rearrangement endpoints (Table 4). The 31-bp repeat at positions 12 333 and 31 010 in the Helianthus genome is close to two of the second and third rearrangement endpoints, respectively, although the copy at coordinate 31 010 is 173 bp away from the third endpoint, which is a bit farther than its repeat pair. None of the other repeats stand out as being correlated with the rearrangement. Our analysis only looked at repeats of 23 bp and larger, so further examination of smaller repeats might reveal a higher density of repeats in this area. Another possible explanation for the lack of repeats associated with rearrangement endpoints may relate to the presence of tRNAs flanking all three of our rearrangement endpoints. Other researchers have noticed this association (Hiratsuka et al., 1989 ) and have hypothesized that tRNA-associated recombination may facilitate large inversions rather than repeats.


View this table:
[in this window]
[in a new window]

 
Table 4. Repeats associated with rearrangements. Repeats that are in close proximity to genome rearrangements—locations are for the Helianthus genome based on estimated rearrangement endpoints, from Kim et al. (2005)

 
Perhaps the most directly practical data to emerge from this analysis are the identification of new genomic regions for use in phylogenetic studies. The study of evolutionary relationships in Asteraceae has ballooned over the past 15 yr, with studies focused both within and between genera. Most studies use a combination of several chloroplast regions and one or two nuclear genes. Usually the chloroplast regions used have a lower rate of evolution than the nuclear DNA, so more sequencing of cpDNA is needed to achieve equivalent resolution (Bayer et al., 2002 ; Funk et al., 2004 ). We listed the 25 most divergent regions between Helianthus and Lactuca (Table 1) along with their length and the number of indels. Our p-distance measure excludes any position containing a gap, so indels are not included in the divergence calculation, although the gaps might be useful for phylogenetic reconstruction. For comparison, we included results from a few relevant studies. Panero and Crozier (2003) reviewed the phylogenetic utility of different chloroplast regions specifically for Asteraceae, while a more recent Shaw et al. (2005) review covered phylogenetic utility of currently utilized cpDNA markers across flowering plants. Interestingly, their most informative regions only partially overlap with our results. Daniell et al. (2006) also performed a genome-wide average p-distance in the Solanaceae, which is a more direct comparison to our study in a different family. Their top 25 regions with the highest p-distance overlap with our set in less then half of the cases (Table 1). Given that few of these regions are not directly comparable due to rearrangement and that Shaw et al.'s phylogenetic utility is not directly correlated with p-distance, for the most part our results are very different from theirs. This shows that, while there is a trend for some regions to be phylogenetically more useful than others (Shaw et al., 2005 ), a narrow set of the most variable regions across flowering plants is not the most common scenario. Instead, each family or major lineage will most likely have a unique set of variable regions. Of these most divergent regions in Asteraceae longer than 300 bp, seven have not been widely utilized, if ever, for phylogenetic inference. This is a promising finding because plant systematists are constantly searching for more variable chloroplast sequences for resolving species-level relationships. Three of these regions are currently being utilized for phylogenetic analyses in Helianthus (R. E. Timme, unpublished data), and the primer sequences used to amplify these regions are listed in Table 5.


View this table:
[in this window]
[in a new window]

 
Table 5. Primer sequences for three rapidly evolving chloroplast spacer regions

 
Conclusions
Two newly sequenced chloroplast genomes of economically important crop plants, lettuce (Lactuca sativa) and sunflower (Helianthus annuus), are important to a broad array of researchers, from evolutionary biologists to those involved with chloroplast engineering. The analyses performed on these two genomes resulted in the following conclusions. First, our analysis of repeats on rearranged genomes found no correlation between large repeats and rearrangement endpoints. Our novel analysis of characterizing repeats found far fewer repeat numbers than reported using current methods, which highly overestimates their occurrences. This will provide a nice framework for future comparative repeat analyses. Second, our comparison of coding regions to EST databases uncovered 16 C -> U changes that are possible RNA edited sites. From our calculations of error, due to the database's poor sequence quality, we estimate that about half of these are possible edited sites. This finding is important and alerts the community of the potential problems using EST databases for fine-scale analyses. Finally, our analyses of sequence divergence between these two Asteraceae genomes identifies the fastest evolving genomic regions, both coding and noncoding. This provides the plant systematic community working in this family with regions to target for phylogenetic analyses. Also, from a broader perspective, these two genomes will enable Asteraceae, the second largest plant family, to be included in broader analyses of plastid evolution, from genome rearrangement to rates of gene evolution across all plastid-containing organisms. Ultimately, a comprehensive understanding of the genomic contribution that a plastid provides its plant host will be of great value to the plant research community.

FOOTNOTES

1 The authors thank R. Linder, B. Simpson, and the reviewers for providing valuable comments on this manuscript. Leaf material was provided by S. Knapp (Helianthus annuus) and R. Michelmore (Lactuca sativa). Z. Cai assisted with the perl scripts. This research was supported in part by a grant from the National Science Foundation (DEB 0120709) and the Sidney F. and Doris Blake Centennial Professorship in Systematic Botany to R.K.J. and a National Science Foundation IGERT grant (0114387) to R.E.T. Part of this work was performed under the auspices of the U.S. Department of Energy, Office of Biological and Environmental Research, by the University of California, Lawrence Berkeley National Laboratory, under contract no. DE-AC02-05CH11231. Back

2 Author for correspondence (e-mail: retimme{at}mail.utexas.edu ) Back

LITERATURE CITED

Asano T. Tsudzuki T. Takahashi S. Shimada H. Kadowaki K.. 2004. Complete nucleotide sequence of the sugarcane (Saccharum officinarum) chloroplast genome: a comparative analysis of four monocot chloroplast genomes. DNA Research 11: 93-99.[Abstract]

Bayer R. J. Greber D. G. Bagnall N. H.. 2002. Phylogeny of Australian Gnaphalieae (Asteraceae) based on chloroplast and nuclear sequences, the trnL intron, trnL/trnF intergenic spacer, matK, and ETS. Systematic Botany 27: 801-814.

Bayer R. J. Starr J. R.. 1998. Tribal phylogeny of the Asteraceae based on two non-coding chloroplast sequences, the trnL intron and trnL/trnF intergenic spacer. Annals of the Missouri Botanical Garden 85: 242-256.

Bremer K.. 1994. Asteraceae: cladistics and classification Timber Press, Portland, Oregon, USA.

Daniell H. Dhimgra A. Ruiz O.. 2004. Chloroplast genetic engineering to confer desired plant traits. Methods in Molecular Biology 286: 111-137.

Daniell H. Kumar S. Ruiz O.. 2005. Breakthrough in chloroplast genetic engineering of agronomically important crops. Trends in Biotechnology 23: 238-245.[CrossRef][Web of Science][Medline]

Daniell H. Lee S.-B. Grevich J. Saski C. Quesada-Vargas T. Guda C. Tomkins J. Jansen R. K.. 2006. Complete chloroplast genome sequences of Solanum bulbocastanum, Solanum lycopersicum and comparative analyses with other Solanaceae genomes. Theoretical and Applied Genetics 112: 1503-1518.[CrossRef][Web of Science][Medline]

Denda T. Watanabe K. Kosuge K. Yahara T. Ito M.. 1999. Molecular phylogeny of Brachycome (Asteraceae). Plant Systematics and Evolution 217: 299-311.

Drescher A. Ruf S. Calsa T. Carrer H. Bock R.. 2000. The two largest chloroplast genome-encoded open reading frames of higher plants are essential genes. Plant Journal 22: 97-104.[CrossRef][Web of Science][Medline]

Ewing B. Green P.. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8: 186-194.[Abstract/Free Full Text]

Funk V. A. Chan R. Keeley S. C.. 2004. Insights into the evolution of the tribe Arctoteae (Compositae: subfamily Cichorioideae s.s.) using trnL-F, ndhF, and ITS. Taxon 53: 637-655.

Gordon D. Abajian C. Green P.. 1998. Consed: a graphical tool for sequence finishing. Genome Research 8: 195-202.[Abstract/Free Full Text]

Heyraud F. Serror P. Kuntz M. Steinmetz A. Heizmann P.. 1987. Physical map and gene localization on sunflower (Helianthus annuus) chloroplast DNA: evidence for an inversion of a 32.5-kbp segment in the large single copy region. Plant Molecular Biology 9: 485-496.

Hiratsuka J. Shimada H. Whittier R. Ishibashi T. Sakamoto M. Mori M. Kondo C. Honji Y. Sun C. R. Meng B. Y. Li Y. Q. Kanno A. Nishizawa Y. Hirai A. Shinozaki K. Sugiura M.. 1989. The complete sequence of the rice (Oryza sativa) chloroplast genome—intermolecular recombination between distinct transfer-RNA genes accounts for a major plastid DNA inversion during the evolution of the cereals. Molecular & General Genetics 217: 185-194.

Hirose T. Kusumegi T. Tsudzuki T. Sugiura M.. 1999. RNA editing sites in tobacco chloroplast transcripts: editing as a possible regulator of chloroplast RNA polymerase activity. Molecular Biology and Evolution 262: 462-467.

Hupfer H. Swaitek M. Hornung S. Herrmann R. G. Maier R. M. Chiu W. L. Sears B.. 2000. Complete nucleotide sequence of the Oenothera elata plastid chromosome, representing plastome I of the five distinguishable Euoenthera plastomes. Molecular Genetics and Genomics 263: 581-585.

Jansen R. K. Kim K. J.. 1994. Implications of chloroplast DNA data for the classifications and phylogeny of the Asteraceae. Compositae: Systematics, Proceedings of the International Compositae Conference, Kew 1: 317-339.

Jansen R. K. Palmer J. D.. 1987a. A chloroplast DNA inversion marks an ancient evolutionary split in the sunflower family (Asteraceae). Proceedings of the National Academy of Sciences, USA 84: 5818-5822.[Abstract/Free Full Text]

Jansen R. K. Palmer J. D.. 1987b. Chloroplast DNA from lettuce and Barnadesia (Asteraceae): structure, gene localization, and characterization of a large inversion. Current Genetics 11: 553-564.

Jansen R. K. Raubeson L. A. Boore J. L. de Pamphilis C. W. Chumley T. W. Haberle R. C. Wyman S. K. Alverson A. J. Peery R. Herman S. J. Fourcade H. M. Kuehl J. McNeal J. R. Leebens-Mack J. Cui L.. 2005. Methods for obtaining and analyzing whole chloroplast genome sequences. Methods in Enzymology 348-384.

Kim H.-G. Choi K.-S. Jansen R. K.. 2005. Two chloroplast DNA inversions originated simultaneously during the early evolution of the sunflower family (Asteraceae). Molecular Biology and Evolution 22: 1-10.[Abstract/Free Full Text]

Kim K.-J. Jansen R. K.. 1995. ndhF sequence evolution and the major clades in the sunflower family. Proceedings of the National Academy of Sciences, USA 92: 10379-10383.[Abstract/Free Full Text]

Kim K. J. Jansen R. K. Wallace R. S. Michaels H. H. Palmer J. D.. 1992. Phylogenetic implications of rbcL sequence variation in the Asteraceae. Annals of the Missouri Botanical Garden 79: 428-445.

Kim K. J. Lee H. L.. 2004. Complete chloroplast genome sequences from Korean ginseng (Panax schinseng Nees) and comparative analysis of sequence evolution among 17 vascular plants. DNA Research 11: 247-261.[Abstract]

Kim S. C. Crawford D. J. Jansen R. K. Santos-Guerra A.. 1999. The use of a non-coding region of chloroplast DNA in phylogenetic studies of the subtribe Sonchinae (Asteraceae: Lactuceae). Plant Systematics and Evolution 215: 85-99.

Knox E. B. Downie S. R. Palmer J. D.. 1993. Chloroplast genome rearrangements and the evolution of giant Lobelias from herbaceous ancestors. Molecular Biology and Evolution 10: 414-430.[Web of Science]

Kurtz S. Choudhuri J. V. Ohlebusch E. Schleiermacher C. Stoye J. Giegerich R.. 2001. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Research 29: 4633-4642.[Abstract/Free Full Text]

Lee S.-B. Kaittanis C. Jansen R. K. Hostetler J. B. Tallon L. J. Town C. D. Daniell H.. 2006. The complete chloroplast genome sequence of Gossypium hirsutum: organization and phylogenetic relationships to other angiosperms. BMC Genomics 7: 61.[CrossRef][Medline]

Maddison D. R. Maddison W. P.. 2002. MacClade: analysis of phylogeny and character evolution, 4.05 Sinauer, Sunderland, Massachusetts, USA.

Maier R. M. Neckermann K. Igloi G. L. Kossel H.. 1995. Complete sequence of the maize chloroplast genome: gene content, hotspots of divergence and fine tuning of genetic information by transcript editing. Journal of Molecular Biology 251: 614-628.[CrossRef][Web of Science][Medline]

Marshall H. D. Newton C. Ritland K.. 2001. Sequence-repeat polymorphisms exhibit the signature of recombination in lodgepole pine chloroplast DNA. Molecular Biology and Evolution 18: 2136-2138.[Free Full Text]

Matsuoka Y. Yamazaki Y. Ogihara Y. Tsunewaki K.. 2002. Whole chloroplast genome comparison of rice, maize, and wheat: implications for chloroplast gene diversification and phylogeny of cereals. Molecular Biology and Evolution 19: 2084-2091.[Abstract/Free Full Text]

Michelmore R. Knapp S. J. Bradford K. J. Rieseberg L. H. Jackson L. E. Kesseli R. V. Compositae Genome Project Database Website http://cgpdb.ucdavis.edu/sitemap.html [accessed November 2005].

Millen R. S. Olmstead R. G. Adams K. L. Palmer J. D. Lao N. T. Heggie L. Kavanagh T. A. Hibberd J. M. Gray J. C. Morden C. W. Calie P. J. Jermiin L. S. Wolfe K. H.. 2001. Many parallel losses of infA from chloroplast DNA during angiosperm evolution with multiple independent transfers to the nucleus. Plant Cell 13: 645-658.[Abstract/Free Full Text]

Palmer J. D.. 1986. Isolation and structural analysis of chloroplast DNA. Methods in Enzymology 118: 167-186.

Panero J. L. Crozier B. S.. 2003. Primers for PCR amplification of Asteraceae chloroplast DNA. Lundellia 6: 1-9.

Panero J. L. Funk V. A.. 2002. Toward a phylogenetic subfamilial classification for the Compositae (Asteraceae). Proceedings of the Biological Society of Washington 115: 909-922.

Powell W. Morgante M. McDevitt R. Vendramin G. G. Rafalski J. A.. 1995. Polymorphic simple sequence repeat regions in chloroplast genomes: applications to the population genetics of pines. Proceedings of the National Academy of Sciences, USA 92: 7759-7763.[Abstract/Free Full Text]

Provan J. Powell W. Hollingsworth P. M.. 2001. Chloroplast microsatellites: new tools for studies in plant ecology and evolution. Trends in Ecology and Evolution 16: 142-148.

Ruhlman T. Lee S.-B. Jansen R. Hostetler J. Tallon L. Town C. Daniell H.. 2006. Complete plastid genome sequence of Daucus carota: implications for biotechnology and phylogeny of angiosperms. BMC Genomics 7: 222.[CrossRef][Medline]

Ruiz O. Daniell H.. 2005. Engineering cytoplasmic male sterility via the chloroplast genome. Plant Physiology 138: 1232-1246.[Abstract/Free Full Text]

Saski C. Lee S. Daniell H. Wood T. Tomkins J. Kim H.-G. Jansen R. K.. 2005. Complete chloroplast genome sequence of Glycine max and comparative analyses with other legume genomes. Plant Molecular Biology 59: 309-322.[CrossRef][Web of Science][Medline]

Schilling E. E.. 1997. Phylogenetic analysis of Helianthus (Asteraceae) based on chloroplast DNA restriction site data. Theoretical and Applied Genetics 94: 925-933.

Schilling E. E. Jansen R. K.. 1989. Restriction fragment analysis of chloroplast DNA and systematics of Viguiera and related genera (Asteraceae: Heliantheae). American Journal of Botany 76: 1769-1778.

Schwartz S. Z. Z. Frazer K. A. Smit A. Riemer C. Bouck J. Gibbs R. Hardison R. Miller W.. 2000. PipMaker: a web server for aligning two genomic DNA sequences. Genome Research 10: 577-586.[Abstract/Free Full Text]

Shaw J. Lickey E. B. Beck J. T. Farmer S. B. Liu W. Miller J. Siripun K. C. Winder C. T. Schilling E. E. Small R. L.. 2005. The tortoise and the hare II: relative utility of 21 noncoding chloroplast DNA sequences for phylogenetic analysis. American Journal of Botany 92: 142-166.[Abstract/Free Full Text]

Steane D. A.. 2005. Complete nucleotide sequence of the chloroplast genome from Tasmanian blue gum, Eucalyptus globulus (Myrtaceae). DNA Research 12: 215-220.[Abstract/Free Full Text]

Swofford D. L.. 2003. PAUP*: phylogenetic analysis using parsimony (*and other methods) Sinauer, Sunderland, Massachusetts, USA.

[TIGR] The Institute for Genomic Research.. 2005. TIGR Gene Index Database Website http://compbio.dfci.harvard.edu/tgi/plant.html [accessed November 2005].

Tsudzuki T. Wakasugi T. Sugiura M.. 2001. Comparative analysis of RNA editing sites in higher plant chloroplasts. Journal of Molecular Evolution 53: 327-332.[CrossRef][Web of Science][Medline]

Wakasugi T. Sugita M. Tsudzuki T. Sugiura M.. 1998. Updated gene map of tobacco chloroplast DNA. Plant Molecular Biology Reporter 16: 231-241.

Wyman S. K. Boore J. L. Jansen R. K.. 2004. Automatic annotation of organellar genomes with DOGMA. Bioinformatics 20: 3252-3255.[Abstract/Free Full Text]

Zuker M.. 2003. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research 31: 3406-3415.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Facebook Facebook   Add to Reddit Reddit   Add to Technorati Technorati   Add to Twitter Twitter    What's this?


This article has been cited by other articles:


Home page
Am. J. Bot.Home page
P. R. Steele, L. M. Friar, L. E. Gilbert, and R. K. Jansen
Molecular systematics of the neotropical genus Psiguria (Cucurbitaceae): Implications for phylogeny and species identification
Am. J. Botany, January 1, 2010; 97(1): 156 - 173.
[Abstract] [Full Text] [PDF]


Home page
DNA ResHome page
K. Diekmann, T. R. Hodkinson, K. H. Wolfe, R. van den Bekerom, P. J. Dix, and S. Barth
Complete Chloroplast Genome Sequence of a Major Allogamous Forage Species, Perennial Ryegrass (Lolium perenne L.)
DNA Res, June 1, 2009; 16(3): 165 - 176.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Bot.Home page
J. Shaw, E. B. Lickey, E. E. Schilling, and R. L. Small
Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III
Am. J. Botany, March 1, 2007; 94(3): 275 - 288.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplementary Data
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Web of Science (16)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Timme, R. E.
Right arrow Articles by Jansen, R. K.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Timme, R. E.
Right arrow Articles by Jansen, R. K.
Agricola
Right arrow Articles by Timme, R. E.
Right arrow Articles by Jansen, R. K.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Facebook   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS