Am. J. Bot. Cross-Journal Searching
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Data
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (58)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Cronn, R. C.
Right arrow Articles by Wendel, J. F.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Cronn, R. C.
Right arrow Articles by Wendel, J. F.
Agricola
Right arrow Articles by Cronn, R. C.
Right arrow Articles by Wendel, J. F.
(American Journal of Botany. 2002;89:707-725.)
© 2002 Botanical Society of America, Inc.


Systematics

Rapid diversification of the cotton genus (Gossypium: Malvaceae) revealed by analysis of sixteen nuclear and chloroplast genes1

Richard C. Cronn2,4, Randall L. Small, Tamara Haselkorn2 and Jonathan F. Wendel2,5

2Department of Botany, Iowa State University, Ames, Iowa 50011 USA; 3Department of Botany, University of Tennessee, Knoxville, Tennessee 37996 USA

Received for publication May 3, 2001. Accepted for publication September 20, 2001.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 LITERATURE CITED
 
Previous molecular phylogenetic studies have failed to resolve the branching order among the major cotton (Gossypium) lineages, and it has been unclear whether this reflects actual history (rapid radiation) or sampling properties of the genes evaluated. In this paper, we reconsider the phylogenetic relationships of diploid cotton genome groups using DNA sequences from 11 single-copy nuclear loci (10 293 base pairs [bp]), nuclear ribosomal DNA (695 bp), and four chloroplast loci (7370 bp). Results from individual loci and combined nuclear and chloroplast DNA partitions reveal that the cotton genome groups radiated in rapid succession following the formation of the genus. Maximum likelihood analysis of nuclear synonymous sites shows that this radiation occurred within a time span equivalent to 17% of the time since the separation of Gossypium from its nearest extant relatives in the genera Kokia and Gossypioides. Chloroplast and nuclear phylogenies differ significantly with respect to resolution of the basal divergence in the genus and to interrelationships among African cottons. This incongruence is due to limited character evolution in cpDNA and either previously unsuspected hybridization or unreliable phylogenetic performance of the cpDNA characters. This study highlights the necessity of using multiple, independent data sets for resolving phylogenetic relationships of rapidly diverged lineages.

Key Words: chloroplast DNA • Gossypium • nuclear DNA • nuclear markers • phylogenetic incongruence


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 LITERATURE CITED
 
One of the more vexing problems in phylogenetic analysis is the occurrence of short interior branches. Because short internodes often have weak support as measured by jackknife and bootstrap resampling, decay analysis, or other indicators of relative confidence, subsequent phylogenetic analyses using additional molecular markers often fail to yield the same topology. Arising from long branch attraction, insufficient signal, or other causes, short internodes are frequently observed in phylogenetic trees and are thought to be a common cause of misleading phylogenetic inference as well as topological incongruence among data sets (reviewed in Wendel and Doyle, 1998 ). The short internode problem is a relative concept, dependent both on the scale of temporal divergence and the relative rates of character change, but it is applicable to all taxonomic levels in both plants and animals (e.g., Olmstead and Sweere, 1994 ; Fehrer, 1996 ; Lara, Patton, and DaSilva, 1996 ; Baldwin, 1997 ; Brinkmann and Philippe, 1999 ; Hughes and Baker, 1999 ; Kennedy et al., 1999 ; Waits et al., 1999 ). A common response when confronted with poorly supported or conflicting short interior branches is to sample more taxa to break up terminal branches, more characters to increase true signal, or both (Hillis, Huelsenbeck, and Cunningham, 1994 ; Cummings, Otto, and Wakeley, 1995 ; Hillis, 1996, 1998 ; but see Naylor and Brown, 1998 ; Nei, Kumar, and Takahashi, 1998 ; Poe, 1998 ; Soltis et al., 1998 ; Bremer et al., 1999 ; Poe and Swofford, 1999 ). Because phylogenetically inferred short interior branches may reflect relatively rapid radiations on a temporal scale that is compressed relative to the rate of character-state change, it may be exceedingly difficult, even impossible, to confidently resolve particular branch orders.

A probable example of rapid and global diversification is offered by the cotton genus Gossypium L. (Malvaceae). This genus of about 50 species (Fryxell, 1979, 1992 ) is differentiated cytogenetically into eight "genome groups" (designated "A" through "G", and "K") that differ in DNA content and chromosome size but not in chromosome number (reviewed in Endrizzi, Turcotte, and Kohel, 1985 ; Percival, Stewart, and Wendel, 1999 ; Wendel et al., 1999 ). The inclusivity of each genome group is congruent with the alignment of species in the most recent taxonomic treatment (Fryxell, 1979, 1992 ), in which three subgenera of diploid species are recognized: the African/Arabian subgenus Gossypium, containing A-, B-, E-, and F-genome cottons; the Australian subgenus Sturtia (R. Brown) Todaro (C-, G-, and K-genomes); and subgenus Houzingenia Fryxell, comprising the New World, D-genome cottons.

Present molecular evidence indicates that each genome group is monophyletic (Wendel and Albert, 1992 ; Seelanan, Schnabel, and Wendel, 1997 ; Seelanan et al., 1999 ), but relationships among genome groups remain in question (Wendel and Albert, 1992 ; Seelanan, Schnabel, and Wendel, 1997 ; Seelanan et al., 1999 ). As a consequence, branching orders among the three subgenera are unresolved, as is monophyly for the African/Arabian subgenus Gossypium. Using cpDNA restriction site data, Wendel and Albert (1992) suggested that the Australian cotton lineage (C-, G-, and K-genomes) represents the earliest divergence within the genus (Fig. 1). This interpretation was later shown to be based on modest character support (Seelanan, Schnabel, and Wendel, 1997 ); moreover, sequence data for the plastid gene ndhF failed to provide corroborating evidence of this relationship (Fig. 1). Uncertainty regarding relationships among genome groups is also apparent in the cpDNA restriction site and ndhF sequence analysis, most notably in the equivocal placement of the African B-genome lineage (Wendel and Albert, 1992 ). This result raised doubts as to the monophyly of subgenus Gossypium, as have later studies based on phylogenetic analyses of nuclear rDNA sequences (Seelanan, Schnabel, and Wendel, 1997 ). Sequence data from the internal transcribed spacer region of the nuclear ribosomal internal transcribed spacer (nrITS) suggested that Arabian E-genome and Australian C-genome clades are sister groups, with the (E + C)-genome clade sister to the New World D-genome clade. This assemblage resolves as sister to the remaining African genome groups (Fig. 1; Seelanan, Schnabel, and Wendel, 1997 ). Thus, existing data sets are contradictory in two major aspects, first with regard to the monophyly of African/Arabian cottons and second with regard to the earliest divergence event in the genus.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 1. Prior phylogenetic hypotheses for diploid Gossypium genome groups derived from cpDNA restriction sites (Wendel and Albert, 1992 ), chloroplast ndhF gene sequences, and nuclear ribosomal nrITS sequences (Seelanan, Schnabel, and Wendel, 1997 ). Illustrated are the strict consensus trees for four (cpDNA), six (ndhF), and four (nrITS) most parsimonious trees recovered from exhaustive searches. Decay values (d, italicized) and jackknife percentages (j) are shown for nodes where d > 0 and j ≥ 50%, and the treelength (L) and retention index (RI) is shown below each tree. Thespesia populnea was used as the outgroup (OG) for all analyses. Note the phylogenetic alternatives for the primary cladogenic event, as well as the equivocal resolution of the African B- and E-genome clades

 
In each of the foregoing examples, inspection revealed short internal branches for the conflicting resolutions. This incongruence may reflect intrinsic evolutionary properties of the genes used in these analyses, or it may instead reflect an important feature of Gossypium evolution, namely, a relatively rapid diversification of the genus after its origin (Seelanan, Schnabel, and Wendel, 1997 ; Wendel and Doyle, 1998 ). Our purpose in the present paper is to explore these possibilities using multiple data sets derived from the nuclear and plastid genomes. If the major clades within the genus originated in a temporally brief time period, we expect to consistently recover short interior branches, and possibly additional incongruence among newly generated gene trees, as a consequence of spurious relationships inferred from cases in which homoplasy overwhelms true signal.

Our second objective was to elucidate the actual cladistic affinities of the major diploid cotton lineages, notwithstanding the challenges imposed by the suspected history of rapid radiation. Toward this end, we employed two rational approaches to resolving poorly supported nodes, namely, we used multiple genes and generated larger data sets (e.g., Hillis, Huelsenbeck, and Cunningham, 1994 ; Olmstead and Sweere, 1994 ; Cummings, Otto, and Wakeley, 1995 ; Hillis, 1996 ; Nei, Kumar, and Takahashi, 1998 ; Soltis et al., 1998 ; Bremer et al., 1999 ; Hughes and Baker, 1999 ; Walsh et al., 1999 ). In this paper, we reevaluate the issue of the cotton genomes phylogeny using three additional chloroplast DNA regions and 11 single-copy nuclear genes. By combining these data with preexisting chloroplast ndhF and nuclear nrITS sequence information (Seelanan, Schnabel, and Wendel, 1997 ), we were able to evaluate the phylogenetic signal contained within 18 kilobases (kb) of sequence data derived from the chloroplast (7 kb) and nuclear (11 kb) genomes. This substantial data set bolsters the relatively weak signal present in short interior branches and provides complete resolution of the branching order among major cotton lineages. The resolutions provided by both chloroplast and nuclear DNA enable comparisons among individual loci and data partitions, permit a multilocus evaluation of interior node resolution, and provide a detailed picture of the relative timing of cladogenic events that have given rise to extant diploid cotton genome groups.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 LITERATURE CITED
 
Plant materials
Taxa included in this study are listed in Table 1, and GenBank accession numbers for each taxon–locus combination have been archived under Appendix 1 at the Botanical Society of America website (http://ajbsupp.botany.org/v89/Cronn/.doc). Since the monophyly of Gossypium genome groups has been convincingly established (Wendel and Albert, 1992 ; Seelanan, Schnabel, and Wendel, 1997 ; Liu et al., 2001 ), we sampled one representative species from each: these include G. herbaceum (A-genome) G. anomalum (B-genome), G. robinsonii (C-genome), G. raimondii (D-genome), G. somalense (E-genome), G. longicalyx (F-genome), and G. bickii (G-genome). In three cases, we substituted placeholders for these genomes: G. stocksii and G. arboreum represented the E- and A-genomes, respectively, for ndhF analysis, and the A-subgenome sequence from the AD-genome allotetraploid G. hirsutum was used to represent the A-genome in AdhC analysis (Small et al., 1998 ). The distinctive K-genome cottons from northwest Australia were omitted from this study because all phylogenetic analyses to date place K-genome species in a monophyletic clade allied with other Australian cottons (Wendel and Albert, 1992 ; Seelanan, Schnabel, and Wendel, 1997 ; Seelanan et al., 1999 ; Liu et al., 2001 ). We chose Gossypioides kirkii and Kokia drynarioides as outgroup taxa because these two genera comprise the sister lineage to Gossypium (Seelanan, Schnabel, and Wendel, 1997 ). Leaf DNA from each species was extracted using the method of Paterson, Brubaker, and Wendel (1993) . Voucher specimens were deposited in the Hayden Herbarium (ISC) at Iowa State University, Ames, Iowa, USA.


View this table:
[in this window]
[in a new window]
 
Table 1. Taxa included in the present study

 
Chloroplast loci examined
Three loci from the plastid genome were amplified using the polymerase chain reaction (PCR), isolated and sequenced from all ingroup and outgroup taxa: the matK gene and flanking trnK intron sequences (Steele and Vilgalys, 1994 ), the rpl16 intron (Jordan, Courtney, and Neigel, 1996 ), and the trnT-trnL spacer (Taberlet et al., 1991 ). Primers used to amplify these loci follow the original references and are summarized in Fig. 2. Amplification conditions for each locus followed previously optimized protocols (Small et al., 1998 ). Following amplification, reaction products were gel-purified and directly sequenced using the ABI "Big Dye" dideoxynucleotide termination kit (Applied Biosystems, Foster City, California, USA). All reactions used approximately 100 ng of PCR product and 5 pmol of sequencing primer. Sequences were resolved on an ABI377 (Applied Biosystems) automated sequencer at the Iowa State University DNA Sequencing and Synthesis Facility, and automated traces were evaluated using Sequence Navigator (version 1.0.1, Applied Biosystems). In addition to the cpDNA regions noted above, we included ndhF sequences reported previously (Seelanan, Schnabel, and Wendel, 1997 ).



View larger version (47K):
[in this window]
[in a new window]
 
Fig. 2. Gene diagrams for the 12 nuclear and 4 chloroplast loci included in this study. Gene diagrams show the overall length sampled from each locus, as well the location of exon (shaded bars) and intron or noncoding (black line) regions. Primers used for amplification and sequencing are indicated by arrows either above (forward primers) or below (reverse primers) each gene diagram; the name and nucleotide sequence for all primers used are indicated below gene diagrams

 
Nuclear loci examined
To evaluate phylogenetic signal from the nuclear genome, we generated gene trees for 12 nuclear loci (11 reported here, plus the nrITS sequences from Seelanan, Schnabel, and Wendel, 1997 ). Gene diagrams showing primer sequences and intron/exon structures are shown in Fig. 2. All loci were amplified, isolated, and sequenced from all ingroup and outgroup taxa using primers and conditions described in Cronn, Small, and Wendel (1999) . These loci include six anonymous markers that correspond to PstI mapping probes A1341, A1713, A1751, G1121, G1134, and G1262 (Brubaker and Wendel, 1994 ; Cronn and Wendel, 1998 ; Cronn, Small, and Wendel, 1999 ), the two alcohol dehydrogenase genes AdhA and AdhC (Small et al., 1998 ; Cronn, Small, and Wendel, 1999 ; Small and Wendel, 2000 ), two cellulose synthase genes (CesA1 and CesA1b, formerly CelA1 and CelA2; Cronn, Small, and Wendel, 1999 ; R. Cronn et al., unpublished data) and a 5' untranslated intron from a microsomal fatty acid desaturase gene (FAD2-1; Liu et al., 2001 ). An important criterion in selecting the nuclear loci employed in this study is that, with the exception of rDNA, all genes have been shown to be low-copy or unique in A- and D-genome diploid cottons by Southern hybridization experiments (Brubaker and Wendel, 1994 ; Cronn and Wendel, 1998 ; Brubaker, Paterson, and Wendel, 1999 ; Small and Wendel, 2000 ; R. Cronn et al., unpublished data). Additionally, orthological relationships for all markers (except FAD2-1) have been established by comparative genetic mapping, whereby the placement of putative orthologues were compared on A-genome and D-genome linkage maps that include over 300 molecular markers (Cronn and Wendel, 1998 ; Brubaker, Paterson, and Wendel, 1999 ; Small and Wendel, 2000 ; R. Cronn and R. Small, unpublished data). The combination of low-copy number and localization of all genes to syntenic, colinear linkage groups in the diploid A- and D-genomes constitutes rigorous evidence of orthology.

Amplification primers and protocols used to amplify each of these loci follow previously detailed protocols (Small et al., 1998 ; Cronn, Small, and Wendel, 1999 ; Liu et al., 2001 ). Polymerase chain reaction products were gel-purified and directly sequenced for all loci except the FAD2-1 intron. Due to apparent polymerase slippage in the numerous stretches of AT-rich sequence, we found it necessary to clone FAD2-1 PCR products (using pGem-T Easy; Promega, Madison, Wisconsin, USA) prior to sequencing. Sequencing reactions for cloned FAD2-1 inserts used 500 ng of plasmid and 5 pmol of sequencing primer. The consensus sequence of three clones was determined from each taxon to reduce the errors induced by Taq polymerase.

Data analysis
Sequences were aligned with the Clustal function of Sequence Navigator using default parameters. Sequence alignment led to the recognition of numerous gaps in both the individual and combined data sets. While many of these gaps could be aligned with high confidence in assessments of positional homology, five regions from four genes showed complex, possibly superimposed indel patterns. These regions, which consequently were excluded from all analyses, include positions 473–521 of the chloroplast rpl16 intron (2544–2587 of the combined chloroplast data set); positions 294–450 of the chloroplast trnT-trnL spacer (3516–3672 of the combined chloroplast data set); positions 2510–2552 of the trnK intron downstream from the chloroplast matK gene (7328–7370 of the combined chloroplast data set); and positions 935–1015 and 1529–1587 of the nuclear FAD2-1 locus (10 150–10 230 and 10 744–10 802 of the combined nuclear data set). The remaining alignment gaps were treated as missing data in all analyses. To evaluate the phylogenetic signal contributed by insertion/deletion (indel) events, we scored all indels in the cpDNA and nDNA data sets for which we had confidence in positional homology (archived as Appendix 2 at the Botanical Society of America website, http://ajbsupp.botany.org/v89/Cronn/.doc) and analyzed these data separately from nucleotide data using parsimony analysis. Indels were coded as unordered characters with binary states (in the case of simple presence/absence indels) or multistate characters (in the case of indels with variable length but one identical 5' or 3' end) using the criteria described in Simmons and Ochoterena (2000) . Due to their high lability, di- and trinucleotide repeats found in chloroplast rpl16 intron (TA2-4), chloroplast trnK (matK) intron (TTA2-4), chloroplast trnT-trnL intron (GAA3-5), and nuclear FAD2-1 intron (GAA4-10) data sets were specifically excluded from this analysis. Complete sequence data for all loci are available from the author at http://www.botany.iastate.edu/~jfw/HomePage/jfwdata_sets.html or from the Botanical Society of America website (http://www.botany.org/bsa/ajbsupp/).

Phylogenetic analyses, based on maximum parsimony (MP) and maximum likelihood (ML) methods, were performed using PAUP* version 4.0b8 (Swofford, 2001 ). Most-parsimonious trees were found from exhaustive searches with equal weighting of all characters. Maximum likelihood trees were estimated using the Hasegawa-Kishino-Yano (1985) model of sequence evolution, with heuristic searches and tree bisection-reconstruction (TBR) branch swapping. For each analysis, we used empirical base frequencies and estimated the following parameters: the transition/transversion rate, the gamma shape parameter (using four rate categories), and the proportion of invariable sites. We explored other models of sequence evolution and combinations of input parameters but found that these perturbations had no effect on tree topologies and did not significantly improve the likelihood scores.

To evaluate "combinability" (Bull et al., 1993 ; Johnson and Soltis, 1998 ) of the 16 loci included in this study, incongruence length difference tests (ILD; Farris et al., 1995 ) were performed pairwise on all loci within a genome partition (either "chloroplast" or "nuclear"), on two combined data sets representing all sequences from either the chloroplast (cpDNA) or nuclear (nDNA) genome, and on the "indel" data sets derived from cpDNA and nDNA. These tests were implemented using PAUP* 4.0 (as the partition-homogeneity test). Each analysis utilized 1000 replicates, in which random partitions of equal size were created by sampling sites without replacement from the original data. Replicates were analyzed using the parsimony criterion, and branch swapping using TBR was performed on trees held at each step during the stepwise addition.

Several methods were used to evaluate support of tree topologies derived from individual and combined data sets. For parsimony analyses, jackknife resampling (Farris, 1996 ) was performed using 1000 replicates (branch and bound searches) and with 50% of the characters deleted in each replicate. Decay analysis (Bremer, 1988 ) was performed by searching exhaustively for all trees up to ten steps longer than the most-parsimonious and noting how many steps longer than the shortest tree each clade survived. For ML analyses, the likelihood ratio test of Kishino and Hasegawa (KH test; 1989 ) was performed on the combined nuclear and chloroplast data sets to test among alternative topologies obtained from prior phylogenetic studies (detailed in RESULTS).

To estimate the relative divergence times of the major diploid cotton lineages, ML methods were used to evaluate clocklike rate constancy across lineages and to calculate likelihood-based branch lengths. In cases of rate constancy, branch lengths from the optimal ML topology are directly proportional to time (Baum, Small, and Wendel, 1998 ; Sanderson, 1998 ). To test the assumption of clocklike sequence divergence, the likelihoods of clock-enforced and clock-unenforced topologies were computed (using the parameters described above) from reduced data sets composed of nuclear DNA synonymous sites (7978 aligned sites). Since clock-enforced and clock-unenforced analyses yielded identical, single trees from the nuclear DNA data set, we used the likelihood-ratio test (Sanderson, 1998 ) to evaluate whether the clock-enforced likelihood score differed significantly from the clock-unenforced alternative. To convert relative divergence time into illustrative absolute divergence dates, we first computed pairwise synonymous divergences (KS, using Jukes-Cantor transformation) for silent sites in exons and introns between all taxa, using the computer program DnaSP version 3 (Rozas and Rozas, 1999 ). We then computed the mean KS between all ingroup taxa (N = 7) and the two outgroup taxa. This value (KS = 0.0672 ± 0.0556 substitutions per site; N = 14 comparisons) was converted into absolute time using the following equation: Tdivergence = KS /(2)(r), where r corresponds to the absolute rate of synonymous site divergence in nuclear Adh genes from palms (2.6 x 10–9 substitutions · synonymous site–1 · yr–1; Morton, Gaut, and Clegg, 1996 ) or members of the Brassicaceae (1.5 x 10–8 substitutions · synonymous site–1 · yr–1; Koch, Haubold, and Mitchell-Olds, 2000 .) Although there are numerous potential sources for error in molecular clock estimates of divergence (Sanderson, 1998 ), estimates of absolute divergence times are presented for heuristic purposes, as well as to compare to other divergence time estimates for Gossypium (Endrizzi, Katterman, and Geever, 1989 ; Geever, Katterman, and Endrizzi, 1989 ; Wendel and Albert, 1992 ; Seelanan, Schnabel, and Wendel, 1997 ).


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 LITERATURE CITED
 
Sequence data were generated for 14 different genes and spacer regions, yielding final data sets that include 4 chloroplast (combined length = 7370 bp) and 12 nuclear loci (combined length = 10 988 bp). Results from two of the genes included in the present study, chloroplast ndhF and nuclear nrITS, have already previously been described in the context of evaluating the phylogeny of the Gossypieae (Seelanan, Schnabel, and Wendel, 1997 ). These two data sets were included in the present analysis for comparative analyses of relative rates of sequence evolution and so that the conflicting phylogenies derived from ndhF and nrITS sequence data could be evaluated with respect to the other loci reported here. Due to space limitations, we summarize descriptive statistics for the 16 data sets in Table 2. This table highlights the most important features of each data set and includes absolute and aligned length, number of exon and intron nucleotides, the number of variable and phylogenetically informative sites, and the number of phylogenetically informative indels in each data set.


View this table:
[in this window]
[in a new window]
 
Table 2. Description of regions sequenced from the plastid (top) and nuclear (bottom) genomes.a

 
Characteristics of combined data sets
Combined chloroplast sequences
Aligned sequences from ndhF (2064 bp), matK (2553 bp), the rpl16 intron (1157 bp), and the trnT-trnL spacer (1596 bp) were concatenated into a single data set (the "cpDNA" data set). Incongruence-length difference (ILD) tests on all pairwise combinations of the four individual chloroplast sequences yielded no significant incongruence (P = 0.20–1.00); accordingly, we combined the sequences into a concatenated data set. Concatenated sequences ranged in length from 6875 bp (Kokia drynarioides) to 7109 bp (Gossypioides kirkii). Alignment of concatenated sequences yielded a data set 7370 bp in length, consisting of 3573 exon nucleotides and 3797 intron or spacer nucleotides. Of the 7370 aligned nucleotide positions, 249 positions were excluded due to alignment uncertainty (51 bp from rpl16 intron, 156 bp from trnT-trnL spacer, 42 bp from matK), yielding a final data set of 7121 bp, with a nucleotide composition of 16.8% G, 16.2% C, 29.1% A, and 37.9% T. A total of 54 fixed substitutions separated the ingroup and outgroup, 208 sites (2.9%) were variable, and 49 sites (0.7%) were phylogenetically informative within the ingroup. Pairwise Jukes-Cantor distances ranged from 0.0205 between G. robinsonii and Gossypioides kirkii to 0.0048 between the A- and F-genomes (Table 3). The average Jukes-Cantor distance among ingroup taxa was 0.0103 ± 0.0024 (mean ± standard deviation; N = 21), and the average distance between all ingroup taxa and the two outgroup taxa was 0.0182 ± 0.0017.


View this table:
[in this window]
[in a new window]
 
Table 3. Summary of Jukes-Cantor distancesa for all sites (bold typeface), silent sitesb (italic typeface), and replacement sites (plain typeface) in pairwise comparisons of the cpDNA data set (four chloroplast genes; above diagonal) and the nDNA data set (12 nuclear loci; below diagonal)

 
Combined nuclear sequences
Aligned sequences from A1341 (681 bp), A1713 (595 bp), A1751 (858 bp), AdhA (953 bp), AdhC (940 bp), CesA1 (1086 bp), CesA1b (1177 bp), FAD2-1 intron (1774 bp), G1121 (748 bp), G1134 (547 bp), G1262 (934 bp), and nrITS (695 bp) were concatenated into a single data set (the "nDNA" data set). Pairwise ILD tests on all individual nuclear sequences (66 possible pairwise tests) yielded only a single instance of statistical conflict, namely, between the FAD2-1 intron and nrITS data sets (P = 0.001). These two data sets exhibited the greatest phylogenetic signal among nuclear genes, so their pattern heterogeneity likely reflects different evolutionary constraints or histories. In this regard, we note that nrITS from diploid and polyploid Gossypium have previously been shown to exhibit unexpected patterns of sequence evolution, including non-Mendelian inheritance in allopolyploids (Wendel, Schnabel, and Seelanan, 1995a ) and apparent sequence chimera formation in suspected diploid hybrids (Wendel, Schnabel, and Seelanan, 1995b ). Despite this single result of pattern heterogeneity, comparisons between FAD2-1 intron and nrITS and the remaining ten nuclear genes exhibited no additional heterogeneity. For this reason, we chose to analyze nuclear data sets sequences individually, as well as combined. The concatenated sequences ranged in length from 10 330 bp (in G. raimondii) to 10 839 bp (Kokia drynarioides). Alignment of these sequences yielded a data set that was 10 988 bp in length, consisting of 4318 bp of exon (39.2%) and 6668 bp of intron or spacer DNA (60.8%). This data set provided a cumulative total of 33 phylogenetically informative gaps. Of 10 988 bp, 138 positions (10 150–10 230 and 10 744–10 802 in the combined alignment) were excluded due to alignment uncertainties in the FAD2-1 intron data set. The final nuclear DNA data set used for phylogenetic analysis included 10 850 bp and showed a nucleotide composition of 20.8% G, 18.2% C, 28.9% A, and 32.1% T. A total of 226 fixed substitutions separated the ingroup from outgroup. The data matrix included 704 sites (6.89%) that were variable within the ingroup, and of these, 85 (0.83%) were phylogenetically informative. Comparisons among the composite sequences yielded pairwise nucleotide distances that ranged from a high of 0.0557 between G. somalense and Gossypioides kirkii, to a low of 0.0140 between the C- and G-genome representatives (Table 3). The average Jukes-Cantor distance among ingroup taxa was 0.0242 ± 0.0046, and the average distance between all ingroup taxa and the two outgroup taxa was 0.0494 ± 0.0041.

Combined chloroplast and nuclear sequences
Pairwise ILD test results from a comparison of the combined cpDNA vs. the nDNA data set revealed significant heterogeneity (P = 0.001). This heterogeneity is due primarily to the conflicting phylogenetic resolution of B- and E-genome representatives (discussed in "phylogenetic analysis" sections below). Because of the numerous possible biological sources for partition incongruence between chloroplast DNA and nuclear DNA evolution (Wendel and Doyle, 1998 ), we chose not to combine the cpDNA and nDNA data sets into a single data set for "total evidence" analysis.

Individual and combined indel data sets
Because of the large size of the combined cpDNA and nDNA data sets, a substantial number of indels were identified with high confidence of positional homology (Appendix 2; http://ajbsupp.botany.org/v89/Cronn/.doc). Using outgroup sequences to polarize changes in the ingroup, we scored 24 and 25 potentially phylogenetically informative indels in the cpDNA and nDNA data sets, respectively. The majority of indels scored in these data sets (38 of 49, or 77.6%) were contiguous gaps with identical 5' and 3' ends and hence likely represent the product of single indel events. These indels varied in length from a low of 1 bp to a high of 49 bp. The remaining 11 indels included six perfect duplications 4 bp or greater in length, and 5 indels of variable length with one conserved end (either the 5' or 3'). In contrast to the combined cpDNA and nDNA nucleotide data sets, tests for incongruence between the two indel data sets using the ILD test did not return a significant value (P = 0.092). Accordingly, we chose to combine the cpDNA + nDNA indel data sets into a single indel data set composed of 49 characters.

Phylogenetic analysis of individual and combined data sets
Phylogenetic inference from chloroplast genes
Maximum likelihood topologies obtained for each of the individual chloroplast loci (matK, ndhF, rpl16 intron and trnT-trnL spacer) are summarized in Fig. 3, and results for the combined cpDNA data set are shown in Fig. 4A. Analyses of the individual chloroplast loci yielded more than one MP tree in all cases. Since the single tree produced by ML was always included among the parsimony trees, the trees shown in Fig. 3 can be considered one representative example of the 5 trees (ndhF), 7 trees (matK), 32 trees (rpl16 intron), and 6 trees (trnT-trnL spacer) recovered from MP analysis for individual chloroplast loci. Despite minor topological differences between ML and MP trees, Fig. 3 incorporates relevant information for both of these forms of inference, including branch lengths for ML trees, and descriptive statistics (tree lengths, consistency, and retention indices) and measures of support for MP trees (decay indices [d] greater than zero, and jackknife percentages [j] greater than 50%).



View larger version (47K):
[in this window]
[in a new window]
 
Fig. 3. Phylogenetic trees resulting from analyses of individual data sets. Taxa are designated by their genome designation. For each data set the ML (maximum likelihood) topology is shown. In all cases except one (AdhC), the ML topology was identical to one of the MP (maximum parsimony) trees recovered. The MP length (L), consistency index with autapomorphies excluded (CI), and the retention index (RI) are also shown. Branch lengths are indicated by a three decimal-place value representing the ML estimate, and an integer representing the MP-inferred branch length. Additional values on some branches indicate decay values (>0) and parsimony jackknife percentages (≥50%) for a given node. For AdhC, ML and MP analyses identified different optimal topologies. The ML analysis resulted in an ingroup topology as shown by solid lines connecting [(D + B)(C + G)(F + A')]. The strict consensus of the three most parsimonious trees, as shown by dashed lines, had an ingroup topology of [D{(C + G)(B)(F + A')}]. For this tree, the ML branch lengths are shown above the branches, while MP-inferred branch lengths are shown below the branches

 


View larger version (21K):
[in this window]
[in a new window]
 
Fig. 4. Phylogenetic trees derived from combined data sets. (A) Maximum likelihood (ML) topology derived from the 7121-bp combined chloroplast data set; topology is identical to the single most-parsimonious tree recovered from cladistic analysis. (B) ML topology derived from the 10 850-bp combined nuclear data set; topology is identical to the single most-parsimonious tree recovered from cladistic analysis. (C) Single most-parsimonious tree derived from 49 phylogenetically-informative chloroplast and nuclear indels. Branch lengths are reported in substitutions per site [from ML in panels (A) and (B)], and as evolutionarily inferred steps from maximum parsimony in (A), (B), and (C). Decay values and jackknife support percentages are given for nodes where d > 0 and j > 50%; Maximum parsimony (MP) length (L), consistency index with autapomorphies excluded (CI), and the retention index (RI) are also provided

 
Results obtained from the matK and trnT-trnL spacer data sets generally agree with those previously obtained from cpDNA restriction site and ndhF gene sequence analyses (Fig. 1; Wendel and Albert, 1992 ; Seelanan, Schnabel, and Wendel, 1997 ). The six most-parsimonious trees obtained from trnT-trnL spacer show a basal dichotomy that divides Gossypium into two clades, one composed of the Australian C- and G-genomes plus the African B-genome and a second composed of African A-, E-, and F-genomes and the New World D-genome. This dichotomy is weakly supported, as the node supporting the B + (C + G) clade shows a jackknife support value of 68%. Similarly, decay analysis shows that this clade collapses in trees one step longer than the most parsimonious, yielding an ingroup trichotomy composed of a C + G clade, an A + E + F + D clade, and a solitary B-genome lineage. Strict consensus trees obtained from matK (seven trees) and ndhF (five trees) indicate an unresolved trichotomy in the genus, consisting of a B + (C + G) clade, an (A + F) clade, and a solitary D-genome clade (ndhF) or E-genome clade (matK). In contrast to trnT-trnL spacer data, the association between the B-genome and the (C + G)-genomes appears robust in matK and ndhF; the B + (C + G) clade survives three steps of decay in the matK data set (j = 95%), and one step of decay in the ndhF data set (j = 74%). In matK and ndhF, the African E-genome shows phylogenetic instability. The strict-consensus MP trees from ndhF place the E-genome in a clade sister to the A + F genomes (an association that survives one step of decay), while the strict consensus MP tree from matK fails to unite the E-genome with any of the African genome groups. Remarkably, only four potentially phylogenetically informative sites were recovered in the rpl16 intron data set despite the fact that over 1100 bp were sequenced from each taxon; as expected, this gene provided little cladistic structure, as indicated by the 32 MP trees recovered.

In contrast with the inconsistent basal branching patterns, individual plastid sequences provide support for two sister-taxon pairs, an Australian clade composed of the (C + G)-genomes and an African clade composed of the (A + F)-genomes. All four cpDNA loci provide unanimous support for a (C + G)-genome clade, with decay and jacknife values ranging from robust in matK (d = 5, j = 100%) to weak in the rpl16 intron (d = 0, j = 62%). Three of the four data sets also include the African B-genome in a monophyletic clade with the (C + G)-genomes, while the fourth data set (rpl16 intron) simply lacks character support for this node. Three of the four loci analyzed (matK, ndhF, and rpl16 intron) also showed support for a sister-taxon relationship between the A- and F-genome cottons (ndhF, d = 1 and j = 84%; matK, d = 1 and j = 97%; rpl16 intron, d = 0 and j < 50%). The trnT-trnL spacer data set fails to support this relationship, as the F-genome is weakly resolved (d = 0, j < 50%) as sister to the E-genome. Aside from evidence supporting a monophyletic Australian (C + G)-genome clade (perhaps associated with the African B-genome lineage) and a sister relationship between the A- and F-genomes, little new phylogenetic insight emerged from the analysis of three individual cpDNA data sets. Nodes left unresolved by these analyses include the placement of the New World D-genome and African-Arabian E-genome, as these taxa exhibited affinities to either the (A + F)-genome clade (matK and trnT-trnL spacer for the D-genome; ndhF, rpl16 intron and trnT-trnL spacer for the E-genome) or were left unresolved in a genus-wide polytomy (ndhF and rpl16 intron for the D-genome; matK for the E-genome).

In sharp contrast to the four incompletely resolved individual chloroplast gene trees, ML and MP analyses of the concatenated cpDNA data sets (7121 included characters per taxon) yield a single most parsimonious tree that fully resolves the brancing order of the seven ingroup taxa (Fig. 4A). While the cpDNA tree combines features of individual cpDNA gene trees, it is important to note that no individual gene tree yielded a topology identical to that inferred from the combined cpDNA data set. This tree shows Gossypium to be divided into two primary lineages, with one clade composed of B + (C + G)-genomes, and the other composed of the D + [E + (A + F)]-genomes. The clade composed of the B + (C + G)-genome cottons is supported by nine synapomorphies and is strongly supported by decay analysis (d = 6) and jackknife resampling (j = 98%); similarly, the cladistic affinity between the Australian (C + G)-genomes remains well supported (d > 10, j = 100%). Conversely, there is only limited internal support for two of the three nodes included in the D + [E + (A + B)]-genome clade. Although the basal position for the D-genome (d = 2, j = 88%) is relatively robust, the placement of the E-genome (d = 0, j = 60%) is only weakly resolved. The sister relationship between the (A + F)-genomes is convincingly indicated, with high decay (d = 8) and jackknife (j = 99%) support.

Phylogenetic inference from nuclear genes
Maximum likelihood topologies for individual nuclear genes are summarized in Fig. 3, and results for the concatenated nDNA data set are summarized in Fig. 4B. Superimposed on these ML trees are statistics and measures of support derived from parsimony analysis, as in nearly all cases the topology obtained from ML analysis did not conflict with the tree(s) obtained from MP analysis. The sole exception to this topological congruence was for AdhC, as ML and MP analyses identified different optimal topologies. In this case, the ML analysis resulted in an ingroup topology of [(D + B)(C + G)(F + A)], as indicated by bold lines in Fig. 3. The strict consensus of the three MP trees showed an ingroup topology of [D{(C + G)(B)(F + A)}] and is indicated by dashed lines. For this tree, ML branch lengths are shown above the branches, while MP-inferred branch lengths are shown in parentheses below the branches.

In a manner similar to cpDNA, 11 of 12 individual nuclear gene trees (all but AdhA) provide strong support for a sister-taxon association between the Australian (C + G)-cottons. The sister relationship indicated for the African (A + F)-genome cottons by cpDNA was less apparent in nuclear trees, as this clade appeared in only 3 of the 12 gene trees (A1713, AdhC, G1121); an equally common resolution, indicated by A1341, CesA1b, and G1262, was for the A- and F-genomes to be placed in an unresolved trichotomy that includes the B-genome. No locus positively contradicted the (A + F) grouping. Beyond the (C + G)-genome pair (and possibly an A + F-genome pair), trees obtained from individual (Fig. 3) and combined (Fig. 4B) nDNA data sets bear little resemblance to those obtained from cpDNA sequences (Figs. 3 and 4A). Features unique to the nuclear gene trees include: (1) placement of the D-genome as sister to the remainder of the genus, creating a basal "New World/Old World" split; (2) placement of the B-genome in an African clade consisting of (B + A + F)-genome cottons; and (3) a weak affinity of the African E-genome to the Australian (C + G)-genome cottons.

The most striking difference between trees based on cpDNA data (Wendel and Albert, 1992 ; Seelanan, Schnabel, and Wendel, 1997 ) and those obtained from nuclear sequences is that the latter repeatedly place the D-genome lineage sister to the remainder of the genus (Figs. 3 and 4B). This finding is supported by both ML and MP analyses of sequences from six different loci (A1341, AdhA, FAD2-1, G1121, G1134, and nrITS), and by parsimony analysis of AdhC. Four of the included nuclear loci (A1713, A1751, AdhC, and G1262) place the D-genome in an unresolved basal polytomy, a placement that does not conflict with a D-genome basal split. In fact, of the 12 nuclear loci sampled, only 1 (CesA1) shows a conflicting alternative for this earliest divergence in the genus. Results from CesA1 provide weak support (d = 0) for a basal split between the Australian clade and the remainder of the genus, a result similar to that obtained from chloroplast trnT-trnL spacer data. While placement of the D-genome in a sister relationship to the remainder of the genus is only weakly supported among individual loci, the combined nDNA data set provides robust support for this placement (Fig. 4B). The combined Old World lineage shares a total of 33 inferred synapomorphies, and the node resolving New World D-genome cottons sister to Old World cottons has high support for decay (d = 9) and jackknife resampling (j = 98%).

A second major difference between cpDNA-derived phylogenies and the trees obtained from nDNA concerns the placement of the African B-genome lineage (represented by G. anomalum). Chloroplast DNA sequence data resolve the B-genome as sister to the Australian (C + G)-genome cottons, rather than with other African cottons (Figs. 3 and 4A). In contrast, the nuclear data sets repeatedly unite the B-genome with one or both of the African A- and F-genome cottons. Five loci (A1341, A1713, CesA1b, G1262, and nrITS) independently support a monophyletic clade comprising the B-, A-, and F-genomes, while a sixth locus (FAD2-1) includes the Arabian E-genome in this assemblage. Although the (B + A + F)-genome clade fails to survive even a single step of decay in most data sets, the association is relatively robust in CesA1b (d = 1, j = 72%) and nrITS (d = 5, j = 99%). Three other loci reveal a close association between the B-genome and either the A-genome or F-genome, but do not support the inclusivity of a (B + A + F)-clade; the B-genome is sister to the A-genome without the F-genome in the AdhA data set, and it resolves as sister to the F-genome without the A-genome in the AdhC and G1121 data sets. For two data sets, AdhC and CesA1, the B-genome resolves as sister to taxa other than A- or F-genome cottons, although these associations have no decay and little jackknife support. For AdhC, the B-genome is placed sister to the D-genome, but only in the ML analysis; this association is not seen in MP, as the B-genome falls into a polytomy composed of Old World cottons that are sister to the New World D-genome. Also, CesA1 shows the B-genome to be the sister taxon of the E-genome, an association not evident in the other data sets. The preponderance of evidence (9 of 12 nuclear genes) indicates that the B-genome is closely allied to the A- and F-genome African cottons, and that these three taxa likely form a monophyletic lineage. Not surprisingly, analysis of the combined nDNA data set provided strong support for this association, as the preferred [B + (A + F)]-genome clade survives 12 steps of decay and yields jackknife resampling support of 99%.

As observed with cpDNA-derived phylogenies, trees obtained from nuclear genes show equivocal placement for the African-Arabian E-genome representative G. somalense. While plastid data generally show the E-genome to belong to a clade composed of African cottons (Fig. 4A), individual nuclear loci show that the E-genome is phylogenetically unstable, occupying alternative positions in the different gene trees (Figs. 3 and 4B). Three loci, AdhA, G1121, and nrITS, support an association of the E-genome with C- and G-genome cottons, albeit only weakly as only a single, nonhomoplasious character unites the E + (C + G)-genomes in each data set. Additionally, this node collapses after one step of decay and appears in only 58% of the jackknife replicates (Fig. 4B) in the combined nDNA data set. Alternative placements for the E-genome (all of which lack decay or jackknife support) include a basal position in the "Old World" cotton clade (A1341), a sister-taxon position with the B-genome (CesA1), as part of a monophyletic "African clade" that includes the A-, B-, and F-genome cottons (FAD2-1 and nrITS) and as one of several lineages comprising a genus-wide, basal polytomy (A1713, A1751, CesA1b, G1134, and G1262). The position of the E-genome in the AdhC gene tree could not be determined, as amplification products could not be obtained from either G. somalense or G. stocksii. This locus has been partially eliminated (G. arboreum) or completely removed (G. herbaceum) from A-genome diploid cottons (Small et al., 1998 ; Small and Wendel, 2000 ), and it seems that E-genome AdhC may have experienced the same fate.

Phylogenetic inferences from indel events
Analysis of the combined cpDNA + nDNA indel data set produced a single most-parsimonious tree (length [L] = 65, retention index [RI] = 0.76; Fig. 4C) that is fully resolved. This tree shows greater overall similarity to the tree produced by the nDNA nucleotide data than by the cpDNA nucleotide data, as a New World/Old World cotton split represents the primary divergence event and because Old World taxa are divided into monophyletic Australian (C + G-genome) and African (A + B + E + F-genome) lineages. The primary divergence between the D-genome/Old World is supported by five synapomorphic indels (all from the nDNA data set), resulting in appreciable jackknife and decay support (j = 86%, d = 2). A similarly robust node supports the Australian (C + G)-genome cotton clade (j = 96%, d = 4), as four cpDNA and three nDNA indels mark this lineage. The final clade, composed entirely of African cottons, shows complete resolution but poor support values (j = 53%, d = 0) at two of the three critical nodes.

While overall support for a monophyletic African clade (e.g., A-, F-, B-, and E-genomes) is low, inspection of cpDNA gap characters highlight an interesting inconsistency within the chloroplast data set. In the combined data set, five characters (ndhF-1, trnT-trnL-6, trnT-trnL-8 from cpDNA; AdhA-3 and CesA1b-3 from nDNA) provide support for placing the B-genome in the clade composed exclusively of African-Arabian taxa. In contrast, only a single character (trnT-trnL-7) from the cpDNA data set provides support for the combined B-genome–Australian clade. In this regard, indels reveal a remarkable feature of the cotton cpDNA data set, namely that chloroplast nucleotides and chloroplast indels inconsistently resolve the B-genome.

Statistical analysis of incongruence between nucleotide data sets
Despite the limited number of taxa included in this study, substantial topological disagreement was evident in comparisons between the cpDNA and nDNA nucleotide-derived trees (e.g., Fig. 4A vs. 4B). Indeed, agreement between the two topologies is limited only to sister relationships for the African A- and F-genomes and the Australian C- and G-genomes, associations that have been previously resolved using cpDNA (Wendel and Albert, 1992 ; Seelanan, Schnabel, and Wendel, 1997 ) and nrITS (Seelanan, Schnabel, and Wendel, 1997 ). Conflict among the remaining genome groups is most evident in the basal divergence event in the genus. The cpDNA tree reveals a basal divergence separating the common ancestor of the (C + G + B)-genomes from the common ancestor of the (D + E + F + A)-genomes. In contrast, nDNA indicates a geographically more coherent divergence, separating the ancestor of the New World D-genome from the ancestor of all Old World (A + F + B + E + C + G genomes) cottons. These two data sets show relatively robust character support for these alternative phylogenetic interpretations, as the basal node in the cpDNA tree (Fig. 4A) survives two steps of decay and exhibits a jackknife value of 88%, while the basal node from the nDNA tree (Fig. 4B) survives nine steps of decay and shows a jackknife value of 98%. Conflict is also evident in the affinity of the African B-genome, as represented here by G. anomalum. This taxon shows a strong affinity to African A- and F-cottons based on nDNA (d > 10, j = 99%), yet it exhibits a closer relationship to the Australian (C + G)-genome lineage based on cpDNA (d = 6, j = 98%).

To statistically evaluate the cpDNA and nDNA alternatives, we compared several topological variations using the Kishino-Hasegawa (KH) {Delta}lnL test. As might be anticipated from the strong character conflict between the cpDNA and nDNA data sets (indicated by the ILD test results), enforcing the alternative topology on either of these two concatenated data sets significantly increased the likelihood score, as indicated by the {Delta}lnL test: enforcing the nDNA-derived topology on the combined cpDNA data set resulted in a significant increase in the maximum likelihood score (unconstrained = 11 943.28; constrained to nDNA topology = 12 003.69; {Delta}lnL = 60.41; P = 0.0016), whereas even greater conflict appears when the nDNA data are forcibly constrained to the topology of the optimal cpDNA ML tree (unconstrained = 23 026.74; constrained to cpDNA topology = 23 099.55; {Delta}lnL = 72.81; P = 0.0001). These results indicate that the topological differences between the cpDNA and nDNA maximum likelihood trees are statistically significant and that the nDNA data set is more significantly perturbed by enforcing the alternative topology.

Additional topological tests were performed to evaluate whether alternative arrangements of selected taxa could be tolerated by the nDNA and cpDNA data sets. Because the testing of alternative, user-defined trees a posteriori may constitute a violation of the underlying assumptions of the KH test (Goldman, Anderson, and Rodrigo, 2000 ), we limited our comparisons to alternative topologies indicated from a priori phylogenetic hypotheses derived from cpDNA restriction site data (Wendel and Albert, 1992 ; Seelanan, Schnabel, and Wendel, 1997 ) and rDNA (Seelanan, Schnabel, and Wendel, 1997 ). To execute these tests, we pruned the number of taxa from nine (seven Gossypium and two outgroups) to a total of four (three ingroup and one outgroup) so that conflict among the most relevant lineages could be evaluated.

To evaluate conflict concerning the primary divergence event among Gossypium genome lineages, we pruned the cpDNA and nDNA data sets to include the A-genome (G. herbaceum), C-genome (G. robinsonii), D-genome (G. raimondii), and one outgroup (Gossypioides kirkii). As three ingroup taxa are considered, there are three possible topologies for evaluation (Fig. 5, upper panel): (A + D) sister to C, which is the cpDNA topology reported by Wendel and Albert (1992) , Seelanan, Schnabel, and Wendel (1997) , and supported by our cpDNA analysis; (C + D) sister to A, the topology reported by Seelanan, Schnabel, and Wendel (1997) based upon nrITS using the distantly related taxon Thespesia populnea as an outgroup; and (A + C) sister to D, a topology that was indicated by analysis of the Adh multigene family in cotton (Small and Wendel, 2000 ) and is robustly supported by the nDNA data presented herein.



View larger version (38K):
[in this window]
[in a new window]
 
Fig. 5. Statistical evaluation of alternative Gossypium topologies using select taxa. Statistical support for each of three possible ingroup topologies based on the cpDNA or nDNA data sets was evaluated by computing Maximum likelihood (ML) scores (–ln L) for each user-input tree using the sequence parameters described in MATERIALS AND METHODS. The resulting increase in the ML score ({Delta}ln L) was evaluated for significance using the likelihood ratio test of Kishino and Hasegawa (1989) . Comparisons that produced statistically significant scores (P ≤ 0.05) are indicated by an asterisk. Upper panel: statistical support for alternative topologies involving the basalmost divergence event in the genus. Taxa selected to represent this divergence event include the A-, C-, and D-genome groups. Lower panel: statistical support for alternative topologies involving the phylogenetic affinity of the African B-genome lineage. Taxa selected to represent this divergence event include the A-, B-, and C-genome groups. The outgroup taxon for all analyses was Gossypioides kirkii

 
As shown in Fig. 5 (upper panel), the preferred topology for the cpDNA data set ([i.e., (A + D) sister to C]) yields an ML score that is statistically indistinguishable from the other two topological alternatives, i.e., (A + C) sister to D (P = 0.391) or (C + D) sister to A (P = 0.551). This result indicates that while ML and MP analyses resolve this node identically, character support for the preferred topology, a basal divergence of the Australian lineage from all other cottons, is limited. In contrast, the preferred topology for the nDNA data set [e.g., (A + C) sister to D] yields a maximum likelihood score that is statistically different from those obtained by the topological alternatives, i.e., (A + D) sister to C, or (C + D) sister to A (P = 0.037 in both cases). It should be noted that while we chose the A- and C-genomes to represent the African and Australian lineages in this analysis, identical results are obtained if the African F- and/or Australian G-genomes are substituted (results not shown).

Conflict in the placement of the B-genome lineage was evaluated in a similar manner by pruning the cpDNA and nDNA data sets to include the A-genome (G. herbaceum), B-genome (G. anomalum), C-genome (G. robinsonii), and one outgroup (Gossypioides kirkii). Hence, this analysis evaluates the strength of association between the B-genome and African vs. Australian cottons (Fig. 5, lower panel). With these included taxa, the three possible topologies are: (A + B) sister to C, the tree topology recovered from cpDNA by Wendel and Albert (1992) , nrITS by Seelanan, Schnabel, and Wendel (1997) , and from our analysis of nDNA; (B + C) sister to A, the tree topology recovered from cpDNA by Seelanan, Schnabel, and Wendel (1997) , and supported by our analysis of cpDNA; and (A + C) sister to B, a topology that has yet to be proposed in literature and is unsupported by our analysis of nDNA and cpDNA.

As shown in Fig. 5, the preferred topology for the cpDNA data set, i.e., (B + C) sister to A, yields an ML score that statistically preferred to the other topological alternatives [e.g., (A + B) sister to C, or (A + C) sister to B; P = 0.036]. Similarly, the preferred topology for the nDNA data set [e.g., African (A + B) sister to Australian C] yields a ML score that is statistically more likely than those obtained from the topological alternatives, i.e., (B + C) sister to A, and (A + C) sister to B (P = 0.008 and 0.009, respectively). Again, while we chose the A- and C-genomes to represent the African and Australian lineages, identical results are obtained if the F- and/or G-genomes are used (results not shown).

The most revealing result from this exercise is the finding that placing the D-genome lineage sister to Old World cottons (in essence, the nDNA topology) results in an insignificant {Delta}lnL score in cpDNA data set (P = 0.391). This result suggests that while cpDNA and nDNA trees differ in their resolution of the basal divergence in Gossypium, placement of the D-genome is sufficiently equivocal in the cpDNA tree that a sister relationship between New World and Old World cottons is consistent with both data sets. Conversely, we find that the optimal placement of the B-genome, as indicated by ML analysis of cpDNA vs. nDNA, is statistically different in these data sets (Fig. 4A and 4B). Phylogenetic discord evident in the variable placement of the B-genome indicates that one of these data partitions has an evolutionary history that leads to phylogenetic incongruence. Events potentially involved include cytoplasmic introgression due to ancient hybridization, differential retention of ancestral chloroplast haplotypes, or poorly understood molecular-evolutionary phenomena that bias character evolution in a way that misleads phylogenetic reconstruction.

Timing the divergence of Gossypium lineages
To gain an understanding of the relative and absolute timing of the divergence events leading to the major extant cotton lineages, we used an ML approach to estimate internal and terminal branch lengths while enforcing a molecular clock constraint (Baum, Small, and Wendel, 1998 ; Sanderson, 1998 ). Maximum likelihood analyses of reduced data sets containing only synonymous sites showed that the cpDNA (4739 aligned positions) and nDNA (7978 aligned positions) synonymous site data sets deviated significantly from molecular clock expectations. Enforcing the clock constraint on the cpDNA data set yielded a significant increase in the ML score (–ln L unconstrained = 8402.92, –ln L clock enforced = 8438.71, {Delta}ln L = 35.79, P = 1.9 x 10–13); similarly, enforcing the clock constraint on the nDNA data set yielded a significant increase in the ML score (–ln L unconstrained = 17 926.08, –ln L clock enforced = 17 940.94, {Delta}lnL = 14.86, P = 4.4 x 10–5). By inspecting the trees visually and evaluating sequence heterogeneity using the 1D rate test of Tajima (Tajima, 1993 ), we found evidence of localized rate heterogeneity (results not shown). In particular, cpDNA from the C- and G-genomes showed elevated rates of divergence as compared to the B- and F-genomes. Similarly, nuclear sequences from the C-genome showed a significantly reduced rate of divergence as compared with nDNA from the A-, D-, and E-genome representatives (results not shown). By pruning the C-genome from the nDNA data set, reanalysis of the data using clock and nonclock models resulted in acceptance of the molecular clock (–lnL unconstrained = 17 500.87, –lnL clock enforced = 17 506.36; {Delta}ln L = 5.49; P = 0.089). Unfortunately, simple pruning of one taxon (B-, C-, F-, or G-genome, singly) or pairs of taxa (C + G, or B + F) from the cpDNA data did not produce data sets that met molecular clock expectations (data not shown). For this reason, only the nDNA data set, pruned of the C-genome, is used to estimate the timing of major divergence events between cotton genome groups.

To illustrate the temporal scale and rapidity of these divergence events, we estimated absolute divergence dates by applying molecular clock rates derived from nuclear genes of plant groups that are represented in the fossil record. Fossil-calibrated synonymous rates of Adh divergence for angiosperms vary about sixfold, with absolute rates ranging from 2.6 x 10–9 substitutions · synonymous site–1 · yr–1 for palms (Morton, Gaut and Clegg, 1996 ), 7.0 x 10–9 substitutions · synonymous site–1 · yr–1 for grasses (Gaut, 1998 ), and 1.5 x 10–8 substitutions · synonymous site–1 · yr–1 for Brassica (Koch, Haubold, and Mitchell-Olds, 2000 ). In the absence of Gossypium fossils, it is not possible to know which value is most appropriate in the present application. Nevertheless, this exercise provides an effective demonstration of the rapidity of cotton radiation, as discussed below, and permits us to consider the merits of earlier hypotheses regarding the timing of radiation of the major Gossypium clades.

Average pairwise synonymous site divergence between the two outgroup species and all seven ingroup cottons was 0.0702 substitutions/synonymous site (standard deviation = 0.0055, N = 6521 synonymous sites per taxon). By using the lowest of the published Adh divergence rates (2.6 x 10–9 substitutions · synonymous site–1 · yr–1 for palms), we estimate that the separation between extant Gossypium diploids and Gossypioides kirkii (node 1, Fig. 6) occurred approximately 13.4 million years (MY) ago. This estimate provides a maximum likelihood-based conversion factor of 13.4 MY : 0.03913 ML units for the distance separating Gossypium from the combined outgroup Gossypioides kirkii + Kokia drynarioides (node 1 to terminals).



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 6. Maximum likelihood (ML) tree obtained from nuclear synonymous site data (7978 bp) using a molecular clock constraint and inferences for absolute timing of divergence among diploid Gossypium lineages. Divergence events inferred by nodes 1–6 are superimposed on a relative scale of ML distances (upper bar), and an absolute timescale (in millions of years before present [MYBP]; lower bar) derived from synonymous site divergence estimates between diploid Gossypium and the outgroup taxa Gossypiodes kirkii (Gk) and Kokia drynarioides (Kd). Absolute divergence times are based on absolute rates of divergence estimated for palm Adh (upper time scale, 2.6 x 10–9 substitutions · synonymous site–1 · yr–1; Morton, Gaut, and Clegg, 1996 ) and Brassica Adh (lower time scale, 1.5 x 10–8 substitutions · synonymous site–1 · yr–1; Koch, Haubold, and Mitchell-Olds, 2000 )

 
On the ML tree shown in Fig. 6, the basalmost radiation within Gossypium (i.e., divergence of the D-genome lineage from all cottons at node 2) is 0.01928 ML units distant from node 1 and 0.01985 ML units from terminal Gossypium taxa. The latter of these two distances (0.01985) approximates the relative age of the genus, although this estimate is conservative because extinct, Gossypium-like taxa may have diverged from extant cottons prior to this node. Using the conversion factor of 13.4 MY : 0.03913 ML units, the primary divergence event indicated by node 2 occurred approximately 6.8 MY ago (assuming a rate of 2.6 x 10–9 substitutions · synonymous site–1 · yr–1). Lineages comprising the Old World clade (A-, B-, E-, F-, G-genomes) share an internal branch of 0.00286 ML units (nodes 2–3), a value that accounts for 14.4% of the age of the genus and corresponds to about 1.0 MY of common history. Members of the African clade (A-, B-, F-genomes) share an internal branch of 0.00287 ML units (node 3 to 5), a value that corresponds to 1.0 MY of common history. The shortest internal branches on the tree unite the A- and F-genomes (0.00104 ML units = 350 000 yr), and the E- and G-genomes (0.00042 ML units = 150 000 yr). If instead we use the faster of the published Adh rates (1.5 x 10–8 for Brassica), the separation between extant Gossypium taxa and Gossypioides kirkii (node 1) is reduced to approximately 2.3 MY ago. Using a conversion ratio of 2.3 MY : 0.03913 ML units, we can infer the separation of the D-genome lineage from the Old World lineage to have occurred about 1.2 MY ago (node 2). Members of the Old World clade exhibit 170 000 yr of shared history (nodes 2–3) and subsequently diverged to form their present-day genome lineages in the following 240 000 yr (nodes 3–6).


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 LITERATURE CITED
 
Rates of divergence and phylogenetic content of chloroplast and nuclear DNA in cotton
The pace and pattern of divergence in chloroplast and nuclear DNA sequences have been described for most diploid cotton species using chloroplast restriction site variation (Wendel a