|
|
||||||||
Systematics |
2Department of Botany, Iowa State University, Ames, Iowa 50011 USA; 3Department of Botany, University of Tennessee, Knoxville, Tennessee 37996 USA
Received for publication May 3, 2001. Accepted for publication September 20, 2001.
| ABSTRACT |
|---|
|
|
|---|
Key Words: chloroplast DNA Gossypium nuclear DNA nuclear markers phylogenetic incongruence
| INTRODUCTION |
|---|
|
|
|---|
A probable example of rapid and global diversification is offered by the cotton genus Gossypium L. (Malvaceae). This genus of about 50 species (Fryxell, 1979, 1992
) is differentiated cytogenetically into eight "genome groups" (designated "A" through "G", and "K") that differ in DNA content and chromosome size but not in chromosome number (reviewed in Endrizzi, Turcotte, and Kohel, 1985
; Percival, Stewart, and Wendel, 1999
; Wendel et al., 1999
). The inclusivity of each genome group is congruent with the alignment of species in the most recent taxonomic treatment (Fryxell, 1979, 1992
), in which three subgenera of diploid species are recognized: the African/Arabian subgenus Gossypium, containing A-, B-, E-, and F-genome cottons; the Australian subgenus Sturtia (R. Brown) Todaro (C-, G-, and K-genomes); and subgenus Houzingenia Fryxell, comprising the New World, D-genome cottons.
Present molecular evidence indicates that each genome group is monophyletic (Wendel and Albert, 1992
; Seelanan, Schnabel, and Wendel, 1997
; Seelanan et al., 1999
), but relationships among genome groups remain in question (Wendel and Albert, 1992
; Seelanan, Schnabel, and Wendel, 1997
; Seelanan et al., 1999
). As a consequence, branching orders among the three subgenera are unresolved, as is monophyly for the African/Arabian subgenus Gossypium. Using cpDNA restriction site data, Wendel and Albert (1992)
suggested that the Australian cotton lineage (C-, G-, and K-genomes) represents the earliest divergence within the genus (Fig. 1). This interpretation was later shown to be based on modest character support (Seelanan, Schnabel, and Wendel, 1997
); moreover, sequence data for the plastid gene ndhF failed to provide corroborating evidence of this relationship (Fig. 1). Uncertainty regarding relationships among genome groups is also apparent in the cpDNA restriction site and ndhF sequence analysis, most notably in the equivocal placement of the African B-genome lineage (Wendel and Albert, 1992
). This result raised doubts as to the monophyly of subgenus Gossypium, as have later studies based on phylogenetic analyses of nuclear rDNA sequences (Seelanan, Schnabel, and Wendel, 1997
). Sequence data from the internal transcribed spacer region of the nuclear ribosomal internal transcribed spacer (nrITS) suggested that Arabian E-genome and Australian C-genome clades are sister groups, with the (E + C)-genome clade sister to the New World D-genome clade. This assemblage resolves as sister to the remaining African genome groups (Fig. 1; Seelanan, Schnabel, and Wendel, 1997
). Thus, existing data sets are contradictory in two major aspects, first with regard to the monophyly of African/Arabian cottons and second with regard to the earliest divergence event in the genus.
|
Our second objective was to elucidate the actual cladistic affinities of the major diploid cotton lineages, notwithstanding the challenges imposed by the suspected history of rapid radiation. Toward this end, we employed two rational approaches to resolving poorly supported nodes, namely, we used multiple genes and generated larger data sets (e.g., Hillis, Huelsenbeck, and Cunningham, 1994
; Olmstead and Sweere, 1994
; Cummings, Otto, and Wakeley, 1995
; Hillis, 1996
; Nei, Kumar, and Takahashi, 1998
; Soltis et al., 1998
; Bremer et al., 1999
; Hughes and Baker, 1999
; Walsh et al., 1999
). In this paper, we reevaluate the issue of the cotton genomes phylogeny using three additional chloroplast DNA regions and 11 single-copy nuclear genes. By combining these data with preexisting chloroplast ndhF and nuclear nrITS sequence information (Seelanan, Schnabel, and Wendel, 1997
), we were able to evaluate the phylogenetic signal contained within 18 kilobases (kb) of sequence data derived from the chloroplast (7 kb) and nuclear (11 kb) genomes. This substantial data set bolsters the relatively weak signal present in short interior branches and provides complete resolution of the branching order among major cotton lineages. The resolutions provided by both chloroplast and nuclear DNA enable comparisons among individual loci and data partitions, permit a multilocus evaluation of interior node resolution, and provide a detailed picture of the relative timing of cladogenic events that have given rise to extant diploid cotton genome groups.
| MATERIALS AND METHODS |
|---|
|
|
|---|
|
|
Amplification primers and protocols used to amplify each of these loci follow previously detailed protocols (Small et al., 1998
; Cronn, Small, and Wendel, 1999
; Liu et al., 2001
). Polymerase chain reaction products were gel-purified and directly sequenced for all loci except the FAD2-1 intron. Due to apparent polymerase slippage in the numerous stretches of AT-rich sequence, we found it necessary to clone FAD2-1 PCR products (using pGem-T Easy; Promega, Madison, Wisconsin, USA) prior to sequencing. Sequencing reactions for cloned FAD2-1 inserts used 500 ng of plasmid and 5 pmol of sequencing primer. The consensus sequence of three clones was determined from each taxon to reduce the errors induced by Taq polymerase.
Data analysis
Sequences were aligned with the Clustal function of Sequence Navigator using default parameters. Sequence alignment led to the recognition of numerous gaps in both the individual and combined data sets. While many of these gaps could be aligned with high confidence in assessments of positional homology, five regions from four genes showed complex, possibly superimposed indel patterns. These regions, which consequently were excluded from all analyses, include positions 473521 of the chloroplast rpl16 intron (25442587 of the combined chloroplast data set); positions 294450 of the chloroplast trnT-trnL spacer (35163672 of the combined chloroplast data set); positions 25102552 of the trnK intron downstream from the chloroplast matK gene (73287370 of the combined chloroplast data set); and positions 9351015 and 15291587 of the nuclear FAD2-1 locus (10 15010 230 and 10 74410 802 of the combined nuclear data set). The remaining alignment gaps were treated as missing data in all analyses. To evaluate the phylogenetic signal contributed by insertion/deletion (indel) events, we scored all indels in the cpDNA and nDNA data sets for which we had confidence in positional homology (archived as Appendix 2 at the Botanical Society of America website, http://ajbsupp.botany.org/v89/Cronn/.doc) and analyzed these data separately from nucleotide data using parsimony analysis. Indels were coded as unordered characters with binary states (in the case of simple presence/absence indels) or multistate characters (in the case of indels with variable length but one identical 5' or 3' end) using the criteria described in Simmons and Ochoterena (2000)
. Due to their high lability, di- and trinucleotide repeats found in chloroplast rpl16 intron (TA2-4), chloroplast trnK (matK) intron (TTA2-4), chloroplast trnT-trnL intron (GAA3-5), and nuclear FAD2-1 intron (GAA4-10) data sets were specifically excluded from this analysis. Complete sequence data for all loci are available from the author at http://www.botany.iastate.edu/
jfw/HomePage/jfwdata_sets.html or from the Botanical Society of America website (http://www.botany.org/bsa/ajbsupp/).
Phylogenetic analyses, based on maximum parsimony (MP) and maximum likelihood (ML) methods, were performed using PAUP* version 4.0b8 (Swofford, 2001
). Most-parsimonious trees were found from exhaustive searches with equal weighting of all characters. Maximum likelihood trees were estimated using the Hasegawa-Kishino-Yano (1985)
model of sequence evolution, with heuristic searches and tree bisection-reconstruction (TBR) branch swapping. For each analysis, we used empirical base frequencies and estimated the following parameters: the transition/transversion rate, the gamma shape parameter (using four rate categories), and the proportion of invariable sites. We explored other models of sequence evolution and combinations of input parameters but found that these perturbations had no effect on tree topologies and did not significantly improve the likelihood scores.
To evaluate "combinability" (Bull et al., 1993
; Johnson and Soltis, 1998
) of the 16 loci included in this study, incongruence length difference tests (ILD; Farris et al., 1995
) were performed pairwise on all loci within a genome partition (either "chloroplast" or "nuclear"), on two combined data sets representing all sequences from either the chloroplast (cpDNA) or nuclear (nDNA) genome, and on the "indel" data sets derived from cpDNA and nDNA. These tests were implemented using PAUP* 4.0 (as the partition-homogeneity test). Each analysis utilized 1000 replicates, in which random partitions of equal size were created by sampling sites without replacement from the original data. Replicates were analyzed using the parsimony criterion, and branch swapping using TBR was performed on trees held at each step during the stepwise addition.
Several methods were used to evaluate support of tree topologies derived from individual and combined data sets. For parsimony analyses, jackknife resampling (Farris, 1996
) was performed using 1000 replicates (branch and bound searches) and with 50% of the characters deleted in each replicate. Decay analysis (Bremer, 1988
) was performed by searching exhaustively for all trees up to ten steps longer than the most-parsimonious and noting how many steps longer than the shortest tree each clade survived. For ML analyses, the likelihood ratio test of Kishino and Hasegawa (KH test; 1989
) was performed on the combined nuclear and chloroplast data sets to test among alternative topologies obtained from prior phylogenetic studies (detailed in RESULTS).
To estimate the relative divergence times of the major diploid cotton lineages, ML methods were used to evaluate clocklike rate constancy across lineages and to calculate likelihood-based branch lengths. In cases of rate constancy, branch lengths from the optimal ML topology are directly proportional to time (Baum, Small, and Wendel, 1998
; Sanderson, 1998
). To test the assumption of clocklike sequence divergence, the likelihoods of clock-enforced and clock-unenforced topologies were computed (using the parameters described above) from reduced data sets composed of nuclear DNA synonymous sites (7978 aligned sites). Since clock-enforced and clock-unenforced analyses yielded identical, single trees from the nuclear DNA data set, we used the likelihood-ratio test (Sanderson, 1998
) to evaluate whether the clock-enforced likelihood score differed significantly from the clock-unenforced alternative. To convert relative divergence time into illustrative absolute divergence dates, we first computed pairwise synonymous divergences (KS, using Jukes-Cantor transformation) for silent sites in exons and introns between all taxa, using the computer program DnaSP version 3 (Rozas and Rozas, 1999
). We then computed the mean KS between all ingroup taxa (N = 7) and the two outgroup taxa. This value (KS = 0.0672 ± 0.0556 substitutions per site; N = 14 comparisons) was converted into absolute time using the following equation: Tdivergence = KS /(2)(r), where r corresponds to the absolute rate of synonymous site divergence in nuclear Adh genes from palms (2.6 x 109 substitutions · synonymous site1 · yr1; Morton, Gaut, and Clegg, 1996
) or members of the Brassicaceae (1.5 x 108 substitutions · synonymous site1 · yr1; Koch, Haubold, and Mitchell-Olds, 2000
.) Although there are numerous potential sources for error in molecular clock estimates of divergence (Sanderson, 1998
), estimates of absolute divergence times are presented for heuristic purposes, as well as to compare to other divergence time estimates for Gossypium (Endrizzi, Katterman, and Geever, 1989
; Geever, Katterman, and Endrizzi, 1989
; Wendel and Albert, 1992
; Seelanan, Schnabel, and Wendel, 1997
).
| RESULTS |
|---|
|
|
|---|
|
|
Combined chloroplast and nuclear sequences
Pairwise ILD test results from a comparison of the combined cpDNA vs. the nDNA data set revealed significant heterogeneity (P = 0.001). This heterogeneity is due primarily to the conflicting phylogenetic resolution of B- and E-genome representatives (discussed in "phylogenetic analysis" sections below). Because of the numerous possible biological sources for partition incongruence between chloroplast DNA and nuclear DNA evolution (Wendel and Doyle, 1998
), we chose not to combine the cpDNA and nDNA data sets into a single data set for "total evidence" analysis.
Individual and combined indel data sets
Because of the large size of the combined cpDNA and nDNA data sets, a substantial number of indels were identified with high confidence of positional homology (Appendix 2; http://ajbsupp.botany.org/v89/Cronn/.doc). Using outgroup sequences to polarize changes in the ingroup, we scored 24 and 25 potentially phylogenetically informative indels in the cpDNA and nDNA data sets, respectively. The majority of indels scored in these data sets (38 of 49, or 77.6%) were contiguous gaps with identical 5' and 3' ends and hence likely represent the product of single indel events. These indels varied in length from a low of 1 bp to a high of 49 bp. The remaining 11 indels included six perfect duplications 4 bp or greater in length, and 5 indels of variable length with one conserved end (either the 5' or 3'). In contrast to the combined cpDNA and nDNA nucleotide data sets, tests for incongruence between the two indel data sets using the ILD test did not return a significant value (P = 0.092). Accordingly, we chose to combine the cpDNA + nDNA indel data sets into a single indel data set composed of 49 characters.
Phylogenetic analysis of individual and combined data sets
Phylogenetic inference from chloroplast genes
Maximum likelihood topologies obtained for each of the individual chloroplast loci (matK, ndhF, rpl16 intron and trnT-trnL spacer) are summarized in Fig. 3, and results for the combined cpDNA data set are shown in Fig. 4A. Analyses of the individual chloroplast loci yielded more than one MP tree in all cases. Since the single tree produced by ML was always included among the parsimony trees, the trees shown in Fig. 3 can be considered one representative example of the 5 trees (ndhF), 7 trees (matK), 32 trees (rpl16 intron), and 6 trees (trnT-trnL spacer) recovered from MP analysis for individual chloroplast loci. Despite minor topological differences between ML and MP trees, Fig. 3 incorporates relevant information for both of these forms of inference, including branch lengths for ML trees, and descriptive statistics (tree lengths, consistency, and retention indices) and measures of support for MP trees (decay indices [d] greater than zero, and jackknife percentages [j] greater than 50%).
|
|
In contrast with the inconsistent basal branching patterns, individual plastid sequences provide support for two sister-taxon pairs, an Australian clade composed of the (C + G)-genomes and an African clade composed of the (A + F)-genomes. All four cpDNA loci provide unanimous support for a (C + G)-genome clade, with decay and jacknife values ranging from robust in matK (d = 5, j = 100%) to weak in the rpl16 intron (d = 0, j = 62%). Three of the four data sets also include the African B-genome in a monophyletic clade with the (C + G)-genomes, while the fourth data set (rpl16 intron) simply lacks character support for this node. Three of the four loci analyzed (matK, ndhF, and rpl16 intron) also showed support for a sister-taxon relationship between the A- and F-genome cottons (ndhF, d = 1 and j = 84%; matK, d = 1 and j = 97%; rpl16 intron, d = 0 and j < 50%). The trnT-trnL spacer data set fails to support this relationship, as the F-genome is weakly resolved (d = 0, j < 50%) as sister to the E-genome. Aside from evidence supporting a monophyletic Australian (C + G)-genome clade (perhaps associated with the African B-genome lineage) and a sister relationship between the A- and F-genomes, little new phylogenetic insight emerged from the analysis of three individual cpDNA data sets. Nodes left unresolved by these analyses include the placement of the New World D-genome and African-Arabian E-genome, as these taxa exhibited affinities to either the (A + F)-genome clade (matK and trnT-trnL spacer for the D-genome; ndhF, rpl16 intron and trnT-trnL spacer for the E-genome) or were left unresolved in a genus-wide polytomy (ndhF and rpl16 intron for the D-genome; matK for the E-genome).
In sharp contrast to the four incompletely resolved individual chloroplast gene trees, ML and MP analyses of the concatenated cpDNA data sets (7121 included characters per taxon) yield a single most parsimonious tree that fully resolves the brancing order of the seven ingroup taxa (Fig. 4A). While the cpDNA tree combines features of individual cpDNA gene trees, it is important to note that no individual gene tree yielded a topology identical to that inferred from the combined cpDNA data set. This tree shows Gossypium to be divided into two primary lineages, with one clade composed of B + (C + G)-genomes, and the other composed of the D + [E + (A + F)]-genomes. The clade composed of the B + (C + G)-genome cottons is supported by nine synapomorphies and is strongly supported by decay analysis (d = 6) and jackknife resampling (j = 98%); similarly, the cladistic affinity between the Australian (C + G)-genomes remains well supported (d > 10, j = 100%). Conversely, there is only limited internal support for two of the three nodes included in the D + [E + (A + B)]-genome clade. Although the basal position for the D-genome (d = 2, j = 88%) is relatively robust, the placement of the E-genome (d = 0, j = 60%) is only weakly resolved. The sister relationship between the (A + F)-genomes is convincingly indicated, with high decay (d = 8) and jackknife (j = 99%) support.
Phylogenetic inference from nuclear genes
Maximum likelihood topologies for individual nuclear genes are summarized in Fig. 3, and results for the concatenated nDNA data set are summarized in Fig. 4B. Superimposed on these ML trees are statistics and measures of support derived from parsimony analysis, as in nearly all cases the topology obtained from ML analysis did not conflict with the tree(s) obtained from MP analysis. The sole exception to this topological congruence was for AdhC, as ML and MP analyses identified different optimal topologies. In this case, the ML analysis resulted in an ingroup topology of [(D + B)(C + G)(F + A)], as indicated by bold lines in Fig. 3. The strict consensus of the three MP trees showed an ingroup topology of [D{(C + G)(B)(F + A)}] and is indicated by dashed lines. For this tree, ML branch lengths are shown above the branches, while MP-inferred branch lengths are shown in parentheses below the branches.
In a manner similar to cpDNA, 11 of 12 individual nuclear gene trees (all but AdhA) provide strong support for a sister-taxon association between the Australian (C + G)-cottons. The sister relationship indicated for the African (A + F)-genome cottons by cpDNA was less apparent in nuclear trees, as this clade appeared in only 3 of the 12 gene trees (A1713, AdhC, G1121); an equally common resolution, indicated by A1341, CesA1b, and G1262, was for the A- and F-genomes to be placed in an unresolved trichotomy that includes the B-genome. No locus positively contradicted the (A + F) grouping. Beyond the (C + G)-genome pair (and possibly an A + F-genome pair), trees obtained from individual (Fig. 3) and combined (Fig. 4B) nDNA data sets bear little resemblance to those obtained from cpDNA sequences (Figs. 3 and 4A). Features unique to the nuclear gene trees include: (1) placement of the D-genome as sister to the remainder of the genus, creating a basal "New World/Old World" split; (2) placement of the B-genome in an African clade consisting of (B + A + F)-genome cottons; and (3) a weak affinity of the African E-genome to the Australian (C + G)-genome cottons.
The most striking difference between trees based on cpDNA data (Wendel and Albert, 1992
; Seelanan, Schnabel, and Wendel, 1997
) and those obtained from nuclear sequences is that the latter repeatedly place the D-genome lineage sister to the remainder of the genus (Figs. 3 and 4B). This finding is supported by both ML and MP analyses of sequences from six different loci (A1341, AdhA, FAD2-1, G1121, G1134, and nrITS), and by parsimony analysis of AdhC. Four of the included nuclear loci (A1713, A1751, AdhC, and G1262) place the D-genome in an unresolved basal polytomy, a placement that does not conflict with a D-genome basal split. In fact, of the 12 nuclear loci sampled, only 1 (CesA1) shows a conflicting alternative for this earliest divergence in the genus. Results from CesA1 provide weak support (d = 0) for a basal split between the Australian clade and the remainder of the genus, a result similar to that obtained from chloroplast trnT-trnL spacer data. While placement of the D-genome in a sister relationship to the remainder of the genus is only weakly supported among individual loci, the combined nDNA data set provides robust support for this placement (Fig. 4B). The combined Old World lineage shares a total of 33 inferred synapomorphies, and the node resolving New World D-genome cottons sister to Old World cottons has high support for decay (d = 9) and jackknife resampling (j = 98%).
A second major difference between cpDNA-derived phylogenies and the trees obtained from nDNA concerns the placement of the African B-genome lineage (represented by G. anomalum). Chloroplast DNA sequence data resolve the B-genome as sister to the Australian (C + G)-genome cottons, rather than with other African cottons (Figs. 3 and 4A). In contrast, the nuclear data sets repeatedly unite the B-genome with one or both of the African A- and F-genome cottons. Five loci (A1341, A1713, CesA1b, G1262, and nrITS) independently support a monophyletic clade comprising the B-, A-, and F-genomes, while a sixth locus (FAD2-1) includes the Arabian E-genome in this assemblage. Although the (B + A + F)-genome clade fails to survive even a single step of decay in most data sets, the association is relatively robust in CesA1b (d = 1, j = 72%) and nrITS (d = 5, j = 99%). Three other loci reveal a close association between the B-genome and either the A-genome or F-genome, but do not support the inclusivity of a (B + A + F)-clade; the B-genome is sister to the A-genome without the F-genome in the AdhA data set, and it resolves as sister to the F-genome without the A-genome in the AdhC and G1121 data sets. For two data sets, AdhC and CesA1, the B-genome resolves as sister to taxa other than A- or F-genome cottons, although these associations have no decay and little jackknife support. For AdhC, the B-genome is placed sister to the D-genome, but only in the ML analysis; this association is not seen in MP, as the B-genome falls into a polytomy composed of Old World cottons that are sister to the New World D-genome. Also, CesA1 shows the B-genome to be the sister taxon of the E-genome, an association not evident in the other data sets. The preponderance of evidence (9 of 12 nuclear genes) indicates that the B-genome is closely allied to the A- and F-genome African cottons, and that these three taxa likely form a monophyletic lineage. Not surprisingly, analysis of the combined nDNA data set provided strong support for this association, as the preferred [B + (A + F)]-genome clade survives 12 steps of decay and yields jackknife resampling support of 99%.
As observed with cpDNA-derived phylogenies, trees obtained from nuclear genes show equivocal placement for the African-Arabian E-genome representative G. somalense. While plastid data generally show the E-genome to belong to a clade composed of African cottons (Fig. 4A), individual nuclear loci show that the E-genome is phylogenetically unstable, occupying alternative positions in the different gene trees (Figs. 3 and 4B). Three loci, AdhA, G1121, and nrITS, support an association of the E-genome with C- and G-genome cottons, albeit only weakly as only a single, nonhomoplasious character unites the E + (C + G)-genomes in each data set. Additionally, this node collapses after one step of decay and appears in only 58% of the jackknife replicates (Fig. 4B) in the combined nDNA data set. Alternative placements for the E-genome (all of which lack decay or jackknife support) include a basal position in the "Old World" cotton clade (A1341), a sister-taxon position with the B-genome (CesA1), as part of a monophyletic "African clade" that includes the A-, B-, and F-genome cottons (FAD2-1 and nrITS) and as one of several lineages comprising a genus-wide, basal polytomy (A1713, A1751, CesA1b, G1134, and G1262). The position of the E-genome in the AdhC gene tree could not be determined, as amplification products could not be obtained from either G. somalense or G. stocksii. This locus has been partially eliminated (G. arboreum) or completely removed (G. herbaceum) from A-genome diploid cottons (Small et al., 1998
; Small and Wendel, 2000
), and it seems that E-genome AdhC may have experienced the same fate.
Phylogenetic inferences from indel events
Analysis of the combined cpDNA + nDNA indel data set produced a single most-parsimonious tree (length [L] = 65, retention index [RI] = 0.76; Fig. 4C) that is fully resolved. This tree shows greater overall similarity to the tree produced by the nDNA nucleotide data than by the cpDNA nucleotide data, as a New World/Old World cotton split represents the primary divergence event and because Old World taxa are divided into monophyletic Australian (C + G-genome) and African (A + B + E + F-genome) lineages. The primary divergence between the D-genome/Old World is supported by five synapomorphic indels (all from the nDNA data set), resulting in appreciable jackknife and decay support (j = 86%, d = 2). A similarly robust node supports the Australian (C + G)-genome cotton clade (j = 96%, d = 4), as four cpDNA and three nDNA indels mark this lineage. The final clade, composed entirely of African cottons, shows complete resolution but poor support values (j = 53%, d = 0) at two of the three critical nodes.
While overall support for a monophyletic African clade (e.g., A-, F-, B-, and E-genomes) is low, inspection of cpDNA gap characters highlight an interesting inconsistency within the chloroplast data set. In the combined data set, five characters (ndhF-1, trnT-trnL-6, trnT-trnL-8 from cpDNA; AdhA-3 and CesA1b-3 from nDNA) provide support for placing the B-genome in the clade composed exclusively of African-Arabian taxa. In contrast, only a single character (trnT-trnL-7) from the cpDNA data set provides support for the combined B-genomeAustralian clade. In this regard, indels reveal a remarkable feature of the cotton cpDNA data set, namely that chloroplast nucleotides and chloroplast indels inconsistently resolve the B-genome.
Statistical analysis of incongruence between nucleotide data sets
Despite the limited number of taxa included in this study, substantial topological disagreement was evident in comparisons between the cpDNA and nDNA nucleotide-derived trees (e.g., Fig. 4A vs. 4B). Indeed, agreement between the two topologies is limited only to sister relationships for the African A- and F-genomes and the Australian C- and G-genomes, associations that have been previously resolved using cpDNA (Wendel and Albert, 1992
; Seelanan, Schnabel, and Wendel, 1997
) and nrITS (Seelanan, Schnabel, and Wendel, 1997
). Conflict among the remaining genome groups is most evident in the basal divergence event in the genus. The cpDNA tree reveals a basal divergence separating the common ancestor of the (C + G + B)-genomes from the common ancestor of the (D + E + F + A)-genomes. In contrast, nDNA indicates a geographically more coherent divergence, separating the ancestor of the New World D-genome from the ancestor of all Old World (A + F + B + E + C + G genomes) cottons. These two data sets show relatively robust character support for these alternative phylogenetic interpretations, as the basal node in the cpDNA tree (Fig. 4A) survives two steps of decay and exhibits a jackknife value of 88%, while the basal node from the nDNA tree (Fig. 4B) survives nine steps of decay and shows a jackknife value of 98%. Conflict is also evident in the affinity of the African B-genome, as represented here by G. anomalum. This taxon shows a strong affinity to African A- and F-cottons based on nDNA (d > 10, j = 99%), yet it exhibits a closer relationship to the Australian (C + G)-genome lineage based on cpDNA (d = 6, j = 98%).
To statistically evaluate the cpDNA and nDNA alternatives, we compared several topological variations using the Kishino-Hasegawa (KH)
lnL test. As might be anticipated from the strong character conflict between the cpDNA and nDNA data sets (indicated by the ILD test results), enforcing the alternative topology on either of these two concatenated data sets significantly increased the likelihood score, as indicated by the
lnL test: enforcing the nDNA-derived topology on the combined cpDNA data set resulted in a significant increase in the maximum likelihood score (unconstrained = 11 943.28; constrained to nDNA topology = 12 003.69;
lnL = 60.41; P = 0.0016), whereas even greater conflict appears when the nDNA data are forcibly constrained to the topology of the optimal cpDNA ML tree (unconstrained = 23 026.74; constrained to cpDNA topology = 23 099.55;
lnL = 72.81; P = 0.0001). These results indicate that the topological differences between the cpDNA and nDNA maximum likelihood trees are statistically significant and that the nDNA data set is more significantly perturbed by enforcing the alternative topology.
Additional topological tests were performed to evaluate whether alternative arrangements of selected taxa could be tolerated by the nDNA and cpDNA data sets. Because the testing of alternative, user-defined trees a posteriori may constitute a violation of the underlying assumptions of the KH test (Goldman, Anderson, and Rodrigo, 2000
), we limited our comparisons to alternative topologies indicated from a priori phylogenetic hypotheses derived from cpDNA restriction site data (Wendel and Albert, 1992
; Seelanan, Schnabel, and Wendel, 1997
) and rDNA (Seelanan, Schnabel, and Wendel, 1997
). To execute these tests, we pruned the number of taxa from nine (seven Gossypium and two outgroups) to a total of four (three ingroup and one outgroup) so that conflict among the most relevant lineages could be evaluated.
To evaluate conflict concerning the primary divergence event among Gossypium genome lineages, we pruned the cpDNA and nDNA data sets to include the A-genome (G. herbaceum), C-genome (G. robinsonii), D-genome (G. raimondii), and one outgroup (Gossypioides kirkii). As three ingroup taxa are considered, there are three possible topologies for evaluation (Fig. 5, upper panel): (A + D) sister to C, which is the cpDNA topology reported by Wendel and Albert (1992)
, Seelanan, Schnabel, and Wendel (1997)
, and supported by our cpDNA analysis; (C + D) sister to A, the topology reported by Seelanan, Schnabel, and Wendel (1997)
based upon nrITS using the distantly related taxon Thespesia populnea as an outgroup; and (A + C) sister to D, a topology that was indicated by analysis of the Adh multigene family in cotton (Small and Wendel, 2000
) and is robustly supported by the nDNA data presented herein.
|
Conflict in the placement of the B-genome lineage was evaluated in a similar manner by pruning the cpDNA and nDNA data sets to include the A-genome (G. herbaceum), B-genome (G. anomalum), C-genome (G. robinsonii), and one outgroup (Gossypioides kirkii). Hence, this analysis evaluates the strength of association between the B-genome and African vs. Australian cottons (Fig. 5, lower panel). With these included taxa, the three possible topologies are: (A + B) sister to C, the tree topology recovered from cpDNA by Wendel and Albert (1992)
, nrITS by Seelanan, Schnabel, and Wendel (1997)
, and from our analysis of nDNA; (B + C) sister to A, the tree topology recovered from cpDNA by Seelanan, Schnabel, and Wendel (1997)
, and supported by our analysis of cpDNA; and (A + C) sister to B, a topology that has yet to be proposed in literature and is unsupported by our analysis of nDNA and cpDNA.
As shown in Fig. 5, the preferred topology for the cpDNA data set, i.e., (B + C) sister to A, yields an ML score that statistically preferred to the other topological alternatives [e.g., (A + B) sister to C, or (A + C) sister to B; P = 0.036]. Similarly, the preferred topology for the nDNA data set [e.g., African (A + B) sister to Australian C] yields a ML score that is statistically more likely than those obtained from the topological alternatives, i.e., (B + C) sister to A, and (A + C) sister to B (P = 0.008 and 0.009, respectively). Again, while we chose the A- and C-genomes to represent the African and Australian lineages, identical results are obtained if the F- and/or G-genomes are used (results not shown).
The most revealing result from this exercise is the finding that placing the D-genome lineage sister to Old World cottons (in essence, the nDNA topology) results in an insignificant
lnL score in cpDNA data set (P = 0.391). This result suggests that while cpDNA and nDNA trees differ in their resolution of the basal divergence in Gossypium, placement of the D-genome is sufficiently equivocal in the cpDNA tree that a sister relationship between New World and Old World cottons is consistent with both data sets. Conversely, we find that the optimal placement of the B-genome, as indicated by ML analysis of cpDNA vs. nDNA, is statistically different in these data sets (Fig. 4A and 4B). Phylogenetic discord evident in the variable placement of the B-genome indicates that one of these data partitions has an evolutionary history that leads to phylogenetic incongruence. Events potentially involved include cytoplasmic introgression due to ancient hybridization, differential retention of ancestral chloroplast haplotypes, or poorly understood molecular-evolutionary phenomena that bias character evolution in a way that misleads phylogenetic reconstruction.
Timing the divergence of Gossypium lineages
To gain an understanding of the relative and absolute timing of the divergence events leading to the major extant cotton lineages, we used an ML approach to estimate internal and terminal branch lengths while enforcing a molecular clock constraint (Baum, Small, and Wendel, 1998
; Sanderson, 1998
). Maximum likelihood analyses of reduced data sets containing only synonymous sites showed that the cpDNA (4739 aligned positions) and nDNA (7978 aligned positions) synonymous site data sets deviated significantly from molecular clock expectations. Enforcing the clock constraint on the cpDNA data set yielded a significant increase in the ML score (ln L unconstrained = 8402.92, ln L clock enforced = 8438.71,
ln L = 35.79, P = 1.9 x 1013); similarly, enforcing the clock constraint on the nDNA data set yielded a significant increase in the ML score (ln L unconstrained = 17 926.08, ln L clock enforced = 17 940.94,
lnL = 14.86, P = 4.4 x 105). By inspecting the trees visually and evaluating sequence heterogeneity using the 1D rate test of Tajima (Tajima, 1993
), we found evidence of localized rate heterogeneity (results not shown). In particular, cpDNA from the C- and G-genomes showed elevated rates of divergence as compared to the B- and F-genomes. Similarly, nuclear sequences from the C-genome showed a significantly reduced rate of divergence as compared with nDNA from the A-, D-, and E-genome representatives (results not shown). By pruning the C-genome from the nDNA data set, reanalysis of the data using clock and nonclock models resulted in acceptance of the molecular clock (lnL unconstrained = 17 500.87, lnL clock enforced = 17 506.36;
ln L = 5.49; P = 0.089). Unfortunately, simple pruning of one taxon (B-, C-, F-, or G-genome, singly) or pairs of taxa (C + G, or B + F) from the cpDNA data did not produce data sets that met molecular clock expectations (data not shown). For this reason, only the nDNA data set, pruned of the C-genome, is used to estimate the timing of major divergence events between cotton genome groups.
To illustrate the temporal scale and rapidity of these divergence events, we estimated absolute divergence dates by applying molecular clock rates derived from nuclear genes of plant groups that are represented in the fossil record. Fossil-calibrated synonymous rates of Adh divergence for angiosperms vary about sixfold, with absolute rates ranging from 2.6 x 109 substitutions · synonymous site1 · yr1 for palms (Morton, Gaut and Clegg, 1996
), 7.0 x 109 substitutions · synonymous site1 · yr1 for grasses (Gaut, 1998
), and 1.5 x 108 substitutions · synonymous site1 · yr1 for Brassica (Koch, Haubold, and Mitchell-Olds, 2000
). In the absence of Gossypium fossils, it is not possible to know which value is most appropriate in the present application. Nevertheless, this exercise provides an effective demonstration of the rapidity of cotton radiation, as discussed below, and permits us to consider the merits of earlier hypotheses regarding the timing of radiation of the major Gossypium clades.
Average pairwise synonymous site divergence between the two outgroup species and all seven ingroup cottons was 0.0702 substitutions/synonymous site (standard deviation = 0.0055, N = 6521 synonymous sites per taxon). By using the lowest of the published Adh divergence rates (2.6 x 109 substitutions · synonymous site1 · yr1 for palms), we estimate that the separation between extant Gossypium diploids and Gossypioides kirkii (node 1, Fig. 6) occurred approximately 13.4 million years (MY) ago. This estimate provides a maximum likelihood-based conversion factor of 13.4 MY : 0.03913 ML units for the distance separating Gossypium from the combined outgroup Gossypioides kirkii + Kokia drynarioides (node 1 to terminals).
|
| DISCUSSION |
|---|
|
|
|---|