Quartet Decomposition


*   Check the trees

*   Convert tree leave names

*   Generate bootstrap trees from multiple sequence alignment

*   Start my Quartet Decomposition job

Contact: fenglou@csbl.bmb.uga.edu for any problems.

 

*    Introduction about quartet theory

1.      What is a quartet?

2.      Why do we want to decompose a phylogenetic tree to quartets?

3.      How do we identify plurality quartets?

4.      How do we draw the quartet spectrum?

5.      What are the quartet filters?

6.      What is a split network?

7.      Why do we want generate a split network from quartets?

8.      How do we generate a split network?

 

*    Frequently Asked Qestions about Quartet Decomposition (QD) server

1.      What is the overall structure of the QD server?

2.      What kind of data should I prepare in order to use QD server? Do you provide scripts to help prepare the input data?

3.      Can I see a set of sample input files and the sample result page?

4.      What should I do after I submit my job?

5.      Why do I have to wait for the job to be done?

6.      How do I generate a quartet spectrum in QD server?

7.      What can I do after I generate the quartet spectrum?

8.      How could I download the quartets I wanted?

9.      How is the agreement score calculated? And what is the meaning of the agreement score?

10.  How are the paralogs processed?

11.  How could I draw a split network in QD server?

12.  How could I obtain the Nexus file which I could use in my own SplitTree?

13.  What is a reference tree and how do we generate quartet spectrum by using a reference tree?

 

Introduction about quartet theory

1.       What is a quartet?

We should mention the tree of life before we introduce the concept of quartet. Most evolutionary biologists agree that all of existing organisms and their ancestors could be organized as a tree, we call it Tree of Life (ToL), we also call this tree a species tree. A species tree describes the relationship of the organisms in the tree.

There is another kind of tree in evolutionary studies, we call it gene tree. A gene tree is a tree constructed from a set of homologous genes. The gene tree describes the relationship of the genes in the tree.

There are many articles which study species tree and gene trees.  If there is no genetic material exchange in evolution, and species evolve just by mutation and speciation, a gene tree should be much like the species tree. Unfortunately it is not true, there are many genetic material exchange events in evolution history, and this makes it difficult to infer the species tree from gene trees. (Other processes can lead to phylogenetic conflict, such as unrecognized gene duplications, lineage sorting and systematic artifacts due to lineage sorting and long-branch attraction.  We do not discuss these here; however, the user should be aware especially of the systematic artifacts that can create a consistent phylogenetic signal.)

Figure 1 shows a gene tree. If we focus on the leaf nodes 1,4,9 and 10, they also form a small tree, we call it an embedded quartet. A quartet is an unrooted tree for a quadruple of taxa(Zhaxybayeva, et al., 2006). By studying quartets from multiple gene trees, we can infer rich information about the evolutionary relationship between species as well as genes.

Description: web_fig1.png

Figure 1. A unrooted gene tree, and an embedded quartet (1,4,9,10)

 

2.       Why do we want to decompose a phylogenetic tree into quartets?

Gene trees constructed from different gene families may adopt different topologies. A tree is complicated, and thus hard to be analyzed, e.g., it is hard to compare two trees. A quartet, as the minimal informative element in the tree, is simpler; it only has three kinds of topologies (Figure 2). We could decompose trees to quartets, and use common quartets to rebuild a super tree, or construct a split network (see below), which could provide another view of possible species relationship that depicts signals caused by Horizontal Gene Transfer(HGT). Of course one can perform other kinds of analysis on the extracted quartets. The three topologies will be noted as TOP1, TOP2 and TOP3 from left to right.

 

Description: C:\Users\Fenglou Mao\AppData\Roaming\Tencent\Users\183574811\QQ\WinTemp\RichOle\@G]1UEZDOR2EX2KI0{]AMVB.jpg

Figure 2: Three possible quartet topologies

 

3.       How do we identify plurality quartet topology?

Given a list of genomes and multiple homologous gene families collected from these genomes. (We initially assume that the members of a gene family are orthologs, but further quartet based analyses can reveal homologous replacement and unrecognized paralogy.)

a.    Construct some bootstrap gene trees for each gene family, e.g., 100 bootstrap gene trees per gene family.

b.   Decompose each tree into quartets, here we clarify that a quartet is a combination of four taxa; each quartet could adopt one of the three topologies in Figure 2; we number the topologies as TOP1, TOP2 and TOP3 from left to right.

c.    Calculate the bootstrap support value for each topology. E.g., we have 100 bootstrap gene trees for a specific gene family X; so we will have 100 quartets for a specific combination of four taxa A, B, C and D; each bootstrap tree contributes one quartet; in these 100 quartets, there might be 80 quartets adopt TOP1, 15 adopt TOP2, and 5 adopt TOP3; thus for this specific quartet (A,B,C,D), TOP1 has a bootstrap support value 80%, TOP2 has 15% and TOP3 has 5%.

d.   Here we give a bootstrap support value threshold, e.g., 70%. We could say the TOP1 in the quartet (A,B,C,D) is supported by gene family X with a bootstrap support value (80%) bigger than the threshold (70%); then we count how strong gene family X supports quartet (A,B,C,D)’s TOP1. For all the gene families which include this quartet (A,B,C,D), we calculate its bootstrap support values for all three topologies; and we count the number of gene families which support TOP1 with bootstrap support value bigger than the threshold, e.g., we got n1 gene families. We could do the same for TOP2 and TOP3, e.g., we got the number n2 and n3. Assume n1 is larger than n2 and n3, then TOP1 is the plurality topology for the quartet (A,B,C,D).

4.       How do we draw the quartet spectrum?

Once we identified the plurality topology for each quartet, we can sort the number of supporting gene families for the plurality topologies, and we can also draw a histogram for these numbers. For each quartet the plurality topology is just one of the three topologies, we can also draw another histogram at the negative side of Y-axis for the sum of the two minor topologies. We call this histogram as Quartet Spectrum. See Figure 3 for an example. This spectrum is generated by QD server by using the data set from the paper (Zhaxybayeva, et al., 2006), it is also available at the link below. The data set can be downloaded from FAQ 2 or FAQ 3.

http://csbl.bmb.uga.edu/QD/jobstatus.php?jobid=QDkj65zE

Description: spectrum_olga.png

Figure 3: a sample quartet spectrum

Please remember the number of supporting gene families might be different if we use different bootstrap support threshold values. The numbers at the up-right corner represent different bootstrap support threshold values. Normally we draw the spectrum for several threshold values at the same picture for easy comparison.

5.       What are the quartet filters?

The purpose of quartet decomposition is to find the informative quartets. Unfortunately for some reasons not all the quartets are informative, which normally results from inaccurate or difficult tree construction. We designed some rules to remove these kinds of quartets to avoid noise in future quartet analysis, we label these rules as quartet filters.

Currently we provide three kinds of filters. We can see from Figure 2 that there are four external branches and one internal branch for each quartet topology.

a.    Short internal branch: the first filter can be used to remove quartets with very short internal branch. The user specifies a threshold (default value = 0.02 substitutions per site or branch length unit in user supplied gene trees); all quartets with internal branch shorter than the threshold will be removed in the quartet decomposition. Please notice that this only removed a specific quartet from this specific bootstrap tree in this specific gene family, it does not have any affect in other trees or other gene families. When we calculate the bootstrap support value (a percentage value), this removed quartet will not be counted in the total any more.

b.   Long external branch: the second filter can be used to remove the quartets with a very long external branch. We calculate the ratio between the longest external branch and internal branch, any quartet with this ratio bigger than a user specified value (default value = 10) will be removed. The removal of this quartet only affects this specific bootstrap tree, which is same as previous filter. Please see the paper (Felsenstein, 1978) and (Kennedy, et al., 2005) for detail.

c.    Remove less supported quartets: Since the number of quartets equals the number of the combinations of any four species in a given gene tree, this number could be very large for a relatively large gene tree. Generally less well resolved quartets are less informative and have a higher chance for noise resulting in erroneous significant conflicts with the plurality (false positives). We implemented the following filter, which requires two thresholds: the first threshold is the bootstrap support value threshold (T1), and the second (T2) is the percentage of gene families supporting the quartet topology with a bootstrap support value at least T1. For a specific quartet plurality topology, if the percentage of gene families supporting it with a bootstrap support value at least T1 is less than T2, this quartet will be removed from further analyses, e.g., from display of a quartet spectrum. Please notice this removal is different from two previous filters, it will remove this quartet and all its three topologies from all gene families and all bootstrap trees. (The aim of this filter is to remove false positives for identified conflict.  An observation from simulations was that quartets that are difficult to resolve by most gene families, will create a small number of false positives, i.e. conflicts with significant bootstrap support.

This last filter can be applied multiple times to generate different quartet spectrums even after the decomposition process is done, while the other two filters have to be specified before the decomposition process starts. The decomposition process, which is highly CPU intensive, has to be run again if a user wants to change the values for the first two filters.

6.       What is a split network?

A split network is a network representation of the relationship of a set of taxa(Bryant and Moulton, 2004). A split network can still look like a tree, provided the data are compatible with a tree, but the split network might include alternative splits to represent the relationships between taxa. See Figure 4 for a sample.

Description: neighbornet.png

Figure 4: A sample split network

7.       Why do we want generate a split network from quartets?

The significantly supported quartets can be combined into a supertree using the Matrix representation using parsimony approach(Baum, 1992; Ragan, 1992).  However, we believe split network is a more reasonable representation of the taxa relationship than a tree especially in complex situation, and we can use the quartet decomposition result to generate a split network. The software SplitTree 4 (Huson and Bryant, 2006) is used by QD server to generate a split network.

8.       How do we generate a split network?

The algorithm NeighborNet can be used to generate a split network from a matrix. The matrix can be generated by many ways. One method is from a set of quartets, which can be obtained from quartet decomposition. QD server can generate a nexus file with the matrix, and SplitTree 4 is used in QD server to generate a split network (Huson and Bryant, 2006).

FAQ about Quartet Decomposition (QD) server

1.       What is the overall structure of the QD server?

Please see Figure 5 below for the overall structure of the server, and there is detailed explanation in other FAQ questions. The boxes with green border are parameterized filters which can be applied multiple times to generate different quartet spectra. The green arrows represent the repeatable steps.

Figure 5: the overall structure of QD server.

2.       What kind of data should I prepare in order to use QD server?

In order to use QD server, the user need to prepare two files: one file provide the genome list, and the other file is a compressed file of all your gene trees. We provide two sample files below:

a.       Sample genome list file

b.      Sample compressed tree file

The genome list file must be a pure text file with one genome name per line, any formatted file such as Microsoft Word document file or PDF file won’t work in QD server. The tree file must be a compressed file with .tar.gz, or .rar, or .zip format. The .tar.gz file is easily obtained in Unix/Linux/MacOS X platform by using tar and gzip command. The .rar file can be obtained by using Winrar in Windows. The .zip is also popular and supported in many platforms such as zip/unzip (Linux) and WinZip (Windows). After uncompress the compressed file, the restored files should be many Newick format tree files. Newick tree files can be obtained by tree construction programs such as Phylip. Each Newick tree file should be generated from a gene family, and all the bootstrap trees for the same gene family should be in the same Newick tree file.

There is one important issue: the taxa name in a gene tree file directly generated from tree construction software is normally the gene names, the user need to convert it to the genome name. We do provide a perl script to help the users to do this conversion. The users have to provide a map file between gene names and genome names, and the trees with gene name as the substring of the tip name. A short usage description will be shown if no argument is provided. Sample data files are also provided here. Uncompress the zip file to obtain all sample files.

We also provide another perl script to help the user to check the format of the input trees, the user need to run this script to check format of the input trees to avoid future problem in the server.

The user needs to install BioPerl in order to run the provided perl scripts.

The URL below will lead you to the quartet spectrum generated by the two sample files above.

http://csbl.bmb.uga.edu/QD/jobstatus.php?jobid=QDkj65zE

 

3.      Can I see a set of sample input files and the sample result page?
Please see the sample input files below. Please be aware that if you submit a job with the sample files, it will take about one hour to be done. You could go to the result page directly to see the generated quartet spectrum.

The first sample data set:

a.       The sample genome list file

The genome names are listed below; the names in parenthesis are used in genome list file and tree files.

(Anabaena) Anabaena sp. PCC7120;
(Trichodesm) Trichodesmium erythraeum IMS101;
(Synechocys) Synechocystis sp. PCC6803;
(1Prochloro) Prochlorococcus marinus CCMP1375;
(2Prochloro) Prochlorococcus marinus MED4;
(3Prochloro) Prochlorococcus marinus MIT9313;
(Synechococ) Synechococcus WH8102;
(Thermosyne) Thermosynechococcus elongatus BP-1;
(Gloeobacte) Gloeobacter violaceus PCC7421;
(Nostoc) Nostoc punctiforme ATCC29133;
(Crocosphae) Crocosphaera watsonii WH8501.

b.      The sample compressed tree file

The gene families are constructed by the method described at (Zhaxybayeva and Gogarten, 2002). 1128 gene families are obtained by requiring at least nine of the 11 genomes contain the gene. The trees are generated by PHYLIP package. Please see the paper (Zhaxybayeva, et al., 2006) for details.

Result page: http://csbl.bmb.uga.edu/QD/jobstatus.php?jobid=QDkj65zE, a snapshot of the spectrum is shown in Figure 3.

The second sample data set:

a.       The sample genome list file

The genome names are listed below; the names in parenthesis are used in genome list file and tree files.

(AS9601)  Prochlorococcus marinus AS9601
(CCMP1375) Prochlorococcus marinus CCMP1375
(CCMP1986) Prochlorococcus marinus MED4
(MIT9313) Prochlorococcus marinus MIT9313
(MIT9303) Prochlorococcus marinus MIT9303
(MIT9312) Prochlorococcus marinus MIT9312
(MIT9515) Prochlorococcus marinus MIT9515
(NATL1A)  Prochlorococcus marinus NATL1A
(NATL2A)  Prochlorococcus marinus NATL2A
(CC9311)  Synechococcus CC9311
(CC9605)  Synechococcus CC9605
(CC9902)  Synechococcus CC9902
(WH8102)  Synechococcus sp WH8102
(MIT9211) Prochlorococcus marinus MIT9211
(MIT9215) Prochlorococcus marinus MIT9215
(RCC307)  Synechococcus RCC307
(WH7803)  Synechococcus WH7803
(MIT9301) Prochlorococcus marinus MIT 9301
(PCC7002) Synechococcus PCC 7002

b.      The sample compressed tree file

This data set is from the paper (Zhaxybayeva, et al., 2009), The data set consists of 1,812 gene families present in at least 4 of 18 cyanobacterial genomes of Prochlorococcus marinus and marine Synechococcus spp.

Result page: http://csbl.bmb.uga.edu/QD/jobstatus.php?jobid=QDf17dEr, [It will take a minute to load this page because the data set is big]. A snapshot of the spectrum is shown at Figure 6. The real spectrum is much bigger, we reduced its size for easier view. The original spectrum will be opened in a new window if you click on it.

Description: s2009.png

Figure 6. the quartet spectrum for the 2nd sample data set.

4.       What should I do after I submit my job?

Wait until the job has completed.

5.       Why do I have to wait for the job to be completed?

It is quite time consuming to decompose trees to quartets, we use a small Linux cluster to do the job. Normally it will take half an hour to several hours to be done; the actual time is depended on the size of your data set. It might take several hours if the data set has several hundreds of gene families and many genomes (eg., more then 15). You have to refresh your job status page to check if it is done. Once it is done, a quartet spectrum will be generated by using plurality quartets.

6.       How do I generate a quartet spectrum in QD server?

A quartet spectrum based on plurality quartets will be generated automatically once your job is done. You can also provide a reference tree; the server will generate a quartet spectrum based on your reference tree. A reference tree is a regular Newick tree with all your genome names presented.

7.       What can I do after I generate the quartet spectrum?

a.       Click on the spectrum bars to show the detailed information about the quartet, such as the quartet topology and the supporting gene families.

b.      Download quartets based one variety of options.

c.       Apply the 3rd filter to remove unwanted quartets.

d.      Calculate agreement scores.

e.      Draw another quartet spectrum by using a (different) reference tree.

f.        Generate split network and download the Nexus file.

8.       How is the agreement score calculated? And what is the meaning of the agreement score?

The agreement score(Zhaxybayeva, et al., 2009) is calculated for each gene family by using the formula below:

Description: C:\Users\Fenglou Mao\AppData\Roaming\Tencent\Users\183574811\QQ\WinTemp\RichOle\4E[TP8$%D)B}LT2U$0MZARS.jpg

where N is the number of bootstrap trees for the gene family (N=100 as the default); M is the number of possible quartets;

Description: C:\Users\Fenglou Mao\AppData\Roaming\Tencent\Users\183574811\QQ\WinTemp\RichOle\)8K_P0~[CDJ`JDB3_0)IZ@R.jpg

with x being the number of genomes being considered; N*M is the number of total quartets in all bootstrap gene trees for this gene family; and ni(plurality) is the number of topologies which agree with plurality for the ith quartet. The score S will be 1 if all the bootstrap trees are the same and equal to plurality tree, and it will be less than 1 if there is an inconsistency between some bootstrap trees. The more conflicts there are, the closer the score is to “0”.

If the quartet spectrum is generated from a reference tree, we will change ni(plurality) to ni(reference), which is the number of topologies which has the same topology as the reference tree for the ith quartet.

9.       How could I download the quartets I wanted?

We provide two download options:

a. The first download option is to download a specific part of the quartets, which uses two thresholds, the first threshold (T1) is a bootstrap support value and the second threshold is an integer number (T2). A user can download a subset of the decomposed quartets that are supported with a bootstrap support value at least T1 in at least T2 gene families.

b. The second download option is based on the quartet spectrum. The quartet topologies with positive y-values are considered as plurality quartet topology, and as conflicts quartet topology if they have negative y-values. The user can download the plurality/conflict quartet topologies based on his/her choice. A bootstrap support value threshold (T1) and an integer threshold (T2) need to be provided. The plurality/conflict quartet topologies with bootstrap support value larger than T1 in at least T2 gene families will be downloaded. Note that the spectrum is affected by the third filter mentioned above, which removes some unwanted quartets. This download option only downloads the quartets that pass the filter.

10.   How are the paralogs are processed?

If there are paralogs in a bootstrap tree, the distribution of quartet topologies will be calculated as below. Given a tree and four genomes A, B, C and D, the number of paralogs are a, b, c and d for each genome respectively. The total number of quartet topologies with the four genomes will be t=a×b×c×d. Since each topology will adapt one of TOP1, TOP2 or TOP3 (see Figure 2), thus we can count the total number of quartet topologies with TOP1, TOP2 and TOP3 as t1, t2 and t3. The sum of t1, t2 and t3 should be equal to t. For the given tree, we calculate the ratio of TOP1, TOP2 and TOP3 as t1/t, t2/t and t3/t, respectively. The sum of the three ratio will be equal to 1, which is same as a tree which has no paralog. In addition, quartets with two tips from same genome will be ignored.

11.   How could I draw a split network in QD server?

Split networks can be generated from any quartet download pages by using the quartet topologies you are going to download. Of course the generated split network is also affected by the quartet download thresholds.

12.   How could I obtain the Nexus files which I could use in my own SplitTree analysis?

A link to download the Nexus file can be found in both the quartet download page andsplit network picture page.

13.   What is a reference tree and how do we generate a quartet spectrum using a reference tree?

A reference tree is similar with a regular gene tree, we just require it has all possible taxa. In a gene tree you may not have all taxa. We use the quartet topologies in the reference tree to generate the plurality quartet bars in the quartet spectrum. The agreement score calculation is also affected by the reference tree.

 

Reference

Baum, B.R. (1992) Combining Trees as a Way of Combining Data Sets for Phylogenetic Inference, and the Desirability of Combining Gene Trees, Taxon, 41, 3-10.

Bryant, D. and Moulton, V. (2004) Neighbor-net: an agglomerative method for the construction of phylogenetic networks, Mol Biol Evol, 21, 255-265.

Felsenstein, J. (1978) Cases in which Parsimony or Compatibility Methods Will be Positively Misleading, Systematic Zoology, 27, 401-410.

Huson, D.H. and Bryant, D. (2006) Application of phylogenetic networks in evolutionary studies, Mol Biol Evol, 23, 254-267.

Kennedy, M., Holland, B.R., Gray, R.D. and Spencer, H.G. (2005) Untangling long branches: identifying conflicting phylogenetic signals using spectral analysis, neighbor-net, and consensus networks, Syst Biol, 54, 620-633.

Ragan, M.A. (1992) Phylogenetic inference based on matrix representation of trees, Mol Phylogenet Evol, 1, 53-58.

Zhaxybayeva, O., Doolittle, W.F., Papke, R.T. and Gogarten, J.P. (2009) Intertwined Evolutionary Histories of Marine Synechococcus and Prochlorococcus marinus, Genome Biol Evol, 2009, 325-339.

Zhaxybayeva, O. and Gogarten, J.P. (2002) Bootstrap, Bayesian probability and maximum likelihood mapping: exploring new tools for comparative genome analyses, BMC Genomics, 3, 4.

Zhaxybayeva, O., Gogarten, J.P., Charlebois, R.L., Doolittle, W.F. and Papke, R.T. (2006) Phylogenetic analyses of cyanobacterial genomes: quantification of horizontal gene transfer events, Genome Res, 16, 1099-1108.

 

The material and programs available through this server are based upon work supported by the National Science Foundation under Grant No. 0830024. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.