2019-05-21-多基因聯合建樹軟件astral方法

astral是基於java開發的根據壹組無根的genetrees建立speciestree。

運行astral不需要安裝，但是需要在java環境下運行。

astral沒有圖形界面，需要在命令行下運行。

運行後可以看到astral的options。如果運行沒有錯誤，說明安裝成功了。

-o 輸出

輸入文件是含有所有genetrees的Newick格式的文件。輸入的genetree被當做unrooted tree,不管他們是否有根。astral的輸出也是被當做unrooted tree。輸入的genetree支持多分支。

輸出的結果是Newick格式，可以用很多軟件查看。

astral測量branch length 是用coalscent units。不是我們通常認為的boostrap value

-q參數

得到的是quartet score 和branch length 和 branch support values。0.9表示genetree產生的quartet tree的90%存在於species tree中。為樹打分的命令如下：

與simulated_14taxon.gene.tre進行比較後，會為物種樹simulated_14taxon.default.tre打分。

表示來自genetree的4803個quartet trees存在於species trees中。4803個quartet trees占所有的quartet trees的47.98%。這個數據集的ILS水平很高。導致這個結果，也就是genetree和species tree的不壹致性很高。

當妳得到壹個species tree或者用-q參數為樹打分，妳將會得到每壹個分支的branch length 和local posterior support 。除了這些默認的參數，還可以輸出其他branch 信息。無根樹的每壹個branch都有四組。分別是first child (L), the second child (R), the sister group (S), and everything else (O)。兩兩配對，可以得到三種拓撲結構。其中壹種就是當前樹的拓撲結構。剩下的就是可選的兩種拓撲結構。astral可以不僅僅得到當前樹的local posterior probability，也能計算剩下的兩種拓撲結構的。-t 參數

命令如下

閱讀幾個分支給出的所有值，並理解他們。

用Yule prior model 計算speciestree 的branch length的local posterior probabilities 和branch lengths。Yule process的物種形成速率（in coalscent units）默認值設置為0.5，導致quartet 頻率在[1/3,1]之間是平穩的。（並不理解）用-c的選項可以調節hyper-parameter。

astral 可以不通過bootstrapping輸出branch support value.這種support比bootstrapping更加可靠（在作者的數據下）。盡管，妳可能還是想得到bootstrapping。astral可以進行multi-locus bootstrapping。為了開展multi-locus bootstrapping，astral需要訪問每壹個gene的boostrap replicate trees。

例如：

妳需要提供所有gene tree bootstrap replicates的位置。在測試數據中進行bootstrapping。

1.進入test_data目錄

2.解壓called song_mammals.424genes.bs-trees.zip.

3.然後運行

然後會run100次bootstrapping。

1.-i 包括所有的MLgenetrees（就像不計算bootstrap也要輸入的）

2.-b 告訴astral 需要計算bootstrap value。-b 後面的文件 bs-files 包含了genetree bootstrap files的文件路徑，壹行壹個gene。例如：

424genes/100/raxmlboot.gtrgamma/RAxML_bootstrap.allbs

1.100 bootstrapped replicate trees，每壹個都是對壹組bootstrap gene trees進行running astral 的結果。

2.A greedy consensus of the 100 bootstrapped replicate trees; this tree has support values drawn on branches based on the bootstrap replicate trees. Support values show the percentage of bootstrap replicates that contain a branch.

3.The “main” ASTRAL tree; this is the results of running ASTRAL on the best_ml input gene trees. This main tree also includes support values, which are again drawn based on the 100 bootstrap replicate trees.（不懂）

註意：support value以百分數的形式展示。而local posterior probabilities是0-1之間的數。當astral 計算bootstrapping時，它會持續輸出每壹個重復的bootstrapped astral tree.因此，如果replicate 被輸入成100，它將會輸出100個數，然後，輸出100 bootstrapped trees 的greedy consensus。（不懂）最後，它會開展主要的分析（-i參數的文件）然後計算主要樹的branch support。這個示例中就是102trees。

默認值是100，-r 參數可以設置任何數量的重復。但是要保證妳的genetree的bootstrap file 的bootstrap replicates 要多於妳的-r參數後面的設置。

astral 開展site-only的resampling，可以用-g參數。

這時候我們需要更多的genetree replicates。如果是-g -r 100，對於某些gene那可能需要150 replicates。因為在genes resampled的時候，壹些gene抽到的概率會比其他的gene更多。

astral展開gene-only bootstrapping 用--gene-only的option。這個只要one inputfile。用-i 參數就可以了，對於這個就不要使用-b參數。

由於引導涉及壹個隨機的過程，我們可以提供壹個seed number給astral 保證重復性。seed number 可以有-s進行設置。默認的參數是692.

astral 有exact 和heuristic 的version。當taxa的數目較少的時候，exact version 會節約時間。但是分類不能超過37個。

-x參數就是開啟exact version。大約30秒。同樣的，我們可以使用默認的heuristic啟發式搜索法

這就只有1秒，那麽他們的運行結果有何不同呢？其實是壹致的

The default primate dataset we used in the previous step had 424 genes and 14 taxa. Since we have a relatively large number of gene trees, we could reasonably expect the exact and heuristic versions to generate identical output. The key point here is that as the number of genes increases, the probability that each bipartition of the species tree appears in at least one input gene tree increases. Thus, with 424 genes all bipartitions from the species tree are in at least one input gene tree, and therefore, the exact and the heuristic versions are identical.

We tried hard to find a subset of genes in the biological primates dataset where the exact and the heuristic versions did not match. We couldn't! So we had to resort to simulations. We simulated a 14-taxon dataset with extreme levels of ILS (average 87% RF between gene trees and the species tree). Now, with this simulated dataset, if you take only 10 genes, something interesting happens.

運行：

這時得分會有壹點不同，topology也會不同。因此，在極端的情況下（ILS水平較高，genetree錯誤較多或者較分類來說可用的genetrees較少比如14類群只有10個gene，較之前的424gene就是較少）。那麽就可以觀察到兩種算法的差異。

為了expand search space ，運行：

這裏的-e參數用於輸入壹組extra trees 用於擴展astral的搜索空間。這個文件為10個simulated genes提供了200 bootstrap replicates 。-f 用於當input tree 有species labels代替gene label 的時候。

大數據集（>500taxa）增加memory available to java。

run

-m: 移除含有少於指定葉子數量的gene。對於需要壹定分類級別的taxon occupancy 是有用的。後面設置數量。

-k completed : To build the set X (and not to score the species tree), ASTRAL internally completes the gene trees. To see these completed gene trees, run this option. This option is usable only when you also have -o（不懂）

-k bootstrapped 和-k bootstraps_norun:these options output the bootstrap replicate inputs to ASTRAL. These are useful if you want to run ASTRAL separately on each bootstrap replicate on a cluster.

-k searchspace_norun:輸出search space然後退出。

----polylimit：

--samplingrounds：For multi-individual datasets, this option controls how many rounds of individual sampling is used in building the constraint set. Adjust to reduce/increase the search space for multi-individual datasets

文章參考：[ /smirarab/ASTRAL/blob/master/astral-tutorial.md#running-on-a-multi-individual-datasets]