Dynamics of cross-species transmission in SIV

S1

Figure S1: Extensive divergence makes sitewise measures of genetic linkage ineffective

For pairs of biallelic sites (ignoring rare variants), R2 was used to estimate how strongly the allele in one site predicts the allele in the second site, with values of 0 indicating no linkage and 1 indicating perfect linkage. The mean value of R2 was 0.044, indicating very low levels of linkage overall.


S2

Figure S2: No evidence of linkage between nonadjacent segments of the SIV genome.

The alignment used for GARD analyses (LANL compendium with HIV overrepresentation reduced) was split along the breakpoints identified by GARD to yield the 12 genomic segments, and a maximum likelihood tree was constructed for each. The number of steps required to turn one tree topology into another was assessed for each pair of trees with the Rooted Subtree-Prune-and-Regraft (rSPR) package. Segment pairs with similar topologies have lower scores than segments with less similar topologies.


S3

Figure S3 Distribution of the number of sequences per host included in analyses

A: All available high-quality lentivirus sequences were randomly subsampled up to 25 sequences per host for the main dataset. We included the 24 hosts with at least 5 sequences available in this dataset. B: For the supplemental dataset, we randomly subsampled up to 40 sequences per host, and included the 15 hosts with at least 16 sequences available in this dataset. For both datasets, a small number of additional sequences were permitted for the few hosts that are infected by multiple viral lineages in order to represent the full breadth of known genetic diversity of lentiviruses in each host population.


S4

Figure S4: Actual rates and Bayes factors for main dataset discrete trait analyses

Values for the asymmetric transition rates between hosts, as estimated by the CTMC, were calculated as rate * indicator (element-wise for each state logged). We report the average posterior values above. Bayes factors represent a ratio of the posterior odds / prior odds that a given actual rate is non-zero. Because each of the 12 segments contributes to the likelihood, but they have not evolved independently, we divide all Bayes factors by 12 and report the adjusted values above (and throughout the text).  


S5

Figure S5: Maximum clade credibility trees for each of the 12 GARD-identified genomic segments of the lentiviral genome

Tips are color coded by known host state; branches and internal nodes are color coded by inferred host state, with color saturation indicating the confidence of these assignments. Monophyletic clades of viruses from the same lineage are collapsed, with the triangle width proportional to the number of represented sequences.


S6

Figure S6: Most lentiviruses are the product of ancient cross-species transmissions (supplemental dataset).

The phylogeny of the host species' mitochondrial genomes forms the outer circle (gray: not included in supplemental dataset). Arrows with filled arrowheads represent transmission events inferred by the model with Bayes' factor (BF) >= 3.0; black arrows have BF >= 10, with opacity of gray arrows scaled for BF between 3.0 and 10.0. Transmissions with 2.0 <= BF < 3.0 have open arrowheads (see discussion). Width of the arrow indicates the rate of transmission (actual rates = rates * indicators). Circle sizes represent network centrality scores for each host. Transmissions from chimps to humans; chimps to gorillas; gorillas to humans; sooty mangabeys to humans; sabaeus to tantalus; and vervets to baboons have been previously documented. To our knowledge, all other transmissions illustrated are novel identifications.  


S7

Figure S7: Actual rates and Bayes factors for supplemental dataset discrete trait analyses

Values for the asymmetric transition rates between hosts, as estimated by the CTMC, were calculated as rate * indicator (element-wise for each state logged). We report the average posterior values above. Bayes factors represent a ratio of the posterior odds / prior odds that a given actual rate is non-zero. Because each of the 12 segments contributes to the likelihood, but they have not evolved independently, we divide all Bayes factors by 12 and report the adjusted values above (and throughout the text).


S8

Figure S8: Maximum clade credibility trees for each of the 12 GARD-identified genomic segments of the lentiviral genome (supplemental dataset)

Tips are color coded by known host state; branches and internal nodes are color coded by inferred host state, with color saturation indicating the confidence of these assignments. Monophyletic clades of viruses from the same lineage are collapsed, with the triangle width proportional to the number of represented sequences.  


S9

Figure S9: Comparison of Main and Supplemental Dataset Discrete Trait Analysis Results

Each datapoint represents on of the 210 possible transmissions between each pair of the 15 hosts present in both datasets. The black dashed line shows y=x; the linear regression and 95% CI are shown in gray.