Predictor data for phylogeographic GLM.
Currently planned predictors
Fraction of population living in an urban center: Source: Data downloaded as CSV from The World Bank | Data.
Destination country population size. Source: CIA World Factbook and UNdata, same as above.
Degree of air traffic between countries. Source: Bluedot. Please note that due to licensing agreements with IATA, this predictor cannot be made publicly available on Github.
Vector abundance of country. Possible source: Messina et al paper.
Latitudinal direction of ZIKV migration. This pairwise predictor has a
1value for the cell if the origin's population-weighted centroid is north of the destination's, and a
-1if origin is south of the destination.
Possible other predictors
- Country population size that is native-born or foreign-born (available from UNdata) as a possible measure of population-level migrancy?
Workflow for data standardization and making predictor matrices in BEAST format.
1) Standardize country names. See
indexed-countries-50.tsv for canonical project names. Cleaned up data are written to
Importantly, at this point there are two types of
tsvfiles, those that contain data that is not pairwise (e.g. country population size) or data that is inherently pairwise (e.g. amount of air traffic passengers flowing between countries). Pairwise predictor
tsvfiles are written to the following format:
origin \t destination \t predictor_value
tsvfiles are written as:
country \t value
tsv files that are not pairwise, are collated within a single dataframe by left-joining on
indexed-countries-45.tsv. This allows filtering out of countries that are not included in the analysis.
3) Make pairwise-format
tsv files from the non-pairwise data by assigning either the origin value to the destination value to all origin-destination pairs. These
tsv files are the input files for making the predictor matrices. They are stored in the
4) Import all origin-destination pair data from each of the
tsv files, then make flattened matrices. Note that matrix indexing for GLM predictor matrices is unique. We've accounted for this in our code.
5) Take ln of all values in the matrix, then standardize the log transformed values by doing the following for each value in the matrix. Note that the predictor for latitudinal direction of ZIKV migration is not ln-transformed, but is still standardized.
(value - mean)/(standard deviation)
6) Export log-transformed and standardized flattened matrices to
tsv. These tsv files can be directly copied and pasted into the BEAST xml file for the phylogeographic analysis. These
tsv files are stored in the
transformed-linearized-matrices directory .