Abstract

With the expansion of DNA sequencing technology, quantifying evolution during emerging viral outbreaks has become an important tool for scientists and public health officials. Although it is known that the degree of sequence divergence significantly affects the calculation of evolutionary metrics in viral outbreaks, the extent and duration of this effect during an actual outbreak remains unclear. We have analyzed how limited divergence time during an early viral outbreak affects the accuracy of molecular evolutionary metrics. Using sequence data from the first 25 months of the 2009 pandemic H1N1 (pH1N1) outbreak, we calculated each of three different standard evolutionary metrics—molecular clock rate (i.e., evolutionary rate), whole-gene dN/dS, and site-wise dN/dS—for hemagglutinin and neuraminidase, using increasingly longer time windows, from 1 month to 25 months. For the molecular clock rate, we found that at least 3–4 months of temporal divergence from the start of sampling was required to make precise estimates that also agreed with long-term values. For whole-gene dN/dS, we found that at least 2 months of data were required to generate precise estimates, but 6-9 months were required for estimates to approach their long term values. For site-wise dN/dS estimates, we found that at least 6 months of sampling divergence was required before the majority of sites had at least one mutation and were thus evolutionarily informative. Furthermore, 8 months of sampling divergence was required before the site-wise estimates appropriately reflected the distribution of values expected from known protein-structure-based evolutionary pressure in influenza. In summary, we found that evolutionary metrics calculated from gene sequence data in early outbreaks should be expected to deviate from their long-term estimates for at least several months after the initial emergence and sequencing of the virus.