Actual v. predicted distances of Atlantic cyclones

Machine learning

Statistical models

Nonlinear forecasting

"Mechanistic" modeling

Inference of gene regulatory networks

from knockout, observational, and synthetic data

Compete Lasso, random forests, Bayesian networks, mutual information, ANOVA, etc.

An artificial neural network that included microbial interactions performed best.

Reconstruct attractor ("library") from time series

Use attractor to make short-term predictions

Trajectories in chaotic attractors diverge

$$ \lvert \delta \textbf{Z}(t)\rvert \approx e^{\lambda t}\lvert\delta\textbf{Z}_0\rvert$$

$\lambda$ is the Lyapunov exponent

(so with chaos, $\lambda>0$)

- Choose an embedding dimension $E$ and lag $\tau$
- Each point in $E$-dimensional space: $\{x_t,x_{t-\tau},x_{t-2\tau},...,x_{t-(E-1)\tau}\}$
- Construct these points from the time series
- Define a point to predict ("predictee")
- See where predictee's $E+1$ nearest neighbors wind up $t$ steps into the future
- Measure correlations $\rho$ between predictee's observed future state and neighbors' weighted predictions

An unsolved problem

Use $E$ (and $\tau$) that yield best predictions

where $I(s)$ is the incidence in week $s$, $h$ is a strain-specific incidence threshold, $T$ is the time of crossing $h$, and $Y$ is the strain's whole-season cumulative incidence proxy.

Aim: Predict peak timing and magnitude

Achieves 37% accuracy with 1-3 week lead, ~50% at 0 week lead

General strategy for antigenically evolving seasonal influenza viruses is attempt to match vaccine strain as close as possible to circulating viruses

Due to manufacturing and distribution, vaccine strain selection occurs in Feb for an fall vaccination campaign ahead of seasonal influenza epidemic

Future frequency $x_i(t+\Delta t)$ of strain $i$ derives from strain fitness $f_i$ and present day frequency $x_i(t)$, such that

$$x_i(t+\Delta t) = \frac{1}{Z(t)} \, x_i(t) \, \mathrm{exp}(f_i \, \Delta t)$$

Strain frequencies at each timepoint are normalized by total frequency $Z(t)$. This captures clonal interference between competing lineages.

The fitness $f$ of strain $i$ is estimated as

$$f_i = \beta^\mathrm{A} \, f_i^\mathrm{A} + \beta^\mathrm{B} \, f_i^\mathrm{B} + \ldots$$

where $f^A$, $f^B$, etc... are different standardized viral attributes and $\beta^A$, $\beta^B$, etc... coefficients are trained based on historical evolution

Antigenic drift | Intrinsic fitness | Recent growth |
---|---|---|

epitope mutations | non-epitope mutations | local branching index |

HI titers | DMS data (via Bloom lab) | delta frequency |

Strain fitness $f_i$ is largely fixed by the "fundamentals" of the strain rather than being learned from frequency behavior.

$x' = \frac{x \, (1+s)}{x \, (1+s) + (1-x)}$ for frequency $x$ in one generation with selective advantage $s$

$x(t) = \frac{x_0 \, (1+s)^t}{x_0 \, (1+s)^t + (1-x_0)}$ for initial frequency $x_0$ over $t$ generations

Trajectories are linear once logit transformed via $\mathrm{log}(\frac{x}{1 - x})$

Multinomial logistic regression models the probability of a virus sampled at time $t$ belonging to variant $i$ as

$$\mathrm{Pr}(X = i) = x_i(t) = \frac{p_i \, \mathrm{exp}(f_i \, t)}{\sum_{1 \le j \le n} p_j \, \mathrm{exp}(f_j \, t) }$$

where the model has $2n$ parameters consisting of $p_i$ the frequency of variant $i$ at initial timepoint and $f_i$ the growth rate or fitness of variant $i$ for $n$ variants.

The model is fit to minimize "log loss" of predicted variant vs observed variant across observations in dataset.

Model from Figgins and Bedford. 2022. medRxiv.

Multinomial logistic regression should work well for SARS-CoV-2 prediction, except new variants have been emerging fast enough that the prediction horizon is really quite short