Pick a model, including parameter values

Evaluate how likely the observed data are, given the model

Tweak the model to make the observations more likely

Is this model superior to other models?

In ML, you have some set of data $D$ and a model for generating this data. This model has parameters $\theta$. The probability of observing data is $\mathrm{Pr}(D \, | \, \theta)$. The best parameter point estimate $\hat{\theta}$ is simply the value that maximizes $\mathrm{Pr}(D \, | \, \theta)$.

For example, if we have data $D$ from a Bernoulli observation model representing $k$ successes in $n$ trials, then the probability of observing $k$ and $n$ given coin flip probability parameter $p$ is simply $$\mathrm{Pr}(k,n \, | \, p) = p^k \, (1-p)^{n-k}.$$

For the Bernoulli model $\mathrm{Pr}(k,n \, | \, p) = p^k \, (1-p)^{n-k}$, we have $\hat{p} = k/n$. For example, with $k=8$ and $n=10$, $\hat{p}=0.8$ the likelihood curve follows

Observed trajectory $D=(t_0,...,t_n)$ depends on unknown parameter(s) $\theta$

Probability density function of $D$ is $f_{\theta}$

$$L_D(\theta)=f_{\theta}(D)$$

Problem: These data aren't independent.

Solution: Factor the joint density into conditional and marginal

e.g., $f(y_3,y_2,y_1)=f(y_3 \, | \, y_2,y_1) \cdot f(y_2,y_1)$

$$f(y_3,y_2,y_1)=f(y_3 \, | \, y_2,y_1) \cdot f(y_2 \, | \, y_1) \cdot f(y_1)$$

$$L(\theta)=\prod_{t=2}^{T}f(y_t|l_{t-1})\cdot f(y_1)$$

where $l_{t-1}$ is information through $t-1$ (i.e., $y_{t-1},...,y_1$), and $T$ is the time series length

Case counts at different times

Sequences

Titers

or some composite of observations

means maximizing the log-likelihood

or minimizing the negative log-likelihood

Can be analytically tractable

For our models, it's not

Brute force

Derivative-based methods

Simplex

Simulated annealing

Sequential Monte Carlo

Many others... but few tried and true

POMP (partially observed Markov Process)

pMCMC (particle Markov chain Monte Carlo)

TSIR, if conditions met

Hold parameter(s) constant, fit the rest

Generally, it's difficult to make probability statements using frequentist statistics. You cannot
directly say that model 1 is twice as likely as model 2.
People misuse *p* values in this sort of fashion all
the time.

Bayes' rule forms the basis of Bayesian inference, it states: $$ \mathrm{Pr}(A \, | \, B) = \cfrac{ \mathrm{Pr}(B \, | \, A) \, \mathrm{Pr}(A) }{ \mathrm{Pr}(B) } $$

Bayesian inference applies Bayes' rule in a likelihood context, so that $$ \mathrm{Pr}(\theta \, | \, D) = \cfrac{ \mathrm{Pr}(D \, | \, \theta) \, \mathrm{Pr}(\theta) }{ \mathrm{Pr}(D) }, $$ where $D$ is data and $\theta$ are parameters. $\mathrm{Pr}(D)$ is constant with respect to $\theta$, so that $ \mathrm{Pr}(\theta \, | \, D) \propto \mathrm{Pr}(D \, | \, \theta) \, \mathrm{Pr}(\theta)$. This relationship is often referred to as $ \mathrm{posterior} \propto \mathrm{likelihood} \times \mathrm{prior}$.

Following our previous Bernoulli example, we've observed $k$ successes in $n$ trials, and so the likelihood $\mathrm{Pr}(k,n \, | \, p) = p^k \, (1-p)^{n-k}$. We'll assume a flat prior $\mathrm{Pr}(p) = 1$. In this case, the marginal likelihood follows $$\mathrm{Pr}(k,n) = \int_0^1 \mathrm{Pr}(k,n \, | \, p) \, \mathrm{Pr}(p) \, dp = \cfrac{k! \, (n-k)!}{(n+1)!}.$$ And the full posterior follows $$\mathrm{Pr}(p \, | \, k,n) = \cfrac{(n+1)! \, p^k \, (1-p)^{n-k}}{k! \, (n-k)!}.$$

If $k=8$ and $n=10$, the mean posterior $\mathrm{E}[p] = 0.75$, while the 95% credible interval extends from $0.482$ to $0.896$, and the posterior distribution follows

For the Bernoulli model $\mathrm{Pr}(k,n \, | \, p) = p^k \, (1-p)^{n-k}$, we have $\hat{p} = k/n$. For example, with $k=8$ and $n=10$, $\hat{p}=0.8$ the likelihood curve follows

Markov Chain Monte Carlo

Metropolis-Hastings MCMC

Particle MCMC

Hybrid/Hamiltonian Monte Carlo

Many others

e.g., R-package pomp

Fit models to time series or (new) longitudinal data

$\lambda_{i,j,t}=\lambda_{0_j}f(\overrightarrow{\theta_j} \boldsymbol{X_{it}}) + \text{I(prev. inf.)}d_{jc_i}e^{-w_j(t-t_\mathrm{clr})}$

Approximate Bayesian Computation

Time-series probes

Many parameters

Multiple minima

Noise, nonstationarity, etc.

Assume we have time series of cases of each.

- Assume this morning's model structure. (Potential modifications: allow coinfections, noise in rates.)
- Create a mapping from latent states (number infected, $X(t)$) to data (cases, $Y(t)$), e.g., $$Y(t) \sim \text{nbinom}(\text{mean} = \rho X(t), \text{size} = \psi )$$ with $\rho$ representing the reporting rate. (N.B. Potential problems with overcounting here; better to use an accumulator variable.)

Constrain/fix some parameters, fit (vary) $\alpha$, *a*, maybe $\rho$, etc., to find the parameter combinations that maximize the likelihood of the data, e.g., multiple iterated filtering using a particle filter (sequential Monte Carlo)

Confirm convergence

AIC and WAIC

Leave-one-out cross-validation (LOO)

Out-of-sample prediction

Replicate on simulated data

$$H'=aH-bHL$$

$$L'=cHL-dL$$

$H$ hares, $L$ lynxes

hare birth rate $a$, predation rate $b$,

consumption rate $c$, death rate $d$

$$x'=\sigma(y-x)$$

$$y'=x(\rho-z)-y$$

$$z'=xy-\beta z$$

We can detect underlying structure

We can detect and predict without understanding

New claim: We can infer causal interactions

Very roughly, the system's attractor is **diffeomorphic to** (can be mapped without loss of information) to the individual attractors of the state variables in some delay-embedding space.

Through their shadow manifolds, variables in the same dynamical system can predict each other.

If $X$ drives $Y$, increasing the number of observations of $Y$ should improve predictions of states of $X$.

To infer if $X$ drives $Y$:

- Construct the shadow manifold of $Y$, $\pmb{M}_Y$ (for some $E$, $\tau$). (Each point in $\pmb{M}_Y$ is given by $\vec{y}(t) = \{y_t,y_{t-\tau},y_{t-2\tau},...,y_{t-(E-1)\tau}\}$.)
- For each $X(t)$, identify its analogues $\vec{x}(t)$ and $\vec{y}(t)$.
- Find the $E+1$ nearest neighbors of $\vec{y}(t)$ and weight them by their Euclidean distances to $\vec{y}(t)$.
- To make a prediction $\hat{X}(t)$, multiply these weights by the respective points in $\pmb{M}_X$. Let $\rho$ be the correlation between $\vec{x}(t)$ and $\hat{X}(t)$.
- First make predictions from $\pmb{M}_Y$ constructed with only a few points in the time series, $L_\text{min}$, and then with many, $L_\text{max}$.
- If $\rho$ increases with more information on $\pmb{M}_Y$, $X$ drives $Y$.

Predict something else

Exploit natural and unnatural disturbances

Carried by 20-80% of young children

Transmitted mostly between healthy carriers

>90 serotypes

Some serotypes seem better at everything

For each value of serotype-specific immunity, $\sigma$

Fit the transmission rate to obtain 40% prevalence in kids

(Later, sensitivity analysis on fixed parameters)

Increase in serotype diversity with age

Stable rank order

Decrease in carriage duration with age

Frequency of co-colonizations

Epidemics of rarer serotypes