Pick a model, including parameter values
Evaluate how likely the observed data are, given the model
Tweak the model to make the observations more likely
Is this model superior to other models?
In ML, you have some set of data $D$ and a model for generating this data. This model has parameters $\theta$. The probability of observing data is $\mathrm{Pr}(D \, | \, \theta)$. The best parameter point estimate $\hat{\theta}$ is simply the value that maximizes $\mathrm{Pr}(D \, | \, \theta)$.
For example, if we have data $D$ from a Bernoulli observation model representing $k$ successes in $n$ trials, then the probability of observing $k$ and $n$ given coin flip probability parameter $p$ is simply $$\mathrm{Pr}(k,n \, | \, p) = p^k \, (1-p)^{n-k}.$$
For the Bernoulli model $\mathrm{Pr}(k,n \, | \, p) = p^k \, (1-p)^{n-k}$, we have $\hat{p} = k/n$. For example, with $k=8$ and $n=10$, $\hat{p}=0.8$ the likelihood curve follows
Observed trajectory $D=(t_0,...,t_n)$ depends on unknown parameter(s) $\theta$
Probability density function of $D$ is $f_{\theta}$
$$L_D(\theta)=f_{\theta}(D)$$
Problem: These data aren't independent.
Solution: Factor the joint density into conditional and marginal
e.g., $f(y_3,y_2,y_1)=f(y_3 \, | \, y_2,y_1) \cdot f(y_2,y_1)$
$$f(y_3,y_2,y_1)=f(y_3 \, | \, y_2,y_1) \cdot f(y_2 \, | \, y_1) \cdot f(y_1)$$
$$L(\theta)=\prod_{t=2}^{T}f(y_t|l_{t-1})\cdot f(y_1)$$
where $l_{t-1}$ is information through $t-1$ (i.e., $y_{t-1},...,y_1$), and $T$ is the time series length
Case counts at different times
Sequences
Titers
or some composite of observations
means maximizing the log-likelihood
or minimizing the negative log-likelihood
Can be analytically tractable
For our models, it's not
Brute force
Derivative-based methods
Simplex
Simulated annealing
Sequential Monte Carlo
Many others... but few tried and true
POMP (partially observed Markov Process)
pMCMC (particle Markov chain Monte Carlo)
TSIR, if conditions met
Hold parameter(s) constant, fit the rest
Generally, it's difficult to make probability statements using frequentist statistics. You cannot directly say that model 1 is twice as likely as model 2. People misuse p values in this sort of fashion all the time.
Bayes' rule forms the basis of Bayesian inference, it states: $$ \mathrm{Pr}(A \, | \, B) = \cfrac{ \mathrm{Pr}(B \, | \, A) \, \mathrm{Pr}(A) }{ \mathrm{Pr}(B) } $$
Bayesian inference applies Bayes' rule in a likelihood context, so that $$ \mathrm{Pr}(\theta \, | \, D) = \cfrac{ \mathrm{Pr}(D \, | \, \theta) \, \mathrm{Pr}(\theta) }{ \mathrm{Pr}(D) }, $$ where $D$ is data and $\theta$ are parameters. $\mathrm{Pr}(D)$ is constant with respect to $\theta$, so that $ \mathrm{Pr}(\theta \, | \, D) \propto \mathrm{Pr}(D \, | \, \theta) \, \mathrm{Pr}(\theta)$. This relationship is often referred to as $ \mathrm{posterior} \propto \mathrm{likelihood} \times \mathrm{prior}$.
Following our previous Bernoulli example, we've observed $k$ successes in $n$ trials, and so the likelihood $\mathrm{Pr}(k,n \, | \, p) = p^k \, (1-p)^{n-k}$. We'll assume a flat prior $\mathrm{Pr}(p) = 1$. In this case, the marginal likelihood follows $$\mathrm{Pr}(k,n) = \int_0^1 \mathrm{Pr}(k,n \, | \, p) \, \mathrm{Pr}(p) \, dp = \cfrac{k! \, (n-k)!}{(n+1)!}.$$ And the full posterior follows $$\mathrm{Pr}(p \, | \, k,n) = \cfrac{(n+1)! \, p^k \, (1-p)^{n-k}}{k! \, (n-k)!}.$$
If $k=8$ and $n=10$, the mean posterior $\mathrm{E}[p] = 0.75$, while the 95% credible interval extends from $0.482$ to $0.896$, and the posterior distribution follows
For the Bernoulli model $\mathrm{Pr}(k,n \, | \, p) = p^k \, (1-p)^{n-k}$, we have $\hat{p} = k/n$. For example, with $k=8$ and $n=10$, $\hat{p}=0.8$ the likelihood curve follows
Markov Chain Monte Carlo
Metropolis-Hastings MCMC
Particle MCMC
Hybrid/Hamiltonian Monte Carlo
Many others
e.g., R-package pomp
Fit models to time series or (new) longitudinal data
$\lambda_{i,j,t}=\lambda_{0_j}f(\overrightarrow{\theta_j} \boldsymbol{X_{it}}) + \text{I(prev. inf.)}d_{jc_i}e^{-w_j(t-t_\mathrm{clr})}$
Approximate Bayesian Computation
Time-series probes
Many parameters
Multiple minima
Noise, nonstationarity, etc.
Assume we have time series of cases of each.
Constrain/fix some parameters, fit (vary) $\alpha$, a, maybe $\rho$, etc., to find the parameter combinations that maximize the likelihood of the data, e.g., multiple iterated filtering using a particle filter (sequential Monte Carlo)
Confirm convergence
AIC and WAIC
Leave-one-out cross-validation (LOO)
Out-of-sample prediction
Replicate on simulated data
$$H'=aH-bHL$$
$$L'=cHL-dL$$
$H$ hares, $L$ lynxes
hare birth rate $a$, predation rate $b$,
consumption rate $c$, death rate $d$
$$x'=\sigma(y-x)$$
$$y'=x(\rho-z)-y$$
$$z'=xy-\beta z$$
We can detect underlying structure
We can detect and predict without understanding
New claim: We can infer causal interactions
Very roughly, the system's attractor is diffeomorphic to (can be mapped without loss of information) to the individual attractors of the state variables in some delay-embedding space.
Through their shadow manifolds, variables in the same dynamical system can predict each other.
If $X$ drives $Y$, increasing the number of observations of $Y$ should improve predictions of states of $X$.
To infer if $X$ drives $Y$:
Predict something else
Exploit natural and unnatural disturbances
Carried by 20-80% of young children
Transmitted mostly between healthy carriers
>90 serotypes
Some serotypes seem better at everything
For each value of serotype-specific immunity, $\sigma$
Fit the transmission rate to obtain 40% prevalence in kids
(Later, sensitivity analysis on fixed parameters)
Increase in serotype diversity with age
Stable rank order
Decrease in carriage duration with age
Frequency of co-colonizations
Epidemics of rarer serotypes