** Predictive Densities / Distributions **

This is a model for predicting future $x$s, as opposed to the classification functions in predictive modeling.

<latex>\begin{align*}

p(x|D) &= \int p(x,\theta|D) d \theta & \text{by LTP}

&= \int p(x|theta,D) p(\theta|D) d \theta

&= \int p(x|\theta) p(\theta|D) d \theta & \text{by IID assumption}

&= \int (\text{density for x given a particular $\theta$})*(\text{posterior for $\theta$}) d \theta
\end{align*}</latex>

$x \sim \mathcal{N}(\mu,\sigma^2)$ , $p(\mu) = \mathcal{N}(\mu_0,s^2)$

Saw in the gaussian_parameter_example that $p(\mu|D) = \mathcal{N}(\mu | \mu_N , \sigma^2)$

<latex>\begin{align*}

p(x|D) &= \int \left[ \mathcal{N}(x ; \mu,\sigma^2) \mathcal{N}(\mu ; \mu_N,\sigma_N^2) \right] d \mu

&= \int (\text{density for x given a particular $\mu$})*(\text{posterior for $\mu$}) d \mu

&\Rightarrow (\text{after some manipulation})

&= \mathcal{N}(x ; \mu_N , \sigma_N^2 + \sigma^2) & \mu_N \text{\ is posterior mean},

& & \sigma_N^2 \text{\ is uncertainty about the mean $\mu$}),

& & \sigma^2 \text{\ is known variance of the data}
\end{algin*}</latex>

Consider $K$ possible models $M_1, \dots , M_K$ each with parameters $\theta_1, \dots , \theta_K$.

Because there are different number of parameters, can't use training data with $P(D_{train}|\hat \theta_k^{ML})$ to choose a model. The more complex models have more degrees of freedom and a better ability to fit the training data.

<latex>\begin{align*}

p(M_k|D_{train}) &\propto p(D_{train}|M_k) p(M_k) & \text{can assume\ } p(M_k)=\frac{1}{K}

p(D_{train}|M_k) &= \int p(D_{train}|\theta,M_k) p(\theta|M_k) d\theta

&= \text{“marginal likelihood”}
\end{align*}</latex>
The marginal likelihood often cannot be computed in closed form. It often relies on unreliable approximations. Can be difficult to do in practice.

e.g. **Bayesian Information Criterion** (**BIC**): we want to maximize the log-likelihood of the training data given the model with a penalty term.

Objective function: $$\log p(D_{train}|\hat \theta_k^{ML}) - \frac{P_k}{2} \log N $$

This is a good approximation to the marginal likelihood for certain types of models, but in general is only a heuristic.

Select the model that maximizes $\log P(D_{test}|\hat \theta_k,M_k)$ over k. $\hat \theta_k$ can be $\hat \theta^{ML}$ or $\hat \theta^{MAP}$, etc.

Can also use an average over the parameters $\theta$ with respect to $P(\theta|D_{train})$, known as **Bayesian predictive density**. This has the form $\log P(D_{test}|D_{train},M_k)$.

“How well are we predicting new data?” Widely used in text prediction. Avoids issues about computing the marginal likelihood or heuristic assumptions above. The issue here is having enough data to support it.

This is not a “fully Bayesian” technique because we aren't computing a posterior or marginal likelihood and not learning from the test data.