When a joint distribution can't be computed directly, but marginals can, Gibbs Sampling is a MCMC algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables.
$P(x_i) = p(x_i|x_{\neg i})$
Samples are dependent and form a Markov chain with stationary distribution P(x|evidence).
The process can be understood as a random walk in the space of all instantiations of a variable.
Converges iff the chain is irreducible and ergodic. These conditions are satisfied when all probabilities are positive.
After an initial burn-in period, the samples x',y',z' will be samples from the true joint distribution $P(x,y,z|data)$, providing empirical estimates
.
Reducing variance can speed up convergence. Due to the bias-variance tradeoff, reducing variance can reduce error, which leads to faster convergence.
Rao-Blackwellisation leads to smaller variance.
Blocking?
Want to know $P(x_i|e)$. Counting method: Count the number of samples where $X_i=x_i$.
Averaging method: $P(X_i=x_i) = 1/T sum_1^T P(X_i = x_i | \text{Markov}_i^t)$.
The averaging method is better because it is always nonzero when the Markov network is nonzero there.
Sample a subset of variables, and integrate out the rest. In the case of a probability model, can treat parameters as variables to integrate out.
When processing a data point, compute $p(x_i | x_C)$ directly, where $x_C = \{x | x \in C , x \neq x_i\}$.
Do this by integrating over the parameters of the base distribution.
$\displaystyle p(x_i|x_{-i}) = \int_{\mu} \int_{\Sigma} \frac{ p(x_i|\mu , \Sigma) p(\mu , \Sigma | x_{-i}) }{ p(x_{-i}) } d\Sigma d\mu$