# Michael's Wiki

When a joint distribution can't be computed directly, but marginals can, Gibbs Sampling is a MCMC algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables.

1. Sample a new variable value one at a time from the variable's conditional distribution:

$P(x_i) = p(x_i|x_{\neg i})$

Samples are dependent and form a Markov chain with stationary distribution P(x|evidence).

The process can be understood as a random walk in the space of all instantiations of a variable.

Converges iff the chain is irreducible and ergodic. These conditions are satisfied when all probabilities are positive.

##### Example
• For three variables x y z,
• initialize with z' y' z' (e.g. randomly)
• iterate:
• sample new x' ~ p(x|y',z',data)
• sample new y' ~ p(y|x',z',data)
• sample new z' ~ p(z|y',x',data)
• continue for some number of iterations (large number)
• each iteration consists of a sweep through the hidden variables or parmeters (here x,y,z)

After an initial burn-in period, the samples x',y',z' will be samples from the true joint distribution $P(x,y,z|data)$, providing empirical estimates.

##### Hacks

Reducing variance can speed up convergence. Due to the bias-variance tradeoff, reducing variance can reduce error, which leads to faster convergence.

Blocking?

Want to know $P(x_i|e)$. Counting method: Count the number of samples where $X_i=x_i$.

Averaging method: $P(X_i=x_i) = 1/T sum_1^T P(X_i = x_i | \text{Markov}_i^t)$.

The averaging method is better because it is always nonzero when the Markov network is nonzero there.

##### Collapsed Gibbs Sampling

Sample a subset of variables, and integrate out the rest. In the case of a probability model, can treat parameters as variables to integrate out.

When processing a data point, compute $p(x_i | x_C)$ directly, where $x_C = \{x | x \in C , x \neq x_i\}$.

Do this by integrating over the parameters of the base distribution.

#### MOG Example

$\displaystyle p(x_i|x_{-i}) = \int_{\mu} \int_{\Sigma} \frac{ p(x_i|\mu , \Sigma) p(\mu , \Sigma | x_{-i}) }{ p(x_{-i}) } d\Sigma d\mu$ 