Directed Graphical Models
A directed graphical model or Bayesian network is a directed acyclic graph and a type of dependency graph where nodes represent probabilities of a variable and edges show a conditional relationship between child and parent. Root nodes represent unconditional probabilities over a variable.
“Directed graphical models” were invented in genetics in the 1920s, then were revived in the 1970s as “Bayesian Networks”. They are a systematic way to represent and compute sets of random variables and associated conditional independence assumptions.
A graphical model with no independence assumptions is called saturated.
A node has converging arrows when it is dependent on multiple other variables. A node has diverging arrows when multiple other variables depend on it.
Can be represented as a Dependency Graph in which nodes represent propositional variables and arcs represent dependencies among related propositions.
Can be transformed into a Tree Decomposition via the Fill-In method
Can express non-chordal dependencies through the use of auxiliary variables.
This is an example of the conditional dependence of converging nodes which can be unintuitive. A bell rings only if two coins flip to the same value. The coins are independent of each other, but dependent when conditioned on an observation of the bell.
Applies to undirected graphs. Two nodes are separated when there are no unblocked paths between them. A path is blocked if a node in the path has been conditioned upon. In general, the Markov blanket separates a node from the rest of the graph, and the Markov boundary is a minimal Markov blanket.
Markov blanket: the set of nodes which make a node independent of the rest of the graph when conditioned upon. Markov boundary is the minimal blanket: the parent, childre, and spouses of a node.
Applies to directed graphs. Can be evaluated by a transformation into an undirected grapd-separation via moralization. The major complication here is conditional dependence, like in the coins and bell example. Moralization transforms the directed graph into an undirected graph with equivalent dependency relationships, eliminating the conditional dependence headache. $Z$ d-separates $X$ and $Y$ if not $ I(X,Y|Z)$. This can occur when $Z$ blocks all paths between $X$ and $Y$. (denoted $< X | Z | Y >$) $Z$ blocks a path that involves a non-convergent node by containing that node. If $Z$ contains a convergent node, then conditioning on that node actually “opens” the nodes around it due to conditional dependence.
Testing: Is X d-separated from Y given Z?
Markov boundaries
Can construct a Bayesian network from a Markov network via the chain rule.
In general, $p(x_1,x_2,\dots) = \Pi{p(x_j|x_{j-1})}$
Markov model: $p(x_1,x_2,\dots) = \Pi_{d=2}\left[p(x_j|parents(x_j))\right]p(x_1)$
First-orderMarkov Assumption: Given the previous entry in a sequence, the next entry is independent on all previous parts of the sequence. For higher orders, $P(X_t)$ will depend on more than one other variable. $$P(X_t|X_{t-1}) = P(X_t|X_{t-1},X_{t-2},\dots)$$
$\underline{x} \in \mathbb{R}^d$ for a d-dimensional model.
Characterized by parameters $\mu$ and $\Sigma$, the mean and covariance matrix. * For $d$ dimensions, the number of parameters needed grows as $d^2$ due to the covariance matrix. * The covariance matrix is positive semi-definite. * The Gaussian assumes only pairwise linear dependencies.
$\displaystyle P(x) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} e^{-\frac{1}{2}[(x-\mu)^t \Sigma (x-\mu)]} = \frac{1}{C} e^{-\frac{1}{2} d(x,\mu)}$
$\displaystyle
d(x,\mu) =
\begin{pmatrix}
x_1 & x_2
\end{pmatrix}
\begin{pmatrix}
1 & \rho
\rho & 1
\end{pmatrix}
\begin{pmatrix}
x_1
x_2
\end{pmatrix}
=
\frac{x_1^2}{\sigma_1^2} + \frac{x_2^2}{\sigma_2^2}$
This $d(x,\mu)$ is effectively a distance function between the observation and the mean, known as the Mahalanobis Distance. As this distance increases, the exponential component of the equation becomes a smaller and smaller value.
The covariance matrix $\Sigma$ can be expressed through eigendecomposition as $\Sigma = E \Lambda E$. Here $\Lambda$ is a diagonal matrix of $\lambda_i$ eigenvalues.
MH distance can be interpreted as a weighted Euclidean distance in a transformed space. In this space, values are rotated by $E$ and shifted by $\mu$.
Define $\lambda_i$ = variance in direction $e_i$. The eigenvalues of $\Lambda$ contain these variances when an optimal eigendecomposition is performed.
Why did Prof. Dechter call these incorrect? “Incorrect” methods:
“Correct” methods:
Insemination has 87% chance of causing pregnancy. Pregnancy causes 90% chance of detectable progesterone. No pregnancy causes 1% chance of detectable progesterone.
What is the probability of pregnancy if all three tests return negative? How do you find limits of positive/negative errors to ensure a certain confidence in the result?
Diagnose whether components X Y Z are faulty based on input A,B , internal signals C,D, and output E.
The variables for each component represent states of function/faultiness, where the variables for electronic signals represent signal state.
Can represent variables and clauses as nodes in a Bayesian network.
e.g. a 1st-order Markov chain with four variables A,B,C,D. Given $A = a$, compute new probability of $D = d$.
$p(x_4|x_1) = \sum_{x_2,x_3} p(x_4,x_3,x_2|x_1)$ by law of total probability.
$ = \sum_{x_2,x_3} p(x_4,x_3,x_2,x_1) / p(x_1)$
$ = \sum_{x_2,x_3} p(x_4|x_3) p(x_3|x_2) p(x_2|x_1) p(x_1) / p(x_1)$
$ = \sum_{x_2,x_3} p(x_4|x_3) p(x_3|x_2) p(x_2|x_1)$
$ = \sum_{x_3} p(x_4|x_3) \sum_{x_2} p(x_3|x_2) p(x_2|x_1)$
Each is a sum over $O(k^2)$ terms, but there are $T$ variables to do it over, so the complexity of this operation will be $O(Tk^2)$.
This is known as the message-passing algorithm in bayesian networks.
Compute $P(d_j|g_k)$. Brute force method is $p(d_j,g_k) = \sum_{a,b,c,e,f} p(d_j,a,b,c,e,f,g_k)$ summing over a $m^5$ term there.
Using the model: $p(d_j,g_k) = \sum_r p(d_j|b_r,g_k) p(b_r|g_k)$ by LTP. $ = \sum_r p(d_j|b_r) p(b_r|g_k) $ this is $O(m)$ because the sum of $p(b_r|g_k)$ is known for $i=1\dots m$.
$p(b_r|g_k) = \sum_i p(b_r|a_i)p(a_i|g_k)$ this is $O(m)$ for each fixed value of $b_r$, $O(m^2)$ total.
$p(a_i|g_k) = \frac{1}{p(g_k)} p(a_i,g_k)$
$p(a_i|g_k) = \sum_t p(a_i,c_t,g_k) = \sum_t p(a_i) p(c_t|a_i) p(g_k|c_t)$ this is $O(m)$ for each fixed value of $a_i$, $O(m^2)$ total.