# Michael's Wiki

Directed Graphical Models

A directed graphical model or Bayesian network is a directed acyclic graph and a type of dependency graph where nodes represent probabilities of a variable and edges show a conditional relationship between child and parent. Root nodes represent unconditional probabilities over a variable.

“Directed graphical models” were invented in genetics in the 1920s, then were revived in the 1970s as “Bayesian Networks”. They are a systematic way to represent and compute sets of random variables and associated conditional independence assumptions.

##### Terminology

A graphical model with no independence assumptions is called saturated.

A node has converging arrows when it is dependent on multiple other variables. A node has diverging arrows when multiple other variables depend on it.

##### Properties

Can be represented as a Dependency Graph in which nodes represent propositional variables and arcs represent dependencies among related propositions.

Can be transformed into a Tree Decomposition via the Fill-In method

Can express non-chordal dependencies through the use of auxiliary variables.

#### Coins and Bell Example

This is an example of the conditional dependence of converging nodes which can be unintuitive. A bell rings only if two coins flip to the same value. The coins are independent of each other, but dependent when conditioned on an observation of the bell.

#### Separation

Applies to undirected graphs. Two nodes are separated when there are no unblocked paths between them. A path is blocked if a node in the path has been conditioned upon. In general, the Markov blanket separates a node from the rest of the graph, and the Markov boundary is a minimal Markov blanket.

Markov blanket: the set of nodes which make a node independent of the rest of the graph when conditioned upon. Markov boundary is the minimal blanket: the parent, childre, and spouses of a node.

#### d Separation

Applies to directed graphs. Can be evaluated by a transformation into an undirected grapd-separation via moralization. The major complication here is conditional dependence, like in the coins and bell example. Moralization transforms the directed graph into an undirected graph with equivalent dependency relationships, eliminating the conditional dependence headache. $Z$ d-separates $X$ and $Y$ if not $I(X,Y|Z)$. This can occur when $Z$ blocks all paths between $X$ and $Y$. (denoted $< X | Z | Y >$) $Z$ blocks a path that involves a non-convergent node by containing that node. If $Z$ contains a convergent node, then conditioning on that node actually “opens” the nodes around it due to conditional dependence.

Testing: Is X d-separated from Y given Z?

1. Take the ancestral graph of X,Y,Z
2. Moralize the result
3. * create a connection between all nodes of the same level
4. * edges are now undirected
5. Apply regular undirected graph separation
##### Construction

Markov boundaries

1. Create an ordering of the variables
2. * The ordering determines how full or sparse a resulting minimal I-map will be
3. Initialize empty graph
4. For each variable, in order:
1. Add the variable to the graph
2. Connect each variable in its Markov boundary present in the graph to the new variable, as a parent

Can construct a Bayesian network from a Markov network via the chain rule.

##### Common Examples

In general, $p(x_1,x_2,\dots) = \Pi{p(x_j|x_{j-1})}$

#### Markov Model

Markov model: $p(x_1,x_2,\dots) = \Pi_{d=2}\left[p(x_j|parents(x_j))\right]p(x_1)$

First-orderMarkov Assumption: Given the previous entry in a sequence, the next entry is independent on all previous parts of the sequence. For higher orders, $P(X_t)$ will depend on more than one other variable. $$P(X_t|X_{t-1}) = P(X_t|X_{t-1},X_{t-2},\dots)$$

#### Naive Bayes Model

Naive Bayes model: $p(x_1,x_2,\dots|C) = \Pi{p(x_j|C)p(x_1)}$

#### Gaussian Multivariate Model

$\underline{x} \in \mathbb{R}^d$ for a d-dimensional model.

### Gaussian Multivariate Density Function

Characterized by parameters $\mu$ and $\Sigma$, the mean and covariance matrix. * For $d$ dimensions, the number of parameters needed grows as $d^2$ due to the covariance matrix. * The covariance matrix is positive semi-definite. * The Gaussian assumes only pairwise linear dependencies.

$\displaystyle P(x) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} e^{-\frac{1}{2}[(x-\mu)^t \Sigma (x-\mu)]} = \frac{1}{C} e^{-\frac{1}{2} d(x,\mu)}$

$\displaystyle d(x,\mu) = \begin{pmatrix} x_1 & x_2 \end{pmatrix} \begin{pmatrix} 1 & \rho \rho & 1 \end{pmatrix} \begin{pmatrix} x_1 x_2 \end{pmatrix} = \frac{x_1^2}{\sigma_1^2} + \frac{x_2^2}{\sigma_2^2}$

This $d(x,\mu)$ is effectively a distance function between the observation and the mean, known as the Mahalanobis Distance. As this distance increases, the exponential component of the equation becomes a smaller and smaller value.

The covariance matrix $\Sigma$ can be expressed through eigendecomposition as $\Sigma = E \Lambda E$. Here $\Lambda$ is a diagonal matrix of $\lambda_i$ eigenvalues.

MH distance can be interpreted as a weighted Euclidean distance in a transformed space. In this space, values are rotated by $E$ and shifted by $\mu$.

### Principal Components Interpretation

Define $\lambda_i$ = variance in direction $e_i$. The eigenvalues of $\Lambda$ contain these variances when an optimal eigendecomposition is performed.

##### Inference

Why did Prof. Dechter call these incorrect? “Incorrect” methods:

• MPE
• Most probable explanation for each network variable given the evidence
• $p(a | e) =$ ? where $a = X$
• Given some evidence, identify an instantiation of remaining variables which maximizes probability of the evidence.
• Marginal
• $p(a | e) =$ ? where $a = X_i$
• MAP
• $p(a | e) =$ ? using $p(a)$ and $p(e|a)$

“Correct” methods:

1. Build a network
1. Identify the variables
2. Connect and quantify the links
2. Belief Updating by marginalization

#### Cow Example

Insemination has 87% chance of causing pregnancy. Pregnancy causes 90% chance of detectable progesterone. No pregnancy causes 1% chance of detectable progesterone.

• Scanning test
• 1% false positive rate
• 10% false negative rate
• Urine test
• 10% false positive
• 20% false negative
• Blood test
• 10% false positive
• 30% false negative

What is the probability of pregnancy if all three tests return negative? How do you find limits of positive/negative errors to ensure a certain confidence in the result?

#### Circuit Network

Diagnose whether components X Y Z are faulty based on input A,B , internal signals C,D, and output E. The variables for each component represent states of function/faultiness, where the variables for electronic signals represent signal state.

#### Satisfiability Problems

Can represent variables and clauses as nodes in a Bayesian network.

#### 1st Order Markov Chain

e.g. a 1st-order Markov chain with four variables A,B,C,D. Given $A = a$, compute new probability of $D = d$.

$p(x_4|x_1) = \sum_{x_2,x_3} p(x_4,x_3,x_2|x_1)$ by law of total probability.

$= \sum_{x_2,x_3} p(x_4,x_3,x_2,x_1) / p(x_1)$

$= \sum_{x_2,x_3} p(x_4|x_3) p(x_3|x_2) p(x_2|x_1) p(x_1) / p(x_1)$

$= \sum_{x_2,x_3} p(x_4|x_3) p(x_3|x_2) p(x_2|x_1)$

$= \sum_{x_3} p(x_4|x_3) \sum_{x_2} p(x_3|x_2) p(x_2|x_1)$

Each is a sum over $O(k^2)$ terms, but there are $T$ variables to do it over, so the complexity of this operation will be $O(Tk^2)$.

#### Binary Tree

This is known as the message-passing algorithm in bayesian networks.

Compute $P(d_j|g_k)$. Brute force method is $p(d_j,g_k) = \sum_{a,b,c,e,f} p(d_j,a,b,c,e,f,g_k)$ summing over a $m^5$ term there.

Using the model: $p(d_j,g_k) = \sum_r p(d_j|b_r,g_k) p(b_r|g_k)$ by LTP. $= \sum_r p(d_j|b_r) p(b_r|g_k)$ this is $O(m)$ because the sum of $p(b_r|g_k)$ is known for $i=1\dots m$.

$p(b_r|g_k) = \sum_i p(b_r|a_i)p(a_i|g_k)$ this is $O(m)$ for each fixed value of $b_r$, $O(m^2)$ total.

$p(a_i|g_k) = \frac{1}{p(g_k)} p(a_i,g_k)$

$p(a_i|g_k) = \sum_t p(a_i,c_t,g_k) = \sum_t p(a_i) p(c_t|a_i) p(g_k|c_t)$ this is $O(m)$ for each fixed value of $a_i$, $O(m^2)$ total.