Given a set of observed variables , the bayesian framework of model fitting aims to find out the most likely parameters that maximizes the likelihood of observing . To put it in probablistic terms we try to maximize the function .

This is known as MAP.

Another formulation of the problem is to find a that maximizes the probability of observing ,

The actual process behind an actual observation can be very complex. Therefore we often resort to a bayesian network that leads to the observation. This kind of model has more expressability since we can make objective claims about the generative process itself.

This more complex formalism involves a set intermediate variables, that we don't directly estimate, but, helps us to put more constraint on the model. These variables are commonly denoted by .

Here we don't show in the model, but they are there.

The goal is still the same, to maximize the log-likelihood of the observation.

By using **Jensen's** inequality which says when we have concave function then, any point on the straight-line connecting two points on the concave curve (i.e. ) is always lower than the actual mapped point on the curve (i.e. ), therefore . If is here, then we have

There are many terms that are used to denote equation (9), such as **variational lower bound**, **E**vidence **L**ower **Bo**ound (**ELBO**) etc. Intuitively this is average (with respect to a distribution ) of log-fold-change of joint likelihood of and a fictitious variational distribution . There is another way of writing (9) if we take denominator out,

Negative expectation of a log (i.e. ) is also known as shanon's entropy.

So we see that there is a gap between LHS and RHS in equation (9). We can examine the gap further,

As depends only on ( is given when we evaluate ) therefore can come inside the .

RHS (14) denotes the Kullback-leibler divergence between the fictitious distribution of hidden variable and the true distribution of , generally denoted by ,

ELBO, , can also be rewritten as, , and, subsequently,

divergence is a positive quantity.

Equation 17 gives us another definition of , or the **variational lower bound**, **E**vidence **L**ower **Bo**ound (**ELBO**),

Equation 19 will be useful later.

An iterative algorithm known as EM, is applied. We stepwise optimize and . We can use EM, these two steps can be executed properly

Let's assume we already have a

In this step we will minimize the KL divergence. divergence become 0 if,

It would be awesome if we can evaluate . Remember it amounts to solving the following expression.

The integration in the denominator of equation (22) is the hardest part here, for many reasons, such as might not give us an analytical closed form, and might not be integrable.

Say by some superpower (or if numerator of equation 20 is analytically so simple that you can just integrate ) you can solve the equation

Finding this value is not where E-step ends, although that's the crux of it. We also evaluate ELBO by plugging in in 18

We move to the next step for obtaining a better estimate of .

Givem the next step is to maximize . There can be many ways to solve it. You might want to put the function in ELBO, to evaluate . But that would just yield a function of ,

A very naive way to solve equation (20) is to just differentiate it (if you can) and figure out .

That's how EM algorithm proceeds,

can be any value you deem reasonable (taking *any* value might elongate the convergence.)

FAQ

The name comes from evaluating the expectation in equation 27. We in essence evaluate the *expectation* of the observed data *under* the fictitious distribution .

Because is called variational ditribution. In the case EM we got super lucky and evaluated , that's not the case in most problems, in which case we resort to a variational distribution and given that *variational* distribution ELBO is a *lower bound*. You can evaluate and after each -th step, and plot this value. If the implementation is successful we would see an increasing curve that saturates with denoting that lower bound is getting improved.

If we can't solve the integration then we have to figure out other ways to lower the KL divergence. We of course can not make it 0 under , but may be, we can get closer, and KL divergence would be a small value.

We still know that our best bet is to evaluate and therefore evaluating or , if we can't do that exactly, then to the least we can use *numerical* algorithms in order to estimate this integration approximately. There are many approximation algorithms that can give us numeric estimates of an actual integration.

A very naive technique of would be to sample a bunch of points from the ditribution (which can be an intimidating task on it's own). and evaluate .

To be more concrete you sample (there are many many good algorithms for well-structured sampling such as MCMC, gibbs) a vector (hidden variables) (if you can) and evaluate . Say, do it (a bunch of) times, then our best hope is to calculate

This approximation looks simple, **but** there could be caveats of such a solution, such as,

- The expression can be far from true when is not well distributed, or are autocorrelated (therefore does not capture the space of true hidden variable distribution).
- We can't simply sample from , because there is no analytical form to it. (This happens a lot in a real world problem)

(To be written)

Equation 19 states , where we had taken help of a distirbution . We realized that the step finds out a that minimizes the KL divergence part given an initial estimate of , . When we are lucky we can set to true estimate to . In that case we are certain that we can fully evaluate , by getting a solution to equation 22 (doing integration etc.).

When we are *not* so lucky (which is most of the cases), we parameterize and try to find out accordingly. Before getting into that, let's first realize a fact when we used in equation 23, we actually did the following,

Therefore can be treated as a key component in this entire exercise.

Now, when is intractable, we cannot set to , *however* we define another parameterized function (Notice we removed here, we assume this function is not dependent on the model parameters, but rather another set of parameters which are all hidden in function .). Our hope will be close to .

Now coming back to our discussion about not having a good , one can think about parameterizing itself as a function of some variables, and try to improve those parameters. In other words if we can get a that is so flexible that it can be moulded in any way we want, that would be great. One such a function is neural networks. We can assume the parameters in the network (all the weights and nonlinear activation parameters) are encoded by then instead of using , we can call such a neural network .

In fact you can think as an output from

encoderinthe auto-encoder.

In the light of this new notation we can re-write equation 19, as

We would simplify the above equation a but by assuming that we first like to find out the best for one data point . We would also assume is dependent on a subset of hidden variables , with this assumptions, we can wright likelihood for one data point ,

does not contain anything to do with . We can rewrite , since is a probability distribution and therefore, . We used the similar trick while absorbing inside expectation in equation 12.

We can massage the expression ,

The point of this mathematical juggling would be clear in a moment,

Our new **variational lower bound** or **E**vidence **L**ower **Bo**ound (**ELBO**) is,

We are again in trouble, since the first part is also an expectation, but since is something that we can choose, we can take enough samples and approximate the expression.

We need two different outcomes, firstly, we want to optimize ELBO, , and get the best values for and . On the other hand, we need to evalueate .

**E-like**step (finding optimal given )

We follow something similar to EM, but a bit more complex, in the *E-like* step we find out a that maximizes . (given an initial value )

We differentiate w.r.t ,

To understand how we optimize ELBO, first we have to understand how in reality such an expression , is formed. Because we are not going to differentiate the exact form of equation 39 directly. This optimization scheme is very much tied to the actual generation process. To briefly understand that, let's review the flow of inference.

Given the data , a trainable parameterized function (such as *encoder* network), is used to generate a set of hidden variable (often called **latent** variables) , we use another *easy* function (such as a normal distribution). Given this super artificial distribution for , we evaluate , Now again this expectation could be complex. So we resort to sampling

So the approximation process could be drawing a sample and then evaluate , subsequently,

Summing them up and averaging would give us an approximation of

The last part of equation 40 is another neural network, that can be termed as a ** decoder** with a set of parameters (this part is very similar to a normal heirarchical bayesian model) . With this sampling process we redefine the definition of ELBO,

Now we can think about differentiating the above expression, Now if we want to differentiate the first part of 48, then we will face a problem, since is hard to evaluate. Since the sampling is inside the summation notation depends on , specifically, itself depends on a .

This is a classical problem and this does does not have anything to do with sampling in particular,

In short this problem is formulated as following,

Here probability of the distribution for wihich the expectation is taken depends on a variable , and we are differentiating w.r.t that variable. One way to solve this problem is something called *log derivative tick* , which by now might be familiar: we multiply and divide the expresseion inside the expression by .

Effectively this is a good solution if we go back to 42

In actual we do sampling for evaluating

Although expression might look simple, it leads to a very bad sampling. Why?

Let's think about a real world scenario, where we start with some random and denotes the probability of generating an image from some random initalization, this probability can be very low number (log likelihood of images ). So while sampling we choose and then take derivative of log of that with respect to , which could be negative or positive, multiply with a highly negative number, becoming a very large or small number. So the sampling mechanisms (such as MCMC) will take a long time to converge.

Another way of solving this problem is *change of variable* or *reparameterization*. As opposed to *log derivative trick* we use change of variables in the calculation of expectation. This method is widely popular in calculus,

One could easily think about this in terms expectation

If can be represented as a probability distribution then

So basically we can change the expectation altogether

In our particular case let's assume we have one dimensional world (just for the ease of calculation),

is same as first sampling and then multiplying by and scaling by ,