Note: the notes are largely based on the course by https://space.bilibili.com/491707363/.
Approximate posterior distribution \(p(z|x)\) with \(q(z)\), i.e. minimize \(KL(q(z)|p(z|x))\).
Maximizing ELBO is equivalent to minimizing \(KL(q(z)|p(z|x))\).
Decompose \(p(z)\) into the products of \(n\) distributions, i.e. \(q(z)=\sum_{i=1}^{n}q_{i}(z)\). It is possible to find the best \(q(z)\) through iteratively optimizing each \(q_{i}(z)\).
\[q_{i}(z)=\frac{exp\{E_{q_{-j}}[log(p(x,z))]\}}{\int_{z_{j}}exp\{E_{q_{-j}}[log(p(x,z))]\}dz_{j}} \]Define a new distribution \(\tilde{p}_{j}\), such that \(log(\tilde{p}_{j}(x,z_{j}))=E_{q_{-j}}[log(p(x,z))]+const\).
Then,
\[\begin{align} ELBO(q)&=\int_{z_{j}}q_{j}(z_{j})log(\tilde{p}_{j}(x,z_{j}))dz_{j}-\int_{z_{j}}q_{j}(z_{j})log(q_{j}(z_{j}))dz_{j}+const\\ &=\int_{z_{j}}q_{j}(z_{j})log\left(\frac{\tilde{p}_{j}(x,z_{j})}{q_{j}(z_{j})}\right)dz_{j}+const\\ &=-KL(q_{j}(z_{j})|\tilde{p}_{j}(x,z_{j}))+const \end{align} \]In order for ELBO to be maximum,
\[\begin{align} q_{j}(z_{j})&=\tilde{p}_{j}(x,z_{j})\\ &\propto exp\{E_{q_{-j}}[log(p(x,z))]\} \end{align} \]Consider a toy example, in which we are trying to approximate \(p(z)=\mathcal{N}(z|\mu,\Lambda^{-1})\) with \(q(z)=q_{1}(z)q_{2}(z)\).
\[\mu= \left( \begin{array}{cc} \mu_{1}\\ \mu_{2} \end{array} \right), \Lambda= \left( \begin{array}{cc} \Lambda_{11} & \Lambda_{12}\\ \Lambda_{21} & \Lambda_{22} \end{array} \right), \]The closed form solution is:
\[\begin{align} q^{*}(z)&=q^{*}_{1}(z_{1})q^{*}_{2}(z_{2})\\ &=\mathcal{N}(z_{1}|\mu_{1},\Lambda_{11}^{-1})\mathcal{N}(z_{2}|\mu_{2},\Lambda_{22}^{-1}) \end{align} \]It can be seen that the above equation implies that \(q^{*}_{1}(z_{1})=\mathcal{N}(z_{1}|m_{1},\Lambda^{-1}_{11})\), where \(m_{1}=E[z_{1}]=\mu_{1}-\Lambda_{11}^{-1}\Lambda_{12}(E[z_{2}]-\mu_{2})\).
Similarly, \(q^{*}_{2}(z_{2})=\mathcal{N}(z_{2}|m_{2},\Lambda^{-1}_{22})\), where \(m_{2}=E[z_{2}]=\mu_{2}-\Lambda_{22}^{-1}\Lambda_{21}(E[z_{1}]-\mu_{1})\).
It can be seen that the solution is \(m_{1}=\mu_{1}\) and \(m_{2}=\mu_{2}\).
Minimize \(KL(p(z|x)|q(z))\).
Because we want to find \(q(z)\) given \(p(z|x)\), it is equivalent to minimize \(-\int p(z|x)\left[\sum^{m}_{i=1}log(q_{i}(z_{i}))\right]dz+const\). Then,
\[q^{*}_{j}(z_{j})=argmin_{q_{j}}KL(p(z|x)|q(x))=\int_{z_{-j}} p(z|x)dz_{-j} \]Optimizing \(KL(p(z|x)|q(x))\) w.r.t. \(q_{j}(z_{j})\)
\[\begin{align} KL(p(z|x)|q(x))&=-\int_{z}p(z)\left[\sum^{m}_{i=1}log(q_{i}(z_{i}))\right]dz+const\\ &=-\int_{z}p(z)log(q_{j}(z_{j}))dz+const\\ &=-\int_{z_{j}}log(q_{j}(z_{j}))\left[\int_{z_{-j}}p(z)dz_{-j}\right]dz_{j}+const \end{align} \]The above expression is the cross entropy. Therefore, the optimal value is \(q^{*}_{j}(z_{j})=\int_{z_{-j}} p(z|x)dz_{-j}\).
VI tends to under-estimate, EP tends to over-estimate.
VI is zero-forcing, EP is zero-avoiding.