The Variational Autoencoder (Kingma & Welling, 2013) is a Latent Variable ModelLatent Variable Model
Instead of targeting directly
\(\min KL(q(\mathbf x) || p(\mathbf x)),\)
latent variable models introduce a $\mathbf z$ on which inference is performed. If one has a prior distribution that can be sampled, the model can be used to generate new data.
The latent variable enables unsupervised objectives like clustering. Also, it can be easier to model the data with more expressive models instead of directly targeting data likelihood. See [[Variational Autoencoder]].
that targets
\(\begin{aligned}
\min_{\phi, \theta} D_{KL}[q_\phi(\mathbf x, \mathbf z) || p_\theta(\mathbf x, \mathbf z)] &= \mathbb{E}_{q(x)q_\phi(z|x)}[\log q_\phi(\mathbf z \mid \mathbf x) - \log p_\theta(\mathbf x \mid \mathbf z) - \log p_\theta(\mathbf z)] \\
&= \mathbb{E}_{q(x)}\left[\mathbb{E}_{q_\phi(z|x)}[-\log p_\theta(\mathbf x \mid \mathbf z)] + D_{KL}[q_\phi(\mathbf z \mid \mathbf x) || p_\theta(\mathbf z))]\right] \\
&=: \text{Variational Free Energy}
\end{aligned}\)
By using the chain rule for KL-divergences \(D_{KL}[q_\phi(\mathbf x, \mathbf z) || p_\theta(\mathbf x, \mathbf z)] = D_{KL}[q(\mathbf x) || p_\theta(\mathbf x)] + D_{KL}[p(\mathbf z|x) || q_\phi(\mathbf z|x)]\) we see that the objective results in minimization of both the data likelihood objective and the divergence between the variational posterior $q_\phi(\mathbf z \mid \mathbf x)$ and the posterior of the generative model $p_\theta(\mathbf z \mid \mathbf x)$.