In this post, I will introduce the denoising score matching (DSM) loss in diffusion models. I recommend reaidng my previous posts before reading this one.
Score matching loss in diffusion models.
DSM loss
The score matching (SM) loss cannot be used in practice because the score function is intractable. As an alternative, the DSM loss \mathcal{L}_{DSM} (\theta) is optimized for training diffusion models
\mathcal{L}_{DSM}(\theta) = \mathbb{E}_{t, x_0, x_t} \lVert s_\theta(x_t,t) - \nabla_{x_t} \log p_{0t}(x_t |x_0) \rVert^2 \tag{DSM loss},where t is random chosen from U[0,T] , a uniform distribution between 0 and T, the term p_{0t} (x_t |x_0) , called a perturbation kernel, is a conditional probability density function (pdf) of x_t at time t given x_0 at time 0 and x_t in the \mathbb{E} is sampled from p_{0t} (x_t|x_0) .
The score model s_\theta is trained to approximate \nabla_{x_t} \log p_{0t}(x_t |x_0) with DSM loss, instead of \log p_t(x_t) when SM loss is optimized.
In the next post, I will explain that DSM loss and SM loss are equivalent in the sense that optimizing both aim to make the score model approximate the score function.