Adaptive Annealing
Annealing is a technique to parametrically smooth a target density to improve sampling efficiency and accuracy during inference. In the discrete case, this is achieved by incrementing an inverse temperature \(t_{k}\) and setting \(p_k(\boldsymbol{z},\boldsymbol{x}) = p^{t_k}(\boldsymbol{z},\boldsymbol{x}),\,\,\text{for } k=0,\dots,K\), where \(0 < t_{0} < \cdots < t_{K} \le 1\). The result of exponentiation produces a smooth unimodal distribution for a sufficiently small \(t_0\), recovering the target density as \(t_{k}\) approaches 1. In other words, annealing provides a continuous deformation from an easier to approximate unimodal distribution to a desired target density.
A linear annealing scheduler (see, e.g. Rezende and Mohamed [RM15]) with fixed temperature increments is often used in practice, where \(t_j=t_{0} + j (1-t_{0})/K\) for \(j=0,\ldots,K\) with constant increments :math: epsilon = (1-t_{0})/K. Intuitively, small temperature changes are desirable to carefully explore the parameter spaces at the beginning of the annealing process, whereas larger changes can be taken as \(t_{k}\) increases, after annealing has helped to capture important features of the target distribution (e.g., locating all the relevant modes).
The proposed AdaAnn scheduler determines the increment \(\epsilon_{k}\) that approximately produces a pre-defined change in the KL divergence between two distributions annealed at \(t_{k}\) and \(t_{k+1}=t_{k}+\epsilon_{k}\), respectively. Letting the KL divergence equal a constant \(\tau^2/2\), where \(\tau\) is referred to as the KL tolerance, the step size \(\epsilon_k\) becomes
The denominator is large when the support of the annealed distribution \(p^{t_{k}}(\boldsymbol{z},\boldsymbol{x})\) is wider than the support of the target \(p(\boldsymbol{z},\boldsymbol{x})\), and progressively reduces with increasing \(t_{k}\). Further detail on the derivation of the expression for \(\epsilon_{k}\) can be found in Cobian et al. [CHLS23].