The ANF model s an invertible latent variable model. It is composed of multiple autoencoding transforms, each of which comprises a pair of the encoding and decoding transforms as depicted in Fig.(a). Consider the example of ANF with one autoencoding transform (i.e. one-step ANF). It converts the input $x$ coupled with an independent noise $e$ into their latent representation $(y, z)$ with one pair of encoding and decoding transforms:
$g_\pi^{enc} (x, e) = (x, s^{enc}_\pi(x) \odot e + m^{enc}_\pi (x)) = (x, z)$
$g_\pi^{dec} (x, z) = (x - \mu_\pi^{dec}(z) / \sigma^{dec}_\pi(z), z) = (y, z)$
where $s^{enc}_\pi, m^{enc}_\pi, \mu^{enc}_\pi,$ and $\sigma^{enc}_\pi$ are element-wise affine transformation parameters. These learnable parameters are driven by the encoding and decoding neural networks, the weights of which are referred collectively to as $\pi$. Compared with ordinary flow models, ANF augments the input with an independent noise. It has been shown that he augmented input space allows a smoother transformation to the required latent space.
Fig. (b) describes the framework of ANFIC. From bottom to top, it stacks two autoencoding transforms (i.e. two-step ANF), with the top one extended further to the right to form a hierarchical ANF that implements the hyperprior. More autoencoding transforms can be added straightforwardly to create a multi-step ANF. In particular, the $g^{enc}_\pi$ and $g^{dec}_\pi$ in the autocoding transforms follow the following equation.
$g^{enc}_\pi(x, e) = (x, s^{enc}_\pi(x) + m^{enc}_\pi (x)) = (x, z)$
$g^{dec}_\pi(x, z) = (x - \mu^{dec}_\pi(z), z) = (y, z)$
We make them purely additive by removing $s^{enc}(x)$ and $\sigma^{dec}_\pi$ for better convergence as some other flow-based schemes. The autoencoding transform
of the hyperprior, which assume each sample in the latent representation $z_2$ is a Gaussian, is defined as
$h^{enc}_\pi (z_2, e_h) = (z_2, e_h + m^{enc}_{\pi_3}(z_2)) = (z_2, \hat{h}_2)$
$h^{dec}_\pi (z_2, \hat{h}_2) = (\lfloor z_2 - \mu^{dec}_{\pi_3}(\hat{h}_2) \rceil, \hat{h}_2) = (\hat{z}_2, \hat{h}_2)$
where $\lfloor \rceil$ (depicted as Q in Fig. (b)) denotes the nearest-integer rounding for quantizing the residual between $z_2$ and the predicted mean $\mu^{dec}_{\pi_3}(\hat{h}_2)$ of the Gaussian distribution from the hyperprior $\hat{h}_2$. This part implements the autoregressive hyperprior, with $z_2$ denoting the image latents whose distributions are signaled as the side information $\hat{h}_2$.
The encoding of ANFIC proceeds by passing the augmented input $(x, e_z, e_h)$ through the autoencoding and hyperprior transforms, i.e. $G_\pi = g^{dec}_{\pi_2} \circ h^{dec}_{\pi_3} \circ h^{enc}_{\pi_3} \circ g^{enc}_{\pi_2} \circ g^{dec}_{\pi_1} \circ g^{enc}_{\pi_1}$ to obtain the latent representation $(x_2, \hat{z}_2, \hat{h}_2)$. In particular, $x$ represents the input image, $e_z = 0$ denotes the augmented input, and $e_h \sim U(-0.5, 0.5)$, another augmented input, simulates the additive quantization noise of the hyperprior during training. To achieve lossy compression, we want $\hat{z}_2$ and $\hat{h}_2$ to capture most of the information about the input $x$ and regularize $x_2$ during training to approximate noughts. As such, only $\hat{z}_2$ and $\hat{h}_2$ are entropy coded into bitstreams.
To decode the input $x$, we apply the inverse mapping function $G_\pi^{-1]$ to the quantized latents $(0, \hat{z}_2, \hat{h}_2)$, where $x_2$ is set to noughts. In ANFIC, there are two sources of distortion that cause the reconstruction to be lossy: the quantization error of $z_2$ and the error of setting $x_2$ to noughts during the inverse operation. Essentially, ANFIC is an ANF model, which is bijective and invertible. The errors between the encoding latents $(x_2, z_2)$ and their quantized version $(0, \hat{z}_2)$ will introduce distortion to the reconstructed image, as shown in Fig. (c). To mitigate the effect of quantization errors on the decoded image quality, we incorporate a quality enhancement (QE) network at the end of the reverse path, as illustrated in Fig. (c).