Abstract

This paper introduces an end-to-end learned image compression system, termed ANFIC, based on Augmented Normalizing Flows (ANF). ANF is a new type of flow model, which stacks multiple variational autoencoders (VAE) for greater model expressiveness. The VAE-based image compression has gone mainstream, showing promising compression performance. Our work presents the first attempt to leverage VAE-based compression in a flow-based framework. ANFIC advances further compression efficiency by stacking and extending hierarchically multiple VAE’s. The invertibility of ANF, together with our training strategies, enables ANFIC to support a wide range of quality levels without changing the encoding and decoding networks. Extensive experimental results show that in terms of PSNR-RGB, ANFIC performs comparably to or better than the state-of-the-art learned image compression. Moreover, it performs close to VVC intra coding, from low-rate compression up to perceptually lossless compression. In particular, ANFIC achieves the state-of-the-art performance, when extended with conditional convolution for variable rate compression with a single model. The source code of ANFIC can be found at https://github.com/dororojames/ANFIC.

System overview

Fig. (b) describes the framework of ANFIC. From bottom to top, it stacks two autoencoding transforms (i.e. two-step ANF), with the top one extended further to the right to form a hierarchical ANF that implements the hyperprior. More autoencoding transforms can be added straightforwardly to create a multi-step ANF. In particular, the $g^{enc}_\pi$ and $g^{dec}_\pi$ in the autocoding transforms follow the following equation.

$g^{enc}_\pi(x, e) = (x, s^{enc}_\pi(x) + m^{enc}_\pi (x)) = (x, z)$
$g^{dec}_\pi(x, z) = (x - \mu^{dec}_\pi(z), z) = (y, z)$

We make them purely additive by removing $s^{enc}(x)$ and $\sigma^{dec}_\pi$ for better convergence as some other flow-based schemes. The autoencoding transform of the hyperprior, which assume each sample in the latent representation $z_2$ is a Gaussian, is defined as

$h^{enc}_\pi (z_2, e_h) = (z_2, e_h + m^{enc}_{\pi_3}(z_2)) = (z_2, \hat{h}_2)$
$h^{dec}_\pi (z_2, \hat{h}_2) = (\lfloor z_2 - \mu^{dec}_{\pi_3}(\hat{h}_2) \rceil, \hat{h}_2) = (\hat{z}_2, \hat{h}_2)$

where $\lfloor \rceil$ (depicted as Q in Fig. (b)) denotes the nearest-integer rounding for quantizing the residual between $z_2$ and the predicted mean $\mu^{dec}_{\pi_3}(\hat{h}_2)$ of the Gaussian distribution from the hyperprior $\hat{h}_2$. This part implements the autoregressive hyperprior, with $z_2$ denoting the image latents whose distributions are signaled as the side information $\hat{h}_2$.

The encoding of ANFIC proceeds by passing the augmented input $(x, e_z, e_h)$ through the autoencoding and hyperprior transforms, i.e. $G_\pi = g^{dec}_{\pi_2} \circ h^{dec}_{\pi_3} \circ h^{enc}_{\pi_3} \circ g^{enc}_{\pi_2} \circ g^{dec}_{\pi_1} \circ g^{enc}_{\pi_1}$ to obtain the latent representation $(x_2, \hat{z}_2, \hat{h}_2)$. In particular, $x$ represents the input image, $e_z = 0$ denotes the augmented input, and $e_h \sim U(-0.5, 0.5)$, another augmented input, simulates the additive quantization noise of the hyperprior during training. To achieve lossy compression, we want $\hat{z}_2$ and $\hat{h}_2$ to capture most of the information about the input $x$ and regularize $x_2$ during training to approximate noughts. As such, only $\hat{z}_2$ and $\hat{h}_2$ are entropy coded into bitstreams.

To decode the input $x$, we apply the inverse mapping function $G_\pi^{-1]$ to the quantized latents $(0, \hat{z}_2, \hat{h}_2)$, where $x_2$ is set to noughts. In ANFIC, there are two sources of distortion that cause the reconstruction to be lossy: the quantization error of $z_2$ and the error of setting $x_2$ to noughts during the inverse operation. Essentially, ANFIC is an ANF model, which is bijective and invertible. The errors between the encoding latents $(x_2, z_2)$ and their quantized version $(0, \hat{z}_2)$ will introduce distortion to the reconstructed image, as shown in Fig. (c). To mitigate the effect of quantization errors on the decoded image quality, we incorporate a quality enhancement (QE) network at the end of the reverse path, as illustrated in Fig. (c).

Paper

Subjective Quality Comparison

The reconstruction quality on image selected from Kodak dataset.
Click on image to enlarge it.