Abstract

This paper presents an end-to-end learning-based video compression system, termed CANF-VC, based on conditional augmented normalizing flows (ANF). Most learned video compression systems adopt the same hybrid-based coding architecture as the traditional codecs. Recent research on conditional coding has shown the sub-optimality of the hybrid-based coding and opens up opportunities for deep generative models to take a key role in creating new coding frameworks. CANF-VC represents a new attempt that leverages the conditional ANF to learn a video generative model for conditional inter-frame coding. We choose ANF because it is a special type of generative model, which includes variational autoencoder as a special case and is able to achieve better expressiveness. CANF-VC also extends the idea of conditional coding to motion coding, forming a purely conditional coding framework. Extensive experimental results on commonly used datasets confirm the superiority of CANF-VC to the state-of-the-art methods.

Note: The CANF-based motion coder $\{ F_\pi, F_\pi^{-1} \}$ follows the same design as the inter-frame coder, with $x_t, x_c$ replaced by $f_t, f_c$, respectively.

System overview

Fig. (a) depicts our CANF-based video compression system, abbreviated as CANF-VC. It includes two major components: (1) the CANF-based inter-frame coder $\{G_\pi, G_\pi^{-1}\}$ and (2) the CANF-based motion coder $\{F_\pi, F_\pi^{-1}\}$. The inter-frame coder encodes a video frame $x_t$ conditionally, given the motion -compensated frame $x_c$. It departs from the conventional residual coding by maximizing the conditional log-likelihood $p(x_t|x_c)$ with a onditional, multi-step ANF model. The motion coder shares a similar architecture to the inter-frame coder. It extends conditional coding to motion coding, in order to signal the flow map $f_t$, which characterizes the motion between $x_t$ and its reference frame $\hat{x}_{t-1}$. In our work, $f_t$ is estimated by PWC-Net. The compressed flow map $\hat{f}_t$ serves to warp the reference frame $\hat{x}_{t-1}$, with the warped result enhanced further by a motion compensation network to arrive at $x_c$. To formulate a condition for conditional motion coding, we introduce a flow extrapolation network to extrapolate a flow map $f_c$ from three previously decoded frames $\hat{x}_{t-1}, \hat{x}_{t-2}, \hat{x}_{t-3}$ and two decoded flow maps $\hat{f}_{t-1}, \hat{f}_{t-2}$. Note that we expand the condition of $p(x_t | \hat{x}_{\textless t})$ from previously decoded frames $\{\hat{x}_{\textless t}\}$ to include also previously decoded flows $\{\hat{f}_{\textless t}\}$.

Subjective Quality Comparison

The reconstruction quality on sequence selected from UVG, HEVC class B, and MCL-JCV dataset.

Note: In order to compare with DCVC fairly, we apply the same intra-frame coder to DCVC.

Ground Truth DCVC (ANFIC) CANF-VC DCVC-ssim (ANFIC) CANF-VC-ssim
PSNR-RGB: 33.84dB
0.0184 bpp
PSNR-RGB: 34.50dB
0.0109 bpp
MS-SSIM-RGB: 0.967
0.0336 bpp
MS-SSIM-RGB: 0.966
0.0271 bpp

Ground Truth DCVC (ANFIC) CANF-VC DCVC-ssim (ANFIC) CANF-VC-ssim
PSNR-RGB: 27.71dB
0.0441 bpp
PSNR-RGB: 29.00dB
0.0396 bpp
MS-SSIM-RGB: 0.952
0.0425 bpp
MS-SSIM-RGB: 0.955
0.0465 bpp

Ground Truth DCVC (ANFIC) CANF-VC DCVC-ssim (ANFIC) CANF-VC-ssim
PSNR-RGB: 32.68dB
0.0343 bpp
PSNR-RGB: 33.26dB
0.0267 bpp
MS-SSIM-RGB: 0.972
0.0506 bpp
MS-SSIM-RGB: 0.970
0.0390 bpp