Human View Synthesis using a Single Sparse RGB-D Input
Novel view synthesis for humans in motion is a challenging computer vision problem that enables applications such as free-viewpoint video. Existing methods typically use complex setups with multiple input views, 3D supervision or pre-trained models that do not generalize well to new identities. Aiming to address these limitations, we present a novel view synthesis framework to generate realistic renders from unseen views of any human captured from a single-view sensor with sparse RGB-D, similar to a low-cost depth camera, and without actor-specific models. We propose an architecture to learn dense features in novel views obtained by sphere-based neural rendering, and create complete renders using a global context inpainting model. Additionally, an enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details. We show our method generates high-quality novel views of synthetic and real human actors given a single sparse RGB-D input. It generalizes to unseen identities, new poses and faithfully reconstructs facial expressions. Our approach outperforms prior human view synthesis methods and is robust to different levels of input sparsity.
Comparison of 3D point cloud transformations.
From a single RGB-D input, we obtain the warped image using: a depth-based warping transformation, neural point and sphere-based renderer. The novel image warped by Pulsar is significantly denser because Pulsar renderer not only provides the option to use a per-sphere radius parameter, but it also provides gradients for these radiuses, which enables to set them dynamically.
Sphere-based view synthesis network
The feature predictor F learns radius and feature vectors of the sphere set S. We then use the sphere-based differentiable renderer Ω to densify the learned input features M and warp them to the target camera T . The projected features are passed through the global context inpainting module G to generate the foreground mask, confidence map and novel image. Brighter colors of the confidence map indicate lower confidence.
Using an additional occlusion-free input, we refine the initial estimated novel view by training the Enhancer network. We infer the dense correspondences of both predicted novel view and occlusion-free image using a novel HD-IUV module. The occlusion-free image is warped to the target view and then refined by an auto-encoder. The refined novel view shows better result on the occluded area compared to the initial estimated.
We train our method on the synthetic data of the RenderPeople dataset and test the trained model on the real 3DMD scans. Qualitative results show that HVS-Net is able to generalize on unseen humans in the testing time.
Using a single sparse RGB-D input, point-based rendering method (SynSin) is unable to render realistic textures on the occluded regions at the target viewpoint and the skin of human is clearly different from the input image. Using just 10% of the input depth points from a single view, HVS-Net also outperform a recently proposed LookingGood method which utilizes a multi-view capture setup.
Input depth sparsity robustness
Finally, we show the generated novel views of HVS-Net using different level of input depth sparsity. We observe a drop of performance of HVS-Net if we use only 5% of the points in the input depth map and there is no significant different between generated novel views using 10% or 25% of the input depth points