SemanticHuman-HD:

High-Resolution Semantic Disentangled 3D Human Generation

1School of Artificial Intelligence, Jilin University, Changchun, China 2Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China 3School of Intelligence Science and Technology, Nanjing University, Suzhou, China
Semantic-aware virtual try-on image

(a) Semantic-aware virtual try-on. Given a real image, we first employ GAN inversion to obtain its semantic latent code. Subsequently, we replace the top and bottom garment by manipulating the semantic latent code. Here, the top is randomly generated by our model, and the bottom is disentangled from another GAN inversion result. (b) Controllable image synthesis. Our method allows for generating the same person in different poses as well as rendering them from different viewpoints.

Pose Control

With the pose control, viewers can see the generated 3D humans in motion, as well as the 3D garments.

View Control

We render generated results from different viewpoints. Notably, the rendering results include 3D garments.

Abstract

With the development of neural radiance fields and generative models, numerous methods have been proposed for learning 3D human generation from 2D images. These methods allow control over the pose of the generated 3D human and enable rendering from different viewpoints. However, none of these methods explore semantic disentanglement in human image synthesis, i.e., they can not disentangle the generation of different semantic parts, such as the body, tops, and bottoms. Furthermore, existing methods are limited to synthesize images at 5122 resolution due to the high computational cost of neural radiance fields.

To address these limitations, we introduce SemanticHuman-HD, the first method to achieve semantic disentangled human image synthesis. Notably, SemanticHuman-HD is also the first method to achieve 3D-aware image synthesis at 10242 resolution, benefiting from our proposed 3D-aware super-resolution module. By leveraging the depth maps and semantic masks as guidance for the 3D-aware super-resolution, we significantly reduce the number of sampling points during volume rendering, thereby reducing the computational cost.

Our comparative experiments demonstrate the superiority of our method. The effectiveness of each proposed component is also verified through ablation studies. Moreover, our method opens up exciting possibilities for various applications, including 3D garment generation, semantic-aware image synthesis, controllable image synthesis, and out-of-domain image synthesis.

Results

Semantic disentanglement

Semantic disentanglement image

By modifying the latent code of a specified semantic part, we can alter that specified part in the synthesized image.

Out-of-domain image synthesis

Out-of-domain image synthesis image

To achieve out-of-domain image synthesis, we assign different semantic labels to various semantic parts, e.g., if we set the semantic label corresponding to the body as "male", and the label corresponding to the tops as "dress", we can synthesize an image of a man wearing a dress.

Garment generation

Semantic-aware virtual try-on image

(a) Results randomly generated by our model. (b) Results obtained from GAN inversion.

Conditional image synthesis

Conditional image synthesis image

For each paired set of images, the image on the right is synthesized conditioned on the semantic label Ls and human pose P of the real image on the left.

Semantic-aware interpolation

Semantic-aware interpolation image

Red dashed rectangles on the images indicate chosen semantic parts during the interpolation.

3D garment interpolation

3D garment interpolation image

3D garment interpolation, including images and normal maps. For a closer view, please zoom in to see the details.