Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/184558
Title: Human-centric 3D representation learning
Authors: Hong, Fangzhou
Keywords: Computer and Information Science
Issue Date: 2025
Publisher: Nanyang Technological University
Source: Hong, F. (2025). Human-centric 3D representation learning. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/184558
Abstract: Understanding the world in three dimensions has long been a scientific challenge with significant practical implications. In the era of deep learning, researchers have made substantial progress in developing 3D representations using deep neural networks. Among the myriad of entities in our complex environment, humans stand out due to their general significance in science and their specific relevance in various applications. This thesis focuses on human-centric 3D representation learning within the framework of deep learning. Specifically, we explore three key areas: human-centric 3D perception, human reconstruction, and 3D human generation. The thesis presents five studies that collectively address these topics. In the realm of human-centric 3D perception, we introduce a versatile multi-modal pre-training approach. By harnessing the diverse modalities of human data, such as RGB images, depth, and 2D keypoints, we present a general framework HCMoCo for effective human-centric representation learning. This framework achieves state-of-the-art performance across four human perception tasks, including DensePose prediction, human parsing, and 3D keypoint prediction. In the area of human reconstruction, we investigate garment reconstruction from 4D point clouds of dressed individuals. To address the ambiguities inherent in 2D images, we propose a principled framework called Garment4D, which enables separable and interpretable garment reconstruction. Notably, Garment4D can effectively reconstruct and model the non-rigid deformations of loose garments (e.g., skirts) that do not share the same topology as the human body. We present two distinct approaches to 3D human generation. In our first work, AvatarCLIP, we generate and animate avatars based on text descriptions of body shapes, appearances, and motions. By leveraging differentiable rendering and large- scale vision-language pre-trained models, we achieve avatar and motion synthesis without the need for supervised training or paired data. To further enhance the quality of the generated avatars, we introduce a second approach, EVA3D, which is a high-quality unconditional 3D human generative model that requires only 2D image collections for training. We design an efficient compositional human NeRF representation to facilitate high-resolution 3D human sampling and rendering, which is employed in adversarial training. Finally, by harnessing the capabilities of large language models (LLMs), we explore 3D human representation learning with a focus on human motion and the integration of perception, reconstruction, and generation. We introduce EgoLM, a versatile framework for understanding egocentric motion using multi-modal data. This framework incorporates rich contextual information from egocentric videos and motion sensors provided by wearable devices. EgoLM unifies various motion learning tasks, including motion understanding from video and motion data, as well as motion tracking and generation from text or sparse sensor input. Additionally, it enables a novel task unique to wearable devices: generating text descriptions from sparse sensor data. By investigating 3D human representation learning from three distinct perspectives, we have attained a comprehensive understanding of humans that spans both high-level semantics and low-level geometry. This thesis offers a cohesive exploration of human-centric 3D representations, contributing significantly to the field of human-centric vision.
URI: https://hdl.handle.net/10356/184558
Schools: College of Computing and Data Science 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Theses

Files in This Item:
File Description SizeFormat 
Thesis Revised.pdf86.98 MBAdobe PDFView/Open

Page view(s)

15
Updated on May 5, 2025

Download(s)

1
Updated on May 5, 2025

Google ScholarTM

Check

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.