Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/165042
Title: Detecting human-object interactions for human activity analysis
Authors: Wang, Suchen
Keywords: Engineering::Computer science and engineering
Issue Date: 2023
Publisher: Nanyang Technological University
Source: Wang, S. (2023). Detecting human-object interactions for human activity analysis. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/165042
Abstract: A long-standing goal in the field of computer vision is to develop models that can understand the rich visual world and recognize the diverse activities within. We have witnessed significant strides towards this goal over the last few years due to the availability of mass data and rapid advances in computing resources and deep learning algorithms. Computers can now detect person instances from images or videos, classify actions and recognize the interacting objects. However, most of the advances focus on assigning one or a few labels in a pre-determined small category space (e.g., riding a bicycle, opening a bottle, etc.), which only uncovers the tip of the iceberg of diverse human daily activities. In this thesis, we develop models that allow us to detect human interactions with a wide range of common objects. In particular, we first assemble a large-vocabulary dataset and propose a one-stage detector that takes an image as input and directly outputs a set of interaction tuples. We demonstrate that human visual cues (e.g., human pose, spatial location, etc.) can provide informative priors for searching interacting objects and recognizing interactions. We empirically show that the proposed one-stage HOI detector can detect 23 times more (from 600 to 14,000) interactions and achieve 25% mAP improvement over state-of-the-art methods. Second, we develop a model that embeds visual objects and category names into a joint embedding space. We present a way to identify novel objects based on the knowledge obtained from known object categories. We empirically show that the proposed zero-shot HOI detector can achieve over 24% mAP improvement on human interactions with unseen objects. Third, we introduce a model that can learn to detect human-object interactions based on natural language descriptions instead of pre-determined discrete labels. We demonstrate that this model is transferable to 1,800 unseen interactions with a significant mAP improvement (from 6.21 to 10.04). In the end, we argue that these models can offer many practical benefits and immediate valuable applications. The proposed HOI detectors can be applied to extract discriminative action features for downstream tasks, e.g., video summarization and human activity understanding. We expect these techniques can serve as a stepping stone toward a more comprehensive understanding of human activities. One side objective of this thesis is to address the challenges brought by limited training data. Compared to the unlimited human activities in the visual world, there is inherently only a small portion of interactions that can be represented by the labeled data. From this perspective, our contribution lies in the design of algorithms to handle the potential novel interactions beyond the collected category space, including unseen objects, novel combinations between seen actions and objects, etc. From the modeling perspective, instead of designing complex multi-stage frameworks, our contribution lies in the design of one-stage architectures that take an image and directly produce the interaction tuples with a single network. We formulate the task as a multi-task optimization problem and learn all module components with a shared objective function. We show that our methods outperform state-of-the-art HOI detection approaches, and they can help facilitate the visual understanding of rich human activities in our visual world.
URI: https://hdl.handle.net/10356/165042
DOI: 10.32657/10356/165042
Schools: School of Electrical and Electronic Engineering 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:EEE Theses

Files in This Item:
File Description SizeFormat 
thesis_suchen.pdf11.44 MBAdobe PDFThumbnail
View/Open

Page view(s)

343
Updated on Mar 20, 2025

Download(s) 50

210
Updated on Mar 20, 2025

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.