Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/171489
Title: Real-world object detection
Authors: Zang, Yuhang
Keywords: Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Issue Date: 2023
Publisher: Nanyang Technological University
Source: Zang, Y. (2023). Real-world object detection. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/171489
Project: NTU NAP 
RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) 
Singapore MOE AcRF Tier 2 (MOE-T2EP20120-0001) 
Abstract: Object detection is a fundamental computer vision task that estimates object classification labels and location coordinates in images. Previous studies have consistently boosted the performance of object detectors. However, the real-world scenario introduces significant obstacles that hinder their overall effectiveness. In this thesis, we will concentrate specifically on two such challenges. The first challenge is the long-tailed data distribution, where real-world data often exhibits a significant imbalance in the number of images per category. Directly training an object detector on long-tailed data can introduce a bias toward head class objects, causing the omission of tail class objects. The second challenge is about generalizing to test samples from unseen classes that are not included in the training set. Detectors frequently make inaccurate classification predictions for objects from these unseen classes, including misclassifications as background or known categories. This thesis explores solutions to address the aforementioned challenges. For the long-tailed problem, we concentrate on two approaches: data augmentation (FASA) and semi-supervised learning (CascadeMatch). To enhance the detector's generalization ability, we investigate leveraging prior knowledge from Vision and Language Models (OV-DETR, UPT) or Multimodal Large Language Models (ContextDET). We first propose a simple yet effective method, Feature Augmentation and Sampling Adaptation (FASA), that addresses the long-tailed issue by augmenting the feature space, especially for rare classes. FASA does not require any elaborate loss design and removes the need for inter-class transfer learning that often involves large costs and manually-defined head/tail class groups. We show FASA is a fast, generic method that can be easily plugged into standard or long-tailed segmentation frameworks, with consistent performance gains and little added cost. Second, we propose CascadeMatch, a novel pseudo-labeling-based object detector that uses semi-supervised supervision to effectively tackle the long-tailed problem. CascadeMatch features a cascade network architecture that consists of multi-stage detection heads with incremental confidence thresholds. To avoid confirmation bias, each detection head is trained by the ensemble pseudo labels of all detection heads. To take into account the class-imbalance problem in real-world data that causes neural networks to give a higher/lower confidence to many/few-shot classes, we propose class-specific self-adaptive confidence thresholds, which are automatically tuned from labeled data with minimal human intervention. Third, to achieve generalization on unseen classes during testing, we propose a novel open-vocabulary detector called OV-DETR. Once trained, OV-DETR can detect any object given its class name or an exemplar image. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR achieves non-trivial improvements over the baseline methods. Fourth, we present a systematic study of unimodal prompt tuning methods, which serve as popular transfer learning paradigms for vision-language models like CLIP. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a conceptually simple approach called Unified Prompt Tuning (UPT), which learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on existing benchmarks. Finally, we introduce a novel research problem of contextual object detection---understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Extensive experiments show the advantages of ContextDET on a series of tasks, including our proposed contextual object detection, open-vocabulary detection, and referring image segmentation.
URI: https://hdl.handle.net/10356/171489
DOI: 10.32657/10356/171489
Schools: School of Computer Science and Engineering 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Theses

Files in This Item:
File Description SizeFormat 
Thesis_yuhangzang.pdf24.63 MBAdobe PDFThumbnail
View/Open

Page view(s)

367
Updated on Dec 11, 2024

Download(s) 50

188
Updated on Dec 11, 2024

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.