Please use this identifier to cite or link to this item:
https://hdl.handle.net/10356/172024
Title: | Contrastive knowledge transfer from CLIP for open vocabulary object detection | Authors: | Zhang, Chuhan | Keywords: | Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision |
Issue Date: | 2023 | Publisher: | Nanyang Technological University | Source: | Zhang, C. (2023). Contrastive knowledge transfer from CLIP for open vocabulary object detection. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/172024 | Abstract: | Object detection has made remarkable progress in recent years. While in real-world scenarios, a model is expected to generalize to novel objects that it never explicitly trained on. Though pre-trained vision language model has shown powerful results in zero-shot classification task, adapting it to detection task is non-trivial due to the detection includes region-level reasoning as well as non-semantic localization. In this dissertation, a method built on detr-style architecture and contrastive dis- tillation has been proposed. It utilizes the CLIP model to provide semantic-rich features as priors for querying novel objects. Besides, the model is trained to align with CLIP in a latent space via contrastive loss, enabling it to distinguish unseen classes. The effectiveness of the proposed method is supported by the experimental results with 65.3 novel AR and 23.4 novel mAP on MSCOCO dataset. Its variants out- performs its counter part by 3.5 mAP and 3.1 mAP respectively. The proposed contrastive distillation loss could also be integrated with other framework and achieves the best performance. The significance of different modules is revealed through ablation study and visualization study. The qualitative analysis demonstrates the potential of the proposed method as an effective on-the-fly detector. In final part, a discussion section analyzes the critical factors that contribute to open vocabulary object detection. It provides a unified perspective on reconstruction loss and contrastive loss, offering an interpretation of feature transfer within the context of open vocabulary scenarios. | URI: | https://hdl.handle.net/10356/172024 | DOI: | 10.32657/10356/172024 | Schools: | School of Computer Science and Engineering | Rights: | This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). | Fulltext Permission: | open | Fulltext Availability: | With Fulltext |
Appears in Collections: | SCSE Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
MEng_Thesis_Zhang Chuhan_revised.pdf | 16.12 MB | Adobe PDF | ![]() View/Open |
Page view(s)
464
Updated on May 7, 2025
Download(s) 20
366
Updated on May 7, 2025
Google ScholarTM
Check
Altmetric
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.