Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/150772
Title: Unified information fusion network for multi-modal RGB-D and RGB-T salient object detection
Authors: Gao, Wei
Liao, Guibiao
Ma, Siwei
Li, Ge
Liang, Yongsheng
Lin, Weisi
Keywords: Engineering::Computer science and engineering
Issue Date: 2021
Source: Gao, W., Liao, G., Ma, S., Li, G., Liang, Y. & Lin, W. (2021). Unified information fusion network for multi-modal RGB-D and RGB-T salient object detection. IEEE Transactions On Circuits and Systems for Video Technology. https://dx.doi.org/10.1109/TCSVT.2021.3082939
Journal: IEEE Transactions on Circuits and Systems for Video Technology
Abstract: The use of complementary information, namely depth or thermal information, has shown its benefits to salient object detection (SOD) during recent years. However, the RGB-D or RGB-T SOD problems are currently only solved independently, and most of them directly extract and fuse raw features from backbones. Such methods can be easily restricted by low-quality modality data and redundant cross-modal features. In this work, a unified end-to-end framework is designed to simultaneously analyze RGB-D and RGB-T SOD tasks. Specifically, to effectively tackle multi-modal features, we propose a novel multi-stage and multi-scale fusion network (MMNet), which consists of a crossmodal multi-stage fusion module (CMFM) and a bi-directional multi-scale decoder (BMD). Similar to the visual color stage doctrine in the human visual system (HVS), the proposed CMFM aims to explore important feature representations in feature response stage, and integrate them into cross-modal features in adversarial combination stage. Moreover, the proposed BMD learns the combination of multi-level cross-modal fused features to capture both local and global information of salient objects, and can further boost the multi-modal SOD performance. The proposed unified cross-modality feature analysis framework based on two-stage and multi-scale information fusion can be used for diverse multi-modal SOD tasks. Comprehensive experiments (∼92K image-pairs) demonstrate that the proposed method consistently outperforms the other 21 state-of-the-art methods on nine benchmark datasets. This validates that our proposed method can work well on diverse multi-modal SOD tasks with good generalization and robustness, and provides a good multimodal SOD benchmark.
URI: https://hdl.handle.net/10356/150772
ISSN: 1051-8215
DOI: 10.1109/TCSVT.2021.3082939
Rights: © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/TCSVT.2021.3082939.
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Journal Articles

Page view(s)

33
Updated on May 20, 2022

Download(s) 50

27
Updated on May 20, 2022

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.