Please use this identifier to cite or link to this item:
Title: Machine learning based audio event recognition
Authors: Lu, Yujing
Keywords: Engineering::Electrical and electronic engineering
Issue Date: 2020
Publisher: Nanyang Technological University
Abstract: As an important information carrier, sound carries abundant information about the environment, which is often used to assist the environment perception and video surveillance. During the recognition of audio event, the feature values are extracted based on the analysis of environmental sound, classified and attached with semantic labels, such as beach, library, forest etc. Audio scene recognition can be used in various fields, such as military reconnaissance, intelligent home, security monitoring, medical monitoring, etc. The deep learning method involves neural network with multiple layers for perceptron, which has achieved great success in image recognition, machine translation and other applications. Deep learning can also be used as a classifier in audio event recognition. Under supervision, deep learning can learn audio features automatically, which can overcome many disadvantages including long time consumption, heavy manual work and unstable manual selection of features. To address these problems, a variety of deep learning models are investigated in this project. Therefore, this project mainly studies the sound event recognition technology based on a variety of deep learning models. By using various deep neural networks with different structures, information extraction and representation learning of sound event samples are performed to improve the recognition accuracy of sound event recognition systems. In this project, a DNN-based audio scene recognition system is built, in which, MFCC is used to extract audio features, and the system consists 10 dense layers and a dropout layer. This model achieved the training data accuracy of 84.5%, but the accuracy of test data was under 40%. In this work, a CNN-based audio scene recognition system is also established. The reason for choosing CNN is that CNN is currently the most mainstream network structure in deep learning, which has good performance in the fields of image recognition and speech recognition. The systems consists of 4 convolutional layers and 4 pooling layers, 1 tiled layer, two fully-connected layers and also a dropout layer, which can prevent the network from overfitting in training. In this model, the accuracy of training data reached 80.5%, but the accuracy of test data was only around 77%. Finally, a CRNN-based audio scene recognition model was established, but the accuracy rate of this model was lower than that of the CNN model, and it also took longer to train.
Fulltext Permission: restricted
Fulltext Availability: With Fulltext
Appears in Collections:EEE Theses

Files in This Item:
File Description SizeFormat 
  Restricted Access
2.06 MBAdobe PDFView/Open

Page view(s)

Updated on Feb 2, 2023


Updated on Feb 2, 2023

Google ScholarTM


Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.