Acoustic Scene Classification Using Various Features and DNN Model: A Monolithic and Hierarchical Approach

IF 1.8 3区工程技术 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC Circuits, Systems and Signal Processing Pub Date : 2024-09-06 DOI:10.1007/s00034-024-02836-6

Chandrasekhar Paseddula, Suryakanth V. Gangashetty

{"title":"Acoustic Scene Classification Using Various Features and DNN Model: A Monolithic and Hierarchical Approach","authors":"Chandrasekhar Paseddula, Suryakanth V. Gangashetty","doi":"10.1007/s00034-024-02836-6","DOIUrl":null,"url":null,"abstract":"<p>An acoustic scene is a complicated phenomenon; thus, it would be difficult to draw out scene-specific information from the foreground and background sound sources. To accurately discern the sound sceneries and pinpoint the distinct sound occurrences in realistic soundscapes, more study is still required. Investigating a good feature representation is helpful for acoustic scene classification (ASC). This study investigated a few common acoustic features for ASC, including the mel-frequency cepstral coefficients (MFCC), log-mel band energy (LOGMEL), linear prediction cepstral coefficients (LPCC), and all-pole group delay (APGD). To represent acoustic scenes, we proposed a variety of features based on speaker/music recognition, including inverted mel-frequency cepstral coefficients, spectral centroid magnitude coefficients, sub-band spectral flux coefficients, and single frequency filtering cepstral coefficients. Using DNN classification models, it has been investigated how these features affect the classification of acoustic scenes in the DCASE 2017 dataset. Our analysis shows that no single feature has performed better than the others for all acoustic scenarios. In general, it may be challenging for a single classifier to successfully identify all the classes when there are more acoustic scenes. Therefore, we have proposed a two-level hierarchical classification approach. This is accomplished by first determining the meta-category of the acoustic scene, followed by the fine-grained classification that falls under each meta-category. From our studies, it is observed that, the hierarchical approach has performed (81.0%) better than the monolithic classification approach (79.9%) without DNN score fusion at level 2 as post processing. The performance of the ASC system can be further improved by exploring more sophisticated complementary features. The fusion of MFCC AND LOGMEL features based monolithic system resulted in an accuracy of 90.5%. The proposed hierarchical system results in accuracy of 82.6% with DNN score fusion at level 2 as post processing.</p>","PeriodicalId":10227,"journal":{"name":"Circuits, Systems and Signal Processing","volume":"1 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Circuits, Systems and Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1007/s00034-024-02836-6","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

An acoustic scene is a complicated phenomenon; thus, it would be difficult to draw out scene-specific information from the foreground and background sound sources. To accurately discern the sound sceneries and pinpoint the distinct sound occurrences in realistic soundscapes, more study is still required. Investigating a good feature representation is helpful for acoustic scene classification (ASC). This study investigated a few common acoustic features for ASC, including the mel-frequency cepstral coefficients (MFCC), log-mel band energy (LOGMEL), linear prediction cepstral coefficients (LPCC), and all-pole group delay (APGD). To represent acoustic scenes, we proposed a variety of features based on speaker/music recognition, including inverted mel-frequency cepstral coefficients, spectral centroid magnitude coefficients, sub-band spectral flux coefficients, and single frequency filtering cepstral coefficients. Using DNN classification models, it has been investigated how these features affect the classification of acoustic scenes in the DCASE 2017 dataset. Our analysis shows that no single feature has performed better than the others for all acoustic scenarios. In general, it may be challenging for a single classifier to successfully identify all the classes when there are more acoustic scenes. Therefore, we have proposed a two-level hierarchical classification approach. This is accomplished by first determining the meta-category of the acoustic scene, followed by the fine-grained classification that falls under each meta-category. From our studies, it is observed that, the hierarchical approach has performed (81.0%) better than the monolithic classification approach (79.9%) without DNN score fusion at level 2 as post processing. The performance of the ASC system can be further improved by exploring more sophisticated complementary features. The fusion of MFCC AND LOGMEL features based monolithic system resulted in an accuracy of 90.5%. The proposed hierarchical system results in accuracy of 82.6% with DNN score fusion at level 2 as post processing.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用各种特征和 DNN 模型进行声学场景分类：单层和分层方法

声音场景是一种复杂的现象，因此很难从前景和背景声源中提取特定场景的信息。要准确辨别声音场景，并在真实的声音场景中精确定位独特的声音发生，还需要更多的研究。研究良好的特征表示有助于声学场景分类（ASC）。本研究调查了一些用于声学场景分类的常见声学特征，包括旋律-频率共振频率系数（MFCC）、对数-旋律带能量（LOGMEL）、线性预测共振频率系数（LPCC）和全极群延迟（APGD）。为了表示声学场景，我们提出了多种基于扬声器/音乐识别的特征，包括倒置梅尔频率epstral系数、频谱中心点幅度系数、子带频谱通量系数和单频滤波epstral系数。利用 DNN 分类模型，我们研究了这些特征如何影响 DCASE 2017 数据集中的声学场景分类。我们的分析表明，在所有声学场景中，没有一个特征的表现优于其他特征。一般来说，当声学场景较多时，单个分类器要成功识别所有类别可能具有挑战性。因此，我们提出了一种两级分层分类方法。首先确定声学场景的元类别，然后对每个元类别进行细粒度分类。从我们的研究中可以看出，分层方法的性能（81.0%）优于没有在第 2 层进行 DNN 分数融合作为后处理的单一分类方法（79.9%）。通过探索更复杂的互补特征，可以进一步提高 ASC 系统的性能。基于 MFCC 和 LOGMEL 特征融合的单一系统的准确率为 90.5%。提议的分层系统通过在第 2 层进行 DNN 分数融合作为后处理，准确率达到 82.6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Circuits, Systems and Signal Processing 工程技术-工程：电子与电气

CiteScore

4.80

自引率

13.00%

发文量

321

审稿时长

4.6 months

期刊介绍： Rapid developments in the analog and digital processing of signals for communication, control, and computer systems have made the theory of electrical circuits and signal processing a burgeoning area of research and design. The aim of Circuits, Systems, and Signal Processing (CSSP) is to help meet the needs of outlets for significant research papers and state-of-the-art review articles in the area. The scope of the journal is broad, ranging from mathematical foundations to practical engineering design. It encompasses, but is not limited to, such topics as linear and nonlinear networks, distributed circuits and systems, multi-dimensional signals and systems, analog filters and signal processing, digital filters and signal processing, statistical signal processing, multimedia, computer aided design, graph theory, neural systems, communication circuits and systems, and VLSI signal processing. The Editorial Board is international, and papers are welcome from throughout the world. The journal is devoted primarily to research papers, but survey, expository, and tutorial papers are also published. Circuits, Systems, and Signal Processing (CSSP) is published twelve times annually.