{"title":"Nonspeech7k dataset: Classification and analysis of human non-speech sound","authors":"Muhammad Mamunur Rashid, Guiqing Li, Chengrui Du","doi":"10.1049/sil2.12233","DOIUrl":null,"url":null,"abstract":"<p>Human non-speech sounds occur during expressions in a real-life environment. Realising a person's incapability to prompt confident expressions by non-speech sounds may assist in identifying premature disorder in medical applications. A novel dataset named Nonspeech7k is introduced that contains a diverse set of human non-speech sounds, such as the sounds of breathing, coughing, crying, laughing, screaming, sneezing, and yawning. The authors then conduct a variety of classification experiments with end-to-end deep convolutional neural networks (CNN) to show the performance of the dataset. First, a set of typical deep classifiers are used to verify the reliability and validity of Nonspeech7k. Involved CNN models include 1D-2D deep CNN EnvNet, deep stack CNN M11, deep stack CNN M18, intense residual block CNN ResNet34, modified M11 named M12, and the authors’ baseline model. Among these, M12 achieves the highest accuracy of 79%. Second, to verify the heterogeneity of Nonspeech7k with respect to two typical datasets, FSD50K and VocalSound, the authors design a series of experiments to analyse the classification performance of deep neural network classifier M12 by using FSD50K, FSD50K + Nonspeech7k, VocalSound, VocalSound + Nonspeech7k as training data, respectively. Experimental results show that the classifier trained with existing datasets mixed with Nonspeech7k achieves the highest accuracy improvement of 15.7% compared to that without Nonspeech7k mixed. Nonspeech7k is 100% annotated, completely checked, and free of noise. It is available at https://doi.org/10.5281/zenodo.6967442.</p>","PeriodicalId":56301,"journal":{"name":"IET Signal Processing","volume":"17 6","pages":""},"PeriodicalIF":1.1000,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/sil2.12233","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/sil2.12233","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Human non-speech sounds occur during expressions in a real-life environment. Realising a person's incapability to prompt confident expressions by non-speech sounds may assist in identifying premature disorder in medical applications. A novel dataset named Nonspeech7k is introduced that contains a diverse set of human non-speech sounds, such as the sounds of breathing, coughing, crying, laughing, screaming, sneezing, and yawning. The authors then conduct a variety of classification experiments with end-to-end deep convolutional neural networks (CNN) to show the performance of the dataset. First, a set of typical deep classifiers are used to verify the reliability and validity of Nonspeech7k. Involved CNN models include 1D-2D deep CNN EnvNet, deep stack CNN M11, deep stack CNN M18, intense residual block CNN ResNet34, modified M11 named M12, and the authors’ baseline model. Among these, M12 achieves the highest accuracy of 79%. Second, to verify the heterogeneity of Nonspeech7k with respect to two typical datasets, FSD50K and VocalSound, the authors design a series of experiments to analyse the classification performance of deep neural network classifier M12 by using FSD50K, FSD50K + Nonspeech7k, VocalSound, VocalSound + Nonspeech7k as training data, respectively. Experimental results show that the classifier trained with existing datasets mixed with Nonspeech7k achieves the highest accuracy improvement of 15.7% compared to that without Nonspeech7k mixed. Nonspeech7k is 100% annotated, completely checked, and free of noise. It is available at https://doi.org/10.5281/zenodo.6967442.
期刊介绍:
IET Signal Processing publishes research on a diverse range of signal processing and machine learning topics, covering a variety of applications, disciplines, modalities, and techniques in detection, estimation, inference, and classification problems. The research published includes advances in algorithm design for the analysis of single and high-multi-dimensional data, sparsity, linear and non-linear systems, recursive and non-recursive digital filters and multi-rate filter banks, as well a range of topics that span from sensor array processing, deep convolutional neural network based approaches to the application of chaos theory, and far more.
Topics covered by scope include, but are not limited to:
advances in single and multi-dimensional filter design and implementation
linear and nonlinear, fixed and adaptive digital filters and multirate filter banks
statistical signal processing techniques and analysis
classical, parametric and higher order spectral analysis
signal transformation and compression techniques, including time-frequency analysis
system modelling and adaptive identification techniques
machine learning based approaches to signal processing
Bayesian methods for signal processing, including Monte-Carlo Markov-chain and particle filtering techniques
theory and application of blind and semi-blind signal separation techniques
signal processing techniques for analysis, enhancement, coding, synthesis and recognition of speech signals
direction-finding and beamforming techniques for audio and electromagnetic signals
analysis techniques for biomedical signals
baseband signal processing techniques for transmission and reception of communication signals
signal processing techniques for data hiding and audio watermarking
sparse signal processing and compressive sensing
Special Issue Call for Papers:
Intelligent Deep Fuzzy Model for Signal Processing - https://digital-library.theiet.org/files/IET_SPR_CFP_IDFMSP.pdf