{"title":"Toward end-to-end interpretable convolutional neural networks for waveform signals","authors":"Linh Vu, Thu Tran, Wern-Han Lim, Raphael Phan","doi":"arxiv-2405.01815","DOIUrl":null,"url":null,"abstract":"This paper introduces a novel convolutional neural networks (CNN) framework\ntailored for end-to-end audio deep learning models, presenting advancements in\nefficiency and explainability. By benchmarking experiments on three standard\nspeech emotion recognition datasets with five-fold cross-validation, our\nframework outperforms Mel spectrogram features by up to seven percent. It can\npotentially replace the Mel-Frequency Cepstral Coefficients (MFCC) while\nremaining lightweight. Furthermore, we demonstrate the efficiency and\ninterpretability of the front-end layer using the PhysioNet Heart Sound\nDatabase, illustrating its ability to handle and capture intricate long\nwaveform patterns. Our contributions offer a portable solution for building\nefficient and interpretable models for raw waveform data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"111 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.01815","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper introduces a novel convolutional neural networks (CNN) framework
tailored for end-to-end audio deep learning models, presenting advancements in
efficiency and explainability. By benchmarking experiments on three standard
speech emotion recognition datasets with five-fold cross-validation, our
framework outperforms Mel spectrogram features by up to seven percent. It can
potentially replace the Mel-Frequency Cepstral Coefficients (MFCC) while
remaining lightweight. Furthermore, we demonstrate the efficiency and
interpretability of the front-end layer using the PhysioNet Heart Sound
Database, illustrating its ability to handle and capture intricate long
waveform patterns. Our contributions offer a portable solution for building
efficient and interpretable models for raw waveform data.