{"title":"Classification of Speech vs. Speech with Background Music","authors":"Mrinmoy Bhattacharjee, S. Prasanna, P. Guha","doi":"10.1109/SPCOM50965.2020.9179491","DOIUrl":null,"url":null,"abstract":"Applications that perform enhancement of speech containing background music require a critical preprocessing step that can efficiently detect such segments. This work proposes such a preprocessing method to detect speech with background music that is mixed at different SNR levels. A bag-of-words approach is proposed in this work. Representative dictionaries from speech and music data are first learned. The signals are processed as spectrograms of 1s intervals. Rows of these spectrograms are used to learn separate speech and music dictionaries. This work proposes a weighting scheme to reduce confusion by suppressing codewords of one class that have similarities to the other class. The proposed feature is a weighted histogram of 1s audio intervals obtained from the learned dictionaries. The classification is performed using a deep neural network classifier. The proposed approach is validated against a baseline and benchmarked over two publicly available datasets. The proposed feature shows promising results, both individually and in combination with the baseline.","PeriodicalId":208527,"journal":{"name":"2020 International Conference on Signal Processing and Communications (SPCOM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM50965.2020.9179491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Applications that perform enhancement of speech containing background music require a critical preprocessing step that can efficiently detect such segments. This work proposes such a preprocessing method to detect speech with background music that is mixed at different SNR levels. A bag-of-words approach is proposed in this work. Representative dictionaries from speech and music data are first learned. The signals are processed as spectrograms of 1s intervals. Rows of these spectrograms are used to learn separate speech and music dictionaries. This work proposes a weighting scheme to reduce confusion by suppressing codewords of one class that have similarities to the other class. The proposed feature is a weighted histogram of 1s audio intervals obtained from the learned dictionaries. The classification is performed using a deep neural network classifier. The proposed approach is validated against a baseline and benchmarked over two publicly available datasets. The proposed feature shows promising results, both individually and in combination with the baseline.