RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction

IF 5.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-02 DOI:10.1109/TASLP.2024.3473302

Takahiro Fukumori;Taito Ishida;Yoichi Yamashita

{"title":"RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction","authors":"Takahiro Fukumori;Taito Ishida;Yoichi Yamashita","doi":"10.1109/TASLP.2024.3473302","DOIUrl":null,"url":null,"abstract":"The detection of shouted speech is crucial in audio surveillance and monitoring. Although it is desirable for a security system to be able to identify emergencies, existing corpora provide only a binary label (i.e., shouted or normal) for each speech sample, making it difficult to predict the shout intensity. Furthermore, most corpora comprise only utterances typical of hazardous situations, meaning that classifiers cannot learn to discriminate such utterances from shouts typical of less hazardous situations such as cheers. Thus, this paper presents a novel research source, the RItsumeikan Shout Corpus (RISC), which contains wide variety types of shouted speech samples collected in recording experiments. Each shouted speech sample in RISC has a shout type and is also assigned shout intensity ratings via a crowdsourcing service. We also present a comprehensive performance comparison among deep learning approaches for speech type classification tasks and a shout intensity prediction task. The results show that feature learning based on the spectral and cepstral domains achieves high performance, no matter which network architecture is used. The results also demonstrate that shout type classification and intensity prediction are still challenging tasks, and RISC is expected to contribute to further development in this research area.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4434-4444"},"PeriodicalIF":5.1000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704045","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10704045/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

The detection of shouted speech is crucial in audio surveillance and monitoring. Although it is desirable for a security system to be able to identify emergencies, existing corpora provide only a binary label (i.e., shouted or normal) for each speech sample, making it difficult to predict the shout intensity. Furthermore, most corpora comprise only utterances typical of hazardous situations, meaning that classifiers cannot learn to discriminate such utterances from shouts typical of less hazardous situations such as cheers. Thus, this paper presents a novel research source, the RItsumeikan Shout Corpus (RISC), which contains wide variety types of shouted speech samples collected in recording experiments. Each shouted speech sample in RISC has a shout type and is also assigned shout intensity ratings via a crowdsourcing service. We also present a comprehensive performance comparison among deep learning approaches for speech type classification tasks and a shout intensity prediction task. The results show that feature learning based on the spectral and cepstral domains achieves high performance, no matter which network architecture is used. The results also demonstrate that shout type classification and intensity prediction are still challenging tasks, and RISC is expected to contribute to further development in this research area.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

RISC：用于呼喊类型分类和呼喊强度预测的语料库

在音频监控中，对喊叫语音的检测至关重要。虽然安防系统最好能够识别紧急情况，但现有的语料库只为每个语音样本提供二进制标签（即喊叫或正常），因此很难预测喊叫强度。此外，大多数语料库只包含危险情况下的典型语句，这意味着分类器无法学习如何将此类语句与欢呼等危险性较低情况下的典型喊叫区分开来。因此，本文提出了一个新的研究来源--RItsumeikan 喊声语料库（RISC），其中包含在录音实验中收集的各种类型的喊声语音样本。RISC 中的每个呐喊语音样本都有一个呐喊类型，并通过众包服务为其分配了呐喊强度评级。我们还介绍了深度学习方法在语音类型分类任务和呐喊强度预测任务中的综合性能比较。结果表明，无论使用哪种网络架构，基于频谱和倒频谱域的特征学习都能实现高性能。结果还表明，喊叫声类型分类和强度预测仍然是具有挑战性的任务，RISC有望为这一研究领域的进一步发展做出贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach