自动语音识别和模糊法证音频的转录:新一代系统表现如何?

IF 1.5 Q2 COMMUNICATION Frontiers in Communication Pub Date : 2024-02-14 DOI:10.3389/fcomm.2024.1281407
Debbie Loakes
{"title":"自动语音识别和模糊法证音频的转录:新一代系统表现如何?","authors":"Debbie Loakes","doi":"10.3389/fcomm.2024.1281407","DOIUrl":null,"url":null,"abstract":"This study provides an update on an earlier study in the “Capturing Talk” research topic, which aimed to demonstrate how automatic speech recognition (ASR) systems work with indistinct forensic-like audio, in comparison with good-quality audio. Since that time, there has been rapid technological advancement, with newer systems having access to extremely large language models and having their performance proclaimed as being human-like in accuracy. This study compares various ASR systems, including OpenAI’s Whisper, to continue to test how well automatic speaker recognition works with forensic-like audio. The results show that the transcription of a good-quality audio file is at ceiling for some systems, with no errors. For the poor-quality (forensic-like) audio, Whisper was the best performing system but had only 50% of the entire speech material correct. The results for the poor-quality audio were also generally variable across the systems, with differences depending on whether a .wav or .mp3 file was used and differences between earlier and later versions of the same system. Additionally, and against expectations, Whisper showed a drop in performance over a 2-month period. While more material was transcribed in the later attempt, more was also incorrect. This study concludes that forensic-like audio is not suitable for automatic analysis.","PeriodicalId":31739,"journal":{"name":"Frontiers in Communication","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic speech recognition and the transcription of indistinct forensic audio: how do the new generation of systems fare?\",\"authors\":\"Debbie Loakes\",\"doi\":\"10.3389/fcomm.2024.1281407\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study provides an update on an earlier study in the “Capturing Talk” research topic, which aimed to demonstrate how automatic speech recognition (ASR) systems work with indistinct forensic-like audio, in comparison with good-quality audio. Since that time, there has been rapid technological advancement, with newer systems having access to extremely large language models and having their performance proclaimed as being human-like in accuracy. This study compares various ASR systems, including OpenAI’s Whisper, to continue to test how well automatic speaker recognition works with forensic-like audio. The results show that the transcription of a good-quality audio file is at ceiling for some systems, with no errors. For the poor-quality (forensic-like) audio, Whisper was the best performing system but had only 50% of the entire speech material correct. The results for the poor-quality audio were also generally variable across the systems, with differences depending on whether a .wav or .mp3 file was used and differences between earlier and later versions of the same system. Additionally, and against expectations, Whisper showed a drop in performance over a 2-month period. While more material was transcribed in the later attempt, more was also incorrect. This study concludes that forensic-like audio is not suitable for automatic analysis.\",\"PeriodicalId\":31739,\"journal\":{\"name\":\"Frontiers in Communication\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2024-02-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Communication\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fcomm.2024.1281407\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMMUNICATION\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fcomm.2024.1281407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMMUNICATION","Score":null,"Total":0}
引用次数: 0

摘要

本研究是对 "捕捉谈话 "研究课题早期研究的更新,该研究旨在展示自动语音识别(ASR)系统与高质量音频相比,如何处理模糊不清的法证类音频。从那时起,技术发展突飞猛进,较新的系统可以访问超大语言模型,并宣称其性能在准确性上与人类无异。本研究对包括 OpenAI 的 Whisper 在内的各种 ASR 系统进行了比较,以继续测试说话人自动识别在类似法证音频中的表现。结果表明,对某些系统来说,高质量音频文件的转录达到了上限,没有出现任何错误。对于质量较差(类似法证)的音频,Whisper 是性能最好的系统,但整个语音材料中只有 50% 是正确的。各系统对劣质音频的处理结果也不尽相同,这取决于使用的是 .wav 文件还是 .mp3 文件,以及同一系统的早期版本和后期版本之间的差异。此外,与预期不同的是,Whisper 的性能在两个月内有所下降。虽然后期尝试转录的材料更多,但错误也更多。本研究的结论是,法医类音频不适合自动分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Automatic speech recognition and the transcription of indistinct forensic audio: how do the new generation of systems fare?
This study provides an update on an earlier study in the “Capturing Talk” research topic, which aimed to demonstrate how automatic speech recognition (ASR) systems work with indistinct forensic-like audio, in comparison with good-quality audio. Since that time, there has been rapid technological advancement, with newer systems having access to extremely large language models and having their performance proclaimed as being human-like in accuracy. This study compares various ASR systems, including OpenAI’s Whisper, to continue to test how well automatic speaker recognition works with forensic-like audio. The results show that the transcription of a good-quality audio file is at ceiling for some systems, with no errors. For the poor-quality (forensic-like) audio, Whisper was the best performing system but had only 50% of the entire speech material correct. The results for the poor-quality audio were also generally variable across the systems, with differences depending on whether a .wav or .mp3 file was used and differences between earlier and later versions of the same system. Additionally, and against expectations, Whisper showed a drop in performance over a 2-month period. While more material was transcribed in the later attempt, more was also incorrect. This study concludes that forensic-like audio is not suitable for automatic analysis.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
3.30
自引率
8.30%
发文量
284
审稿时长
14 weeks
期刊最新文献
Use of comics in the promotion of school children’s health: a scoping review Editorial: Rethinking global health and communication Understanding news-related user comments and their effects: a systematic review Short versions of the Basque MacArthur-Bates Communicative Development Inventories (children aged 8–50 months) The figure of the influencer under scrutiny: highly exposed, poorly regulated
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1