首页 > 最新文献

SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology最新文献

英文 中文
SPEECH RECOGNITION FOR ANALYSIS OF POLICE RADIO COMMUNICATION. 用于警察无线电通信分析的语音识别。
Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul

Police departments around the world use two-way radio for coordination. These broadcast police communications (BPC) are a unique source of information about everyday police activity and emergency response. Yet BPC are not transcribed, and their naturalistic audio properties make automatic transcription challenging. We collect a corpus of roughly 62,000 manually transcribed radio transmissions (~46 hours of audio) to evaluate the feasibility of automatic speech recognition (ASR) using modern recognition models. We evaluate the performance of off-the-shelf speech recognizers, models fine-tuned on BPC data, and customized end-to-end models. We find that both human and machine transcription is challenging in this domain. Large off-the-shelf ASR models perform poorly, but fine-tuned models can reach the approximate range of human performance. Our work suggests directions for future work, including analysis of short utterances and potential miscommunication in police radio interactions. We make our corpus and data annotation pipeline available to other researchers, to enable further research on recognition and analysis of police communication.

世界各地的警察部门使用双向无线电进行协调。这些广播警察通信(BPC)是关于警察日常活动和应急反应的独特信息来源。然而,BPC是不转录的,其自然的音频属性使自动转录具有挑战性。我们收集了大约62,000个手动转录的无线电传输(约46小时的音频)的语料库,以评估使用现代识别模型进行自动语音识别(ASR)的可行性。我们评估了现成的语音识别器、基于BPC数据微调的模型和定制的端到端模型的性能。我们发现人类和机器转录在这个领域都是具有挑战性的。大型现成的ASR模型表现不佳,但经过微调的模型可以达到接近人类表现的范围。我们的工作为未来的工作指明了方向,包括分析警察无线电互动中的简短话语和潜在的误解。我们将我们的语料库和数据标注管道提供给其他研究人员,以进一步研究警察通信的识别和分析。
{"title":"SPEECH RECOGNITION FOR ANALYSIS OF POLICE RADIO COMMUNICATION.","authors":"Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul","doi":"10.1109/slt61566.2024.10832157","DOIUrl":"10.1109/slt61566.2024.10832157","url":null,"abstract":"<p><p>Police departments around the world use two-way radio for coordination. These broadcast police communications (BPC) are a unique source of information about everyday police activity and emergency response. Yet BPC are not transcribed, and their naturalistic audio properties make automatic transcription challenging. We collect a corpus of roughly 62,000 manually transcribed radio transmissions (<sup>~</sup>46 hours of audio) to evaluate the feasibility of automatic speech recognition (ASR) using modern recognition models. We evaluate the performance of off-the-shelf speech recognizers, models fine-tuned on BPC data, and customized end-to-end models. We find that both human and machine transcription is challenging in this domain. Large off-the-shelf ASR models perform poorly, but fine-tuned models can reach the approximate range of human performance. Our work suggests directions for future work, including analysis of short utterances and potential miscommunication in police radio interactions. We make our corpus and data annotation pipeline available to other researchers, to enable further research on recognition and analysis of police communication.</p>","PeriodicalId":74811,"journal":{"name":"SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology","volume":"2024 ","pages":"906-912"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12180137/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144478193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STUTTER-SOLVER: END-TO-END MULTI-LINGUAL DYSFLUENCY DETECTION. 口吃解决器:端到端的多语言不流利检测。
Xuanru Zhou, Cheol Jun Cho, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Boon Lead Tee, Maria Luisa Gorno-Tempini, Jiachen Lian, Gopala Anumanchipalli

Current de-facto dysfluency modeling methods [1, 2] utilize template matching algorithms which are not generalizable to out-of-domain real-world dysfluencies across languages, and are not scalable with increasing amounts of training data. To handle these problems, we propose Stutter-Solver: an end-to-end framework that detects dysfluency with accurate type and time transcription, inspired by the YOLO [3] object detection algorithm. Stutter-Solver can handle co-dysfluencies and is a natural multi-lingual dysfluency detector. To leverage scalability and boost performance, we also introduce three novel dysfluency corpora: VCTK-Pro, VCTK-Art, and AISHELL3-Pro, simulating natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation through articulatory-encodec and TTS-based methods. Our approach achieves state-of-the-art performance on all available dysfluency corpora. Code and datasets are open-sourced at https://github.com/eureka235/Stutter-Solver.

目前事实上的不流畅建模方法[1,2]使用模板匹配算法,这些算法不能推广到跨语言的域外真实世界的不流畅,并且不能随着训练数据量的增加而扩展。为了解决这些问题,我们提出了口吃解决器:一个端到端的框架,通过准确的类型和时间转录来检测不流利,灵感来自YOLO[3]对象检测算法。stuttter - solver可以处理共同流利障碍,是一个天然的多语言流利障碍检测器。为了利用可扩展性和提高性能,我们还引入了三种新的非流利语料:VCTK-Pro, VCTK-Art和AISHELL3-Pro,通过发音编码和基于tts的方法模拟自然的口语不流利,包括重复,块,缺失,替换和延长。我们的方法在所有可用的非流利语料库上达到了最先进的性能。代码和数据集在https://github.com/eureka235/Stutter-Solver上开源。
{"title":"STUTTER-SOLVER: END-TO-END MULTI-LINGUAL DYSFLUENCY DETECTION.","authors":"Xuanru Zhou, Cheol Jun Cho, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Boon Lead Tee, Maria Luisa Gorno-Tempini, Jiachen Lian, Gopala Anumanchipalli","doi":"10.1109/slt61566.2024.10832222","DOIUrl":"10.1109/slt61566.2024.10832222","url":null,"abstract":"<p><p>Current de-facto dysfluency modeling methods [1, 2] utilize template matching algorithms which are not generalizable to out-of-domain real-world dysfluencies across languages, and are not scalable with increasing amounts of training data. To handle these problems, we propose <i>Stutter-Solver</i>: an end-to-end framework that detects dysfluency with accurate type and time transcription, inspired by the YOLO [3] object detection algorithm. <i>Stutter-Solver</i> can handle <i>co-dysfluencies</i> and is a natural multi-lingual dysfluency detector. To leverage scalability and boost performance, we also introduce three novel dysfluency corpora: <i>VCTK-Pro</i>, <i>VCTK-Art</i>, and <i>AISHELL3-Pro</i>, simulating natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation through articulatory-encodec and TTS-based methods. Our approach achieves <i>state-of-the-art</i> performance on all available dysfluency corpora. Code and datasets are open-sourced at https://github.com/eureka235/Stutter-Solver.</p>","PeriodicalId":74811,"journal":{"name":"SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology","volume":"2024 ","pages":"1039-1046"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12233913/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144585834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS. stylettes - vc:通过基于风格的TTS模型的知识转移进行一次语音转换。
Yinghao Aaron Li, Cong Han, Nima Mesgarani

One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly outperform the previous state-of-the-art one-shot voice conversion models in both naturalness and similarity.

单次语音转换(One-shot voice conversion, VC)旨在将任意源说话者的语音转换为任意目标说话者,而目标说话者只需要几秒钟的参考语音。这在很大程度上依赖于理清说话者的身份和演讲内容,这一任务仍然具有挑战性。在这里,我们提出了一种新的方法,通过基于风格的文本到语音(TTS)模型的迁移学习来学习解纠缠语音表示。通过周期一致性和对抗性训练,基于风格的TTS模型可以以高保真度和相似性执行转录引导的一次性VC。通过师生知识转移和新颖的数据增强方案学习一个额外的梅尔谱图编码器,我们的方法在不需要输入文本的情况下实现了语音表示的解纠缠。主观评价表明,我们的方法在自然度和相似度方面都明显优于以前最先进的单次语音转换模型。
{"title":"STYLETTS-VC: ONE-SHOT VOICE CONVERSION BY KNOWLEDGE TRANSFER FROM STYLE-BASED TTS MODELS.","authors":"Yinghao Aaron Li,&nbsp;Cong Han,&nbsp;Nima Mesgarani","doi":"10.1109/slt54892.2023.10022498","DOIUrl":"https://doi.org/10.1109/slt54892.2023.10022498","url":null,"abstract":"<p><p>One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly outperform the previous state-of-the-art one-shot voice conversion models in both naturalness and similarity.</p>","PeriodicalId":74811,"journal":{"name":"SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology","volume":"2022 ","pages":"920-927"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10417535/pdf/nihms-1919646.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9990482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
COMPUTATIONAL ANALYSIS OF TRAJECTORIES OF LINGUISTIC DEVELOPMENT IN AUTISM. 自闭症儿童语言发展轨迹的计算分析。
Emily Prud'hommeaux, Eric Morley, Masoud Rouhizadeh, Laura Silverman, Jan van Santen, Brian Roark, Richard Sproat, Sarah Kauper, Rachel DeLaHunta

Deficits in semantic and pragmatic expression are among the hallmark linguistic features of autism. Recent work in deriving computational correlates of clinical spoken language measures has demonstrated the utility of automated linguistic analysis for characterizing the language of children with autism. Most of this research, however, has focused either on young children still acquiring language or on small populations covering a wide age range. In this paper, we extract numerous linguistic features from narratives produced by two groups of children with and without autism from two narrow age ranges. We find that although many differences between diagnostic groups remain constant with age, certain pragmatic measures, particularly the ability to remain on topic and avoid digressions, seem to improve. These results confirm findings reported in the psychology literature while underscoring the need for careful consideration of the age range of the population under investigation when performing clinically oriented computational analysis of spoken language.

语义和语用表达的缺陷是自闭症的标志性语言特征之一。最近在临床口语测量的计算相关性方面的工作已经证明了自动语言分析在描述自闭症儿童语言特征方面的实用性。然而,大多数研究都集中在仍在学习语言的幼儿身上,或者集中在覆盖广泛年龄范围的小群体身上。在本文中,我们从两组自闭症儿童和非自闭症儿童在两个狭窄的年龄范围内所产生的叙述中提取了许多语言特征。我们发现,尽管诊断组之间的许多差异随着年龄的增长而保持不变,但某些实用措施,特别是保持主题和避免离题的能力,似乎有所改善。这些结果证实了心理学文献中报道的发现,同时强调在进行临床导向的口语计算分析时,需要仔细考虑被调查人群的年龄范围。
{"title":"COMPUTATIONAL ANALYSIS OF TRAJECTORIES OF LINGUISTIC DEVELOPMENT IN AUTISM.","authors":"Emily Prud'hommeaux,&nbsp;Eric Morley,&nbsp;Masoud Rouhizadeh,&nbsp;Laura Silverman,&nbsp;Jan van Santen,&nbsp;Brian Roark,&nbsp;Richard Sproat,&nbsp;Sarah Kauper,&nbsp;Rachel DeLaHunta","doi":"10.1109/SLT.2014.7078585","DOIUrl":"https://doi.org/10.1109/SLT.2014.7078585","url":null,"abstract":"<p><p>Deficits in semantic and pragmatic expression are among the hallmark linguistic features of autism. Recent work in deriving computational correlates of clinical spoken language measures has demonstrated the utility of automated linguistic analysis for characterizing the language of children with autism. Most of this research, however, has focused either on young children still acquiring language or on small populations covering a wide age range. In this paper, we extract numerous linguistic features from narratives produced by two groups of children with and without autism from two narrow age ranges. We find that although many differences between diagnostic groups remain constant with age, certain pragmatic measures, particularly the ability to remain on topic and avoid digressions, seem to improve. These results confirm findings reported in the psychology literature while underscoring the need for careful consideration of the age range of the population under investigation when performing clinically oriented computational analysis of spoken language.</p>","PeriodicalId":74811,"journal":{"name":"SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology","volume":"2014 ","pages":"266-271"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/SLT.2014.7078585","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35532885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
ROBUST DETECTION OF VOICED SEGMENTS IN SAMPLES OF EVERYDAY CONVERSATIONS USING UNSUPERVISED HMMS. 使用无监督 HMMs 对日常对话样本中的语音片段进行稳健检测。
Meysam Asgari, Izhak Shafran, Alireza Bayestehtashk

We investigate methods for detecting voiced segments in everyday conversations from ambient recordings. Such recordings contain high diversity of background noise, making it difficult or infeasible to collect representative labelled samples for estimating noise-specific HMM models. The popular utility get-f0 and its derivatives compute normalized cross-correlation for detecting voiced segments, which unfortunately is sensitive to different types of noise. Exploiting the fact that voiced speech is not just periodic but also rich in harmonic, we model voiced segments by adopting harmonic models, which have recently gained considerable attention. In previous work, the parameters of the model were estimated independently for each frame using maximum likelihood criterion. However, since the distribution of harmonic coefficients depend on articulators of speakers, we estimate the model parameters more robustly using a maximum a posteriori criterion. We use the likelihood of voicing, computed from the harmonic model, as an observation probability of an HMM and detect speech using this unsupervised HMM. The one caveat of the harmonic model is that they fail to distinguish speech from other stationary harmonic noise. We rectify this weakness by taking advantage of the non-stationary property of speech. We evaluate our models empirically on a task of detecting speech on a large corpora of everyday speech and demonstrate that these models perform significantly better than standard voice detection algorithm employed in popular tools.

我们研究了从环境录音中检测日常对话中语音片段的方法。这类录音包含多种背景噪声,因此很难或不可能收集到有代表性的标记样本来估计特定噪声的 HMM 模型。流行的实用程序 get-f0 及其衍生程序会计算归一化交叉相关来检测发声片段,但不幸的是,这种方法对不同类型的噪声都很敏感。由于有声语音不仅具有周期性,而且还富含谐波,因此我们采用谐波模型对有声片段进行建模。在以前的工作中,我们使用最大似然准则对每个帧的模型参数进行独立估计。然而,由于谐波系数的分布取决于说话者的发音器官,我们采用最大后验标准来估算模型参数会更加稳健。我们使用谐波模型计算出的发声可能性作为 HMM 的观测概率,并使用这种无监督 HMM 检测语音。谐波模型的一个缺点是无法将语音与其他静态谐波噪音区分开来。我们利用语音的非稳态特性纠正了这一缺陷。我们在一个大型日常语音库的语音检测任务中对我们的模型进行了实证评估,结果表明这些模型的性能明显优于流行工具中采用的标准语音检测算法。
{"title":"ROBUST DETECTION OF VOICED SEGMENTS IN SAMPLES OF EVERYDAY CONVERSATIONS USING UNSUPERVISED HMMS.","authors":"Meysam Asgari, Izhak Shafran, Alireza Bayestehtashk","doi":"10.1109/slt.2012.6424264","DOIUrl":"10.1109/slt.2012.6424264","url":null,"abstract":"<p><p>We investigate methods for detecting voiced segments in everyday conversations from ambient recordings. Such recordings contain high diversity of background noise, making it difficult or infeasible to collect representative labelled samples for estimating noise-specific HMM models. The popular utility <i>get-f0</i> and its derivatives compute normalized cross-correlation for detecting voiced segments, which unfortunately is sensitive to different types of noise. Exploiting the fact that voiced speech is not just periodic but also rich in harmonic, we model voiced segments by adopting harmonic models, which have recently gained considerable attention. In previous work, the parameters of the model were estimated independently for each frame using maximum likelihood criterion. However, since the distribution of harmonic coefficients depend on articulators of speakers, we estimate the model parameters more robustly using a maximum <i>a posteriori</i> criterion. We use the likelihood of voicing, computed from the harmonic model, as an observation probability of an HMM and detect speech using this unsupervised HMM. The one caveat of the harmonic model is that they fail to distinguish speech from other stationary harmonic noise. We rectify this weakness by taking advantage of the non-stationary property of speech. We evaluate our models empirically on a task of detecting speech on a large corpora of everyday speech and demonstrate that these models perform significantly better than standard voice detection algorithm employed in popular tools.</p>","PeriodicalId":74811,"journal":{"name":"SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology","volume":"2012 ","pages":"438-442"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7909075/pdf/nihms-1670854.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25414977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient prior and incremental beam width control to suppress excessive speech recognition time based on score range estimation 基于分数范围估计的有效的先验和增量波束宽度控制来抑制过多的语音识别时间
Satoshi Kobashikawa, Takaaki Hori, Y. Yamaguchi, Taichi Asami, H. Masataki, Satoshi Takahashi
{"title":"Efficient prior and incremental beam width control to suppress excessive speech recognition time based on score range estimation","authors":"Satoshi Kobashikawa, Takaaki Hori, Y. Yamaguchi, Taichi Asami, H. Masataki, Satoshi Takahashi","doi":"10.1109/SLT.2012.6424209","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424209","url":null,"abstract":"","PeriodicalId":74811,"journal":{"name":"SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology","volume":"214 1","pages":"125-130"},"PeriodicalIF":0.0,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72783333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech Technology Opportunities and Challenges 语音技术的机遇与挑战
D. Nahamoo
Summary form only given. Two forces are in pursuit of discovering the possibilities of speech technology automation. First is the global research and development community which has been hard at work for improving the performance and usability of the technology. Second is the business community which constantly evaluates the performance of the technology against the expectation of the user community for delivering solutions such as a spoken car navigation system. While the performance improvement has been on a constant positive progress curve, the market opportunity has been on a much more uncertain curve. For example, the early vision of delivering a dictation solution has been on hold in recent years while it enjoyed enormous interest in the 90 s. At the same time, some industry experts predict that this vision will be fulfilled soon because of the usability needs of billions of mobile devices in use today. Analogies can be drawn for the use of speech technologies for call centers self service interaction. While we have seen a much bigger market success, some industry experts predict that web self services will slow down the use of speech self service. So, where does the truth lie? What are those market opportunities that are clear winners? What opportunities will open up in future and what are their technical challenges? In this talk, we will address some of these questions.
只提供摘要形式。两股力量正在探索语音技术自动化的可能性。首先是全球研究和开发社区,他们一直在努力提高技术的性能和可用性。第二种是商业社区,他们不断地根据用户社区的期望来评估技术的性能,以提供诸如语音汽车导航系统之类的解决方案。虽然性能改善一直处于一个持续的积极进展曲线上,但市场机会却处于一个更加不确定的曲线上。例如,提供口授解决方案的早期愿景近年来一直被搁置,而它在90年代受到了极大的关注。与此同时,一些行业专家预测,由于目前使用的数十亿移动设备的可用性需求,这一愿景将很快实现。在呼叫中心自助服务交互中使用语音技术可以进行类比。虽然我们已经看到了更大的市场成功,但一些行业专家预测,网络自助服务将减缓语音自助服务的使用。那么,真相在哪里呢?哪些市场机会是明显的赢家?未来会有哪些机会?他们面临的技术挑战是什么?在这次演讲中,我们将讨论其中的一些问题。
{"title":"Speech Technology Opportunities and Challenges","authors":"D. Nahamoo","doi":"10.1109/SLT.2006.326778","DOIUrl":"https://doi.org/10.1109/SLT.2006.326778","url":null,"abstract":"Summary form only given. Two forces are in pursuit of discovering the possibilities of speech technology automation. First is the global research and development community which has been hard at work for improving the performance and usability of the technology. Second is the business community which constantly evaluates the performance of the technology against the expectation of the user community for delivering solutions such as a spoken car navigation system. While the performance improvement has been on a constant positive progress curve, the market opportunity has been on a much more uncertain curve. For example, the early vision of delivering a dictation solution has been on hold in recent years while it enjoyed enormous interest in the 90 s. At the same time, some industry experts predict that this vision will be fulfilled soon because of the usability needs of billions of mobile devices in use today. Analogies can be drawn for the use of speech technologies for call centers self service interaction. While we have seen a much bigger market success, some industry experts predict that web self services will slow down the use of speech self service. So, where does the truth lie? What are those market opportunities that are clear winners? What opportunities will open up in future and what are their technical challenges? In this talk, we will address some of these questions.","PeriodicalId":74811,"journal":{"name":"SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology","volume":"34 1","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82726673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Information Extraction from speech 语音信息提取
J. Makhoul
Summary form only given. The state of the art in automatic speech recognition has reached the point that searching for and extracting information from large speech repositories or streaming audio has become a growing reality. This paper summarizes the technologies that have been instrumental in making audio as searchable as text, including speech recognition, speaker clustering, segmentation, and identification; topic classification; and story segmentation. Once speech is turned into text, information extraction methods can then be applied, such as named entity extraction, finding relationships between named entities, and resolution of anaphoric references. Examples of deployed systems for information extraction from speech, which incorporate some of the aforementioned technologies, will be given.
只提供摘要形式。自动语音识别的技术水平已经达到了从大型语音存储库或流音频中搜索和提取信息的程度,这已经成为一种日益增长的现实。本文总结了使音频像文本一样可搜索的技术,包括语音识别、说话人聚类、分割和识别;主题分类;故事分割。一旦语音转化为文本,就可以应用信息提取方法,如命名实体提取,查找命名实体之间的关系,以及解析回指引用。本文将给出用于从语音中提取信息的已部署系统的例子,这些系统采用了上述的一些技术。
{"title":"Information Extraction from speech","authors":"J. Makhoul","doi":"10.1109/SLT.2006.326780","DOIUrl":"https://doi.org/10.1109/SLT.2006.326780","url":null,"abstract":"Summary form only given. The state of the art in automatic speech recognition has reached the point that searching for and extracting information from large speech repositories or streaming audio has become a growing reality. This paper summarizes the technologies that have been instrumental in making audio as searchable as text, including speech recognition, speaker clustering, segmentation, and identification; topic classification; and story segmentation. Once speech is turned into text, information extraction methods can then be applied, such as named entity extraction, finding relationships between named entities, and resolution of anaphoric references. Examples of deployed systems for information extraction from speech, which incorporate some of the aforementioned technologies, will be given.","PeriodicalId":74811,"journal":{"name":"SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology","volume":"38 1","pages":"3"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80964566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
No More Strings, please 请不要再用绳子了
Kevin Knight
Summary form only given. In natural language research, many (grammar) trees were felled in 1992, to make room for the highly successful string-based HMM industry. A small literature survived on parsing (putting a tree on a string) and syntactic language modeling (putting a weight on a string). However, trees are making a comeback. Tree transformations are turning out to be very useful in large-scale machine translation (MT), and we will cover recent developments in this area. Most of the tree techniques used in MT turn out to be generic, leading to tools and software for manipulating tree automata in general. Tree acceptors and transducers generalize HMM techniques to the world of trees, raising many interesting theoretical and practical problems.
只提供摘要形式。在自然语言研究方面,1992年许多(语法)树木被砍伐,为非常成功的基于字符串的HMM产业腾出空间。一小部分文献通过解析(在字符串上添加树)和语法语言建模(在字符串上添加权重)幸存下来。然而,树木正在卷土重来。树变换在大规模机器翻译(MT)中非常有用,我们将介绍这一领域的最新发展。机器翻译中使用的大多数树形技术都是通用的,这导致了用于操作树形自动机的工具和软件。树形受体和换能器将HMM技术推广到树形世界,提出了许多有趣的理论和实践问题。
{"title":"No More Strings, please","authors":"Kevin Knight","doi":"10.1109/SLT.2006.326779","DOIUrl":"https://doi.org/10.1109/SLT.2006.326779","url":null,"abstract":"Summary form only given. In natural language research, many (grammar) trees were felled in 1992, to make room for the highly successful string-based HMM industry. A small literature survived on parsing (putting a tree on a string) and syntactic language modeling (putting a weight on a string). However, trees are making a comeback. Tree transformations are turning out to be very useful in large-scale machine translation (MT), and we will cover recent developments in this area. Most of the tree techniques used in MT turn out to be generic, leading to tools and software for manipulating tree automata in general. Tree acceptors and transducers generalize HMM techniques to the world of trees, raising many interesting theoretical and practical problems.","PeriodicalId":74811,"journal":{"name":"SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology","volume":"60 1","pages":"2"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82633077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Widening the NLP Pipeline for spoken Language Processing 扩大口语语言处理的NLP管道
S. Bangalore
Summary form only given. A typical text-based natural language application (eg. machine translation, summarization, information extraction) consists of a pipeline of preprocessing steps such as tokenization, stemming, part-of-speech tagging, named entity detection, chunking, parsing. Information flows downstream through the preprocessing steps along a narrow pipe: each step transforms a single input string into a single best solution string. However, this narrow pipe is limiting for two reasons: First, since each of the preprocessing steps are erroneous, producing a single best solution could magnify the error propogation down the pipeline. Second, the preprocessing steps are forced to resolve genuine ambiguity prematurely. While the widening of the pipeline can potentially benefit text-based language applications, it is imperative for spoken language processing where the output from the speech recognizer is typically a word lattice/graph. In this talk, we review how such a goal has been accomplished in tasks such as spoken language understanding, speech translation and multimodal language processing. We will also sketch methods that encode the preprocessing steps as finite-state transductions in order to exploit composition of finite-state transducers as a general constraint propogation method.
只提供摘要形式。一个典型的基于文本的自然语言应用程序(例如:机器翻译(摘要、信息提取)由一系列预处理步骤组成,如标记化、词干提取、词性标注、命名实体检测、分块、解析。信息沿着一条狭窄的管道通过预处理步骤向下游流动:每一步都将单个输入字符串转换为单个最佳解决方案字符串。然而,这种狭窄的管道有两个限制:首先,由于每个预处理步骤都是错误的,因此产生单个最佳解决方案可能会放大错误在管道中的传播。其次,预处理步骤被迫过早地解决真正的歧义。虽然管道的扩大可能有利于基于文本的语言应用程序,但对于语音处理来说,它是必要的,因为语音识别器的输出通常是一个词格/图。在这次演讲中,我们回顾了如何在口语理解、语音翻译和多模态语言处理等任务中实现这一目标。我们还将概述将预处理步骤编码为有限状态换能器的方法,以便利用有限状态换能器的组合作为一般约束传播方法。
{"title":"Widening the NLP Pipeline for spoken Language Processing","authors":"S. Bangalore","doi":"10.1109/SLT.2006.326787","DOIUrl":"https://doi.org/10.1109/SLT.2006.326787","url":null,"abstract":"Summary form only given. A typical text-based natural language application (eg. machine translation, summarization, information extraction) consists of a pipeline of preprocessing steps such as tokenization, stemming, part-of-speech tagging, named entity detection, chunking, parsing. Information flows downstream through the preprocessing steps along a narrow pipe: each step transforms a single input string into a single best solution string. However, this narrow pipe is limiting for two reasons: First, since each of the preprocessing steps are erroneous, producing a single best solution could magnify the error propogation down the pipeline. Second, the preprocessing steps are forced to resolve genuine ambiguity prematurely. While the widening of the pipeline can potentially benefit text-based language applications, it is imperative for spoken language processing where the output from the speech recognizer is typically a word lattice/graph. In this talk, we review how such a goal has been accomplished in tasks such as spoken language understanding, speech translation and multimodal language processing. We will also sketch methods that encode the preprocessing steps as finite-state transductions in order to exploit composition of finite-state transducers as a general constraint propogation method.","PeriodicalId":74811,"journal":{"name":"SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology","volume":"48 1","pages":"15"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85810428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
SLT ... : ... IEEE Workshop on Spoken Language Technology : proceedings. IEEE Workshop on Spoken Language Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1