2022 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文中文

Distilling Sequence-to-Sequence Voice Conversion Models for Streaming Conversion Applications 为流转换应用提取序列到序列的语音转换模型

2022 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023432

Kou Tanaka, H. Kameoka, Takuhiro Kaneko, Shogo Seki

This paper describes a method for distilling a recurrent-based sequence-to-sequence (S2S) voice conversion (VC) model. Although the performance of recent VCs is becoming higher quality, streaming conversion is still a challenge when considering practical applications. To achieve streaming VC, the conversion model needs a streamable structure, a causal layer rather than a non-causal layer. Motivated by this constraint and recent advances in S2S learning, we apply the teacher-student framework to recurrent-based S2S- VC models. A major challenge is how to minimize degradation due to the use of causal layers which masks future input information. Experimental evaluations show that except for male-to-female speaker conversion, our approach is able to maintain the teacher model's performance in terms of subjective evaluations despite the streamable student model structure. Audio samples can be accessed on http://www.kecl.ntt.co.jp/people/tanaka.ko/projects/dists2svc.

本文描述了一种提取基于循环的序列到序列(S2S)语音转换(VC)模型的方法。尽管最近的vc的性能质量越来越高，但在考虑实际应用时，流转换仍然是一个挑战。为了实现流VC，转换模型需要一个可流的结构，一个因果层而不是非因果层。基于这一约束和S2S学习的最新进展，我们将师生框架应用于基于循环的S2S- VC模型。一个主要的挑战是如何最大限度地减少由于使用掩盖未来输入信息的因果层而造成的退化。实验评估表明，除了男性到女性说话者的转换外，我们的方法能够保持教师模型在主观评价方面的表现，尽管学生模型结构可流化。音频样本可以访问http://www.kecl.ntt.co.jp/people/tanaka.ko/projects/dists2svc。

引用次数: 2

Streaming Bilingual End-to-End ASR Model Using Attention Over Multiple Softmax 基于多Softmax关注的流双语端到端ASR模型

2022 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022475

Aditya Patil, Vikas Joshi, Purvi Agrawal, Rupeshkumar Mehta

Even with several advancements in multilingual modeling, it is challenging to recognize multiple languages using a single neural model, without knowing the input language and most multilingual models assume the availability of the input language. In this work, we propose a novel bilingual end-to-end (E2E) modeling approach, where a single neural model can recognize both languages and also support switching between the languages, without any language input from the user. The proposed model has shared encoder and prediction networks, with language-specific joint networks that are combined via a self-attention mechanism. As the language-specific posteriors are combined, it produces a single posterior probability over all the output symbols, enabling a single beam search decoding and also allowing dynamic switching between the languages. The proposed approach outperforms the conventional bilingual baseline with 13.3%, 8.23% and 1.3% word error rate relative reduction on Hindi, English and code-mixed test sets, respectively.

即使在多语言建模方面取得了一些进展，在不知道输入语言的情况下，使用单个神经模型识别多种语言仍然是一项挑战，而且大多数多语言模型都假设输入语言是可用的。在这项工作中，我们提出了一种新颖的双语端到端(E2E)建模方法，其中单个神经模型可以识别两种语言并支持语言之间的切换，而无需用户输入任何语言。该模型具有共享的编码器和预测网络，以及通过自注意机制组合的特定语言联合网络。当特定语言的后验组合在一起时，它在所有输出符号上产生一个单一的后验概率，从而实现单波束搜索解码，并允许在语言之间动态切换。该方法在印地语、英语和代码混合测试集上的错误率相对降低分别为13.3%、8.23%和1.3%，优于传统双语基线。

引用次数: 0

Code-Switched Language Modelling Using a Code Predictive Lstm in Under-Resourced South African Languages 在资源不足的南非语言中使用代码预测Lstm的代码切换语言建模

2022 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022517

Joshua Jansen van Vüren, T. Niesler

We present a new LSTM language model architecture for code-switched speech incorporating a neural structure that explicitly models language switches. Experimental evaluation of this code predictive model for four under-resourced South African languages shows consistent improvements in perplexity as well as perplexity specifically over code-switches compared to an LSTM baseline. Substantial reductions in absolute speech recognition word error rates (0.5%-1.2%) as well as errors specifically at code-switches (0.6%-2.3%) are also achieved during n-best rescoring. When used for both data augmentation and n-best rescoring, our code predictive model reduces word error rate by a further 0.8%-2.6% absolute and consistently outperforms a baseline LSTM. The similar and consistent trends observed across all four language pairs allows us to conclude that explicit modelling of language switches by a dedicated language model component is a suitable strategy for code-switched speech recognition.

我们提出了一种新的LSTM语言模型体系结构，用于编码切换语音，其中包含一个显式建模语言切换的神经结构。该代码预测模型对四种资源不足的南非语言进行了实验评估，结果显示，与LSTM基线相比，该模型在困惑性以及特别是在代码切换方面的困惑性方面得到了一致的改善。在n-best评分期间，绝对语音识别单词错误率(0.5%-1.2%)以及代码切换时的错误(0.6%-2.3%)也得到了大幅降低。当用于数据增强和n-best评分时，我们的代码预测模型将单词错误率的绝对值进一步降低了0.8%-2.6%，并且始终优于基线LSTM。在所有四种语言对中观察到的相似和一致的趋势使我们得出结论，通过专门的语言模型组件对语言切换进行显式建模是一种适合于代码切换语音识别的策略。

引用次数: 3

NAM+: Towards Scalable End-to-End Contextual Biasing for Adaptive ASR 面向自适应ASR的可扩展端到端上下文偏置

2022 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023323

Tsendsuren Munkhdalai, Zelin Wu, G. Pundak, K. Sim, Jiayang Li, Pat Rondon, Tara N. Sainath

Attention-based biasing techniques for end-to-end ASR systems are able to achieve large accuracy gains without requiring the inference algorithm adjustments and parameter tuning common to fusion approaches. However, it is challenging to simultaneously scale up attention-based biasing to realistic numbers of biased phrases; maintain in-domain WER gains, while minimizing out-of-domain losses; and run in real time. We present NAM+, an attention-based biasing approach which achieves a 16X inference speedup per acoustic frame over prior work when run with 3,000 biasing entities, as measured on a typical mobile CPU. NAM+ achieves these run-time gains through a combination of Two-Pass Hierarchical Attention and Dilated Context Update. Compared to the adapted baseline, NAM+ further decreases the in-domain WER by up to 12.6% relative, while incurring an out-of-domain WER regression of 20% relative. Compared to the non-adapted baseline, the out-of-domain WER regression is 7.1 % relative.

端到端ASR系统的基于注意力的偏倚技术能够获得较大的精度增益，而无需像融合方法那样进行推理算法调整和参数调整。然而，同时将基于注意力的偏见扩大到实际的偏见短语数量是具有挑战性的;保持领域内的WER收益，同时尽量减少领域外的损失;并实时运行。我们提出了NAM+，一种基于注意力的偏置方法，当在典型的移动CPU上使用3,000个偏置实体运行时，每个声学帧的推理速度比先前的工作提高了16倍。NAM+通过两步分级注意和扩展上下文更新的结合实现了这些运行时收益。与适应的基线相比，NAM+进一步降低了域内相对WER高达12.6%，而导致域外相对WER回归20%。与非适应基线相比，域外WER相对回归为7.1%。

引用次数: 7

Cover Page 封面页

2022 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2023-01-09 DOI: 10.1109/slt54892.2023.10022896

引用次数: 0

Hackathon

2022 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2023-01-09 DOI: 10.1109/slt54892.2023.10023077

DO Objetivo

O tema do “Hackathon DAF e PCTec/UnB” será “UnB na palma da sua mão”, cuja ideia é pensar a inovação de forma a gerar benefícios diretos para a Universidade de Brasília, por meio da construção de um APP através do qual os estudantes, professores, servidores e a comunidade acadêmica em geral possam monitorar os serviços prestados pelas empresas responsáveis pelos diversos contratos vigentes com a universidade.

DAF -和PCTec /超窄带的主题“群”是“联合”的另一只手的手心,谁的主意是想创新的销售带来的直接收益巴西利亚大学,通过构建一个应用,学生、教师、和学术团体一般能监视服务器提供的服务公司负责和大学现有的各种契约。

引用次数: 2

Response Timing Estimation for Spoken Dialog Systems Based on Syntactic Completeness Prediction 基于句法完整性预测的口语对话系统响应时间估计

2022 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023458

Jin Sakuma, S. Fujie, Tetsunori Kobayashi

Appropriate response timing is very important for achieving smooth dialog progression. Conventionally, prosodic, temporal and linguistic features have been used to determine timing. In addition to the conventional parameters, we propose to utilize the syntactic completeness after a certain time, which represents whether the other party is about to finish speaking. We generate the next token sequence from intermediate speech recognition results using a language model and obtain the probability of the end of utterance appearing $K$ tokens ahead, where $K$ varies from 1 to $M$. We obtain an $M$ -dimensional vector, which we denote as estimates of syntactic completeness (ESC). We evaluated this method on a simulated dialog database of a restaurant information center. The results confirmed that considering ESC improves the performance of response timing estimation, especially the accuracy in quick responses, compared with the method using only conventional features.

适当的反应时间对于实现流畅的对话进程非常重要。通常，韵律、时间和语言特征被用来确定时间。除了常规参数外，我们建议使用一段时间后的句法完整性，它代表对方是否即将结束讲话。我们使用语言模型从中间语音识别结果生成下一个标记序列，并获得话语结束前出现$K$标记的概率，其中$K$从1到$M$变化。我们得到一个$M$维向量，我们将其表示为句法完备性(ESC)的估计。我们在一个餐馆信息中心的模拟对话数据库上评估了这种方法。结果表明，与仅使用常规特征的方法相比，考虑ESC的方法提高了响应时间估计的性能，特别是在快速响应时的准确性。

引用次数: 6

Peppanet: Effective Mispronunciation Detection and Diagnosis Leveraging Phonetic, Phonological, and Acoustic Cues Peppanet:利用语音、语音和声学线索有效的错误发音检测和诊断

2022 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022472

Bi-Cheng Yan, Hsin-Wei Wang, Berlin Chen

Mispronunciation detection and diagnosis (MDD) aims to detect erroneous pronunciation segments in an L2 learner's articulation and subsequently provide informative diagnostic feedback. Most existing neural methods follow a dictation-based modeling paradigm that finds out pronunciation errors and returns diagnostic feedback at the same time by aligning the recognized phone sequence uttered by an L2 learner to the corresponding canonical phone sequence of a given text prompt. However, the main downside of these methods is that the dictation process and alignment process are mostly made independent of each other. In view of this, we present a novel end-to-end neural method, dubbed PeppaNet, building on a unified structure that can jointly model the dictation process and the alignment process. The model of our method learns to directly predict the pronunciation correctness of each canonical phone of the text prompt and in turn provides its corresponding diagnostic feedback. In contrast to the conventional dictation-based methods that rely mainly on a free-phone recognition process, PeppaNet makes good use of an effective selective gating mechanism to simultaneously incorporate phonetic, phonological and acoustic cues to generate corrections that are more proper and phonetically related to the canonical pronunciations. Extensive sets of experiments conducted on the L2-ARCTIC benchmark dataset seem to show the merits of our proposed method in comparison to some recent top-of-the-line methods.

错误发音检测和诊断(MDD)旨在检测二语学习者发音中的错误发音片段，并随后提供信息诊断反馈。大多数现有的神经方法都遵循基于听写的建模范式，通过将二语学习者发出的识别电话序列与给定文本提示的相应规范电话序列相匹配，发现发音错误并同时返回诊断反馈。然而，这些方法的主要缺点是听写过程和对齐过程大多是相互独立的。鉴于此，我们提出了一种新的端到端神经网络方法，称为PeppaNet，它建立在一个统一的结构上，可以联合建模听写过程和对齐过程。该方法的模型学习直接预测文本提示符的每个规范电话的发音正确性，并提供相应的诊断反馈。与传统的基于听写的方法主要依赖于自由电话识别过程相比，PeppaNet很好地利用了一种有效的选择性门控机制，同时结合语音、语音和声学线索来生成更合适的、与标准发音相关的纠正。在L2-ARCTIC基准数据集上进行的大量实验似乎表明，与最近的一些顶级方法相比，我们提出的方法具有优点。

{"title":"Peppanet: Effective Mispronunciation Detection and Diagnosis Leveraging Phonetic, Phonological, and Acoustic Cues","authors":"Bi-Cheng Yan, Hsin-Wei Wang, Berlin Chen","doi":"10.1109/SLT54892.2023.10022472","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022472","url":null,"abstract":"Mispronunciation detection and diagnosis (MDD) aims to detect erroneous pronunciation segments in an L2 learner's articulation and subsequently provide informative diagnostic feedback. Most existing neural methods follow a dictation-based modeling paradigm that finds out pronunciation errors and returns diagnostic feedback at the same time by aligning the recognized phone sequence uttered by an L2 learner to the corresponding canonical phone sequence of a given text prompt. However, the main downside of these methods is that the dictation process and alignment process are mostly made independent of each other. In view of this, we present a novel end-to-end neural method, dubbed PeppaNet, building on a unified structure that can jointly model the dictation process and the alignment process. The model of our method learns to directly predict the pronunciation correctness of each canonical phone of the text prompt and in turn provides its corresponding diagnostic feedback. In contrast to the conventional dictation-based methods that rely mainly on a free-phone recognition process, PeppaNet makes good use of an effective selective gating mechanism to simultaneously incorporate phonetic, phonological and acoustic cues to generate corrections that are more proper and phonetically related to the canonical pronunciations. Extensive sets of experiments conducted on the L2-ARCTIC benchmark dataset seem to show the merits of our proposed method in comparison to some recent top-of-the-line methods.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125891553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Welcome Page 欢迎页面

2022 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2023-01-09 DOI: 10.1109/slt54892.2023.10022398

D. Glaser

[2]Professor Donald A. Glaser was a master of experimental science throughout his career. Born in Cleveland and educated at Case Institute of Technology, he earned a doctorate at Caltech and taught at the University of Michigan before accepting a post at UC Berkeley in 1959. Early in his career, Dr. Glaser experimented with ways to make the workings of sub-atomic particles visible. For his subsequent invention of the bubble chamber, he was awarded the Nobel Prize in Physics 1960. He then began exploring the new field of molecular biology, improving techniques for working with bacterial phages, bacteria, and mammalian cells. By designing equipment to automate his experiments and scale them up, he could run thousands of experiments simultaneously, generating enough data to move the science forward. Recognizing the implications for medicine, Dr. Glaser and two friends created the pioneering biotech company Cetus Corporation in 1971, thus launching the genetic engineering industry.

[2]唐纳德·a·格拉泽教授在他的职业生涯中一直是实验科学的大师。他出生于克利夫兰，在凯斯理工学院接受教育，在加州理工学院获得博士学位，并在密歇根大学任教，1959年接受加州大学伯克利分校的职位。在他职业生涯的早期，格拉泽博士曾尝试过让亚原子粒子的运作变得可见的方法。由于他后来发明了气泡室，他被授予1960年诺贝尔物理学奖。然后，他开始探索分子生物学的新领域，改进研究细菌噬菌体、细菌和哺乳动物细胞的技术。通过设计设备使他的实验自动化并扩大规模，他可以同时进行数千个实验，产生足够的数据来推动科学发展。认识到基因工程对医学的影响，格拉泽博士和两个朋友在1971年创建了开创性的生物技术公司Cetus Corporation，从而启动了基因工程行业。

引用次数: 0

Flickering Reduction with Partial Hypothesis Reranking for Streaming ASR 基于部分假设重排序的流ASR闪烁抑制

2022 IEEE Spoken Language Technology Workshop (SLT)

Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023016

A. Bruguier, David Qiu, Trevor Strohman, Yanzhang He

Incremental speech recognizers start displaying results while the users are still speaking. These partial results are beneficial to users who like the responsiveness of the system. However, as new partial results come in, words that were previously displayed can change or disappear. The results appear unstable and this unwanted phenomenon is called flickering. Typical remediation approaches can increase latency and reduce the quality of the partials results, but little work has been done to measure these effects. We first introduce two new metrics that allow us to measure the quality and latency of the partials. We propose the new, lightweight approach of reranking the partial results in favor of a more stable prefix without changing the beam search. This allows us to reduce flickering without impacting the final result. We show that we can roughly halve the amount of flickering with negligible impact on the quality and latency of the partial results.

增量语音识别器在用户仍在说话时开始显示结果。这些部分结果对于喜欢系统响应性的用户是有益的。然而，当新的部分结果出现时，之前显示的单词可能会改变或消失。结果似乎不稳定，这种不想要的现象被称为闪烁。典型的补救方法会增加延迟并降低部分结果的质量，但是很少有人对这些影响进行测量。我们首先引入两个新的度量，它们允许我们度量部分的质量和延迟。我们提出了一种新的轻量级方法，在不改变波束搜索的情况下，对部分结果重新排序，以支持更稳定的前缀。这允许我们在不影响最终结果的情况下减少闪烁。我们可以将闪烁的数量减少一半，而对部分结果的质量和延迟的影响可以忽略不计。

引用次数: 1

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2022 IEEE Spoken Language Technology Workshop (SLT)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀