PULMO：用于语音反欺骗的精确语篇级建模

IF 4.3 2区物理与天体物理 Q1 ACOUSTICS Applied Acoustics Pub Date : 2025-01-05 Epub Date: 2024-08-24 DOI:10.1016/j.apacoust.2024.110221

Sunghyun Yoon

{"title":"PULMO：用于语音反欺骗的精确语篇级建模","authors":"Sunghyun Yoon","doi":"10.1016/j.apacoust.2024.110221","DOIUrl":null,"url":null,"abstract":"<div><p>In recent years, most state-of-the-art approaches for spoofed speech detection have been based on convolutional neural networks (CNNs). Most neural networks, including CNNs, are trained in minibatch units, where all input data in each minibatch must have the same shape. Therefore, for minibatch training, each utterance is first either padded or truncated because utterances are variable-length sequences and thus cannot be directly fed into networks in minibatch units. However, modeling either a padded or truncated utterance, rather than the original one, makes it unfeasible to capture the entire context as is: padding could propagate even unwanted information, like artifacts, in the original utterance, and truncation inevitably loses some information. With these information distortions, model could get stuck in a suboptimal solution. To fill this gap, we proposeÚ a method for precise utterance-level modeling that enables minibatch-wise utterance-level modeling of variable-length utterances while minimizing the information distortions. The proposed method comprises sequence segmentation followed by segment aggregation. Sequence segmentation feeds variable-length utterances in the minibatch unit by decomposing each of them into fixed-length segments, which enables parallel processing of variable-length utterances without the uncertainty in input length. Segment aggregation plays a role in aggregating the segment embeddings by utterance to encode the entire information of each utterance. The experimental results of the evaluation trials of ASVspoof 2019 and 2021 indicate that the proposed method shows up to 84.9 % and 97.6 % relative equal error rate reductions on logical and physical access scenarios, respectively. Furthermore, the proposed method reduced the FLOPs for an epoch by 6 %.</p></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"227 ","pages":"Article 110221"},"PeriodicalIF":4.3000,"publicationDate":"2025-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PULMO: Precise utterance-level modeling for speech anti-spoofing\",\"authors\":\"Sunghyun Yoon\",\"doi\":\"10.1016/j.apacoust.2024.110221\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In recent years, most state-of-the-art approaches for spoofed speech detection have been based on convolutional neural networks (CNNs). Most neural networks, including CNNs, are trained in minibatch units, where all input data in each minibatch must have the same shape. Therefore, for minibatch training, each utterance is first either padded or truncated because utterances are variable-length sequences and thus cannot be directly fed into networks in minibatch units. However, modeling either a padded or truncated utterance, rather than the original one, makes it unfeasible to capture the entire context as is: padding could propagate even unwanted information, like artifacts, in the original utterance, and truncation inevitably loses some information. With these information distortions, model could get stuck in a suboptimal solution. To fill this gap, we proposeÚ a method for precise utterance-level modeling that enables minibatch-wise utterance-level modeling of variable-length utterances while minimizing the information distortions. The proposed method comprises sequence segmentation followed by segment aggregation. Sequence segmentation feeds variable-length utterances in the minibatch unit by decomposing each of them into fixed-length segments, which enables parallel processing of variable-length utterances without the uncertainty in input length. Segment aggregation plays a role in aggregating the segment embeddings by utterance to encode the entire information of each utterance. The experimental results of the evaluation trials of ASVspoof 2019 and 2021 indicate that the proposed method shows up to 84.9 % and 97.6 % relative equal error rate reductions on logical and physical access scenarios, respectively. Furthermore, the proposed method reduced the FLOPs for an epoch by 6 %.</p></div>\",\"PeriodicalId\":55506,\"journal\":{\"name\":\"Applied Acoustics\",\"volume\":\"227 \",\"pages\":\"Article 110221\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-01-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Acoustics\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003682X24003724\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/8/24 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X24003724","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/24 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

近年来，最先进的欺骗语音检测方法大多基于卷积神经网络（CNN）。包括 CNN 在内的大多数神经网络都是以 minibatch 为单位进行训练的，每个 minibatch 中的所有输入数据都必须具有相同的形状。因此，在进行小批量训练时，首先要对每个语篇进行填充或截断，因为语篇是长度可变的序列，因此不能直接输入到小批量单元的网络中。然而，对经过填充或截断的语篇而不是原始语篇进行建模，就不可能捕捉到整个语境的原貌：填充甚至会传播原始语篇中不需要的信息，如人工痕迹，而截断则不可避免地会丢失一些信息。在这些信息失真的情况下，模型可能会陷入次优解。为了填补这一空白，我们提出了一种精确的语篇级建模方法，它可以对长度可变的语篇进行小批量的语篇级建模，同时将信息失真降到最低。我们提出的方法包括序列分割和语段聚合。序列分割通过将每个变长语句分解为固定长度的语段，将其送入小批单元，从而实现对变长语句的并行处理，而不会影响输入长度的不确定性。分段聚合的作用是按语句聚合分段嵌入，以编码每个语句的全部信息。ASVspoof 2019 和 2021 的评估试验结果表明，所提出的方法在逻辑访问和物理访问场景中分别降低了高达 84.9% 和 97.6% 的相对相等错误率。此外，所提出的方法还将每一纪元的 FLOPs 减少了 6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PULMO: Precise utterance-level modeling for speech anti-spoofing

In recent years, most state-of-the-art approaches for spoofed speech detection have been based on convolutional neural networks (CNNs). Most neural networks, including CNNs, are trained in minibatch units, where all input data in each minibatch must have the same shape. Therefore, for minibatch training, each utterance is first either padded or truncated because utterances are variable-length sequences and thus cannot be directly fed into networks in minibatch units. However, modeling either a padded or truncated utterance, rather than the original one, makes it unfeasible to capture the entire context as is: padding could propagate even unwanted information, like artifacts, in the original utterance, and truncation inevitably loses some information. With these information distortions, model could get stuck in a suboptimal solution. To fill this gap, we proposeÚ a method for precise utterance-level modeling that enables minibatch-wise utterance-level modeling of variable-length utterances while minimizing the information distortions. The proposed method comprises sequence segmentation followed by segment aggregation. Sequence segmentation feeds variable-length utterances in the minibatch unit by decomposing each of them into fixed-length segments, which enables parallel processing of variable-length utterances without the uncertainty in input length. Segment aggregation plays a role in aggregating the segment embeddings by utterance to encode the entire information of each utterance. The experimental results of the evaluation trials of ASVspoof 2019 and 2021 indicate that the proposed method shows up to 84.9 % and 97.6 % relative equal error rate reductions on logical and physical access scenarios, respectively. Furthermore, the proposed method reduced the FLOPs for an epoch by 6 %.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Acoustics 物理-声学

CiteScore

7.40

自引率

11.80%

发文量

618

审稿时长

7.5 months

期刊介绍： Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense. Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems. Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.