{"title":"PULMO:用于语音反欺骗的精确语篇级建模","authors":"Sunghyun Yoon","doi":"10.1016/j.apacoust.2024.110221","DOIUrl":null,"url":null,"abstract":"<div><p>In recent years, most state-of-the-art approaches for spoofed speech detection have been based on convolutional neural networks (CNNs). Most neural networks, including CNNs, are trained in minibatch units, where all input data in each minibatch must have the same shape. Therefore, for minibatch training, each utterance is first either padded or truncated because utterances are variable-length sequences and thus cannot be directly fed into networks in minibatch units. However, modeling either a padded or truncated utterance, rather than the original one, makes it unfeasible to capture the entire context as is: padding could propagate even unwanted information, like artifacts, in the original utterance, and truncation inevitably loses some information. With these information distortions, model could get stuck in a suboptimal solution. To fill this gap, we proposeÚ a method for precise utterance-level modeling that enables minibatch-wise utterance-level modeling of variable-length utterances while minimizing the information distortions. The proposed method comprises sequence segmentation followed by segment aggregation. Sequence segmentation feeds variable-length utterances in the minibatch unit by decomposing each of them into fixed-length segments, which enables parallel processing of variable-length utterances without the uncertainty in input length. Segment aggregation plays a role in aggregating the segment embeddings by utterance to encode the entire information of each utterance. The experimental results of the evaluation trials of ASVspoof 2019 and 2021 indicate that the proposed method shows up to 84.9 % and 97.6 % relative equal error rate reductions on logical and physical access scenarios, respectively. Furthermore, the proposed method reduced the FLOPs for an epoch by 6 %.</p></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":null,"pages":null},"PeriodicalIF":3.4000,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PULMO: Precise utterance-level modeling for speech anti-spoofing\",\"authors\":\"Sunghyun Yoon\",\"doi\":\"10.1016/j.apacoust.2024.110221\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In recent years, most state-of-the-art approaches for spoofed speech detection have been based on convolutional neural networks (CNNs). Most neural networks, including CNNs, are trained in minibatch units, where all input data in each minibatch must have the same shape. Therefore, for minibatch training, each utterance is first either padded or truncated because utterances are variable-length sequences and thus cannot be directly fed into networks in minibatch units. However, modeling either a padded or truncated utterance, rather than the original one, makes it unfeasible to capture the entire context as is: padding could propagate even unwanted information, like artifacts, in the original utterance, and truncation inevitably loses some information. With these information distortions, model could get stuck in a suboptimal solution. To fill this gap, we proposeÚ a method for precise utterance-level modeling that enables minibatch-wise utterance-level modeling of variable-length utterances while minimizing the information distortions. The proposed method comprises sequence segmentation followed by segment aggregation. Sequence segmentation feeds variable-length utterances in the minibatch unit by decomposing each of them into fixed-length segments, which enables parallel processing of variable-length utterances without the uncertainty in input length. Segment aggregation plays a role in aggregating the segment embeddings by utterance to encode the entire information of each utterance. The experimental results of the evaluation trials of ASVspoof 2019 and 2021 indicate that the proposed method shows up to 84.9 % and 97.6 % relative equal error rate reductions on logical and physical access scenarios, respectively. Furthermore, the proposed method reduced the FLOPs for an epoch by 6 %.</p></div>\",\"PeriodicalId\":55506,\"journal\":{\"name\":\"Applied Acoustics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Acoustics\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003682X24003724\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X24003724","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
PULMO: Precise utterance-level modeling for speech anti-spoofing
In recent years, most state-of-the-art approaches for spoofed speech detection have been based on convolutional neural networks (CNNs). Most neural networks, including CNNs, are trained in minibatch units, where all input data in each minibatch must have the same shape. Therefore, for minibatch training, each utterance is first either padded or truncated because utterances are variable-length sequences and thus cannot be directly fed into networks in minibatch units. However, modeling either a padded or truncated utterance, rather than the original one, makes it unfeasible to capture the entire context as is: padding could propagate even unwanted information, like artifacts, in the original utterance, and truncation inevitably loses some information. With these information distortions, model could get stuck in a suboptimal solution. To fill this gap, we proposeÚ a method for precise utterance-level modeling that enables minibatch-wise utterance-level modeling of variable-length utterances while minimizing the information distortions. The proposed method comprises sequence segmentation followed by segment aggregation. Sequence segmentation feeds variable-length utterances in the minibatch unit by decomposing each of them into fixed-length segments, which enables parallel processing of variable-length utterances without the uncertainty in input length. Segment aggregation plays a role in aggregating the segment embeddings by utterance to encode the entire information of each utterance. The experimental results of the evaluation trials of ASVspoof 2019 and 2021 indicate that the proposed method shows up to 84.9 % and 97.6 % relative equal error rate reductions on logical and physical access scenarios, respectively. Furthermore, the proposed method reduced the FLOPs for an epoch by 6 %.
期刊介绍:
Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense.
Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems.
Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.