Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9979937
Kotaro Onishi, Toru Nakashika
Non-parallel voice conversion with deep neural net-works often disentangle speaker individuality and speech content. However, these methods rely on external models, text data, or implicit constraints for ways to disentangle. They may require learning other models or annotating text, or may not understand how latent representations are acquired. Therefore, we pro-pose voice conversion with momentum contrastive representation learning (MoCo V C), a method of explicitly adding constraints to intermediate features using contrastive representation learning, which is a self-supervised learning method. Using contrastive rep-resentation learning with transformations that preserve utterance content allows us to explicitly constrain the intermediate features to preserve utterance content. We present transformations used for contrastive representation learning that could be used for voice conversion and verify the effectiveness of each in an exper-iment. Moreover, MoCoVC demonstrates a high or comparable performance to the vector quantization constrained method in terms of both naturalness and speaker individuality in subjective evaluation experiments.
基于深度神经网络的非并行语音转换往往会将说话人的个性与语音内容分离开来。然而,这些方法依赖于外部模型、文本数据或隐式约束来解开纠缠。他们可能需要学习其他模型或注释文本,或者可能不理解如何获得潜在表征。因此,我们提出了基于动量对比表示学习的语音转换(MoCo V C),这是一种利用对比表示学习显式地向中间特征添加约束的方法,是一种自监督学习方法。将对比表示学习与保留话语内容的转换结合使用,可以显式地约束中间特征以保留话语内容。我们提出了用于对比表示学习的转换,可用于语音转换,并在实验中验证了每种转换的有效性。此外,在主观评价实验中,MoCoVC在自然度和说话人个性方面都表现出与矢量量化约束方法相当的性能。
{"title":"MoCoVC: Non-parallel Voice Conversion with Momentum Contrastive Representation Learning","authors":"Kotaro Onishi, Toru Nakashika","doi":"10.23919/APSIPAASC55919.2022.9979937","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979937","url":null,"abstract":"Non-parallel voice conversion with deep neural net-works often disentangle speaker individuality and speech content. However, these methods rely on external models, text data, or implicit constraints for ways to disentangle. They may require learning other models or annotating text, or may not understand how latent representations are acquired. Therefore, we pro-pose voice conversion with momentum contrastive representation learning (MoCo V C), a method of explicitly adding constraints to intermediate features using contrastive representation learning, which is a self-supervised learning method. Using contrastive rep-resentation learning with transformations that preserve utterance content allows us to explicitly constrain the intermediate features to preserve utterance content. We present transformations used for contrastive representation learning that could be used for voice conversion and verify the effectiveness of each in an exper-iment. Moreover, MoCoVC demonstrates a high or comparable performance to the vector quantization constrained method in terms of both naturalness and speaker individuality in subjective evaluation experiments.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115062338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980044
Ahmed Khan, Koksheik Wong, Vishnu Monn Baskaran
An ideal image watermarking (IW) scheme aims to manage the trade-off among quality, capacity, and robustness. However, our literature survey reveals some flaws in the form of poor robustness and quality or low embedding capability. In this paper, multiple frequency domain based image watermarking scheme using salient (eye-catching) object detection is applied. Specifically, the host and the watermark images are partitioned into background and foreground regions by the proposed multi-dimension decomposition, which accumulates image patches and combining them to form the salient map. Next, the watermark image is encrypted by multiple applications of the 3D Arnold and logistic maps, then embedded into both the identified foreground and background regions of the host image by using different embedding strengths. The proposed method can embed 1 color pixel of the watermark image into 1 color pixel in the host image while maintaining high image quality. In the best case scenario, we could embed a 24-bit image as the watermark into a 24-bit image of the same dimension while maintaining an average RGB-SSIM of 0.9999. Experiments are carried out (with 10K MSRA dataset images) to verify the performance of the proposed method and to compare our proposed method against the state-of-the-art (SOTA) watermarking methods.
{"title":"Image Watermarking based on Saliency Detection and Multiple Transformations","authors":"Ahmed Khan, Koksheik Wong, Vishnu Monn Baskaran","doi":"10.23919/APSIPAASC55919.2022.9980044","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980044","url":null,"abstract":"An ideal image watermarking (IW) scheme aims to manage the trade-off among quality, capacity, and robustness. However, our literature survey reveals some flaws in the form of poor robustness and quality or low embedding capability. In this paper, multiple frequency domain based image watermarking scheme using salient (eye-catching) object detection is applied. Specifically, the host and the watermark images are partitioned into background and foreground regions by the proposed multi-dimension decomposition, which accumulates image patches and combining them to form the salient map. Next, the watermark image is encrypted by multiple applications of the 3D Arnold and logistic maps, then embedded into both the identified foreground and background regions of the host image by using different embedding strengths. The proposed method can embed 1 color pixel of the watermark image into 1 color pixel in the host image while maintaining high image quality. In the best case scenario, we could embed a 24-bit image as the watermark into a 24-bit image of the same dimension while maintaining an average RGB-SSIM of 0.9999. Experiments are carried out (with 10K MSRA dataset images) to verify the performance of the proposed method and to compare our proposed method against the state-of-the-art (SOTA) watermarking methods.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123921076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9979986
Shou-Hong Liu, Chun-Tai Liu, Wei-Hung Chou, JenYi Pan
In recent years, 3GPP (3rd generation partnership project) has studied and developed the standards for Non-terrestrial networks (NTN). One of the newest working items for NTN is coverage enhancements. In this paper, we construct the NTN channel described in 3GPP. Moreover, we summarize the NTN channel model and current coverage enhancements. Since the scenario of NTN is very different from the traditional terrestrial network systems, we also summarize the challenges and phenomena of the NTN. To reach the high communication quality of voice over internet protocol (VoIP) service in NTN, we evaluate the performance and discuss the benefit of the PUSCH repetition technique in the NTN low Earth orbit (LEO) scenario.
{"title":"Evaluation of Voice Service in LEO Communication with 3GPP PUSCH Repetition Enhancement","authors":"Shou-Hong Liu, Chun-Tai Liu, Wei-Hung Chou, JenYi Pan","doi":"10.23919/APSIPAASC55919.2022.9979986","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979986","url":null,"abstract":"In recent years, 3GPP (3rd generation partnership project) has studied and developed the standards for Non-terrestrial networks (NTN). One of the newest working items for NTN is coverage enhancements. In this paper, we construct the NTN channel described in 3GPP. Moreover, we summarize the NTN channel model and current coverage enhancements. Since the scenario of NTN is very different from the traditional terrestrial network systems, we also summarize the challenges and phenomena of the NTN. To reach the high communication quality of voice over internet protocol (VoIP) service in NTN, we evaluate the performance and discuss the benefit of the PUSCH repetition technique in the NTN low Earth orbit (LEO) scenario.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"481 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122604068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980147
Kenta Tsunomori, Yuma Yamasaki, M. Kuribayashi, N. Funabiki, I. Echizen
An effective way to defend against adversarial examples (AEs), which are used, for example, to attack applications such as face recognition, is to detect in advance whether an input image is an AE. Some AE defense methods focus on the response characteristics of image classifiers when denoising filters are applied to the input image. However, several filters are required, which results in a large amount of computation. Because JPEG compression of AEs effectively removes adversarial perturbations, the difference between the image before and after JPEG compression should be highly correlated with the perturbations. However, the difference should not be completely consistent with adversarial perturbations. We have developed a filtering operation that modulates this difference by varying their magnitude and positive/negative sign and adding them to an image so that adversarial perturbations can be effectively removed. We consider that adversarial perturbations that could not be removed by simply applying JPEG compression can be removed by modulating this difference. Furthermore, applying a resizing process to the image after adding these distortions enables us to remove perturbations that could not be removed otherwise. The filtering operation will successfully remove the adversarial noise and reconstruct the corrected samples from AEs. We also consider a simple but effective reconstruction method based on the filtering operations. Experiments in which the adversarial attack used was not known to the detector demonstrated that the proposed method could achieve better performance in terms of accuracy with reasonable computational complexity. In addition, the percentage of correct classification results after applying the proposed filter for non-targeted attacks was higher than that of JPEG compression and scaling. These results suggest that the proposed method effectively removes adversarial perturbations and is an effective filter for detecting AEs.
{"title":"Detection and Correction of Adversarial Examples Based on IPEG-Compression-Derived Distortion","authors":"Kenta Tsunomori, Yuma Yamasaki, M. Kuribayashi, N. Funabiki, I. Echizen","doi":"10.23919/APSIPAASC55919.2022.9980147","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980147","url":null,"abstract":"An effective way to defend against adversarial examples (AEs), which are used, for example, to attack applications such as face recognition, is to detect in advance whether an input image is an AE. Some AE defense methods focus on the response characteristics of image classifiers when denoising filters are applied to the input image. However, several filters are required, which results in a large amount of computation. Because JPEG compression of AEs effectively removes adversarial perturbations, the difference between the image before and after JPEG compression should be highly correlated with the perturbations. However, the difference should not be completely consistent with adversarial perturbations. We have developed a filtering operation that modulates this difference by varying their magnitude and positive/negative sign and adding them to an image so that adversarial perturbations can be effectively removed. We consider that adversarial perturbations that could not be removed by simply applying JPEG compression can be removed by modulating this difference. Furthermore, applying a resizing process to the image after adding these distortions enables us to remove perturbations that could not be removed otherwise. The filtering operation will successfully remove the adversarial noise and reconstruct the corrected samples from AEs. We also consider a simple but effective reconstruction method based on the filtering operations. Experiments in which the adversarial attack used was not known to the detector demonstrated that the proposed method could achieve better performance in terms of accuracy with reasonable computational complexity. In addition, the percentage of correct classification results after applying the proposed filter for non-targeted attacks was higher than that of JPEG compression and scaling. These results suggest that the proposed method effectively removes adversarial perturbations and is an effective filter for detecting AEs.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128993205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980216
Yingyi Ma, Xueliang Zhang
There are three main interferences in the FM signal trans-mission process-Multipath effect, Doppler effect, and White noise. These interferences have significant influences on speech. We proposed a method that uses a masking or mapping approach for single-channel speech enhancement in wireless communication. Since the method improves speech equality by focusing on three interferences simultaneously, it is simpler in comparison to conventional methods. Experiments are conducted on the dataset, which is simulated by ourselves. Because the PESQ and STOI need reference targets, it is hard to evaluate the performance using real-world data. So we only give the spectral comparison of the real data enhancement results. Simulation results show excellent speech enhancement performance on the unprocessed mixture and significantly improve speech quality on the actual collected data. It verifies the feasibility of deep learning on this kind of task. Future studies will be made to improve the real-time performance and compress the number of network parameters.
{"title":"Application of Deep Learning-based Single-channel Speech Enhancement for Frequency-modulation Transmitted Speech","authors":"Yingyi Ma, Xueliang Zhang","doi":"10.23919/APSIPAASC55919.2022.9980216","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980216","url":null,"abstract":"There are three main interferences in the FM signal trans-mission process-Multipath effect, Doppler effect, and White noise. These interferences have significant influences on speech. We proposed a method that uses a masking or mapping approach for single-channel speech enhancement in wireless communication. Since the method improves speech equality by focusing on three interferences simultaneously, it is simpler in comparison to conventional methods. Experiments are conducted on the dataset, which is simulated by ourselves. Because the PESQ and STOI need reference targets, it is hard to evaluate the performance using real-world data. So we only give the spectral comparison of the real data enhancement results. Simulation results show excellent speech enhancement performance on the unprocessed mixture and significantly improve speech quality on the actual collected data. It verifies the feasibility of deep learning on this kind of task. Future studies will be made to improve the real-time performance and compress the number of network parameters.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130796944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980311
Chandler Timm C. Doloriel, R. Cajote
Object detection is a computer vision technique used to identify objects that are usually present in natural scenes. However, the methods used for this case are not easily transferable to detect objects in aerial images. Objects in aerial images are mostly arbitrary-oriented, small, and in complex backgrounds compared to upright and well-focused objects in natural scenes. To effectively detect objects in aerial images, we propose a new regression loss function based on the attention mechanism through attention weights. Using the relative position of the attention weights to the bounding box, the foreground is given more attention, which highlights the target object and effectively suppresses the noise and background. Preliminary experiments are conducted on an attention-based object detector using the DOTA dataset to test the capability of attention mechanism in extracting the contextual information of objects, especially in complex environments.
{"title":"Object Detection in Aerial Images with Attention-based Regression Loss","authors":"Chandler Timm C. Doloriel, R. Cajote","doi":"10.23919/APSIPAASC55919.2022.9980311","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980311","url":null,"abstract":"Object detection is a computer vision technique used to identify objects that are usually present in natural scenes. However, the methods used for this case are not easily transferable to detect objects in aerial images. Objects in aerial images are mostly arbitrary-oriented, small, and in complex backgrounds compared to upright and well-focused objects in natural scenes. To effectively detect objects in aerial images, we propose a new regression loss function based on the attention mechanism through attention weights. Using the relative position of the attention weights to the bounding box, the foreground is given more attention, which highlights the target object and effectively suppresses the noise and background. Preliminary experiments are conducted on an attention-based object detector using the DOTA dataset to test the capability of attention mechanism in extracting the contextual information of objects, especially in complex environments.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130834451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980002
Rui Lin, Kazunori Hayashi
Compressed sensing is a technique to recover a sparse vector from its underdetermined linear measurements. Since a naive $ell_{0}$ optimization approach is hard to tackle due to the discreteness and the non-convexity of $ell_{0}$ norm, a relaxed problem of the $ell_{1}-ell_{2}$ optimization is often employed for the reconstruction of the sparse vector especially when the measurement noise is not negligible. FISTA (fast iterative shrinkage-thresholding algorithm) is one of popular algorithms for the $ell_{1}-ell_{2}$ optimization, and is known to achieve optimal convergence rate among the first order methods. Recently, the employment of optical circuits for various signal processing including deep neural networks has been considered intensively, but it is difficult to implement FISTA with the optical circuit, because it requires operations of divisions with a dynamic value in the algorithm. In this paper, assuming the implementation with the optical circuit, we propose an ADMM (alternating direction method of multipliers) based algorithm for the $ell_{1}-ell_{2}$ optimization. It is true that an ADMM based algorithm for the $ell_{1}-ell_{2}$ optimization has been already proposed in the literature, but the proposed algorithm is derived with the different formulation from the existing method, and unlike the existing ADMM based algorithm, the proposed algorithm does not include the calculation of the inverse of a matrix. Computer simulation results demonstrate that the proposed algorithm can achieve comparable performance as FISTA or existing ADMM based algorithm while requiring no division operations and no matrix inversions.
{"title":"An Approximated ADMM based Algorithm for $ell_{1}-ell_{2}$ Optimization Problem","authors":"Rui Lin, Kazunori Hayashi","doi":"10.23919/APSIPAASC55919.2022.9980002","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980002","url":null,"abstract":"Compressed sensing is a technique to recover a sparse vector from its underdetermined linear measurements. Since a naive $ell_{0}$ optimization approach is hard to tackle due to the discreteness and the non-convexity of $ell_{0}$ norm, a relaxed problem of the $ell_{1}-ell_{2}$ optimization is often employed for the reconstruction of the sparse vector especially when the measurement noise is not negligible. FISTA (fast iterative shrinkage-thresholding algorithm) is one of popular algorithms for the $ell_{1}-ell_{2}$ optimization, and is known to achieve optimal convergence rate among the first order methods. Recently, the employment of optical circuits for various signal processing including deep neural networks has been considered intensively, but it is difficult to implement FISTA with the optical circuit, because it requires operations of divisions with a dynamic value in the algorithm. In this paper, assuming the implementation with the optical circuit, we propose an ADMM (alternating direction method of multipliers) based algorithm for the $ell_{1}-ell_{2}$ optimization. It is true that an ADMM based algorithm for the $ell_{1}-ell_{2}$ optimization has been already proposed in the literature, but the proposed algorithm is derived with the different formulation from the existing method, and unlike the existing ADMM based algorithm, the proposed algorithm does not include the calculation of the inverse of a matrix. Computer simulation results demonstrate that the proposed algorithm can achieve comparable performance as FISTA or existing ADMM based algorithm while requiring no division operations and no matrix inversions.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131068050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9979991
Koi Yee Ng, Simying Ong
In this paper, an improved scrambling-embedding technique, namely row-rotational-based data hiding method is proposed to hide data in partially-encrypted images. The partially-encrypted images are generated by performing bit-wise XOR-cipher to investigate the feasibility of applying the proposed method in various encryption levels. The proposed method is performed by divided each row into multiple non-overlapping continuous partitions. These partitions will be arranged in a rotational manner to create different states, while each state will be used to represent specific data in binary representation. During the decoding process, α notation is introduced to reduce the number of failure rows, which will cause further image degradation and incorrect data extraction. The BSDS300 dataset is utilized for experiments, and encrypted with different encryption strengths. From the experiment results, it is observed that when least significant bits are encrypted, the proposed data hiding method using scrambling-embedding technique can still performed well as in the plain image domain.
{"title":"Scrambling-Embedding in Partially-Encrypted Images","authors":"Koi Yee Ng, Simying Ong","doi":"10.23919/APSIPAASC55919.2022.9979991","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979991","url":null,"abstract":"In this paper, an improved scrambling-embedding technique, namely row-rotational-based data hiding method is proposed to hide data in partially-encrypted images. The partially-encrypted images are generated by performing bit-wise XOR-cipher to investigate the feasibility of applying the proposed method in various encryption levels. The proposed method is performed by divided each row into multiple non-overlapping continuous partitions. These partitions will be arranged in a rotational manner to create different states, while each state will be used to represent specific data in binary representation. During the decoding process, α notation is introduced to reduce the number of failure rows, which will cause further image degradation and incorrect data extraction. The BSDS300 dataset is utilized for experiments, and encrypted with different encryption strengths. From the experiment results, it is observed that when least significant bits are encrypted, the proposed data hiding method using scrambling-embedding technique can still performed well as in the plain image domain.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129375033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980322
Aastha Kachhi, Anand Therattil, Ankur T. Patil, Hardik B. Sailor, H. Patil
Dysarthria is a neuro-motor speech impairment that renders speech unintelligibility, which is generally imperceptible to humans w.r.t severity-levels. Dysarthric speech classification acts as a diagnostic tool for evaluating the advancement in a patient's severity condition and also aids in automatic dysarthric speech recognition systems (an important assistive speech technology). This study investigates the significance of Teager Energy Cepstral Coefficients (TECC) in dysarthric speech classification using three deep learning architectures, namely, Convolutional Neural Network (CNN), Light-CNN (LCNN), and Residual Networks (ResNet). The performance of TECC is compared with state-of-the-art features, such as Short-Time Fourier Transform (STFT), Mel Frequency Cepstral Coefficients (MFCC), and Linear Frequency Cepstral Coefficients (LFCC). In addition, this study also investigate the effectiveness of cepstral features over the spectral features for this problem. The highest classification accuracy achieved using UA-Speech corpus is 97.18%, 94.63%, and 98.02% (i.e., absolute improvement of 1.98%, 1.41%, and 1.69%) with CNN, LCNN, and ResNet, respectively, as compared to the MFCC. Further, we evaluate feature discriminative capability using $F1$-score, Matthew's Correlation Coefficient (MCC), Jaccard index, and Hamming loss. Finally, analysis of latency period w.r.t. state-of-the-art feature sets indicates the potential of TECC for practical deployment of the severity-level classification system.
{"title":"Teager Energy Cepstral Coefficients For Classification of Dysarthric Speech Severity-Level","authors":"Aastha Kachhi, Anand Therattil, Ankur T. Patil, Hardik B. Sailor, H. Patil","doi":"10.23919/APSIPAASC55919.2022.9980322","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980322","url":null,"abstract":"Dysarthria is a neuro-motor speech impairment that renders speech unintelligibility, which is generally imperceptible to humans w.r.t severity-levels. Dysarthric speech classification acts as a diagnostic tool for evaluating the advancement in a patient's severity condition and also aids in automatic dysarthric speech recognition systems (an important assistive speech technology). This study investigates the significance of Teager Energy Cepstral Coefficients (TECC) in dysarthric speech classification using three deep learning architectures, namely, Convolutional Neural Network (CNN), Light-CNN (LCNN), and Residual Networks (ResNet). The performance of TECC is compared with state-of-the-art features, such as Short-Time Fourier Transform (STFT), Mel Frequency Cepstral Coefficients (MFCC), and Linear Frequency Cepstral Coefficients (LFCC). In addition, this study also investigate the effectiveness of cepstral features over the spectral features for this problem. The highest classification accuracy achieved using UA-Speech corpus is 97.18%, 94.63%, and 98.02% (i.e., absolute improvement of 1.98%, 1.41%, and 1.69%) with CNN, LCNN, and ResNet, respectively, as compared to the MFCC. Further, we evaluate feature discriminative capability using $F1$-score, Matthew's Correlation Coefficient (MCC), Jaccard index, and Hamming loss. Finally, analysis of latency period w.r.t. state-of-the-art feature sets indicates the potential of TECC for practical deployment of the severity-level classification system.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"05 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129569049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Voice can represent a person's identity. Thus, it can be used in automatic speaker verification (ASV) systems for authenticating secure applications. Unfortunately, existing ASV systems are vulnerable to spoofing attacks. A replay attack is a widely used spoofing technique because it is simple but difficult to detect. Hence, many methods are proposed for countermeasures against replay attacks. Most work inseparably considers voice and non-voice sections in the detection's performance. In this work, we investigate the spoof detection performances when the voice, non-voice, and both with different percentages of voice are used to obtain the optimal section. We also propose a method for detecting replay attacks using the optimal section of a signal. Mel-frequency cepstral coefficients are calculated from the optimal section as a feature, and the ResNet-34 model is used for classification. We evaluated the proposed method using a dataset from the ASVspoof 2019 challenge. The results depict that the optimal section for replay attack detection is when 10% and 20% of voice are included in the non-voice sections. It also showed that the proposed method outperforms the baselines with a 7.52% relatively improvement or an equal error rate of 1.72%.
{"title":"Replay Attack Detection Based on Voice and Non-voice Sections for Speaker Verification","authors":"Ananda Garin Mills, Patthranit Kaewcharuay, Pannathorn Sathirasattayanon, Suradej Duangpummet, Kasorn Galajit, Jessada Karnjana, P. Aimmanee","doi":"10.23919/APSIPAASC55919.2022.9980225","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980225","url":null,"abstract":"Voice can represent a person's identity. Thus, it can be used in automatic speaker verification (ASV) systems for authenticating secure applications. Unfortunately, existing ASV systems are vulnerable to spoofing attacks. A replay attack is a widely used spoofing technique because it is simple but difficult to detect. Hence, many methods are proposed for countermeasures against replay attacks. Most work inseparably considers voice and non-voice sections in the detection's performance. In this work, we investigate the spoof detection performances when the voice, non-voice, and both with different percentages of voice are used to obtain the optimal section. We also propose a method for detecting replay attacks using the optimal section of a signal. Mel-frequency cepstral coefficients are calculated from the optimal section as a feature, and the ResNet-34 model is used for classification. We evaluated the proposed method using a dataset from the ASVspoof 2019 challenge. The results depict that the optimal section for replay attack detection is when 10% and 20% of voice are included in the non-voice sections. It also showed that the proposed method outperforms the baselines with a 7.52% relatively improvement or an equal error rate of 1.72%.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130867969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}