Emotion information plays an important role in improving the expressiveness of synthesized speech. At present, researchers mainly use style or emotion encoder to extract emotion information from mel-spectrogram extracted by mel fil-terbank. The mel filterbank does not consider the masking effect in the human auditory system, which results in mel-spectrogram not modeling complete auditory information. The multi-resolution modulation-filtered cochleagram (MMCG) simulates the auditory signal processing mechanism and reflects the function of the human auditory system. It can extract high-level auditory representations and significantly improve the emotion recognition performance. Therefore, we propose extracting emotion information from MMCG rather than mel-spectrogram to improve the expressiveness of synthesized speech. We propose three different kinds of MMCG encoders based on the characteristics of MMCG. Subjective and objective experiments demonstrate that using MMCG as an input feature can not only improve the naturalness and style transfer performance of synthesized speech but also reduce the fundamental frequency error. Our proposed MMCG encoders can extract more complete and rich emotion information from MMCG to further improve the expressiveness of synthesized speech.
{"title":"Learning Emotion Information for Expressive Speech Synthesis Using Multi-resolution Modulation-filtered Cochleagram","authors":"Kaili Zhang, M. Unoki","doi":"10.23919/APSIPAASC55919.2022.9979810","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979810","url":null,"abstract":"Emotion information plays an important role in improving the expressiveness of synthesized speech. At present, researchers mainly use style or emotion encoder to extract emotion information from mel-spectrogram extracted by mel fil-terbank. The mel filterbank does not consider the masking effect in the human auditory system, which results in mel-spectrogram not modeling complete auditory information. The multi-resolution modulation-filtered cochleagram (MMCG) simulates the auditory signal processing mechanism and reflects the function of the human auditory system. It can extract high-level auditory representations and significantly improve the emotion recognition performance. Therefore, we propose extracting emotion information from MMCG rather than mel-spectrogram to improve the expressiveness of synthesized speech. We propose three different kinds of MMCG encoders based on the characteristics of MMCG. Subjective and objective experiments demonstrate that using MMCG as an input feature can not only improve the naturalness and style transfer performance of synthesized speech but also reduce the fundamental frequency error. Our proposed MMCG encoders can extract more complete and rich emotion information from MMCG to further improve the expressiveness of synthesized speech.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114684450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper aims to develop an attention-based phishing detector by performing sub-word tokenization and fme-tuning the Bidirectional Encoder Representation from Transformers (BERT) model. It is called BERT embedding attention model (BEAM). Our proposed BEAM method contains five building blocks: a data pre-processing block to extract components according to the URL structure, a tokenization block to tokenize the individual URL components into a number of sub-words, an embedding block to produce a numerical sequence representation, an encoding block to give a context feature vector and a classification block for phishing URL detection. The subword tokenization allows us to characterize the relationship among connecting subwords, while the attention mechanism in the BERT allows the proposed model to focus selectively on important parts contributing to phishing behavior. We have compared our proposed BEAM method with other existing state-of-the-art phishing detection methods such as CNN, Bi-LSTM, and machine learning models (random forest and XGBoost). Experimental results confirm that our proposed BEAM method effectively detects phishing links and outperforms other existing methods.
{"title":"BEAM - An Algorithm for Detecting Phishing Link","authors":"Sea Ran Cleon Liew, N. F. Law","doi":"10.23919/APSIPAASC55919.2022.9979860","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979860","url":null,"abstract":"This paper aims to develop an attention-based phishing detector by performing sub-word tokenization and fme-tuning the Bidirectional Encoder Representation from Transformers (BERT) model. It is called BERT embedding attention model (BEAM). Our proposed BEAM method contains five building blocks: a data pre-processing block to extract components according to the URL structure, a tokenization block to tokenize the individual URL components into a number of sub-words, an embedding block to produce a numerical sequence representation, an encoding block to give a context feature vector and a classification block for phishing URL detection. The subword tokenization allows us to characterize the relationship among connecting subwords, while the attention mechanism in the BERT allows the proposed model to focus selectively on important parts contributing to phishing behavior. We have compared our proposed BEAM method with other existing state-of-the-art phishing detection methods such as CNN, Bi-LSTM, and machine learning models (random forest and XGBoost). Experimental results confirm that our proposed BEAM method effectively detects phishing links and outperforms other existing methods.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127416519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The traditional unsupervised domain adaptation (UDA) has achieved great success in many computer vision tasks, especially semantic segmentation, which requires high cost of pixel-wise annotations. However, the final performance of UDA method is still far behind that of supervised learning due to the lack of annotations. Researchers introduce the semi-supervised learning (SSL) and propose a more practical setting, semi-supervised domain adaptation (SSDA), that is, having labeled source domain data and a small number of labeled target domain data. To address the inter-domain gap, current researches focus on domain alignment by mixing annotated data from two domains, but we argue that adapting the target domain data distribution through model transfer is a better solution. In this paper, we propose a two-stage SSDA framework based on this assumption. Firstly, we adapt the model from the source domain to the labeled dataset in the target domain. To verify the assumption, we choose a basic transfer mode: finetuning. Then, to align the labeled subspace and the unlabeled subspace of the target domain, we choose teacher-student model with class-level data augmentation as the basis to realize online self-training. We also provide a deformation to solve overfitting on the target domain with a small number of annotated data. Extensive experiments on two synthetic-to-real benchmarks have demonstrated the correctness of our idea and the effectiveness of our method. In most SSDA scenarios, our approach can achieve supervised performance or even better.
{"title":"A Two-stage Cascading Method Based on Finetuning in Semi-supervised Domain Adaptation Semantic Segmentation","authors":"Huiying Chang, Kaixin Chen, Ming Wu","doi":"10.23919/APSIPAASC55919.2022.9980206","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980206","url":null,"abstract":"The traditional unsupervised domain adaptation (UDA) has achieved great success in many computer vision tasks, especially semantic segmentation, which requires high cost of pixel-wise annotations. However, the final performance of UDA method is still far behind that of supervised learning due to the lack of annotations. Researchers introduce the semi-supervised learning (SSL) and propose a more practical setting, semi-supervised domain adaptation (SSDA), that is, having labeled source domain data and a small number of labeled target domain data. To address the inter-domain gap, current researches focus on domain alignment by mixing annotated data from two domains, but we argue that adapting the target domain data distribution through model transfer is a better solution. In this paper, we propose a two-stage SSDA framework based on this assumption. Firstly, we adapt the model from the source domain to the labeled dataset in the target domain. To verify the assumption, we choose a basic transfer mode: finetuning. Then, to align the labeled subspace and the unlabeled subspace of the target domain, we choose teacher-student model with class-level data augmentation as the basis to realize online self-training. We also provide a deformation to solve overfitting on the target domain with a small number of annotated data. Extensive experiments on two synthetic-to-real benchmarks have demonstrated the correctness of our idea and the effectiveness of our method. In most SSDA scenarios, our approach can achieve supervised performance or even better.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127523779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The vulnerability of automatic speaker verification (ASV) is exposed to the threat of rapidly developing speech synthesis and voice conversion techniques. Developing anti-spoofing systems is an urgent need. This paper proposes a novel spoofed speech detection model for better utilizing the augmented data at the training stage. This model adopts a light convolutional neural network (LCNN) with the split batch normalization (SBN) structure to alleviate the issue of data pollution caused by data augmentation. The pre-trained wav2vec 2.0 model is used to extract features from input speech waveforms. Three data augmentation strategies, including audio compression, mixup and channel simulation, are compared in our experiments. Experimental results demonstrate that our proposed method achieves the state-of-the-art equal error rate (ERR) of 0.258% on the ASVspoof2019 LA task. Further analysis also confirms the effectiveness of the pre-trained model for feature extraction, the data augmentation strategies, and our proposed SBNLCNN model on improving the performance of spoofed speech detection.
{"title":"A Light CNN with Split Batch Normalization for Spoofed Speech Detection Using Data Augmentation","authors":"Haojian Lin, Yang Ai, Zhenhua Ling","doi":"10.23919/APSIPAASC55919.2022.9980260","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980260","url":null,"abstract":"The vulnerability of automatic speaker verification (ASV) is exposed to the threat of rapidly developing speech synthesis and voice conversion techniques. Developing anti-spoofing systems is an urgent need. This paper proposes a novel spoofed speech detection model for better utilizing the augmented data at the training stage. This model adopts a light convolutional neural network (LCNN) with the split batch normalization (SBN) structure to alleviate the issue of data pollution caused by data augmentation. The pre-trained wav2vec 2.0 model is used to extract features from input speech waveforms. Three data augmentation strategies, including audio compression, mixup and channel simulation, are compared in our experiments. Experimental results demonstrate that our proposed method achieves the state-of-the-art equal error rate (ERR) of 0.258% on the ASVspoof2019 LA task. Further analysis also confirms the effectiveness of the pre-trained model for feature extraction, the data augmentation strategies, and our proposed SBNLCNN model on improving the performance of spoofed speech detection.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"13 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123642923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We explore the performance of permutation invariant and location-based training (PIT and LBT, respectively) for sound event localization and detection (SELD). Due to being intrinsically a multi-output multi-class and multi-task problem, the design space of loss functions for SELD is large, and, as of yet, rather unexplored. Our study revolves around the multiple activity coupled direction of arrival target format which cleverly combines direction and event probability into a single mean squared error loss. While PIT, and its variant auxiliary duplicating PIT (ADPIT), have been prominently featured in recent DCASE challenges, LBT has not yet been applied to SELD. In this work, we investigate some modifications to PIT and ADPIT, as well as the application of LBT to SELD. First, the PIT loss is changed to have a variable number of tracks per event class, providing extra flexibility. Second, we propose auxiliary duplicating or silence PIT (ADPIT-S), where unused tracks are indifferently filled with a duplicate event, or nothing. Finally, we propose to use LBT with ordering of the events by Cartesian or polar coordinates. We give two ways of padding the unused tracks, with zeros or by repeating the last event. We conduct experiments using the STARSS22 dataset from the DCASE Challenge 2022. We find that ordering by Cartesian coordinates with repeat padding is best for LBT. When comparing all loss functions, we suprisingly found that PIT performed the best. In addition, LBT turned out to be competitive with PIT and ADPIT. While ADPIT-S had slightly worse overall performance, it did better in terms of error rate and F-score metrics.
{"title":"On Sorting and Padding Multiple Targets for Sound Event Localization and Detection with Permutation Invariant and Location-based Training","authors":"Robin Scheibler, Tatsuya Komatsu, Yusuke Fujita, Michael Hentschel","doi":"10.23919/APSIPAASC55919.2022.9979815","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979815","url":null,"abstract":"We explore the performance of permutation invariant and location-based training (PIT and LBT, respectively) for sound event localization and detection (SELD). Due to being intrinsically a multi-output multi-class and multi-task problem, the design space of loss functions for SELD is large, and, as of yet, rather unexplored. Our study revolves around the multiple activity coupled direction of arrival target format which cleverly combines direction and event probability into a single mean squared error loss. While PIT, and its variant auxiliary duplicating PIT (ADPIT), have been prominently featured in recent DCASE challenges, LBT has not yet been applied to SELD. In this work, we investigate some modifications to PIT and ADPIT, as well as the application of LBT to SELD. First, the PIT loss is changed to have a variable number of tracks per event class, providing extra flexibility. Second, we propose auxiliary duplicating or silence PIT (ADPIT-S), where unused tracks are indifferently filled with a duplicate event, or nothing. Finally, we propose to use LBT with ordering of the events by Cartesian or polar coordinates. We give two ways of padding the unused tracks, with zeros or by repeating the last event. We conduct experiments using the STARSS22 dataset from the DCASE Challenge 2022. We find that ordering by Cartesian coordinates with repeat padding is best for LBT. When comparing all loss functions, we suprisingly found that PIT performed the best. In addition, LBT turned out to be competitive with PIT and ADPIT. While ADPIT-S had slightly worse overall performance, it did better in terms of error rate and F-score metrics.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121532515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In cross-modal distillation, e.g., from text-based inference modules to spoken language understanding module, it is difficult to determine the teacher's influence due to the different nature of both modalities that bring the heterogeneity in the aspect of uncertainty. Though error rate or entropy-based schemes have been suggested to cope with the heuristics of time-based scheduling, the confidence of the teacher inference has not been necessarily taken into deciding the teacher's influence. In this paper, we propose a dropout-based confidence that decides the teacher's confidence and to-student influence of the loss. On the widely used spoken language understanding dataset, Fluent Speech Command, we show that our weight decision scheme enhances performance in combination with the conventional scheduling strategies, displaying a maximum 20% relative error reduction concerning the model with no distillation.
{"title":"Cross-Modal Knowledge Distillation with Dropout-Based Confidence","authors":"Won Ik Cho, Jeunghun Kim, N. Kim","doi":"10.23919/APSIPAASC55919.2022.9980213","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980213","url":null,"abstract":"In cross-modal distillation, e.g., from text-based inference modules to spoken language understanding module, it is difficult to determine the teacher's influence due to the different nature of both modalities that bring the heterogeneity in the aspect of uncertainty. Though error rate or entropy-based schemes have been suggested to cope with the heuristics of time-based scheduling, the confidence of the teacher inference has not been necessarily taken into deciding the teacher's influence. In this paper, we propose a dropout-based confidence that decides the teacher's confidence and to-student influence of the loss. On the widely used spoken language understanding dataset, Fluent Speech Command, we show that our weight decision scheme enhances performance in combination with the conventional scheduling strategies, displaying a maximum 20% relative error reduction concerning the model with no distillation.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123749479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel graph representation which tightly integrates the information sources of node embed-ding matrix and weight matrix in a graph learning representation. A new parameter updating method is proposed to dynamically represent the graph network by using a specialized transformer. This graph evolved and embedded transformer is built by using the weights and node embeddings from graph structural data. The attention-based graph learning machine is implemented. Using the proposed method, each transformer layer is composed of two attention layers. The first layer is designed to calculate the weight matrix in graph convolutional network, and also the self attention within the matrix itself. The second layer is used to estimate the node embedding and weight matrix, and also the cross attention between them. Graph learning representation is enhanced by using these two attention layers. Experiments on three financial prediction tasks demonstrate that this transformer captures the temporal information and improves the Fl score and the mean reciprocal rank.
{"title":"Graph Evolving and Embedding in Transformer","authors":"Jen-Tzung Chien, Chia-Wei Tsao","doi":"10.23919/APSIPAASC55919.2022.9979949","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979949","url":null,"abstract":"This paper presents a novel graph representation which tightly integrates the information sources of node embed-ding matrix and weight matrix in a graph learning representation. A new parameter updating method is proposed to dynamically represent the graph network by using a specialized transformer. This graph evolved and embedded transformer is built by using the weights and node embeddings from graph structural data. The attention-based graph learning machine is implemented. Using the proposed method, each transformer layer is composed of two attention layers. The first layer is designed to calculate the weight matrix in graph convolutional network, and also the self attention within the matrix itself. The second layer is used to estimate the node embedding and weight matrix, and also the cross attention between them. Graph learning representation is enhanced by using these two attention layers. Experiments on three financial prediction tasks demonstrate that this transformer captures the temporal information and improves the Fl score and the mean reciprocal rank.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123770577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the development of silicon photonics and wavelength division multiplexing, the advantages of on-chip optical interconnection, such as low loss, low delay and high bandwidth, can make up for the disadvantages of electrical interconnection. However, with the increase of network scale and complexity, a series of problems, such as communication congestion, low utilization rate of microring resonator and increase of insertion loss, appear in optical interconnection network. The traditional optical interconnection network structure is relatively fixed and cannot meet the needs of reconfigurable array processors. Therefore, this paper designs a configurable, non-blocking, scalable, low loss optical interconnection network structure ReLONEONoC. Depending on the array size, electrical interconnection is used within clusters, and optical communication is used for mass data transmission between clusters. Finally, the simulation and verification model of optical link is built by Waveshaper 500A/SP configurable optical device, and the coupling screening effect of microring resonator is simulated to verify the functional correctness of optical link. The prototype system of ReLONEONoC was designed by combining Waveshaper and $mathbf{UltraScale} +mathbf{VU}mathbf{440}$ development platform. Statistical results show that optical communication between clusters improves both delay and loss.
{"title":"Design and system implementation of a configurable optical interconnection network","authors":"Bowen Yang, Junyong Deng, Jiaying Luo, Yu Feng","doi":"10.23919/APSIPAASC55919.2022.9979816","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979816","url":null,"abstract":"With the development of silicon photonics and wavelength division multiplexing, the advantages of on-chip optical interconnection, such as low loss, low delay and high bandwidth, can make up for the disadvantages of electrical interconnection. However, with the increase of network scale and complexity, a series of problems, such as communication congestion, low utilization rate of microring resonator and increase of insertion loss, appear in optical interconnection network. The traditional optical interconnection network structure is relatively fixed and cannot meet the needs of reconfigurable array processors. Therefore, this paper designs a configurable, non-blocking, scalable, low loss optical interconnection network structure ReLONEONoC. Depending on the array size, electrical interconnection is used within clusters, and optical communication is used for mass data transmission between clusters. Finally, the simulation and verification model of optical link is built by Waveshaper 500A/SP configurable optical device, and the coupling screening effect of microring resonator is simulated to verify the functional correctness of optical link. The prototype system of ReLONEONoC was designed by combining Waveshaper and $mathbf{UltraScale} +mathbf{VU}mathbf{440}$ development platform. Statistical results show that optical communication between clusters improves both delay and loss.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126255490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a non-parallel voice conversion method based on the minimization of the free energy of a restricted Boltzmann machine (RBM). The proposed method uses an RBM that learns the generative probability of acoustic features conditioned on a target speaker, and it iteratively updates the input acoustic features until their free energy reaches a local minimum to obtain converted features. Since it is based on the RBM, only a few hyperparameters need to be set, and the number of training parameters is very small. Therefore, training is stable. In determining the step size of the update formula in accordance with the Newton-Raphson method to obtain the feature that gives the local minimum of the free energy, we found that the Hesse matrix of the free energy can be approximated by a diagonal matrix, and the update can be performed efficiently with a small amount of calculation. In objective evaluation experiments, the proposed method outperforms StarGAN-VC in Mel-cepstral distortions. In subjective evaluation experiments, the performance of the proposed method is comparable to that of StarGAN-VC in similarity MOS.
{"title":"Non-Parallel Voice Conversion Based on Free-Energy Minimization of Speaker-Conditional Restricted Boltzmann Machine","authors":"Takuya Kishida, Toru Nakashika","doi":"10.23919/APSIPAASC55919.2022.9980151","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980151","url":null,"abstract":"In this paper, we propose a non-parallel voice conversion method based on the minimization of the free energy of a restricted Boltzmann machine (RBM). The proposed method uses an RBM that learns the generative probability of acoustic features conditioned on a target speaker, and it iteratively updates the input acoustic features until their free energy reaches a local minimum to obtain converted features. Since it is based on the RBM, only a few hyperparameters need to be set, and the number of training parameters is very small. Therefore, training is stable. In determining the step size of the update formula in accordance with the Newton-Raphson method to obtain the feature that gives the local minimum of the free energy, we found that the Hesse matrix of the free energy can be approximated by a diagonal matrix, and the update can be performed efficiently with a small amount of calculation. In objective evaluation experiments, the proposed method outperforms StarGAN-VC in Mel-cepstral distortions. In subjective evaluation experiments, the performance of the proposed method is comparable to that of StarGAN-VC in similarity MOS.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"C-31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126486092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The effect of target tracking based on millime-ter wave radar is susceptible to multi path effect and target crossover. Most existing methods are unsatisfactory in high-noise, complex environments. In contrast, we propose a method covering target positioning, tracking, and track re-association by using top-mounted millimeter-wave radar, achieving stable and accurate counting and tracking of multiple targets. First, a polar-coordinate- based tracking is performed using an extended Kalman filter with linear regression correction. Then, a density-based classification algorithm with group signal-to-noise ratio analysis is performed to remove ghost targets. In terms of the track fracture problem caused by the target intersection, we propose to use the Hankel matrix to solve this situation. Our experiments prove the robustness of the proposed method, which not only has a high tracking precision within O.lm but also successfully handles most target crossover situations considered. At the same time, in the cases within six people, the ratio between the number of frames in which personnel counting error is less than or equal to 1 and the total number of frames is more than 95%.
{"title":"Continuous Tracking of Indoor Human Targets Based on Millimeter Wave Radar","authors":"Meiqiu Jiang, Shisheng Guo, Haolan Luo, G. Cui","doi":"10.23919/APSIPAASC55919.2022.9979904","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979904","url":null,"abstract":"The effect of target tracking based on millime-ter wave radar is susceptible to multi path effect and target crossover. Most existing methods are unsatisfactory in high-noise, complex environments. In contrast, we propose a method covering target positioning, tracking, and track re-association by using top-mounted millimeter-wave radar, achieving stable and accurate counting and tracking of multiple targets. First, a polar-coordinate- based tracking is performed using an extended Kalman filter with linear regression correction. Then, a density-based classification algorithm with group signal-to-noise ratio analysis is performed to remove ghost targets. In terms of the track fracture problem caused by the target intersection, we propose to use the Hankel matrix to solve this situation. Our experiments prove the robustness of the proposed method, which not only has a high tracking precision within O.lm but also successfully handles most target crossover situations considered. At the same time, in the cases within six people, the ratio between the number of frames in which personnel counting error is less than or equal to 1 and the total number of frames is more than 95%.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125515749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}