Pub Date : 2024-05-13DOI: 10.1007/s11063-024-11615-y
Qi An, Shan Jiang
Recognizing the pivotal role of choosing an appropriate distance metric in designing the clustering algorithm, our focus is on innovating the k-means method by redefining the distance metric in its distortion. In this study, we introduce a novel k-means clustering algorithm utilizing a distance metric derived from the (ell _p) quasi-norm with (pin (0,1)). Through an illustrative example, we showcase the advantageous properties of the proposed distance metric compared to commonly used alternatives for revealing natural groupings in data. Subsequently, we present a novel k-means type heuristic by integrating this sub-one quasi-norm-based distance, offer a step-by-step iterative relocation scheme, and prove the convergence to the Kuhn-Tucker point. Finally, we empirically validate the effectiveness of our clustering method through experiments on synthetic and real-life datasets, both in their original form and with additional noise introduced. We also investigate the performance of the proposed method as a subroutine in a deep learning clustering algorithm. Our results demonstrate the efficacy of the proposed k-means algorithm in capturing distinctive patterns exhibited by certain data types.
{"title":"Sub-One Quasi-Norm-Based k-Means Clustering Algorithm and Analyses","authors":"Qi An, Shan Jiang","doi":"10.1007/s11063-024-11615-y","DOIUrl":"https://doi.org/10.1007/s11063-024-11615-y","url":null,"abstract":"<p>Recognizing the pivotal role of choosing an appropriate distance metric in designing the clustering algorithm, our focus is on innovating the <i>k</i>-means method by redefining the distance metric in its distortion. In this study, we introduce a novel <i>k</i>-means clustering algorithm utilizing a distance metric derived from the <span>(ell _p)</span> quasi-norm with <span>(pin (0,1))</span>. Through an illustrative example, we showcase the advantageous properties of the proposed distance metric compared to commonly used alternatives for revealing natural groupings in data. Subsequently, we present a novel <i>k</i>-means type heuristic by integrating this sub-one quasi-norm-based distance, offer a step-by-step iterative relocation scheme, and prove the convergence to the Kuhn-Tucker point. Finally, we empirically validate the effectiveness of our clustering method through experiments on synthetic and real-life datasets, both in their original form and with additional noise introduced. We also investigate the performance of the proposed method as a subroutine in a deep learning clustering algorithm. Our results demonstrate the efficacy of the proposed <i>k</i>-means algorithm in capturing distinctive patterns exhibited by certain data types.</p>","PeriodicalId":51144,"journal":{"name":"Neural Processing Letters","volume":"46 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-11DOI: 10.1007/s11063-024-11449-8
Lei Xia, Jianfeng Tang, Guangli Li, Jun Fu, Shukai Duan, Lidan Wang
The Echo state network (ESN) is an efficient recurrent neural network that has achieved good results in time series prediction tasks. Still, its application in time series classification tasks has yet to develop fully. In this study, we work on the time series classification problem based on echo state networks. We propose a new framework called forward echo state convolutional network (FESCN). It consists of two parts, the encoder and the decoder, where the encoder part is composed of a forward topology echo state network (FT-ESN), and the decoder part mainly consists of a convolutional layer and a max-pooling layer. We apply the proposed network framework to the univariate time series dataset UCR and compare it with six traditional methods and four neural network models. The experimental findings demonstrate that FESCN outperforms other methods in terms of overall classification accuracy. Additionally, we investigated the impact of reservoir size on network performance and observed that the optimal classification results were obtained when the reservoir size was set to 32. Finally, we investigated the performance of the network under noise interference, and the results show that FESCN has a more stable network performance compared to EMN (echo memory network).
{"title":"Time Series Classification Based on Forward Echo State Convolution Network","authors":"Lei Xia, Jianfeng Tang, Guangli Li, Jun Fu, Shukai Duan, Lidan Wang","doi":"10.1007/s11063-024-11449-8","DOIUrl":"https://doi.org/10.1007/s11063-024-11449-8","url":null,"abstract":"<p>The Echo state network (ESN) is an efficient recurrent neural network that has achieved good results in time series prediction tasks. Still, its application in time series classification tasks has yet to develop fully. In this study, we work on the time series classification problem based on echo state networks. We propose a new framework called forward echo state convolutional network (FESCN). It consists of two parts, the encoder and the decoder, where the encoder part is composed of a forward topology echo state network (FT-ESN), and the decoder part mainly consists of a convolutional layer and a max-pooling layer. We apply the proposed network framework to the univariate time series dataset UCR and compare it with six traditional methods and four neural network models. The experimental findings demonstrate that FESCN outperforms other methods in terms of overall classification accuracy. Additionally, we investigated the impact of reservoir size on network performance and observed that the optimal classification results were obtained when the reservoir size was set to 32. Finally, we investigated the performance of the network under noise interference, and the results show that FESCN has a more stable network performance compared to EMN (echo memory network).</p>","PeriodicalId":51144,"journal":{"name":"Neural Processing Letters","volume":"49 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-10DOI: 10.1007/s11063-024-11606-z
Xin Ye, Xiang Tian, Bolun Zheng, Fan Zhou, Yaowu Chen
Knowledge distillation is a model compression technique that transfers knowledge learned by teacher networks to student networks. Existing knowledge distillation methods greatly expand the forms of knowledge, but also make the distillation models complex and symmetric. However, few studies have explored the commonalities among these methods. In this study, we propose a concise distillation framework to unify these methods and a method to construct asymmetric knowledge distillation under the framework. Asymmetric distillation aims to enable differentiated knowledge transfers for different distillation objects. We designed a multi-stage shallow-wide branch bifurcation method to distill different knowledge representations and a grouping ensemble strategy to supervise the network to teach and learn selectively. Consequently, we conducted experiments using image classification benchmarks to verify the proposed method. Experimental results show that our implementation can achieve considerable improvements over existing methods, demonstrating the effectiveness of the method and the potential of the framework.
{"title":"A Unified Asymmetric Knowledge Distillation Framework for Image Classification","authors":"Xin Ye, Xiang Tian, Bolun Zheng, Fan Zhou, Yaowu Chen","doi":"10.1007/s11063-024-11606-z","DOIUrl":"https://doi.org/10.1007/s11063-024-11606-z","url":null,"abstract":"<p>Knowledge distillation is a model compression technique that transfers knowledge learned by teacher networks to student networks. Existing knowledge distillation methods greatly expand the forms of knowledge, but also make the distillation models complex and symmetric. However, few studies have explored the commonalities among these methods. In this study, we propose a concise distillation framework to unify these methods and a method to construct asymmetric knowledge distillation under the framework. Asymmetric distillation aims to enable differentiated knowledge transfers for different distillation objects. We designed a multi-stage shallow-wide branch bifurcation method to distill different knowledge representations and a grouping ensemble strategy to supervise the network to teach and learn selectively. Consequently, we conducted experiments using image classification benchmarks to verify the proposed method. Experimental results show that our implementation can achieve considerable improvements over existing methods, demonstrating the effectiveness of the method and the potential of the framework.</p>","PeriodicalId":51144,"journal":{"name":"Neural Processing Letters","volume":"21 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140942261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-10DOI: 10.1007/s11063-024-11630-z
Qian Lang, Jing Xu, Huiwen Zhang, Zhengxin Wang
In this paper, group consensus is investigated for a class of nonlinear multi-agent systems suffered from the DoS attacks. Firstly, a first-order nonlinear multi-agent system is constructed, which is divided into M subsystems and each subsystem has an unique leader. Then a protocol is proposed and a Lyapunov function candidate is chosen. By means of the stability theory, a sufficient criterion, which involves the duration of DoS attacks, coupling strength and control gain, is obtained for achieving group consensus in first-order system. That is, the nodes in each subsystem can track the leader of that group. Furthermore, the result is extended to nonlinear second-order multi-agent systems and the controller is also improved to obtain sufficient conditions for group consensus. Additionally, the lower bounds of the coupling strength and average interval of DoS attacks can be determined from the obtained sufficient conditions. Finally, several numerical simulations are presented to explain the effectiveness of the proposed controllers and the derived theoretical results.
本文研究了一类遭受 DoS 攻击的非线性多代理系统的群体共识。首先,构建一个一阶非线性多代理系统,将其划分为 M 个子系统,每个子系统都有一个唯一的领导者。然后提出一个协议,并选择一个候选 Lyapunov 函数。通过稳定性理论,得到了在一阶系统中实现群体共识的充分准则,该准则涉及 DoS 攻击持续时间、耦合强度和控制增益。也就是说,每个子系统中的节点都能跟踪该组的领导者。此外,该结果还扩展到了非线性二阶多代理系统,并改进了控制器,从而获得了群体共识的充分条件。此外,还可以根据获得的充分条件确定耦合强度的下限和 DoS 攻击的平均间隔。最后,介绍了几个数值模拟,以解释所提控制器的有效性和推导出的理论结果。
{"title":"Pinning Group Consensus of Multi-agent Systems Under DoS Attacks","authors":"Qian Lang, Jing Xu, Huiwen Zhang, Zhengxin Wang","doi":"10.1007/s11063-024-11630-z","DOIUrl":"https://doi.org/10.1007/s11063-024-11630-z","url":null,"abstract":"<p>In this paper, group consensus is investigated for a class of nonlinear multi-agent systems suffered from the DoS attacks. Firstly, a first-order nonlinear multi-agent system is constructed, which is divided into <i>M</i> subsystems and each subsystem has an unique leader. Then a protocol is proposed and a Lyapunov function candidate is chosen. By means of the stability theory, a sufficient criterion, which involves the duration of DoS attacks, coupling strength and control gain, is obtained for achieving group consensus in first-order system. That is, the nodes in each subsystem can track the leader of that group. Furthermore, the result is extended to nonlinear second-order multi-agent systems and the controller is also improved to obtain sufficient conditions for group consensus. Additionally, the lower bounds of the coupling strength and average interval of DoS attacks can be determined from the obtained sufficient conditions. Finally, several numerical simulations are presented to explain the effectiveness of the proposed controllers and the derived theoretical results.</p>","PeriodicalId":51144,"journal":{"name":"Neural Processing Letters","volume":"27 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-09DOI: 10.1007/s11063-024-11623-y
Manu Augustine, Om Prakash Yadav, Ashish Nayyar, Dheeraj Joshi
Fuzzy cognitive maps (FCMs) provide a rapid and efficient approach for system modeling and simulation. The literature demonstrates numerous successful applications of FCMs in identifying failure modes. The standard process of failure mode identification using FCMs involves monitoring crucial concept/node values for excesses. Threshold functions are used to limit the value of nodes within a pre-specified range, which is usually [0, 1] or [-1, + 1]. However, traditional FCMs using the tanh threshold function possess two crucial drawbacks for this particular.Purpose(i) a tendency to reduce the values of state vector components, and (ii) the potential inability to reach a limit state with clearly identifiable failure states. The reason for this is the inherent mathematical nature of the tanh function in being asymptotic to the horizontal line demarcating the edge of the specified range. To overcome these limitations, this paper introduces a novel modified tanh threshold function that effectively addresses both issues.
{"title":"Use of a Modified Threshold Function in Fuzzy Cognitive Maps for Improved Failure Mode Identification","authors":"Manu Augustine, Om Prakash Yadav, Ashish Nayyar, Dheeraj Joshi","doi":"10.1007/s11063-024-11623-y","DOIUrl":"https://doi.org/10.1007/s11063-024-11623-y","url":null,"abstract":"<p>Fuzzy cognitive maps (FCMs) provide a rapid and efficient approach for system modeling and simulation. The literature demonstrates numerous successful applications of FCMs in identifying failure modes. The standard process of failure mode identification using FCMs involves monitoring crucial concept/node values for excesses. Threshold functions are used to limit the value of nodes within a pre-specified range, which is usually [0, 1] or [-1, + 1]. However, traditional FCMs using the <i>tanh</i> threshold function possess two crucial drawbacks for this particular.Purpose(i) a tendency to reduce the values of state vector components, and (ii) the potential inability to reach a limit state with clearly identifiable failure states. The reason for this is the inherent mathematical nature of the <i>tanh</i> function in being asymptotic to the horizontal line demarcating the edge of the specified range. To overcome these limitations, this paper introduces a novel modified <i>tanh</i> threshold function that effectively addresses both issues.</p>","PeriodicalId":51144,"journal":{"name":"Neural Processing Letters","volume":"25 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the unsupervised domain adaptation (UDA) (Akada et al. Self-supervised learning of domain invariant features for depth estimation, in: 2022 IEEE/CVF winter conference on applications of computer vision (WACV), pp 3377–3387 (2022). 10.1109/WACV51458.2022.00107) depth estimation task, a new adaptive approach is to use the bidirectional transformation network to transfer the style between the target and source domain inputs, and then train the depth estimation network in their respective domains. However, the domain adaptation process and the style transfer may result in defects and biases, often leading to depth holes and instance edge depth missing in the target domain’s depth output. To address these issues, We propose a training network that has been improved in terms of model structure and supervision constraints. First, we introduce a edge-guided self-attention mechanism in the task network of each domain to enhance the network’s attention to high-frequency edge features, maintain clear boundaries and fill in missing areas of depth. Furthermore, we utilize an edge detection algorithm to extract edge features from the input of the target domain. Then we establish edge consistency constraints between inter-domain entities in order to narrow the gap between domains and make domain-to-domain transfers easier. Our experimental demonstrate that our proposed method effectively solve the aforementioned problem, resulting in a higher quality depth map and outperforming existing state-of-the-art methods.
在无监督领域适应(UDA)(Akada et al:2022 年 IEEE/CVF 计算机视觉应用冬季会议(WACV),第 3377-3387 页(2022 年)。10.1109/WACV51458.2022.00107) 深度估计任务,一种新的自适应方法是使用双向转换网络在目标域和源输入域之间转换样式,然后在各自的域中训练深度估计网络。然而,域适应过程和样式转移可能会导致缺陷和偏差,往往会导致目标域深度输出中出现深度漏洞和实例边缘深度缺失。为了解决这些问题,我们提出了一种在模型结构和监督约束方面进行了改进的训练网络。首先,我们在每个域的任务网络中引入了边缘引导的自我关注机制,以增强网络对高频边缘特征的关注,保持清晰的边界并填补深度缺失区域。此外,我们还利用边缘检测算法从目标域的输入中提取边缘特征。然后,我们在域间实体之间建立边缘一致性约束,以缩小域间差距,使域间传输更容易。实验证明,我们提出的方法有效地解决了上述问题,得到了更高质量的深度图,优于现有的先进方法。
{"title":"Unsupervised Domain Adaptation Depth Estimation Based on Self-attention Mechanism and Edge Consistency Constraints","authors":"Peng Guo, Shuguo Pan, Peng Hu, Ling Pei, Baoguo Yu","doi":"10.1007/s11063-024-11621-0","DOIUrl":"https://doi.org/10.1007/s11063-024-11621-0","url":null,"abstract":"<p>In the unsupervised domain adaptation (UDA) (Akada et al. Self-supervised learning of domain invariant features for depth estimation, in: 2022 IEEE/CVF winter conference on applications of computer vision (WACV), pp 3377–3387 (2022). 10.1109/WACV51458.2022.00107) depth estimation task, a new adaptive approach is to use the bidirectional transformation network to transfer the style between the target and source domain inputs, and then train the depth estimation network in their respective domains. However, the domain adaptation process and the style transfer may result in defects and biases, often leading to depth holes and instance edge depth missing in the target domain’s depth output. To address these issues, We propose a training network that has been improved in terms of model structure and supervision constraints. First, we introduce a edge-guided self-attention mechanism in the task network of each domain to enhance the network’s attention to high-frequency edge features, maintain clear boundaries and fill in missing areas of depth. Furthermore, we utilize an edge detection algorithm to extract edge features from the input of the target domain. Then we establish edge consistency constraints between inter-domain entities in order to narrow the gap between domains and make domain-to-domain transfers easier. Our experimental demonstrate that our proposed method effectively solve the aforementioned problem, resulting in a higher quality depth map and outperforming existing state-of-the-art methods.</p>","PeriodicalId":51144,"journal":{"name":"Neural Processing Letters","volume":"2 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-08DOI: 10.1007/s11063-024-11466-7
Chao Huang, Zhao Kang, Hong Wu
Image anomaly detection and localization perform not only image-level anomaly classification but also locate pixel-level anomaly regions. Recently, it has received much research attention due to its wide application in various fields. This paper proposes ProtoAD, a prototype-based neural network for image anomaly detection and localization. First, the patch features of normal images are extracted by a deep network pre-trained on nature images. Then, the prototypes of the normal patch features are learned by non-parametric clustering. Finally, we construct an image anomaly localization network (ProtoAD) by appending the feature extraction network with L2 feature normalization, a (1times 1) convolutional layer, a channel max-pooling, and a subtraction operation. We use the prototypes as the kernels of the (1times 1) convolutional layer; therefore, our neural network does not need a training phase and can conduct anomaly detection and localization in an end-to-end manner. Extensive experiments on two challenging industrial anomaly detection datasets, MVTec AD and BTAD, demonstrate that ProtoAD achieves competitive performance compared to the state-of-the-art methods with a higher inference speed. The code and pre-trained models are publicly available at https://github.com/98chao/ProtoAD.
图像异常检测和定位不仅能进行图像级的异常分类,还能定位像素级的异常区域。近年来,由于其在各个领域的广泛应用,受到了许多研究人员的关注。本文提出了一种用于图像异常检测和定位的基于原型的神经网络 ProtoAD。首先,通过在自然图像上预先训练的深度网络提取正常图像的斑块特征。然后,通过非参数聚类学习正常斑块特征的原型。最后,我们通过对特征提取网络进行 L2 特征归一化、卷积层、通道最大池化和减法运算,构建了图像异常定位网络(ProtoAD)。我们使用原型作为卷积层的核;因此,我们的神经网络不需要训练阶段,就能以端到端的方式进行异常检测和定位。在两个具有挑战性的工业异常检测数据集(MVTec AD 和 BTAD)上进行的广泛实验表明,ProtoAD 与最先进的方法相比,具有更高的推理速度,实现了具有竞争力的性能。代码和预训练模型可通过 https://github.com/98chao/ProtoAD 公开获取。
{"title":"A Prototype-Based Neural Network for Image Anomaly Detection and Localization","authors":"Chao Huang, Zhao Kang, Hong Wu","doi":"10.1007/s11063-024-11466-7","DOIUrl":"https://doi.org/10.1007/s11063-024-11466-7","url":null,"abstract":"<p>Image anomaly detection and localization perform not only image-level anomaly classification but also locate pixel-level anomaly regions. Recently, it has received much research attention due to its wide application in various fields. This paper proposes ProtoAD, a prototype-based neural network for image anomaly detection and localization. First, the patch features of normal images are extracted by a deep network pre-trained on nature images. Then, the prototypes of the normal patch features are learned by non-parametric clustering. Finally, we construct an image anomaly localization network (ProtoAD) by appending the feature extraction network with <i>L</i>2 feature normalization, a <span>(1times 1)</span> convolutional layer, a channel max-pooling, and a subtraction operation. We use the prototypes as the kernels of the <span>(1times 1)</span> convolutional layer; therefore, our neural network does not need a training phase and can conduct anomaly detection and localization in an end-to-end manner. Extensive experiments on two challenging industrial anomaly detection datasets, MVTec AD and BTAD, demonstrate that ProtoAD achieves competitive performance compared to the state-of-the-art methods with a higher inference speed. The code and pre-trained models are publicly available at https://github.com/98chao/ProtoAD.</p>","PeriodicalId":51144,"journal":{"name":"Neural Processing Letters","volume":"45 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140938881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-08DOI: 10.1007/s11063-024-11613-0
Kyungdeuk Ko, Donghyeon Kim, Kyungseok Oh, Hanseok Ko
Voice conversion (VC) is a task for changing the speech of a source speaker to the target voice while preserving linguistic information of the source speech. The existing VC methods typically use mel-spectrogram as both input and output, so a separate vocoder is required to transform mel-spectrogram into waveform. Therefore, the VC performance varies depending on the vocoder performance, and noisy speech can be generated due to problems such as train-test mismatch. In this paper, we propose a speech and fundamental frequency consistent raw audio voice conversion method called WaveVC. Unlike other methods, WaveVC does not require a separate vocoder and can perform VC directly on raw audio waveform using 1D convolution. This eliminates the issue of performance degradation caused by the train-test mismatch of the vocoder. In the training phase, WaveVC employs speech loss and F0 loss to preserve the content of the source speech and generate F0 consistent speech using the pre-trained networks. WaveVC is capable of converting voices while maintaining consistency in speech and fundamental frequency. In the test phase, the F0 feature of the source speech is concatenated with a content embedding vector to ensure the converted speech follows the fundamental frequency flow of the source speech. WaveVC achieves higher performances than baseline methods in both many-to-many VC and any-to-any VC. The converted samples are available online.
{"title":"WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion","authors":"Kyungdeuk Ko, Donghyeon Kim, Kyungseok Oh, Hanseok Ko","doi":"10.1007/s11063-024-11613-0","DOIUrl":"https://doi.org/10.1007/s11063-024-11613-0","url":null,"abstract":"<p>Voice conversion (VC) is a task for changing the speech of a source speaker to the target voice while preserving linguistic information of the source speech. The existing VC methods typically use mel-spectrogram as both input and output, so a separate vocoder is required to transform mel-spectrogram into waveform. Therefore, the VC performance varies depending on the vocoder performance, and noisy speech can be generated due to problems such as train-test mismatch. In this paper, we propose a speech and fundamental frequency consistent raw audio voice conversion method called WaveVC. Unlike other methods, WaveVC does not require a separate vocoder and can perform VC directly on raw audio waveform using 1D convolution. This eliminates the issue of performance degradation caused by the train-test mismatch of the vocoder. In the training phase, WaveVC employs speech loss and F0 loss to preserve the content of the source speech and generate F0 consistent speech using the pre-trained networks. WaveVC is capable of converting voices while maintaining consistency in speech and fundamental frequency. In the test phase, the F0 feature of the source speech is concatenated with a content embedding vector to ensure the converted speech follows the fundamental frequency flow of the source speech. WaveVC achieves higher performances than baseline methods in both many-to-many VC and any-to-any VC. The converted samples are available online.</p>","PeriodicalId":51144,"journal":{"name":"Neural Processing Letters","volume":"37 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140887887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-08DOI: 10.1007/s11063-024-11614-z
Jingyu Zhao, Ruwei Li, Maocun Tian, Weidong An
To address the challenges of the poor representation capability and low data utilization rate of end-to-end speech recognition models in deep learning, this study proposes an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning (MM-ASR). It adopts a multi-task learning paradigm for training. The proposed method emphasizes the importance of inter-layer information within shared encoders, aiming to enhance the model’s characterization capability via the multi-scale feature fusion module. Moreover, we apply multi-view self-supervised learning to effectively exploit data information. Our approach is rigorously evaluated on the Aishell-1 dataset and further validated its effectiveness on the English corpus WSJ. The experimental results demonstrate a noteworthy 4.6(%) reduction in character error rate, indicating significantly improved speech recognition performance . These findings showcase the effectiveness and potential of our proposed MM-ASR model for end-to-end speech recognition tasks.
{"title":"Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition","authors":"Jingyu Zhao, Ruwei Li, Maocun Tian, Weidong An","doi":"10.1007/s11063-024-11614-z","DOIUrl":"https://doi.org/10.1007/s11063-024-11614-z","url":null,"abstract":"<p>To address the challenges of the poor representation capability and low data utilization rate of end-to-end speech recognition models in deep learning, this study proposes an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning (MM-ASR). It adopts a multi-task learning paradigm for training. The proposed method emphasizes the importance of inter-layer information within shared encoders, aiming to enhance the model’s characterization capability via the multi-scale feature fusion module. Moreover, we apply multi-view self-supervised learning to effectively exploit data information. Our approach is rigorously evaluated on the Aishell-1 dataset and further validated its effectiveness on the English corpus WSJ. The experimental results demonstrate a noteworthy 4.6<span>(%)</span> reduction in character error rate, indicating significantly improved speech recognition performance . These findings showcase the effectiveness and potential of our proposed MM-ASR model for end-to-end speech recognition tasks.</p>","PeriodicalId":51144,"journal":{"name":"Neural Processing Letters","volume":"29 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140942262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-08DOI: 10.1007/s11063-024-11605-0
Shuangmei Wang, Yang Cao, Tieru Wu
Few-shot class-incremental learning (FSCIL) struggles to incrementally recognize novel classes from few examples without catastrophic forgetting of old classes or overfitting to new classes. We propose TLCE, which ensembles multiple pre-trained models to improve separation of novel and old classes. Specifically, we use episodic training to map images from old classes to quasi-orthogonal prototypes, which minimizes interference between old and new classes. Then, we incorporate the use of ensembling diverse pre-trained models to further tackle the challenge of data imbalance and enhance adaptation to novel classes. Extensive experiments on various datasets demonstrate that our transfer learning ensemble approach outperforms state-of-the-art FSCIL methods.
{"title":"TLCE: Transfer-Learning Based Classifier Ensembles for Few-Shot Class-Incremental Learning","authors":"Shuangmei Wang, Yang Cao, Tieru Wu","doi":"10.1007/s11063-024-11605-0","DOIUrl":"https://doi.org/10.1007/s11063-024-11605-0","url":null,"abstract":"<p>Few-shot class-incremental learning (FSCIL) struggles to incrementally recognize novel classes from few examples without catastrophic forgetting of old classes or overfitting to new classes. We propose TLCE, which ensembles multiple pre-trained models to improve separation of novel and old classes. Specifically, we use episodic training to map images from old classes to quasi-orthogonal prototypes, which minimizes interference between old and new classes. Then, we incorporate the use of ensembling diverse pre-trained models to further tackle the challenge of data imbalance and enhance adaptation to novel classes. Extensive experiments on various datasets demonstrate that our transfer learning ensemble approach outperforms state-of-the-art FSCIL methods.</p>","PeriodicalId":51144,"journal":{"name":"Neural Processing Letters","volume":"12 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140887658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}