首页 > 最新文献

计算机科学最新文献

英文 中文
IF:
Cro-MTVITS: An end-to-end cross-lingual speech synthesis model for Mandarin and multi-dialect Tibetan based on VITS Cro-MTVITS:基于VITS的汉语普通话和多方言藏语端到端跨语言语音合成模型
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-10-01 Epub Date: 2026-02-24 DOI: 10.1016/j.csl.2026.101956
Weizhao Zhang , Mengjuan Wang , Junzhi Li , Hongwu Yang
Cross-lingual speech synthesis is a key research focus in speech synthesis, allowing a single model to generate speech in multiple languages for one speaker. In China, while Mandarin is the official language, approximately 4 million people speak Tibetan as their native language. Previous Mandarin–Tibetan cross-lingual researches have largely concentrated on the Lhasa dialect, often overlooking the Kham and Amdo dialects, and have relied on autoregressive models, which still produce speech quality inferior to that of major languages. To address these challenges, we propose Cro-MTVITS, an end-to-end cross-lingual speech synthesis model for Mandarin and multi-dialect Tibetan. Firstly, we constructed a large-scale multi-dialect Tibetan corpus covering Lhasa, Kham, and Amdo dialects, totaling 52.2 h. Then, we developed a baseline model based on VITS, incorporating speaker and language embeddings into the text encoder, posterior encoder, decoder, stochastic duration predictor (SDP) and flow to enable cross-lingual synthesis. Finally, we enhanced this baseline model with an improved posterior encoder, SDP, and pre-trained language and speech models, yielding significant performance gains. Cro-MTVITS consistently achieved higher mean opinion score (MOS) values than the VITS baseline across all languages and scenarios, with improvements ranging from 0.07 to 0.21 points. Statistical tests confirmed that Cro-MTVITS significantly outperforms the baseline. Overall, experimental results demonstrate that our model surpasses the baseline in both subjective and objective evaluations, enabling high-quality cross-lingual speech synthesis between Mandarin and multi-dialect Tibetan. The synthesized speech samples can be found on demos1.
跨语言语音合成是语音合成领域的一个研究热点,它允许一个模型为一个说话者生成多种语言的语音。在中国,虽然普通话是官方语言,但大约有400万人以藏语为母语。以往的汉藏跨语研究主要集中在拉萨方言,往往忽略了康和安多方言,并且依赖于自回归模型,这些模型仍然产生不如主要语言的语音质量。为了解决这些挑战,我们提出了Cro-MTVITS,一个端到端跨语言的普通话和多方言藏语语音合成模型。首先,我们构建了包含拉萨、康、安多三种方言的大规模多方言语料库,共52.2 h。然后,我们建立了基于VITS的基线模型,将说话人和语言嵌入到文本编码器、后置编码器、解码器、随机持续时间预测器(SDP)和流中,以实现跨语言合成。最后,我们使用改进的后验编码器、SDP和预训练的语言和语音模型增强了这个基线模型,产生了显著的性能提升。在所有语言和场景中,Cro-MTVITS始终比VITS基线获得更高的平均意见得分(MOS)值,改进幅度从0.07到0.21分不等。统计测试证实,Cro-MTVITS显著优于基线。总体而言,实验结果表明,我们的模型在主观和客观评价方面都超过了基线,能够实现高质量的普通话和多方言藏语之间的跨语言语音合成。合成的语音样本可以在demo上找到。
{"title":"Cro-MTVITS: An end-to-end cross-lingual speech synthesis model for Mandarin and multi-dialect Tibetan based on VITS","authors":"Weizhao Zhang ,&nbsp;Mengjuan Wang ,&nbsp;Junzhi Li ,&nbsp;Hongwu Yang","doi":"10.1016/j.csl.2026.101956","DOIUrl":"10.1016/j.csl.2026.101956","url":null,"abstract":"<div><div>Cross-lingual speech synthesis is a key research focus in speech synthesis, allowing a single model to generate speech in multiple languages for one speaker. In China, while Mandarin is the official language, approximately 4 million people speak Tibetan as their native language. Previous Mandarin–Tibetan cross-lingual researches have largely concentrated on the Lhasa dialect, often overlooking the Kham and Amdo dialects, and have relied on autoregressive models, which still produce speech quality inferior to that of major languages. To address these challenges, we propose Cro-MTVITS, an end-to-end cross-lingual speech synthesis model for Mandarin and multi-dialect Tibetan. Firstly, we constructed a large-scale multi-dialect Tibetan corpus covering Lhasa, Kham, and Amdo dialects, totaling 52.2 h. Then, we developed a baseline model based on VITS, incorporating speaker and language embeddings into the text encoder, posterior encoder, decoder, stochastic duration predictor (SDP) and flow to enable cross-lingual synthesis. Finally, we enhanced this baseline model with an improved posterior encoder, SDP, and pre-trained language and speech models, yielding significant performance gains. Cro-MTVITS consistently achieved higher mean opinion score (MOS) values than the VITS baseline across all languages and scenarios, with improvements ranging from 0.07 to 0.21 points. Statistical tests confirmed that Cro-MTVITS significantly outperforms the baseline. Overall, experimental results demonstrate that our model surpasses the baseline in both subjective and objective evaluations, enabling high-quality cross-lingual speech synthesis between Mandarin and multi-dialect Tibetan. The synthesized speech samples can be found on demos<span><span><sup>1</sup></span></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101956"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deepening graph-based approaches for Portuguese open information extraction with LLM augmentation 加深基于图的葡萄牙语开放信息提取方法与LLM增强
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-10-01 Epub Date: 2026-02-23 DOI: 10.1016/j.csl.2026.101963
Gabriel Silva , Mário Rodrigues , António Teixeira , Marlene Amorim
Utilizing richer information, such as structural and syntactic details, can enhance Natural Language Processing (NLP) tasks like Open Information Extraction (Open IE), particularly for languages with limited resources like Portuguese. Knowledge Graphs (KGs) offer a robust solution by unifying diverse annotations and enabling the application of Graph Machine Learning (Graph ML).
This paper presents an advanced framework for Portuguese Open IE, integrating KGs and Graph ML with Large Language Model (LLM) augmentation. Our framework employs a three-stage process: (1) initial Knowledge Graph (KG) construction from text, followed by (2) Predicate Extraction and (3) Subject/Object Extraction, both leveraging GraphSAGE models. Large Language Models (LLMs) (DeepSeek) are used for augmentation when Graph ML predictions are absent or for refining/validating extractions.
We present two versions of a system that was evaluated on a Portuguese dataset. Automatic evaluation (word-based) for the best version of the system yielded an F1-score of 64.9% for Predicate extraction and 89.7% for Subject/Object extraction. The final end-to-end performance of the system is an F1-score of 58.2%.
A human evaluation was conducted on 51 Portuguese sentences (yielding 100 triples) by two annotators, achieving a substantial agreement (Cohen’s Kappa of 0.67). The system extracted an average of 1.84 triples per sentence, with 53.9% deemed correct. Notably, this version significantly reduced invalid/wrong extractions to 6.6% from 31.7% in the previous version, demonstrating improved Precision while maintaining the ability to extract multiple meaningful triples.
利用更丰富的信息,如结构和句法细节,可以增强自然语言处理(NLP)任务,如开放信息提取(Open information Extraction, Open IE),特别是对于葡萄牙语等资源有限的语言。知识图(Knowledge Graphs, KGs)通过统一不同的注释和实现图机器学习(Graph ML)的应用,提供了一个健壮的解决方案。本文提出了一种葡萄牙语开放IE的高级框架,它将KGs和图ML与大型语言模型(LLM)增强相结合。我们的框架采用了三个阶段的过程:(1)从文本构建初始知识图(KG),然后(2)谓词提取和(3)主题/对象提取,两者都利用GraphSAGE模型。大型语言模型(llm) (DeepSeek)用于在没有Graph ML预测时进行增强或用于精炼/验证提取。我们提出了两个版本的系统,在葡萄牙数据集上进行了评估。对系统的最佳版本进行自动评估(基于单词),谓语提取的f1得分为64.9%,主题/对象提取的f1得分为89.7%。该系统的最终端到端性能为f1得分58.2%。由两名注释者对51个葡萄牙语句子(产生100个triples)进行了人类评估,取得了实质性的一致(Cohen 's Kappa为0.67)。该系统平均每句话提取1.84个三元组,53.9%被认为是正确的。值得注意的是,这个版本显著地将无效/错误提取从上一个版本的31.7%减少到6.6%,在保持提取多个有意义三元组的能力的同时,展示了更高的精度。
{"title":"Deepening graph-based approaches for Portuguese open information extraction with LLM augmentation","authors":"Gabriel Silva ,&nbsp;Mário Rodrigues ,&nbsp;António Teixeira ,&nbsp;Marlene Amorim","doi":"10.1016/j.csl.2026.101963","DOIUrl":"10.1016/j.csl.2026.101963","url":null,"abstract":"<div><div>Utilizing richer information, such as structural and syntactic details, can enhance Natural Language Processing (NLP) tasks like Open Information Extraction (Open IE), particularly for languages with limited resources like Portuguese. Knowledge Graphs (KGs) offer a robust solution by unifying diverse annotations and enabling the application of Graph Machine Learning (Graph ML).</div><div>This paper presents an advanced framework for Portuguese Open IE, integrating KGs and Graph ML with Large Language Model (LLM) augmentation. Our framework employs a three-stage process: (1) initial Knowledge Graph (KG) construction from text, followed by (2) Predicate Extraction and (3) Subject/Object Extraction, both leveraging GraphSAGE models. Large Language Models (LLMs) (DeepSeek) are used for augmentation when Graph ML predictions are absent or for refining/validating extractions.</div><div>We present two versions of a system that was evaluated on a Portuguese dataset. Automatic evaluation (word-based) for the best version of the system yielded an F1-score of 64.9% for Predicate extraction and 89.7% for Subject/Object extraction. The final end-to-end performance of the system is an F1-score of 58.2%.</div><div>A human evaluation was conducted on 51 Portuguese sentences (yielding 100 triples) by two annotators, achieving a substantial agreement (Cohen’s Kappa of 0.67). The system extracted an average of 1.84 triples per sentence, with 53.9% deemed correct. Notably, this version significantly reduced invalid/wrong extractions to 6.6% from 31.7% in the previous version, demonstrating improved Precision while maintaining the ability to extract multiple meaningful triples.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101963"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dual-Resource constrained flexible job-shop scheduling with ergonomic considerations in conventional and human-robot systems using an enhanced NSGA-II with teaching-learning effect 基于增强型NSGA-II的传统和人机系统中考虑人机工程学的双资源约束柔性作业车间调度
IF 11.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-10-01 Epub Date: 2026-03-06 DOI: 10.1016/j.rcim.2026.103272
Shaban Usman , Tianrun Ye , Haotian Xue , Lei Liu , Weiwei Qin , Ping Zhang , Ailong Yuan , Chueh Ting , Yanli Gong , Chunming Gao
The dual-resource constrained flexible job-shop scheduling problem (DRCFJSP) addresses practical challenges in modern production systems, especially where human and robotic resources are jointly managed. This study proposes a DRCFJSP model with ergonomic consideration (DRCFJSP-ER), aiming to simultaneously enhance productivity and the well-being of workers in both conventional and human-robot systems. Ergonomic load in a job-shop environment is assessed using the rapid upper limb assessment (RULA) score by introducing three novel evaluation metrics: the weighted average RULA score for operations, the cumulative RULA score for operations, and the cumulative RULA score for the entire job-shop cycle. To efficiently solve the DRCFJSP-ER, we propose an enhanced NSGA-II with teaching-learning effect (ENSGA-TL) to simultaneously minimize the makespan and maximum cumulative RULA score. A comprehensive analysis based on standard performance metrics is conducted to evaluate the effectiveness of ENSGA-TL for DRCFJSP-ER using newly generated test instances. Additionally, two real-world case studies in an agricultural production environment, selected for their labor-intensive and robotics-relevant characteristics, demonstrate the model’s effectiveness and adaptability to conventional and smart robotic production systems. The results validate the potential of the DRCFJSP-ER model and the ENSGA-TL algorithm in improving production efficiency and protecting worker well-being.
双资源约束柔性作业车间调度问题(DRCFJSP)解决了现代生产系统中的实际挑战,特别是在人力和机器人资源共同管理的情况下。本研究提出了一个考虑人体工程学的DRCFJSP模型(DRCFJSP- er),旨在同时提高传统和人机系统中工人的生产力和福祉。采用快速上肢评估(RULA)评分对作业车间环境中的人体工程学负荷进行了评估,引入了三种新的评估指标:作业加权平均RULA评分、作业累积RULA评分和整个作业车间周期的累积RULA评分。为了有效地解决DRCFJSP-ER问题,我们提出了一个具有教-学效应的增强型NSGA-II (ENSGA-TL),以同时最小化makespan和最大累积RULA分数。使用新生成的测试实例,基于标准性能指标进行了综合分析,以评估enga - tl对DRCFJSP-ER的有效性。此外,在农业生产环境中选择了两个现实世界的案例研究,因为它们的劳动密集型和机器人相关特征,证明了该模型对传统和智能机器人生产系统的有效性和适应性。结果验证了drcfjp - er模型和ENSGA-TL算法在提高生产效率和保护工人福祉方面的潜力。
{"title":"Dual-Resource constrained flexible job-shop scheduling with ergonomic considerations in conventional and human-robot systems using an enhanced NSGA-II with teaching-learning effect","authors":"Shaban Usman ,&nbsp;Tianrun Ye ,&nbsp;Haotian Xue ,&nbsp;Lei Liu ,&nbsp;Weiwei Qin ,&nbsp;Ping Zhang ,&nbsp;Ailong Yuan ,&nbsp;Chueh Ting ,&nbsp;Yanli Gong ,&nbsp;Chunming Gao","doi":"10.1016/j.rcim.2026.103272","DOIUrl":"10.1016/j.rcim.2026.103272","url":null,"abstract":"<div><div>The dual-resource constrained flexible job-shop scheduling problem (DRCFJSP) addresses practical challenges in modern production systems, especially where human and robotic resources are jointly managed. This study proposes a DRCFJSP model with ergonomic consideration (DRCFJSP-ER), aiming to simultaneously enhance productivity and the well-being of workers in both conventional and human-robot systems. Ergonomic load in a job-shop environment is assessed using the rapid upper limb assessment (RULA) score by introducing three novel evaluation metrics: the weighted average RULA score for operations, the cumulative RULA score for operations, and the cumulative RULA score for the entire job-shop cycle. To efficiently solve the DRCFJSP-ER, we propose an enhanced NSGA-II with teaching-learning effect (ENSGA-TL) to simultaneously minimize the makespan and maximum cumulative RULA score. A comprehensive analysis based on standard performance metrics is conducted to evaluate the effectiveness of ENSGA-TL for DRCFJSP-ER using newly generated test instances. Additionally, two real-world case studies in an agricultural production environment, selected for their labor-intensive and robotics-relevant characteristics, demonstrate the model’s effectiveness and adaptability to conventional and smart robotic production systems. The results validate the potential of the DRCFJSP-ER model and the ENSGA-TL algorithm in improving production efficiency and protecting worker well-being.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"101 ","pages":"Article 103272"},"PeriodicalIF":11.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147387361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-modal fusion-enhanced fuzzy adaptive variable impedance control with improved DBN for robotic constant force blade grinding 基于改进DBN的多模态融合增强模糊自适应变阻抗控制用于机器人恒力刃磨
IF 11.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-10-01 Epub Date: 2026-03-11 DOI: 10.1016/j.rcim.2026.103294
Yong Tao , Jiao Xue , Yazui Liu , Lin Yang , Jiewu Leng , Pai Zheng , Baicun Wang , Xiaotong Wang , Hongxing Wei
During the grinding of aeroengine blade edges, complex time-varying nonlinear coupling and uncertain disturbances pose challenges to the adaptive regulation of constant force grinding, reducing process stability and precision. This paper proposed a multi-modal fusion-enhanced fuzzy adaptive variable impedance control with improved deep belief network (DBN) for robotic constant force blade grinding. Specifically, the three-dimensional model and point cloud model of the blade are integrated to extract accurate geometric information and generate reference grinding trajectories. Furtherly, the DBN training hyperparameters are optimized using linear success history-based adaptive differential evolution (LSHADE). This improves the DBN configuration and overcomes the limitations of conventional DBN based force compensation with fixed network structures and single modality inputs. On this basis, a fuzzy adaptive variable impedance control method based on the improved DBN is developed. Geometric, force/pose, and error modalities are fused to dynamically adjust the force compensation term. This design enables the controller to outperform conventional adaptive variable impedance methods under strongly time-varying conditions. It improves the interaction between the robot and the environment and realizes adaptive active compliant constant-force control in robotic grinding. Comparative experiments demonstrate the stability and reliability of the proposed method. Compared with mainstream methods, the proposed method reduces the grinding force error by 66.7% and 28.6%, respectively. The key error metrics MSE, RMSE, MAPE, and MAE are reduced by more than 71% and 20%, and the average surface roughness is reduced by approximately 15.6% and 5.8%, respectively
在航空发动机叶片边缘磨削过程中,复杂的时变非线性耦合和不确定扰动给恒力磨削的自适应调节带来了挑战,降低了加工的稳定性和精度。针对机器人恒力刃磨,提出了一种基于改进深度信念网络的多模态融合增强模糊自适应变阻抗控制方法。具体而言,将叶片的三维模型和点云模型相结合,提取精确的几何信息,生成参考磨削轨迹。此外,使用基于线性成功历史的自适应差分进化(LSHADE)对DBN训练超参数进行优化。这改进了DBN的配置,克服了传统DBN基于固定网络结构和单模态输入的力补偿的局限性。在此基础上,提出了一种基于改进DBN的模糊自适应变阻抗控制方法。融合几何、力/位姿和误差模态来动态调整力补偿项。这种设计使控制器在强时变条件下优于传统的自适应变阻抗方法。该方法改善了机器人与环境的交互作用,实现了机器人磨削的自适应主动柔顺恒力控制。对比实验证明了该方法的稳定性和可靠性。与主流方法相比,该方法可将磨削力误差分别降低66.7%和28.6%。关键误差指标MSE、RMSE、MAPE和MAE分别降低了71%和20%以上,平均表面粗糙度分别降低了约15.6%和5.8%
{"title":"Multi-modal fusion-enhanced fuzzy adaptive variable impedance control with improved DBN for robotic constant force blade grinding","authors":"Yong Tao ,&nbsp;Jiao Xue ,&nbsp;Yazui Liu ,&nbsp;Lin Yang ,&nbsp;Jiewu Leng ,&nbsp;Pai Zheng ,&nbsp;Baicun Wang ,&nbsp;Xiaotong Wang ,&nbsp;Hongxing Wei","doi":"10.1016/j.rcim.2026.103294","DOIUrl":"10.1016/j.rcim.2026.103294","url":null,"abstract":"<div><div>During the grinding of aeroengine blade edges, complex time-varying nonlinear coupling and uncertain disturbances pose challenges to the adaptive regulation of constant force grinding, reducing process stability and precision. This paper proposed a multi-modal fusion-enhanced fuzzy adaptive variable impedance control with improved deep belief network (DBN) for robotic constant force blade grinding. Specifically, the three-dimensional model and point cloud model of the blade are integrated to extract accurate geometric information and generate reference grinding trajectories. Furtherly, the DBN training hyperparameters are optimized using linear success history-based adaptive differential evolution (LSHADE). This improves the DBN configuration and overcomes the limitations of conventional DBN based force compensation with fixed network structures and single modality inputs. On this basis, a fuzzy adaptive variable impedance control method based on the improved DBN is developed. Geometric, force/pose, and error modalities are fused to dynamically adjust the force compensation term. This design enables the controller to outperform conventional adaptive variable impedance methods under strongly time-varying conditions. It improves the interaction between the robot and the environment and realizes adaptive active compliant constant-force control in robotic grinding. Comparative experiments demonstrate the stability and reliability of the proposed method. Compared with mainstream methods, the proposed method reduces the grinding force error by 66.7% and 28.6%, respectively. The key error metrics MSE, RMSE, MAPE, and MAE are reduced by more than 71% and 20%, and the average surface roughness is reduced by approximately 15.6% and 5.8%, respectively</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"101 ","pages":"Article 103294"},"PeriodicalIF":11.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147387362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EDNPOS: An open-set skeleton-based human action recognition approach for human-robot collaboration enabled by outlier exposure EDNPOS:一种基于开放集骨架的人机协作动作识别方法,通过离群值暴露实现
IF 11.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-10-01 Epub Date: 2026-03-03 DOI: 10.1016/j.rcim.2026.103278
Ci Song , Baicun Wang , Xingyu Li , Huayong Yang , Lihui Wang
With the advent of human-centric manufacturing paradigm in the context of Industry 5.0, human-robot collaboration (HRC) becomes a crucial strategy to achieving enhanced flexibility and adaptability in manufacturing systems. Serving as a foundation for HRC deployment, human action recognition (HAR) infers human operational intent and enables robots to respond accordingly. However, existing HAR methods embedded in HRC systems mainly focus on accurately classifying actions into a known category encountered during training, with limited consideration of unknown sample in real scenarios, which may undermine the stability and safety of HRC systems. To address this issue, this work proposes a novel skeleton-based HAR algorithm with open-set recognition ability. The model features ensembled backbones for feature extraction using three parallel branches, and a corresponding Energy-based Diverse Non-Parametric Outlier Synthesis (EDNPOS) learning framework is designed which is able to generate virtual outliers as supervision signals and optimize the decision boundary between known and unknown data. Comprehensive experiments are conducted on three public datasets NTU RGB+D 60 (NTU 60), NW-UCLA and InHARD. Results verify the outstanding open-set recognition ability of our model while maintaining competitive closed-set accuracy. Finally, quantitative and qualitative evaluations on a compressor assembly case demonstrate the effectiveness and promise of our method in HRC applications. This work is expected to serve as a reference for realizing a more reliable HAR function in HRC systems.
随着工业5.0背景下以人为中心的制造模式的出现,人机协作(HRC)成为制造系统实现增强灵活性和适应性的关键策略。作为HRC部署的基础,人类行为识别(HAR)可以推断人类的操作意图,并使机器人能够做出相应的响应。然而,现有的嵌入HRC系统的HAR方法主要侧重于将训练中遇到的动作准确地分类到已知的类别中,很少考虑真实场景中的未知样本,这可能会破坏HRC系统的稳定性和安全性。为了解决这一问题,本文提出了一种具有开集识别能力的基于骨架的HAR算法。该模型采用3个并行分支对集成主干进行特征提取,并设计了相应的基于能量的多元非参数离群点综合(EDNPOS)学习框架,该框架能够生成虚拟离群点作为监督信号,优化已知和未知数据之间的决策边界。在NTU RGB+ d60 (NTU 60)、NW-UCLA和InHARD三个公共数据集上进行了综合实验。结果验证了我们的模型在保持有竞争力的闭集精度的同时具有出色的开集识别能力。最后,对一个压缩机装配案例进行了定量和定性评价,证明了该方法在HRC应用中的有效性和前景。本研究可为在HRC系统中实现更可靠的HAR功能提供参考。
{"title":"EDNPOS: An open-set skeleton-based human action recognition approach for human-robot collaboration enabled by outlier exposure","authors":"Ci Song ,&nbsp;Baicun Wang ,&nbsp;Xingyu Li ,&nbsp;Huayong Yang ,&nbsp;Lihui Wang","doi":"10.1016/j.rcim.2026.103278","DOIUrl":"10.1016/j.rcim.2026.103278","url":null,"abstract":"<div><div>With the advent of human-centric manufacturing paradigm in the context of Industry 5.0, human-robot collaboration (HRC) becomes a crucial strategy to achieving enhanced flexibility and adaptability in manufacturing systems. Serving as a foundation for HRC deployment, human action recognition (HAR) infers human operational intent and enables robots to respond accordingly. However, existing HAR methods embedded in HRC systems mainly focus on accurately classifying actions into a known category encountered during training, with limited consideration of unknown sample in real scenarios, which may undermine the stability and safety of HRC systems. To address this issue, this work proposes a novel skeleton-based HAR algorithm with open-set recognition ability. The model features ensembled backbones for feature extraction using three parallel branches, and a corresponding Energy-based Diverse Non-Parametric Outlier Synthesis (EDNPOS) learning framework is designed which is able to generate virtual outliers as supervision signals and optimize the decision boundary between known and unknown data. Comprehensive experiments are conducted on three public datasets NTU RGB+<em>D</em> 60 (NTU 60), NW-UCLA and InHARD. Results verify the outstanding open-set recognition ability of our model while maintaining competitive closed-set accuracy. Finally, quantitative and qualitative evaluations on a compressor assembly case demonstrate the effectiveness and promise of our method in HRC applications. This work is expected to serve as a reference for realizing a more reliable HAR function in HRC systems.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"101 ","pages":"Article 103278"},"PeriodicalIF":11.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147360721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging synthetic speech: TTS-driven data augmentation for effective dysarthric speech recognition 利用合成语音:tts驱动的数据增强,用于有效的困难语音识别
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-10-01 Epub Date: 2026-02-19 DOI: 10.1016/j.csl.2026.101961
P. Vijayalakshmi , Anushiya Rachel Gladston , B. Ramani , M.P. Actlin Jeeva , K. Anantha Krishnan , T. Lavanya , T. Nagarajan
Dysarthria is a neuro-motor speech disorder that impairs a person’s ability to communicate. This necessitates a communication aid to enable interaction with both individuals and computers, typically in the form of an automatic speech recognition (ASR) system. However, conventional ASR systems exhibit high word error rates (WER) when applied to dysarthric speech necessitating a dysarthric ASR (DASR) system. In the current work, DASR systems are developed using SSN TDSC (Tamil Dysarthric Speech Corpus) dataset, targeting mild and moderate dysarthria. Initially, a baseline DASR system is developed with original dysarthric speech data resulting in WER of 9.71% for mild and 19.54 % for moderate dysarthria respectively. In order to develop a DASR system with low WER enormous amount of dysarthric speech data is required. However, recording several hours of speech data from dysarthric speakers is difficult owing to their medical condition. To address this data scarcity, we explore data augmentation using text-to-speech (TTS) synthesis to generate additional dysarthric speech data. In this study, various TTS models, namely, hidden Markov model-based TTS (HTS), FastSpeech2 and Tacotron2 are used for synthesizing dysarthric speech. The current work focuses on identifying the properties that the synthetic speech must exhibit to aid in improving the performance of DASR systems and to derive the required amount of dysarthric speech data. Based on the subjective and objective evaluations carried out on the synthetic speech, FastSpeech2 outperforms the other TTS models considered in terms of preserving the dysarthric speech properties. Training the DASR systems using FastSpeech2-derived augmented data resulted in reduced WERs of 3.49% for mild and 13.17% for moderate dysarthria. Further experiments revealed that a reduction in WER (2.67% & 8.32% for mild and moderate dysarthria) is achieved when moderate amount of augmented data from multiple synthesizers (Fastspeech2 & Tacotron2) is used for training. These results demonstrate the effectiveness of TTS-based data augmentation in improving DASR performance.
构音障碍是一种神经运动语言障碍,会损害一个人的沟通能力。这就需要一种通信辅助工具来实现与个人和计算机的交互,通常以自动语音识别(ASR)系统的形式出现。然而,传统的ASR系统在应用于困难语音时表现出很高的单词错误率(WER),因此需要一个困难ASR (DASR)系统。在目前的工作中,DASR系统是使用SSN TDSC(泰米尔语构音障碍语料库)数据集开发的,针对轻度和中度构音障碍。最初,根据原始构音障碍语音数据开发了基线DASR系统,结果轻度构音障碍和中度构音障碍的WER分别为9.71%和19.54%。为了开发一个低WER的DASR系统,需要大量的困难语音数据。然而,由于他们的身体状况,记录几个小时的语言数据是很困难的。为了解决这一数据稀缺问题,我们探索了使用文本到语音(TTS)合成来生成额外的语言障碍数据的数据增强。在本研究中,使用了多种TTS模型,即基于隐马尔可夫模型的TTS (HTS)、FastSpeech2和Tacotron2来合成困难语音。目前的工作重点是确定合成语音必须表现出的特性,以帮助提高DASR系统的性能,并获得所需的困难语音数据量。基于对合成语音进行的主观和客观评价,FastSpeech2在保留困难语音属性方面优于其他TTS模型。使用fastspeech 2衍生的增强数据训练DASR系统,轻度构音障碍的wer降低了3.49%,中度构音障碍的wer降低了13.17%。进一步的实验表明,当使用来自多个合成器(Fastspeech2 & Tacotron2)的适量增强数据进行训练时,可以实现WER的降低(轻度和中度音障碍降低2.67% & 8.32%)。这些结果证明了基于tts的数据增强在提高DASR性能方面的有效性。
{"title":"Leveraging synthetic speech: TTS-driven data augmentation for effective dysarthric speech recognition","authors":"P. Vijayalakshmi ,&nbsp;Anushiya Rachel Gladston ,&nbsp;B. Ramani ,&nbsp;M.P. Actlin Jeeva ,&nbsp;K. Anantha Krishnan ,&nbsp;T. Lavanya ,&nbsp;T. Nagarajan","doi":"10.1016/j.csl.2026.101961","DOIUrl":"10.1016/j.csl.2026.101961","url":null,"abstract":"<div><div>Dysarthria is a neuro-motor speech disorder that impairs a person’s ability to communicate. This necessitates a communication aid to enable interaction with both individuals and computers, typically in the form of an automatic speech recognition (ASR) system. However, conventional ASR systems exhibit high word error rates (WER) when applied to dysarthric speech necessitating a dysarthric ASR (DASR) system. In the current work, DASR systems are developed using SSN TDSC (Tamil Dysarthric Speech Corpus) dataset, targeting mild and moderate dysarthria. Initially, a baseline DASR system is developed with original dysarthric speech data resulting in WER of 9.71% for mild and 19.54 % for moderate dysarthria respectively. In order to develop a DASR system with low WER enormous amount of dysarthric speech data is required. However, recording several hours of speech data from dysarthric speakers is difficult owing to their medical condition. To address this data scarcity, we explore data augmentation using text-to-speech (TTS) synthesis to generate additional dysarthric speech data. In this study, various TTS models, namely, hidden Markov model-based TTS (HTS), FastSpeech2 and Tacotron2 are used for synthesizing dysarthric speech. The current work focuses on identifying the properties that the synthetic speech must exhibit to aid in improving the performance of DASR systems and to derive the required amount of dysarthric speech data. Based on the subjective and objective evaluations carried out on the synthetic speech, FastSpeech2 outperforms the other TTS models considered in terms of preserving the dysarthric speech properties. Training the DASR systems using FastSpeech2-derived augmented data resulted in reduced WERs of 3.49% for mild and 13.17% for moderate dysarthria. Further experiments revealed that a reduction in WER (2.67% &amp; 8.32% for mild and moderate dysarthria) is achieved when moderate amount of augmented data from multiple synthesizers (Fastspeech2 &amp; Tacotron2) is used for training. These results demonstrate the effectiveness of TTS-based data augmentation in improving DASR performance.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101961"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TriTSP: A triangular joint reasoning networks for target–stance prediction TriTSP:用于目标姿态预测的三角联合推理网络
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-10-01 Epub Date: 2026-02-21 DOI: 10.1016/j.csl.2026.101962
JiaYu Zhang, HongLi Zhang, ChunYu Liu, ZeShu Tian, Chao Meng, YuXiang Ma
Target–stance prediction is a novel task evolved from the traditional stance detection task, aiming to predict the pair of target and stance from each tweet. The target–stance prediction task is currently solved by the two-stage method. Although this method effectively alleviates the dependence on manually labeled target information, the errors generated in the first-stage target identification task will directly have a negative impact on the performance of the second-stage stance detection task, resulting in obvious error cascades. Moreover, it is difficult to establish effective feature interactions between the two subtasks. To tackle the above problems, we propose a triangular joint reasoning model named TriTSP. The proposed model unifies the target features and stance features in the joint prediction manner to capture the correlations and interactions between them. Furthermore, inspired by the way humans express stances, we incorporate expanded stance triangle framework into our model to infer the specified target–stance pair through the explicit pairs contained in social media. Our proposed model not only eliminates error cascades, but also effectively improves the performance of the target–stance prediction task. Experiments on two benchmark datasets demonstrate that our proposed model has significant advantages over the current state-of-the-art models.
目标-姿态预测是在传统姿态检测任务的基础上发展起来的一项新任务,旨在从每条推文中预测目标和姿态对。目标姿态预测目前主要采用两阶段法求解。虽然该方法有效地减轻了对人工标记目标信息的依赖,但第一阶段目标识别任务中产生的误差会直接对第二阶段姿态检测任务的性能产生负面影响,产生明显的误差级联。此外,很难在两个子任务之间建立有效的特征交互。为了解决上述问题,我们提出了一个三角联合推理模型TriTSP。该模型以联合预测的方式将目标特征和姿态特征统一起来,捕捉它们之间的相关性和相互作用。此外,受人类姿态表达方式的启发,我们将扩展的姿态三角形框架纳入我们的模型中,通过社交媒体中包含的显式姿态对来推断指定的目标姿态对。该模型不仅消除了误差级联,而且有效地提高了目标姿态预测任务的性能。在两个基准数据集上的实验表明,我们提出的模型比当前最先进的模型具有显著的优势。
{"title":"TriTSP: A triangular joint reasoning networks for target–stance prediction","authors":"JiaYu Zhang,&nbsp;HongLi Zhang,&nbsp;ChunYu Liu,&nbsp;ZeShu Tian,&nbsp;Chao Meng,&nbsp;YuXiang Ma","doi":"10.1016/j.csl.2026.101962","DOIUrl":"10.1016/j.csl.2026.101962","url":null,"abstract":"<div><div>Target–stance prediction is a novel task evolved from the traditional stance detection task, aiming to predict the pair of target and stance from each tweet. The target–stance prediction task is currently solved by the two-stage method. Although this method effectively alleviates the dependence on manually labeled target information, the errors generated in the first-stage target identification task will directly have a negative impact on the performance of the second-stage stance detection task, resulting in obvious error cascades. Moreover, it is difficult to establish effective feature interactions between the two subtasks. To tackle the above problems, we propose a triangular joint reasoning model named TriTSP. The proposed model unifies the target features and stance features in the joint prediction manner to capture the correlations and interactions between them. Furthermore, inspired by the way humans express stances, we incorporate expanded stance triangle framework into our model to infer the specified target–stance pair through the explicit pairs contained in social media. Our proposed model not only eliminates error cascades, but also effectively improves the performance of the target–stance prediction task. Experiments on two benchmark datasets demonstrate that our proposed model has significant advantages over the current state-of-the-art models.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101962"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
3D residual optimization-based trajectory planning for robotic grinding of complex curved blades 基于残差优化的复杂曲面叶片机器人磨削轨迹规划
IF 11.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2026-10-01 Epub Date: 2026-02-27 DOI: 10.1016/j.rcim.2026.103275
Chong Lv , Lai Zou , Heng Li , Lei Ren , Feng Jiao , Xinli Wang
In robotic belt grinding of complex curved blades, the elastic contact characteristics and variable curvature distribution of the blade results in non-uniform residual height distributions in both the chordwise and spanwise directions, thereby hindering the attainment of stringent dimensional tolerances. In this work, a novel trajectory planning method for robotic grinding of blades is presented to effectively improve surface residual uniformity. Initially, a 3D residual theoretical model is established through the curved surface geometric properties. Subsequently, the maximum chord height between adjacent cutter contact (CC) points is recalculated by the iterative verification algorithm, and an optimized chord height method is proposed to maximize the step length within the allowable. Furthermore, the isoparametric trajectory and the isoscallop trajectory for 3D residual optimization are proposed respectively to dynamically adjust the row spacing based on the curvature changes of CC points. Simulation and experimental results demonstrate the effectiveness of the proposed methods from the perspectives of machined efficiency and machined quality. The machining efficiency of the optimized isoscallop method is improved by 7.4 % compared with that before optimization, the fluctuation ranges of the surface profile error of these two proposed trajectories decreased by 28.7 % and 38.5 %, respectively. The presented trajectory planning method provides a valuable reference for improving the machined surface quality consistency in robotic grinding of complex curved surfaces.
在复杂弯曲叶片的机器人带磨削中,由于叶片的弹性接触特性和可变曲率分布,导致其弦向和展向残余高度分布不均匀,从而影响了严格尺寸公差的实现。提出了一种新的叶片机器人磨削轨迹规划方法,有效地提高了叶片表面残留均匀性。首先,通过曲面几何特性建立了三维残差理论模型。随后,通过迭代验证算法重新计算相邻刀具接触点之间的最大弦高,并提出了一种优化弦高方法,使步长在允许范围内最大化。在此基础上,提出了基于CC点曲率变化动态调整行间距的等参轨迹和等腰轨迹进行三维残差优化。仿真和实验结果从加工效率和加工质量两方面验证了所提方法的有效性。与优化前相比,优化后的等腰法加工效率提高了7.4%,两种轨迹的表面轮廓误差波动幅度分别减小了28.7%和38.5%。所提出的轨迹规划方法为提高复杂曲面机器人磨削加工表面质量一致性提供了有价值的参考。
{"title":"3D residual optimization-based trajectory planning for robotic grinding of complex curved blades","authors":"Chong Lv ,&nbsp;Lai Zou ,&nbsp;Heng Li ,&nbsp;Lei Ren ,&nbsp;Feng Jiao ,&nbsp;Xinli Wang","doi":"10.1016/j.rcim.2026.103275","DOIUrl":"10.1016/j.rcim.2026.103275","url":null,"abstract":"<div><div>In robotic belt grinding of complex curved blades, the elastic contact characteristics and variable curvature distribution of the blade results in non-uniform residual height distributions in both the chordwise and spanwise directions, thereby hindering the attainment of stringent dimensional tolerances. In this work, a novel trajectory planning method for robotic grinding of blades is presented to effectively improve surface residual uniformity. Initially, a 3D residual theoretical model is established through the curved surface geometric properties. Subsequently, the maximum chord height between adjacent cutter contact (CC) points is recalculated by the iterative verification algorithm, and an optimized chord height method is proposed to maximize the step length within the allowable. Furthermore, the isoparametric trajectory and the isoscallop trajectory for 3D residual optimization are proposed respectively to dynamically adjust the row spacing based on the curvature changes of CC points. Simulation and experimental results demonstrate the effectiveness of the proposed methods from the perspectives of machined efficiency and machined quality. The machining efficiency of the optimized isoscallop method is improved by 7.4 % compared with that before optimization, the fluctuation ranges of the surface profile error of these two proposed trajectories decreased by 28.7 % and 38.5 %, respectively. The presented trajectory planning method provides a valuable reference for improving the machined surface quality consistency in robotic grinding of complex curved surfaces.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"101 ","pages":"Article 103275"},"PeriodicalIF":11.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147330051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring efficient attention strategies in conformer-based sound event detection 基于一致性的声音事件检测的有效注意策略研究
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-10-01 Epub Date: 2026-03-04 DOI: 10.1016/j.csl.2026.101967
Sara Barahona, Juan Ignacio Alvarez-Trejos, Alicia Lozano-Diez, Daniel Ramos, Doroteo T. Toledano
Sound Event Detection (SED) requires models that can accurately localize and classify overlapping audio events within complex acoustic environments. Conformer-based architectures have demonstrated promising performance by leveraging self-attention to capture long-range dependencies. However, this global attention can be accumulated across layers, which can blur local temporal boundaries and reduce detection accuracy, especially for short or closely spaced events. While increasing the input sequence length can help recover temporal detail, the quadratic complexity of Conformers’ self-attention significantly increases computational costs. To address this, we propose integrating the Efficient Conformer architecture, which introduces subsampling along the input sequence length, effectively reducing the temporal dimension within blocks. This design enables processing longer input sequences at finer temporal resolution, enhancing localization accuracy without extending output length. Using the DCASE Challenge 2023 Task 4 benchmark, system performance is evaluated via the threshold-independent Polyphonic Sound Detection Score (PSDS), measuring both localization precision (PSDS1) and class robustness (PSDS2). Experiments on the DESED validation dataset demonstrate that the Efficient Conformer not only improves temporal resolution and long-range dependency modeling, but also outperforms standard Conformer and Convolutional Recurrent Neural Network (CRNN) baselines in PSDS2. Additionally, we explore lightweight attention mechanisms employing squeeze-and-excitation blocks to emulate frequency-axis translation invariance of Frequency Dynamic Convolutions (FDY). Our approach achieves performance comparable to heavier models like FDY+Conformer, while reducing computational cost by over 69%, showing promising results for Conformer-based systems in terms of precision and model efficiency.
声音事件检测(SED)需要能够准确定位和分类复杂声学环境中重叠音频事件的模型。基于一致性的体系结构通过利用自我关注来捕获远程依赖关系,已经证明了良好的性能。然而,这种全局注意力可以跨层积累,这会模糊局部时间边界并降低检测精度,特别是对于短时间或紧密间隔的事件。虽然增加输入序列长度有助于恢复时间细节,但一致性自关注的二次复杂度显著增加了计算成本。为了解决这个问题,我们提出了集成高效共形结构,该结构沿着输入序列长度引入子采样,有效地降低了块内的时间维。这种设计能够以更精细的时间分辨率处理更长的输入序列,在不延长输出长度的情况下提高定位精度。使用DCASE Challenge 2023 Task 4基准测试,通过与阈值无关的复音检测评分(PSDS)来评估系统性能,测量定位精度(PSDS1)和类鲁棒性(PSDS2)。在DESED验证数据集上的实验表明,Efficient Conformer不仅提高了时间分辨率和远程依赖建模,而且在PSDS2中优于标准Conformer和卷积递归神经网络(CRNN)基线。此外,我们探索了使用挤压和激励块来模拟频率动态卷积(FDY)的频率轴平移不变性的轻量级注意机制。我们的方法达到了与FDY+Conformer等重型模型相当的性能,同时将计算成本降低了69%以上,在精度和模型效率方面显示了基于Conformer的系统的良好结果。
{"title":"Exploring efficient attention strategies in conformer-based sound event detection","authors":"Sara Barahona,&nbsp;Juan Ignacio Alvarez-Trejos,&nbsp;Alicia Lozano-Diez,&nbsp;Daniel Ramos,&nbsp;Doroteo T. Toledano","doi":"10.1016/j.csl.2026.101967","DOIUrl":"10.1016/j.csl.2026.101967","url":null,"abstract":"<div><div>Sound Event Detection (SED) requires models that can accurately localize and classify overlapping audio events within complex acoustic environments. Conformer-based architectures have demonstrated promising performance by leveraging self-attention to capture long-range dependencies. However, this global attention can be accumulated across layers, which can blur local temporal boundaries and reduce detection accuracy, especially for short or closely spaced events. While increasing the input sequence length can help recover temporal detail, the quadratic complexity of Conformers’ self-attention significantly increases computational costs. To address this, we propose integrating the Efficient Conformer architecture, which introduces subsampling along the input sequence length, effectively reducing the temporal dimension within blocks. This design enables processing longer input sequences at finer temporal resolution, enhancing localization accuracy without extending output length. Using the DCASE Challenge 2023 Task 4 benchmark, system performance is evaluated via the threshold-independent Polyphonic Sound Detection Score (PSDS), measuring both localization precision (PSDS1) and class robustness (PSDS2). Experiments on the DESED validation dataset demonstrate that the Efficient Conformer not only improves temporal resolution and long-range dependency modeling, but also outperforms standard Conformer and Convolutional Recurrent Neural Network (CRNN) baselines in PSDS2. Additionally, we explore lightweight attention mechanisms employing squeeze-and-excitation blocks to emulate frequency-axis translation invariance of Frequency Dynamic Convolutions (FDY). Our approach achieves performance comparable to heavier models like FDY+Conformer, while reducing computational cost by over 69%, showing promising results for Conformer-based systems in terms of precision and model efficiency.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101967"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LaFresCat: A studio-quality Catalan multi-accent speech dataset for text-to-speech synthesis 用于文本到语音合成的工作室质量加泰罗尼亚语多口音语音数据集
IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-10-01 Epub Date: 2026-01-21 DOI: 10.1016/j.csl.2026.101945
Alex Peiró-Lilja , Carme Armentano-Oller , José Giraldo , Wendy Elvira-García , Ignasi Esquerra , Rodolfo Zevallos , Cristina España-Bonet , Martí Llopart-Font , Baybars Külebi , Mireia Farrús
Current text-to-speech (TTS) systems are capable of learning the phonetics of a language accurately given that the speech data used to train such models covers all phonetic phenomena. For languages with different varieties, this includes all their richness and accents. This is the case of Catalan, a mid-resourced language with several dialects or accents. Although there are various publicly available corpora, there is a lack of high-quality open-access data for speech technologies covering its variety of accents. Common Voice includes recordings of Catalan speakers from different regions; however, accent labeling has been shown to be inaccurate, and artificially enhanced samples may be unsuitable for TTS. To address these limitations, we present LaFresCat, the first studio-quality Catalan multi-accent dataset. LaFresCat comprises 3.5 h of professionally recording speech covering four of the most prominent Catalan accents: Balearic, Central, North-Western, and Valencian. In this work, we provide a detailed description of the dataset design: utterances were selected to be phonetically balanced, detailed speaker instructions were provided, native speakers from the regions corresponding to the Catalan accents were hired, and the recordings were formatted and post-processed. The resulting dataset, LaFresCat, is publicly available. To preliminarily evaluate the dataset, we trained and assessed a lightweight flow-based TTS system, which is also provided as a by-product. We also analyzed LaFresCat samples and the corresponding TTS-generated samples at the phonetic level, employing expert annotations and Pillai scores to quantify acoustic vowel overlap. Preliminary results suggest a significant improvement in predicted mean opinion score (UTMOS), with an increase of 0.42 points when the TTS system is fine-tuned on LaFresCat rather than trained from scratch, starting from a pre-trained version based on Central Catalan data from Common Voice. Subsequent human expert annotations achieved nearly 90% accuracy in accent classification for LaFresCat recordings. However, although the TTS tends to homogenize pronunciation, it still learns distinct dialectal patterns. This assessment offers key insights for establishing a baseline to guide future evaluations of Catalan multi-accent TTS systems and further studies of LaFresCat.
当前的文本到语音(TTS)系统能够准确地学习语言的语音,因为用于训练这种模型的语音数据涵盖了所有语音现象。对于具有不同种类的语言,这包括它们所有的丰富性和口音。这就是加泰罗尼亚语的情况,这是一种中等资源的语言,有几种方言或口音。尽管有各种各样的公开可用的语料库,但缺乏覆盖各种口音的高质量开放访问的语音技术数据。“共同之声”包括来自不同地区的加泰罗尼亚语使用者的录音;然而,重音标记已被证明是不准确的,人工增强的样本可能不适合TTS。为了解决这些限制,我们提出了LaFresCat,第一个工作室质量的加泰罗尼亚语多口音数据集。LaFresCat包括3.5小时的专业录音演讲,涵盖四个最突出的加泰罗尼亚口音:巴利阿里,中部,西北部和巴伦西亚。在这项工作中,我们提供了数据集设计的详细描述:选择语音平衡的话语,提供详细的说话人说明,聘请来自加泰罗尼亚口音相应地区的母语人士,并对录音进行格式化和后处理。得到的数据集LaFresCat是公开的。为了初步评估数据集,我们训练并评估了一个轻量级的基于流量的TTS系统,该系统也是作为副产品提供的。我们还在语音层面分析了LaFresCat样本和相应的tts生成样本,采用专家注释和Pillai评分来量化元音重叠。初步结果表明,预测平均意见得分(UTMOS)显著提高,当TTS系统在LaFresCat上进行微调而不是从头开始训练时,从基于Common Voice中央加泰罗尼亚语数据的预训练版本开始,预测平均意见得分(UTMOS)增加了0.42分。随后的人类专家注释在LaFresCat录音的口音分类中达到了近90%的准确率。然而,虽然TTS倾向于同质化发音,但它仍然学习不同的方言模式。该评估为建立基线提供了关键见解,以指导加泰罗尼亚语多口音TTS系统的未来评估和LaFresCat的进一步研究。
{"title":"LaFresCat: A studio-quality Catalan multi-accent speech dataset for text-to-speech synthesis","authors":"Alex Peiró-Lilja ,&nbsp;Carme Armentano-Oller ,&nbsp;José Giraldo ,&nbsp;Wendy Elvira-García ,&nbsp;Ignasi Esquerra ,&nbsp;Rodolfo Zevallos ,&nbsp;Cristina España-Bonet ,&nbsp;Martí Llopart-Font ,&nbsp;Baybars Külebi ,&nbsp;Mireia Farrús","doi":"10.1016/j.csl.2026.101945","DOIUrl":"10.1016/j.csl.2026.101945","url":null,"abstract":"<div><div>Current text-to-speech (TTS) systems are capable of learning the phonetics of a language accurately given that the speech data used to train such models covers all phonetic phenomena. For languages with different varieties, this includes all their richness and accents. This is the case of Catalan, a mid-resourced language with several dialects or accents. Although there are various publicly available corpora, there is a lack of high-quality open-access data for speech technologies covering its variety of accents. Common Voice includes recordings of Catalan speakers from different regions; however, accent labeling has been shown to be inaccurate, and artificially enhanced samples may be unsuitable for TTS. To address these limitations, we present LaFresCat, the first studio-quality Catalan multi-accent dataset. LaFresCat comprises 3.5 h of professionally recording speech covering four of the most prominent Catalan accents: Balearic, Central, North-Western, and Valencian. In this work, we provide a detailed description of the dataset design: utterances were selected to be phonetically balanced, detailed speaker instructions were provided, native speakers from the regions corresponding to the Catalan accents were hired, and the recordings were formatted and post-processed. The resulting dataset, LaFresCat, is publicly available. To preliminarily evaluate the dataset, we trained and assessed a lightweight flow-based TTS system, which is also provided as a by-product. We also analyzed LaFresCat samples and the corresponding TTS-generated samples at the phonetic level, employing expert annotations and Pillai scores to quantify acoustic vowel overlap. Preliminary results suggest a significant improvement in predicted mean opinion score (UTMOS), with an increase of 0.42 points when the TTS system is fine-tuned on LaFresCat rather than trained from scratch, starting from a pre-trained version based on Central Catalan data from Common Voice. Subsequent human expert annotations achieved nearly 90% accuracy in accent classification for LaFresCat recordings. However, although the TTS tends to homogenize pronunciation, it still learns distinct dialectal patterns. This assessment offers key insights for establishing a baseline to guide future evaluations of Catalan multi-accent TTS systems and further studies of LaFresCat.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101945"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
全部 J. Field Rob. J. Bionic Eng. ACTA INFORM Adv. Rob. AI MAG Ann. Math. Artif. Intell. Appl. Bionics Biomech. APPL INTELL APPL COMPUT ELECTROM APPL ARTIF INTELL Artif. Intell. ARTIF INTELL REV CHEMOMETR INTELL LAB China Commun. CMC-Comput. Mater. Continua Complex Intell. Syst. Comput. Sci. Eng. Commun. ACM COMPUTER Comput. Graphics Forum COMPUTING EMPIR SOFTW ENG Enterp. Inf. Syst. EPJ Data Sci. ETRI J EURASIP J WIREL COMM Evolving Systems FORM METHOD SYST DES Front. Neurorob. FRONT COMPUT SCI-CHI IEEE Trans. Commun. IEEE Trans. Comput. Social Syst. IEEE Trans. Dependable Secure Comput. IEEE Trans. Green Commun. Networking IEEE Trans. Cognit. Commun. Networking IEEE Access IEEE Trans. Comput. IEEE Antennas Propag. Mag. IEEE Micro IEEE Trans. Antennas Propag. IEEE Trans. Control Syst. Technol. IEEE Trans. Big Data IEEE Trans. Cybern. IEEE Internet Comput. IEEE Trans. Affective Comput. IEEE Trans. Emerging Top. Comput. Intell. IEEE SECUR PRIV IEEE Trans. Emerging Top. Comput. IEEE Trans. Aerosp. Electron. Syst. IEEE Trans. Broadcast. IEEE Intell. Syst. IEEE Commun. Lett. IEEE Trans. Autom. Control IEEE Trans. Cloud Comput. IEEE Trans. Evol. Comput. IEEE Trans. Consum. Electron. IEEE Trans. Fuzzy Syst. IEEE Trans. Haptic IEEE Trans. Image Process. IEEE Multimedia IEEE Rob. Autom. Lett. IEEE J. Sel. Areas Commun. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. IETE Tech. Rev. IEEE Trans. Serv. Comput. IEEE Trans. Parallel Distrib. Syst. IEEE Trans. Sustainable Comput. IEEE Trans. Multimedia IEEE Trans. Ind. Inf. IEEE Trans. Neural Networks Learn. Syst. IEEE Trans. Software Eng. IEEE-ACM T AUDIO SPE IEEE Wireless Commun. IEEE Wireless Commun. Lett. IET MICROW ANTENNA P IEEE Trans. Visual Comput. Graphics IEEE Trans. Ind. Electron. IET Optoelectron IEEE Trans. Veh. Technol. IEEE Trans. Netw. Serv. Manage. IEEE Trans. Pattern Anal. Mach. Intell. IEEE Trans. Wireless Commun. IEEE ACM T NETWORK IEEE Trans. Inf. Forensics Secur. IEEE Trans. Inf. Theory IEEE Trans. Knowl. Data Eng. INFORM SYST FRONT INFORMS J COMPUT INFOR Int. J. Comput. Vision Int. J. Approximate Reasoning Int. J. Control Int. J. Commun. Syst. Int. J. Imaging Syst. Technol. Int. J. Fuzzy Syst. Int. J. Intell. Syst. Int. J. Network Manage. Int. J. Parallel Program. Int. J. Social Rob. Int. J. Software Tools Technol. Trans.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1