Pub Date : 2026-10-01Epub Date: 2026-03-06DOI: 10.1016/j.rcim.2026.103272
Shaban Usman , Tianrun Ye , Haotian Xue , Lei Liu , Weiwei Qin , Ping Zhang , Ailong Yuan , Chueh Ting , Yanli Gong , Chunming Gao
The dual-resource constrained flexible job-shop scheduling problem (DRCFJSP) addresses practical challenges in modern production systems, especially where human and robotic resources are jointly managed. This study proposes a DRCFJSP model with ergonomic consideration (DRCFJSP-ER), aiming to simultaneously enhance productivity and the well-being of workers in both conventional and human-robot systems. Ergonomic load in a job-shop environment is assessed using the rapid upper limb assessment (RULA) score by introducing three novel evaluation metrics: the weighted average RULA score for operations, the cumulative RULA score for operations, and the cumulative RULA score for the entire job-shop cycle. To efficiently solve the DRCFJSP-ER, we propose an enhanced NSGA-II with teaching-learning effect (ENSGA-TL) to simultaneously minimize the makespan and maximum cumulative RULA score. A comprehensive analysis based on standard performance metrics is conducted to evaluate the effectiveness of ENSGA-TL for DRCFJSP-ER using newly generated test instances. Additionally, two real-world case studies in an agricultural production environment, selected for their labor-intensive and robotics-relevant characteristics, demonstrate the model’s effectiveness and adaptability to conventional and smart robotic production systems. The results validate the potential of the DRCFJSP-ER model and the ENSGA-TL algorithm in improving production efficiency and protecting worker well-being.
{"title":"Dual-Resource constrained flexible job-shop scheduling with ergonomic considerations in conventional and human-robot systems using an enhanced NSGA-II with teaching-learning effect","authors":"Shaban Usman , Tianrun Ye , Haotian Xue , Lei Liu , Weiwei Qin , Ping Zhang , Ailong Yuan , Chueh Ting , Yanli Gong , Chunming Gao","doi":"10.1016/j.rcim.2026.103272","DOIUrl":"10.1016/j.rcim.2026.103272","url":null,"abstract":"<div><div>The dual-resource constrained flexible job-shop scheduling problem (DRCFJSP) addresses practical challenges in modern production systems, especially where human and robotic resources are jointly managed. This study proposes a DRCFJSP model with ergonomic consideration (DRCFJSP-ER), aiming to simultaneously enhance productivity and the well-being of workers in both conventional and human-robot systems. Ergonomic load in a job-shop environment is assessed using the rapid upper limb assessment (RULA) score by introducing three novel evaluation metrics: the weighted average RULA score for operations, the cumulative RULA score for operations, and the cumulative RULA score for the entire job-shop cycle. To efficiently solve the DRCFJSP-ER, we propose an enhanced NSGA-II with teaching-learning effect (ENSGA-TL) to simultaneously minimize the makespan and maximum cumulative RULA score. A comprehensive analysis based on standard performance metrics is conducted to evaluate the effectiveness of ENSGA-TL for DRCFJSP-ER using newly generated test instances. Additionally, two real-world case studies in an agricultural production environment, selected for their labor-intensive and robotics-relevant characteristics, demonstrate the model’s effectiveness and adaptability to conventional and smart robotic production systems. The results validate the potential of the DRCFJSP-ER model and the ENSGA-TL algorithm in improving production efficiency and protecting worker well-being.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"101 ","pages":"Article 103272"},"PeriodicalIF":11.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147387361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-10-01Epub Date: 2026-03-11DOI: 10.1016/j.rcim.2026.103294
Yong Tao , Jiao Xue , Yazui Liu , Lin Yang , Jiewu Leng , Pai Zheng , Baicun Wang , Xiaotong Wang , Hongxing Wei
During the grinding of aeroengine blade edges, complex time-varying nonlinear coupling and uncertain disturbances pose challenges to the adaptive regulation of constant force grinding, reducing process stability and precision. This paper proposed a multi-modal fusion-enhanced fuzzy adaptive variable impedance control with improved deep belief network (DBN) for robotic constant force blade grinding. Specifically, the three-dimensional model and point cloud model of the blade are integrated to extract accurate geometric information and generate reference grinding trajectories. Furtherly, the DBN training hyperparameters are optimized using linear success history-based adaptive differential evolution (LSHADE). This improves the DBN configuration and overcomes the limitations of conventional DBN based force compensation with fixed network structures and single modality inputs. On this basis, a fuzzy adaptive variable impedance control method based on the improved DBN is developed. Geometric, force/pose, and error modalities are fused to dynamically adjust the force compensation term. This design enables the controller to outperform conventional adaptive variable impedance methods under strongly time-varying conditions. It improves the interaction between the robot and the environment and realizes adaptive active compliant constant-force control in robotic grinding. Comparative experiments demonstrate the stability and reliability of the proposed method. Compared with mainstream methods, the proposed method reduces the grinding force error by 66.7% and 28.6%, respectively. The key error metrics MSE, RMSE, MAPE, and MAE are reduced by more than 71% and 20%, and the average surface roughness is reduced by approximately 15.6% and 5.8%, respectively
{"title":"Multi-modal fusion-enhanced fuzzy adaptive variable impedance control with improved DBN for robotic constant force blade grinding","authors":"Yong Tao , Jiao Xue , Yazui Liu , Lin Yang , Jiewu Leng , Pai Zheng , Baicun Wang , Xiaotong Wang , Hongxing Wei","doi":"10.1016/j.rcim.2026.103294","DOIUrl":"10.1016/j.rcim.2026.103294","url":null,"abstract":"<div><div>During the grinding of aeroengine blade edges, complex time-varying nonlinear coupling and uncertain disturbances pose challenges to the adaptive regulation of constant force grinding, reducing process stability and precision. This paper proposed a multi-modal fusion-enhanced fuzzy adaptive variable impedance control with improved deep belief network (DBN) for robotic constant force blade grinding. Specifically, the three-dimensional model and point cloud model of the blade are integrated to extract accurate geometric information and generate reference grinding trajectories. Furtherly, the DBN training hyperparameters are optimized using linear success history-based adaptive differential evolution (LSHADE). This improves the DBN configuration and overcomes the limitations of conventional DBN based force compensation with fixed network structures and single modality inputs. On this basis, a fuzzy adaptive variable impedance control method based on the improved DBN is developed. Geometric, force/pose, and error modalities are fused to dynamically adjust the force compensation term. This design enables the controller to outperform conventional adaptive variable impedance methods under strongly time-varying conditions. It improves the interaction between the robot and the environment and realizes adaptive active compliant constant-force control in robotic grinding. Comparative experiments demonstrate the stability and reliability of the proposed method. Compared with mainstream methods, the proposed method reduces the grinding force error by 66.7% and 28.6%, respectively. The key error metrics MSE, RMSE, MAPE, and MAE are reduced by more than 71% and 20%, and the average surface roughness is reduced by approximately 15.6% and 5.8%, respectively</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"101 ","pages":"Article 103294"},"PeriodicalIF":11.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147387362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-10-01Epub Date: 2026-02-24DOI: 10.1016/j.csl.2026.101956
Weizhao Zhang , Mengjuan Wang , Junzhi Li , Hongwu Yang
Cross-lingual speech synthesis is a key research focus in speech synthesis, allowing a single model to generate speech in multiple languages for one speaker. In China, while Mandarin is the official language, approximately 4 million people speak Tibetan as their native language. Previous Mandarin–Tibetan cross-lingual researches have largely concentrated on the Lhasa dialect, often overlooking the Kham and Amdo dialects, and have relied on autoregressive models, which still produce speech quality inferior to that of major languages. To address these challenges, we propose Cro-MTVITS, an end-to-end cross-lingual speech synthesis model for Mandarin and multi-dialect Tibetan. Firstly, we constructed a large-scale multi-dialect Tibetan corpus covering Lhasa, Kham, and Amdo dialects, totaling 52.2 h. Then, we developed a baseline model based on VITS, incorporating speaker and language embeddings into the text encoder, posterior encoder, decoder, stochastic duration predictor (SDP) and flow to enable cross-lingual synthesis. Finally, we enhanced this baseline model with an improved posterior encoder, SDP, and pre-trained language and speech models, yielding significant performance gains. Cro-MTVITS consistently achieved higher mean opinion score (MOS) values than the VITS baseline across all languages and scenarios, with improvements ranging from 0.07 to 0.21 points. Statistical tests confirmed that Cro-MTVITS significantly outperforms the baseline. Overall, experimental results demonstrate that our model surpasses the baseline in both subjective and objective evaluations, enabling high-quality cross-lingual speech synthesis between Mandarin and multi-dialect Tibetan. The synthesized speech samples can be found on demos1.
{"title":"Cro-MTVITS: An end-to-end cross-lingual speech synthesis model for Mandarin and multi-dialect Tibetan based on VITS","authors":"Weizhao Zhang , Mengjuan Wang , Junzhi Li , Hongwu Yang","doi":"10.1016/j.csl.2026.101956","DOIUrl":"10.1016/j.csl.2026.101956","url":null,"abstract":"<div><div>Cross-lingual speech synthesis is a key research focus in speech synthesis, allowing a single model to generate speech in multiple languages for one speaker. In China, while Mandarin is the official language, approximately 4 million people speak Tibetan as their native language. Previous Mandarin–Tibetan cross-lingual researches have largely concentrated on the Lhasa dialect, often overlooking the Kham and Amdo dialects, and have relied on autoregressive models, which still produce speech quality inferior to that of major languages. To address these challenges, we propose Cro-MTVITS, an end-to-end cross-lingual speech synthesis model for Mandarin and multi-dialect Tibetan. Firstly, we constructed a large-scale multi-dialect Tibetan corpus covering Lhasa, Kham, and Amdo dialects, totaling 52.2 h. Then, we developed a baseline model based on VITS, incorporating speaker and language embeddings into the text encoder, posterior encoder, decoder, stochastic duration predictor (SDP) and flow to enable cross-lingual synthesis. Finally, we enhanced this baseline model with an improved posterior encoder, SDP, and pre-trained language and speech models, yielding significant performance gains. Cro-MTVITS consistently achieved higher mean opinion score (MOS) values than the VITS baseline across all languages and scenarios, with improvements ranging from 0.07 to 0.21 points. Statistical tests confirmed that Cro-MTVITS significantly outperforms the baseline. Overall, experimental results demonstrate that our model surpasses the baseline in both subjective and objective evaluations, enabling high-quality cross-lingual speech synthesis between Mandarin and multi-dialect Tibetan. The synthesized speech samples can be found on demos<span><span><sup>1</sup></span></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101956"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Utilizing richer information, such as structural and syntactic details, can enhance Natural Language Processing (NLP) tasks like Open Information Extraction (Open IE), particularly for languages with limited resources like Portuguese. Knowledge Graphs (KGs) offer a robust solution by unifying diverse annotations and enabling the application of Graph Machine Learning (Graph ML).
This paper presents an advanced framework for Portuguese Open IE, integrating KGs and Graph ML with Large Language Model (LLM) augmentation. Our framework employs a three-stage process: (1) initial Knowledge Graph (KG) construction from text, followed by (2) Predicate Extraction and (3) Subject/Object Extraction, both leveraging GraphSAGE models. Large Language Models (LLMs) (DeepSeek) are used for augmentation when Graph ML predictions are absent or for refining/validating extractions.
We present two versions of a system that was evaluated on a Portuguese dataset. Automatic evaluation (word-based) for the best version of the system yielded an F1-score of 64.9% for Predicate extraction and 89.7% for Subject/Object extraction. The final end-to-end performance of the system is an F1-score of 58.2%.
A human evaluation was conducted on 51 Portuguese sentences (yielding 100 triples) by two annotators, achieving a substantial agreement (Cohen’s Kappa of 0.67). The system extracted an average of 1.84 triples per sentence, with 53.9% deemed correct. Notably, this version significantly reduced invalid/wrong extractions to 6.6% from 31.7% in the previous version, demonstrating improved Precision while maintaining the ability to extract multiple meaningful triples.
利用更丰富的信息,如结构和句法细节,可以增强自然语言处理(NLP)任务,如开放信息提取(Open information Extraction, Open IE),特别是对于葡萄牙语等资源有限的语言。知识图(Knowledge Graphs, KGs)通过统一不同的注释和实现图机器学习(Graph ML)的应用,提供了一个健壮的解决方案。本文提出了一种葡萄牙语开放IE的高级框架,它将KGs和图ML与大型语言模型(LLM)增强相结合。我们的框架采用了三个阶段的过程:(1)从文本构建初始知识图(KG),然后(2)谓词提取和(3)主题/对象提取,两者都利用GraphSAGE模型。大型语言模型(llm) (DeepSeek)用于在没有Graph ML预测时进行增强或用于精炼/验证提取。我们提出了两个版本的系统,在葡萄牙数据集上进行了评估。对系统的最佳版本进行自动评估(基于单词),谓语提取的f1得分为64.9%,主题/对象提取的f1得分为89.7%。该系统的最终端到端性能为f1得分58.2%。由两名注释者对51个葡萄牙语句子(产生100个triples)进行了人类评估,取得了实质性的一致(Cohen 's Kappa为0.67)。该系统平均每句话提取1.84个三元组,53.9%被认为是正确的。值得注意的是,这个版本显著地将无效/错误提取从上一个版本的31.7%减少到6.6%,在保持提取多个有意义三元组的能力的同时,展示了更高的精度。
{"title":"Deepening graph-based approaches for Portuguese open information extraction with LLM augmentation","authors":"Gabriel Silva , Mário Rodrigues , António Teixeira , Marlene Amorim","doi":"10.1016/j.csl.2026.101963","DOIUrl":"10.1016/j.csl.2026.101963","url":null,"abstract":"<div><div>Utilizing richer information, such as structural and syntactic details, can enhance Natural Language Processing (NLP) tasks like Open Information Extraction (Open IE), particularly for languages with limited resources like Portuguese. Knowledge Graphs (KGs) offer a robust solution by unifying diverse annotations and enabling the application of Graph Machine Learning (Graph ML).</div><div>This paper presents an advanced framework for Portuguese Open IE, integrating KGs and Graph ML with Large Language Model (LLM) augmentation. Our framework employs a three-stage process: (1) initial Knowledge Graph (KG) construction from text, followed by (2) Predicate Extraction and (3) Subject/Object Extraction, both leveraging GraphSAGE models. Large Language Models (LLMs) (DeepSeek) are used for augmentation when Graph ML predictions are absent or for refining/validating extractions.</div><div>We present two versions of a system that was evaluated on a Portuguese dataset. Automatic evaluation (word-based) for the best version of the system yielded an F1-score of 64.9% for Predicate extraction and 89.7% for Subject/Object extraction. The final end-to-end performance of the system is an F1-score of 58.2%.</div><div>A human evaluation was conducted on 51 Portuguese sentences (yielding 100 triples) by two annotators, achieving a substantial agreement (Cohen’s Kappa of 0.67). The system extracted an average of 1.84 triples per sentence, with 53.9% deemed correct. Notably, this version significantly reduced invalid/wrong extractions to 6.6% from 31.7% in the previous version, demonstrating improved Precision while maintaining the ability to extract multiple meaningful triples.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101963"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147386292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-10-01Epub Date: 2026-03-03DOI: 10.1016/j.rcim.2026.103278
Ci Song , Baicun Wang , Xingyu Li , Huayong Yang , Lihui Wang
With the advent of human-centric manufacturing paradigm in the context of Industry 5.0, human-robot collaboration (HRC) becomes a crucial strategy to achieving enhanced flexibility and adaptability in manufacturing systems. Serving as a foundation for HRC deployment, human action recognition (HAR) infers human operational intent and enables robots to respond accordingly. However, existing HAR methods embedded in HRC systems mainly focus on accurately classifying actions into a known category encountered during training, with limited consideration of unknown sample in real scenarios, which may undermine the stability and safety of HRC systems. To address this issue, this work proposes a novel skeleton-based HAR algorithm with open-set recognition ability. The model features ensembled backbones for feature extraction using three parallel branches, and a corresponding Energy-based Diverse Non-Parametric Outlier Synthesis (EDNPOS) learning framework is designed which is able to generate virtual outliers as supervision signals and optimize the decision boundary between known and unknown data. Comprehensive experiments are conducted on three public datasets NTU RGB+D 60 (NTU 60), NW-UCLA and InHARD. Results verify the outstanding open-set recognition ability of our model while maintaining competitive closed-set accuracy. Finally, quantitative and qualitative evaluations on a compressor assembly case demonstrate the effectiveness and promise of our method in HRC applications. This work is expected to serve as a reference for realizing a more reliable HAR function in HRC systems.
{"title":"EDNPOS: An open-set skeleton-based human action recognition approach for human-robot collaboration enabled by outlier exposure","authors":"Ci Song , Baicun Wang , Xingyu Li , Huayong Yang , Lihui Wang","doi":"10.1016/j.rcim.2026.103278","DOIUrl":"10.1016/j.rcim.2026.103278","url":null,"abstract":"<div><div>With the advent of human-centric manufacturing paradigm in the context of Industry 5.0, human-robot collaboration (HRC) becomes a crucial strategy to achieving enhanced flexibility and adaptability in manufacturing systems. Serving as a foundation for HRC deployment, human action recognition (HAR) infers human operational intent and enables robots to respond accordingly. However, existing HAR methods embedded in HRC systems mainly focus on accurately classifying actions into a known category encountered during training, with limited consideration of unknown sample in real scenarios, which may undermine the stability and safety of HRC systems. To address this issue, this work proposes a novel skeleton-based HAR algorithm with open-set recognition ability. The model features ensembled backbones for feature extraction using three parallel branches, and a corresponding Energy-based Diverse Non-Parametric Outlier Synthesis (EDNPOS) learning framework is designed which is able to generate virtual outliers as supervision signals and optimize the decision boundary between known and unknown data. Comprehensive experiments are conducted on three public datasets NTU RGB+<em>D</em> 60 (NTU 60), NW-UCLA and InHARD. Results verify the outstanding open-set recognition ability of our model while maintaining competitive closed-set accuracy. Finally, quantitative and qualitative evaluations on a compressor assembly case demonstrate the effectiveness and promise of our method in HRC applications. This work is expected to serve as a reference for realizing a more reliable HAR function in HRC systems.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"101 ","pages":"Article 103278"},"PeriodicalIF":11.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147360721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-10-01Epub Date: 2026-02-19DOI: 10.1016/j.csl.2026.101961
P. Vijayalakshmi , Anushiya Rachel Gladston , B. Ramani , M.P. Actlin Jeeva , K. Anantha Krishnan , T. Lavanya , T. Nagarajan
Dysarthria is a neuro-motor speech disorder that impairs a person’s ability to communicate. This necessitates a communication aid to enable interaction with both individuals and computers, typically in the form of an automatic speech recognition (ASR) system. However, conventional ASR systems exhibit high word error rates (WER) when applied to dysarthric speech necessitating a dysarthric ASR (DASR) system. In the current work, DASR systems are developed using SSN TDSC (Tamil Dysarthric Speech Corpus) dataset, targeting mild and moderate dysarthria. Initially, a baseline DASR system is developed with original dysarthric speech data resulting in WER of 9.71% for mild and 19.54 % for moderate dysarthria respectively. In order to develop a DASR system with low WER enormous amount of dysarthric speech data is required. However, recording several hours of speech data from dysarthric speakers is difficult owing to their medical condition. To address this data scarcity, we explore data augmentation using text-to-speech (TTS) synthesis to generate additional dysarthric speech data. In this study, various TTS models, namely, hidden Markov model-based TTS (HTS), FastSpeech2 and Tacotron2 are used for synthesizing dysarthric speech. The current work focuses on identifying the properties that the synthetic speech must exhibit to aid in improving the performance of DASR systems and to derive the required amount of dysarthric speech data. Based on the subjective and objective evaluations carried out on the synthetic speech, FastSpeech2 outperforms the other TTS models considered in terms of preserving the dysarthric speech properties. Training the DASR systems using FastSpeech2-derived augmented data resulted in reduced WERs of 3.49% for mild and 13.17% for moderate dysarthria. Further experiments revealed that a reduction in WER (2.67% & 8.32% for mild and moderate dysarthria) is achieved when moderate amount of augmented data from multiple synthesizers (Fastspeech2 & Tacotron2) is used for training. These results demonstrate the effectiveness of TTS-based data augmentation in improving DASR performance.
{"title":"Leveraging synthetic speech: TTS-driven data augmentation for effective dysarthric speech recognition","authors":"P. Vijayalakshmi , Anushiya Rachel Gladston , B. Ramani , M.P. Actlin Jeeva , K. Anantha Krishnan , T. Lavanya , T. Nagarajan","doi":"10.1016/j.csl.2026.101961","DOIUrl":"10.1016/j.csl.2026.101961","url":null,"abstract":"<div><div>Dysarthria is a neuro-motor speech disorder that impairs a person’s ability to communicate. This necessitates a communication aid to enable interaction with both individuals and computers, typically in the form of an automatic speech recognition (ASR) system. However, conventional ASR systems exhibit high word error rates (WER) when applied to dysarthric speech necessitating a dysarthric ASR (DASR) system. In the current work, DASR systems are developed using SSN TDSC (Tamil Dysarthric Speech Corpus) dataset, targeting mild and moderate dysarthria. Initially, a baseline DASR system is developed with original dysarthric speech data resulting in WER of 9.71% for mild and 19.54 % for moderate dysarthria respectively. In order to develop a DASR system with low WER enormous amount of dysarthric speech data is required. However, recording several hours of speech data from dysarthric speakers is difficult owing to their medical condition. To address this data scarcity, we explore data augmentation using text-to-speech (TTS) synthesis to generate additional dysarthric speech data. In this study, various TTS models, namely, hidden Markov model-based TTS (HTS), FastSpeech2 and Tacotron2 are used for synthesizing dysarthric speech. The current work focuses on identifying the properties that the synthetic speech must exhibit to aid in improving the performance of DASR systems and to derive the required amount of dysarthric speech data. Based on the subjective and objective evaluations carried out on the synthetic speech, FastSpeech2 outperforms the other TTS models considered in terms of preserving the dysarthric speech properties. Training the DASR systems using FastSpeech2-derived augmented data resulted in reduced WERs of 3.49% for mild and 13.17% for moderate dysarthria. Further experiments revealed that a reduction in WER (2.67% & 8.32% for mild and moderate dysarthria) is achieved when moderate amount of augmented data from multiple synthesizers (Fastspeech2 & Tacotron2) is used for training. These results demonstrate the effectiveness of TTS-based data augmentation in improving DASR performance.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101961"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Target–stance prediction is a novel task evolved from the traditional stance detection task, aiming to predict the pair of target and stance from each tweet. The target–stance prediction task is currently solved by the two-stage method. Although this method effectively alleviates the dependence on manually labeled target information, the errors generated in the first-stage target identification task will directly have a negative impact on the performance of the second-stage stance detection task, resulting in obvious error cascades. Moreover, it is difficult to establish effective feature interactions between the two subtasks. To tackle the above problems, we propose a triangular joint reasoning model named TriTSP. The proposed model unifies the target features and stance features in the joint prediction manner to capture the correlations and interactions between them. Furthermore, inspired by the way humans express stances, we incorporate expanded stance triangle framework into our model to infer the specified target–stance pair through the explicit pairs contained in social media. Our proposed model not only eliminates error cascades, but also effectively improves the performance of the target–stance prediction task. Experiments on two benchmark datasets demonstrate that our proposed model has significant advantages over the current state-of-the-art models.
{"title":"TriTSP: A triangular joint reasoning networks for target–stance prediction","authors":"JiaYu Zhang, HongLi Zhang, ChunYu Liu, ZeShu Tian, Chao Meng, YuXiang Ma","doi":"10.1016/j.csl.2026.101962","DOIUrl":"10.1016/j.csl.2026.101962","url":null,"abstract":"<div><div>Target–stance prediction is a novel task evolved from the traditional stance detection task, aiming to predict the pair of target and stance from each tweet. The target–stance prediction task is currently solved by the two-stage method. Although this method effectively alleviates the dependence on manually labeled target information, the errors generated in the first-stage target identification task will directly have a negative impact on the performance of the second-stage stance detection task, resulting in obvious error cascades. Moreover, it is difficult to establish effective feature interactions between the two subtasks. To tackle the above problems, we propose a triangular joint reasoning model named TriTSP. The proposed model unifies the target features and stance features in the joint prediction manner to capture the correlations and interactions between them. Furthermore, inspired by the way humans express stances, we incorporate expanded stance triangle framework into our model to infer the specified target–stance pair through the explicit pairs contained in social media. Our proposed model not only eliminates error cascades, but also effectively improves the performance of the target–stance prediction task. Experiments on two benchmark datasets demonstrate that our proposed model has significant advantages over the current state-of-the-art models.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101962"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-10-01Epub Date: 2026-02-27DOI: 10.1016/j.rcim.2026.103275
Chong Lv , Lai Zou , Heng Li , Lei Ren , Feng Jiao , Xinli Wang
In robotic belt grinding of complex curved blades, the elastic contact characteristics and variable curvature distribution of the blade results in non-uniform residual height distributions in both the chordwise and spanwise directions, thereby hindering the attainment of stringent dimensional tolerances. In this work, a novel trajectory planning method for robotic grinding of blades is presented to effectively improve surface residual uniformity. Initially, a 3D residual theoretical model is established through the curved surface geometric properties. Subsequently, the maximum chord height between adjacent cutter contact (CC) points is recalculated by the iterative verification algorithm, and an optimized chord height method is proposed to maximize the step length within the allowable. Furthermore, the isoparametric trajectory and the isoscallop trajectory for 3D residual optimization are proposed respectively to dynamically adjust the row spacing based on the curvature changes of CC points. Simulation and experimental results demonstrate the effectiveness of the proposed methods from the perspectives of machined efficiency and machined quality. The machining efficiency of the optimized isoscallop method is improved by 7.4 % compared with that before optimization, the fluctuation ranges of the surface profile error of these two proposed trajectories decreased by 28.7 % and 38.5 %, respectively. The presented trajectory planning method provides a valuable reference for improving the machined surface quality consistency in robotic grinding of complex curved surfaces.
{"title":"3D residual optimization-based trajectory planning for robotic grinding of complex curved blades","authors":"Chong Lv , Lai Zou , Heng Li , Lei Ren , Feng Jiao , Xinli Wang","doi":"10.1016/j.rcim.2026.103275","DOIUrl":"10.1016/j.rcim.2026.103275","url":null,"abstract":"<div><div>In robotic belt grinding of complex curved blades, the elastic contact characteristics and variable curvature distribution of the blade results in non-uniform residual height distributions in both the chordwise and spanwise directions, thereby hindering the attainment of stringent dimensional tolerances. In this work, a novel trajectory planning method for robotic grinding of blades is presented to effectively improve surface residual uniformity. Initially, a 3D residual theoretical model is established through the curved surface geometric properties. Subsequently, the maximum chord height between adjacent cutter contact (CC) points is recalculated by the iterative verification algorithm, and an optimized chord height method is proposed to maximize the step length within the allowable. Furthermore, the isoparametric trajectory and the isoscallop trajectory for 3D residual optimization are proposed respectively to dynamically adjust the row spacing based on the curvature changes of CC points. Simulation and experimental results demonstrate the effectiveness of the proposed methods from the perspectives of machined efficiency and machined quality. The machining efficiency of the optimized isoscallop method is improved by 7.4 % compared with that before optimization, the fluctuation ranges of the surface profile error of these two proposed trajectories decreased by 28.7 % and 38.5 %, respectively. The presented trajectory planning method provides a valuable reference for improving the machined surface quality consistency in robotic grinding of complex curved surfaces.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"101 ","pages":"Article 103275"},"PeriodicalIF":11.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147330051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-10-01Epub Date: 2026-03-04DOI: 10.1016/j.csl.2026.101967
Sara Barahona, Juan Ignacio Alvarez-Trejos, Alicia Lozano-Diez, Daniel Ramos, Doroteo T. Toledano
Sound Event Detection (SED) requires models that can accurately localize and classify overlapping audio events within complex acoustic environments. Conformer-based architectures have demonstrated promising performance by leveraging self-attention to capture long-range dependencies. However, this global attention can be accumulated across layers, which can blur local temporal boundaries and reduce detection accuracy, especially for short or closely spaced events. While increasing the input sequence length can help recover temporal detail, the quadratic complexity of Conformers’ self-attention significantly increases computational costs. To address this, we propose integrating the Efficient Conformer architecture, which introduces subsampling along the input sequence length, effectively reducing the temporal dimension within blocks. This design enables processing longer input sequences at finer temporal resolution, enhancing localization accuracy without extending output length. Using the DCASE Challenge 2023 Task 4 benchmark, system performance is evaluated via the threshold-independent Polyphonic Sound Detection Score (PSDS), measuring both localization precision (PSDS1) and class robustness (PSDS2). Experiments on the DESED validation dataset demonstrate that the Efficient Conformer not only improves temporal resolution and long-range dependency modeling, but also outperforms standard Conformer and Convolutional Recurrent Neural Network (CRNN) baselines in PSDS2. Additionally, we explore lightweight attention mechanisms employing squeeze-and-excitation blocks to emulate frequency-axis translation invariance of Frequency Dynamic Convolutions (FDY). Our approach achieves performance comparable to heavier models like FDY+Conformer, while reducing computational cost by over 69%, showing promising results for Conformer-based systems in terms of precision and model efficiency.
{"title":"Exploring efficient attention strategies in conformer-based sound event detection","authors":"Sara Barahona, Juan Ignacio Alvarez-Trejos, Alicia Lozano-Diez, Daniel Ramos, Doroteo T. Toledano","doi":"10.1016/j.csl.2026.101967","DOIUrl":"10.1016/j.csl.2026.101967","url":null,"abstract":"<div><div>Sound Event Detection (SED) requires models that can accurately localize and classify overlapping audio events within complex acoustic environments. Conformer-based architectures have demonstrated promising performance by leveraging self-attention to capture long-range dependencies. However, this global attention can be accumulated across layers, which can blur local temporal boundaries and reduce detection accuracy, especially for short or closely spaced events. While increasing the input sequence length can help recover temporal detail, the quadratic complexity of Conformers’ self-attention significantly increases computational costs. To address this, we propose integrating the Efficient Conformer architecture, which introduces subsampling along the input sequence length, effectively reducing the temporal dimension within blocks. This design enables processing longer input sequences at finer temporal resolution, enhancing localization accuracy without extending output length. Using the DCASE Challenge 2023 Task 4 benchmark, system performance is evaluated via the threshold-independent Polyphonic Sound Detection Score (PSDS), measuring both localization precision (PSDS1) and class robustness (PSDS2). Experiments on the DESED validation dataset demonstrate that the Efficient Conformer not only improves temporal resolution and long-range dependency modeling, but also outperforms standard Conformer and Convolutional Recurrent Neural Network (CRNN) baselines in PSDS2. Additionally, we explore lightweight attention mechanisms employing squeeze-and-excitation blocks to emulate frequency-axis translation invariance of Frequency Dynamic Convolutions (FDY). Our approach achieves performance comparable to heavier models like FDY+Conformer, while reducing computational cost by over 69%, showing promising results for Conformer-based systems in terms of precision and model efficiency.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101967"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147385931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current text-to-speech (TTS) systems are capable of learning the phonetics of a language accurately given that the speech data used to train such models covers all phonetic phenomena. For languages with different varieties, this includes all their richness and accents. This is the case of Catalan, a mid-resourced language with several dialects or accents. Although there are various publicly available corpora, there is a lack of high-quality open-access data for speech technologies covering its variety of accents. Common Voice includes recordings of Catalan speakers from different regions; however, accent labeling has been shown to be inaccurate, and artificially enhanced samples may be unsuitable for TTS. To address these limitations, we present LaFresCat, the first studio-quality Catalan multi-accent dataset. LaFresCat comprises 3.5 h of professionally recording speech covering four of the most prominent Catalan accents: Balearic, Central, North-Western, and Valencian. In this work, we provide a detailed description of the dataset design: utterances were selected to be phonetically balanced, detailed speaker instructions were provided, native speakers from the regions corresponding to the Catalan accents were hired, and the recordings were formatted and post-processed. The resulting dataset, LaFresCat, is publicly available. To preliminarily evaluate the dataset, we trained and assessed a lightweight flow-based TTS system, which is also provided as a by-product. We also analyzed LaFresCat samples and the corresponding TTS-generated samples at the phonetic level, employing expert annotations and Pillai scores to quantify acoustic vowel overlap. Preliminary results suggest a significant improvement in predicted mean opinion score (UTMOS), with an increase of 0.42 points when the TTS system is fine-tuned on LaFresCat rather than trained from scratch, starting from a pre-trained version based on Central Catalan data from Common Voice. Subsequent human expert annotations achieved nearly 90% accuracy in accent classification for LaFresCat recordings. However, although the TTS tends to homogenize pronunciation, it still learns distinct dialectal patterns. This assessment offers key insights for establishing a baseline to guide future evaluations of Catalan multi-accent TTS systems and further studies of LaFresCat.
{"title":"LaFresCat: A studio-quality Catalan multi-accent speech dataset for text-to-speech synthesis","authors":"Alex Peiró-Lilja , Carme Armentano-Oller , José Giraldo , Wendy Elvira-García , Ignasi Esquerra , Rodolfo Zevallos , Cristina España-Bonet , Martí Llopart-Font , Baybars Külebi , Mireia Farrús","doi":"10.1016/j.csl.2026.101945","DOIUrl":"10.1016/j.csl.2026.101945","url":null,"abstract":"<div><div>Current text-to-speech (TTS) systems are capable of learning the phonetics of a language accurately given that the speech data used to train such models covers all phonetic phenomena. For languages with different varieties, this includes all their richness and accents. This is the case of Catalan, a mid-resourced language with several dialects or accents. Although there are various publicly available corpora, there is a lack of high-quality open-access data for speech technologies covering its variety of accents. Common Voice includes recordings of Catalan speakers from different regions; however, accent labeling has been shown to be inaccurate, and artificially enhanced samples may be unsuitable for TTS. To address these limitations, we present LaFresCat, the first studio-quality Catalan multi-accent dataset. LaFresCat comprises 3.5 h of professionally recording speech covering four of the most prominent Catalan accents: Balearic, Central, North-Western, and Valencian. In this work, we provide a detailed description of the dataset design: utterances were selected to be phonetically balanced, detailed speaker instructions were provided, native speakers from the regions corresponding to the Catalan accents were hired, and the recordings were formatted and post-processed. The resulting dataset, LaFresCat, is publicly available. To preliminarily evaluate the dataset, we trained and assessed a lightweight flow-based TTS system, which is also provided as a by-product. We also analyzed LaFresCat samples and the corresponding TTS-generated samples at the phonetic level, employing expert annotations and Pillai scores to quantify acoustic vowel overlap. Preliminary results suggest a significant improvement in predicted mean opinion score (UTMOS), with an increase of 0.42 points when the TTS system is fine-tuned on LaFresCat rather than trained from scratch, starting from a pre-trained version based on Central Catalan data from Common Voice. Subsequent human expert annotations achieved nearly 90% accuracy in accent classification for LaFresCat recordings. However, although the TTS tends to homogenize pronunciation, it still learns distinct dialectal patterns. This assessment offers key insights for establishing a baseline to guide future evaluations of Catalan multi-accent TTS systems and further studies of LaFresCat.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101945"},"PeriodicalIF":3.4,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146147489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}