Stefano Bini, Vincenzo Carletti, Alessia Saggese, Mario Vento
{"title":"在具有挑战性的工业环境中进行可靠的语音命令识别","authors":"Stefano Bini, Vincenzo Carletti, Alessia Saggese, Mario Vento","doi":"10.1016/j.comcom.2024.107938","DOIUrl":null,"url":null,"abstract":"<div><p>Speech is among the main forms of communication between humans and robots in industrial settings, being the most natural way for a human worker to issue commands. However, the presence of pervasive and loud environmental noise poses significant challenges to the adoption of Speech-Command Recognition systems onboard manufacturing robots; indeed, they are expected to perform in real time on hardware with limited computational capabilities and also to be robust and accurate in such complex environments. In this paper, we propose an innovative system based on an End-to-End architecture with a Conformer backbone. Our system is specifically designed to achieve high accuracy in noisy industrial environments and to guarantee a minimal computational burden to meet stringent real-time requirements while running on computing devices that are embedded in robots. In order to increase the generalization capability of the system, the training procedure is driven by a Curriculum Learning strategy combined with dynamic data augmentation techniques, that progressively increase the complexity of input samples by increasing the noise during the training phase. We have conducted extensive experimentation to assess the effectiveness of our system, using a dataset composed of more than 50,000 samples, of which about 2,000 have been acquired during the daily operations of a Stellantis Italian factory. The results confirm the suitability of the proposed approach to be adopted in a real industrial environment; indeed, it is able to achieve, on both English and Italian commands, an accuracy higher than 90%, maintaining a compact model size (the network is 1.81 <span><math><mrow><mi>M</mi><mi>B</mi></mrow></math></span>) and running in real-time on an industrial embedded device (namely <span><math><mrow><mn>41</mn><mspace></mspace><mi>ms</mi></mrow></math></span> over an NVIDIA Xavier NX).</p></div>","PeriodicalId":55224,"journal":{"name":"Computer Communications","volume":"228 ","pages":"Article 107938"},"PeriodicalIF":4.5000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust speech command recognition in challenging industrial environments\",\"authors\":\"Stefano Bini, Vincenzo Carletti, Alessia Saggese, Mario Vento\",\"doi\":\"10.1016/j.comcom.2024.107938\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Speech is among the main forms of communication between humans and robots in industrial settings, being the most natural way for a human worker to issue commands. However, the presence of pervasive and loud environmental noise poses significant challenges to the adoption of Speech-Command Recognition systems onboard manufacturing robots; indeed, they are expected to perform in real time on hardware with limited computational capabilities and also to be robust and accurate in such complex environments. In this paper, we propose an innovative system based on an End-to-End architecture with a Conformer backbone. Our system is specifically designed to achieve high accuracy in noisy industrial environments and to guarantee a minimal computational burden to meet stringent real-time requirements while running on computing devices that are embedded in robots. In order to increase the generalization capability of the system, the training procedure is driven by a Curriculum Learning strategy combined with dynamic data augmentation techniques, that progressively increase the complexity of input samples by increasing the noise during the training phase. We have conducted extensive experimentation to assess the effectiveness of our system, using a dataset composed of more than 50,000 samples, of which about 2,000 have been acquired during the daily operations of a Stellantis Italian factory. The results confirm the suitability of the proposed approach to be adopted in a real industrial environment; indeed, it is able to achieve, on both English and Italian commands, an accuracy higher than 90%, maintaining a compact model size (the network is 1.81 <span><math><mrow><mi>M</mi><mi>B</mi></mrow></math></span>) and running in real-time on an industrial embedded device (namely <span><math><mrow><mn>41</mn><mspace></mspace><mi>ms</mi></mrow></math></span> over an NVIDIA Xavier NX).</p></div>\",\"PeriodicalId\":55224,\"journal\":{\"name\":\"Computer Communications\",\"volume\":\"228 \",\"pages\":\"Article 107938\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Communications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0140366424002858\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Communications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0140366424002858","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Robust speech command recognition in challenging industrial environments
Speech is among the main forms of communication between humans and robots in industrial settings, being the most natural way for a human worker to issue commands. However, the presence of pervasive and loud environmental noise poses significant challenges to the adoption of Speech-Command Recognition systems onboard manufacturing robots; indeed, they are expected to perform in real time on hardware with limited computational capabilities and also to be robust and accurate in such complex environments. In this paper, we propose an innovative system based on an End-to-End architecture with a Conformer backbone. Our system is specifically designed to achieve high accuracy in noisy industrial environments and to guarantee a minimal computational burden to meet stringent real-time requirements while running on computing devices that are embedded in robots. In order to increase the generalization capability of the system, the training procedure is driven by a Curriculum Learning strategy combined with dynamic data augmentation techniques, that progressively increase the complexity of input samples by increasing the noise during the training phase. We have conducted extensive experimentation to assess the effectiveness of our system, using a dataset composed of more than 50,000 samples, of which about 2,000 have been acquired during the daily operations of a Stellantis Italian factory. The results confirm the suitability of the proposed approach to be adopted in a real industrial environment; indeed, it is able to achieve, on both English and Italian commands, an accuracy higher than 90%, maintaining a compact model size (the network is 1.81 ) and running in real-time on an industrial embedded device (namely over an NVIDIA Xavier NX).
期刊介绍:
Computer and Communications networks are key infrastructures of the information society with high socio-economic value as they contribute to the correct operations of many critical services (from healthcare to finance and transportation). Internet is the core of today''s computer-communication infrastructures. This has transformed the Internet, from a robust network for data transfer between computers, to a global, content-rich, communication and information system where contents are increasingly generated by the users, and distributed according to human social relations. Next-generation network technologies, architectures and protocols are therefore required to overcome the limitations of the legacy Internet and add new capabilities and services. The future Internet should be ubiquitous, secure, resilient, and closer to human communication paradigms.
Computer Communications is a peer-reviewed international journal that publishes high-quality scientific articles (both theory and practice) and survey papers covering all aspects of future computer communication networks (on all layers, except the physical layer), with a special attention to the evolution of the Internet architecture, protocols, services, and applications.