基于唇动和语音信息的日语语段提取

IF 0.8 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Journal of Advanced Computational Intelligence and Intelligent Informatics Pub Date : 2023-01-20 DOI:10.20965/jaciii.2023.p0054

Etsuro Nakamura, Y. Kageyama, Satoshi Hirose

{"title":"基于唇动和语音信息的日语语段提取","authors":"Etsuro Nakamura, Y. Kageyama, Satoshi Hirose","doi":"10.20965/jaciii.2023.p0054","DOIUrl":null,"url":null,"abstract":"In recent years, several Japanese companies have attempted to improve the efficiency of their meetings, which has been a significant challenge. For instance, voice recognition technology is used to considerably improve meeting minutes creation. In an automatic minutes-creating system, identifying the speaker to add speaker information to the text would substantially improve the overall efficiency of the process. Therefore, a few companies and research groups have proposed speaker estimation methods; however, it includes challenges, such as requiring advance preparation, special equipment, and multiple microphones. These problems can be solved by using speech sections that are extracted from lip movements and voice information. When a person speaks, voice and lip movements occur simultaneously. Therefore, the speaker’s speech section can be extracted from videos by using lip movement and voice information. However, when this speech section contains only voice information, the voiceprint information of each meeting participant is required for speaker identification. When using lip movements, the speech section and speaker position can be extracted without the voiceprint information. Therefore, in this study, we propose a speech-section extraction method that uses image and voice information in Japanese for speaker identification. The proposed method consists of three processes: i) the extraction of speech frames using lip movements, ii) the extraction of speech frames using voices, and iii) the classification of speech sections using these extraction results. We used video data to evaluate the functionality of the method. Further, the proposed method was compared with state-of-the-art techniques. The average F-measure of the proposed method is determined to be higher than that of the conventional methods that are based on state-of-the-art techniques. The evaluation results showed that the proposed method achieves state-of-the-art performance using a simpler process compared to the conventional method.","PeriodicalId":45921,"journal":{"name":"Journal of Advanced Computational Intelligence and Intelligent Informatics","volume":"1 1","pages":"54-63"},"PeriodicalIF":0.8000,"publicationDate":"2023-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Speech-Section Extraction Using Lip Movement and Voice Information in Japanese\",\"authors\":\"Etsuro Nakamura, Y. Kageyama, Satoshi Hirose\",\"doi\":\"10.20965/jaciii.2023.p0054\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, several Japanese companies have attempted to improve the efficiency of their meetings, which has been a significant challenge. For instance, voice recognition technology is used to considerably improve meeting minutes creation. In an automatic minutes-creating system, identifying the speaker to add speaker information to the text would substantially improve the overall efficiency of the process. Therefore, a few companies and research groups have proposed speaker estimation methods; however, it includes challenges, such as requiring advance preparation, special equipment, and multiple microphones. These problems can be solved by using speech sections that are extracted from lip movements and voice information. When a person speaks, voice and lip movements occur simultaneously. Therefore, the speaker’s speech section can be extracted from videos by using lip movement and voice information. However, when this speech section contains only voice information, the voiceprint information of each meeting participant is required for speaker identification. When using lip movements, the speech section and speaker position can be extracted without the voiceprint information. Therefore, in this study, we propose a speech-section extraction method that uses image and voice information in Japanese for speaker identification. The proposed method consists of three processes: i) the extraction of speech frames using lip movements, ii) the extraction of speech frames using voices, and iii) the classification of speech sections using these extraction results. We used video data to evaluate the functionality of the method. Further, the proposed method was compared with state-of-the-art techniques. The average F-measure of the proposed method is determined to be higher than that of the conventional methods that are based on state-of-the-art techniques. The evaluation results showed that the proposed method achieves state-of-the-art performance using a simpler process compared to the conventional method.\",\"PeriodicalId\":45921,\"journal\":{\"name\":\"Journal of Advanced Computational Intelligence and Intelligent Informatics\",\"volume\":\"1 1\",\"pages\":\"54-63\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2023-01-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Advanced Computational Intelligence and Intelligent Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.20965/jaciii.2023.p0054\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advanced Computational Intelligence and Intelligent Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.20965/jaciii.2023.p0054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

近年来，几家日本公司试图提高会议效率，但这一直是一项重大挑战。例如，语音识别技术被用于大大提高会议记录的制作。在自动制作会议记录的系统中，识别发言者以便将发言者的信息添加到案文中，将大大提高该过程的总体效率。因此，一些公司和研究小组提出了说话人估计方法;然而，它也包含挑战，例如需要提前准备，特殊设备和多个麦克风。这些问题可以通过使用从嘴唇运动和语音信息中提取的语音片段来解决。当一个人说话时，声音和嘴唇的运动同时发生。因此，可以利用嘴唇运动和语音信息从视频中提取说话人的语音部分。但是，当此演讲部分仅包含语音信息时，需要每个与会者的声纹信息来识别发言者。当使用唇部运动时，可以在不需要声纹信息的情况下提取语音段和说话人的位置。因此，在本研究中，我们提出了一种利用日语图像和语音信息进行说话人识别的语音片段提取方法。该方法包括三个过程:i)使用唇形运动提取语音帧，ii)使用声音提取语音帧，以及iii)使用这些提取结果对语音片段进行分类。我们使用视频数据来评估该方法的功能。此外，将所提出的方法与最先进的技术进行了比较。所建议方法的平均f值被确定为高于基于最先进技术的传统方法。评估结果表明，与传统方法相比，该方法以更简单的过程获得了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Speech-Section Extraction Using Lip Movement and Voice Information in Japanese

In recent years, several Japanese companies have attempted to improve the efficiency of their meetings, which has been a significant challenge. For instance, voice recognition technology is used to considerably improve meeting minutes creation. In an automatic minutes-creating system, identifying the speaker to add speaker information to the text would substantially improve the overall efficiency of the process. Therefore, a few companies and research groups have proposed speaker estimation methods; however, it includes challenges, such as requiring advance preparation, special equipment, and multiple microphones. These problems can be solved by using speech sections that are extracted from lip movements and voice information. When a person speaks, voice and lip movements occur simultaneously. Therefore, the speaker’s speech section can be extracted from videos by using lip movement and voice information. However, when this speech section contains only voice information, the voiceprint information of each meeting participant is required for speaker identification. When using lip movements, the speech section and speaker position can be extracted without the voiceprint information. Therefore, in this study, we propose a speech-section extraction method that uses image and voice information in Japanese for speaker identification. The proposed method consists of three processes: i) the extraction of speech frames using lip movements, ii) the extraction of speech frames using voices, and iii) the classification of speech sections using these extraction results. We used video data to evaluate the functionality of the method. Further, the proposed method was compared with state-of-the-art techniques. The average F-measure of the proposed method is determined to be higher than that of the conventional methods that are based on state-of-the-art techniques. The evaluation results showed that the proposed method achieves state-of-the-art performance using a simpler process compared to the conventional method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Advanced Computational Intelligence and Intelligent Informatics COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

1.50

自引率

14.30%

发文量

期刊介绍： JACIII focuses on advanced computational intelligence and intelligent informatics. The topics include, but are not limited to; Fuzzy logic, Fuzzy control, Neural Networks, GA and Evolutionary Computation, Hybrid Systems, Adaptation and Learning Systems, Distributed Intelligent Systems, Network systems, Multi-media, Human interface, Biologically inspired evolutionary systems, Artificial life, Chaos, Complex systems, Fractals, Robotics, Medical applications, Pattern recognition, Virtual reality, Wavelet analysis, Scientific applications, Industrial applications, and Artistic applications.