Speech-Section Extraction Using Lip Movement and Voice Information in Japanese

IF 0.7 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Journal of Advanced Computational Intelligence and Intelligent Informatics Pub Date : 2023-01-20 DOI:10.20965/jaciii.2023.p0054
Etsuro Nakamura, Y. Kageyama, Satoshi Hirose
{"title":"Speech-Section Extraction Using Lip Movement and Voice Information in Japanese","authors":"Etsuro Nakamura, Y. Kageyama, Satoshi Hirose","doi":"10.20965/jaciii.2023.p0054","DOIUrl":null,"url":null,"abstract":"In recent years, several Japanese companies have attempted to improve the efficiency of their meetings, which has been a significant challenge. For instance, voice recognition technology is used to considerably improve meeting minutes creation. In an automatic minutes-creating system, identifying the speaker to add speaker information to the text would substantially improve the overall efficiency of the process. Therefore, a few companies and research groups have proposed speaker estimation methods; however, it includes challenges, such as requiring advance preparation, special equipment, and multiple microphones. These problems can be solved by using speech sections that are extracted from lip movements and voice information. When a person speaks, voice and lip movements occur simultaneously. Therefore, the speaker’s speech section can be extracted from videos by using lip movement and voice information. However, when this speech section contains only voice information, the voiceprint information of each meeting participant is required for speaker identification. When using lip movements, the speech section and speaker position can be extracted without the voiceprint information. Therefore, in this study, we propose a speech-section extraction method that uses image and voice information in Japanese for speaker identification. The proposed method consists of three processes: i) the extraction of speech frames using lip movements, ii) the extraction of speech frames using voices, and iii) the classification of speech sections using these extraction results. We used video data to evaluate the functionality of the method. Further, the proposed method was compared with state-of-the-art techniques. The average F-measure of the proposed method is determined to be higher than that of the conventional methods that are based on state-of-the-art techniques. The evaluation results showed that the proposed method achieves state-of-the-art performance using a simpler process compared to the conventional method.","PeriodicalId":45921,"journal":{"name":"Journal of Advanced Computational Intelligence and Intelligent Informatics","volume":"1 1","pages":"54-63"},"PeriodicalIF":0.7000,"publicationDate":"2023-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advanced Computational Intelligence and Intelligent Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.20965/jaciii.2023.p0054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

In recent years, several Japanese companies have attempted to improve the efficiency of their meetings, which has been a significant challenge. For instance, voice recognition technology is used to considerably improve meeting minutes creation. In an automatic minutes-creating system, identifying the speaker to add speaker information to the text would substantially improve the overall efficiency of the process. Therefore, a few companies and research groups have proposed speaker estimation methods; however, it includes challenges, such as requiring advance preparation, special equipment, and multiple microphones. These problems can be solved by using speech sections that are extracted from lip movements and voice information. When a person speaks, voice and lip movements occur simultaneously. Therefore, the speaker’s speech section can be extracted from videos by using lip movement and voice information. However, when this speech section contains only voice information, the voiceprint information of each meeting participant is required for speaker identification. When using lip movements, the speech section and speaker position can be extracted without the voiceprint information. Therefore, in this study, we propose a speech-section extraction method that uses image and voice information in Japanese for speaker identification. The proposed method consists of three processes: i) the extraction of speech frames using lip movements, ii) the extraction of speech frames using voices, and iii) the classification of speech sections using these extraction results. We used video data to evaluate the functionality of the method. Further, the proposed method was compared with state-of-the-art techniques. The average F-measure of the proposed method is determined to be higher than that of the conventional methods that are based on state-of-the-art techniques. The evaluation results showed that the proposed method achieves state-of-the-art performance using a simpler process compared to the conventional method.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于唇动和语音信息的日语语段提取
近年来,几家日本公司试图提高会议效率,但这一直是一项重大挑战。例如,语音识别技术被用于大大提高会议记录的制作。在自动制作会议记录的系统中,识别发言者以便将发言者的信息添加到案文中,将大大提高该过程的总体效率。因此,一些公司和研究小组提出了说话人估计方法;然而,它也包含挑战,例如需要提前准备,特殊设备和多个麦克风。这些问题可以通过使用从嘴唇运动和语音信息中提取的语音片段来解决。当一个人说话时,声音和嘴唇的运动同时发生。因此,可以利用嘴唇运动和语音信息从视频中提取说话人的语音部分。但是,当此演讲部分仅包含语音信息时,需要每个与会者的声纹信息来识别发言者。当使用唇部运动时,可以在不需要声纹信息的情况下提取语音段和说话人的位置。因此,在本研究中,我们提出了一种利用日语图像和语音信息进行说话人识别的语音片段提取方法。该方法包括三个过程:i)使用唇形运动提取语音帧,ii)使用声音提取语音帧,以及iii)使用这些提取结果对语音片段进行分类。我们使用视频数据来评估该方法的功能。此外,将所提出的方法与最先进的技术进行了比较。所建议方法的平均f值被确定为高于基于最先进技术的传统方法。评估结果表明,与传统方法相比,该方法以更简单的过程获得了最先进的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
1.50
自引率
14.30%
发文量
89
期刊介绍: JACIII focuses on advanced computational intelligence and intelligent informatics. The topics include, but are not limited to; Fuzzy logic, Fuzzy control, Neural Networks, GA and Evolutionary Computation, Hybrid Systems, Adaptation and Learning Systems, Distributed Intelligent Systems, Network systems, Multi-media, Human interface, Biologically inspired evolutionary systems, Artificial life, Chaos, Complex systems, Fractals, Robotics, Medical applications, Pattern recognition, Virtual reality, Wavelet analysis, Scientific applications, Industrial applications, and Artistic applications.
期刊最新文献
The Impact of Individual Heterogeneity on Household Asset Choice: An Empirical Study Based on China Family Panel Studies Private Placement, Investor Sentiment, and Stock Price Anomaly Does Increasing Public Service Expenditure Slow the Long-Term Economic Growth Rate?—Evidence from China Prediction and Characteristic Analysis of Enterprise Digital Transformation Integrating XGBoost and SHAP Industrial Chain Map and Linkage Network Characteristics of Digital Economy
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1