Detection of Inconsistency Between Subject and Speaker Based on the Co-occurrence of Lip Motion and Voice Towards Speech Scene Extraction from News Videos
S. Kumagai, Keisuke Doman, Tomokazu Takahashi, Daisuke Deguchi, I. Ide, H. Murase
{"title":"Detection of Inconsistency Between Subject and Speaker Based on the Co-occurrence of Lip Motion and Voice Towards Speech Scene Extraction from News Videos","authors":"S. Kumagai, Keisuke Doman, Tomokazu Takahashi, Daisuke Deguchi, I. Ide, H. Murase","doi":"10.1109/ISM.2011.56","DOIUrl":null,"url":null,"abstract":"We propose a method to detect the inconsistency between a subject and the speaker for extracting speech scenes from news videos. Speech scenes in news videos contain a wealth of multimedia information, and are valuable as archived material. In order to extract speech scenes from news videos, there is an approach that uses the position and size of a face region. However, it is difficult to extract them with only such approach, since news videos contain non-speech scenes where the speaker is not the subject, such as narrated scenes. To solve this problem, we propose a method to discriminate between speech scenes and narrated scenes based on the co-occurrence between a subject's lip motion and the speaker's voice. The proposed method uses lip shape and degree of lip opening as visual features representing a subject's lip motion, and uses voice volume and phoneme as audio feature representing a speaker's voice. Then, the proposed method discriminates between speech scenes and narrated scenes based on the correlations of these features. We report the results of experiments on videos captured in a laboratory condition and also on actual broadcast news videos. Their results showed the effectiveness of our method and the feasibility of our research goal.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Symposium on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISM.2011.56","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
We propose a method to detect the inconsistency between a subject and the speaker for extracting speech scenes from news videos. Speech scenes in news videos contain a wealth of multimedia information, and are valuable as archived material. In order to extract speech scenes from news videos, there is an approach that uses the position and size of a face region. However, it is difficult to extract them with only such approach, since news videos contain non-speech scenes where the speaker is not the subject, such as narrated scenes. To solve this problem, we propose a method to discriminate between speech scenes and narrated scenes based on the co-occurrence between a subject's lip motion and the speaker's voice. The proposed method uses lip shape and degree of lip opening as visual features representing a subject's lip motion, and uses voice volume and phoneme as audio feature representing a speaker's voice. Then, the proposed method discriminates between speech scenes and narrated scenes based on the correlations of these features. We report the results of experiments on videos captured in a laboratory condition and also on actual broadcast news videos. Their results showed the effectiveness of our method and the feasibility of our research goal.