基于多模态深度学习方法的视频视听内容理解方法

Sakarya University Journal of Computer and Information Sciences Pub Date : 2022-07-06 DOI:10.35377/saucis...1139765

Emre Beray Boztepe, Bedirhan Karakaya, B. Karasulu, İsmet Ünlü

{"title":"基于多模态深度学习方法的视频视听内容理解方法","authors":"Emre Beray Boztepe, Bedirhan Karakaya, B. Karasulu, İsmet Ünlü","doi":"10.35377/saucis...1139765","DOIUrl":null,"url":null,"abstract":"This study contains an approach for recognizing the sound environment class from a video to understand the spoken content with its sentimental context via some sort of analysis that is achieved by the processing of audio-visual content using multimodal deep learning methodology. This approach begins with cutting the parts of a given video which the most action happened by using deep learning and this cutted parts get concanarated as a new video clip. With the help of a deep learning network model which was trained before for sound recognition, a sound prediction process takes place. The model was trained by using different sound clips of ten different categories to predict sound classes. These categories have been selected by where the action could have happened the most. Then, to strengthen the result of sound recognition if there is a speech in the new video, this speech has been taken. By using Natural Language Processing (NLP) and Named Entity Recognition (NER) this speech has been categorized according to if the word of a speech has connotation of any of the ten categories. Sentiment analysis and Apriori Algorithm from Association Rule Mining (ARM) processes are preceded by identifying the frequent categories in the concanarated video and helps us to define the relationship between the categories owned. According to the highest performance evaluation values from our experiments, the accuracy for sound environment recognition for a given video's processed scene is 70%, average Bilingual Evaluation Understudy (BLEU) score for speech to text with VOSK speech recognition toolkit's English language model is 90% on average and for Turkish language model is 81% on average. Discussion and conclusion based on scientific findings are included in our study.","PeriodicalId":257636,"journal":{"name":"Sakarya University Journal of Computer and Information Sciences","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An Approach for Audio-Visual Content Understanding of Video using Multimodal Deep Learning Methodology\",\"authors\":\"Emre Beray Boztepe, Bedirhan Karakaya, B. Karasulu, İsmet Ünlü\",\"doi\":\"10.35377/saucis...1139765\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study contains an approach for recognizing the sound environment class from a video to understand the spoken content with its sentimental context via some sort of analysis that is achieved by the processing of audio-visual content using multimodal deep learning methodology. This approach begins with cutting the parts of a given video which the most action happened by using deep learning and this cutted parts get concanarated as a new video clip. With the help of a deep learning network model which was trained before for sound recognition, a sound prediction process takes place. The model was trained by using different sound clips of ten different categories to predict sound classes. These categories have been selected by where the action could have happened the most. Then, to strengthen the result of sound recognition if there is a speech in the new video, this speech has been taken. By using Natural Language Processing (NLP) and Named Entity Recognition (NER) this speech has been categorized according to if the word of a speech has connotation of any of the ten categories. Sentiment analysis and Apriori Algorithm from Association Rule Mining (ARM) processes are preceded by identifying the frequent categories in the concanarated video and helps us to define the relationship between the categories owned. According to the highest performance evaluation values from our experiments, the accuracy for sound environment recognition for a given video's processed scene is 70%, average Bilingual Evaluation Understudy (BLEU) score for speech to text with VOSK speech recognition toolkit's English language model is 90% on average and for Turkish language model is 81% on average. Discussion and conclusion based on scientific findings are included in our study.\",\"PeriodicalId\":257636,\"journal\":{\"name\":\"Sakarya University Journal of Computer and Information Sciences\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sakarya University Journal of Computer and Information Sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.35377/saucis...1139765\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sakarya University Journal of Computer and Information Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.35377/saucis...1139765","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

本研究包含了一种方法，通过使用多模态深度学习方法处理视听内容，通过某种分析，从视频中识别声音环境类，以理解带有情感背景的口语内容。这种方法首先通过使用深度学习将给定视频中发生最多动作的部分剪切下来，这些剪切的部分被合并为一个新的视频剪辑。借助之前训练过的声音识别深度学习网络模型，进行声音预测过程。该模型通过使用10个不同类别的不同声音片段来预测声音类别。这些类别是根据最可能发生这种行为的地方选出的。然后，为了加强声音识别的结果，如果在新的视频中有一个演讲，这个演讲已经被拍摄。通过使用自然语言处理(NLP)和命名实体识别(NER)，根据语音中的单词是否具有十种类别中的任何一种内涵来对该语音进行分类。关联规则挖掘(ARM)过程中的情感分析和Apriori算法首先识别关联视频中的频繁类别，并帮助我们定义所拥有的类别之间的关系。根据我们实验的最高性能评估值，对给定视频处理场景的声音环境识别的准确率为70%，使用VOSK语音识别工具包的英语语言模型的语音到文本的平均双语评估Understudy (BLEU)分数平均为90%，土耳其语言模型的平均分数为81%。我们的研究包括基于科学发现的讨论和结论。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

An Approach for Audio-Visual Content Understanding of Video using Multimodal Deep Learning Methodology

This study contains an approach for recognizing the sound environment class from a video to understand the spoken content with its sentimental context via some sort of analysis that is achieved by the processing of audio-visual content using multimodal deep learning methodology. This approach begins with cutting the parts of a given video which the most action happened by using deep learning and this cutted parts get concanarated as a new video clip. With the help of a deep learning network model which was trained before for sound recognition, a sound prediction process takes place. The model was trained by using different sound clips of ten different categories to predict sound classes. These categories have been selected by where the action could have happened the most. Then, to strengthen the result of sound recognition if there is a speech in the new video, this speech has been taken. By using Natural Language Processing (NLP) and Named Entity Recognition (NER) this speech has been categorized according to if the word of a speech has connotation of any of the ten categories. Sentiment analysis and Apriori Algorithm from Association Rule Mining (ARM) processes are preceded by identifying the frequent categories in the concanarated video and helps us to define the relationship between the categories owned. According to the highest performance evaluation values from our experiments, the accuracy for sound environment recognition for a given video's processed scene is 70%, average Bilingual Evaluation Understudy (BLEU) score for speech to text with VOSK speech recognition toolkit's English language model is 90% on average and for Turkish language model is 81% on average. Discussion and conclusion based on scientific findings are included in our study.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Sakarya University Journal of Computer and Information Sciences

自引率

0.00%

发文量