{"title":"Efficient Video Retrieval Method Based on Transition Detection and Video Metadata Information","authors":"Nhat-Tuong Do-Tran, Vu-Hoang Tran, Tuan-Ngoc Nguyen, Thanh-Le Nguyen","doi":"10.1109/ICSSE58758.2023.10227191","DOIUrl":null,"url":null,"abstract":"In this paper, we propose an event retrieval support system that quickly finds videos in a large database based on user-entered content. The system addresses the challenges of providing fast and relevant results for a dataset of over 400 hours of videos and developing user-friendly tools. To achieve fast retrieval, we convert the videos into compact semantic features. This involves two steps: (1) Identifying keyframes that represent different content and (2) Extracting semantic features from these frames. We first use the TransNet model to find transition frames, which split the video into scenes with different content. Then we will extract the keyframes which are evenly distributed in these scenes. Finally, the CLIP model is used to extract features from these keyframes and connect them with text. This forms a compact and semantic feature database. When users search with text, we convert it into features and measure similarity with the database using cosine distance, then the most similar video is retrieved. In cases where CLIP model fails, we recommend leveraging news headlines and audio by applying Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) on videos to form a text database and comparing the entered text with this text database. Experimental results on a Vietnamese media news dataset demonstrate the effectiveness and accuracy of our method.","PeriodicalId":280745,"journal":{"name":"2023 International Conference on System Science and Engineering (ICSSE)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on System Science and Engineering (ICSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSSE58758.2023.10227191","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we propose an event retrieval support system that quickly finds videos in a large database based on user-entered content. The system addresses the challenges of providing fast and relevant results for a dataset of over 400 hours of videos and developing user-friendly tools. To achieve fast retrieval, we convert the videos into compact semantic features. This involves two steps: (1) Identifying keyframes that represent different content and (2) Extracting semantic features from these frames. We first use the TransNet model to find transition frames, which split the video into scenes with different content. Then we will extract the keyframes which are evenly distributed in these scenes. Finally, the CLIP model is used to extract features from these keyframes and connect them with text. This forms a compact and semantic feature database. When users search with text, we convert it into features and measure similarity with the database using cosine distance, then the most similar video is retrieved. In cases where CLIP model fails, we recommend leveraging news headlines and audio by applying Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) on videos to form a text database and comparing the entered text with this text database. Experimental results on a Vietnamese media news dataset demonstrate the effectiveness and accuracy of our method.