Pub Date : 2012-06-27DOI: 10.1109/CBMI.2012.6269808
Xavier Sevillano, X. Valero, Francesc Alías
Tagging videos with the geo-coordinates of the place where they were filmed (i.e. geo-tagging) enables indexing online multimedia repositories using geographical criteria. However, millions of non geo-tagged videos available online are invisible to the eyes of geo-oriented applications, which calls for the development of automatic techniques for estimating the location where a video was filmed. The most successful approaches to this problem largely rely on exploiting the textual metadata associated to the video, but it is not rare to encounter videos with no title, description nor tags. This work focuses on this adverse scenario and proposes a purely audiovisual approach to geo-tagging. Using a subset of the MediaEval 2011 Placing task data set, we evaluate the ability of several visual and acoustic features for estimating the videos location, and demonstrate that the optimally configured version of the proposed system outperforms the only audiovisual participant in the MediaEval 2011 Placing task.
{"title":"Audio and video cues for geo-tagging online videos in the absence of metadata","authors":"Xavier Sevillano, X. Valero, Francesc Alías","doi":"10.1109/CBMI.2012.6269808","DOIUrl":"https://doi.org/10.1109/CBMI.2012.6269808","url":null,"abstract":"Tagging videos with the geo-coordinates of the place where they were filmed (i.e. geo-tagging) enables indexing online multimedia repositories using geographical criteria. However, millions of non geo-tagged videos available online are invisible to the eyes of geo-oriented applications, which calls for the development of automatic techniques for estimating the location where a video was filmed. The most successful approaches to this problem largely rely on exploiting the textual metadata associated to the video, but it is not rare to encounter videos with no title, description nor tags. This work focuses on this adverse scenario and proposes a purely audiovisual approach to geo-tagging. Using a subset of the MediaEval 2011 Placing task data set, we evaluate the ability of several visual and acoustic features for estimating the videos location, and demonstrate that the optimally configured version of the proposed system outperforms the only audiovisual participant in the MediaEval 2011 Placing task.","PeriodicalId":120769,"journal":{"name":"2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114899991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-27DOI: 10.1109/CBMI.2012.6269851
Juliette Kahn, Olivier Galibert, L. Quintard, Matthieu Carré, Aude Giraudel, P. Joly
The REPERE Challenge aims to support research on people recognition in multimodal conditions. To assess the technology progress, annual evaluation campaigns will be organized from 2012 to 2014. In this context the REPERE corpus, a French video corpus with multimodal annotation, has been developed. The systems have to answer the following questions: Who is speaking? Who is present in the video? What names are cited? What names are displayed? The challenge is to combine the various information coming from the speech and the images.
{"title":"A presentation of the REPERE challenge","authors":"Juliette Kahn, Olivier Galibert, L. Quintard, Matthieu Carré, Aude Giraudel, P. Joly","doi":"10.1109/CBMI.2012.6269851","DOIUrl":"https://doi.org/10.1109/CBMI.2012.6269851","url":null,"abstract":"The REPERE Challenge aims to support research on people recognition in multimodal conditions. To assess the technology progress, annual evaluation campaigns will be organized from 2012 to 2014. In this context the REPERE corpus, a French video corpus with multimodal annotation, has been developed. The systems have to answer the following questions: Who is speaking? Who is present in the video? What names are cited? What names are displayed? The challenge is to combine the various information coming from the speech and the images.","PeriodicalId":120769,"journal":{"name":"2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128947964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-27DOI: 10.1109/CBMI.2012.6269811
Ionut Mironica, B. Ionescu, C. Vertan
In this paper we address the issue of relevance feedback in the context of content-based image retrieval. We propose a method that uses an hierarchical cluster representation of the relevant and non-relevant images in a query. The main advantage of this strategy is in performing on the initial set of the retrieved images (user feedback is provided only once for a small number of retrieved images) instead of performing additional queries as most approaches do. Experimental tests conducted on several standard image databases and using state-of-the-art content descriptors (e.g. MPEG-7, SURF) show that the proposed method provides a significant improvement in the retrieval performance, outperforming some other classic approaches.
{"title":"Hierarchical clustering relevance feedback for content-based image retrieval","authors":"Ionut Mironica, B. Ionescu, C. Vertan","doi":"10.1109/CBMI.2012.6269811","DOIUrl":"https://doi.org/10.1109/CBMI.2012.6269811","url":null,"abstract":"In this paper we address the issue of relevance feedback in the context of content-based image retrieval. We propose a method that uses an hierarchical cluster representation of the relevant and non-relevant images in a query. The main advantage of this strategy is in performing on the initial set of the retrieved images (user feedback is provided only once for a small number of retrieved images) instead of performing additional queries as most approaches do. Experimental tests conducted on several standard image databases and using state-of-the-art content descriptors (e.g. MPEG-7, SURF) show that the proposed method provides a significant improvement in the retrieval performance, outperforming some other classic approaches.","PeriodicalId":120769,"journal":{"name":"2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"04 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129662322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-27DOI: 10.1109/CBMI.2012.6269852
Nesrine Ksentini, Mohamed Zarka, A. Ammar, A. Alimi
This paper introduces a novel approach of video annotation by the use of context-based assistance for the annotator. The notion of context plays, actually, a significant role in the multimedia content search and retrieval systems. In fact, a semantic interpretation (or concept) separated from its context produces incomplete information. Moreover, each concept interpretation varies according to different contexts. The assistance that we introduce uses intelligent structures such as context ontology and previous annotation. The evaluation of the proposed assisted annotation prototype has led to promising results by the use of context in the annotation process.
{"title":"Toward an assisted context based collaborative annotation","authors":"Nesrine Ksentini, Mohamed Zarka, A. Ammar, A. Alimi","doi":"10.1109/CBMI.2012.6269852","DOIUrl":"https://doi.org/10.1109/CBMI.2012.6269852","url":null,"abstract":"This paper introduces a novel approach of video annotation by the use of context-based assistance for the annotator. The notion of context plays, actually, a significant role in the multimedia content search and retrieval systems. In fact, a semantic interpretation (or concept) separated from its context produces incomplete information. Moreover, each concept interpretation varies according to different contexts. The assistance that we introduce uses intelligent structures such as context ontology and previous annotation. The evaluation of the proposed assisted annotation prototype has led to promising results by the use of context in the annotation process.","PeriodicalId":120769,"journal":{"name":"2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130634077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-27DOI: 10.1109/CBMI.2012.6269802
Julien Law-To, R. Landais, G. Grefenstette
Video search is still largely based on text search human-supplied metadata, sometimes supplemented by extracted thumbnails. We have been developing a broadcast news search system based on recent progress in automatic speech recognition (ASR), natural language processing (NLP) and video and image processing to provide a rich content-based search to news. Our public online demonstrator of the Voxalead application, described here, currently indexes daily broadcast news content from 60 sources in English, French, Chinese, Arabic, Spanish, Dutch, Italian, German and Russian and make them searchable few time after they have been published.
{"title":"VOXALEADNEWS: A scalable content based video search engine","authors":"Julien Law-To, R. Landais, G. Grefenstette","doi":"10.1109/CBMI.2012.6269802","DOIUrl":"https://doi.org/10.1109/CBMI.2012.6269802","url":null,"abstract":"Video search is still largely based on text search human-supplied metadata, sometimes supplemented by extracted thumbnails. We have been developing a broadcast news search system based on recent progress in automatic speech recognition (ASR), natural language processing (NLP) and video and image processing to provide a rich content-based search to news. Our public online demonstrator of the Voxalead application, described here, currently indexes daily broadcast news content from 60 sources in English, French, Chinese, Arabic, Spanish, Dutch, Italian, German and Russian and make them searchable few time after they have been published.","PeriodicalId":120769,"journal":{"name":"2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130262291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-27DOI: 10.1109/CBMI.2012.6269835
Nicolas Voiron, A. Benoît, P. Lambert
When considering multimedia database growth, one current challenging issue is to design accurate navigation tools. End user basic needs, such as exploration, similarity search and favorite suggestions, lead to investigate how to find semantically resembling media. One way is to build numerous continuous dissimilarity measures from low-level image features. In parallel, an other way is to build discrete dissimilarities from textual information which may be available with video sequences. However, how such different measures should be selected as relevant and be fused? To this aim, the purpose of this paper is to compare all those various dissimilarities and to propose a suitable ranking fusion method for several dissimilarities. Subjective tests with human observers on the CITIA animation movie database have been carried out to validate the model.
{"title":"Automatic difference measure between movies using dissimilarity measure fusion and rank correlation coefficients","authors":"Nicolas Voiron, A. Benoît, P. Lambert","doi":"10.1109/CBMI.2012.6269835","DOIUrl":"https://doi.org/10.1109/CBMI.2012.6269835","url":null,"abstract":"When considering multimedia database growth, one current challenging issue is to design accurate navigation tools. End user basic needs, such as exploration, similarity search and favorite suggestions, lead to investigate how to find semantically resembling media. One way is to build numerous continuous dissimilarity measures from low-level image features. In parallel, an other way is to build discrete dissimilarities from textual information which may be available with video sequences. However, how such different measures should be selected as relevant and be fused? To this aim, the purpose of this paper is to compare all those various dissimilarities and to propose a suitable ranking fusion method for several dissimilarities. Subjective tests with human observers on the CITIA animation movie database have been carried out to validate the model.","PeriodicalId":120769,"journal":{"name":"2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125186274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-27DOI: 10.1109/CBMI.2012.6269850
Christian Wartena
We compare the effect of different text segmentation strategies on speech based passage retrieval of video. Passage retrieval has mainly been studied to improve document retrieval and to enable question answering. In these domains best results were obtained using passages defined by the paragraph structure of the source documents or by using arbitrary overlapping passages. For the retrieval of relevant passages in a video, using speech transcripts, no author defined segmentation is available. We compare retrieval results from 4 different types of segments based on the speech channel of the video: fixed length segments, a sliding window, semantically coherent segments and prosodic segments. We evaluated the methods on the corpus of the MediaEval 2011 Rich Speech Retrieval task. Our main conclusion is that the retrieval results highly depend on the right choice for the segment length. However, results using the segmentation into semantically coherent parts depend much less on the segment length. Especially, the quality of fixed length and sliding window segmentation drops fast when the segment length increases, while quality of the semantically coherent segments is much more stable. Thus, if coherent segments are defined, longer segments can be used and consequently less segments have to be considered at retrieval time.
{"title":"Comparing segmentation strategies for efficient video passage retrieval","authors":"Christian Wartena","doi":"10.1109/CBMI.2012.6269850","DOIUrl":"https://doi.org/10.1109/CBMI.2012.6269850","url":null,"abstract":"We compare the effect of different text segmentation strategies on speech based passage retrieval of video. Passage retrieval has mainly been studied to improve document retrieval and to enable question answering. In these domains best results were obtained using passages defined by the paragraph structure of the source documents or by using arbitrary overlapping passages. For the retrieval of relevant passages in a video, using speech transcripts, no author defined segmentation is available. We compare retrieval results from 4 different types of segments based on the speech channel of the video: fixed length segments, a sliding window, semantically coherent segments and prosodic segments. We evaluated the methods on the corpus of the MediaEval 2011 Rich Speech Retrieval task. Our main conclusion is that the retrieval results highly depend on the right choice for the segment length. However, results using the segmentation into semantically coherent parts depend much less on the segment length. Especially, the quality of fixed length and sliding window segmentation drops fast when the segment length increases, while quality of the semantically coherent segments is much more stable. Thus, if coherent segments are defined, longer segments can be used and consequently less segments have to be considered at retrieval time.","PeriodicalId":120769,"journal":{"name":"2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114484241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-27DOI: 10.1109/CBMI.2012.6269807
Kimiaki Shirahama, K. Uehara, M. Grzegorzek
Query-By-Example (QBE) approach retrieves shots which are visually similar to example shots provided by a user. However, QBE cannot work if example shots are unavailable. To overcome this, this paper develops Query-By-Virtual-Example (QBVE) approach where example shots (virtual examples) for a query are created using virtual reality technique. A virtual example is created by synthesizing the user's gesture, 3D object and background image. Using large-scale video data, we examine the effectiveness of virtual examples from the perspective of video retrieval. In particular, we study about the comparison between virtual examples and example shots selected from real videos, the importance of camera movements, the combination strategy of gestures, 3D objects and backgrounds, and the individual difference in users.
{"title":"Examining the applicability of virtual reality technique for video retrieval","authors":"Kimiaki Shirahama, K. Uehara, M. Grzegorzek","doi":"10.1109/CBMI.2012.6269807","DOIUrl":"https://doi.org/10.1109/CBMI.2012.6269807","url":null,"abstract":"Query-By-Example (QBE) approach retrieves shots which are visually similar to example shots provided by a user. However, QBE cannot work if example shots are unavailable. To overcome this, this paper develops Query-By-Virtual-Example (QBVE) approach where example shots (virtual examples) for a query are created using virtual reality technique. A virtual example is created by synthesizing the user's gesture, 3D object and background image. Using large-scale video data, we examine the effectiveness of virtual examples from the perspective of video retrieval. In particular, we study about the comparison between virtual examples and example shots selected from real videos, the importance of camera movements, the combination strategy of gestures, 3D objects and backgrounds, and the individual difference in users.","PeriodicalId":120769,"journal":{"name":"2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126728643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-27DOI: 10.1109/CBMI.2012.6269812
F. Thudor, Ingrid Autier, B. Chupeau, F. Lefèbvre, Lionel Oisel
In this paper we propose a framework for automatically chaptering VoD content based on its DVD version. The idea is to benefit from artistic work performed for DVD chapter creation even if DVD and VoD video content are not exactly the same. The framework is based on a sparse to dense frame synchronization, combining both global and local image descriptions, together with adaptive video sequence splitting to enable the processing of very long sequences. A way to extract specific information from a DVD is also embedded in the framework. Results of the evaluation performed on official movie releases are presented in the paper.
{"title":"Automatic chaptering of VoD content based on DVD content","authors":"F. Thudor, Ingrid Autier, B. Chupeau, F. Lefèbvre, Lionel Oisel","doi":"10.1109/CBMI.2012.6269812","DOIUrl":"https://doi.org/10.1109/CBMI.2012.6269812","url":null,"abstract":"In this paper we propose a framework for automatically chaptering VoD content based on its DVD version. The idea is to benefit from artistic work performed for DVD chapter creation even if DVD and VoD video content are not exactly the same. The framework is based on a sparse to dense frame synchronization, combining both global and local image descriptions, together with adaptive video sequence splitting to enable the processing of very long sequences. A way to extract specific information from a DVD is also embedded in the framework. Results of the evaluation performed on official movie releases are presented in the paper.","PeriodicalId":120769,"journal":{"name":"2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128792943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-06-27DOI: 10.1109/CBMI.2012.6269815
M. Diephuis, S. Voloshynovskiy, O. Koval, F. Beekhof
In this paper we propose an architecture for message-privacy preserving copy detection and content identification for images based on the signs of the Discrete Cosine Transform (DCT) coefficients. The architecture allows for searching in encrypted data and places the computational burden on the server. Sign components of the low frequency DCT coefficients of an image are used to generate a dual set of keys that in turn are used to encrypt the source image and serve as a robust hash that can be queried for content identification. The statistical properties of these DCT sign vectors are modelled and we analyse their robustness against real world image distortions. Finally, the trade-off between the discriminative power of such vectors, the offered security and the resilience against errors is demonstrated.
{"title":"DCT sign based robust privacy preserving image copy detection for cloud-based systems","authors":"M. Diephuis, S. Voloshynovskiy, O. Koval, F. Beekhof","doi":"10.1109/CBMI.2012.6269815","DOIUrl":"https://doi.org/10.1109/CBMI.2012.6269815","url":null,"abstract":"In this paper we propose an architecture for message-privacy preserving copy detection and content identification for images based on the signs of the Discrete Cosine Transform (DCT) coefficients. The architecture allows for searching in encrypted data and places the computational burden on the server. Sign components of the low frequency DCT coefficients of an image are used to generate a dual set of keys that in turn are used to encrypt the source image and serve as a robust hash that can be queried for content identification. The statistical properties of these DCT sign vectors are modelled and we analyse their robustness against real world image distortions. Finally, the trade-off between the discriminative power of such vectors, the offered security and the resilience against errors is demonstrated.","PeriodicalId":120769,"journal":{"name":"2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121711235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}