Pub Date : 2015-06-10DOI: 10.1109/CBMI.2015.7153636
N. Zikos, A. Delopoulos
In this paper we present a method for logo detection in image collections and streams. The proposed method is based on features, extracted from reference logo images and test images. Extracted features are combined with respect to their similarity in their descriptors' space and afterwards with respect to their geometric consistency on the image plane. The contribution of this paper is a novel method for fast geometric consistency test. Using state of the art fast matching methods, it produces pairs of similar features between the test image and the reference logo image and then examines which pairs are forming a consistent geometry on both the test and the reference logo image. It is noteworthy that the proposed method is scale, rotation and translation invariant. The key advantage of the proposed method is that it exhibits a much lower computational complexity and better performance than the state of the art methods. Experimental results on large scale datasets are presented to support these statements.
{"title":"Fast geometric consistency test for real time logo detection","authors":"N. Zikos, A. Delopoulos","doi":"10.1109/CBMI.2015.7153636","DOIUrl":"https://doi.org/10.1109/CBMI.2015.7153636","url":null,"abstract":"In this paper we present a method for logo detection in image collections and streams. The proposed method is based on features, extracted from reference logo images and test images. Extracted features are combined with respect to their similarity in their descriptors' space and afterwards with respect to their geometric consistency on the image plane. The contribution of this paper is a novel method for fast geometric consistency test. Using state of the art fast matching methods, it produces pairs of similar features between the test image and the reference logo image and then examines which pairs are forming a consistent geometry on both the test and the reference logo image. It is noteworthy that the proposed method is scale, rotation and translation invariant. The key advantage of the proposed method is that it exhibits a much lower computational complexity and better performance than the state of the art methods. Experimental results on large scale datasets are presented to support these statements.","PeriodicalId":387496,"journal":{"name":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"37 28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125704353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-10DOI: 10.1109/CBMI.2015.7153635
T. Danisman, J. Martinet, Ioan Marius Bilasco
Automatic landmark identification is one of the hot research topics in computer vision domain. Efficient and robust identification of landmark points is a challenging task, especially in a mobile context. This paper addresses the pruning of near-duplicate images for creating representative training image sets to minimize overall query processing complexity and time. We prune different perspectives of real world landmarks to find the smallest set of the most representative images. Inspired from graph theory, we represent each class in a separate graph using geometric verification of well-known RANSAC algorithm. Our iterative method uses maximum coverage information in each iteration to find the minimum representative set to reduce and prioritize the images of the initial dataset. Experiments on Paris dataset show that the proposed method provides robust and accurate results using smaller subsets.
{"title":"Pruning near-duplicate images for mobile landmark identification: A graph theoretical approach","authors":"T. Danisman, J. Martinet, Ioan Marius Bilasco","doi":"10.1109/CBMI.2015.7153635","DOIUrl":"https://doi.org/10.1109/CBMI.2015.7153635","url":null,"abstract":"Automatic landmark identification is one of the hot research topics in computer vision domain. Efficient and robust identification of landmark points is a challenging task, especially in a mobile context. This paper addresses the pruning of near-duplicate images for creating representative training image sets to minimize overall query processing complexity and time. We prune different perspectives of real world landmarks to find the smallest set of the most representative images. Inspired from graph theory, we represent each class in a separate graph using geometric verification of well-known RANSAC algorithm. Our iterative method uses maximum coverage information in each iteration to find the minimum representative set to reduce and prioritize the images of the initial dataset. Experiments on Paris dataset show that the proposed method provides robust and accurate results using smaller subsets.","PeriodicalId":387496,"journal":{"name":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132223460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-10DOI: 10.1109/CBMI.2015.7153603
Esra Acar, F. Hopfgartner, S. Albayrak
When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation.
{"title":"Fusion of learned multi-modal representations and dense trajectories for emotional analysis in videos","authors":"Esra Acar, F. Hopfgartner, S. Albayrak","doi":"10.1109/CBMI.2015.7153603","DOIUrl":"https://doi.org/10.1109/CBMI.2015.7153603","url":null,"abstract":"When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation.","PeriodicalId":387496,"journal":{"name":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129380878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-10DOI: 10.1109/CBMI.2015.7153637
Mateusz Budnik, Efrain-Leonardo Gutierrez-Gomez, Bahjat Safadi, G. Quénot
In this paper, we compare “traditional” engineered (hand-crafted) features (or descriptors) and learned features for content-based semantic indexing of video documents. Learned (or semantic) features are obtained by training classifiers for other target concepts on other data. These classifiers are then applied to the current collection. The vector of classification scores is the new feature used for training a classifier for the current target concepts on the current collection. If the classifiers used on the other collection are of the Deep Convolutional Neural Network (DCNN) type, it is possible to use as a new feature not only the score values provided by the last layer but also the intermediate values corresponding to the output of all the hidden layers. We made an extensive comparison of the performance of such features with traditional engineered ones as well as with combinations of them. The comparison was made in the context of the TRECVid semantic indexing task. Our results confirm those obtained for still images: features learned from other training data generally outperform engineered features for concept recognition. Additionally, we found that directly training SVM classifiers using these features does significantly better than partially retraining the DCNN for adapting it to the new data. We also found that, even though the learned features performed better that the engineered ones, the fusion of both of them perform significantly better, indicating that engineered features are still useful, at least in this case.
{"title":"Learned features versus engineered features for semantic video indexing","authors":"Mateusz Budnik, Efrain-Leonardo Gutierrez-Gomez, Bahjat Safadi, G. Quénot","doi":"10.1109/CBMI.2015.7153637","DOIUrl":"https://doi.org/10.1109/CBMI.2015.7153637","url":null,"abstract":"In this paper, we compare “traditional” engineered (hand-crafted) features (or descriptors) and learned features for content-based semantic indexing of video documents. Learned (or semantic) features are obtained by training classifiers for other target concepts on other data. These classifiers are then applied to the current collection. The vector of classification scores is the new feature used for training a classifier for the current target concepts on the current collection. If the classifiers used on the other collection are of the Deep Convolutional Neural Network (DCNN) type, it is possible to use as a new feature not only the score values provided by the last layer but also the intermediate values corresponding to the output of all the hidden layers. We made an extensive comparison of the performance of such features with traditional engineered ones as well as with combinations of them. The comparison was made in the context of the TRECVid semantic indexing task. Our results confirm those obtained for still images: features learned from other training data generally outperform engineered features for concept recognition. Additionally, we found that directly training SVM classifiers using these features does significantly better than partially retraining the DCNN for adapting it to the new data. We also found that, even though the learned features performed better that the engineered ones, the fusion of both of them perform significantly better, indicating that engineered features are still useful, at least in this case.","PeriodicalId":387496,"journal":{"name":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128932636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-10DOI: 10.1109/CBMI.2015.7153609
K. Andreadou, S. Papadopoulos, Y. Kompatsiaris
In the context of using Web image content for analysis and retrieval, it is typically necessary to perform large-scale image crawling. A serious bottleneck in such set-ups pertains to the fetching of image content, since for each web page a large number of HTTP requests need to be issued to download all included image elements. In practice, however, only the relatively big images (e.g., larger than 400 pixels in width and height) are potentially of interest, since most of the smaller ones are irrelevant to the main subject or correspond to decorative elements (e.g., icons, buttons). Given that there is often no dimension information in the HTML img tag of images, to filter out small images, an image crawler would still need to issue a GET request and download the respective files before deciding whether to index them. To address this limitation, in this paper, we explore the challenge of predicting the size of images on the Web based only on their URL and information extracted from the surrounding HTML code. We present two different methodologies: The first one is based on a common text classification approach using the n-grams or tokens of the image URLs and the second one relies on the HTML elements surrounding the image. Eventually, we combine these two techniques, and achieve considerable improvement in terms of accuracy, leading to a highly effective filtering component that can significantly improve the speed and efficiency of the image crawler.
{"title":"Web image size prediction for efficient focused image crawling","authors":"K. Andreadou, S. Papadopoulos, Y. Kompatsiaris","doi":"10.1109/CBMI.2015.7153609","DOIUrl":"https://doi.org/10.1109/CBMI.2015.7153609","url":null,"abstract":"In the context of using Web image content for analysis and retrieval, it is typically necessary to perform large-scale image crawling. A serious bottleneck in such set-ups pertains to the fetching of image content, since for each web page a large number of HTTP requests need to be issued to download all included image elements. In practice, however, only the relatively big images (e.g., larger than 400 pixels in width and height) are potentially of interest, since most of the smaller ones are irrelevant to the main subject or correspond to decorative elements (e.g., icons, buttons). Given that there is often no dimension information in the HTML img tag of images, to filter out small images, an image crawler would still need to issue a GET request and download the respective files before deciding whether to index them. To address this limitation, in this paper, we explore the challenge of predicting the size of images on the Web based only on their URL and information extracted from the surrounding HTML code. We present two different methodologies: The first one is based on a common text classification approach using the n-grams or tokens of the image URLs and the second one relies on the HTML elements surrounding the image. Eventually, we combine these two techniques, and achieve considerable improvement in terms of accuracy, leading to a highly effective filtering component that can significantly improve the speed and efficiency of the image crawler.","PeriodicalId":387496,"journal":{"name":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116147786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-10DOI: 10.1109/CBMI.2015.7153633
M. S. Uysal, C. Beecks, T. Seidl
The high usage of the Internet, in particular videosharing and social networking websites, have led to enormous amount of video data recently, raising demand on effective and efficient content-based near-duplicate video detection approaches. In this paper, we propose to efficiently search for near-duplicate videos via the utilization of efficient approximation techniques of the well-known effective similarity measure Earth Mover's Distance (EMD). To this end, we model keyframes by flexible feature representations which are then exploited in a filter-and-refine architecture to alleviate the query processing time. The experiments on real data indicate high efficiency guaranteeing reduced number of EMD computations, which contributes to the near-duplicate detection in video datasets.
{"title":"On efficient content-based near-duplicate video detection","authors":"M. S. Uysal, C. Beecks, T. Seidl","doi":"10.1109/CBMI.2015.7153633","DOIUrl":"https://doi.org/10.1109/CBMI.2015.7153633","url":null,"abstract":"The high usage of the Internet, in particular videosharing and social networking websites, have led to enormous amount of video data recently, raising demand on effective and efficient content-based near-duplicate video detection approaches. In this paper, we propose to efficiently search for near-duplicate videos via the utilization of efficient approximation techniques of the well-known effective similarity measure Earth Mover's Distance (EMD). To this end, we model keyframes by flexible feature representations which are then exploited in a filter-and-refine architecture to alleviate the query processing time. The experiments on real data indicate high efficiency guaranteeing reduced number of EMD computations, which contributes to the near-duplicate detection in video datasets.","PeriodicalId":387496,"journal":{"name":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116624896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-10DOI: 10.1109/CBMI.2015.7153629
Robert Gregor, Andreas Lamprecht, I. Sipiran, T. Schreck, B. Bustos
A common approach for implementing content-based multimedia retrieval tasks resorts to extracting high-dimensional feature vectors from the multimedia objects. In combination with an appropriate dissimilarity function, such as the well-known Lp functions or statistical measures like χ2, one can rank objects by dissimilarity with respect to a query. For many multimedia retrieval problems, a large number of feature extraction methods have been proposed and experimentally evaluated for their effectiveness. Much less work has been done to systematically study the impact of the choice of dissimilarity function on the retrieval effectiveness. Inspired by previous work which compared dissimilarity functions for image retrieval, we provide an extensive comparison of dissimilarity measures for 3D object retrieval. Our study is based on an encompassing set of feature extractors, dissimilarity measures and benchmark data sets. We identify the best performing dissimilarity measures and in turn identify dependencies between well-performing dissimilarity measures and types of 3D features. Based on these findings, we show that the effectiveness of 3D retrieval can be improved by a feature-dependent measure choice. In addition, we apply different normalization schemes to the dissimilarity distributions in order to show improved retrieval effectiveness for late fusion of multi-feature combination. Finally, we present preliminary findings on the correlation of rankings for dissimilarity measures, which could be exploited for further improvement of retrieval effectiveness for single features as well as combinations.
{"title":"Empirical evaluation of dissimilarity measures for 3D object retrieval with application to multi-feature retrieval","authors":"Robert Gregor, Andreas Lamprecht, I. Sipiran, T. Schreck, B. Bustos","doi":"10.1109/CBMI.2015.7153629","DOIUrl":"https://doi.org/10.1109/CBMI.2015.7153629","url":null,"abstract":"A common approach for implementing content-based multimedia retrieval tasks resorts to extracting high-dimensional feature vectors from the multimedia objects. In combination with an appropriate dissimilarity function, such as the well-known Lp functions or statistical measures like χ2, one can rank objects by dissimilarity with respect to a query. For many multimedia retrieval problems, a large number of feature extraction methods have been proposed and experimentally evaluated for their effectiveness. Much less work has been done to systematically study the impact of the choice of dissimilarity function on the retrieval effectiveness. Inspired by previous work which compared dissimilarity functions for image retrieval, we provide an extensive comparison of dissimilarity measures for 3D object retrieval. Our study is based on an encompassing set of feature extractors, dissimilarity measures and benchmark data sets. We identify the best performing dissimilarity measures and in turn identify dependencies between well-performing dissimilarity measures and types of 3D features. Based on these findings, we show that the effectiveness of 3D retrieval can be improved by a feature-dependent measure choice. In addition, we apply different normalization schemes to the dissimilarity distributions in order to show improved retrieval effectiveness for late fusion of multi-feature combination. Finally, we present preliminary findings on the correlation of rankings for dissimilarity measures, which could be exploited for further improvement of retrieval effectiveness for single features as well as combinations.","PeriodicalId":387496,"journal":{"name":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"18 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113961386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-10DOI: 10.1109/CBMI.2015.7153601
Cyril Gaudefroy, H. Papadopoulos, M. Kowalski
Music structure appears on a wide variety of temporal levels (notes, bars, phrases, etc). Its highest-level expression is therefore dependent on music's lower-level organization, especially beats and bars. We propose a method for automatic structure segmentation that uses musically meaningful information and is content-adaptive. It relies on a meter-adaptive signal representation that prevents from the use of empirical parameters. Moreover, our method is designed to combine multiple signal features to account for various musical dimensions. Finally, it also combines multiple structural principles that yield complementary results. The resulting algorithm proves to already outperform state-of-the-art methods, especially within small tolerance windows, and yet offers several encouraging improvement directions.
{"title":"A multi-dimensional meter-adaptive method for automatic segmentation of music","authors":"Cyril Gaudefroy, H. Papadopoulos, M. Kowalski","doi":"10.1109/CBMI.2015.7153601","DOIUrl":"https://doi.org/10.1109/CBMI.2015.7153601","url":null,"abstract":"Music structure appears on a wide variety of temporal levels (notes, bars, phrases, etc). Its highest-level expression is therefore dependent on music's lower-level organization, especially beats and bars. We propose a method for automatic structure segmentation that uses musically meaningful information and is content-adaptive. It relies on a meter-adaptive signal representation that prevents from the use of empirical parameters. Moreover, our method is designed to combine multiple signal features to account for various musical dimensions. Finally, it also combines multiple structural principles that yield complementary results. The resulting algorithm proves to already outperform state-of-the-art methods, especially within small tolerance windows, and yet offers several encouraging improvement directions.","PeriodicalId":387496,"journal":{"name":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121690604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-10DOI: 10.1109/CBMI.2015.7153628
Faranak Sobhani, N. F. Kahar, Qianni Zhang
This paper presents analysis and development of a forensic domain ontology to support an automated visual surveillance system. The proposed domain ontology is built on a specific use case based on the severe riots that swept across major UK cities with devastating effects during the summer 2011. The proposed ontology aims at facilitating the description of activities, entities, relationships, resources and consequences of the event. The study exploits 3.07 TB data provided by the Londons Metropolitan Police (Scotland Yard) as a part of European LASIE project1. The data has been analyzed and used to guarantee adherence to a real-world application scenario. A `top-down development' approach to the ontology design has been taken. The ontology is also used to demonstrate how high level reasoning can be incorporated into an automatop-ted forensic system. Thus, the designed ontology is also the base for future development of knowledge inference as response to domain specific queries.
{"title":"An ontology framework for automated visual surveillance system","authors":"Faranak Sobhani, N. F. Kahar, Qianni Zhang","doi":"10.1109/CBMI.2015.7153628","DOIUrl":"https://doi.org/10.1109/CBMI.2015.7153628","url":null,"abstract":"This paper presents analysis and development of a forensic domain ontology to support an automated visual surveillance system. The proposed domain ontology is built on a specific use case based on the severe riots that swept across major UK cities with devastating effects during the summer 2011. The proposed ontology aims at facilitating the description of activities, entities, relationships, resources and consequences of the event. The study exploits 3.07 TB data provided by the Londons Metropolitan Police (Scotland Yard) as a part of European LASIE project1. The data has been analyzed and used to guarantee adherence to a real-world application scenario. A `top-down development' approach to the ontology design has been taken. The ontology is also used to demonstrate how high level reasoning can be incorporated into an automatop-ted forensic system. Thus, the designed ontology is also the base for future development of knowledge inference as response to domain specific queries.","PeriodicalId":387496,"journal":{"name":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125260227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-06-10DOI: 10.1109/CBMI.2015.7153600
J. Pont-Tuset, Miquel A. Farre, A. Smolic
For applications that require very accurate video object segmentations, semi-automatic algorithms are typically used, which help operators to minimize the annotation time, as off-the-shelf automatic segmentation techniques are still far from being precise enough in this context. This paper presents a novel interface based on a click-and-drag interaction that allows to rapidly select regions from state-of-the-art segmentation hierarchies. The interface is very responsive, allows to obtain very accurate segmentations, and is designed to minimize the human interaction. To evaluate the results, we provide a new set of object video ground truth data.
{"title":"Semi-automatic video object segmentation by advanced manipulation of segmentation hierarchies","authors":"J. Pont-Tuset, Miquel A. Farre, A. Smolic","doi":"10.1109/CBMI.2015.7153600","DOIUrl":"https://doi.org/10.1109/CBMI.2015.7153600","url":null,"abstract":"For applications that require very accurate video object segmentations, semi-automatic algorithms are typically used, which help operators to minimize the annotation time, as off-the-shelf automatic segmentation techniques are still far from being precise enough in this context. This paper presents a novel interface based on a click-and-drag interaction that allows to rapidly select regions from state-of-the-art segmentation hierarchies. The interface is very responsive, allows to obtain very accurate segmentations, and is designed to minimize the human interaction. To evaluate the results, we provide a new set of object video ground truth data.","PeriodicalId":387496,"journal":{"name":"2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127590209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}