This paper presents a corpus of deep features extracted from the YFCC100M images considering the fc6 hidden layer activation of the HybridNet deep convolutional neural network. For a set of random selected queries we made available k-NN results obtained sequentially scanning the entire set features comparing both using the Euclidean and Hamming Distance on a binarized version of the features. This set of results is ground truth for evaluating Content-Based Image Retrieval (CBIR) systems that use approximate similarity search methods for efficient and scalable indexing. Moreover, we present experimental results obtained indexing this corpus with two distinct approaches: the Metric Inverted File and the Lucene Quantization. These two CBIR systems are public available online allowing real-time search using both internal and external queries.
{"title":"YFCC100M HybridNet fc6 Deep Features for Content-Based Image Retrieval","authors":"Giuseppe Amato, F. Falchi, C. Gennaro, F. Rabitti","doi":"10.1145/2983554.2983557","DOIUrl":"https://doi.org/10.1145/2983554.2983557","url":null,"abstract":"This paper presents a corpus of deep features extracted from the YFCC100M images considering the fc6 hidden layer activation of the HybridNet deep convolutional neural network. For a set of random selected queries we made available k-NN results obtained sequentially scanning the entire set features comparing both using the Euclidean and Hamming Distance on a binarized version of the features. This set of results is ground truth for evaluating Content-Based Image Retrieval (CBIR) systems that use approximate similarity search methods for efficient and scalable indexing. Moreover, we present experimental results obtained indexing this corpus with two distinct approaches: the Metric Inverted File and the Lucene Quantization. These two CBIR systems are public available online allowing real-time search using both internal and external queries.","PeriodicalId":340803,"journal":{"name":"Proceedings of the 2016 ACM Workshop on Multimedia COMMONS","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124317936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The value and importance of Benchmark Evaluations is widely acknowledged. Benchmarks play a key role in many research projects. It takes time, a well-balanced team of domain specialists preferably with links to the user community and industry, and a strong involvement of the research community itself to establish a sound evaluation framework that includes (annotated) data sets, well-defined tasks that reflect the needs in the 'real world', a proper evaluation methodology, ground-truth, including a strategy for repetitive assessments, and last but not least, funding. Although the benefits of an evaluation framework are typically reviewed from a perspective of 'research output' --e.g., a scientific publication demonstrating an advance of a certain methodology-- it is important to be aware of the value of the process of creating a benchmark itself: it increases significantly the understanding of the problem we want to address and as a consequence also the impact of the evaluation outcomes. In this talk I will overview the history of a series of tasks focusing on audiovisual search emphasizing its 'multimodal' aspects, starting in 2006 with the workshop on 'Searching Spontaneous Conversational Speech' that led to tasks in CLEF and MediaEval ("Search and Hyperlinking"), and recently also TRECVid ("Video Hyperlinking"). The focus of my talk will be on the process rather than on the results of these evaluations themselves, and will address cross-benchmark connections, and new benchmark paradigms, specifically the integration of benchmarking in industrial 'living labs' that are becoming popular in some domains.
{"title":"Developing Benchmarks: The Importance of the Process and New Paradigms","authors":"R. Ordelman","doi":"10.1145/2983554.2983562","DOIUrl":"https://doi.org/10.1145/2983554.2983562","url":null,"abstract":"The value and importance of Benchmark Evaluations is widely acknowledged. Benchmarks play a key role in many research projects. It takes time, a well-balanced team of domain specialists preferably with links to the user community and industry, and a strong involvement of the research community itself to establish a sound evaluation framework that includes (annotated) data sets, well-defined tasks that reflect the needs in the 'real world', a proper evaluation methodology, ground-truth, including a strategy for repetitive assessments, and last but not least, funding. Although the benefits of an evaluation framework are typically reviewed from a perspective of 'research output' --e.g., a scientific publication demonstrating an advance of a certain methodology-- it is important to be aware of the value of the process of creating a benchmark itself: it increases significantly the understanding of the problem we want to address and as a consequence also the impact of the evaluation outcomes. In this talk I will overview the history of a series of tasks focusing on audiovisual search emphasizing its 'multimodal' aspects, starting in 2006 with the workshop on 'Searching Spontaneous Conversational Speech' that led to tasks in CLEF and MediaEval (\"Search and Hyperlinking\"), and recently also TRECVid (\"Video Hyperlinking\"). The focus of my talk will be on the process rather than on the results of these evaluations themselves, and will address cross-benchmark connections, and new benchmark paradigms, specifically the integration of benchmarking in industrial 'living labs' that are becoming popular in some domains.","PeriodicalId":340803,"journal":{"name":"Proceedings of the 2016 ACM Workshop on Multimedia COMMONS","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128387214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giorgos Kordopatis-Zilos, S. Papadopoulos, Y. Kompatsiaris
Evaluating multimedia analysis and retrieval systems is a highly challenging task, of which the outcomes can be highly volatile depending on the selected test collection. In this paper, we focus on the problem of multimedia geotagging, i.e. estimating the geographical location of a media item based on its content and metadata, in order to showcase that very different evaluation outcomes may be obtained depending on the test collection at hand. To alleviate this problem, we propose an evaluation methodology based on an array of sampling strategies over a reference test collection, and a way of quantifying and summarizing the volatility of performance measurements. We report experimental results on the MediaEval 2015 Placing Task dataset, and demonstrate that the proposed methodology could help capture the performance of geotagging systems in a comprehensive manner that is complementary to existing evaluation approaches.
{"title":"In-depth Exploration of Geotagging Performance using Sampling Strategies on YFCC100M","authors":"Giorgos Kordopatis-Zilos, S. Papadopoulos, Y. Kompatsiaris","doi":"10.1145/2983554.2983558","DOIUrl":"https://doi.org/10.1145/2983554.2983558","url":null,"abstract":"Evaluating multimedia analysis and retrieval systems is a highly challenging task, of which the outcomes can be highly volatile depending on the selected test collection. In this paper, we focus on the problem of multimedia geotagging, i.e. estimating the geographical location of a media item based on its content and metadata, in order to showcase that very different evaluation outcomes may be obtained depending on the test collection at hand. To alleviate this problem, we propose an evaluation methodology based on an array of sampling strategies over a reference test collection, and a way of quantifying and summarizing the volatility of performance measurements. We report experimental results on the MediaEval 2015 Placing Task dataset, and demonstrate that the proposed methodology could help capture the performance of geotagging systems in a comprehensive manner that is complementary to existing evaluation approaches.","PeriodicalId":340803,"journal":{"name":"Proceedings of the 2016 ACM Workshop on Multimedia COMMONS","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128452802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Shah, Yi Yu, Suhua Tang, S. Satoh, Akshay Verma, Roger Zimmermann
Social media platforms allow users to annotate photos with tags that significantly facilitate an effective semantics understanding, search, and retrieval of photos. However, due to the manual, ambiguous, and personalized nature of user tagging, many tags of a photo are in a random order and even irrelevant to the visual content. Aiming to automatically compute tag relevance for a given photo, we propose a tag ranking scheme based on voting from photo neighbors derived from multimodal information. Specifically, we determine photo neighbors leveraging geo, visual, and semantics concepts derived from spatial information, visual content, and textual metadata, respectively. We leverage high-level features instead traditional low-level features to compute tag relevance. Experimental results on a representative set of 203,840 photos from the YFCC100M dataset confirm that above-mentioned multimodal concepts complement each other in computing tag relevance. Moreover, we explore the fusion of multimodal information to refine tag ranking leveraging recall based weighting. Experimental results on the representative set confirm that the proposed algorithm outperforms state-of-the-arts.
{"title":"Concept-Level Multimodal Ranking of Flickr Photo Tags via Recall Based Weighting","authors":"R. Shah, Yi Yu, Suhua Tang, S. Satoh, Akshay Verma, Roger Zimmermann","doi":"10.1145/2983554.2983555","DOIUrl":"https://doi.org/10.1145/2983554.2983555","url":null,"abstract":"Social media platforms allow users to annotate photos with tags that significantly facilitate an effective semantics understanding, search, and retrieval of photos. However, due to the manual, ambiguous, and personalized nature of user tagging, many tags of a photo are in a random order and even irrelevant to the visual content. Aiming to automatically compute tag relevance for a given photo, we propose a tag ranking scheme based on voting from photo neighbors derived from multimodal information. Specifically, we determine photo neighbors leveraging geo, visual, and semantics concepts derived from spatial information, visual content, and textual metadata, respectively. We leverage high-level features instead traditional low-level features to compute tag relevance. Experimental results on a representative set of 203,840 photos from the YFCC100M dataset confirm that above-mentioned multimodal concepts complement each other in computing tag relevance. Moreover, we explore the fusion of multimodal information to refine tag ranking leveraging recall based weighting. Experimental results on the representative set confirm that the proposed algorithm outperforms state-of-the-arts.","PeriodicalId":340803,"journal":{"name":"Proceedings of the 2016 ACM Workshop on Multimedia COMMONS","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130468562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alireza Koochali, Sebastian Kalkowski, A. Dengel, Damian Borth, Christian Schulze
Recently, the Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset was introduced to the computer vision and multimedia research community. This dataset consists of millions of images and videos spread over the globe. This geo-distribution hints at a potentially large set of different languages being used in titles, descriptions, and tags of these images and videos. Since the YFCC100m metadata does not provide any information about the languages used in the dataset, this paper presents the first analysis of this kind. The language and geo-location characteristics of the YFCC100m dataset is described by providing (a) an overview of used languages, (b) language to country associations, and (c) second language usage in this dataset. Being able to know the language spoken in titles, descriptions, and tags, users of the dataset can make language specific decisions to select subsets of images for, e.g., proper training of classifiers or analyze user behavior specific to their spoken language. Also, this language information is essential for further linguistic studies on the metadata of the YFCC100m dataset.
{"title":"Which Languages do People Speak on Flickr?: A Language and Geo-Location Study of the YFCC100m Dataset","authors":"Alireza Koochali, Sebastian Kalkowski, A. Dengel, Damian Borth, Christian Schulze","doi":"10.1145/2983554.2983560","DOIUrl":"https://doi.org/10.1145/2983554.2983560","url":null,"abstract":"Recently, the Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset was introduced to the computer vision and multimedia research community. This dataset consists of millions of images and videos spread over the globe. This geo-distribution hints at a potentially large set of different languages being used in titles, descriptions, and tags of these images and videos. Since the YFCC100m metadata does not provide any information about the languages used in the dataset, this paper presents the first analysis of this kind. The language and geo-location characteristics of the YFCC100m dataset is described by providing (a) an overview of used languages, (b) language to country associations, and (c) second language usage in this dataset. Being able to know the language spoken in titles, descriptions, and tags, users of the dataset can make language specific decisions to select subsets of images for, e.g., proper training of classifiers or analyze user behavior specific to their spoken language. Also, this language information is essential for further linguistic studies on the metadata of the YFCC100m dataset.","PeriodicalId":340803,"journal":{"name":"Proceedings of the 2016 ACM Workshop on Multimedia COMMONS","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122662700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Yahoo Flickr Creative Commons 100 Million dataset (YFCC100M) is one of the largest public databases containing images and videos and their annotations for research on multimedia analysis. In this paper, we present our study on analysis of characteristics of the 0.8 million videos in the dataset in spatial, temporal, and content perspectives. For this, all the video frames and metadata of the videos are examined. In addition, user-wise analysis of the characteristics is conducted. We make the obtained results publicly available in the form of a metadata dataset for the research community.
{"title":"Analysis of Spatial, Temporal, and Content Characteristics of Videos in the YFCC100M Dataset","authors":"Jun-Ho Choi, Jong-Seok Lee","doi":"10.1145/2983554.2983559","DOIUrl":"https://doi.org/10.1145/2983554.2983559","url":null,"abstract":"The Yahoo Flickr Creative Commons 100 Million dataset (YFCC100M) is one of the largest public databases containing images and videos and their annotations for research on multimedia analysis. In this paper, we present our study on analysis of characteristics of the 0.8 million videos in the dataset in spatial, temporal, and content perspectives. For this, all the video frames and metadata of the videos are examined. In addition, user-wise analysis of the characteristics is conducted. We make the obtained results publicly available in the form of a metadata dataset for the research community.","PeriodicalId":340803,"journal":{"name":"Proceedings of the 2016 ACM Workshop on Multimedia COMMONS","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123526933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 2016 ACM Workshop on Multimedia COMMONS","authors":"","doi":"10.1145/2983554","DOIUrl":"https://doi.org/10.1145/2983554","url":null,"abstract":"","PeriodicalId":340803,"journal":{"name":"Proceedings of the 2016 ACM Workshop on Multimedia COMMONS","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128315188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}