Detecting text in natural scene images is a challenging task. In this paper, we propose a character-level end-to-end text detection algorithm in natural scene images. In general, text detection tasks are categorized into three parts: text localization, text segmentation, and text recognition. The proposed method aims not only to localize but also to recognize text. To do these tasks successfully, the proposed method consists of four steps: character candidate patch extraction, patch classification using ensemble of ResNets, non-character region elimination, and character region grouping via self-tuning spectral clustering. In the character candidate patch extraction step, character candidate patches are extracted from the image by using both edge information from multi-scale images and Maximally Stable Extremal Regions (MSERs). Then each patch is classified into either character patch or non-character patch by using the deep network that is composed of three ResNets with different hyper-parameters. Text regions are determined by filtering out non-character patches. In order to make further reduction of classification errors, character characteristics are employed to compensate classification results of the ensemble of ResNets. To evaluate the text detection performance, character regions are grouped via self-tuning spectral clustering. The proposed method shows competitive performance on the ICDAR 2013 dataset.
{"title":"A Robust Ensemble of ResNets for Character Level End-to-end Text Detection in Natural Scene Images","authors":"Jinsu Kim, Yoonhyung Kim, Changick Kim","doi":"10.1145/3095713.3095724","DOIUrl":"https://doi.org/10.1145/3095713.3095724","url":null,"abstract":"Detecting text in natural scene images is a challenging task. In this paper, we propose a character-level end-to-end text detection algorithm in natural scene images. In general, text detection tasks are categorized into three parts: text localization, text segmentation, and text recognition. The proposed method aims not only to localize but also to recognize text. To do these tasks successfully, the proposed method consists of four steps: character candidate patch extraction, patch classification using ensemble of ResNets, non-character region elimination, and character region grouping via self-tuning spectral clustering. In the character candidate patch extraction step, character candidate patches are extracted from the image by using both edge information from multi-scale images and Maximally Stable Extremal Regions (MSERs). Then each patch is classified into either character patch or non-character patch by using the deep network that is composed of three ResNets with different hyper-parameters. Text regions are determined by filtering out non-character patches. In order to make further reduction of classification errors, character characteristics are employed to compensate classification results of the ensemble of ResNets. To evaluate the text detection performance, character regions are grouped via self-tuning spectral clustering. The proposed method shows competitive performance on the ICDAR 2013 dataset.","PeriodicalId":310224,"journal":{"name":"Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122202161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
User generated images and videos can enhance the coverage of live events on social and online media, as well as in broadcasts. However, the quality, relevance and complementarity of the received contributions varies greatly. In a live scenario, it is often not feasible for the editorial team to review all content and make selections. We propose to support this work by automatic selection based on captured metadata, and extracted quality and content features. It is usually desired to have a human in the loop, thus the automatic system does not make a final decision, but provides a ranked list of content items. As the operator makes selections, the automatic system shall learn from these decisions, which may change over time. Due to the need for online learning and quick adaptation, we propose the use of online random forests for this task. We show on data from three real live events that the approach is able to provide a ranking based on the predicted selection likelihood after an initial adjustment phase.
{"title":"Learning Selection of User Generated Event Videos","authors":"W. Bailer, M. Winter, Stefanie Wechtitsch","doi":"10.1145/3095713.3095715","DOIUrl":"https://doi.org/10.1145/3095713.3095715","url":null,"abstract":"User generated images and videos can enhance the coverage of live events on social and online media, as well as in broadcasts. However, the quality, relevance and complementarity of the received contributions varies greatly. In a live scenario, it is often not feasible for the editorial team to review all content and make selections. We propose to support this work by automatic selection based on captured metadata, and extracted quality and content features. It is usually desired to have a human in the loop, thus the automatic system does not make a final decision, but provides a ranked list of content items. As the operator makes selections, the automatic system shall learn from these decisions, which may change over time. Due to the need for online learning and quick adaptation, we propose the use of online random forests for this task. We show on data from three real live events that the approach is able to provide a ranking based on the predicted selection likelihood after an initial adjustment phase.","PeriodicalId":310224,"journal":{"name":"Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124260265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a method to effectively utilize multimedia databases created and indexed by a metadata schema designed specifically for AR applications used at cultural heritage sites. We attempt to do so by incorporating storytelling principles that employ video data for the purpose of providing useful and meaningful guidance at Changdeokgung Palace, a UNESCO World Heritage Site that consists of multiple PoIs. We designed a themed narrative that embeds video data related to the PoIs, creating a guide route that connects each PoI in a fixed order. An extensive between-group user evaluation comparing a search-based AR experience and the proposed narrative-based one was conducted to test and prove the validity and effectiveness of our approach. Our results show that storytelling is indeed a significantly powerful tool to enhance the level of immersion users experience through video data in an AR environment at cultural heritage sites.
{"title":"Connecting the Dots: Enhancing the Usability of Indexed Multimedia Data for AR Cultural Heritage Applications through Storytelling","authors":"Jae-eun Shin, Hyerim Park, Woontack Woo","doi":"10.1145/3095713.3095725","DOIUrl":"https://doi.org/10.1145/3095713.3095725","url":null,"abstract":"This paper proposes a method to effectively utilize multimedia databases created and indexed by a metadata schema designed specifically for AR applications used at cultural heritage sites. We attempt to do so by incorporating storytelling principles that employ video data for the purpose of providing useful and meaningful guidance at Changdeokgung Palace, a UNESCO World Heritage Site that consists of multiple PoIs. We designed a themed narrative that embeds video data related to the PoIs, creating a guide route that connects each PoI in a fixed order. An extensive between-group user evaluation comparing a search-based AR experience and the proposed narrative-based one was conducted to test and prove the validity and effectiveness of our approach. Our results show that storytelling is indeed a significantly powerful tool to enhance the level of immersion users experience through video data in an AR environment at cultural heritage sites.","PeriodicalId":310224,"journal":{"name":"Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131444171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karim Aderghal, J. Benois-Pineau, K. Afdel, G. Catheline
The methods of Content-Based visual information indexing and retrieval penetrate into Healthcare and become popular in Computer-Aided Diagnosis. Multimedia in medical images means different imaging modalities, but also multiple views of the same physiological object, such as human brain. In this paper we propose1 a multi-projection fusion approach with CNNs for diagnostics of Alzheimer Disease. Instead of working with the whole brain volume, it fuses CNNs from each brain projection sagittal, coronal, and axial ingesting a 2D+ε limited volume we have previously proposed. Three binary classification tasks are considered separating Alzheimer Disease (AD) patients from Mild Cognitive Impairment (MCI) and Normal control Subject (NC). Two fusion methods on FC-layer and on the single-projection CNN output show better performances, up to 91% and show competitive results with the SOA using heavier algorithmic chains.
{"title":"FuseMe: Classification of sMRI images by fusion of Deep CNNs in 2D+ε projections","authors":"Karim Aderghal, J. Benois-Pineau, K. Afdel, G. Catheline","doi":"10.1145/3095713.3095749","DOIUrl":"https://doi.org/10.1145/3095713.3095749","url":null,"abstract":"The methods of Content-Based visual information indexing and retrieval penetrate into Healthcare and become popular in Computer-Aided Diagnosis. Multimedia in medical images means different imaging modalities, but also multiple views of the same physiological object, such as human brain. In this paper we propose1 a multi-projection fusion approach with CNNs for diagnostics of Alzheimer Disease. Instead of working with the whole brain volume, it fuses CNNs from each brain projection sagittal, coronal, and axial ingesting a 2D+ε limited volume we have previously proposed. Three binary classification tasks are considered separating Alzheimer Disease (AD) patients from Mild Cognitive Impairment (MCI) and Normal control Subject (NC). Two fusion methods on FC-layer and on the single-projection CNN output show better performances, up to 91% and show competitive results with the SOA using heavier algorithmic chains.","PeriodicalId":310224,"journal":{"name":"Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126510717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper considers cross-lingual image annotation, harvesting deep visual models from one language to annotate images with labels from another language. This task cannot be accomplished by machine translation, as labels can be ambiguous and a translated vocabulary leaves us limited freedom to annotate images with appropriate labels. Given non-overlapping vocabularies between two languages, we formulate cross-lingual image annotation as a zero-shot learning problem. For cross-lingual label matching, we adapt zero-shot by replacing the current monolingual semantic embedding space by a bilingual alternative. In order to reduce both label ambiguity and redundancy we propose a simple yet effective approach called label-enhanced zero-shot learning. Using three state-of-the-art deep visual models, i.e., ResNet-152, GoogleNet-Shuffle and OpenImages, experiments on the test set of Flickr8k-CN demonstrate the viability of the proposed approach for cross-lingual image annotation.
{"title":"Harvesting Deep Models for Cross-Lingual Image Annotation","authors":"Qijie Wei, Xiaoxu Wang, Xirong Li","doi":"10.1145/3095713.3095751","DOIUrl":"https://doi.org/10.1145/3095713.3095751","url":null,"abstract":"This paper considers cross-lingual image annotation, harvesting deep visual models from one language to annotate images with labels from another language. This task cannot be accomplished by machine translation, as labels can be ambiguous and a translated vocabulary leaves us limited freedom to annotate images with appropriate labels. Given non-overlapping vocabularies between two languages, we formulate cross-lingual image annotation as a zero-shot learning problem. For cross-lingual label matching, we adapt zero-shot by replacing the current monolingual semantic embedding space by a bilingual alternative. In order to reduce both label ambiguity and redundancy we propose a simple yet effective approach called label-enhanced zero-shot learning. Using three state-of-the-art deep visual models, i.e., ResNet-152, GoogleNet-Shuffle and OpenImages, experiments on the test set of Flickr8k-CN demonstrate the viability of the proposed approach for cross-lingual image annotation.","PeriodicalId":310224,"journal":{"name":"Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133327832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual Question Answering (VQA), a task that requires the ability to provide an answer to a question given an image, has recently become an important benchmark for computer vision. However, current VQA approaches are unable to adequately handle questions that are "irrelevant", such as asking about a cat for an image that has no cat. To date, only one paper has examined the idea of question relevance in VQA, using a binary classification model to assign a relevancy to the entire question / image pair. Truly robust VQA models, however, must not only identify potentially irrelevant questions, but also discover the source of irrelevance and seek to correct it. We therefore introduce two novel problems, question part relevance and question editing, and approaches for solving each problem. In question part relevance, our models go beyond binary question relevance by assigning a classification probability to the portion of the question that is irrelevant. The best question part relevance classifier is later used in question editing to rank possible corrections to the irrelevant portion of given questions. Two custom datasets are developed for these problems using the Visual Genome dataset as a source. Our best models show promising results in these novel tasks over baseline approaches and models adapted from whole-question relevance classification. This work contributes directly to the development of more context-aware and cooperative VQA models, dubbed C2VQA.
{"title":"Question Part Relevance and Editing for Cooperative and Context-Aware VQA (C2VQA)","authors":"Andeep S. Toor, H. Wechsler, M. Nappi","doi":"10.1145/3095713.3095718","DOIUrl":"https://doi.org/10.1145/3095713.3095718","url":null,"abstract":"Visual Question Answering (VQA), a task that requires the ability to provide an answer to a question given an image, has recently become an important benchmark for computer vision. However, current VQA approaches are unable to adequately handle questions that are \"irrelevant\", such as asking about a cat for an image that has no cat. To date, only one paper has examined the idea of question relevance in VQA, using a binary classification model to assign a relevancy to the entire question / image pair. Truly robust VQA models, however, must not only identify potentially irrelevant questions, but also discover the source of irrelevance and seek to correct it. We therefore introduce two novel problems, question part relevance and question editing, and approaches for solving each problem. In question part relevance, our models go beyond binary question relevance by assigning a classification probability to the portion of the question that is irrelevant. The best question part relevance classifier is later used in question editing to rank possible corrections to the irrelevant portion of given questions. Two custom datasets are developed for these problems using the Visual Genome dataset as a source. Our best models show promising results in these novel tasks over baseline approaches and models adapted from whole-question relevance classification. This work contributes directly to the development of more context-aware and cooperative VQA models, dubbed C2VQA.","PeriodicalId":310224,"journal":{"name":"Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124297480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georg Poier, Markus Seidl, M. Zeppelzauer, Christian Reinbacher, M. Schaich, G. Bellandi, A. Marretta, H. Bischof
The development of powerful 3D scanning hardware and reconstruction algorithms has strongly promoted the generation of 3D surface reconstructions in different domains. An area of special interest for such 3D reconstructions is the cultural heritage domain, where surface reconstructions are generated to digitally preserve historical artifacts. While reconstruction quality nowadays is sufficient in many cases, the robust analysis (e.g. segmentation, matching, and classification) of reconstructed 3D data is still an open topic. In this paper, we target the automatic segmentation of high-resolution 3D surface reconstructions of petroglyphs. To foster research in this field, we introduce a fully annotated, large-scale 3D surface dataset including high-resolution meshes, depth maps and point clouds as a novel benchmark dataset, which we make publicly available. Additionally, we provide baseline results for a random forest as well as a convolutional neural network based approach. Results show the complementary strengths and weaknesses of both approaches and point out that the provided dataset represents an open challenge for future research.
{"title":"The 3D-Pitoti Dataset: A Dataset for high-resolution 3D Surface Segmentation","authors":"Georg Poier, Markus Seidl, M. Zeppelzauer, Christian Reinbacher, M. Schaich, G. Bellandi, A. Marretta, H. Bischof","doi":"10.1145/3095713.3095719","DOIUrl":"https://doi.org/10.1145/3095713.3095719","url":null,"abstract":"The development of powerful 3D scanning hardware and reconstruction algorithms has strongly promoted the generation of 3D surface reconstructions in different domains. An area of special interest for such 3D reconstructions is the cultural heritage domain, where surface reconstructions are generated to digitally preserve historical artifacts. While reconstruction quality nowadays is sufficient in many cases, the robust analysis (e.g. segmentation, matching, and classification) of reconstructed 3D data is still an open topic. In this paper, we target the automatic segmentation of high-resolution 3D surface reconstructions of petroglyphs. To foster research in this field, we introduce a fully annotated, large-scale 3D surface dataset including high-resolution meshes, depth maps and point clouds as a novel benchmark dataset, which we make publicly available. Additionally, we provide baseline results for a random forest as well as a convolutional neural network based approach. Results show the complementary strengths and weaknesses of both approaches and point out that the provided dataset represents an open challenge for future research.","PeriodicalId":310224,"journal":{"name":"Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132936082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing","authors":"","doi":"10.1145/3095713","DOIUrl":"https://doi.org/10.1145/3095713","url":null,"abstract":"","PeriodicalId":310224,"journal":{"name":"Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126656551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}