GPU-based methods represent state-of-the-art in approximate nearest neighbor (ANN) search, as they are scalable (billion-scale), accurate (high recall) as well as efficient (sub-millisecond query speed). Faiss, the representative GPU-based ANN system, achieves considerably faster query speed than the representative CPU-based systems. The query accuracy of Faiss critically depends on the number of indexing regions, which in turn is dependent on the amount of available memory. At the same time, query speed deteriorates dramatically with the increase in the number of partition regions. Thus, it can be observed that Faiss suffers from a lack of robustness, that the fine-grained partitioning of datasets is achieved at the expense of search speed, and vice versa. In this paper, we introduce a new GPU-based ANN search method, Robust Quantization (RobustiQ), that addresses the robustness limitations of existing GPU-based methods in a holistic way. We design a novel hierarchical indexing structure using vector and bilayer line quantization. This indexing structure, together with our indexing and encoding methods, allows RobustiQ to avoid the need for maintaining a large lookup table, hence reduces not only memory consumption but also query complexity. Our extensive evaluation on two public billion-scale benchmark datasets, SIFT1B and DEEP1B, shows that RobustiQ consistently obtains 2-3 × speedup over Faiss while achieving better query accuracy for different codebook sizes. Compared to the best CPU-based ANN systems, RobustiQ achieves even more pronounced average speedups of 51.8 × and 11 × respectively.
{"title":"RobustiQ: A Robust ANN Search Method for Billion-scale Similarity Search on GPUs","authors":"Wei Chen, Jincai Chen, F. Zou, Yuan-Fang Li, Ping Lu, Wei Zhao","doi":"10.1145/3323873.3325018","DOIUrl":"https://doi.org/10.1145/3323873.3325018","url":null,"abstract":"GPU-based methods represent state-of-the-art in approximate nearest neighbor (ANN) search, as they are scalable (billion-scale), accurate (high recall) as well as efficient (sub-millisecond query speed). Faiss, the representative GPU-based ANN system, achieves considerably faster query speed than the representative CPU-based systems. The query accuracy of Faiss critically depends on the number of indexing regions, which in turn is dependent on the amount of available memory. At the same time, query speed deteriorates dramatically with the increase in the number of partition regions. Thus, it can be observed that Faiss suffers from a lack of robustness, that the fine-grained partitioning of datasets is achieved at the expense of search speed, and vice versa. In this paper, we introduce a new GPU-based ANN search method, Robust Quantization (RobustiQ), that addresses the robustness limitations of existing GPU-based methods in a holistic way. We design a novel hierarchical indexing structure using vector and bilayer line quantization. This indexing structure, together with our indexing and encoding methods, allows RobustiQ to avoid the need for maintaining a large lookup table, hence reduces not only memory consumption but also query complexity. Our extensive evaluation on two public billion-scale benchmark datasets, SIFT1B and DEEP1B, shows that RobustiQ consistently obtains 2-3 × speedup over Faiss while achieving better query accuracy for different codebook sizes. Compared to the best CPU-based ANN systems, RobustiQ achieves even more pronounced average speedups of 51.8 × and 11 × respectively.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115277955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Similarity search queries in high-dimensional spaces are an important type of queries in many domains such as image processing, machine learning, etc. %Since exact similarity search indexing techniques suffer from the well-knowncurse of dimensionality in high-dimensional spaces, approximate search techniques are often utilized instead. Locality Sensitive Hashing (LSH) has been shown to be an effective approximate search method for solving similarity search queries in high-dimensional spaces. Often, queries in real-world settings arrive as part of a query workload. LSH and its variants are particularly designed to solve single queries effectively. They suffer from one major drawback while executing query workloads: they do not take into consideration important data characteristics for effective cache utilization while designing the index structures. In this paper, we presentqwLSH, an index structure %for efficiently processing similarity search query workloads in high-dimensional spaces. We that intelligently divides a given cache during processing of a query workload by using novel cost models. Experimental results show that, given a query workload,qwLSH is able to perform faster than existing techniques due to its unique cost models and strategies to reduce cache misses. %We further present different caching strategies for efficiently processing similarity search query workloads. We evaluate our proposed unique design and cost models ofqwLSH on real datasets against state-of-the-art LSH-based techniques.
{"title":"qwLSH","authors":"Omid Jafari, John Ossorgin, P. Nagarkar","doi":"10.1145/3323873.3325048","DOIUrl":"https://doi.org/10.1145/3323873.3325048","url":null,"abstract":"Similarity search queries in high-dimensional spaces are an important type of queries in many domains such as image processing, machine learning, etc. %Since exact similarity search indexing techniques suffer from the well-knowncurse of dimensionality in high-dimensional spaces, approximate search techniques are often utilized instead. Locality Sensitive Hashing (LSH) has been shown to be an effective approximate search method for solving similarity search queries in high-dimensional spaces. Often, queries in real-world settings arrive as part of a query workload. LSH and its variants are particularly designed to solve single queries effectively. They suffer from one major drawback while executing query workloads: they do not take into consideration important data characteristics for effective cache utilization while designing the index structures. In this paper, we presentqwLSH, an index structure %for efficiently processing similarity search query workloads in high-dimensional spaces. We that intelligently divides a given cache during processing of a query workload by using novel cost models. Experimental results show that, given a query workload,qwLSH is able to perform faster than existing techniques due to its unique cost models and strategies to reduce cache misses. %We further present different caching strategies for efficiently processing similarity search query workloads. We evaluate our proposed unique design and cost models ofqwLSH on real datasets against state-of-the-art LSH-based techniques.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122800962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The steady growth of multimedia collections - both in terms of size and heterogeneity - necessitates systems that are able to conjointly deal with several types of media as well as large volumes of data. This is especially true when it comes to satisfying a particular information need, i.e., retrieving a particular object of interest from a large collection. Nevertheless, existing multimedia management and retrieval systems are mostly organized in silos and treat different media types separately. Hence, they are limited when it comes to crossing these silos for accessing objects. In this paper, we present vitrivr, a general-purpose content-based multimedia retrieval stack. In addition to the keyword search provided by most media management systems, vitrivr also exploits the object's content in order to facilitate different types of similarity search. This can be done within and, most importantly, across different media types giving rise to new, interesting use cases. To the best of our knowledge, the full vitrivr stack is unique in that it seamlessly integrates support for four different types of media, namely images, audio, videos, and 3D models.
{"title":"Multimodal Multimedia Retrieval with vitrivr","authors":"Ralph Gasser, Luca Rossetto, H. Schuldt","doi":"10.1145/3323873.3326921","DOIUrl":"https://doi.org/10.1145/3323873.3326921","url":null,"abstract":"The steady growth of multimedia collections - both in terms of size and heterogeneity - necessitates systems that are able to conjointly deal with several types of media as well as large volumes of data. This is especially true when it comes to satisfying a particular information need, i.e., retrieving a particular object of interest from a large collection. Nevertheless, existing multimedia management and retrieval systems are mostly organized in silos and treat different media types separately. Hence, they are limited when it comes to crossing these silos for accessing objects. In this paper, we present vitrivr, a general-purpose content-based multimedia retrieval stack. In addition to the keyword search provided by most media management systems, vitrivr also exploits the object's content in order to facilitate different types of similarity search. This can be done within and, most importantly, across different media types giving rise to new, interesting use cases. To the best of our knowledge, the full vitrivr stack is unique in that it seamlessly integrates support for four different types of media, namely images, audio, videos, and 3D models.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114629583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wen Gu, Xiaoyan Gu, Jingzi Gu, B. Li, Zhi Xiong, Weiping Wang
Cross-modal hashing has attracted considerable attention for large-scale multimodal retrieval task. A majority of hashing methods have been proposed for cross-modal retrieval. However, these methods inadequately focus on feature learning process and cannot fully preserve higher-ranking correlation of various item pairs as well as the multi-label semantics of each item, so that the quality of binary codes may be downgraded. To tackle these problems, in this paper, we propose a novel deep cross-modal hashing method, called Adversary Guided Asymmetric Hashing (AGAH). Specifically, it employs an adversarial learning guided multi-label attention module to enhance the feature learning part which can learn discriminative feature representations and keep the cross-modal invariability. Furthermore, in order to generate hash codes which can fully preserve the multi-label semantics of all items, we propose an asymmetric hashing method which utilizes a multi-label binary code map that can equip the hash codes with multi-label semantic information. In addition, to ensure higher-ranking correlation of all similar item pairs than those of dissimilar ones, we adopt a new triplet-margin constraint and a cosine quantization technique for Hamming space similarity preservation. Extensive empirical studies show that AGAH outperforms several state-of-the-art methods for cross-modal retrieval.
{"title":"Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval","authors":"Wen Gu, Xiaoyan Gu, Jingzi Gu, B. Li, Zhi Xiong, Weiping Wang","doi":"10.1145/3323873.3325045","DOIUrl":"https://doi.org/10.1145/3323873.3325045","url":null,"abstract":"Cross-modal hashing has attracted considerable attention for large-scale multimodal retrieval task. A majority of hashing methods have been proposed for cross-modal retrieval. However, these methods inadequately focus on feature learning process and cannot fully preserve higher-ranking correlation of various item pairs as well as the multi-label semantics of each item, so that the quality of binary codes may be downgraded. To tackle these problems, in this paper, we propose a novel deep cross-modal hashing method, called Adversary Guided Asymmetric Hashing (AGAH). Specifically, it employs an adversarial learning guided multi-label attention module to enhance the feature learning part which can learn discriminative feature representations and keep the cross-modal invariability. Furthermore, in order to generate hash codes which can fully preserve the multi-label semantics of all items, we propose an asymmetric hashing method which utilizes a multi-label binary code map that can equip the hash codes with multi-label semantic information. In addition, to ensure higher-ranking correlation of all similar item pairs than those of dissimilar ones, we adopt a new triplet-margin constraint and a cosine quantization technique for Hamming space similarity preservation. Extensive empirical studies show that AGAH outperforms several state-of-the-art methods for cross-modal retrieval.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127957490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinyao Nie, Hong Lu, Zijian Wang, Jingyuan Liu, Zehua Guo
In this paper, we propose an end-to-end Attention-Block network for image retrieval (ABIR), which greatly increases the retrieval accuracy without human annotations like bounding boxes. Specifically, our network utilizes coarse-scale feature fusion, which generates the attentive local features via combining the information from different intermediate layers. Detailed feature information is extracted with the application of two attention blocks. Extensive experiments show that our method outperforms the state-of-the-art by a significant margin on four public datasets for image retrieval tasks.
{"title":"Weakly Supervised Image Retrieval via Coarse-scale Feature Fusion and Multi-level Attention Blocks","authors":"Xinyao Nie, Hong Lu, Zijian Wang, Jingyuan Liu, Zehua Guo","doi":"10.1145/3323873.3325017","DOIUrl":"https://doi.org/10.1145/3323873.3325017","url":null,"abstract":"In this paper, we propose an end-to-end Attention-Block network for image retrieval (ABIR), which greatly increases the retrieval accuracy without human annotations like bounding boxes. Specifically, our network utilizes coarse-scale feature fusion, which generates the attentive local features via combining the information from different intermediate layers. Detailed feature information is extracted with the application of two attention blocks. Extensive experiments show that our method outperforms the state-of-the-art by a significant margin on four public datasets for image retrieval tasks.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"391 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116014729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automatically assigning a group of appropriate semantic tags to one music piece provides an effective way for people to efficiently utilize the massive and ever increasing on-line and off-line music data. In this paper, we propose a novel content-based automatic music annotation model that hierarchically combines attentive convolutional networks and recurrent networks for music representation learning, structure modelling and tag prediction. The model first exploits two separate attentive convolutional networks composed of multiple gated linear units (GLUs) to learn effective representations from both 1-D raw waveform signals and 2-D Mel-spectrogram of the music, which better captures informative features of the music for the annotation task than exploiting any single representation channel. The model then exploits bidirectional Long Short-Term Memory (LSTM) networks to depict the time-varying structures embedded in the description sequences of the music, and further introduces a dual-state LSTM network to encode temporal correlations between two representation channels, which effectively enriches the descriptions of the music. Finally, the model adaptively aggregates music descriptions generated at every time step with a self-attentive multi-weighting mechanism for music tag prediction. The proposed model achieves state-of-the-art results on the public MagnaTagATune music dataset, demonstrating its effectiveness on music annotation.
{"title":"A Hierarchical Attentive Deep Neural Network Model for Semantic Music Annotation Integrating Multiple Music Representations","authors":"Qianqian Wang, Feng Su, Yuyang Wang","doi":"10.1145/3323873.3325031","DOIUrl":"https://doi.org/10.1145/3323873.3325031","url":null,"abstract":"Automatically assigning a group of appropriate semantic tags to one music piece provides an effective way for people to efficiently utilize the massive and ever increasing on-line and off-line music data. In this paper, we propose a novel content-based automatic music annotation model that hierarchically combines attentive convolutional networks and recurrent networks for music representation learning, structure modelling and tag prediction. The model first exploits two separate attentive convolutional networks composed of multiple gated linear units (GLUs) to learn effective representations from both 1-D raw waveform signals and 2-D Mel-spectrogram of the music, which better captures informative features of the music for the annotation task than exploiting any single representation channel. The model then exploits bidirectional Long Short-Term Memory (LSTM) networks to depict the time-varying structures embedded in the description sequences of the music, and further introduces a dual-state LSTM network to encode temporal correlations between two representation channels, which effectively enriches the descriptions of the music. Finally, the model adaptively aggregates music descriptions generated at every time step with a self-attentive multi-weighting mechanism for music tag prediction. The proposed model achieves state-of-the-art results on the public MagnaTagATune music dataset, demonstrating its effectiveness on music annotation.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128309058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","authors":"","doi":"10.1145/3323873","DOIUrl":"https://doi.org/10.1145/3323873","url":null,"abstract":"","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"673 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122972016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motion capture technologies can digitize human movements into a discrete sequence of 3D skeletons. Such spatio-temporal data have a great application potential in many fields, ranging from computer animation, through security and sports to medicine, but their computerized processing is a difficult problem. The objective of this tutorial is to explain fundamental principles and technologies designed for searching, subsequence matching, classification and action detection in the 3D human motion data. These operations inherently require the concept of similarity to determine the degree of accordance between pairs of 3D skeleton sequences. Such similarity can be modeled using a generic approach of metric space by extracting effective deep features and comparing them by efficient distance functions. The metric-space approach also enables applying traditional index structures to efficiently access large datasets of skeleton sequences. We demonstrate the functionality of selected motion-processing operations by interactive web applications.
{"title":"Similarity Search in 3D Human Motion Data","authors":"J. Sedmidubský, P. Zezula","doi":"10.1145/3323873.3326589","DOIUrl":"https://doi.org/10.1145/3323873.3326589","url":null,"abstract":"Motion capture technologies can digitize human movements into a discrete sequence of 3D skeletons. Such spatio-temporal data have a great application potential in many fields, ranging from computer animation, through security and sports to medicine, but their computerized processing is a difficult problem. The objective of this tutorial is to explain fundamental principles and technologies designed for searching, subsequence matching, classification and action detection in the 3D human motion data. These operations inherently require the concept of similarity to determine the degree of accordance between pairs of 3D skeleton sequences. Such similarity can be modeled using a generic approach of metric space by extracting effective deep features and comparing them by efficient distance functions. The metric-space approach also enables applying traditional index structures to efficiently access large datasets of skeleton sequences. We demonstrate the functionality of selected motion-processing operations by interactive web applications.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116154684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Po-Yao (Bernie) Huang, Vaibhav, Xiaojun Chang, Alexander Hauptmann
Although significant progress has been made for cross-modal retrieval models in recent years, few have explored what those models truly learn and what makes one model superior to another. Start by training two state-of-the-art text-to-image retrieval models with adversarial text inputs, we investigate and quantify the importance of syntactic structure and lexical information in learning the joint visual-semantic embedding space for cross-modal retrieval. The results show that the retrieval power mainly comes from localizing and connecting the visual objects and their cross-modal counter-parts, the textual phrases. Inspired by this observation, we propose a novel model which employs object-oriented encoders along with inter- and intra-modal attention networks to improve inter-modal dependencies for cross-modal retrieval. In addition, we develop a new multimodal structure-preserving objective which additionally emphasizes intra-modal hard negative examples to promote intra-modal discrepancies. Extensive experiments show that the proposed approach outperforms the existing best method by a large margin (16.4% and 6.7% relatively with Recall@1 in the text-to-image retrieval task on the Flickr30K dataset and the MS-COCO dataset respectively).
{"title":"Improving What Cross-Modal Retrieval Models Learn through Object-Oriented Inter- and Intra-Modal Attention Networks","authors":"Po-Yao (Bernie) Huang, Vaibhav, Xiaojun Chang, Alexander Hauptmann","doi":"10.1145/3323873.3325043","DOIUrl":"https://doi.org/10.1145/3323873.3325043","url":null,"abstract":"Although significant progress has been made for cross-modal retrieval models in recent years, few have explored what those models truly learn and what makes one model superior to another. Start by training two state-of-the-art text-to-image retrieval models with adversarial text inputs, we investigate and quantify the importance of syntactic structure and lexical information in learning the joint visual-semantic embedding space for cross-modal retrieval. The results show that the retrieval power mainly comes from localizing and connecting the visual objects and their cross-modal counter-parts, the textual phrases. Inspired by this observation, we propose a novel model which employs object-oriented encoders along with inter- and intra-modal attention networks to improve inter-modal dependencies for cross-modal retrieval. In addition, we develop a new multimodal structure-preserving objective which additionally emphasizes intra-modal hard negative examples to promote intra-modal discrepancies. Extensive experiments show that the proposed approach outperforms the existing best method by a large margin (16.4% and 6.7% relatively with Recall@1 in the text-to-image retrieval task on the Flickr30K dataset and the MS-COCO dataset respectively).","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115323645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image understanding is a fundamental task for many multimedia and computer vision applications, such as self-driving, multimedia retrieval, and augmented reality, etc. In this paper, we demonstrate that edge detection could aid image understanding tasks such as semantic segmentation, optical flow estimation, and object proposal generation. Based on our recent research efforts on edge detection, we develop a robust and efficient Edge-Aided imaGe undERstanding system named as EAGER. EAGER is built on a compact and efficient edge detection module, which is constructed with a bi-directional cascade network, multi-scale feature enhancement, and layer-specific training supervision, respectively. Based on detected edges, EAGER achieves accurate semantic segment, optical flow estimation, as well as object bounding-box proposal generation for user-uploaded images and videos.
{"title":"EAGER","authors":"J. He, Xiaobing Liu, Shiliang Zhang","doi":"10.1145/3323873.3326925","DOIUrl":"https://doi.org/10.1145/3323873.3326925","url":null,"abstract":"Image understanding is a fundamental task for many multimedia and computer vision applications, such as self-driving, multimedia retrieval, and augmented reality, etc. In this paper, we demonstrate that edge detection could aid image understanding tasks such as semantic segmentation, optical flow estimation, and object proposal generation. Based on our recent research efforts on edge detection, we develop a robust and efficient Edge-Aided imaGe undERstanding system named as EAGER. EAGER is built on a compact and efficient edge detection module, which is constructed with a bi-directional cascade network, multi-scale feature enhancement, and layer-specific training supervision, respectively. Based on detected edges, EAGER achieves accurate semantic segment, optical flow estimation, as well as object bounding-box proposal generation for user-uploaded images and videos.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"88 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113954839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}