Pub Date : 2021-06-01Epub Date: 2021-05-11DOI: 10.1109/TPAMI.2019.2959613
Yeongmin Lee, Chong-Min Kyung
Accurate stereo matching requires a large amount of memory at a high bandwidth, which restricts its use in resource-limited systems such as mobile devices. This problem is compounded by the recent trend of applications requiring significantly high pixel resolution and disparity levels. To alleviate this, we present a memory-efficient and robust stereo matching algorithm. For cost aggregation, we employ the semiglobal parametric approach, which significantly reduces the memory bandwidth by representing the costs of all disparities as a Gaussian mixture model. All costs on multiple paths in an image are aggregated by updating the Gaussian parameters. The aggregation is performed during the scanning in the forward and backward directions. To reduce the amount of memory for the intermediate results during the forward scan, we suggest to store only the Gaussian parameters which contribute significantly to the final disparity selection. We also propose a method to enhance the overall procedure through a learning-based confidence measure. The random forest framework is used to train various features which are extracted from the cost and intensity profile. The experimental results on KITTI dataset show that the proposed method reduces the memory requirement to less than 3 percent of that of semiglobal matching (SGM) while providing a robust depth map compared to those of state-of-the-art SGM-based algorithms.
{"title":"A Memory- and Accuracy-Aware Gaussian Parameter-Based Stereo Matching Using Confidence Measure.","authors":"Yeongmin Lee, Chong-Min Kyung","doi":"10.1109/TPAMI.2019.2959613","DOIUrl":"https://doi.org/10.1109/TPAMI.2019.2959613","url":null,"abstract":"<p><p>Accurate stereo matching requires a large amount of memory at a high bandwidth, which restricts its use in resource-limited systems such as mobile devices. This problem is compounded by the recent trend of applications requiring significantly high pixel resolution and disparity levels. To alleviate this, we present a memory-efficient and robust stereo matching algorithm. For cost aggregation, we employ the semiglobal parametric approach, which significantly reduces the memory bandwidth by representing the costs of all disparities as a Gaussian mixture model. All costs on multiple paths in an image are aggregated by updating the Gaussian parameters. The aggregation is performed during the scanning in the forward and backward directions. To reduce the amount of memory for the intermediate results during the forward scan, we suggest to store only the Gaussian parameters which contribute significantly to the final disparity selection. We also propose a method to enhance the overall procedure through a learning-based confidence measure. The random forest framework is used to train various features which are extracted from the cost and intensity profile. The experimental results on KITTI dataset show that the proposed method reduces the memory requirement to less than 3 percent of that of semiglobal matching (SGM) while providing a robust depth map compared to those of state-of-the-art SGM-based algorithms.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"43 6","pages":"1845-1858"},"PeriodicalIF":23.6,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TPAMI.2019.2959613","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37484221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01Epub Date: 2021-05-11DOI: 10.1109/TPAMI.2019.2963386
Henri Rebecq, Rene Ranftl, Vladlen Koltun, Davide Scaramuzza
Event cameras are novel sensors that report brightness changes in the form of a stream of asynchronous "events" instead of intensity frames. They offer significant advantages with respect to conventional cameras: high temporal resolution, high dynamic range, and no motion blur. While the stream of events encodes in principle the complete visual signal, the reconstruction of an intensity image from a stream of events is an ill-posed problem in practice. Existing reconstruction approaches are based on hand-crafted priors and strong assumptions about the imaging process as well as the statistics of natural images. In this work we propose to learn to reconstruct intensity images from event streams directly from data instead of relying on any hand-crafted priors. We propose a novel recurrent network to reconstruct videos from a stream of events, and train it on a large amount of simulated event data. During training we propose to use a perceptual loss to encourage reconstructions to follow natural image statistics. We further extend our approach to synthesize color images from color event streams. Our quantitative experiments show that our network surpasses state-of-the-art reconstruction methods by a large margin in terms of image quality ( ), while comfortably running in real-time. We show that the network is able to synthesize high framerate videos ( frames per second) of high-speed phenomena (e.g., a bullet hitting an object) and is able to provide high dynamic range reconstructions in challenging lighting conditions. As an additional contribution, we demonstrate the effectiveness of our reconstructions as an intermediate representation for event data. We show that off-the-shelf computer vision algorithms can be applied to our reconstructions for tasks such as object classification and visual-inertial odometry and that this strategy consistently outperforms algorithms that were specifically designed for event data. We release the reconstruction code, a pre-trained model and the datasets to enable further research.
{"title":"High Speed and High Dynamic Range Video with an Event Camera.","authors":"Henri Rebecq, Rene Ranftl, Vladlen Koltun, Davide Scaramuzza","doi":"10.1109/TPAMI.2019.2963386","DOIUrl":"https://doi.org/10.1109/TPAMI.2019.2963386","url":null,"abstract":"<p><p>Event cameras are novel sensors that report brightness changes in the form of a stream of asynchronous \"events\" instead of intensity frames. They offer significant advantages with respect to conventional cameras: high temporal resolution, high dynamic range, and no motion blur. While the stream of events encodes in principle the complete visual signal, the reconstruction of an intensity image from a stream of events is an ill-posed problem in practice. Existing reconstruction approaches are based on hand-crafted priors and strong assumptions about the imaging process as well as the statistics of natural images. In this work we propose to learn to reconstruct intensity images from event streams directly from data instead of relying on any hand-crafted priors. We propose a novel recurrent network to reconstruct videos from a stream of events, and train it on a large amount of simulated event data. During training we propose to use a perceptual loss to encourage reconstructions to follow natural image statistics. We further extend our approach to synthesize color images from color event streams. Our quantitative experiments show that our network surpasses state-of-the-art reconstruction methods by a large margin in terms of image quality ( ), while comfortably running in real-time. We show that the network is able to synthesize high framerate videos ( frames per second) of high-speed phenomena (e.g., a bullet hitting an object) and is able to provide high dynamic range reconstructions in challenging lighting conditions. As an additional contribution, we demonstrate the effectiveness of our reconstructions as an intermediate representation for event data. We show that off-the-shelf computer vision algorithms can be applied to our reconstructions for tasks such as object classification and visual-inertial odometry and that this strategy consistently outperforms algorithms that were specifically designed for event data. We release the reconstruction code, a pre-trained model and the datasets to enable further research.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"43 6","pages":"1964-1980"},"PeriodicalIF":23.6,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TPAMI.2019.2963386","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37512254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manifold of geodesic plays an essential role in characterizing the intrinsic data geometry. However, the existing SVM methods have largely neglected the manifold structure. As such, functional degeneration may occur due to the potential polluted training. Even worse, the entire SVM model might collapse in the presence of excessive training contamination. To address these issues, this paper devises a manifold SVM method based on a novel ξ -measure geodesic, whose primary design objective is to extract and preserve the data manifold structure in the presence of training noises. To further cope with overly contaminated training data, we introduce Kullback-Leibler (KL) regularization with steerable sparsity constraint. In this way, each loss weight is adaptively obtained by obeying the prior distribution and sparse activation during model training for robust fitting. Moreover, the optimal scale for Stiefel manifold can be automatically learned to improve the model flexibility. Accordingly, extensive experiments verify and validate the superiority of the proposed method.
{"title":"Geodesic Multi-Class SVM with Stiefel Manifold Embedding.","authors":"Rui Zhang, Xuelong Li, Hongyuan Zhang, Ziheng Jiao","doi":"10.1109/TPAMI.2021.3069498","DOIUrl":"10.1109/TPAMI.2021.3069498","url":null,"abstract":"<p><p>Manifold of geodesic plays an essential role in characterizing the intrinsic data geometry. However, the existing SVM methods have largely neglected the manifold structure. As such, functional degeneration may occur due to the potential polluted training. Even worse, the entire SVM model might collapse in the presence of excessive training contamination. To address these issues, this paper devises a manifold SVM method based on a novel ξ -measure geodesic, whose primary design objective is to extract and preserve the data manifold structure in the presence of training noises. To further cope with overly contaminated training data, we introduce Kullback-Leibler (KL) regularization with steerable sparsity constraint. In this way, each loss weight is adaptively obtained by obeying the prior distribution and sparse activation during model training for robust fitting. Moreover, the optimal scale for Stiefel manifold can be automatically learned to improve the model flexibility. Accordingly, extensive experiments verify and validate the superiority of the proposed method.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"PP ","pages":""},"PeriodicalIF":23.6,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25531112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-15DOI: 10.1109/TPAMI.2020.3009287
Umur Aybars Ciftci, Ilke Demir, Lijun Yin
The recent proliferation of fake portrait videos poses direct threats on society, law, and privacy [1]. Believing the fake video of a politician, distributing fake pornographic content of celebrities, fabricating impersonated fake videos as evidence in courts are just a few real world consequences of deep fakes. We present a novel approach to detect synthetic content in portrait videos, as a preventive solution for the emerging threat of deep fakes. In other words, we introduce a deep fake detector. We observe that detectors blindly utilizing deep learning are not effective in catching fake content, as generative models produce formidably realistic results. Our key assertion follows that biological signals hidden in portrait videos can be used as an implicit descriptor of authenticity, because they are neither spatially nor temporally preserved in fake content. To prove and exploit this assertion, we first engage several signal transformations for the pairwise separation problem, achieving 99.39% accuracy. Second, we utilize those findings to formulate a generalized classifier for fake content, by analyzing proposed signal transformations and corresponding feature sets. Third, we generate novel signal maps and employ a CNN to improve our traditional classifier for detecting synthetic content. Lastly, we release an "in the wild" dataset of fake portrait videos that we collected as a part of our evaluation process. We evaluate FakeCatcher on several datasets, resulting with 96%, 94.65%, 91.50%, and 91.07% accuracies, on Face Forensics [2], Face Forensics++ [3], CelebDF [4], and on our new Deep Fakes Dataset respectively. In addition, our approach produces a significantly superior detection rate against baselines, and does not depend on the source, generator, or properties of the fake content. We also analyze signals from various facial regions, under image distortions, with varying segment durations, from different generators, against unseen datasets, and under several dimensionality reduction techniques.
{"title":"FakeCatcher: Detection of Synthetic Portrait Videos using Biological Signals.","authors":"Umur Aybars Ciftci, Ilke Demir, Lijun Yin","doi":"10.1109/TPAMI.2020.3009287","DOIUrl":"10.1109/TPAMI.2020.3009287","url":null,"abstract":"<p><p>The recent proliferation of fake portrait videos poses direct threats on society, law, and privacy [1]. Believing the fake video of a politician, distributing fake pornographic content of celebrities, fabricating impersonated fake videos as evidence in courts are just a few real world consequences of deep fakes. We present a novel approach to detect synthetic content in portrait videos, as a preventive solution for the emerging threat of deep fakes. In other words, we introduce a deep fake detector. We observe that detectors blindly utilizing deep learning are not effective in catching fake content, as generative models produce formidably realistic results. Our key assertion follows that biological signals hidden in portrait videos can be used as an implicit descriptor of authenticity, because they are neither spatially nor temporally preserved in fake content. To prove and exploit this assertion, we first engage several signal transformations for the pairwise separation problem, achieving 99.39% accuracy. Second, we utilize those findings to formulate a generalized classifier for fake content, by analyzing proposed signal transformations and corresponding feature sets. Third, we generate novel signal maps and employ a CNN to improve our traditional classifier for detecting synthetic content. Lastly, we release an \"in the wild\" dataset of fake portrait videos that we collected as a part of our evaluation process. We evaluate FakeCatcher on several datasets, resulting with 96%, 94.65%, 91.50%, and 91.07% accuracies, on Face Forensics [2], Face Forensics++ [3], CelebDF [4], and on our new Deep Fakes Dataset respectively. In addition, our approach produces a significantly superior detection rate against baselines, and does not depend on the source, generator, or properties of the fake content. We also analyze signals from various facial regions, under image distortions, with varying segment durations, from different generators, against unseen datasets, and under several dimensionality reduction techniques.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"PP ","pages":""},"PeriodicalIF":23.6,"publicationDate":"2020-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38228143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01Epub Date: 2020-04-13DOI: 10.1109/TPAMI.2020.2986951
J Matias Di Martino, Fernando Suzacq, Mauricio Delbracio, Qiang Qiu, Guillermo Sapiro
Active illumination is a prominent complement to enhance 2D face recognition and make it more robust, e.g., to spoofing attacks and low-light conditions. In the present work we show that it is possible to adopt active illumination to enhance state-of-the-art 2D face recognition approaches with 3D features, while bypassing the complicated task of 3D reconstruction. The key idea is to project over the test face a high spatial frequency pattern, which allows us to simultaneously recover real 3D information plus a standard 2D facial image. Therefore, state-of-the-art 2D face recognition solution can be transparently applied, while from the high frequency component of the input image, complementary 3D facial features are extracted. Experimental results on ND-2006 dataset show that the proposed ideas can significantly boost face recognition performance and dramatically improve the robustness to spoofing attacks.
{"title":"Differential 3D Facial Recognition: Adding 3D to Your State-of-the-Art 2D Method.","authors":"J Matias Di Martino, Fernando Suzacq, Mauricio Delbracio, Qiang Qiu, Guillermo Sapiro","doi":"10.1109/TPAMI.2020.2986951","DOIUrl":"10.1109/TPAMI.2020.2986951","url":null,"abstract":"<p><p>Active illumination is a prominent complement to enhance 2D face recognition and make it more robust, e.g., to spoofing attacks and low-light conditions. In the present work we show that it is possible to adopt active illumination to enhance state-of-the-art 2D face recognition approaches with 3D features, while bypassing the complicated task of 3D reconstruction. The key idea is to project over the test face a high spatial frequency pattern, which allows us to simultaneously recover real 3D information plus a standard 2D facial image. Therefore, state-of-the-art 2D face recognition solution can be transparently applied, while from the high frequency component of the input image, complementary 3D facial features are extracted. Experimental results on ND-2006 dataset show that the proposed ideas can significantly boost face recognition performance and dramatically improve the robustness to spoofing attacks.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"42 7","pages":"1582-1593"},"PeriodicalIF":23.6,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7892197/pdf/nihms-1668137.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9150865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/tpami.2020.2995283
{"title":"Table of Contents","authors":"","doi":"10.1109/tpami.2020.2995283","DOIUrl":"https://doi.org/10.1109/tpami.2020.2995283","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":" ","pages":""},"PeriodicalIF":23.6,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/tpami.2020.2995283","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46180837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01Epub Date: 2019-12-11DOI: 10.1109/TPAMI.2019.2958083
Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon
Video inpainting aims to fill in spatio-temporal holes in videos with plausible content. Despite tremendous progress on deep learning-based inpainting of a single image, it is still challenging to extend these methods to video domain due to the additional time dimension. In this paper, we propose a recurrent temporal aggregation framework for fast deep video inpainting. In particular, we construct an encoder-decoder model, where the encoder takes multiple reference frames which can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a recurrent feedback in an auto-regressive manner to enforce temporal consistency in the video results. We propose two architectural designs based on this framework. Our first model is a blind video decaptioning network (BVDNet) that is designed to automatically remove and inpaint text overlays in videos without any mask information. Our BVDNet wins the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track 2: Video Decaptioning. Second, we propose a network for more general video inpainting (VINet) to deal with more arbitrary and larger holes. Video results demonstrate the advantage of our framework compared to state-of-the-art methods both qualitatively and quantitatively. The codes are available at https://github.com/mcahny/Deep-Video-Inpainting, and https://github.com/shwoo93/video_decaptioning.
{"title":"Recurrent Temporal Aggregation Framework for Deep Video Inpainting.","authors":"Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon","doi":"10.1109/TPAMI.2019.2958083","DOIUrl":"https://doi.org/10.1109/TPAMI.2019.2958083","url":null,"abstract":"<p><p>Video inpainting aims to fill in spatio-temporal holes in videos with plausible content. Despite tremendous progress on deep learning-based inpainting of a single image, it is still challenging to extend these methods to video domain due to the additional time dimension. In this paper, we propose a recurrent temporal aggregation framework for fast deep video inpainting. In particular, we construct an encoder-decoder model, where the encoder takes multiple reference frames which can provide visible pixels revealed from the scene dynamics. These hints are aggregated and fed into the decoder. We apply a recurrent feedback in an auto-regressive manner to enforce temporal consistency in the video results. We propose two architectural designs based on this framework. Our first model is a blind video decaptioning network (BVDNet) that is designed to automatically remove and inpaint text overlays in videos without any mask information. Our BVDNet wins the first place in the ECCV Chalearn 2018 LAP Inpainting Competition Track 2: Video Decaptioning. Second, we propose a network for more general video inpainting (VINet) to deal with more arbitrary and larger holes. Video results demonstrate the advantage of our framework compared to state-of-the-art methods both qualitatively and quantitatively. The codes are available at https://github.com/mcahny/Deep-Video-Inpainting, and https://github.com/shwoo93/video_decaptioning.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"42 5","pages":"1038-1052"},"PeriodicalIF":23.6,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TPAMI.2019.2958083","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37452544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01Epub Date: 2019-01-16DOI: 10.1109/TPAMI.2019.2893215
Nelson Nauata, Hexiang Hu, Guang-Tong Zhou, Zhiwei Deng, Zicheng Liao, Greg Mori
Visual data such as images and videos contain a rich source of structured semantic labels as well as a wide range of interacting components. Visual content could be assigned with fine-grained labels describing major components, coarse-grained labels depicting high level abstractions, or a set of labels revealing attributes. Such categorization over different, interacting layers of labels evinces the potential for a graph-based encoding of label information. In this paper, we exploit this rich structure for performing graph-based inference in label space for a number of tasks: multi-label image and video classification and action detection in untrimmed videos. We consider the use of the Bidirectional Inference Neural Network (BINN) and Structured Inference Neural Network (SINN) for performing graph-based inference in label space and propose a Long Short-Term Memory (LSTM) based extension for exploiting activity progression on untrimmed videos. The methods were evaluated on (i) the Animal with Attributes (AwA), Scene Understanding (SUN) and NUS-WIDE datasets for multi-label image classification, (ii) the first two releases of the YouTube-8M large scale dataset for multi-label video classification, and (iii) the THUMOS'14 and MultiTHUMOS video datasets for action detection. Our results demonstrate the effectiveness of structured label inference in these challenging tasks, achieving significant improvements against baselines.
{"title":"Structured Label Inference for Visual Understanding.","authors":"Nelson Nauata, Hexiang Hu, Guang-Tong Zhou, Zhiwei Deng, Zicheng Liao, Greg Mori","doi":"10.1109/TPAMI.2019.2893215","DOIUrl":"https://doi.org/10.1109/TPAMI.2019.2893215","url":null,"abstract":"<p><p>Visual data such as images and videos contain a rich source of structured semantic labels as well as a wide range of interacting components. Visual content could be assigned with fine-grained labels describing major components, coarse-grained labels depicting high level abstractions, or a set of labels revealing attributes. Such categorization over different, interacting layers of labels evinces the potential for a graph-based encoding of label information. In this paper, we exploit this rich structure for performing graph-based inference in label space for a number of tasks: multi-label image and video classification and action detection in untrimmed videos. We consider the use of the Bidirectional Inference Neural Network (BINN) and Structured Inference Neural Network (SINN) for performing graph-based inference in label space and propose a Long Short-Term Memory (LSTM) based extension for exploiting activity progression on untrimmed videos. The methods were evaluated on (i) the Animal with Attributes (AwA), Scene Understanding (SUN) and NUS-WIDE datasets for multi-label image classification, (ii) the first two releases of the YouTube-8M large scale dataset for multi-label video classification, and (iii) the THUMOS'14 and MultiTHUMOS video datasets for action detection. Our results demonstrate the effectiveness of structured label inference in these challenging tasks, achieving significant improvements against baselines.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"42 5","pages":"1257-1271"},"PeriodicalIF":23.6,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TPAMI.2019.2893215","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36875059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01Epub Date: 2019-01-18DOI: 10.1109/TPAMI.2019.2893953
Zhenguo Yang, Qing Li, Wenyin Liu, Jianming Lv
Internet platforms provide new ways for people to share experiences, generating massive amounts of data related to various real-world concepts. In this paper, we present an event detection framework to discover real-world events from multiple data domains, including online news media and social media. As multi-domain data possess multiple data views that are heterogeneous, initial dictionaries consisting of labeled data samples are exploited to align the multi-view data. Furthermore, a shared multi-view data representation (SMDR) model is devised, which learns underlying and intrinsic structures shared among the data views by considering the structures underlying the data, data variations, and informativeness of dictionaries. SMDR incorpvarious constraints in the objective function, including shared representation, low-rank, local invariance, reconstruction error, and dictionary independence constraints. Given the data representations achieved by SMDR, class-wise residual models are designed to discover the events underlying the data based on the reconstruction residuals. Extensive experiments conducted on two real-world event detection datasets, i.e., Multi-domain and Multi-modality Event Detection dataset, and MediaEval Social Event Detection 2014 dataset, indicating the effectiveness of the proposed approaches.
{"title":"Shared Multi-View Data Representation for Multi-Domain Event Detection.","authors":"Zhenguo Yang, Qing Li, Wenyin Liu, Jianming Lv","doi":"10.1109/TPAMI.2019.2893953","DOIUrl":"https://doi.org/10.1109/TPAMI.2019.2893953","url":null,"abstract":"<p><p>Internet platforms provide new ways for people to share experiences, generating massive amounts of data related to various real-world concepts. In this paper, we present an event detection framework to discover real-world events from multiple data domains, including online news media and social media. As multi-domain data possess multiple data views that are heterogeneous, initial dictionaries consisting of labeled data samples are exploited to align the multi-view data. Furthermore, a shared multi-view data representation (SMDR) model is devised, which learns underlying and intrinsic structures shared among the data views by considering the structures underlying the data, data variations, and informativeness of dictionaries. SMDR incorpvarious constraints in the objective function, including shared representation, low-rank, local invariance, reconstruction error, and dictionary independence constraints. Given the data representations achieved by SMDR, class-wise residual models are designed to discover the events underlying the data based on the reconstruction residuals. Extensive experiments conducted on two real-world event detection datasets, i.e., Multi-domain and Multi-modality Event Detection dataset, and MediaEval Social Event Detection 2014 dataset, indicating the effectiveness of the proposed approaches.</p>","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"42 5","pages":"1243-1256"},"PeriodicalIF":23.6,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TPAMI.2019.2893953","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36928018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-01DOI: 10.1109/tpami.2020.2979381
{"title":"Table of Contents","authors":"","doi":"10.1109/tpami.2020.2979381","DOIUrl":"https://doi.org/10.1109/tpami.2020.2979381","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":" ","pages":""},"PeriodicalIF":23.6,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/tpami.2020.2979381","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41446029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}