In this paper, we propose an activity prediction method for molecule generation based on the framework of reinforcement learning. The method is used as a scoring module for the molecule generation process. By introducing information about known active molecules for specific set of target conformations, it overcomes the traditional molecular optimization strategy where the method only uses computable properties. Eventually, our prediction method improves the quality of the generated molecules. The prediction method utilized fusion features that consist of traditional countable properties of molecules such as atomic number and the binding property of the molecule to the target. Furthermore, this paper designs a ultra large-scale molecular docking parallel computing method, which greatly improves the performance of the molecular docking [1] scoring process. The computing method makes the high-quality docking computing to predict molecular activity possible. The final experimental result shows that the molecule generation model using the prediction method can produce nearly twenty percent active molecules, which shows that the method proposed in this paper can effectively improve the performance of molecule generation.
{"title":"A Reinforcement Learning-Based Reward Mechanism for Molecule Generation that Introduces Activity Information","authors":"Hao Liu, Jinmeng Yan, Yuandong Zhou","doi":"10.1145/3469877.3497700","DOIUrl":"https://doi.org/10.1145/3469877.3497700","url":null,"abstract":"In this paper, we propose an activity prediction method for molecule generation based on the framework of reinforcement learning. The method is used as a scoring module for the molecule generation process. By introducing information about known active molecules for specific set of target conformations, it overcomes the traditional molecular optimization strategy where the method only uses computable properties. Eventually, our prediction method improves the quality of the generated molecules. The prediction method utilized fusion features that consist of traditional countable properties of molecules such as atomic number and the binding property of the molecule to the target. Furthermore, this paper designs a ultra large-scale molecular docking parallel computing method, which greatly improves the performance of the molecular docking [1] scoring process. The computing method makes the high-quality docking computing to predict molecular activity possible. The final experimental result shows that the molecule generation model using the prediction method can produce nearly twenty percent active molecules, which shows that the method proposed in this paper can effectively improve the performance of molecule generation.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115944114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Increasingly advanced deepfake approaches have made the detection of deepfake videos very challenging. We observe that the general deepfake videos often exhibit appearance-level temporal inconsistencies in some facial components between frames, resulting in discriminable spatiotemporal latent patterns among semantic-level feature maps. Inspired by this finding, we propose a predictive representative learning approach termed Latent Pattern Sensing to capture these semantic change characteristics for deepfake video detection. The approach cascades a CNN-based encoder, a ConvGRU-based aggregator and a single-layer binary classifier. The encoder and aggregator are pre-trained in a self-supervised manner to form the representative spatiotemporal context features. Finally, the classifier is trained to classify the context features, distinguishing fake videos from real ones. In this manner, the extracted features can simultaneously describe the latent patterns of videos across frames spatially and temporally in a unified way, leading to an effective deepfake video detector. Extensive experiments prove our approach’s effectiveness, e.g., surpassing 10 state-of-the-arts at least 7.92%@AUC on challenging Celeb-DF(v2) benchmark.
{"title":"Latent Pattern Sensing: Deepfake Video Detection via Predictive Representation Learning","authors":"Shiming Ge, Fanzhao Lin, Chenyu Li, Daichi Zhang, Jiyong Tan, Weiping Wang, Dan Zeng","doi":"10.1145/3469877.3490586","DOIUrl":"https://doi.org/10.1145/3469877.3490586","url":null,"abstract":"Increasingly advanced deepfake approaches have made the detection of deepfake videos very challenging. We observe that the general deepfake videos often exhibit appearance-level temporal inconsistencies in some facial components between frames, resulting in discriminable spatiotemporal latent patterns among semantic-level feature maps. Inspired by this finding, we propose a predictive representative learning approach termed Latent Pattern Sensing to capture these semantic change characteristics for deepfake video detection. The approach cascades a CNN-based encoder, a ConvGRU-based aggregator and a single-layer binary classifier. The encoder and aggregator are pre-trained in a self-supervised manner to form the representative spatiotemporal context features. Finally, the classifier is trained to classify the context features, distinguishing fake videos from real ones. In this manner, the extracted features can simultaneously describe the latent patterns of videos across frames spatially and temporally in a unified way, leading to an effective deepfake video detector. Extensive experiments prove our approach’s effectiveness, e.g., surpassing 10 state-of-the-arts at least 7.92%@AUC on challenging Celeb-DF(v2) benchmark.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125289993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep hashing has shown great potentials in large-scale visual similarity search due to preferable storage and computation efficiency. Typically, deep hashing encodes visual features into compact binary codes by preserving representative semantic visual features. Works in this area mainly focus on building the relationship between the visual and objective hash space, while they seldom study the triadic cross-domain semantic knowledge transfer among visual, semantic and hashing spaces, leading to serious semantic ignorance problem during space transformation. In this paper, we propose a novel deep tripartite semantically interactive hashing framework, dubbed Semantically Cycle-consistent Hashing Networks (SCHN), for discriminative hash code learning. Particularly, we construct a flexible semantic space and a transitive latent space, in conjunction with the visual space, to jointly deduce the privileged discriminative hash space. Specifically, a semantic space is conceived to strengthen the flexibility and completeness of categories in feature inference. Moreover, a transitive latent space is formulated to explore the shared semantic interactivity embedded in visual and semantic features. Our SCHN, for the first time, establishes the cyclic principle of deep semantic-preserving hashing by adaptive semantic parsing across different spaces in visual similarity search. In addition, the entire learning framework is jointly optimized in an end-to-end manner. Extensive experiments performed on diverse large-scale datasets evidence the superiority of our method against other state-of-the-art deep hashing algorithms.
{"title":"Towards Discriminative Visual Search via Semantically Cycle-consistent Hashing Networks","authors":"Zheng Zhang, Jianning Wang, Guangming Lu","doi":"10.1145/3469877.3490583","DOIUrl":"https://doi.org/10.1145/3469877.3490583","url":null,"abstract":"Deep hashing has shown great potentials in large-scale visual similarity search due to preferable storage and computation efficiency. Typically, deep hashing encodes visual features into compact binary codes by preserving representative semantic visual features. Works in this area mainly focus on building the relationship between the visual and objective hash space, while they seldom study the triadic cross-domain semantic knowledge transfer among visual, semantic and hashing spaces, leading to serious semantic ignorance problem during space transformation. In this paper, we propose a novel deep tripartite semantically interactive hashing framework, dubbed Semantically Cycle-consistent Hashing Networks (SCHN), for discriminative hash code learning. Particularly, we construct a flexible semantic space and a transitive latent space, in conjunction with the visual space, to jointly deduce the privileged discriminative hash space. Specifically, a semantic space is conceived to strengthen the flexibility and completeness of categories in feature inference. Moreover, a transitive latent space is formulated to explore the shared semantic interactivity embedded in visual and semantic features. Our SCHN, for the first time, establishes the cyclic principle of deep semantic-preserving hashing by adaptive semantic parsing across different spaces in visual similarity search. In addition, the entire learning framework is jointly optimized in an end-to-end manner. Extensive experiments performed on diverse large-scale datasets evidence the superiority of our method against other state-of-the-art deep hashing algorithms.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121589772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaolei Luo, S. Xiang, Yingfeng Wang, Qiong Liu, You Yang, Kejun Wu
Object detection under low-light surveillance is a crucial problem that less efforts have been made on it. In this paper, we proposed a hybrid method that jointly use enhancement and object detection for the above challenge, namely Dedark+Detection. In this method, the low-light surveillance video is processed by the proposed de-dark method, and the video can thus be converted to appearance under normal lighting condition. This enhancement bring more benefits to the subsequent stage of object detection. After that, an object detection network is trained on the enhanced dataset for practical applications under low-light surveillance. Experiments are performed on 18 low-light surveillance video test sequences, and superior performance can be found when comparing to state-of-the-arts.
{"title":"Dedark+Detection: A Hybrid Scheme for Object Detection under Low-light Surveillance","authors":"Xiaolei Luo, S. Xiang, Yingfeng Wang, Qiong Liu, You Yang, Kejun Wu","doi":"10.1145/3469877.3497691","DOIUrl":"https://doi.org/10.1145/3469877.3497691","url":null,"abstract":"Object detection under low-light surveillance is a crucial problem that less efforts have been made on it. In this paper, we proposed a hybrid method that jointly use enhancement and object detection for the above challenge, namely Dedark+Detection. In this method, the low-light surveillance video is processed by the proposed de-dark method, and the video can thus be converted to appearance under normal lighting condition. This enhancement bring more benefits to the subsequent stage of object detection. After that, an object detection network is trained on the enhanced dataset for practical applications under low-light surveillance. Experiments are performed on 18 low-light surveillance video test sequences, and superior performance can be found when comparing to state-of-the-arts.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131183674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zero-shot learning (ZSL) aims to recognize images from unseen (novel) classes with the training images from seen classes. The attributes of each class is exploited as auxiliary semantic information. Recently most ZSL approaches focus on learning visual-semantic embeddings to transfer knowledge from the seen classes to the unseen classes. However, few works study whether the auxiliary semantic information in the class-level is extensive enough or not for the ZSL task. To tackle such problem, we propose a hierarchical coupled dictionary learning (HCDL) approach to hierarchically align the visual-semantic structures in both the class-level and the image-level. Firstly, the class-level coupled dictionary is trained to establish a basic connection between visual space and semantic space. Then, the image attributes are generated based on the basic connection. Finally, the fine-grained information can be embedded by training the image-level coupled dictionary. Zero-shot recognition is performed in multiple spaces by searching the nearest neighbor class of the unseen image. Experiments on two widely used benchmark datasets show the effectiveness of the proposed approach.
{"title":"Zero-shot Recognition with Image Attributes Generation using Hierarchical Coupled Dictionary Learning","authors":"Shuang Li, Lichun Wang, Shaofan Wang, Dehui Kong, Baocai Yin","doi":"10.1145/3469877.3490613","DOIUrl":"https://doi.org/10.1145/3469877.3490613","url":null,"abstract":"Zero-shot learning (ZSL) aims to recognize images from unseen (novel) classes with the training images from seen classes. The attributes of each class is exploited as auxiliary semantic information. Recently most ZSL approaches focus on learning visual-semantic embeddings to transfer knowledge from the seen classes to the unseen classes. However, few works study whether the auxiliary semantic information in the class-level is extensive enough or not for the ZSL task. To tackle such problem, we propose a hierarchical coupled dictionary learning (HCDL) approach to hierarchically align the visual-semantic structures in both the class-level and the image-level. Firstly, the class-level coupled dictionary is trained to establish a basic connection between visual space and semantic space. Then, the image attributes are generated based on the basic connection. Finally, the fine-grained information can be embedded by training the image-level coupled dictionary. Zero-shot recognition is performed in multiple spaces by searching the nearest neighbor class of the unseen image. Experiments on two widely used benchmark datasets show the effectiveness of the proposed approach.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133254279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most existing methods focus on improving the clarity and semantic consistency of the image with a given text, but do not pay attention to the multiple control of generated image content, such as the position of the object in generated image. In this paper, we introduce a novel position-based generative network (PBNet) which can generate fine-grained images with the object at the specified location. PBNet combines iterative structure with generative adversarial network (GAN). A location information embedding module (LIEM) is proposed to combine the location information extracted from the boundary block image with the semantic information extracted from the text. In addition, a silhouette generation module (SGM) is proposed to train the generator to generate object based on location information. The experimental results on CUB dataset demonstrate that PBNet effectively controls the location of the object in the generated image.
{"title":"PBNet: Position-specific Text-to-image Generation by Boundary","authors":"Tian Tian, Li Liu, Huaxiang Zhang, Dongmei Liu","doi":"10.1145/3469877.3493594","DOIUrl":"https://doi.org/10.1145/3469877.3493594","url":null,"abstract":"Most existing methods focus on improving the clarity and semantic consistency of the image with a given text, but do not pay attention to the multiple control of generated image content, such as the position of the object in generated image. In this paper, we introduce a novel position-based generative network (PBNet) which can generate fine-grained images with the object at the specified location. PBNet combines iterative structure with generative adversarial network (GAN). A location information embedding module (LIEM) is proposed to combine the location information extracted from the boundary block image with the semantic information extracted from the text. In addition, a silhouette generation module (SGM) is proposed to train the generator to generate object based on location information. The experimental results on CUB dataset demonstrate that PBNet effectively controls the location of the object in the generated image.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131580782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph convolutional networks (GCN) have been widely used in processing graphs and networks data. However, some recent research experiments show that the existing graph convolutional networks have isseus when integrating node features and topology structure. In order to remedy the weakness, we propose a new GCN architecture. Firstly, the proposed architecture introduces the cross-stitch networks into GCN with improved cross-stitch units. Cross-stitch networks spread information/knowledge between node features and topology structure, and obtains consistent learned representation by integrating information of node features and topology structure at the same time. Therefore, the proposed model can capture various channel information in all images through multiple channels. Secondly, an attention mechanism is to further extract the most relevant information between channel embeddings. Experiments on six benchmark datasets shows that our method outperforms all comparison methods on different evaluation indicators.
{"title":"Adaptive Cross-stitch Graph Convolutional Networks","authors":"Zehui Hu, Zidong Su, Yangding Li, Junbo Ma","doi":"10.1145/3469877.3495643","DOIUrl":"https://doi.org/10.1145/3469877.3495643","url":null,"abstract":"Graph convolutional networks (GCN) have been widely used in processing graphs and networks data. However, some recent research experiments show that the existing graph convolutional networks have isseus when integrating node features and topology structure. In order to remedy the weakness, we propose a new GCN architecture. Firstly, the proposed architecture introduces the cross-stitch networks into GCN with improved cross-stitch units. Cross-stitch networks spread information/knowledge between node features and topology structure, and obtains consistent learned representation by integrating information of node features and topology structure at the same time. Therefore, the proposed model can capture various channel information in all images through multiple channels. Secondly, an attention mechanism is to further extract the most relevant information between channel embeddings. Experiments on six benchmark datasets shows that our method outperforms all comparison methods on different evaluation indicators.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123061015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md. Rafi Ur Rashid, Mahim Mahbub, Muhammad Abdullah Adnan
Despite being the sixth most widely spoken language in the world, Bangla has barely received any attention in the domain of audio-visual news classification. In this work, we collect, annotate, and prepare a comprehensive news audio dataset in Bangla, comprising 5120 news clips, with around 820 hours of total duration. We also conduct practical experiments to obtain a human baseline for the news audio classification task. Later, we implement one of the human approaches by performing news classification directly on the audio features using various state-of-the-art classifiers and a few transfer learning models. To the best of our knowledge, this is the very first work developing a benchmark dataset for news audio classification in Bangla.
{"title":"BAND: A Benchmark Dataset forBangla News Audio Classification","authors":"Md. Rafi Ur Rashid, Mahim Mahbub, Muhammad Abdullah Adnan","doi":"10.1145/3469877.3490575","DOIUrl":"https://doi.org/10.1145/3469877.3490575","url":null,"abstract":"Despite being the sixth most widely spoken language in the world, Bangla has barely received any attention in the domain of audio-visual news classification. In this work, we collect, annotate, and prepare a comprehensive news audio dataset in Bangla, comprising 5120 news clips, with around 820 hours of total duration. We also conduct practical experiments to obtain a human baseline for the news audio classification task. Later, we implement one of the human approaches by performing news classification directly on the audio features using various state-of-the-art classifiers and a few transfer learning models. To the best of our knowledge, this is the very first work developing a benchmark dataset for news audio classification in Bangla.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124485321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shivangi Singhal, Mudit Dhawan, R. Shah, P. Kumaraguru
The paradigm shift in the consumption of news via online platforms has cultivated the growth of digital journalism. Contrary to traditional media, lowering entry barriers and enabling everyone to be part of content creation have disabled the concept of centralized gatekeeping in digital journalism. This in turn has triggered the production of fake news. Current studies have made a significant effort towards multimodal fake news detection with less emphasis on exploring the discordance between the different multimedia present in a news article. We hypothesize that fabrication of either modality will lead to dissonance between the modalities, and resulting in misrepresented, misinterpreted and misleading news. In this paper, we inspect the authenticity of news coming from online media outlets by exploiting relationship (discordance) between the textual and multiple visual cues. We develop an inter-modality discordance based fake news detection framework to achieve the goal. The modal-specific discriminative features are learned, employing the cross-entropy loss and a modified version of contrastive loss that explores the inter-modality discordance. To the best of our knowledge, this is the first work that leverages information from different components of the news article (i.e., headline, body, and multiple images) for multimodal fake news detection. We conduct extensive experiments on the real-world datasets to show that our approach outperforms the state-of-the-art by an average F1-score of 6.3%.
{"title":"Inter-modality Discordance for Multimodal Fake News Detection","authors":"Shivangi Singhal, Mudit Dhawan, R. Shah, P. Kumaraguru","doi":"10.1145/3469877.3490614","DOIUrl":"https://doi.org/10.1145/3469877.3490614","url":null,"abstract":"The paradigm shift in the consumption of news via online platforms has cultivated the growth of digital journalism. Contrary to traditional media, lowering entry barriers and enabling everyone to be part of content creation have disabled the concept of centralized gatekeeping in digital journalism. This in turn has triggered the production of fake news. Current studies have made a significant effort towards multimodal fake news detection with less emphasis on exploring the discordance between the different multimedia present in a news article. We hypothesize that fabrication of either modality will lead to dissonance between the modalities, and resulting in misrepresented, misinterpreted and misleading news. In this paper, we inspect the authenticity of news coming from online media outlets by exploiting relationship (discordance) between the textual and multiple visual cues. We develop an inter-modality discordance based fake news detection framework to achieve the goal. The modal-specific discriminative features are learned, employing the cross-entropy loss and a modified version of contrastive loss that explores the inter-modality discordance. To the best of our knowledge, this is the first work that leverages information from different components of the news article (i.e., headline, body, and multiple images) for multimodal fake news detection. We conduct extensive experiments on the real-world datasets to show that our approach outperforms the state-of-the-art by an average F1-score of 6.3%.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124346528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dan Zhang, Mao Ye, Lin Xiong, Shuaifeng Li, Xue Li
Unsupervised cross-domain object detection transfers a detection model trained on a source domain to the target domain that has a different data distribution from the source domain. Conventional domain adaptation detection protocols need source domain data during adaptation. However, due to some reasons such as data security, privacy and storage, we cannot access the source data in many practical applications. In this paper, we focus on source-data free domain adaptive object detection, which uses the pre-trained source model instead of the source data for cross-domain adaptation. Due to the lack of source data, we cannot directly align domain distribution between domains. To challenge this, we propose the Source style transferred Mean Teacher (SMT) for source-data free Object Detection. The batch normalization layers in the pre-trained model contain the style information and the data distribution of the non-observed source data. Thus we use the batch normalization information from the pre-trained source model to transfer the target domain feature to the source-like style feature to make full use of the knowledge from the pre-trained source model. Meanwhile, we use the consistent regularization of the Mean Teacher to further distill the knowledge from the source domain to the target domain. Furthermore, we found that by adding perturbations associated with the target domain distribution, the model can increase the robustness of domain-specific information, thus making the learned model generalized to the target domain. Experiments on multiple domain adaptation object detection benchmarks verify that our method is able to achieve state-of-the-art performance.
{"title":"Source-Style Transferred Mean Teacher for Source-data Free Object Detection","authors":"Dan Zhang, Mao Ye, Lin Xiong, Shuaifeng Li, Xue Li","doi":"10.1145/3469877.3490584","DOIUrl":"https://doi.org/10.1145/3469877.3490584","url":null,"abstract":"Unsupervised cross-domain object detection transfers a detection model trained on a source domain to the target domain that has a different data distribution from the source domain. Conventional domain adaptation detection protocols need source domain data during adaptation. However, due to some reasons such as data security, privacy and storage, we cannot access the source data in many practical applications. In this paper, we focus on source-data free domain adaptive object detection, which uses the pre-trained source model instead of the source data for cross-domain adaptation. Due to the lack of source data, we cannot directly align domain distribution between domains. To challenge this, we propose the Source style transferred Mean Teacher (SMT) for source-data free Object Detection. The batch normalization layers in the pre-trained model contain the style information and the data distribution of the non-observed source data. Thus we use the batch normalization information from the pre-trained source model to transfer the target domain feature to the source-like style feature to make full use of the knowledge from the pre-trained source model. Meanwhile, we use the consistent regularization of the Mean Teacher to further distill the knowledge from the source domain to the target domain. Furthermore, we found that by adding perturbations associated with the target domain distribution, the model can increase the robustness of domain-specific information, thus making the learned model generalized to the target domain. Experiments on multiple domain adaptation object detection benchmarks verify that our method is able to achieve state-of-the-art performance.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"211 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115937888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}