Yi-Meng Gao, Ning Wang, Wei Suo, Mengyang Sun, Peifeng Wang
Recent work on visual question answering demonstrate that grid features can work as well as region feature on vision language tasks. In the meantime, transformer-based model and its variants have shown remarkable performance on image captioning. However, the object-contextual information missing caused by the single granularity nature of grid feature on the encoder side, as well as the future contextual information missing due to the left2right decoding paradigm of transformer decoder, remains unexplored. In this work, we tackle these two problems by enhancing contextual information at dual-side:(i) at encoder side, we propose Context-Aware Self-Attention module, in which the key/value is expanded with adjacent rectangle region where each region contains two or more aggregated grid features; this enables grid feature with varying granularity, storing adequate contextual information for object with different scale. (ii) at decoder side, we incorporate a dual-way decoding strategy, in which left2right and right2left decoding are conducted simultaneously and interactively. It utilizes both past and future contextual information when generates current word. Combining these two modules with a vanilla transformer, our Context-Aware Transformer(CATNet) achieves a new state-of-the-art on MSCOCO benchmark.
{"title":"Improving Image Captioning via Enhancing Dual-Side Context Awareness","authors":"Yi-Meng Gao, Ning Wang, Wei Suo, Mengyang Sun, Peifeng Wang","doi":"10.1145/3512527.3531379","DOIUrl":"https://doi.org/10.1145/3512527.3531379","url":null,"abstract":"Recent work on visual question answering demonstrate that grid features can work as well as region feature on vision language tasks. In the meantime, transformer-based model and its variants have shown remarkable performance on image captioning. However, the object-contextual information missing caused by the single granularity nature of grid feature on the encoder side, as well as the future contextual information missing due to the left2right decoding paradigm of transformer decoder, remains unexplored. In this work, we tackle these two problems by enhancing contextual information at dual-side:(i) at encoder side, we propose Context-Aware Self-Attention module, in which the key/value is expanded with adjacent rectangle region where each region contains two or more aggregated grid features; this enables grid feature with varying granularity, storing adequate contextual information for object with different scale. (ii) at decoder side, we incorporate a dual-way decoding strategy, in which left2right and right2left decoding are conducted simultaneously and interactively. It utilizes both past and future contextual information when generates current word. Combining these two modules with a vanilla transformer, our Context-Aware Transformer(CATNet) achieves a new state-of-the-art on MSCOCO benchmark.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129099885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cell segmentation and counting is a time-consuming task and an important experimental step in traditional biomedical research. Many current counting methods require exact cell locations. However, there are few such cell datasets with detailed object coordinates. Most existing cell datasets only have the total number of cells and a global segmentation labelling. To make more effective use of existing datasets, we divided the cell counting task into cell number prediction and cell segmentation respectively. This paper proposed a lightweight fast multi-task multi-scale feature fusion model based on generative adversarial networks (MFGAN). To coordinate the learning of these two tasks, we proposed a Combined Hybrid Loss function (CH Loss) and used conditional GAN to train our network. We proposed a Lightweight Fast Multitask Generator (LFMG) which reduced the number of parameters by 20% compared with U-Net but got better performance on cell segmentation. We used multi-scale feature fusion technology to improve the quality of reconstructed segmentation images. In addition, we also proposed a Structure Fusion Discrimination (SFD) to refine the accuracy of the details of the features. Our method achieved non-Point-based counting that no longer needs to annotate the exact position of each cell in the image during the training and successfully achieved excellent results on cell counting and cell segmentation.
{"title":"MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN","authors":"Lijia Deng, Yu-dong Zhang","doi":"10.1145/3512527.3531410","DOIUrl":"https://doi.org/10.1145/3512527.3531410","url":null,"abstract":"Cell segmentation and counting is a time-consuming task and an important experimental step in traditional biomedical research. Many current counting methods require exact cell locations. However, there are few such cell datasets with detailed object coordinates. Most existing cell datasets only have the total number of cells and a global segmentation labelling. To make more effective use of existing datasets, we divided the cell counting task into cell number prediction and cell segmentation respectively. This paper proposed a lightweight fast multi-task multi-scale feature fusion model based on generative adversarial networks (MFGAN). To coordinate the learning of these two tasks, we proposed a Combined Hybrid Loss function (CH Loss) and used conditional GAN to train our network. We proposed a Lightweight Fast Multitask Generator (LFMG) which reduced the number of parameters by 20% compared with U-Net but got better performance on cell segmentation. We used multi-scale feature fusion technology to improve the quality of reconstructed segmentation images. In addition, we also proposed a Structure Fusion Discrimination (SFD) to refine the accuracy of the details of the features. Our method achieved non-Point-based counting that no longer needs to annotate the exact position of each cell in the image during the training and successfully achieved excellent results on cell counting and cell segmentation.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"34 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114273828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.
{"title":"Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis","authors":"Pei Dong, L. Wu, Lei Meng, Xiangxu Meng","doi":"10.1145/3512527.3531389","DOIUrl":"https://doi.org/10.1145/3512527.3531389","url":null,"abstract":"In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114681329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rise of human-machine interactions, it has become necessary for machines to better understand humans in order to respond appropriately. Hence, in order to increase communication and interaction, it would be ideal for machines to automatically detect human emotions. Speech Emotion Recognition (SER) has been a focus of a lot of studies in the past few years. However, they can be considered poor in accuracy and must be improved. In our work, we propose a new loss function that aims to encode speeches instead of classifying them directly as the majority of the existing models do. The encoding will be done in a way that utterances with the same labels would have similar encodings. The encoded speeches were tested on two datasets and we managed to get 88.19% accuracy with the RAVDESS (Ryerson Audiovisual Database of Emotional Speech and Song) dataset and 91.66% accuracy with the RML (Ryerson Multimedia Research Lab) dataset.
随着人机交互的兴起,机器有必要更好地了解人类,以便做出适当的反应。因此,为了增加交流和互动,机器能够自动检测人类的情绪将是理想的。语音情感识别(SER)是近年来研究的热点之一。然而,它们的准确性很差,必须加以改进。在我们的工作中,我们提出了一个新的损失函数,旨在对语音进行编码,而不是像大多数现有模型那样直接对语音进行分类。编码将以具有相同标签的话语具有相似编码的方式进行。编码后的语音在两个数据集上进行了测试,我们使用RAVDESS (Ryerson audio - visual Database of Emotional Speech and Song)数据集获得了88.19%的准确率,使用RML (Ryerson Multimedia Research Lab)数据集获得了91.66%的准确率。
{"title":"MuLER: Multiplet-Loss for Emotion Recognition","authors":"Anwer Slimi, M. Zrigui, H. Nicolas","doi":"10.1145/3512527.3531406","DOIUrl":"https://doi.org/10.1145/3512527.3531406","url":null,"abstract":"With the rise of human-machine interactions, it has become necessary for machines to better understand humans in order to respond appropriately. Hence, in order to increase communication and interaction, it would be ideal for machines to automatically detect human emotions. Speech Emotion Recognition (SER) has been a focus of a lot of studies in the past few years. However, they can be considered poor in accuracy and must be improved. In our work, we propose a new loss function that aims to encode speeches instead of classifying them directly as the majority of the existing models do. The encoding will be done in a way that utterances with the same labels would have similar encodings. The encoded speeches were tested on two datasets and we managed to get 88.19% accuracy with the RAVDESS (Ryerson Audiovisual Database of Emotional Speech and Song) dataset and 91.66% accuracy with the RML (Ryerson Multimedia Research Lab) dataset.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131969991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heterogeneous information networks (HINs) are widely used in recommender system research due to their ability to model complex auxiliary information beyond historical interactions to alleviate data sparsity problem. Existing HIN-based recommendation studies have achieved great success via performing graph convolution operators between pairs of nodes on predefined metapath induced graphs, but they have the following major limitations. First, existing heterogeneous network construction strategies tend to exploit item attributes while failing to effectively model user relations. In addition, previous HIN-based recommendation models mainly convert heterogeneous graph into homogeneous graphs by defining metapaths ignoring the complicated relation dependency involved on the metapath. To tackle these limitations, we propose a novel recommendation model with two-way metapath encoder for top-N recommendation, which models metapath similarity and sequence relation dependency in HIN to learn node representations. Specifically, our model first learns the initial node representation through a pre-training module, and then identifies potential friends and item relations based on their similarity to construct a unified HIN. We then develop the two-way encoder module with similarity encoder and instance encoder to capture the similarity collaborative signals and relational dependency on different metapaths. Finally, the representations on different meta-paths are aggregated through the attention fusion layer to yield rich representations. Extensive experiments on three real datasets demonstrate the effectiveness of our method.
{"title":"An Effective Two-way Metapath Encoder over Heterogeneous Information Network for Recommendation","authors":"Yanbin Jiang, Huifang Ma, Xiaohui Zhang, Zhixin Li, Liang Chang","doi":"10.1145/3512527.3531402","DOIUrl":"https://doi.org/10.1145/3512527.3531402","url":null,"abstract":"Heterogeneous information networks (HINs) are widely used in recommender system research due to their ability to model complex auxiliary information beyond historical interactions to alleviate data sparsity problem. Existing HIN-based recommendation studies have achieved great success via performing graph convolution operators between pairs of nodes on predefined metapath induced graphs, but they have the following major limitations. First, existing heterogeneous network construction strategies tend to exploit item attributes while failing to effectively model user relations. In addition, previous HIN-based recommendation models mainly convert heterogeneous graph into homogeneous graphs by defining metapaths ignoring the complicated relation dependency involved on the metapath. To tackle these limitations, we propose a novel recommendation model with two-way metapath encoder for top-N recommendation, which models metapath similarity and sequence relation dependency in HIN to learn node representations. Specifically, our model first learns the initial node representation through a pre-training module, and then identifies potential friends and item relations based on their similarity to construct a unified HIN. We then develop the two-way encoder module with similarity encoder and instance encoder to capture the similarity collaborative signals and relational dependency on different metapaths. Finally, the representations on different meta-paths are aggregated through the attention fusion layer to yield rich representations. Extensive experiments on three real datasets demonstrate the effectiveness of our method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"3 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131452485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Evlampios Apostolidis, Georgios Balaouras, V. Mezaris, I. Patras
In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.
{"title":"Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames","authors":"Evlampios Apostolidis, Georgios Balaouras, V. Mezaris, I. Patras","doi":"10.1145/3512527.3531404","DOIUrl":"https://doi.org/10.1145/3512527.3531404","url":null,"abstract":"In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115834424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo
Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.
{"title":"Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization","authors":"Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo","doi":"10.1145/3512527.3531396","DOIUrl":"https://doi.org/10.1145/3512527.3531396","url":null,"abstract":"Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121236117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuhui Guo, Xun Liang, Tang Hui, Bo Wu, Xiangping Zheng
Weakly supervised semantic segmentation with image-level labels is a challenging problem that typically relies on the initial responses generated by the classification network to locate object regions. However, such initial responses only cover the most discriminative parts of the object and may incorrectly activate in the background regions. To address this problem, we propose a Cross-pixel Dependency with Boundary-feature Transformation (CDBT) method for weakly supervised semantic segmentation. Specifically, we develop a boundary-feature transformation mechanism, to build strong connections among pixels belonging to the same object but weak connections among different objects. Moreover, we design a cross-pixel dependency module to enhance the initial responses, which exploits context appearance information and refines the prediction of current pixels by the relations of global channel pixels, thus generating pseudo labels of higher quality for training the semantic segmentation network. Extensive experiments on the PASCAL VOC 2012 segmentation benchmark demonstrate that our method outperforms state-of-the-art methods using image-level labels as weak supervision.
{"title":"Cross-Pixel Dependency with Boundary-Feature Transformation for Weakly Supervised Semantic Segmentation","authors":"Yuhui Guo, Xun Liang, Tang Hui, Bo Wu, Xiangping Zheng","doi":"10.1145/3512527.3531360","DOIUrl":"https://doi.org/10.1145/3512527.3531360","url":null,"abstract":"Weakly supervised semantic segmentation with image-level labels is a challenging problem that typically relies on the initial responses generated by the classification network to locate object regions. However, such initial responses only cover the most discriminative parts of the object and may incorrectly activate in the background regions. To address this problem, we propose a Cross-pixel Dependency with Boundary-feature Transformation (CDBT) method for weakly supervised semantic segmentation. Specifically, we develop a boundary-feature transformation mechanism, to build strong connections among pixels belonging to the same object but weak connections among different objects. Moreover, we design a cross-pixel dependency module to enhance the initial responses, which exploits context appearance information and refines the prediction of current pixels by the relations of global channel pixels, thus generating pseudo labels of higher quality for training the semantic segmentation network. Extensive experiments on the PASCAL VOC 2012 segmentation benchmark demonstrate that our method outperforms state-of-the-art methods using image-level labels as weak supervision.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124132525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In addition to classical art types like paintings and sculptures, new types of artworks emerge following the advancement of deep learning, social platforms, media capturing devices, and media processing tools. Large volumes of machine-/user-generated content or professionally-edited content are shared and disseminated on the Web. Novel multimedia artworks, therefore, emerge rapidly in the era of social media and big data. The ever-increasing amount of illustrations/comics/animations on this platform gives rise to challenges of automatic classification, indexing, and retrieval that have been studied widely in other areas but not necessarily for this emerging type of artwork. In addition to objective entities like objects, events, and scenes, studies of cognitive properties emerge. Among various kinds of computational cognitive analyses, we focus on attractiveness analysis in this workshop. The topics of the accepted papers cover the affective analysis of texts, images, and music. The actual MMArt-ACM 2022 Proceedings are available at: https://dl.acm.org/citation.cfm?id=3512730.
{"title":"MMArt-ACM 2022: 5th Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia","authors":"Naoko Nitta, Anita Hu, Kensuke Tobitani","doi":"10.1145/3512527.3531442","DOIUrl":"https://doi.org/10.1145/3512527.3531442","url":null,"abstract":"In addition to classical art types like paintings and sculptures, new types of artworks emerge following the advancement of deep learning, social platforms, media capturing devices, and media processing tools. Large volumes of machine-/user-generated content or professionally-edited content are shared and disseminated on the Web. Novel multimedia artworks, therefore, emerge rapidly in the era of social media and big data. The ever-increasing amount of illustrations/comics/animations on this platform gives rise to challenges of automatic classification, indexing, and retrieval that have been studied widely in other areas but not necessarily for this emerging type of artwork. In addition to objective entities like objects, events, and scenes, studies of cognitive properties emerge. Among various kinds of computational cognitive analyses, we focus on attractiveness analysis in this workshop. The topics of the accepted papers cover the affective analysis of texts, images, and music. The actual MMArt-ACM 2022 Proceedings are available at: https://dl.acm.org/citation.cfm?id=3512730.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124858684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, a storefront accessibility image dataset is collected from Google street view and is labeled with three main objects for storefront accessibility: doors (for store entrances), doorknobs (for accessing the entrances) and stairs (for leading to the entrances). Then MultiCLU, a new multi-stage context learning and utilization approach, is proposed with the following four stages: Context in Labeling (CIL), Context in Training (CIT), Context in Detection (CID) and Context in Evaluation (CIE). The CIL stage automatically extends the label for each knob to include more local contextual information. In the CIT stage, a deep learning method is used to project the visual information extracted by a Faster R-CNN based object detector to semantic space generated by a Graph Convolutional Network. The CID stage uses the spatial relation reasoning between categories to refine the confidence score. Finally in the CIE stage, a new loose evaluation metric for storefront accessibility, especially for knob category, is proposed to efficiently help BLV users to find estimated knob locations. Our experiment results show that the proposed MultiCLU framework can achieve significantly better performance than the baseline detector using Faster R-CNN, with +13.4% on mAP and +15.8% on recall, respectively. Our new evaluation metric also introduces a new way to evaluate storefront accessibility objects, which could benefit BLV group in real life.
{"title":"MultiCLU: Multi-stage Context Learning and Utilization for Storefront Accessibility Detection and Evaluation","authors":"X. Wang, Jiajun Chen, Hao Tang, Zhigang Zhu","doi":"10.1145/3512527.3531361","DOIUrl":"https://doi.org/10.1145/3512527.3531361","url":null,"abstract":"In this work, a storefront accessibility image dataset is collected from Google street view and is labeled with three main objects for storefront accessibility: doors (for store entrances), doorknobs (for accessing the entrances) and stairs (for leading to the entrances). Then MultiCLU, a new multi-stage context learning and utilization approach, is proposed with the following four stages: Context in Labeling (CIL), Context in Training (CIT), Context in Detection (CID) and Context in Evaluation (CIE). The CIL stage automatically extends the label for each knob to include more local contextual information. In the CIT stage, a deep learning method is used to project the visual information extracted by a Faster R-CNN based object detector to semantic space generated by a Graph Convolutional Network. The CID stage uses the spatial relation reasoning between categories to refine the confidence score. Finally in the CIE stage, a new loose evaluation metric for storefront accessibility, especially for knob category, is proposed to efficiently help BLV users to find estimated knob locations. Our experiment results show that the proposed MultiCLU framework can achieve significantly better performance than the baseline detector using Faster R-CNN, with +13.4% on mAP and +15.8% on recall, respectively. Our new evaluation metric also introduces a new way to evaluate storefront accessibility objects, which could benefit BLV group in real life.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132396519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}