Vincenzo Carletti, Pasquale Foggia, Antonio Greco, Alessia Saggese, Mario Vento
Sharing facial pictures through online services, especially on social networks, has become a common habit for thousands of users. This practice hides a possible threat to privacy: the owners of such services, as well as malicious users, could automatically extract information from faces using modern and effective neural networks. In this paper, we propose the harmless use of adversarial attacks, i.e. variations of images that are almost imperceptible to the human eye and that are typically generated with the malicious purpose to mislead Convolutional Neural Networks (CNNs). Such attacks have been instead adopted to (i) obfuscate soft biometrics (gender, age, ethnicity) but (ii) without degrading the quality of the face images posted online. We achieve the above mentioned two conflicting goals by modifying the implementations of four of the most popular adversarial attacks, namely FGSM, PGD, DeepFool and C&W, in order to constrain the average amount of noise they generate on the image and the maximum perturbation they add on the single pixel. We demonstrate, in an experimental framework including three popular CNNs, namely VGG16, SENet and MobileNetV3, that the considered obfuscation method, which requires at most four seconds for each image, is effective not only when we have a complete knowledge of the neural network that extracts the soft biometrics (white box attacks), but also when the adversarial attacks are generated in a more realistic black box scenario. Finally, we prove that an opponent can implement defense techniques to partially reduce the effect of the obfuscation, but substantially paying in terms of accuracy over clean images; this result, confirmed by the experiments carried out with three popular defense methods, namely adversarial training, denoising autoencoder and Kullback-Leibler autoencoder, shows that it is not convenient for the opponent to defend himself and that the proposed approach is robust to defenses.
{"title":"Facial soft-biometrics obfuscation through adversarial attacks","authors":"Vincenzo Carletti, Pasquale Foggia, Antonio Greco, Alessia Saggese, Mario Vento","doi":"10.1145/3656474","DOIUrl":"https://doi.org/10.1145/3656474","url":null,"abstract":"<p>Sharing facial pictures through online services, especially on social networks, has become a common habit for thousands of users. This practice hides a possible threat to privacy: the owners of such services, as well as malicious users, could automatically extract information from faces using modern and effective neural networks. In this paper, we propose the harmless use of adversarial attacks, i.e. variations of images that are almost imperceptible to the human eye and that are typically generated with the malicious purpose to mislead Convolutional Neural Networks (CNNs). Such attacks have been instead adopted to (i) obfuscate soft biometrics (gender, age, ethnicity) but (ii) without degrading the quality of the face images posted online. We achieve the above mentioned two conflicting goals by modifying the implementations of four of the most popular adversarial attacks, namely FGSM, PGD, DeepFool and C&W, in order to constrain the average amount of noise they generate on the image and the maximum perturbation they add on the single pixel. We demonstrate, in an experimental framework including three popular CNNs, namely VGG16, SENet and MobileNetV3, that the considered obfuscation method, which requires at most four seconds for each image, is effective not only when we have a complete knowledge of the neural network that extracts the soft biometrics (white box attacks), but also when the adversarial attacks are generated in a more realistic black box scenario. Finally, we prove that an opponent can implement defense techniques to partially reduce the effect of the obfuscation, but substantially paying in terms of accuracy over clean images; this result, confirmed by the experiments carried out with three popular defense methods, namely adversarial training, denoising autoencoder and Kullback-Leibler autoencoder, shows that it is not convenient for the opponent to defend himself and that the proposed approach is robust to defenses.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"16 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniele Lorenzi, Farzad Tashtarian, Hermann Hellwagner, Christian Timmerer
HTTP Adaptive Streaming (HAS) solutions utilize various Adaptive BitRate (ABR) algorithms to dynamically select appropriate video representations, aiming to adapt to fluctuations in network bandwidth. However, current ABR implementations have a limitation in that they are designed to function with one set of video representations, i.e., the bitrate ladder, which differ in bitrate and resolution, but are encoded with the same video codec. When multiple codecs are available, current ABR algorithms select one of them prior to the streaming session and stick to it throughout the entire streaming session. Although newer codecs are generally preferred over older ones, their compression efficiencies differ depending on the content’s complexity, which varies over time. Therefore, it is necessary to select the appropriate codec for each video segment to reduce the requested data while delivering the highest possible quality. In this paper, we first provide a practical example where we compare compression efficiencies of different codecs on a set of video sequences. Based on this analysis, we formulate the optimization problem of selecting the appropriate codec for each user and video segment (on a per-segment basis in the outmost case), refining the selection of the ABR algorithms by exploiting key metrics, such as the perceived segment quality and size. Subsequently, to address the scalability issues of this centralized model, we introduce a novel distributed plug-in ABR algorithm for Video on Demand (VoD) applications called MEDUSA to be deployed on top of existing ABR algorithms. MEDUSA enhances the user’s Quality of Experience (QoE) by utilizing a multi-objective function that considers the quality and size of video segments when selecting the next representation. Using quality information and segment size from the modified Media Presentation Description (MPD), MEDUSA utilizes buffer occupancy to prioritize quality or size by assigning specific weights in the objective function. To show the impact of MEDUSA, we compare the proposed plug-in approach on top of state-of-the-art techniques with their original implementations and analyze the results for different network traces, video content, and buffer capacities. According to the experimental findings, MEDUSA shows the ability to improve QoE for various test videos and scenarios. The results reveal an impressive improvement in the QoE score of up to 42% according to the ITU-T P.1203 model (mode 0). Additionally, MEDUSA can reduce the transmitted data volume by up to more than 40% achieving a QoE similar to the techniques compared, reducing the burden on streaming service providers for delivery costs.
HTTP 自适应流媒体(HAS)解决方案利用各种自适应比特率(ABR)算法动态选择适当的视频表现形式,以适应网络带宽的波动。然而,目前的 ABR 实现有一个局限性,即它们的设计只适用于一组视频表示,即比特率阶梯,这些视频表示在比特率和分辨率上各不相同,但使用相同的视频编解码器进行编码。当有多种编解码器可用时,当前的 ABR 算法会在流媒体会话之前选择其中一种,并在整个流媒体会话过程中坚持使用。虽然较新的编解码器通常比较旧的编解码器更受青睐,但它们的压缩效率因内容的复杂性而异,而复杂性又会随着时间的推移而变化。因此,有必要为每个视频片段选择合适的编解码器,以减少所需的数据,同时提供尽可能高的质量。在本文中,我们首先提供了一个实际例子,比较不同编解码器对一组视频序列的压缩效率。在此分析的基础上,我们提出了为每个用户和视频片段选择合适编解码器的优化问题(在最常见的情况下以每个片段为基础),通过利用关键指标(如感知片段质量和大小)来完善 ABR 算法的选择。随后,为了解决这种集中式模型的可扩展性问题,我们为视频点播(VoD)应用引入了一种名为 MEDUSA 的新型分布式插件 ABR 算法,可部署在现有 ABR 算法之上。MEDUSA 利用多目标函数,在选择下一个表示时考虑视频片段的质量和大小,从而提高用户的体验质量(QoE)。MEDUSA 利用修改后的媒体呈现描述 (MPD) 中的质量信息和片段大小,通过在目标函数中分配特定权重,利用缓冲区占用率来优先考虑质量或大小。为了展示 MEDUSA 的影响,我们将所提出的插件方法与最先进的技术及其原始实现进行了比较,并分析了不同网络轨迹、视频内容和缓冲区容量下的结果。实验结果表明,MEDUSA 能够改善各种测试视频和场景的 QoE。结果显示,根据 ITU-T P.1203 模型(模式 0),QoE 分数最高提高了 42%,令人印象深刻。此外,MEDUSA 还能将传输数据量减少 40% 以上,达到与所比较技术相似的 QoE,从而减轻流媒体服务提供商的传输成本负担。
{"title":"MEDUSA: A Dynamic Codec Switching Approach in HTTP Adaptive Streaming","authors":"Daniele Lorenzi, Farzad Tashtarian, Hermann Hellwagner, Christian Timmerer","doi":"10.1145/3656175","DOIUrl":"https://doi.org/10.1145/3656175","url":null,"abstract":"<p><i>HTTP Adaptive Streaming</i> (HAS) solutions utilize various Adaptive BitRate (ABR) algorithms to dynamically select appropriate video representations, aiming to adapt to fluctuations in network bandwidth. However, current ABR implementations have a limitation in that they are designed to function with one set of video representations, <i>i.e.</i>, the bitrate ladder, which differ in bitrate and resolution, but are encoded with the same video codec. When multiple codecs are available, current ABR algorithms select one of them prior to the streaming session and stick to it throughout the entire streaming session. Although newer codecs are generally preferred over older ones, their compression efficiencies differ depending on the content’s complexity, which varies over time. Therefore, it is necessary to select the appropriate codec for each video segment to reduce the requested data while delivering the highest possible quality. In this paper, we first provide a practical example where we compare compression efficiencies of different codecs on a set of video sequences. Based on this analysis, we formulate the optimization problem of selecting the appropriate codec for each user and video segment (on a per-segment basis in the outmost case), refining the selection of the ABR algorithms by exploiting key metrics, such as the perceived segment quality and size. Subsequently, to address the scalability issues of this centralized model, we introduce a novel distributed plug-in ABR algorithm for Video on Demand (VoD) applications called MEDUSA to be deployed on top of existing ABR algorithms. MEDUSA enhances the user’s Quality of Experience (QoE) by utilizing a multi-objective function that considers the quality and size of video segments when selecting the next representation. Using quality information and segment size from the modified <i>Media Presentation Description (MPD)</i>, MEDUSA utilizes buffer occupancy to prioritize quality or size by assigning specific weights in the objective function. To show the impact of MEDUSA, we compare the proposed plug-in approach on top of state-of-the-art techniques with their original implementations and analyze the results for different network traces, video content, and buffer capacities. According to the experimental findings, MEDUSA shows the ability to improve QoE for various test videos and scenarios. The results reveal an impressive improvement in the QoE score of up to 42% according to the ITU-T P.1203 model (mode 0). Additionally, MEDUSA can reduce the transmitted data volume by up to more than 40% achieving a QoE similar to the techniques compared, reducing the burden on streaming service providers for delivery costs.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"10 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While significant progress has been made in recent years in the field of salient object detection (SOD), there are still limitations in heterogeneous modality fusion and salient feature integrity learning. The former is primarily attributed to a paucity of attention from researchers to the fusion of cross-scale information between different modalities during processing multi-modal heterogeneous data, coupled with an absence of methods for adaptive control of their respective contributions. The latter constraint stems from the shortcomings in existing approaches concerning the prediction of salient region’s integrity. To address these problems, we propose a Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object Detection, denoted as HFIL-Net. In response to the first challenge, we design an Advanced Semantic Guidance Aggregation (ASGA) module, which utilizes three fusion blocks to achieve the aggregation of three types of information: within-scale cross-modal, within-modal cross-scale, and cross-modal cross-scale. In addition, we embed the local fusion factor matrices in the ASGA module and utilize the global fusion factor matrices in the Multi-modal Information Adaptive Fusion (MIAF) module to control the contributions adaptively from different perspectives during the fusion process. For the second issue, we introduce the Feature Integrity Learning and Refinement (FILR) Module. It leverages the idea of ”part-whole” relationships from capsule networks to learn feature integrity and further refine the learned features through attention mechanisms. Extensive experimental results demonstrate that our proposed HFIL-Net outperforms over 17 state-of-the-art (SOTA) detection methods in testing across seven challenging standard datasets. Codes and results are available on https://github.com/BojueGao/HFIL-Net.
{"title":"Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object Detection","authors":"Haoran Gao, Yiming Su, Fasheng Wang, Haojie Li","doi":"10.1145/3656476","DOIUrl":"https://doi.org/10.1145/3656476","url":null,"abstract":"<p>While significant progress has been made in recent years in the field of salient object detection (SOD), there are still limitations in heterogeneous modality fusion and salient feature integrity learning. The former is primarily attributed to a paucity of attention from researchers to the fusion of cross-scale information between different modalities during processing multi-modal heterogeneous data, coupled with an absence of methods for adaptive control of their respective contributions. The latter constraint stems from the shortcomings in existing approaches concerning the prediction of salient region’s integrity. To address these problems, we propose a Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object Detection, denoted as HFIL-Net. In response to the first challenge, we design an Advanced Semantic Guidance Aggregation (ASGA) module, which utilizes three fusion blocks to achieve the aggregation of three types of information: within-scale cross-modal, within-modal cross-scale, and cross-modal cross-scale. In addition, we embed the local fusion factor matrices in the ASGA module and utilize the global fusion factor matrices in the Multi-modal Information Adaptive Fusion (MIAF) module to control the contributions adaptively from different perspectives during the fusion process. For the second issue, we introduce the Feature Integrity Learning and Refinement (FILR) Module. It leverages the idea of ”part-whole” relationships from capsule networks to learn feature integrity and further refine the learned features through attention mechanisms. Extensive experimental results demonstrate that our proposed HFIL-Net outperforms over 17 state-of-the-art (SOTA) detection methods in testing across seven challenging standard datasets. Codes and results are available on https://github.com/BojueGao/HFIL-Net.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"2015 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huiyuan Fu, Jin Liu, Ting Yu, Xin Wang, Huadong Ma
The objective of multi-domain image-to-image translation is to learn the mapping from a source domain to a target domain in multiple image domains while preserving the content representation of the source domain. Despite the importance and recent efforts, most previous studies disregard the large style discrepancy between images and instances in various domains, or fail to capture instance details and boundaries properly, resulting in poor translation results for rich scenes. To address these problems, we present an effective architecture for multi-domain image-to-image translation that only requires one generator. Specifically, we provide detailed procedures for capturing the features of instances throughout the learning process, as well as learning the relationship between the style of the global image and that of a local instance in the image by enforcing the cross-granularity consistency. In order to capture local details within the content space, we employ a dual contrastive learning strategy that operates at both the instance and patch levels. Extensive studies on different multi-domain image-to-image translation datasets reveal that our proposed method outperforms state-of-the-art approaches.
{"title":"Multi-Domain Image-to-Image Translation with Cross-Granularity Contrastive Learning","authors":"Huiyuan Fu, Jin Liu, Ting Yu, Xin Wang, Huadong Ma","doi":"10.1145/3656048","DOIUrl":"https://doi.org/10.1145/3656048","url":null,"abstract":"<p>The objective of multi-domain image-to-image translation is to learn the mapping from a source domain to a target domain in multiple image domains while preserving the content representation of the source domain. Despite the importance and recent efforts, most previous studies disregard the large style discrepancy between images and instances in various domains, or fail to capture instance details and boundaries properly, resulting in poor translation results for rich scenes. To address these problems, we present an effective architecture for multi-domain image-to-image translation that only requires one generator. Specifically, we provide detailed procedures for capturing the features of instances throughout the learning process, as well as learning the relationship between the style of the global image and that of a local instance in the image by enforcing the cross-granularity consistency. In order to capture local details within the content space, we employ a dual contrastive learning strategy that operates at both the instance and patch levels. Extensive studies on different multi-domain image-to-image translation datasets reveal that our proposed method outperforms state-of-the-art approaches.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"1 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panpan Zhang, Meng Liu, Xuemeng Song, Da Cao, Zan Gao, Liqiang Nie
This paper introduces the Universal Relocalizer, a novel approach designed for weakly supervised referring expression grounding. Our method strives to pinpoint a target proposal that corresponds to a specific query, eliminating the need for region-level annotations during training. To bolster the localization precision and enrich the semantic understanding of the target proposal, we devise three key modules: the category module, the color module, and the spatial relationship module. The category and color modules assign respective category and color labels to region proposals, enabling the computation of category and color scores. Simultaneously, the spatial relationship module integrates spatial cues, yielding a spatial score for each proposal to enhance localization accuracy further. By adeptly amalgamating the category, color, and spatial scores, we derive a refined grounding score for every proposal. Comprehensive evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets manifest the prowess of the Universal Relocalizer, showcasing its formidable performance across the board.
{"title":"Universal Relocalizer for Weakly Supervised Referring Expression Grounding","authors":"Panpan Zhang, Meng Liu, Xuemeng Song, Da Cao, Zan Gao, Liqiang Nie","doi":"10.1145/3656045","DOIUrl":"https://doi.org/10.1145/3656045","url":null,"abstract":"<p>This paper introduces the Universal Relocalizer, a novel approach designed for weakly supervised referring expression grounding. Our method strives to pinpoint a target proposal that corresponds to a specific query, eliminating the need for region-level annotations during training. To bolster the localization precision and enrich the semantic understanding of the target proposal, we devise three key modules: the category module, the color module, and the spatial relationship module. The category and color modules assign respective category and color labels to region proposals, enabling the computation of category and color scores. Simultaneously, the spatial relationship module integrates spatial cues, yielding a spatial score for each proposal to enhance localization accuracy further. By adeptly amalgamating the category, color, and spatial scores, we derive a refined grounding score for every proposal. Comprehensive evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets manifest the prowess of the Universal Relocalizer, showcasing its formidable performance across the board.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"17 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Loss functions and sample mining strategies are essential components in deep metric learning algorithms. However, the existing loss function or mining strategy often necessitate the incorporation of additional hyperparameters, notably the threshold, which defines whether the sample pair is informative. The threshold provides a stable numerical standard for determining whether to retain the pairs. It is a vital parameter to reduce the redundant sample pairs participating in training. Nonetheless, finding the optimal threshold can be a time-consuming endeavor, often requiring extensive grid searches. Because the threshold cannot be dynamically adjusted in the training stage, we should conduct plenty of repeated experiments to determine the threshold. Therefore, we introduce a novel approach for adjusting the thresholds associated with both the loss function and the sample mining strategy. We design a static Asymmetric Sample Mining Strategy (ASMS) and its dynamic version Adaptive Tolerance ASMS (AT-ASMS), tailored for sample mining methods. ASMS utilizes differentiated thresholds to address the problems (too few positive pairs and too many redundant negative pairs) caused by only applying a single threshold to filter samples. AT-ASMS can adaptively regulate the ratio of positive and negative pairs during training according to the ratio of the currently mined positive and negative pairs. This meta-learning-based threshold generation algorithm utilizes a single-step gradient descent to obtain new thresholds. We combine these two threshold adjustment algorithms to form the Dual Dynamic Threshold Adjustment Strategy (DDTAS). Experimental results show that our algorithm achieves competitive performance on CUB200, Cars196, and SOP datasets. Our codes are available at https://github.com/NUST-Machine-Intelligence-Laboratory/DDTAS.
{"title":"Dual Dynamic Threshold Adjustment Strategy","authors":"Xiruo Jiang, Yazhou Yao, Sheng Liu, Fumin Shen, Liqiang Nie, Xian-Sheng Hua","doi":"10.1145/3656047","DOIUrl":"https://doi.org/10.1145/3656047","url":null,"abstract":"<p>Loss functions and sample mining strategies are essential components in deep metric learning algorithms. However, the existing loss function or mining strategy often necessitate the incorporation of additional hyperparameters, notably the threshold, which defines whether the sample pair is informative. The threshold provides a stable numerical standard for determining whether to retain the pairs. It is a vital parameter to reduce the redundant sample pairs participating in training. Nonetheless, finding the optimal threshold can be a time-consuming endeavor, often requiring extensive grid searches. Because the threshold cannot be dynamically adjusted in the training stage, we should conduct plenty of repeated experiments to determine the threshold. Therefore, we introduce a novel approach for adjusting the thresholds associated with both the loss function and the sample mining strategy. We design a static Asymmetric Sample Mining Strategy (ASMS) and its dynamic version Adaptive Tolerance ASMS (AT-ASMS), tailored for sample mining methods. ASMS utilizes differentiated thresholds to address the problems (too few positive pairs and too many redundant negative pairs) caused by only applying a single threshold to filter samples. AT-ASMS can adaptively regulate the ratio of positive and negative pairs during training according to the ratio of the currently mined positive and negative pairs. This meta-learning-based threshold generation algorithm utilizes a single-step gradient descent to obtain new thresholds. We combine these two threshold adjustment algorithms to form the Dual Dynamic Threshold Adjustment Strategy (DDTAS). Experimental results show that our algorithm achieves competitive performance on CUB200, Cars196, and SOP datasets. Our codes are available at https://github.com/NUST-Machine-Intelligence-Laboratory/DDTAS.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"40 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised person re-identification (Re-ID) has garnered significant attention because of its data-friendly nature, as it does not require labeled data. Existing approaches primarily address this challenge by employing feature-clustering techniques to generate pseudo-labels. In addition, camera-proxy-based methods have emerged because of their impressive ability to cluster sample identities. However, these methods often blur the distinctions between individuals within inter-camera views, which is crucial for effective person re-ID. To address this issue, this study introduces an inter-camera-identity-difference-based contrastive learning framework for unsupervised person Re-ID. The proposed framework comprises two key components: (1) a different sample cross-view close-range penalty module and (2) the same sample cross-view long-range constraint module. The former aims to penalize excessive similarity among different subjects across inter-camera views, whereas the latter mitigates the challenge of excessive dissimilarity among the same subject across camera views. To validate the performance of our method, we conducted extensive experiments on three existing person Re-ID datasets (Market-1501, MSMT17, and PersonX). The results demonstrate the effectiveness of the proposed method, which shows a promising performance. The code is available at https://github.com/hooldylan/IIDCL.
{"title":"Inter-Camera Identity Discrimination for Unsupervised Person Re-Identification","authors":"Mingfu Xiong, Kaikang Hu, Zhihan Lv, Fei Fang, Zhongyuan Wang, Ruimin Hu, Khan Muhammad","doi":"10.1145/3652858","DOIUrl":"https://doi.org/10.1145/3652858","url":null,"abstract":"<p>Unsupervised person re-identification (Re-ID) has garnered significant attention because of its data-friendly nature, as it does not require labeled data. Existing approaches primarily address this challenge by employing feature-clustering techniques to generate pseudo-labels. In addition, camera-proxy-based methods have emerged because of their impressive ability to cluster sample identities. However, these methods often blur the distinctions between individuals within inter-camera views, which is crucial for effective person re-ID. To address this issue, this study introduces an inter-camera-identity-difference-based contrastive learning framework for unsupervised person Re-ID. The proposed framework comprises two key components: (1) a different sample cross-view close-range penalty module and (2) the same sample cross-view long-range constraint module. The former aims to penalize excessive similarity among different subjects across inter-camera views, whereas the latter mitigates the challenge of excessive dissimilarity among the same subject across camera views. To validate the performance of our method, we conducted extensive experiments on three existing person Re-ID datasets (Market-1501, MSMT17, and PersonX). The results demonstrate the effectiveness of the proposed method, which shows a promising performance. The code is available at https://github.com/hooldylan/IIDCL.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"52 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The goal of sign language recognition (SLR) is to help those who are hard of hearing or deaf overcome the communication barrier. Most existing approaches can be typically divided into two lines, i.e., Skeleton-based and RGB-based methods, but both the two lines of methods have their limitations. Skeleton-based methods do not consider facial expressions, while RGB-based approaches usually ignore the fine-grained hand structure. To overcome both limitations, we propose a new framework called Spatial-temporal Part-aware network (StepNet), based on RGB parts. As its name suggests, it is made up of two modules: Part-level Spatial Modeling and Part-level Temporal Modeling. Part-level Spatial Modeling, in particular, automatically captures the appearance-based properties, such as hands and faces, in the feature space without the use of any keypoint-level annotations. On the other hand, Part-level Temporal Modeling implicitly mines the long-short term context to capture the relevant attributes over time. Extensive experiments demonstrate that our StepNet, thanks to spatial-temporal modules, achieves competitive Top-1 Per-instance accuracy on three commonly-used SLR benchmarks, i.e., 56.89% on WLASL, 77.2% on NMFs-CSL, and 77.1% on BOBSL. Additionally, the proposed method is compatible with the optical flow input and can produce superior performance if fused. For those who are hard of hearing, we hope that our work can act as a preliminary step.
{"title":"StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition","authors":"Xiaolong Shen, Zhedong Zheng, Yi Yang","doi":"10.1145/3656046","DOIUrl":"https://doi.org/10.1145/3656046","url":null,"abstract":"<p>The goal of sign language recognition (SLR) is to help those who are hard of hearing or deaf overcome the communication barrier. Most existing approaches can be typically divided into two lines, <i>i.e.</i>, Skeleton-based and RGB-based methods, but both the two lines of methods have their limitations. Skeleton-based methods do not consider facial expressions, while RGB-based approaches usually ignore the fine-grained hand structure. To overcome both limitations, we propose a new framework called Spatial-temporal Part-aware network (StepNet), based on RGB parts. As its name suggests, it is made up of two modules: Part-level Spatial Modeling and Part-level Temporal Modeling. Part-level Spatial Modeling, in particular, automatically captures the appearance-based properties, such as hands and faces, in the feature space without the use of any keypoint-level annotations. On the other hand, Part-level Temporal Modeling implicitly mines the long-short term context to capture the relevant attributes over time. Extensive experiments demonstrate that our StepNet, thanks to spatial-temporal modules, achieves competitive Top-1 Per-instance accuracy on three commonly-used SLR benchmarks, <i>i.e.</i>, 56.89% on WLASL, 77.2% on NMFs-CSL, and 77.1% on BOBSL. Additionally, the proposed method is compatible with the optical flow input and can produce superior performance if fused. For those who are hard of hearing, we hope that our work can act as a preliminary step.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"31 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the advent of egocentric cameras, there are new challenges where traditional computer vision are not sufficient to handle this kind of videos. Moreover, egocentric cameras often offer multiple modalities which need to be modeled jointly to exploit complimentary information. In this paper, we proposed a sparse low-rank bilinear score pooling approach for egocentric hand action recognition from RGB-D videos. It consists of five blocks: a baseline CNN to encode RGB and depth information for producing classification probabilities; a novel bilinear score pooling block to generate a score matrix; a sparse low rank matrix recovery block to reduce redundant features, which is common in bilinear pooling; a one layer CNN for frame-level classification; and an RNN for video level classification. We proposed to fuse classification probabilities instead of traditional CNN features from RGB and depth modality, involving an effective yet simple sparse low rank bilinear score pooling to produce a fused RGB-D score matrix. To demonstrate the efficacy of our method, we perform extensive experiments over two large-scale hand action datasets, namely, THU-READ and FPHA, and two smaller datasets, GUN-71 and HAD. We observe that the proposed method outperforms state-of-the-art methods and achieves accuracies of 78.55% and 96.87% over the THU-READ dataset in cross-subject and cross-group settings, respectively. Further, we achieved accuracies of 91.59% and 43.87% over the FPHA and Gun-71 datasets, respectively.
{"title":"Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition","authors":"Kankana Roy","doi":"10.1145/3656044","DOIUrl":"https://doi.org/10.1145/3656044","url":null,"abstract":"<p>With the advent of egocentric cameras, there are new challenges where traditional computer vision are not sufficient to handle this kind of videos. Moreover, egocentric cameras often offer multiple modalities which need to be modeled jointly to exploit complimentary information. In this paper, we proposed a sparse low-rank bilinear score pooling approach for egocentric hand action recognition from RGB-D videos. It consists of five blocks: a baseline CNN to encode RGB and depth information for producing classification probabilities; a novel bilinear score pooling block to generate a score matrix; a sparse low rank matrix recovery block to reduce redundant features, which is common in bilinear pooling; a one layer CNN for frame-level classification; and an RNN for video level classification. We proposed to fuse classification probabilities instead of traditional CNN features from RGB and depth modality, involving an effective yet simple sparse low rank bilinear score pooling to produce a fused RGB-D score matrix. To demonstrate the efficacy of our method, we perform extensive experiments over two large-scale hand action datasets, namely, THU-READ and FPHA, and two smaller datasets, GUN-71 and HAD. We observe that the proposed method outperforms state-of-the-art methods and achieves accuracies of 78.55% and 96.87% over the THU-READ dataset in cross-subject and cross-group settings, respectively. Further, we achieved accuracies of 91.59% and 43.87% over the FPHA and Gun-71 datasets, respectively.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"52 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose the first geometry and texture (double) referenced interactive 2D and 3D caricature generating and editing method. The main challenge of caricature generation lies in the fact that it not only exaggerates the facial geometry but also refreshes the facial texture. We address this challenge by utilizing the semantic segmentation maps as an intermediary domain, removing the influence of photo texture while preserving the person-specific geometry features. Specifically, our proposed method consists of two main components: 3D-CariNet and CariMaskGAN. 3D-CariNet uses sketches or caricatures to exaggerate the input photo into several types of 3D caricatures. To generate a CariMask, we geometrically exaggerate the photos using the projection of exaggerated 3D landmarks, after which CariMask is converted into a caricature by CariMaskGAN. In this step, users can edit and adjust the geometry of caricatures freely. Moreover, we propose a semantic detail preprocessing approach that considerably increases the details of generated caricatures and allows modification of hair strands, wrinkles, and beards. By rendering high-quality 2D caricatures as textures, we produce 3D caricatures with a variety of texture styles. Extensive experimental results have demonstrated that our method can produce higher-quality caricatures as well as support interactive modification with ease.
{"title":"Double Reference Guided Interactive 2D and 3D Caricature Generation","authors":"Xin Huang, Dong Liang, Hongrui Cai, Yunfeng Bai, Juyong Zhang, Feng Tian, Jinyuan Jia","doi":"10.1145/3655624","DOIUrl":"https://doi.org/10.1145/3655624","url":null,"abstract":"<p>In this paper, we propose the first geometry and texture (double) referenced interactive 2D and 3D caricature generating and editing method. The main challenge of caricature generation lies in the fact that it not only exaggerates the facial geometry but also refreshes the facial texture. We address this challenge by utilizing the semantic segmentation maps as an intermediary domain, removing the influence of photo texture while preserving the person-specific geometry features. Specifically, our proposed method consists of two main components: 3D-CariNet and CariMaskGAN. 3D-CariNet uses sketches or caricatures to exaggerate the input photo into several types of 3D caricatures. To generate a CariMask, we geometrically exaggerate the photos using the projection of exaggerated 3D landmarks, after which CariMask is converted into a caricature by CariMaskGAN. In this step, users can edit and adjust the geometry of caricatures freely. Moreover, we propose a semantic detail preprocessing approach that considerably increases the details of generated caricatures and allows modification of hair strands, wrinkles, and beards. By rendering high-quality 2D caricatures as textures, we produce 3D caricatures with a variety of texture styles. Extensive experimental results have demonstrated that our method can produce higher-quality caricatures as well as support interactive modification with ease.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"79 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}