ACM Transactions on Multimedia Computing Communications and Applications最新文献_第7页

Facial soft-biometrics obfuscation through adversarial attacks 通过对抗性攻击混淆面部软生物识别技术

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-06 DOI: 10.1145/3656474

Vincenzo Carletti, Pasquale Foggia, Antonio Greco, Alessia Saggese, Mario Vento

Sharing facial pictures through online services, especially on social networks, has become a common habit for thousands of users. This practice hides a possible threat to privacy: the owners of such services, as well as malicious users, could automatically extract information from faces using modern and effective neural networks. In this paper, we propose the harmless use of adversarial attacks, i.e. variations of images that are almost imperceptible to the human eye and that are typically generated with the malicious purpose to mislead Convolutional Neural Networks (CNNs). Such attacks have been instead adopted to (i) obfuscate soft biometrics (gender, age, ethnicity) but (ii) without degrading the quality of the face images posted online. We achieve the above mentioned two conflicting goals by modifying the implementations of four of the most popular adversarial attacks, namely FGSM, PGD, DeepFool and C&W, in order to constrain the average amount of noise they generate on the image and the maximum perturbation they add on the single pixel. We demonstrate, in an experimental framework including three popular CNNs, namely VGG16, SENet and MobileNetV3, that the considered obfuscation method, which requires at most four seconds for each image, is effective not only when we have a complete knowledge of the neural network that extracts the soft biometrics (white box attacks), but also when the adversarial attacks are generated in a more realistic black box scenario. Finally, we prove that an opponent can implement defense techniques to partially reduce the effect of the obfuscation, but substantially paying in terms of accuracy over clean images; this result, confirmed by the experiments carried out with three popular defense methods, namely adversarial training, denoising autoencoder and Kullback-Leibler autoencoder, shows that it is not convenient for the opponent to defend himself and that the proposed approach is robust to defenses.

通过在线服务，特别是在社交网络上分享面部照片，已成为成千上万用户的共同习惯。这种做法隐藏着对隐私的潜在威胁：此类服务的所有者以及恶意用户可以利用现代有效的神经网络自动提取人脸信息。在本文中，我们建议使用无害的对抗性攻击，即人眼几乎无法察觉的图像变化，这些变化通常是出于误导卷积神经网络（CNN）的恶意目的而生成的。采用这种攻击的目的是：(i) 混淆软生物识别（性别、年龄、种族），但 (ii) 不降低网上发布的人脸图像的质量。为了实现上述两个相互冲突的目标，我们修改了四种最流行的对抗性攻击（即 FGSM、PGD、DeepFool 和 C&W）的实现方法，以限制它们在图像上产生的平均噪声量和在单个像素上增加的最大扰动。我们在一个包括三种流行 CNN（即 VGG16、SENet 和 MobileNetV3）的实验框架中证明，所考虑的混淆方法（每幅图像最多需要 4 秒钟）不仅在我们完全了解提取软生物识别信息的神经网络（白盒攻击）时有效，而且在更现实的黑盒场景中生成对抗攻击时也有效。最后，我们证明了对手可以使用防御技术来部分降低混淆效果，但在准确性上却大大低于干净图像；这一结果通过使用三种流行的防御方法（即对抗训练、去噪自编码器和库尔贝克-莱布勒自编码器）进行的实验得到了证实，表明对手并不方便进行自我防御，而且所提出的方法对防御具有鲁棒性。

{"title":"Facial soft-biometrics obfuscation through adversarial attacks","authors":"Vincenzo Carletti, Pasquale Foggia, Antonio Greco, Alessia Saggese, Mario Vento","doi":"10.1145/3656474","DOIUrl":"https://doi.org/10.1145/3656474","url":null,"abstract":"Sharing facial pictures through online services, especially on social networks, has become a common habit for thousands of users. This practice hides a possible threat to privacy: the owners of such services, as well as malicious users, could automatically extract information from faces using modern and effective neural networks. In this paper, we propose the harmless use of adversarial attacks, i.e. variations of images that are almost imperceptible to the human eye and that are typically generated with the malicious purpose to mislead Convolutional Neural Networks (CNNs). Such attacks have been instead adopted to (i) obfuscate soft biometrics (gender, age, ethnicity) but (ii) without degrading the quality of the face images posted online. We achieve the above mentioned two conflicting goals by modifying the implementations of four of the most popular adversarial attacks, namely FGSM, PGD, DeepFool and C&W, in order to constrain the average amount of noise they generate on the image and the maximum perturbation they add on the single pixel. We demonstrate, in an experimental framework including three popular CNNs, namely VGG16, SENet and MobileNetV3, that the considered obfuscation method, which requires at most four seconds for each image, is effective not only when we have a complete knowledge of the neural network that extracts the soft biometrics (white box attacks), but also when the adversarial attacks are generated in a more realistic black box scenario. Finally, we prove that an opponent can implement defense techniques to partially reduce the effect of the obfuscation, but substantially paying in terms of accuracy over clean images; this result, confirmed by the experiments carried out with three popular defense methods, namely adversarial training, denoising autoencoder and Kullback-Leibler autoencoder, shows that it is not convenient for the opponent to defend himself and that the proposed approach is robust to defenses.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"16 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MEDUSA: A Dynamic Codec Switching Approach in HTTP Adaptive Streaming MEDUSA：HTTP 自适应流媒体中的动态编解码器切换方法

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-05 DOI: 10.1145/3656175

Daniele Lorenzi, Farzad Tashtarian, Hermann Hellwagner, Christian Timmerer

HTTP Adaptive Streaming (HAS) solutions utilize various Adaptive BitRate (ABR) algorithms to dynamically select appropriate video representations, aiming to adapt to fluctuations in network bandwidth. However, current ABR implementations have a limitation in that they are designed to function with one set of video representations, i.e., the bitrate ladder, which differ in bitrate and resolution, but are encoded with the same video codec. When multiple codecs are available, current ABR algorithms select one of them prior to the streaming session and stick to it throughout the entire streaming session. Although newer codecs are generally preferred over older ones, their compression efficiencies differ depending on the content’s complexity, which varies over time. Therefore, it is necessary to select the appropriate codec for each video segment to reduce the requested data while delivering the highest possible quality. In this paper, we first provide a practical example where we compare compression efficiencies of different codecs on a set of video sequences. Based on this analysis, we formulate the optimization problem of selecting the appropriate codec for each user and video segment (on a per-segment basis in the outmost case), refining the selection of the ABR algorithms by exploiting key metrics, such as the perceived segment quality and size. Subsequently, to address the scalability issues of this centralized model, we introduce a novel distributed plug-in ABR algorithm for Video on Demand (VoD) applications called MEDUSA to be deployed on top of existing ABR algorithms. MEDUSA enhances the user’s Quality of Experience (QoE) by utilizing a multi-objective function that considers the quality and size of video segments when selecting the next representation. Using quality information and segment size from the modified Media Presentation Description (MPD), MEDUSA utilizes buffer occupancy to prioritize quality or size by assigning specific weights in the objective function. To show the impact of MEDUSA, we compare the proposed plug-in approach on top of state-of-the-art techniques with their original implementations and analyze the results for different network traces, video content, and buffer capacities. According to the experimental findings, MEDUSA shows the ability to improve QoE for various test videos and scenarios. The results reveal an impressive improvement in the QoE score of up to 42% according to the ITU-T P.1203 model (mode 0). Additionally, MEDUSA can reduce the transmitted data volume by up to more than 40% achieving a QoE similar to the techniques compared, reducing the burden on streaming service providers for delivery costs.

HTTP 自适应流媒体（HAS）解决方案利用各种自适应比特率（ABR）算法动态选择适当的视频表现形式，以适应网络带宽的波动。然而，目前的 ABR 实现有一个局限性，即它们的设计只适用于一组视频表示，即比特率阶梯，这些视频表示在比特率和分辨率上各不相同，但使用相同的视频编解码器进行编码。当有多种编解码器可用时，当前的 ABR 算法会在流媒体会话之前选择其中一种，并在整个流媒体会话过程中坚持使用。虽然较新的编解码器通常比较旧的编解码器更受青睐，但它们的压缩效率因内容的复杂性而异，而复杂性又会随着时间的推移而变化。因此，有必要为每个视频片段选择合适的编解码器，以减少所需的数据，同时提供尽可能高的质量。在本文中，我们首先提供了一个实际例子，比较不同编解码器对一组视频序列的压缩效率。在此分析的基础上，我们提出了为每个用户和视频片段选择合适编解码器的优化问题（在最常见的情况下以每个片段为基础），通过利用关键指标（如感知片段质量和大小）来完善 ABR 算法的选择。随后，为了解决这种集中式模型的可扩展性问题，我们为视频点播（VoD）应用引入了一种名为 MEDUSA 的新型分布式插件 ABR 算法，可部署在现有 ABR 算法之上。MEDUSA 利用多目标函数，在选择下一个表示时考虑视频片段的质量和大小，从而提高用户的体验质量（QoE）。MEDUSA 利用修改后的媒体呈现描述 (MPD) 中的质量信息和片段大小，通过在目标函数中分配特定权重，利用缓冲区占用率来优先考虑质量或大小。为了展示 MEDUSA 的影响，我们将所提出的插件方法与最先进的技术及其原始实现进行了比较，并分析了不同网络轨迹、视频内容和缓冲区容量下的结果。实验结果表明，MEDUSA 能够改善各种测试视频和场景的 QoE。结果显示，根据 ITU-T P.1203 模型（模式 0），QoE 分数最高提高了 42%，令人印象深刻。此外，MEDUSA 还能将传输数据量减少 40% 以上，达到与所比较技术相似的 QoE，从而减轻流媒体服务提供商的传输成本负担。

{"title":"MEDUSA: A Dynamic Codec Switching Approach in HTTP Adaptive Streaming","authors":"Daniele Lorenzi, Farzad Tashtarian, Hermann Hellwagner, Christian Timmerer","doi":"10.1145/3656175","DOIUrl":"https://doi.org/10.1145/3656175","url":null,"abstract":"HTTP Adaptive Streaming (HAS) solutions utilize various Adaptive BitRate (ABR) algorithms to dynamically select appropriate video representations, aiming to adapt to fluctuations in network bandwidth. However, current ABR implementations have a limitation in that they are designed to function with one set of video representations, i.e., the bitrate ladder, which differ in bitrate and resolution, but are encoded with the same video codec. When multiple codecs are available, current ABR algorithms select one of them prior to the streaming session and stick to it throughout the entire streaming session. Although newer codecs are generally preferred over older ones, their compression efficiencies differ depending on the content’s complexity, which varies over time. Therefore, it is necessary to select the appropriate codec for each video segment to reduce the requested data while delivering the highest possible quality. In this paper, we first provide a practical example where we compare compression efficiencies of different codecs on a set of video sequences. Based on this analysis, we formulate the optimization problem of selecting the appropriate codec for each user and video segment (on a per-segment basis in the outmost case), refining the selection of the ABR algorithms by exploiting key metrics, such as the perceived segment quality and size. Subsequently, to address the scalability issues of this centralized model, we introduce a novel distributed plug-in ABR algorithm for Video on Demand (VoD) applications called MEDUSA to be deployed on top of existing ABR algorithms. MEDUSA enhances the user’s Quality of Experience (QoE) by utilizing a multi-objective function that considers the quality and size of video segments when selecting the next representation. Using quality information and segment size from the modified Media Presentation Description (MPD), MEDUSA utilizes buffer occupancy to prioritize quality or size by assigning specific weights in the objective function. To show the impact of MEDUSA, we compare the proposed plug-in approach on top of state-of-the-art techniques with their original implementations and analyze the results for different network traces, video content, and buffer capacities. According to the experimental findings, MEDUSA shows the ability to improve QoE for various test videos and scenarios. The results reveal an impressive improvement in the QoE score of up to 42% according to the ITU-T P.1203 model (mode 0). Additionally, MEDUSA can reduce the transmitted data volume by up to more than 40% achieving a QoE similar to the techniques compared, reducing the burden on streaming service providers for delivery costs.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"10 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object Detection 用于 RGB-D 突出物体检测的异构融合与完整性学习网络

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-05 DOI: 10.1145/3656476

Haoran Gao, Yiming Su, Fasheng Wang, Haojie Li

While significant progress has been made in recent years in the field of salient object detection (SOD), there are still limitations in heterogeneous modality fusion and salient feature integrity learning. The former is primarily attributed to a paucity of attention from researchers to the fusion of cross-scale information between different modalities during processing multi-modal heterogeneous data, coupled with an absence of methods for adaptive control of their respective contributions. The latter constraint stems from the shortcomings in existing approaches concerning the prediction of salient region’s integrity. To address these problems, we propose a Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object Detection, denoted as HFIL-Net. In response to the first challenge, we design an Advanced Semantic Guidance Aggregation (ASGA) module, which utilizes three fusion blocks to achieve the aggregation of three types of information: within-scale cross-modal, within-modal cross-scale, and cross-modal cross-scale. In addition, we embed the local fusion factor matrices in the ASGA module and utilize the global fusion factor matrices in the Multi-modal Information Adaptive Fusion (MIAF) module to control the contributions adaptively from different perspectives during the fusion process. For the second issue, we introduce the Feature Integrity Learning and Refinement (FILR) Module. It leverages the idea of ”part-whole” relationships from capsule networks to learn feature integrity and further refine the learned features through attention mechanisms. Extensive experimental results demonstrate that our proposed HFIL-Net outperforms over 17 state-of-the-art (SOTA) detection methods in testing across seven challenging standard datasets. Codes and results are available on https://github.com/BojueGao/HFIL-Net.

尽管近年来在突出物体检测（SOD）领域取得了重大进展，但在异构模态融合和突出特征完整性学习方面仍存在局限性。前者主要是由于研究人员在处理多模态异构数据时很少关注不同模态之间跨尺度信息的融合，同时也缺乏对各自贡献进行自适应控制的方法。后一种限制源于现有方法在预测突出区域完整性方面的缺陷。为了解决这些问题，我们提出了一种用于 RGB-D 突出物体检测的异构融合和完整性学习网络，简称为 HFIL-Net。针对第一个挑战，我们设计了高级语义引导聚合（ASGA）模块，利用三个融合块实现三种信息的聚合：尺度内跨模态信息、尺度内跨模态信息和尺度内跨模态信息。此外，我们在 ASGA 模块中嵌入了局部融合因子矩阵，并在多模态信息自适应融合（MIAF）模块中利用全局融合因子矩阵，在融合过程中从不同角度对贡献进行自适应控制。针对第二个问题，我们引入了特征完整性学习和完善（FILR）模块。它利用胶囊网络中 "部分-整体 "关系的思想来学习特征完整性，并通过注意机制进一步完善所学特征。广泛的实验结果表明，在七个具有挑战性的标准数据集测试中，我们提出的 HFIL-Net 优于 17 种最先进的（SOTA）检测方法。代码和结果可在 https://github.com/BojueGao/HFIL-Net 上获取。

{"title":"Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object Detection","authors":"Haoran Gao, Yiming Su, Fasheng Wang, Haojie Li","doi":"10.1145/3656476","DOIUrl":"https://doi.org/10.1145/3656476","url":null,"abstract":"While significant progress has been made in recent years in the field of salient object detection (SOD), there are still limitations in heterogeneous modality fusion and salient feature integrity learning. The former is primarily attributed to a paucity of attention from researchers to the fusion of cross-scale information between different modalities during processing multi-modal heterogeneous data, coupled with an absence of methods for adaptive control of their respective contributions. The latter constraint stems from the shortcomings in existing approaches concerning the prediction of salient region’s integrity. To address these problems, we propose a Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object Detection, denoted as HFIL-Net. In response to the first challenge, we design an Advanced Semantic Guidance Aggregation (ASGA) module, which utilizes three fusion blocks to achieve the aggregation of three types of information: within-scale cross-modal, within-modal cross-scale, and cross-modal cross-scale. In addition, we embed the local fusion factor matrices in the ASGA module and utilize the global fusion factor matrices in the Multi-modal Information Adaptive Fusion (MIAF) module to control the contributions adaptively from different perspectives during the fusion process. For the second issue, we introduce the Feature Integrity Learning and Refinement (FILR) Module. It leverages the idea of ”part-whole” relationships from capsule networks to learn feature integrity and further refine the learned features through attention mechanisms. Extensive experimental results demonstrate that our proposed HFIL-Net outperforms over 17 state-of-the-art (SOTA) detection methods in testing across seven challenging standard datasets. Codes and results are available on https://github.com/BojueGao/HFIL-Net.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"2015 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Domain Image-to-Image Translation with Cross-Granularity Contrastive Learning 利用跨粒度对比学习进行多域图像到图像翻译

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-04 DOI: 10.1145/3656048

Huiyuan Fu, Jin Liu, Ting Yu, Xin Wang, Huadong Ma

The objective of multi-domain image-to-image translation is to learn the mapping from a source domain to a target domain in multiple image domains while preserving the content representation of the source domain. Despite the importance and recent efforts, most previous studies disregard the large style discrepancy between images and instances in various domains, or fail to capture instance details and boundaries properly, resulting in poor translation results for rich scenes. To address these problems, we present an effective architecture for multi-domain image-to-image translation that only requires one generator. Specifically, we provide detailed procedures for capturing the features of instances throughout the learning process, as well as learning the relationship between the style of the global image and that of a local instance in the image by enforcing the cross-granularity consistency. In order to capture local details within the content space, we employ a dual contrastive learning strategy that operates at both the instance and patch levels. Extensive studies on different multi-domain image-to-image translation datasets reveal that our proposed method outperforms state-of-the-art approaches.

多域图像到图像翻译的目标是在多个图像域中学习从源域到目标域的映射，同时保留源域的内容表示。尽管这一点非常重要，而且近年来也在不断努力，但之前的大多数研究都忽略了不同领域中图像与实例之间存在的巨大风格差异，或者未能正确捕捉实例细节和边界，从而导致丰富场景的翻译效果不佳。为了解决这些问题，我们提出了一种只需一个生成器的多域图像到图像翻译的有效架构。具体来说，我们提供了在整个学习过程中捕捉实例特征的详细步骤，并通过强制执行跨粒度一致性来学习全局图像风格与图像中局部实例风格之间的关系。为了捕捉内容空间中的局部细节，我们采用了双重对比学习策略，在实例和片段两个层面上进行学习。对不同多域图像到图像翻译数据集的广泛研究表明，我们提出的方法优于最先进的方法。

{"title":"Multi-Domain Image-to-Image Translation with Cross-Granularity Contrastive Learning","authors":"Huiyuan Fu, Jin Liu, Ting Yu, Xin Wang, Huadong Ma","doi":"10.1145/3656048","DOIUrl":"https://doi.org/10.1145/3656048","url":null,"abstract":"The objective of multi-domain image-to-image translation is to learn the mapping from a source domain to a target domain in multiple image domains while preserving the content representation of the source domain. Despite the importance and recent efforts, most previous studies disregard the large style discrepancy between images and instances in various domains, or fail to capture instance details and boundaries properly, resulting in poor translation results for rich scenes. To address these problems, we present an effective architecture for multi-domain image-to-image translation that only requires one generator. Specifically, we provide detailed procedures for capturing the features of instances throughout the learning process, as well as learning the relationship between the style of the global image and that of a local instance in the image by enforcing the cross-granularity consistency. In order to capture local details within the content space, we employ a dual contrastive learning strategy that operates at both the instance and patch levels. Extensive studies on different multi-domain image-to-image translation datasets reveal that our proposed method outperforms state-of-the-art approaches.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"1 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Universal Relocalizer for Weakly Supervised Referring Expression Grounding 用于弱监督引用表达接地的通用重定位器

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-04 DOI: 10.1145/3656045

Panpan Zhang, Meng Liu, Xuemeng Song, Da Cao, Zan Gao, Liqiang Nie

This paper introduces the Universal Relocalizer, a novel approach designed for weakly supervised referring expression grounding. Our method strives to pinpoint a target proposal that corresponds to a specific query, eliminating the need for region-level annotations during training. To bolster the localization precision and enrich the semantic understanding of the target proposal, we devise three key modules: the category module, the color module, and the spatial relationship module. The category and color modules assign respective category and color labels to region proposals, enabling the computation of category and color scores. Simultaneously, the spatial relationship module integrates spatial cues, yielding a spatial score for each proposal to enhance localization accuracy further. By adeptly amalgamating the category, color, and spatial scores, we derive a refined grounding score for every proposal. Comprehensive evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets manifest the prowess of the Universal Relocalizer, showcasing its formidable performance across the board.

本文介绍了通用重定位器（Universal Relocalizer），这是一种专为弱监督引用表达接地而设计的新方法。我们的方法致力于精确定位与特定查询相对应的目标建议，在训练过程中无需区域级注释。为了提高定位精度并丰富对目标建议的语义理解，我们设计了三个关键模块：类别模块、颜色模块和空间关系模块。类别和颜色模块分别为区域提案分配类别和颜色标签，从而计算出类别和颜色分数。与此同时，空间关系模块整合空间线索，为每个建议得出空间分数，从而进一步提高定位精度。通过将类别、颜色和空间得分巧妙地结合在一起，我们为每个建议得出了一个精细的定位得分。在 RefCOCO、RefCOCO+ 和 RefCOCOg 数据集上进行的综合评估体现了通用重定位器的实力，展示了其强大的全面性能。

{"title":"Universal Relocalizer for Weakly Supervised Referring Expression Grounding","authors":"Panpan Zhang, Meng Liu, Xuemeng Song, Da Cao, Zan Gao, Liqiang Nie","doi":"10.1145/3656045","DOIUrl":"https://doi.org/10.1145/3656045","url":null,"abstract":"This paper introduces the Universal Relocalizer, a novel approach designed for weakly supervised referring expression grounding. Our method strives to pinpoint a target proposal that corresponds to a specific query, eliminating the need for region-level annotations during training. To bolster the localization precision and enrich the semantic understanding of the target proposal, we devise three key modules: the category module, the color module, and the spatial relationship module. The category and color modules assign respective category and color labels to region proposals, enabling the computation of category and color scores. Simultaneously, the spatial relationship module integrates spatial cues, yielding a spatial score for each proposal to enhance localization accuracy further. By adeptly amalgamating the category, color, and spatial scores, we derive a refined grounding score for every proposal. Comprehensive evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets manifest the prowess of the Universal Relocalizer, showcasing its formidable performance across the board.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"17 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dual Dynamic Threshold Adjustment Strategy 双重动态阈值调整策略

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-03 DOI: 10.1145/3656047

Xiruo Jiang, Yazhou Yao, Sheng Liu, Fumin Shen, Liqiang Nie, Xian-Sheng Hua

Loss functions and sample mining strategies are essential components in deep metric learning algorithms. However, the existing loss function or mining strategy often necessitate the incorporation of additional hyperparameters, notably the threshold, which defines whether the sample pair is informative. The threshold provides a stable numerical standard for determining whether to retain the pairs. It is a vital parameter to reduce the redundant sample pairs participating in training. Nonetheless, finding the optimal threshold can be a time-consuming endeavor, often requiring extensive grid searches. Because the threshold cannot be dynamically adjusted in the training stage, we should conduct plenty of repeated experiments to determine the threshold. Therefore, we introduce a novel approach for adjusting the thresholds associated with both the loss function and the sample mining strategy. We design a static Asymmetric Sample Mining Strategy (ASMS) and its dynamic version Adaptive Tolerance ASMS (AT-ASMS), tailored for sample mining methods. ASMS utilizes differentiated thresholds to address the problems (too few positive pairs and too many redundant negative pairs) caused by only applying a single threshold to filter samples. AT-ASMS can adaptively regulate the ratio of positive and negative pairs during training according to the ratio of the currently mined positive and negative pairs. This meta-learning-based threshold generation algorithm utilizes a single-step gradient descent to obtain new thresholds. We combine these two threshold adjustment algorithms to form the Dual Dynamic Threshold Adjustment Strategy (DDTAS). Experimental results show that our algorithm achieves competitive performance on CUB200, Cars196, and SOP datasets. Our codes are available at https://github.com/NUST-Machine-Intelligence-Laboratory/DDTAS.

损失函数和样本挖掘策略是深度度量学习算法的重要组成部分。然而，现有的损失函数或挖掘策略往往需要加入额外的超参数，特别是阈值，它定义了样本对是否具有信息量。阈值为确定是否保留样本对提供了一个稳定的数字标准。它是减少参与训练的冗余样本对的重要参数。然而，寻找最佳阈值是一项耗时的工作，通常需要进行大量的网格搜索。由于阈值不能在训练阶段动态调整，我们需要进行大量的重复实验来确定阈值。因此，我们引入了一种新方法来调整与损失函数和样本挖掘策略相关的阈值。我们为样本挖掘方法量身定制了静态非对称样本挖掘策略（ASMS）及其动态版本自适应容限 ASMS（AT-ASMS）。ASMS 利用差异化阈值来解决仅应用单一阈值过滤样本所带来的问题（正对太少，冗余负对太多）。AT-ASMS 可以在训练过程中根据当前挖掘到的正负对的比例自适应地调节正负对的比例。这种基于元学习的阈值生成算法利用单步梯度下降来获得新的阈值。我们将这两种阈值调整算法结合起来，形成了双动态阈值调整策略（DDTAS）。实验结果表明，我们的算法在 CUB200、Cars196 和 SOP 数据集上取得了具有竞争力的性能。我们的代码见 https://github.com/NUST-Machine-Intelligence-Laboratory/DDTAS。

{"title":"Dual Dynamic Threshold Adjustment Strategy","authors":"Xiruo Jiang, Yazhou Yao, Sheng Liu, Fumin Shen, Liqiang Nie, Xian-Sheng Hua","doi":"10.1145/3656047","DOIUrl":"https://doi.org/10.1145/3656047","url":null,"abstract":"Loss functions and sample mining strategies are essential components in deep metric learning algorithms. However, the existing loss function or mining strategy often necessitate the incorporation of additional hyperparameters, notably the threshold, which defines whether the sample pair is informative. The threshold provides a stable numerical standard for determining whether to retain the pairs. It is a vital parameter to reduce the redundant sample pairs participating in training. Nonetheless, finding the optimal threshold can be a time-consuming endeavor, often requiring extensive grid searches. Because the threshold cannot be dynamically adjusted in the training stage, we should conduct plenty of repeated experiments to determine the threshold. Therefore, we introduce a novel approach for adjusting the thresholds associated with both the loss function and the sample mining strategy. We design a static Asymmetric Sample Mining Strategy (ASMS) and its dynamic version Adaptive Tolerance ASMS (AT-ASMS), tailored for sample mining methods. ASMS utilizes differentiated thresholds to address the problems (too few positive pairs and too many redundant negative pairs) caused by only applying a single threshold to filter samples. AT-ASMS can adaptively regulate the ratio of positive and negative pairs during training according to the ratio of the currently mined positive and negative pairs. This meta-learning-based threshold generation algorithm utilizes a single-step gradient descent to obtain new thresholds. We combine these two threshold adjustment algorithms to form the Dual Dynamic Threshold Adjustment Strategy (DDTAS). Experimental results show that our algorithm achieves competitive performance on CUB200, Cars196, and SOP datasets. Our codes are available at https://github.com/NUST-Machine-Intelligence-Laboratory/DDTAS.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"40 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Inter-Camera Identity Discrimination for Unsupervised Person Re-Identification 用于无监督人员再识别的摄像头间身份识别技术

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-03 DOI: 10.1145/3652858

Mingfu Xiong, Kaikang Hu, Zhihan Lv, Fei Fang, Zhongyuan Wang, Ruimin Hu, Khan Muhammad

Unsupervised person re-identification (Re-ID) has garnered significant attention because of its data-friendly nature, as it does not require labeled data. Existing approaches primarily address this challenge by employing feature-clustering techniques to generate pseudo-labels. In addition, camera-proxy-based methods have emerged because of their impressive ability to cluster sample identities. However, these methods often blur the distinctions between individuals within inter-camera views, which is crucial for effective person re-ID. To address this issue, this study introduces an inter-camera-identity-difference-based contrastive learning framework for unsupervised person Re-ID. The proposed framework comprises two key components: (1) a different sample cross-view close-range penalty module and (2) the same sample cross-view long-range constraint module. The former aims to penalize excessive similarity among different subjects across inter-camera views, whereas the latter mitigates the challenge of excessive dissimilarity among the same subject across camera views. To validate the performance of our method, we conducted extensive experiments on three existing person Re-ID datasets (Market-1501, MSMT17, and PersonX). The results demonstrate the effectiveness of the proposed method, which shows a promising performance. The code is available at https://github.com/hooldylan/IIDCL.

无监督人员再识别（Re-ID）因其无需标记数据、对数据友好的特性而备受关注。现有方法主要通过采用特征聚类技术生成伪标签来应对这一挑战。此外，基于摄像头代理的方法也因其令人印象深刻的样本身份聚类能力而兴起。然而，这些方法往往会模糊镜头间视图中个体之间的区别，而这对于有效的人员再识别至关重要。为了解决这个问题，本研究为无监督人员再识别引入了一个基于摄像头间身份差异的对比学习框架。该框架由两个关键部分组成：(1) 不同样本跨视角近距离惩罚模块和 (2) 同一样本跨视角远距离约束模块。前者旨在惩罚不同主体在不同相机视图间的过度相似性，而后者则减轻同一主体在不同相机视图间的过度不相似性所带来的挑战。为了验证我们方法的性能，我们在三个现有的人物再识别数据集（Market-1501、MSMT17 和 PersonX）上进行了广泛的实验。实验结果证明了所提方法的有效性，并显示出良好的性能。代码见 https://github.com/hooldylan/IIDCL。

{"title":"Inter-Camera Identity Discrimination for Unsupervised Person Re-Identification","authors":"Mingfu Xiong, Kaikang Hu, Zhihan Lv, Fei Fang, Zhongyuan Wang, Ruimin Hu, Khan Muhammad","doi":"10.1145/3652858","DOIUrl":"https://doi.org/10.1145/3652858","url":null,"abstract":"Unsupervised person re-identification (Re-ID) has garnered significant attention because of its data-friendly nature, as it does not require labeled data. Existing approaches primarily address this challenge by employing feature-clustering techniques to generate pseudo-labels. In addition, camera-proxy-based methods have emerged because of their impressive ability to cluster sample identities. However, these methods often blur the distinctions between individuals within inter-camera views, which is crucial for effective person re-ID. To address this issue, this study introduces an inter-camera-identity-difference-based contrastive learning framework for unsupervised person Re-ID. The proposed framework comprises two key components: (1) a different sample cross-view close-range penalty module and (2) the same sample cross-view long-range constraint module. The former aims to penalize excessive similarity among different subjects across inter-camera views, whereas the latter mitigates the challenge of excessive dissimilarity among the same subject across camera views. To validate the performance of our method, we conducted extensive experiments on three existing person Re-ID datasets (Market-1501, MSMT17, and PersonX). The results demonstrate the effectiveness of the proposed method, which shows a promising performance. The code is available at https://github.com/hooldylan/IIDCL.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"52 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition StepNet：用于孤立手语识别的时空部分感知网络

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-03 DOI: 10.1145/3656046

Xiaolong Shen, Zhedong Zheng, Yi Yang

The goal of sign language recognition (SLR) is to help those who are hard of hearing or deaf overcome the communication barrier. Most existing approaches can be typically divided into two lines, i.e., Skeleton-based and RGB-based methods, but both the two lines of methods have their limitations. Skeleton-based methods do not consider facial expressions, while RGB-based approaches usually ignore the fine-grained hand structure. To overcome both limitations, we propose a new framework called Spatial-temporal Part-aware network (StepNet), based on RGB parts. As its name suggests, it is made up of two modules: Part-level Spatial Modeling and Part-level Temporal Modeling. Part-level Spatial Modeling, in particular, automatically captures the appearance-based properties, such as hands and faces, in the feature space without the use of any keypoint-level annotations. On the other hand, Part-level Temporal Modeling implicitly mines the long-short term context to capture the relevant attributes over time. Extensive experiments demonstrate that our StepNet, thanks to spatial-temporal modules, achieves competitive Top-1 Per-instance accuracy on three commonly-used SLR benchmarks, i.e., 56.89% on WLASL, 77.2% on NMFs-CSL, and 77.1% on BOBSL. Additionally, the proposed method is compatible with the optical flow input and can produce superior performance if fused. For those who are hard of hearing, we hope that our work can act as a preliminary step.

手语识别（SLR）的目标是帮助重听者或聋哑人克服交流障碍。现有的大多数方法通常可分为两类，即基于骨骼的方法和基于 RGB 的方法，但这两类方法都有其局限性。基于骨架的方法不考虑面部表情，而基于 RGB 的方法通常会忽略细粒度的手部结构。为了克服这两种局限性，我们提出了一种基于 RGB 部件的新框架，即空间-时间部件感知网络（StepNet）。顾名思义，它由两个模块组成：部件级空间建模和部件级时间建模。其中，部分级空间建模可自动捕捉特征空间中基于外观的属性，如手和脸，而无需使用任何关键点级注释。另一方面，部分级时间建模（Part-level Temporal Modeling）隐含地挖掘了长短期上下文，以捕捉随时间变化的相关属性。大量实验证明，由于采用了空间-时间模块，我们的 StepNet 在三个常用的 SLR 基准上实现了具有竞争力的 Top-1 Per-instance 准确率，即在 WLASL 上为 56.89%，在 NMFs-CSL 上为 77.2%，在 BOBSL 上为 77.1%。此外，所提出的方法与光流输入兼容，如果进行融合，还能产生更优越的性能。对于听力困难的人来说，我们希望我们的工作能起到初步作用。

{"title":"StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition","authors":"Xiaolong Shen, Zhedong Zheng, Yi Yang","doi":"10.1145/3656046","DOIUrl":"https://doi.org/10.1145/3656046","url":null,"abstract":"The goal of sign language recognition (SLR) is to help those who are hard of hearing or deaf overcome the communication barrier. Most existing approaches can be typically divided into two lines, i.e., Skeleton-based and RGB-based methods, but both the two lines of methods have their limitations. Skeleton-based methods do not consider facial expressions, while RGB-based approaches usually ignore the fine-grained hand structure. To overcome both limitations, we propose a new framework called Spatial-temporal Part-aware network (StepNet), based on RGB parts. As its name suggests, it is made up of two modules: Part-level Spatial Modeling and Part-level Temporal Modeling. Part-level Spatial Modeling, in particular, automatically captures the appearance-based properties, such as hands and faces, in the feature space without the use of any keypoint-level annotations. On the other hand, Part-level Temporal Modeling implicitly mines the long-short term context to capture the relevant attributes over time. Extensive experiments demonstrate that our StepNet, thanks to spatial-temporal modules, achieves competitive Top-1 Per-instance accuracy on three commonly-used SLR benchmarks, i.e., 56.89% on WLASL, 77.2% on NMFs-CSL, and 77.1% on BOBSL. Additionally, the proposed method is compatible with the optical flow input and can produce superior performance if fused. For those who are hard of hearing, we hope that our work can act as a preliminary step.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"31 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition 利用稀疏低秩双线性集合进行多模态评分融合，实现以自我为中心的手部动作识别

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-02 DOI: 10.1145/3656044

Kankana Roy

With the advent of egocentric cameras, there are new challenges where traditional computer vision are not sufficient to handle this kind of videos. Moreover, egocentric cameras often offer multiple modalities which need to be modeled jointly to exploit complimentary information. In this paper, we proposed a sparse low-rank bilinear score pooling approach for egocentric hand action recognition from RGB-D videos. It consists of five blocks: a baseline CNN to encode RGB and depth information for producing classification probabilities; a novel bilinear score pooling block to generate a score matrix; a sparse low rank matrix recovery block to reduce redundant features, which is common in bilinear pooling; a one layer CNN for frame-level classification; and an RNN for video level classification. We proposed to fuse classification probabilities instead of traditional CNN features from RGB and depth modality, involving an effective yet simple sparse low rank bilinear score pooling to produce a fused RGB-D score matrix. To demonstrate the efficacy of our method, we perform extensive experiments over two large-scale hand action datasets, namely, THU-READ and FPHA, and two smaller datasets, GUN-71 and HAD. We observe that the proposed method outperforms state-of-the-art methods and achieves accuracies of 78.55% and 96.87% over the THU-READ dataset in cross-subject and cross-group settings, respectively. Further, we achieved accuracies of 91.59% and 43.87% over the FPHA and Gun-71 datasets, respectively.

随着以自我为中心的摄像机的出现，传统的计算机视觉技术不足以处理这类视频，这就带来了新的挑战。此外，以自我为中心的摄像机通常提供多种模式，需要联合建模以利用互补信息。在本文中，我们提出了一种稀疏低阶双线性分数池方法，用于从 RGB-D 视频中识别以自我为中心的手部动作。该方法由五个部分组成：基线 CNN，用于编码 RGB 和深度信息以生成分类概率；新颖的双线性分数池部分，用于生成分数矩阵；稀疏低等级矩阵恢复部分，用于减少双线性分数池中常见的冗余特征；单层 CNN，用于帧级分类；RNN，用于视频级分类。我们建议融合分类概率，而不是传统 CNN 的 RGB 和深度模式特征，其中涉及一种有效而简单的稀疏低秩双线性分数池，以生成融合的 RGB-D 分数矩阵。为了证明我们的方法的有效性，我们在两个大型手部动作数据集（即 THU-READ 和 FPHA）和两个较小的数据集（即 GUN-71 和 HAD）上进行了广泛的实验。我们发现，所提出的方法优于最先进的方法，在 THU-READ 数据集上，跨主体和跨组设置的准确率分别达到 78.55% 和 96.87%。此外，我们在 FPHA 和 Gun-71 数据集上的准确率也分别达到了 91.59% 和 43.87%。

{"title":"Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition","authors":"Kankana Roy","doi":"10.1145/3656044","DOIUrl":"https://doi.org/10.1145/3656044","url":null,"abstract":"With the advent of egocentric cameras, there are new challenges where traditional computer vision are not sufficient to handle this kind of videos. Moreover, egocentric cameras often offer multiple modalities which need to be modeled jointly to exploit complimentary information. In this paper, we proposed a sparse low-rank bilinear score pooling approach for egocentric hand action recognition from RGB-D videos. It consists of five blocks: a baseline CNN to encode RGB and depth information for producing classification probabilities; a novel bilinear score pooling block to generate a score matrix; a sparse low rank matrix recovery block to reduce redundant features, which is common in bilinear pooling; a one layer CNN for frame-level classification; and an RNN for video level classification. We proposed to fuse classification probabilities instead of traditional CNN features from RGB and depth modality, involving an effective yet simple sparse low rank bilinear score pooling to produce a fused RGB-D score matrix. To demonstrate the efficacy of our method, we perform extensive experiments over two large-scale hand action datasets, namely, THU-READ and FPHA, and two smaller datasets, GUN-71 and HAD. We observe that the proposed method outperforms state-of-the-art methods and achieves accuracies of 78.55% and 96.87% over the THU-READ dataset in cross-subject and cross-group settings, respectively. Further, we achieved accuracies of 91.59% and 43.87% over the FPHA and Gun-71 datasets, respectively.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"52 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Double Reference Guided Interactive 2D and 3D Caricature Generation 双重参照指导下的交互式二维和三维漫画生成

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-01 DOI: 10.1145/3655624

Xin Huang, Dong Liang, Hongrui Cai, Yunfeng Bai, Juyong Zhang, Feng Tian, Jinyuan Jia

In this paper, we propose the first geometry and texture (double) referenced interactive 2D and 3D caricature generating and editing method. The main challenge of caricature generation lies in the fact that it not only exaggerates the facial geometry but also refreshes the facial texture. We address this challenge by utilizing the semantic segmentation maps as an intermediary domain, removing the influence of photo texture while preserving the person-specific geometry features. Specifically, our proposed method consists of two main components: 3D-CariNet and CariMaskGAN. 3D-CariNet uses sketches or caricatures to exaggerate the input photo into several types of 3D caricatures. To generate a CariMask, we geometrically exaggerate the photos using the projection of exaggerated 3D landmarks, after which CariMask is converted into a caricature by CariMaskGAN. In this step, users can edit and adjust the geometry of caricatures freely. Moreover, we propose a semantic detail preprocessing approach that considerably increases the details of generated caricatures and allows modification of hair strands, wrinkles, and beards. By rendering high-quality 2D caricatures as textures, we produce 3D caricatures with a variety of texture styles. Extensive experimental results have demonstrated that our method can produce higher-quality caricatures as well as support interactive modification with ease.

在本文中，我们首次提出了几何和纹理（双重）参考的交互式二维和三维漫画生成和编辑方法。漫画生成的主要挑战在于，它不仅要夸大面部几何图形，还要刷新面部纹理。我们利用语义分割图作为中间域，消除了照片纹理的影响，同时保留了特定人物的几何特征，从而解决了这一难题。具体来说，我们提出的方法由两个主要部分组成：3D-CariNet 和 CariMaskGAN。3D-CariNet 使用草图或漫画将输入照片夸张成多种类型的 3D 漫画。为了生成 CariMask，我们使用夸张三维地标的投影对照片进行几何夸张，然后通过 CariMaskGAN 将 CariMask 转换为漫画。在这一步骤中，用户可以自由编辑和调整漫画的几何形状。此外，我们还提出了一种语义细节预处理方法，可大大增加生成的漫画的细节，并允许修改发丝、皱纹和胡须。通过将高质量的二维漫画渲染为纹理，我们生成了具有各种纹理风格的三维漫画。广泛的实验结果表明，我们的方法可以制作出更高质量的漫画，并能轻松支持交互式修改。

{"title":"Double Reference Guided Interactive 2D and 3D Caricature Generation","authors":"Xin Huang, Dong Liang, Hongrui Cai, Yunfeng Bai, Juyong Zhang, Feng Tian, Jinyuan Jia","doi":"10.1145/3655624","DOIUrl":"https://doi.org/10.1145/3655624","url":null,"abstract":"In this paper, we propose the first geometry and texture (double) referenced interactive 2D and 3D caricature generating and editing method. The main challenge of caricature generation lies in the fact that it not only exaggerates the facial geometry but also refreshes the facial texture. We address this challenge by utilizing the semantic segmentation maps as an intermediary domain, removing the influence of photo texture while preserving the person-specific geometry features. Specifically, our proposed method consists of two main components: 3D-CariNet and CariMaskGAN. 3D-CariNet uses sketches or caricatures to exaggerate the input photo into several types of 3D caricatures. To generate a CariMask, we geometrically exaggerate the photos using the projection of exaggerated 3D landmarks, after which CariMask is converted into a caricature by CariMaskGAN. In this step, users can edit and adjust the geometry of caricatures freely. Moreover, we propose a semantic detail preprocessing approach that considerably increases the details of generated caricatures and allows modification of hair strands, wrinkles, and beards. By rendering high-quality 2D caricatures as textures, we produce 3D caricatures with a variety of texture styles. Extensive experimental results have demonstrated that our method can produce higher-quality caricatures as well as support interactive modification with ease.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"79 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0