arXiv - CS - Multimedia最新文献_第5页

HiSC4D: Human-centered interaction and 4D Scene Capture in Large-scale Space Using Wearable IMUs and LiDAR HiSC4D：使用可穿戴式 IMU 和激光雷达在大尺度空间进行以人为本的交互和 4D 场景捕捉

arXiv - CS - Multimedia

Pub Date : 2024-09-06 DOI: arxiv-2409.04398

Yudi Dai, Zhiyong Wang, Xiping Lin, Chenglu Wen, Lan Xu, Siqi Shen, Yuexin Ma, Cheng Wang

We introduce HiSC4D, a novel Human-centered interaction and 4D Scene Capturemethod, aimed at accurately and efficiently creating a dynamic digital world,containing large-scale indoor-outdoor scenes, diverse human motions, richhuman-human interactions, and human-environment interactions. By utilizingbody-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric humanmotions in unconstrained space without the need for external devices andpre-built maps. This affords great flexibility and accessibility forhuman-centered interaction and 4D scene capturing in various environments.Taking into account that IMUs can capture human spatially unrestricted posesbut are prone to drifting for long-period using, and while LiDAR is stable forglobal localization but rough for local positions and orientations, HiSC4Demploys a joint optimization method, harmonizing all sensors and utilizingenvironment cues, yielding promising results for long-term capture in largescenes. To promote research of egocentric human interaction in large scenes andfacilitate downstream tasks, we also present a dataset, containing 8 sequencesin 4 large scenes (200 to 5,000 $m^2$), providing 36k frames of accurate 4Dhuman motions with SMPL annotations and dynamic scenes, 31k frames of croppedhuman point clouds, and scene mesh of the environment. A variety of scenarios,such as the basketball gym and commercial street, alongside challenging humanmotions, such as daily greeting, one-on-one basketball playing, and tourguiding, demonstrate the effectiveness and the generalization ability ofHiSC4D. The dataset and code will be publicated onwww.lidarhumanmotion.net/hisc4d available for research purposes.

我们介绍的 HiSC4D 是一种新颖的以人为本的交互和 4D 场景捕捉方法，旨在准确高效地创建一个动态数字世界，其中包含大规模室内外场景、多样化的人体运动、丰富的人与人之间的交互以及人与环境之间的交互。通过利用安装在身体上的 IMUs 和头戴式激光雷达，HiSC4D 可以捕捉无约束空间中以自我为中心的人类运动，而无需外部设备和预先构建的地图。考虑到IMUs可以捕捉人类在空间上不受限制的姿势，但在长时间使用时容易漂移，而LiDAR对于全局定位是稳定的，但对于局部位置和方向是粗糙的，HiSC4D采用了一种联合优化方法，协调所有传感器并利用环境线索，为在大型场景中进行长期捕捉带来了可喜的成果。为了促进大场景中以自我为中心的人机交互研究和下游任务的开展，我们还提供了一个数据集，其中包含4个大场景（200到5000美元m^2$）中的8个序列，提供了36k帧带有SMPL注释和动态场景的精确4D人体运动、31k帧裁剪人体点云和环境场景网格。篮球馆和商业街等各种场景，以及日常问候、一对一篮球比赛和导游等具有挑战性的人类动作，都证明了 HiSC4D 的有效性和泛化能力。数据集和代码将在www.lidarhumanmotion.net/hisc4d 上公布，供研究使用。

{"title":"HiSC4D: Human-centered interaction and 4D Scene Capture in Large-scale Space Using Wearable IMUs and LiDAR","authors":"Yudi Dai, Zhiyong Wang, Xiping Lin, Chenglu Wen, Lan Xu, Siqi Shen, Yuexin Ma, Cheng Wang","doi":"arxiv-2409.04398","DOIUrl":"https://doi.org/arxiv-2409.04398","url":null,"abstract":"We introduce HiSC4D, a novel Human-centered interaction and 4D Scene Capture\u0000method, aimed at accurately and efficiently creating a dynamic digital world,\u0000containing large-scale indoor-outdoor scenes, diverse human motions, rich\u0000human-human interactions, and human-environment interactions. By utilizing\u0000body-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric human\u0000motions in unconstrained space without the need for external devices and\u0000pre-built maps. This affords great flexibility and accessibility for\u0000human-centered interaction and 4D scene capturing in various environments.\u0000Taking into account that IMUs can capture human spatially unrestricted poses\u0000but are prone to drifting for long-period using, and while LiDAR is stable for\u0000global localization but rough for local positions and orientations, HiSC4D\u0000employs a joint optimization method, harmonizing all sensors and utilizing\u0000environment cues, yielding promising results for long-term capture in large\u0000scenes. To promote research of egocentric human interaction in large scenes and\u0000facilitate downstream tasks, we also present a dataset, containing 8 sequences\u0000in 4 large scenes (200 to 5,000 $m^2$), providing 36k frames of accurate 4D\u0000human motions with SMPL annotations and dynamic scenes, 31k frames of cropped\u0000human point clouds, and scene mesh of the environment. A variety of scenarios,\u0000such as the basketball gym and commercial street, alongside challenging human\u0000motions, such as daily greeting, one-on-one basketball playing, and tour\u0000guiding, demonstrate the effectiveness and the generalization ability of\u0000HiSC4D. The dataset and code will be publicated on\u0000www.lidarhumanmotion.net/hisc4d available for research purposes.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Question-Answering Dense Video Events 密集视频事件问答

arXiv - CS - Multimedia

Pub Date : 2024-09-06 DOI: arxiv-2409.04388

Hangyu Qin, Junbin Xiao, Angela Yao

Multimodal Large Language Models (MLLMs) have shown excellent performance inquestion-answering of single-event videos. In this paper, we presentquestion-answering dense video events, a novel task that requires answering andgrounding the dense-event questions in long videos, thus challenging MLLMs tofaithfully comprehend and reason about multiple events occurring over extendedtime periods. To facilitate the study, we construct DeVE-QA - a datasetfeaturing 78K questions about 26K events on 10.6K long videos. We thenbenchmark and show that existing MLLMs excelling at single-event QA struggle toperform well in DeVE-QA. For improvement, we propose DeVi, a noveltraining-free MLLM approach that highlights a hierarchical captioning module, atemporal event memory module, and a self-consistency checking module torespectively detect, contextualize and memorize, and ground dense-events inlong videos for question answering. Extensive experiments show that DeVi issuperior at answering dense-event questions and grounding relevant videomoments. Compared with existing MLLMs, it achieves a remarkable increase of 4.1percent and 3.7 percent for G(round)QA accuracy on DeVE-QA and NExT-GQArespectively.

多模态大语言模型（MLLMs）在单事件视频问题解答中表现出色。在本文中，我们介绍了密集视频事件的问题解答，这是一项新颖的任务，要求在长视频中回答密集事件问题并将其落地，从而挑战多模态大语言模型如何忠实地理解和推理在较长时间段内发生的多个事件。为了便于研究，我们构建了 DeVE-QA--一个数据集，其中包含 78K 个问题，涉及 10.6K 个长视频中的 26K 个事件。然后，我们进行了基准测试，结果表明，现有的 MLLM 在单个事件 QA 中表现出色，但在 DeVE-QA 中表现不佳。为了提高性能，我们提出了 DeVi，这是一种无需训练的新型 MLLM 方法，它突出了分层字幕模块、时态事件记忆模块和自一致性检查模块，能够在长视频中检测、上下文关联、记忆密集事件，并将其作为问题解答的基础。广泛的实验表明，DeVi 在回答密集事件问题和将相关视频片段接地方面更胜一筹。与现有的 MLLM 相比，它在 DeVE-QA 和 NExT-GQA 上的 G(round)QA 准确率分别显著提高了 4.1% 和 3.7%。

{"title":"Question-Answering Dense Video Events","authors":"Hangyu Qin, Junbin Xiao, Angela Yao","doi":"arxiv-2409.04388","DOIUrl":"https://doi.org/arxiv-2409.04388","url":null,"abstract":"Multimodal Large Language Models (MLLMs) have shown excellent performance in\u0000question-answering of single-event videos. In this paper, we present\u0000question-answering dense video events, a novel task that requires answering and\u0000grounding the dense-event questions in long videos, thus challenging MLLMs to\u0000faithfully comprehend and reason about multiple events occurring over extended\u0000time periods. To facilitate the study, we construct DeVE-QA - a dataset\u0000featuring 78K questions about 26K events on 10.6K long videos. We then\u0000benchmark and show that existing MLLMs excelling at single-event QA struggle to\u0000perform well in DeVE-QA. For improvement, we propose DeVi, a novel\u0000training-free MLLM approach that highlights a hierarchical captioning module, a\u0000temporal event memory module, and a self-consistency checking module to\u0000respectively detect, contextualize and memorize, and ground dense-events in\u0000long videos for question answering. Extensive experiments show that DeVi is\u0000superior at answering dense-event questions and grounding relevant video\u0000moments. Compared with existing MLLMs, it achieves a remarkable increase of 4.1\u0000percent and 3.7 percent for G(round)QA accuracy on DeVE-QA and NExT-GQA\u0000respectively.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"392 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WaterMAS: Sharpness-Aware Maximization for Neural Network Watermarking WaterMAS：神经网络水印的锐度感知最大化

arXiv - CS - Multimedia

Pub Date : 2024-09-05 DOI: arxiv-2409.03902

Carl De Sousa Trias, Mihai Mitrea, Attilio Fiandrotti, Marco Cagnazzo, Sumanta Chaudhuri, Enzo Tartaglione

Nowadays, deep neural networks are used for solving complex tasks in severalcritical applications and protecting both their integrity and intellectualproperty rights (IPR) has become of utmost importance. To this end, we advanceWaterMAS, a substitutive, white-box neural network watermarking method thatimproves the trade-off among robustness, imperceptibility, and computationalcomplexity, while making provisions for increased data payload and security.WasterMAS insertion keeps unchanged the watermarked weights while sharpeningtheir underlying gradient space. The robustness is thus ensured by limiting theattack's strength: even small alterations of the watermarked weights wouldimpact the model's performance. The imperceptibility is ensured by insertingthe watermark during the training process. The relationship among the WaterMASdata payload, imperceptibility, and robustness properties is discussed. Thesecret key is represented by the positions of the weights conveying thewatermark, randomly chosen through multiple layers of the model. The securityis evaluated by investigating the case in which an attacker would intercept thekey. The experimental validations consider 5 models and 2 tasks (VGG16,ResNet18, MobileNetV3, SwinT for CIFAR10 image classification, and DeepLabV3for Cityscapes image segmentation) as well as 4 types of attacks (Gaussiannoise addition, pruning, fine-tuning, and quantization). The code will bereleased open-source upon acceptance of the article.

如今，深度神经网络被用于解决一些关键应用中的复杂任务，保护其完整性和知识产权（IPR）已成为重中之重。为此，我们提出了一种替代性白盒神经网络水印方法--WaterMAS，该方法改进了鲁棒性、不可感知性和计算复杂性之间的权衡，同时为增加数据有效载荷和安全性做出了规定。因此，通过限制攻击强度确保了鲁棒性：即使是对水印权重的微小改动也会影响模型的性能。通过在训练过程中插入水印，确保了不可感知性。本文讨论了 WaterMAS 数据有效载荷、不可感知性和鲁棒性之间的关系。这些密钥由传递水印的权重位置表示，通过模型的多层随机选择。通过研究攻击者截获密钥的情况，对安全性进行了评估。实验验证考虑了 5 个模型和 2 个任务（VGG16、ResNet18、MobileNetV3、用于 CIFAR10 图像分类的 SwinT 和用于城市景观图像分割的 DeepLabV3）以及 4 种攻击类型（高斯噪声添加、剪枝、微调和量化）。文章一经接受，代码将开源发布。

{"title":"WaterMAS: Sharpness-Aware Maximization for Neural Network Watermarking","authors":"Carl De Sousa Trias, Mihai Mitrea, Attilio Fiandrotti, Marco Cagnazzo, Sumanta Chaudhuri, Enzo Tartaglione","doi":"arxiv-2409.03902","DOIUrl":"https://doi.org/arxiv-2409.03902","url":null,"abstract":"Nowadays, deep neural networks are used for solving complex tasks in several\u0000critical applications and protecting both their integrity and intellectual\u0000property rights (IPR) has become of utmost importance. To this end, we advance\u0000WaterMAS, a substitutive, white-box neural network watermarking method that\u0000improves the trade-off among robustness, imperceptibility, and computational\u0000complexity, while making provisions for increased data payload and security.\u0000WasterMAS insertion keeps unchanged the watermarked weights while sharpening\u0000their underlying gradient space. The robustness is thus ensured by limiting the\u0000attack's strength: even small alterations of the watermarked weights would\u0000impact the model's performance. The imperceptibility is ensured by inserting\u0000the watermark during the training process. The relationship among the WaterMAS\u0000data payload, imperceptibility, and robustness properties is discussed. The\u0000secret key is represented by the positions of the weights conveying the\u0000watermark, randomly chosen through multiple layers of the model. The security\u0000is evaluated by investigating the case in which an attacker would intercept the\u0000key. The experimental validations consider 5 models and 2 tasks (VGG16,\u0000ResNet18, MobileNetV3, SwinT for CIFAR10 image classification, and DeepLabV3\u0000for Cityscapes image segmentation) as well as 4 types of attacks (Gaussian\u0000noise addition, pruning, fine-tuning, and quantization). The code will be\u0000released open-source upon acceptance of the article.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing SegTalker：基于分割的会说话人脸生成与遮罩引导的局部编辑

arXiv - CS - Multimedia

Pub Date : 2024-09-05 DOI: arxiv-2409.03605

Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu

Audio-driven talking face generation aims to synthesize video with lipmovements synchronized to input audio. However, current generative techniquesface challenges in preserving intricate regional textures (skin, teeth). Toaddress the aforementioned challenges, we propose a novel framework calledSegTalker to decouple lip movements and image textures by introducingsegmentation as intermediate representation. Specifically, given the mask ofimage employed by a parsing network, we first leverage the speech to drive themask and generate talking segmentation. Then we disentangle semantic regions ofimage into style codes using a mask-guided encoder. Ultimately, we inject thepreviously generated talking segmentation and style codes into a mask-guidedStyleGAN to synthesize video frame. In this way, most of textures are fullypreserved. Moreover, our approach can inherently achieve background separationand facilitate mask-guided facial local editing. In particular, by editing themask and swapping the region textures from a given reference image (e.g. hair,lip, eyebrows), our approach enables facial editing seamlessly when generatingtalking face video. Experiments demonstrate that our proposed approach caneffectively preserve texture details and generate temporally consistent videowhile remaining competitive in lip synchronization. Quantitative andqualitative results on the HDTF and MEAD datasets illustrate the superiorperformance of our method over existing methods.

音频驱动的 "会说话的脸 "生成技术旨在合成与输入音频同步的带有唇部动作的视频。然而，当前的生成技术在保留复杂的区域纹理（皮肤、牙齿）方面面临挑战。为了应对上述挑战，我们提出了一个名为 "SegTalker "的新框架，通过引入分割作为中间表示，将唇部动作和图像纹理分离开来。具体来说，给定解析网络使用的图像掩码后，我们首先利用语音来驱动任务并生成说话分割。然后，我们使用掩码引导编码器将图像的语义区域分离成风格代码。最后，我们将先前生成的会话分割和风格代码注入掩码引导的风格广域网（Mask-guidedStyleGAN），以合成视频帧。通过这种方式，大部分纹理都得到了完全保留。此外，我们的方法还能在本质上实现背景分离，并促进面具引导的面部局部编辑。特别是，通过编辑任务和交换给定参考图像中的区域纹理（如头发、嘴唇、眉毛），我们的方法可以在生成会说话的人脸视频时实现无缝面部编辑。实验证明，我们提出的方法可以有效地保留纹理细节，生成时间上一致的视频，同时在唇部同步方面保持竞争力。在 HDTF 和 MEAD 数据集上的定量和定性结果表明，我们的方法比现有方法性能更优越。

{"title":"SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing","authors":"Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu","doi":"arxiv-2409.03605","DOIUrl":"https://doi.org/arxiv-2409.03605","url":null,"abstract":"Audio-driven talking face generation aims to synthesize video with lip\u0000movements synchronized to input audio. However, current generative techniques\u0000face challenges in preserving intricate regional textures (skin, teeth). To\u0000address the aforementioned challenges, we propose a novel framework called\u0000SegTalker to decouple lip movements and image textures by introducing\u0000segmentation as intermediate representation. Specifically, given the mask of\u0000image employed by a parsing network, we first leverage the speech to drive the\u0000mask and generate talking segmentation. Then we disentangle semantic regions of\u0000image into style codes using a mask-guided encoder. Ultimately, we inject the\u0000previously generated talking segmentation and style codes into a mask-guided\u0000StyleGAN to synthesize video frame. In this way, most of textures are fully\u0000preserved. Moreover, our approach can inherently achieve background separation\u0000and facilitate mask-guided facial local editing. In particular, by editing the\u0000mask and swapping the region textures from a given reference image (e.g. hair,\u0000lip, eyebrows), our approach enables facial editing seamlessly when generating\u0000talking face video. Experiments demonstrate that our proposed approach can\u0000effectively preserve texture details and generate temporally consistent video\u0000while remaining competitive in lip synchronization. Quantitative and\u0000qualitative results on the HDTF and MEAD datasets illustrate the superior\u0000performance of our method over existing methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression 通过以表达为导向的动态门控和回归，让基于图表的引用表达理解再创辉煌

arXiv - CS - Multimedia

Pub Date : 2024-09-05 DOI: arxiv-2409.03385

Jingcheng Ke, Dele Wang, Jun-Cheng Chen, I-Hong Jhuo, Chia-Wen Lin, Yen-Yu Lin

One common belief is that with complex models and pre-training on large-scaledatasets, transformer-based methods for referring expression comprehension(REC) perform much better than existing graph-based methods. We observe thatsince most graph-based methods adopt an off-the-shelf detector to locatecandidate objects (i.e., regions detected by the object detector), they facetwo challenges that result in subpar performance: (1) the presence ofsignificant noise caused by numerous irrelevant objects during reasoning, and(2) inaccurate localization outcomes attributed to the provided detector. Toaddress these issues, we introduce a plug-and-adapt module guided bysub-expressions, called dynamic gate constraint (DGC), which can adaptivelydisable irrelevant proposals and their connections in graphs during reasoning.We further introduce an expression-guided regression strategy (EGR) to refinelocation prediction. Extensive experimental results on the RefCOCO, RefCOCO+,RefCOCOg, Flickr30K, RefClef, and Ref-reasoning datasets demonstrate theeffectiveness of the DGC module and the EGR strategy in consistently boostingthe performances of various graph-based REC methods. Without any pretaining,the proposed graph-based method achieves better performance than thestate-of-the-art (SOTA) transformer-based methods.

一种普遍的看法是，通过复杂的模型和在大规模数据集上的预训练，基于变换器的引用表达理解（REC）方法比现有的基于图的方法表现要好得多。我们发现，由于大多数基于图的方法都采用现成的检测器来定位候选对象（即对象检测器检测到的区域），因此它们面临两个挑战，导致性能不佳：(1) 推理过程中存在大量无关对象造成的显著噪声；(2) 归因于所提供检测器的定位结果不准确。为了解决这些问题，我们引入了一个由子表达式引导的即插即用模块，称为动态门约束（DGC），它可以在推理过程中自适应地禁用图中的无关提议及其连接。在 RefCOCO、RefCOCO+、RefCOCOg、Flickr30K、RefClef 和 Ref-reasoning 数据集上的大量实验结果表明，DGC 模块和 EGR 策略能够有效地持续提升各种基于图的 REC 方法的性能。在不做任何预处理的情况下，所提出的基于图的方法比基于变压器的最先进（SOTA）方法取得了更好的性能。

{"title":"Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression","authors":"Jingcheng Ke, Dele Wang, Jun-Cheng Chen, I-Hong Jhuo, Chia-Wen Lin, Yen-Yu Lin","doi":"arxiv-2409.03385","DOIUrl":"https://doi.org/arxiv-2409.03385","url":null,"abstract":"One common belief is that with complex models and pre-training on large-scale\u0000datasets, transformer-based methods for referring expression comprehension\u0000(REC) perform much better than existing graph-based methods. We observe that\u0000since most graph-based methods adopt an off-the-shelf detector to locate\u0000candidate objects (i.e., regions detected by the object detector), they face\u0000two challenges that result in subpar performance: (1) the presence of\u0000significant noise caused by numerous irrelevant objects during reasoning, and\u0000(2) inaccurate localization outcomes attributed to the provided detector. To\u0000address these issues, we introduce a plug-and-adapt module guided by\u0000sub-expressions, called dynamic gate constraint (DGC), which can adaptively\u0000disable irrelevant proposals and their connections in graphs during reasoning.\u0000We further introduce an expression-guided regression strategy (EGR) to refine\u0000location prediction. Extensive experimental results on the RefCOCO, RefCOCO+,\u0000RefCOCOg, Flickr30K, RefClef, and Ref-reasoning datasets demonstrate the\u0000effectiveness of the DGC module and the EGR strategy in consistently boosting\u0000the performances of various graph-based REC methods. Without any pretaining,\u0000the proposed graph-based method achieves better performance than the\u0000state-of-the-art (SOTA) transformer-based methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture LongLLaVA：通过混合架构将多模态 LLM 高效扩展到 1000 张图像

arXiv - CS - Multimedia

Pub Date : 2024-09-04 DOI: arxiv-2409.02889

Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang

Expanding the long-context capabilities of Multi-modal Large LanguageModels~(MLLMs) is crucial for video understanding, high-resolution imageunderstanding, and multi-modal agents. This involves a series of systematicoptimizations, including model architecture, data construction and trainingstrategy, particularly addressing challenges such as textit{degradedperformance with more images} and textit{high computational costs}. In thispaper, we adapt the model architecture to a hybrid of Mamba and Transformerblocks, approach data construction with both temporal and spatial dependenciesamong multiple images and employ a progressive training strategy. The releasedmodel textbf{LongLLaVA}~(textbf{Long}-Context textbf{L}argetextbf{L}anguage textbf{a}nd textbf{V}ision textbf{A}ssistant) is the firsthybrid MLLM, which achieved a better balance between efficiency andeffectiveness. LongLLaVA not only achieves competitive results across variousbenchmarks, but also maintains high throughput and low memory consumption.Especially, it could process nearly a thousand images on a single A100 80GBGPU, showing promising application prospects for a wide range of tasks.

扩展多模态大型语言模型（MLLM）的长语境能力对于视频理解、高分辨率图像理解和多模态代理至关重要。这涉及一系列系统优化工作，包括模型架构、数据构建和训练策略，特别是要解决textit{图像越多性能越差}和textit{计算成本越高}等挑战。在本文中，我们将模型架构调整为 Mamba 和 Transformerblocks 的混合体，在数据构建时考虑了多幅图像之间的时间和空间依赖关系，并采用了渐进式训练策略。发布的模型（textbf{LongLLaVA}~（textbf{Long}-Context textbf{L}argetextbf{L}anguage textbf{a}nd textbf{V}ision textbf{A}ssistant ）是第一个混合 MLLM，在效率和效果之间实现了更好的平衡。LongLLaVA 不仅在各种基准测试中取得了具有竞争力的结果，而且还保持了高吞吐量和低内存消耗，特别是它可以在单个 A100 80GB GPU 上处理近千幅图像，在各种任务中展现了广阔的应用前景。

{"title":"LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture","authors":"Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang","doi":"arxiv-2409.02889","DOIUrl":"https://doi.org/arxiv-2409.02889","url":null,"abstract":"Expanding the long-context capabilities of Multi-modal Large Language\u0000Models~(MLLMs) is crucial for video understanding, high-resolution image\u0000understanding, and multi-modal agents. This involves a series of systematic\u0000optimizations, including model architecture, data construction and training\u0000strategy, particularly addressing challenges such as textit{degraded\u0000performance with more images} and textit{high computational costs}. In this\u0000paper, we adapt the model architecture to a hybrid of Mamba and Transformer\u0000blocks, approach data construction with both temporal and spatial dependencies\u0000among multiple images and employ a progressive training strategy. The released\u0000model textbf{LongLLaVA}~(textbf{Long}-Context textbf{L}arge\u0000textbf{L}anguage textbf{a}nd textbf{V}ision textbf{A}ssistant) is the first\u0000hybrid MLLM, which achieved a better balance between efficiency and\u0000effectiveness. LongLLaVA not only achieves competitive results across various\u0000benchmarks, but also maintains high throughput and low memory consumption.\u0000Especially, it could process nearly a thousand images on a single A100 80GB\u0000GPU, showing promising application prospects for a wide range of tasks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ExpLLM: Towards Chain of Thought for Facial Expression Recognition ExpLLM：实现面部表情识别的思维链

arXiv - CS - Multimedia

Pub Date : 2024-09-04 DOI: arxiv-2409.02828

Xing Lan, Jian Xue, Ji Qi, Dongmei Jiang, Ke Lu, Tat-Seng Chua

Facial expression recognition (FER) is a critical task in multimedia withsignificant implications across various domains. However, analyzing the causesof facial expressions is essential for accurately recognizing them. Currentapproaches, such as those based on facial action units (AUs), typically provideAU names and intensities but lack insight into the interactions andrelationships between AUs and the overall expression. In this paper, we proposea novel method called ExpLLM, which leverages large language models to generatean accurate chain of thought (CoT) for facial expression recognition.Specifically, we have designed the CoT mechanism from three key perspectives:key observations, overall emotional interpretation, and conclusion. The keyobservations describe the AU's name, intensity, and associated emotions. Theoverall emotional interpretation provides an analysis based on multiple AUs andtheir interactions, identifying the dominant emotions and their relationships.Finally, the conclusion presents the final expression label derived from thepreceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designedto construct this expression CoT and generate instruction-description data fortraining our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasetsdemonstrate that ExpLLM outperforms current state-of-the-art FER methods.ExpLLM also surpasses the latest GPT-4o in expression CoT generation,particularly in recognizing micro-expressions where GPT-4o frequently fails.

面部表情识别（FER）是多媒体领域的一项重要任务，对各个领域都有重大影响。然而，分析面部表情的成因对于准确识别面部表情至关重要。目前的方法，如基于面部动作单元（AU）的方法，通常提供 AU 名称和强度，但缺乏对 AU 与整体表情之间的交互和关系的洞察。在本文中，我们提出了一种名为 ExpLLM 的新方法，该方法利用大型语言模型生成准确的面部表情识别思维链（CoT）。关键观察点描述了 AU 的名称、强度和相关情绪。最后，结论是根据前面的分析得出的最终表达标签。此外，我们还介绍了 Exp-CoT 引擎，该引擎旨在构建表达 CoT 并生成指令描述数据，以训练我们的 ExpLLM。在 RAF-DB 和 AffectNet 数据集上进行的大量实验表明，ExpLLM 优于当前最先进的 FER 方法。ExpLLM 在表达 CoT 生成方面也超过了最新的 GPT-4o，尤其是在识别 GPT-4o 经常失败的微表达方面。

{"title":"ExpLLM: Towards Chain of Thought for Facial Expression Recognition","authors":"Xing Lan, Jian Xue, Ji Qi, Dongmei Jiang, Ke Lu, Tat-Seng Chua","doi":"arxiv-2409.02828","DOIUrl":"https://doi.org/arxiv-2409.02828","url":null,"abstract":"Facial expression recognition (FER) is a critical task in multimedia with\u0000significant implications across various domains. However, analyzing the causes\u0000of facial expressions is essential for accurately recognizing them. Current\u0000approaches, such as those based on facial action units (AUs), typically provide\u0000AU names and intensities but lack insight into the interactions and\u0000relationships between AUs and the overall expression. In this paper, we propose\u0000a novel method called ExpLLM, which leverages large language models to generate\u0000an accurate chain of thought (CoT) for facial expression recognition.\u0000Specifically, we have designed the CoT mechanism from three key perspectives:\u0000key observations, overall emotional interpretation, and conclusion. The key\u0000observations describe the AU's name, intensity, and associated emotions. The\u0000overall emotional interpretation provides an analysis based on multiple AUs and\u0000their interactions, identifying the dominant emotions and their relationships.\u0000Finally, the conclusion presents the final expression label derived from the\u0000preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed\u0000to construct this expression CoT and generate instruction-description data for\u0000training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets\u0000demonstrate that ExpLLM outperforms current state-of-the-art FER methods.\u0000ExpLLM also surpasses the latest GPT-4o in expression CoT generation,\u0000particularly in recognizing micro-expressions where GPT-4o frequently fails.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation PoseTalk：基于文字和音频的姿态控制和动作细化，用于一次性生成对话头像

arXiv - CS - Multimedia

Pub Date : 2024-09-04 DOI: arxiv-2409.02657

Jun Ling, Yiwen Wang, Han Xue, Rong Xie, Li Song

While previous audio-driven talking head generation (THG) methods generatehead poses from driving audio, the generated poses or lips cannot match theaudio well or are not editable. In this study, we propose textbf{PoseTalk}, aTHG system that can freely generate lip-synchronized talking head videos withfree head poses conditioned on text prompts and audio. The core insight of ourmethod is using head pose to connect visual, linguistic, and audio signals.First, we propose to generate poses from both audio and text prompts, where theaudio offers short-term variations and rhythm correspondence of the headmovements and the text prompts describe the long-term semantics of headmotions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model togenerate motion latent from text prompts and audio cues in a pose latent space.Second, we observe a loss-imbalance problem: the loss for the lip regioncontributes less than 4% of the total reconstruction loss caused by both poseand lip, making optimization lean towards head movements rather than lipshapes. To address this issue, we propose a refinement-based learning strategyto synthesize natural talking videos using two cascaded networks, i.e.,CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produceanimated images in novel poses and the RefineNet focuses on learning finer lipmotions by progressively estimating lip motions from low-to-high resolutions,yielding improved lip-synchronization performance. Experiments demonstrate ourpose prediction strategy achieves better pose diversity and realness comparedto text-only or audio-only, and our video generator model outperformsstate-of-the-art methods in synthesizing talking videos with natural headmotions. Project: https://junleen.github.io/projects/posetalk.

虽然以前的音频驱动的 "说话头像生成（THG）"方法能从驱动音频生成头像姿势，但生成的姿势或嘴唇不能很好地匹配音频，或者无法编辑。在本研究中，我们提出了一种可以根据文本提示和音频自由生成与嘴唇同步、头部姿势自由的 "对话头 "视频的系统（textbf{PoseTalk}）。首先，我们建议根据音频和文本提示生成姿势，其中音频提供头部动作的短期变化和节奏对应，而文本提示则描述头部动作的长期语义。为了实现这一目标，我们设计了一个姿势潜势扩散（PLD）模型，在姿势潜势空间中根据文本提示和音频线索生成运动潜势。其次，我们发现了一个损失不平衡问题：嘴唇区域的损失占姿势和嘴唇造成的总重建损失的比例不到 4%，这使得优化更倾向于头部运动而不是嘴唇形状。为了解决这个问题，我们提出了一种基于细化的学习策略，利用两个级联网络（即 CoarseNet 和 RefineNet）合成自然的说话视频。CoarseNet 通过估算粗略的动作来生成新姿势的动画图像，而 RefineNet 则侧重于学习更精细的唇部动作，从低分辨率到高分辨率逐步估算唇部动作，从而提高唇部同步性能。实验证明，与纯文字或纯音频相比，我们的姿势预测策略实现了更好的姿势多样性和真实性，而且我们的视频生成器模型在合成具有自然头部动作的说话视频方面优于最先进的方法。项目：https://junleen.github.io/projects/posetalk。

{"title":"PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation","authors":"Jun Ling, Yiwen Wang, Han Xue, Rong Xie, Li Song","doi":"arxiv-2409.02657","DOIUrl":"https://doi.org/arxiv-2409.02657","url":null,"abstract":"While previous audio-driven talking head generation (THG) methods generate\u0000head poses from driving audio, the generated poses or lips cannot match the\u0000audio well or are not editable. In this study, we propose textbf{PoseTalk}, a\u0000THG system that can freely generate lip-synchronized talking head videos with\u0000free head poses conditioned on text prompts and audio. The core insight of our\u0000method is using head pose to connect visual, linguistic, and audio signals.\u0000First, we propose to generate poses from both audio and text prompts, where the\u0000audio offers short-term variations and rhythm correspondence of the head\u0000movements and the text prompts describe the long-term semantics of head\u0000motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to\u0000generate motion latent from text prompts and audio cues in a pose latent space.\u0000Second, we observe a loss-imbalance problem: the loss for the lip region\u0000contributes less than 4% of the total reconstruction loss caused by both pose\u0000and lip, making optimization lean towards head movements rather than lip\u0000shapes. To address this issue, we propose a refinement-based learning strategy\u0000to synthesize natural talking videos using two cascaded networks, i.e.,\u0000CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce\u0000animated images in novel poses and the RefineNet focuses on learning finer lip\u0000motions by progressively estimating lip motions from low-to-high resolutions,\u0000yielding improved lip-synchronization performance. Experiments demonstrate our\u0000pose prediction strategy achieves better pose diversity and realness compared\u0000to text-only or audio-only, and our video generator model outperforms\u0000state-of-the-art methods in synthesizing talking videos with natural head\u0000motions. Project: https://junleen.github.io/projects/posetalk.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation 利用跨分辨率关系对比蒸馏技术识别低分辨率物体

arXiv - CS - Multimedia

Pub Date : 2024-09-04 DOI: arxiv-2409.02555

Kangkai Zhang, Shiming Ge, Ruixin Shi, Dan Zeng

Recognizing objects in low-resolution images is a challenging task due to thelack of informative details. Recent studies have shown that knowledgedistillation approaches can effectively transfer knowledge from ahigh-resolution teacher model to a low-resolution student model by aligningcross-resolution representations. However, these approaches still facelimitations in adapting to the situation where the recognized objects exhibitsignificant representation discrepancies between training and testing images.In this study, we propose a cross-resolution relational contrastivedistillation approach to facilitate low-resolution object recognition. Ourapproach enables the student model to mimic the behavior of a well-trainedteacher model which delivers high accuracy in identifying high-resolutionobjects. To extract sufficient knowledge, the student learning is supervisedwith contrastive relational distillation loss, which preserves the similaritiesin various relational structures in contrastive representation space. In thismanner, the capability of recovering missing details of familiar low-resolutionobjects can be effectively enhanced, leading to a better knowledge transfer.Extensive experiments on low-resolution object classification andlow-resolution face recognition clearly demonstrate the effectiveness andadaptability of our approach.

由于缺乏信息细节，在低分辨率图像中识别物体是一项极具挑战性的任务。最近的研究表明，知识灌输方法可以通过对齐跨分辨率表征，有效地将知识从高分辨率的教师模型转移到低分辨率的学生模型。在本研究中，我们提出了一种跨分辨率关系对比蒸馏方法来促进低分辨率物体识别。我们的方法能让学生模型模拟训练有素的教师模型的行为，从而在识别高分辨率物体时达到较高的准确率。为了提取足够的知识，学生的学习受到了对比关系蒸馏损失的监督，这种损失保留了对比表示空间中各种关系结构的相似性。在低分辨率物体分类和低分辨率人脸识别方面的大量实验清楚地证明了我们方法的有效性和适应性。

{"title":"Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation","authors":"Kangkai Zhang, Shiming Ge, Ruixin Shi, Dan Zeng","doi":"arxiv-2409.02555","DOIUrl":"https://doi.org/arxiv-2409.02555","url":null,"abstract":"Recognizing objects in low-resolution images is a challenging task due to the\u0000lack of informative details. Recent studies have shown that knowledge\u0000distillation approaches can effectively transfer knowledge from a\u0000high-resolution teacher model to a low-resolution student model by aligning\u0000cross-resolution representations. However, these approaches still face\u0000limitations in adapting to the situation where the recognized objects exhibit\u0000significant representation discrepancies between training and testing images.\u0000In this study, we propose a cross-resolution relational contrastive\u0000distillation approach to facilitate low-resolution object recognition. Our\u0000approach enables the student model to mimic the behavior of a well-trained\u0000teacher model which delivers high accuracy in identifying high-resolution\u0000objects. To extract sufficient knowledge, the student learning is supervised\u0000with contrastive relational distillation loss, which preserves the similarities\u0000in various relational structures in contrastive representation space. In this\u0000manner, the capability of recovering missing details of familiar low-resolution\u0000objects can be effectively enhanced, leading to a better knowledge transfer.\u0000Extensive experiments on low-resolution object classification and\u0000low-resolution face recognition clearly demonstrate the effectiveness and\u0000adaptability of our approach.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Coral Model Generation from Single Images for Virtual Reality Applications 根据单张图像生成珊瑚模型，用于虚拟现实应用

arXiv - CS - Multimedia

Pub Date : 2024-09-04 DOI: arxiv-2409.02376

Jie FuUniversity of the Arts London, Creative Computing Institute, London, United Kingdom, Shun FuBloks Technology Company, Shanghai, China, Mick GriersonUniversity of the Arts London, Creative Computing Institute, London, United Kingdom

With the rapid development of VR technology, the demand for high-quality 3Dmodels is increasing. Traditional methods struggle with efficiency and qualityin large-scale customization. This paper introduces a deep-learning frameworkthat generates high-precision 3D coral models from a single image. Using theCoral dataset, the framework extracts geometric and texture features, performs3D reconstruction, and optimizes design and material blending. Advancedoptimization and polygon count control ensure shape accuracy, detail retention,and flexible output for various complexities, catering to high-qualityrendering and real-time interaction needs.The project incorporates ExplainableAI (XAI) to transform AI-generated models into interactive "artworks," bestviewed in VR and XR. This enhances model interpretability and human-machinecollaboration. Real-time feedback in VR interactions displays information likecoral species and habitat, enriching user experience. The generated modelssurpass traditional methods in detail, visual quality, and efficiency. Thisresearch offers an intelligent approach to 3D content creation for VR, loweringproduction barriers, and promoting widespread VR applications. Additionally,integrating XAI provides new insights into AI-generated visual content andadvances research in 3D vision interpretability.

随着虚拟现实技术的迅猛发展，对高质量 3D 模型的需求与日俱增。传统方法在大规模定制中难以保证效率和质量。本文介绍了一种深度学习框架，它能从单张图像生成高精度的三维珊瑚模型。该框架使用珊瑚数据集提取几何和纹理特征，执行三维重建，并优化设计和材料混合。先进的优化和多边形数量控制确保了形状的准确性、细节的保留以及各种复杂性的灵活输出，满足了高质量渲染和实时交互的需求。该项目结合了可解释人工智能（XAI），将人工智能生成的模型转化为交互式 "艺术品"，在 VR 和 XR 中观看效果最佳。这增强了模型的可解释性和人机协作。VR 交互中的实时反馈显示了珊瑚种类和栖息地等信息，丰富了用户体验。生成的模型在细节、视觉质量和效率方面都超越了传统方法。这项研究为 VR 3D 内容创建提供了一种智能方法，降低了制作门槛，促进了 VR 应用的普及。此外，整合 XAI 还为人工智能生成的视觉内容提供了新的见解，并推动了 3D 视觉可解释性方面的研究。

{"title":"Coral Model Generation from Single Images for Virtual Reality Applications","authors":"Jie FuUniversity of the Arts London, Creative Computing Institute, London, United Kingdom, Shun FuBloks Technology Company, Shanghai, China, Mick GriersonUniversity of the Arts London, Creative Computing Institute, London, United Kingdom","doi":"arxiv-2409.02376","DOIUrl":"https://doi.org/arxiv-2409.02376","url":null,"abstract":"With the rapid development of VR technology, the demand for high-quality 3D\u0000models is increasing. Traditional methods struggle with efficiency and quality\u0000in large-scale customization. This paper introduces a deep-learning framework\u0000that generates high-precision 3D coral models from a single image. Using the\u0000Coral dataset, the framework extracts geometric and texture features, performs\u00003D reconstruction, and optimizes design and material blending. Advanced\u0000optimization and polygon count control ensure shape accuracy, detail retention,\u0000and flexible output for various complexities, catering to high-quality\u0000rendering and real-time interaction needs.The project incorporates Explainable\u0000AI (XAI) to transform AI-generated models into interactive \"artworks,\" best\u0000viewed in VR and XR. This enhances model interpretability and human-machine\u0000collaboration. Real-time feedback in VR interactions displays information like\u0000coral species and habitat, enriching user experience. The generated models\u0000surpass traditional methods in detail, visual quality, and efficiency. This\u0000research offers an intelligent approach to 3D content creation for VR, lowering\u0000production barriers, and promoting widespread VR applications. Additionally,\u0000integrating XAI provides new insights into AI-generated visual content and\u0000advances research in 3D vision interpretability.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0