arXiv - CS - Multimedia最新文献_第10页

Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection 灯塔：用于可重复视频时刻检索和亮点检测的用户友好型库

arXiv - CS - Multimedia

Pub Date : 2024-08-06 DOI: arxiv-2408.02901

Taichi Nishimura, Shota Nakada, Hokuto Munakata, Tatsuya Komatsu

We propose Lighthouse, a user-friendly library for reproducible video momentretrieval and highlight detection (MR-HD). Although researchers proposedvarious MR-HD approaches, the research community holds two main issues. Thefirst is a lack of comprehensive and reproducible experiments across variousmethods, datasets, and video-text features. This is because no unified trainingand evaluation codebase covers multiple settings. The second is user-unfriendlydesign. Because previous works use different libraries, researchers set upindividual environments. In addition, most works release only the trainingcodes, requiring users to implement the whole inference process of MR-HD.Lighthouse addresses these issues by implementing a unified reproduciblecodebase that includes six models, three features, and five datasets. Inaddition, it provides an inference API and web demo to make these methodseasily accessible for researchers and developers. Our experiments demonstratethat Lighthouse generally reproduces the reported scores in the referencepapers. The code is available at https://github.com/line/lighthouse.

我们提出的 Lighthouse 是一个用户友好型库，用于可重现的视频瞬间检索和高亮检测（MR-HD）。尽管研究人员提出了多种MR-HD方法，但研究界仍存在两个主要问题。首先是缺乏跨越各种方法、数据集和视频文本特征的全面且可重现的实验。这是因为没有涵盖多种设置的统一训练和评估代码库。其次是设计对用户不友好。由于之前的研究使用了不同的库，研究人员需要建立各自的环境。为了解决这些问题，Lighthouse建立了一个统一的可重现代码库，其中包括六个模型、三个特征和五个数据集。此外，它还提供了推理应用程序接口（API）和网络演示，使研究人员和开发人员可以轻松使用这些方法。我们的实验表明，Lighthouse基本重现了参考文献中报告的分数。代码可在 https://github.com/line/lighthouse 上获取。

{"title":"Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection","authors":"Taichi Nishimura, Shota Nakada, Hokuto Munakata, Tatsuya Komatsu","doi":"arxiv-2408.02901","DOIUrl":"https://doi.org/arxiv-2408.02901","url":null,"abstract":"We propose Lighthouse, a user-friendly library for reproducible video moment\u0000retrieval and highlight detection (MR-HD). Although researchers proposed\u0000various MR-HD approaches, the research community holds two main issues. The\u0000first is a lack of comprehensive and reproducible experiments across various\u0000methods, datasets, and video-text features. This is because no unified training\u0000and evaluation codebase covers multiple settings. The second is user-unfriendly\u0000design. Because previous works use different libraries, researchers set up\u0000individual environments. In addition, most works release only the training\u0000codes, requiring users to implement the whole inference process of MR-HD.\u0000Lighthouse addresses these issues by implementing a unified reproducible\u0000codebase that includes six models, three features, and five datasets. In\u0000addition, it provides an inference API and web demo to make these methods\u0000easily accessible for researchers and developers. Our experiments demonstrate\u0000that Lighthouse generally reproduces the reported scores in the reference\u0000papers. The code is available at https://github.com/line/lighthouse.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MaskAnyone Toolkit: Offering Strategies for Minimizing Privacy Risks and Maximizing Utility in Audio-Visual Data Archiving MaskAnyone 工具包：在视听数据存档中提供最小化隐私风险和最大化实用性的策略

arXiv - CS - Multimedia

Pub Date : 2024-08-06 DOI: arxiv-2408.03185

Babajide Alamu Owoyele, Martin Schilling, Rohan Sawahn, Niklas Kaemer, Pavel Zherebenkov, Bhuvanesh Verma, Wim Pouw, Gerard de Melo

This paper introduces MaskAnyone, a novel toolkit designed to navigate someprivacy and ethical concerns of sharing audio-visual data in research.MaskAnyone offers a scalable, user-friendly solution for de-identifyingindividuals in video and audio content through face-swapping and voicealteration, supporting multi-person masking and real-time bulk processing. Byintegrating this tool within research practices, we aim to enhance datareproducibility and utility in social science research. Our approach draws onDesign Science Research, proposing that MaskAnyone can facilitate safer datasharing and potentially reduce the storage of fully identifiable data. Wediscuss the development and capabilities of MaskAnyone, explore its integrationinto ethical research practices, and consider the broader implications ofaudio-visual data masking, including issues of consent and the risk of misuse.The paper concludes with a preliminary evaluation framework for assessing theeffectiveness and ethical integration of masking tools in such researchsettings.

MaskAnyone 提供了一个可扩展、用户友好的解决方案，可通过换脸和语音转换来消除视频和音频内容中的个人身份识别，支持多人屏蔽和实时批量处理。通过在研究实践中整合这一工具，我们旨在提高社会科学研究中数据的可重复性和实用性。我们的方法借鉴了设计科学研究（Design Science Research），提出 "任何人都能屏蔽"（MaskAnyone）可以促进更安全的数据共享，并有可能减少完全可识别数据的存储。我们讨论了 MaskAnyone 的开发和功能，探讨了将其纳入伦理研究实践的问题，并考虑了视听数据屏蔽的更广泛影响，包括同意问题和滥用风险。

{"title":"MaskAnyone Toolkit: Offering Strategies for Minimizing Privacy Risks and Maximizing Utility in Audio-Visual Data Archiving","authors":"Babajide Alamu Owoyele, Martin Schilling, Rohan Sawahn, Niklas Kaemer, Pavel Zherebenkov, Bhuvanesh Verma, Wim Pouw, Gerard de Melo","doi":"arxiv-2408.03185","DOIUrl":"https://doi.org/arxiv-2408.03185","url":null,"abstract":"This paper introduces MaskAnyone, a novel toolkit designed to navigate some\u0000privacy and ethical concerns of sharing audio-visual data in research.\u0000MaskAnyone offers a scalable, user-friendly solution for de-identifying\u0000individuals in video and audio content through face-swapping and voice\u0000alteration, supporting multi-person masking and real-time bulk processing. By\u0000integrating this tool within research practices, we aim to enhance data\u0000reproducibility and utility in social science research. Our approach draws on\u0000Design Science Research, proposing that MaskAnyone can facilitate safer data\u0000sharing and potentially reduce the storage of fully identifiable data. We\u0000discuss the development and capabilities of MaskAnyone, explore its integration\u0000into ethical research practices, and consider the broader implications of\u0000audio-visual data masking, including issues of consent and the risk of misuse.\u0000The paper concludes with a preliminary evaluation framework for assessing the\u0000effectiveness and ethical integration of masking tools in such research\u0000settings.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer ReSyncer：基于风格的重新布线生成器，用于统一的视听同步面部表演者

arXiv - CS - Multimedia

Pub Date : 2024-08-06 DOI: arxiv-2408.03284

Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu

Lip-syncing videos with given audio is the foundation for variousapplications including the creation of virtual presenters or performers. Whilerecent studies explore high-fidelity lip-sync with different techniques, theirtask-orientated models either require long-term videos for clip-specifictraining or retain visible artifacts. In this paper, we propose a unified andeffective framework ReSyncer, that synchronizes generalized audio-visual facialinformation. The key design is revisiting and rewiring the Style-basedgenerator to efficiently adopt 3D facial dynamics predicted by a principledstyle-injected Transformer. By simply re-configuring the information insertionmechanisms within the noise and style space, our framework fuses motion andappearance with unified training. Extensive experiments demonstrate thatReSyncer not only produces high-fidelity lip-synced videos according to audio,but also supports multiple appealing properties that are suitable for creatingvirtual presenters and performers, including fast personalized fine-tuning,video-driven lip-syncing, the transfer of speaking styles, and even faceswapping. Resources can be found athttps://guanjz20.github.io/projects/ReSyncer.

将视频与给定音频进行唇语同步是各种应用的基础，包括创建虚拟主持人或表演者。虽然近期的研究通过不同的技术探索了高保真唇语同步技术，但其面向任务的模型要么需要长期观看视频以进行特定片段的训练，要么会保留可见的人工痕迹。在本文中，我们提出了一个统一而有效的框架 ReSyncer，它能同步通用的视听面部信息。其关键设计在于重新审视和重新连接基于风格的生成器，以有效地采用由原则性风格注入变换器预测的三维面部动态。通过简单地重新配置噪声和风格空间内的信息插入机制，我们的框架将运动和外观与统一训练融合在一起。广泛的实验证明，ReSyncer 不仅能根据音频生成高保真的唇音同步视频，还支持多种适合创建虚拟主持人和表演者的吸引人的特性，包括快速个性化微调、视频驱动的唇音同步、说话风格的转换，甚至是换脸。相关资源请访问：https://guanjz20.github.io/projects/ReSyncer。

{"title":"ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer","authors":"Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu","doi":"arxiv-2408.03284","DOIUrl":"https://doi.org/arxiv-2408.03284","url":null,"abstract":"Lip-syncing videos with given audio is the foundation for various\u0000applications including the creation of virtual presenters or performers. While\u0000recent studies explore high-fidelity lip-sync with different techniques, their\u0000task-orientated models either require long-term videos for clip-specific\u0000training or retain visible artifacts. In this paper, we propose a unified and\u0000effective framework ReSyncer, that synchronizes generalized audio-visual facial\u0000information. The key design is revisiting and rewiring the Style-based\u0000generator to efficiently adopt 3D facial dynamics predicted by a principled\u0000style-injected Transformer. By simply re-configuring the information insertion\u0000mechanisms within the noise and style space, our framework fuses motion and\u0000appearance with unified training. Extensive experiments demonstrate that\u0000ReSyncer not only produces high-fidelity lip-synced videos according to audio,\u0000but also supports multiple appealing properties that are suitable for creating\u0000virtual presenters and performers, including fast personalized fine-tuning,\u0000video-driven lip-syncing, the transfer of speaking styles, and even face\u0000swapping. Resources can be found at\u0000https://guanjz20.github.io/projects/ReSyncer.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multitask and Multimodal Neural Tuning for Large Models 大型模型的多任务和多模态神经调整

arXiv - CS - Multimedia

Pub Date : 2024-08-06 DOI: arxiv-2408.03001

Hao Sun, Yu Song, Jihong Hu, Yen-Wei Chen, Lanfen Lin

In recent years, large-scale multimodal models have demonstrated impressivecapabilities across various domains. However, enabling these models toeffectively perform multiple multimodal tasks simultaneously remains asignificant challenge. To address this, we introduce a novel tuning methodcalled neural tuning, designed to handle diverse multimodal tasks concurrently,including reasoning segmentation, referring segmentation, image captioning, andtext-to-image generation. Neural tuning emulates sparse distributedrepresentation in human brain, where only specific subsets of neurons areactivated for each task. Additionally, we present a new benchmark, MMUD, whereeach sample is annotated with multiple task labels. By applying neural tuningto pretrained large models on the MMUD benchmark, we achieve simultaneous taskhandling in a streamlined and efficient manner. All models, code, and datasetswill be publicly available after publication, facilitating further research anddevelopment in this field.

近年来，大规模多模态模型在各个领域都表现出令人印象深刻的能力。然而，让这些模型同时有效地执行多种多模态任务仍然是一个重大挑战。为了解决这个问题，我们引入了一种称为神经调谐的新型调谐方法，旨在同时处理多种多模态任务，包括推理分割、指代分割、图像字幕和文本到图像生成。神经调谐模拟了人脑中的稀疏分布式表示，即每个任务只激活特定的神经元子集。此外，我们还提出了一个新的基准--MMUD，其中每个样本都标注了多个任务标签。通过在 MMUD 基准上对预训练的大型模型进行神经调谐，我们以精简高效的方式实现了同步任务处理。所有模型、代码和数据集都将在发表后公开，以促进该领域的进一步研究和开发。

引用次数: 0

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark COM 厨房：作为视觉语言基准的未编辑俯视视频数据集

arXiv - CS - Multimedia

Pub Date : 2024-08-05 DOI: arxiv-2408.02272

Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku

Procedural video understanding is gaining attention in the vision andlanguage community. Deep learning-based video analysis requires extensive data.Consequently, existing works often use web videos as training resources, makingit challenging to query instructional contents from raw video observations. Toaddress this issue, we propose a new dataset, COM Kitchens. The datasetconsists of unedited overhead-view videos captured by smartphones, in whichparticipants performed food preparation based on given recipes. Fixed-viewpointvideo datasets often lack environmental diversity due to high camera setupcosts. We used modern wide-angle smartphone lenses to cover cooking countersfrom sink to cooktop in an overhead view, capturing activity without in-personassistance. With this setup, we collected a diverse dataset by distributingsmartphones to participants. With this dataset, we propose the novelvideo-to-text retrieval task Online Recipe Retrieval (OnRR) and new videocaptioning domain Dense Video Captioning on unedited Overhead-View videos(DVC-OV). Our experiments verified the capabilities and limitations of currentweb-video-based SOTA methods in handling these tasks.

程序视频理解在视觉和语言领域越来越受到关注。基于深度学习的视频分析需要大量数据。因此，现有的工作通常使用网络视频作为训练资源，这使得从原始视频观察结果中查询教学内容具有挑战性。为了解决这个问题，我们提出了一个新的数据集 COM Kitchens。该数据集由智能手机拍摄的未经编辑的俯视视频组成，在这些视频中，参与者根据给定的食谱进行食物准备。固定视角视频数据集由于相机设置成本较高，通常缺乏环境多样性。我们使用现代广角智能手机镜头，以俯视视角覆盖从水槽到灶台的烹饪台面，捕捉没有人协助的活动。通过这种设置，我们向参与者分发了智能手机，从而收集了多样化的数据集。利用这个数据集，我们提出了新颖的视频到文本检索任务 "在线食谱检索（OnRR）"和新的视频字幕领域 "未编辑俯视视频上的密集视频字幕"（DVC-OV）。我们的实验验证了当前基于网络视频的 SOTA 方法在处理这些任务时的能力和局限性。

{"title":"COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark","authors":"Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku","doi":"arxiv-2408.02272","DOIUrl":"https://doi.org/arxiv-2408.02272","url":null,"abstract":"Procedural video understanding is gaining attention in the vision and\u0000language community. Deep learning-based video analysis requires extensive data.\u0000Consequently, existing works often use web videos as training resources, making\u0000it challenging to query instructional contents from raw video observations. To\u0000address this issue, we propose a new dataset, COM Kitchens. The dataset\u0000consists of unedited overhead-view videos captured by smartphones, in which\u0000participants performed food preparation based on given recipes. Fixed-viewpoint\u0000video datasets often lack environmental diversity due to high camera setup\u0000costs. We used modern wide-angle smartphone lenses to cover cooking counters\u0000from sink to cooktop in an overhead view, capturing activity without in-person\u0000assistance. With this setup, we collected a diverse dataset by distributing\u0000smartphones to participants. With this dataset, we propose the novel\u0000video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video\u0000captioning domain Dense Video Captioning on unedited Overhead-View videos\u0000(DVC-OV). Our experiments verified the capabilities and limitations of current\u0000web-video-based SOTA methods in handling these tasks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"467 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multiple Contexts and Frequencies Aggregation Network forDeepfake Detection 用于深度防伪检测的多语境和频率聚合网络

arXiv - CS - Multimedia

Pub Date : 2024-08-03 DOI: arxiv-2408.01668

Zifeng Li, Wenzhong Tang, Shijun Gao, Shuai Wang, Yanxiang Wang

Deepfake detection faces increasing challenges since the fast growth ofgenerative models in developing massive and diverse Deepfake technologies.Recent advances rely on introducing heuristic features from spatial orfrequency domains rather than modeling general forgery features withinbackbones. To address this issue, we turn to the backbone design with twointuitive priors from spatial and frequency detectors, textit{i.e.,} learningrobust spatial attributes and frequency distributions that are discriminativefor real and fake samples. To this end, we propose an efficient network forface forgery detection named MkfaNet, which consists of two core modules. Forspatial contexts, we design a Multi-Kernel Aggregator that adaptively selectsorgan features extracted by multiple convolutions for modeling subtle facialdifferences between real and fake faces. For the frequency components, wepropose a Multi-Frequency Aggregator to process different bands of frequencycomponents by adaptively reweighing high-frequency and low-frequency features.Comprehensive experiments on seven popular deepfake detection benchmarksdemonstrate that our proposed MkfaNet variants achieve superior performances inboth within-domain and across-domain evaluations with impressive efficiency ofparameter usage.

最近的进展依赖于从空间或频率域引入启发式特征，而不是在骨干网内对一般伪造特征进行建模。为了解决这个问题，我们转而使用来自空间和频率检测器的两个直观先验来设计骨干网，（textit{i.e.}学习可靠的空间属性和频率分布，这些属性和分布对真实样本和伪造样本具有区分作用。为此，我们提出了一种高效的表面伪造检测网络，命名为 MkfaNet，它由两个核心模块组成。对于空间上下文，我们设计了一个多核聚合器，它能自适应地选择由多重卷积提取的器官特征，以模拟真假人脸之间细微的面部差异。在七个流行的深度假脸检测基准上进行的综合实验证明，我们提出的 MkfaNet 变体在域内和跨域评估中都取得了优异的性能，参数使用效率令人印象深刻。

{"title":"Multiple Contexts and Frequencies Aggregation Network forDeepfake Detection","authors":"Zifeng Li, Wenzhong Tang, Shijun Gao, Shuai Wang, Yanxiang Wang","doi":"arxiv-2408.01668","DOIUrl":"https://doi.org/arxiv-2408.01668","url":null,"abstract":"Deepfake detection faces increasing challenges since the fast growth of\u0000generative models in developing massive and diverse Deepfake technologies.\u0000Recent advances rely on introducing heuristic features from spatial or\u0000frequency domains rather than modeling general forgery features within\u0000backbones. To address this issue, we turn to the backbone design with two\u0000intuitive priors from spatial and frequency detectors, textit{i.e.,} learning\u0000robust spatial attributes and frequency distributions that are discriminative\u0000for real and fake samples. To this end, we propose an efficient network for\u0000face forgery detection named MkfaNet, which consists of two core modules. For\u0000spatial contexts, we design a Multi-Kernel Aggregator that adaptively selects\u0000organ features extracted by multiple convolutions for modeling subtle facial\u0000differences between real and fake faces. For the frequency components, we\u0000propose a Multi-Frequency Aggregator to process different bands of frequency\u0000components by adaptively reweighing high-frequency and low-frequency features.\u0000Comprehensive experiments on seven popular deepfake detection benchmarks\u0000demonstrate that our proposed MkfaNet variants achieve superior performances in\u0000both within-domain and across-domain evaluations with impressive efficiency of\u0000parameter usage.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"100 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MMPKUBase: A Comprehensive and High-quality Chinese Multi-modal Knowledge Graph MMPKUBase：全面、高质量的中文多模态知识图谱

arXiv - CS - Multimedia

Pub Date : 2024-08-03 DOI: arxiv-2408.01679

Xuan Yi, Yanzeng Li, Lei Zou

Multi-modal knowledge graphs have emerged as a powerful approach forinformation representation, combining data from different modalities such astext, images, and videos. While several such graphs have been constructed andhave played important roles in applications like visual question answering andrecommendation systems, challenges persist in their development. These includethe scarcity of high-quality Chinese knowledge graphs and limited domaincoverage in existing multi-modal knowledge graphs. This paper introducesMMPKUBase, a robust and extensive Chinese multi-modal knowledge graph thatcovers diverse domains, including birds, mammals, ferns, and more, comprisingover 50,000 entities and over 1 million filtered images. To ensure dataquality, we employ Prototypical Contrastive Learning and the Isolation Forestalgorithm to refine the image data. Additionally, we have developed auser-friendly platform to facilitate image attribute exploration.

多模态知识图谱是一种强大的信息表征方法，它结合了文本、图像和视频等不同模态的数据。虽然已经构建了一些此类图谱，并在可视化问题解答和推荐系统等应用中发挥了重要作用，但其发展仍面临挑战。这些挑战包括高质量中文知识图谱的匮乏以及现有多模态知识图谱的领域覆盖范围有限。本文介绍的MMPKUBase是一个强大而广泛的中文多模态知识图谱，涵盖鸟类、哺乳动物、蕨类植物等多个领域，包含5万多个实体和100多万张过滤图片。为确保数据质量，我们采用了原型对比学习（Prototypical Contrastive Learning）和隔离森林算法（Isolation Foreststalgorithm）来完善图像数据。此外，我们还开发了用户友好型平台，方便用户探索图像属性。

引用次数: 0

IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection IDNet：用于身份文件分析和欺诈检测的新型数据集

arXiv - CS - Multimedia

Pub Date : 2024-08-03 DOI: arxiv-2408.01690

Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou

Effective fraud detection and analysis of government-issued identitydocuments, such as passports, driver's licenses, and identity cards, areessential in thwarting identity theft and bolstering security on onlineplatforms. The training of accurate fraud detection and analysis tools dependson the availability of extensive identity document datasets. However, currentpublicly available benchmark datasets for identity document analysis, includingMIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer alimited number of samples, cover insufficient varieties of fraud patterns, andseldom include alterations in critical personal identifying fields likeportrait images, limiting their utility in training models capable of detectingrealistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmarkdataset, IDNet, designed to advance privacy-preserving fraud detection efforts.The IDNet dataset comprises 837,060 images of synthetically generated identitydocuments, totaling approximately 490 gigabytes, categorized into 20 types from$10$ U.S. states and 10 European countries. We evaluate the utility and presentuse cases of the dataset, illustrating how it can aid in trainingprivacy-preserving fraud detection methods, facilitating the generation ofcamera and video capturing of identity documents, and testing schemaunification and other identity document management functionalities.

对护照、驾照和身份证等政府签发的身份证件进行有效的欺诈检测和分析，对于打击身份盗窃和加强在线平台的安全性至关重要。准确欺诈检测和分析工具的培训依赖于大量身份证件数据集的可用性。然而，目前公开的用于身份文件分析的基准数据集（包括 MIDV-500、MIDV-2020 和 FMIDV）在几个方面存在不足：它们提供的样本数量有限，涵盖的欺诈模式种类不足，而且很少包含关键个人身份识别字段（如肖像图像）的更改，这限制了它们在训练模型以检测真实欺诈行为的同时保护隐私方面的实用性。针对这些缺陷，我们的研究引入了一个新的基准数据集 IDNet，旨在推进保护隐私的欺诈检测工作。IDNet 数据集包括 837,060 张合成生成的身份证件图像，总计约 490 千兆字节，分为 20 种类型，分别来自 10 美元的美国各州和 10 个欧洲国家。我们评估了该数据集的实用性并介绍了其使用案例，说明了它如何帮助训练保护隐私的欺诈检测方法、促进身份证件的摄像头和视频捕捉生成，以及测试模式统一和其他身份证件管理功能。

{"title":"IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection","authors":"Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou","doi":"arxiv-2408.01690","DOIUrl":"https://doi.org/arxiv-2408.01690","url":null,"abstract":"Effective fraud detection and analysis of government-issued identity\u0000documents, such as passports, driver's licenses, and identity cards, are\u0000essential in thwarting identity theft and bolstering security on online\u0000platforms. The training of accurate fraud detection and analysis tools depends\u0000on the availability of extensive identity document datasets. However, current\u0000publicly available benchmark datasets for identity document analysis, including\u0000MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a\u0000limited number of samples, cover insufficient varieties of fraud patterns, and\u0000seldom include alterations in critical personal identifying fields like\u0000portrait images, limiting their utility in training models capable of detecting\u0000realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark\u0000dataset, IDNet, designed to advance privacy-preserving fraud detection efforts.\u0000The IDNet dataset comprises 837,060 images of synthetically generated identity\u0000documents, totaling approximately 490 gigabytes, categorized into 20 types from\u0000$10$ U.S. states and 10 European countries. We evaluate the utility and present\u0000use cases of the dataset, illustrating how it can aid in training\u0000privacy-preserving fraud detection methods, facilitating the generation of\u0000camera and video capturing of identity documents, and testing schema\u0000unification and other identity document management functionalities.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Music2P: A Multi-Modal AI-Driven Tool for Simplifying Album Cover Design Music2P：简化专辑封面设计的多模式人工智能驱动工具

arXiv - CS - Multimedia

Pub Date : 2024-08-03 DOI: arxiv-2408.01651

Joong Ho Choi, Geonyeong Choi, Ji-Eun Han, Wonjin Yang, Zhi-Qi Cheng

In today's music industry, album cover design is as crucial as the musicitself, reflecting the artist's vision and brand. However, many AI-driven albumcover services require subscriptions or technical expertise, limitingaccessibility. To address these challenges, we developed Music2P, anopen-source, multi-modal AI-driven tool that streamlines album cover creation,making it efficient, accessible, and cost-effective through Ngrok. Music2Pautomates the design process using techniques such as Bootstrapping LanguageImage Pre-training (BLIP), music-to-text conversion (LP-music-caps), imagesegmentation (LoRA), and album cover and QR code generation (ControlNet). Thispaper demonstrates the Music2P interface, details our application of thesetechnologies, and outlines future improvements. Our ultimate goal is to providea tool that empowers musicians and producers, especially those with limitedresources or expertise, to create compelling album covers.

在当今的音乐产业中，专辑封面设计与音乐本身一样至关重要，它反映了艺术家的理念和品牌。然而，许多人工智能驱动的专辑封面服务都需要订阅或专业技术知识，限制了用户的使用。为了应对这些挑战，我们开发了开源、多模式的人工智能驱动工具 Music2P，通过 Ngrok 简化专辑封面制作，使其高效、易用且经济实惠。Music2P利用引导语言图像预训练（BLIP）、音乐到文本的转换（LP-music-caps）、图像分割（LoRA）以及专辑封面和二维码生成（ControlNet）等技术自动完成设计过程。本文演示了 Music2P 界面，详细介绍了我们对这些技术的应用，并概述了未来的改进措施。我们的最终目标是提供一种工具，让音乐家和制作人，尤其是资源或专业知识有限的音乐家和制作人，能够制作出引人注目的专辑封面。

引用次数: 0

SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses SynopGround：从电视剧和梗概中提取多段落视频的大规模数据集

arXiv - CS - Multimedia

Pub Date : 2024-08-03 DOI: arxiv-2408.01669

Chaolei Tan, Zihang Lin, Junfu Pu, Zhongang Qi, Wei-Yi Pei, Zhi Qu, Yexin Wang, Ying Shan, Wei-Shi Zheng, Jian-Fang Hu

Video grounding is a fundamental problem in multimodal content understanding,aiming to localize specific natural language queries in an untrimmed video.However, current video grounding datasets merely focus on simple events and areeither limited to shorter videos or brief sentences, which hinders the modelfrom evolving toward stronger multimodal understanding capabilities. To addressthese limitations, we present a large-scale video grounding dataset namedSynopGround, in which more than 2800 hours of videos are sourced from popularTV dramas and are paired with accurately localized human-written synopses. Eachparagraph in the synopsis serves as a language query and is manually annotatedwith precise temporal boundaries in the long video. These paragraph queries aretightly correlated to each other and contain a wealth of abstract expressionssummarizing video storylines and specific descriptions portraying eventdetails, which enables the model to learn multimodal perception on moreintricate concepts over longer context dependencies. Based on the dataset, wefurther introduce a more complex setting of video grounding dubbedMulti-Paragraph Video Grounding (MPVG), which takes as input multipleparagraphs and a long video for grounding each paragraph query to its temporalinterval. In addition, we propose a novel Local-Global Multimodal Reasoner(LGMR) to explicitly model the local-global structures of long-term multimodalinputs for MPVG. Our method provides an effective baseline solution to themulti-paragraph video grounding problem. Extensive experiments verify theproposed model's effectiveness as well as its superiority in long-termmulti-paragraph video grounding over prior state-of-the-arts. Dataset and codeare publicly available. Project page: https://synopground.github.io/.

视频接地是多模态内容理解中的一个基本问题，其目的是在未经修剪的视频中定位特定的自然语言查询。然而，目前的视频接地数据集只关注简单的事件，要么局限于较短的视频，要么局限于简短的句子，这阻碍了模型向更强的多模态理解能力发展。为了解决这些局限性，我们提出了一个名为 "SynopGround "的大规模视频接地数据集，其中有 2800 多个小时的视频来源于热门电视剧，并与精确定位的人工撰写的梗概配对。梗概中的每个段落都可作为语言查询，并在长视频中人工标注了精确的时间界限。这些段落查询彼此密切相关，包含大量概括视频故事情节的抽象表达和描绘事件细节的具体描述，这使得模型能够在更长的上下文依赖关系中学习更复杂概念的多模态感知。在该数据集的基础上，我们进一步引入了更为复杂的视频接地设置，即多段落视频接地（MPVG），它将多个段落和一段长视频作为输入，将每个段落的查询与其时间间隔接地。此外，我们还提出了一种新颖的局部-全局多模态推理器（LGMR），为 MPVG 明确建模长期多模态输入的局部-全局结构。我们的方法为多段视频接地问题提供了有效的基准解决方案。广泛的实验验证了所提模型的有效性，以及它在长时多段视频接地方面优于之前的技术水平。数据集和代码可公开获取。项目页面：https://synopground.github.io/。

{"title":"SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses","authors":"Chaolei Tan, Zihang Lin, Junfu Pu, Zhongang Qi, Wei-Yi Pei, Zhi Qu, Yexin Wang, Ying Shan, Wei-Shi Zheng, Jian-Fang Hu","doi":"arxiv-2408.01669","DOIUrl":"https://doi.org/arxiv-2408.01669","url":null,"abstract":"Video grounding is a fundamental problem in multimodal content understanding,\u0000aiming to localize specific natural language queries in an untrimmed video.\u0000However, current video grounding datasets merely focus on simple events and are\u0000either limited to shorter videos or brief sentences, which hinders the model\u0000from evolving toward stronger multimodal understanding capabilities. To address\u0000these limitations, we present a large-scale video grounding dataset named\u0000SynopGround, in which more than 2800 hours of videos are sourced from popular\u0000TV dramas and are paired with accurately localized human-written synopses. Each\u0000paragraph in the synopsis serves as a language query and is manually annotated\u0000with precise temporal boundaries in the long video. These paragraph queries are\u0000tightly correlated to each other and contain a wealth of abstract expressions\u0000summarizing video storylines and specific descriptions portraying event\u0000details, which enables the model to learn multimodal perception on more\u0000intricate concepts over longer context dependencies. Based on the dataset, we\u0000further introduce a more complex setting of video grounding dubbed\u0000Multi-Paragraph Video Grounding (MPVG), which takes as input multiple\u0000paragraphs and a long video for grounding each paragraph query to its temporal\u0000interval. In addition, we propose a novel Local-Global Multimodal Reasoner\u0000(LGMR) to explicitly model the local-global structures of long-term multimodal\u0000inputs for MPVG. Our method provides an effective baseline solution to the\u0000multi-paragraph video grounding problem. Extensive experiments verify the\u0000proposed model's effectiveness as well as its superiority in long-term\u0000multi-paragraph video grounding over prior state-of-the-arts. Dataset and code\u0000are publicly available. Project page: https://synopground.github.io/.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0