We propose Lighthouse, a user-friendly library for reproducible video moment retrieval and highlight detection (MR-HD). Although researchers proposed various MR-HD approaches, the research community holds two main issues. The first is a lack of comprehensive and reproducible experiments across various methods, datasets, and video-text features. This is because no unified training and evaluation codebase covers multiple settings. The second is user-unfriendly design. Because previous works use different libraries, researchers set up individual environments. In addition, most works release only the training codes, requiring users to implement the whole inference process of MR-HD. Lighthouse addresses these issues by implementing a unified reproducible codebase that includes six models, three features, and five datasets. In addition, it provides an inference API and web demo to make these methods easily accessible for researchers and developers. Our experiments demonstrate that Lighthouse generally reproduces the reported scores in the reference papers. The code is available at https://github.com/line/lighthouse.
{"title":"Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection","authors":"Taichi Nishimura, Shota Nakada, Hokuto Munakata, Tatsuya Komatsu","doi":"arxiv-2408.02901","DOIUrl":"https://doi.org/arxiv-2408.02901","url":null,"abstract":"We propose Lighthouse, a user-friendly library for reproducible video moment\u0000retrieval and highlight detection (MR-HD). Although researchers proposed\u0000various MR-HD approaches, the research community holds two main issues. The\u0000first is a lack of comprehensive and reproducible experiments across various\u0000methods, datasets, and video-text features. This is because no unified training\u0000and evaluation codebase covers multiple settings. The second is user-unfriendly\u0000design. Because previous works use different libraries, researchers set up\u0000individual environments. In addition, most works release only the training\u0000codes, requiring users to implement the whole inference process of MR-HD.\u0000Lighthouse addresses these issues by implementing a unified reproducible\u0000codebase that includes six models, three features, and five datasets. In\u0000addition, it provides an inference API and web demo to make these methods\u0000easily accessible for researchers and developers. Our experiments demonstrate\u0000that Lighthouse generally reproduces the reported scores in the reference\u0000papers. The code is available at https://github.com/line/lighthouse.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Babajide Alamu Owoyele, Martin Schilling, Rohan Sawahn, Niklas Kaemer, Pavel Zherebenkov, Bhuvanesh Verma, Wim Pouw, Gerard de Melo
This paper introduces MaskAnyone, a novel toolkit designed to navigate some privacy and ethical concerns of sharing audio-visual data in research. MaskAnyone offers a scalable, user-friendly solution for de-identifying individuals in video and audio content through face-swapping and voice alteration, supporting multi-person masking and real-time bulk processing. By integrating this tool within research practices, we aim to enhance data reproducibility and utility in social science research. Our approach draws on Design Science Research, proposing that MaskAnyone can facilitate safer data sharing and potentially reduce the storage of fully identifiable data. We discuss the development and capabilities of MaskAnyone, explore its integration into ethical research practices, and consider the broader implications of audio-visual data masking, including issues of consent and the risk of misuse. The paper concludes with a preliminary evaluation framework for assessing the effectiveness and ethical integration of masking tools in such research settings.
{"title":"MaskAnyone Toolkit: Offering Strategies for Minimizing Privacy Risks and Maximizing Utility in Audio-Visual Data Archiving","authors":"Babajide Alamu Owoyele, Martin Schilling, Rohan Sawahn, Niklas Kaemer, Pavel Zherebenkov, Bhuvanesh Verma, Wim Pouw, Gerard de Melo","doi":"arxiv-2408.03185","DOIUrl":"https://doi.org/arxiv-2408.03185","url":null,"abstract":"This paper introduces MaskAnyone, a novel toolkit designed to navigate some\u0000privacy and ethical concerns of sharing audio-visual data in research.\u0000MaskAnyone offers a scalable, user-friendly solution for de-identifying\u0000individuals in video and audio content through face-swapping and voice\u0000alteration, supporting multi-person masking and real-time bulk processing. By\u0000integrating this tool within research practices, we aim to enhance data\u0000reproducibility and utility in social science research. Our approach draws on\u0000Design Science Research, proposing that MaskAnyone can facilitate safer data\u0000sharing and potentially reduce the storage of fully identifiable data. We\u0000discuss the development and capabilities of MaskAnyone, explore its integration\u0000into ethical research practices, and consider the broader implications of\u0000audio-visual data masking, including issues of consent and the risk of misuse.\u0000The paper concludes with a preliminary evaluation framework for assessing the\u0000effectiveness and ethical integration of masking tools in such research\u0000settings.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping. Resources can be found at https://guanjz20.github.io/projects/ReSyncer.
{"title":"ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer","authors":"Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu","doi":"arxiv-2408.03284","DOIUrl":"https://doi.org/arxiv-2408.03284","url":null,"abstract":"Lip-syncing videos with given audio is the foundation for various\u0000applications including the creation of virtual presenters or performers. While\u0000recent studies explore high-fidelity lip-sync with different techniques, their\u0000task-orientated models either require long-term videos for clip-specific\u0000training or retain visible artifacts. In this paper, we propose a unified and\u0000effective framework ReSyncer, that synchronizes generalized audio-visual facial\u0000information. The key design is revisiting and rewiring the Style-based\u0000generator to efficiently adopt 3D facial dynamics predicted by a principled\u0000style-injected Transformer. By simply re-configuring the information insertion\u0000mechanisms within the noise and style space, our framework fuses motion and\u0000appearance with unified training. Extensive experiments demonstrate that\u0000ReSyncer not only produces high-fidelity lip-synced videos according to audio,\u0000but also supports multiple appealing properties that are suitable for creating\u0000virtual presenters and performers, including fast personalized fine-tuning,\u0000video-driven lip-syncing, the transfer of speaking styles, and even face\u0000swapping. Resources can be found at\u0000https://guanjz20.github.io/projects/ReSyncer.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Sun, Yu Song, Jihong Hu, Yen-Wei Chen, Lanfen Lin
In recent years, large-scale multimodal models have demonstrated impressive capabilities across various domains. However, enabling these models to effectively perform multiple multimodal tasks simultaneously remains a significant challenge. To address this, we introduce a novel tuning method called neural tuning, designed to handle diverse multimodal tasks concurrently, including reasoning segmentation, referring segmentation, image captioning, and text-to-image generation. Neural tuning emulates sparse distributed representation in human brain, where only specific subsets of neurons are activated for each task. Additionally, we present a new benchmark, MMUD, where each sample is annotated with multiple task labels. By applying neural tuning to pretrained large models on the MMUD benchmark, we achieve simultaneous task handling in a streamlined and efficient manner. All models, code, and datasets will be publicly available after publication, facilitating further research and development in this field.
{"title":"Multitask and Multimodal Neural Tuning for Large Models","authors":"Hao Sun, Yu Song, Jihong Hu, Yen-Wei Chen, Lanfen Lin","doi":"arxiv-2408.03001","DOIUrl":"https://doi.org/arxiv-2408.03001","url":null,"abstract":"In recent years, large-scale multimodal models have demonstrated impressive\u0000capabilities across various domains. However, enabling these models to\u0000effectively perform multiple multimodal tasks simultaneously remains a\u0000significant challenge. To address this, we introduce a novel tuning method\u0000called neural tuning, designed to handle diverse multimodal tasks concurrently,\u0000including reasoning segmentation, referring segmentation, image captioning, and\u0000text-to-image generation. Neural tuning emulates sparse distributed\u0000representation in human brain, where only specific subsets of neurons are\u0000activated for each task. Additionally, we present a new benchmark, MMUD, where\u0000each sample is annotated with multiple task labels. By applying neural tuning\u0000to pretrained large models on the MMUD benchmark, we achieve simultaneous task\u0000handling in a streamlined and efficient manner. All models, code, and datasets\u0000will be publicly available after publication, facilitating further research and\u0000development in this field.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku
Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query instructional contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedited overhead-view videos captured by smartphones, in which participants performed food preparation based on given recipes. Fixed-viewpoint video datasets often lack environmental diversity due to high camera setup costs. We used modern wide-angle smartphone lenses to cover cooking counters from sink to cooktop in an overhead view, capturing activity without in-person assistance. With this setup, we collected a diverse dataset by distributing smartphones to participants. With this dataset, we propose the novel video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video captioning domain Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our experiments verified the capabilities and limitations of current web-video-based SOTA methods in handling these tasks.
程序视频理解在视觉和语言领域越来越受到关注。基于深度学习的视频分析需要大量数据。因此,现有的工作通常使用网络视频作为训练资源,这使得从原始视频观察结果中查询教学内容具有挑战性。为了解决这个问题,我们提出了一个新的数据集 COM Kitchens。该数据集由智能手机拍摄的未经编辑的俯视视频组成,在这些视频中,参与者根据给定的食谱进行食物准备。固定视角视频数据集由于相机设置成本较高,通常缺乏环境多样性。我们使用现代广角智能手机镜头,以俯视视角覆盖从水槽到灶台的烹饪台面,捕捉没有人协助的活动。通过这种设置,我们向参与者分发了智能手机,从而收集了多样化的数据集。利用这个数据集,我们提出了新颖的视频到文本检索任务 "在线食谱检索(OnRR)"和新的视频字幕领域 "未编辑俯视视频上的密集视频字幕"(DVC-OV)。我们的实验验证了当前基于网络视频的 SOTA 方法在处理这些任务时的能力和局限性。
{"title":"COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark","authors":"Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku","doi":"arxiv-2408.02272","DOIUrl":"https://doi.org/arxiv-2408.02272","url":null,"abstract":"Procedural video understanding is gaining attention in the vision and\u0000language community. Deep learning-based video analysis requires extensive data.\u0000Consequently, existing works often use web videos as training resources, making\u0000it challenging to query instructional contents from raw video observations. To\u0000address this issue, we propose a new dataset, COM Kitchens. The dataset\u0000consists of unedited overhead-view videos captured by smartphones, in which\u0000participants performed food preparation based on given recipes. Fixed-viewpoint\u0000video datasets often lack environmental diversity due to high camera setup\u0000costs. We used modern wide-angle smartphone lenses to cover cooking counters\u0000from sink to cooktop in an overhead view, capturing activity without in-person\u0000assistance. With this setup, we collected a diverse dataset by distributing\u0000smartphones to participants. With this dataset, we propose the novel\u0000video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video\u0000captioning domain Dense Video Captioning on unedited Overhead-View videos\u0000(DVC-OV). Our experiments verified the capabilities and limitations of current\u0000web-video-based SOTA methods in handling these tasks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"467 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zifeng Li, Wenzhong Tang, Shijun Gao, Shuai Wang, Yanxiang Wang
Deepfake detection faces increasing challenges since the fast growth of generative models in developing massive and diverse Deepfake technologies. Recent advances rely on introducing heuristic features from spatial or frequency domains rather than modeling general forgery features within backbones. To address this issue, we turn to the backbone design with two intuitive priors from spatial and frequency detectors, textit{i.e.,} learning robust spatial attributes and frequency distributions that are discriminative for real and fake samples. To this end, we propose an efficient network for face forgery detection named MkfaNet, which consists of two core modules. For spatial contexts, we design a Multi-Kernel Aggregator that adaptively selects organ features extracted by multiple convolutions for modeling subtle facial differences between real and fake faces. For the frequency components, we propose a Multi-Frequency Aggregator to process different bands of frequency components by adaptively reweighing high-frequency and low-frequency features. Comprehensive experiments on seven popular deepfake detection benchmarks demonstrate that our proposed MkfaNet variants achieve superior performances in both within-domain and across-domain evaluations with impressive efficiency of parameter usage.
{"title":"Multiple Contexts and Frequencies Aggregation Network forDeepfake Detection","authors":"Zifeng Li, Wenzhong Tang, Shijun Gao, Shuai Wang, Yanxiang Wang","doi":"arxiv-2408.01668","DOIUrl":"https://doi.org/arxiv-2408.01668","url":null,"abstract":"Deepfake detection faces increasing challenges since the fast growth of\u0000generative models in developing massive and diverse Deepfake technologies.\u0000Recent advances rely on introducing heuristic features from spatial or\u0000frequency domains rather than modeling general forgery features within\u0000backbones. To address this issue, we turn to the backbone design with two\u0000intuitive priors from spatial and frequency detectors, textit{i.e.,} learning\u0000robust spatial attributes and frequency distributions that are discriminative\u0000for real and fake samples. To this end, we propose an efficient network for\u0000face forgery detection named MkfaNet, which consists of two core modules. For\u0000spatial contexts, we design a Multi-Kernel Aggregator that adaptively selects\u0000organ features extracted by multiple convolutions for modeling subtle facial\u0000differences between real and fake faces. For the frequency components, we\u0000propose a Multi-Frequency Aggregator to process different bands of frequency\u0000components by adaptively reweighing high-frequency and low-frequency features.\u0000Comprehensive experiments on seven popular deepfake detection benchmarks\u0000demonstrate that our proposed MkfaNet variants achieve superior performances in\u0000both within-domain and across-domain evaluations with impressive efficiency of\u0000parameter usage.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"100 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-modal knowledge graphs have emerged as a powerful approach for information representation, combining data from different modalities such as text, images, and videos. While several such graphs have been constructed and have played important roles in applications like visual question answering and recommendation systems, challenges persist in their development. These include the scarcity of high-quality Chinese knowledge graphs and limited domain coverage in existing multi-modal knowledge graphs. This paper introduces MMPKUBase, a robust and extensive Chinese multi-modal knowledge graph that covers diverse domains, including birds, mammals, ferns, and more, comprising over 50,000 entities and over 1 million filtered images. To ensure data quality, we employ Prototypical Contrastive Learning and the Isolation Forest algorithm to refine the image data. Additionally, we have developed a user-friendly platform to facilitate image attribute exploration.
{"title":"MMPKUBase: A Comprehensive and High-quality Chinese Multi-modal Knowledge Graph","authors":"Xuan Yi, Yanzeng Li, Lei Zou","doi":"arxiv-2408.01679","DOIUrl":"https://doi.org/arxiv-2408.01679","url":null,"abstract":"Multi-modal knowledge graphs have emerged as a powerful approach for\u0000information representation, combining data from different modalities such as\u0000text, images, and videos. While several such graphs have been constructed and\u0000have played important roles in applications like visual question answering and\u0000recommendation systems, challenges persist in their development. These include\u0000the scarcity of high-quality Chinese knowledge graphs and limited domain\u0000coverage in existing multi-modal knowledge graphs. This paper introduces\u0000MMPKUBase, a robust and extensive Chinese multi-modal knowledge graph that\u0000covers diverse domains, including birds, mammals, ferns, and more, comprising\u0000over 50,000 entities and over 1 million filtered images. To ensure data\u0000quality, we employ Prototypical Contrastive Learning and the Isolation Forest\u0000algorithm to refine the image data. Additionally, we have developed a\u0000user-friendly platform to facilitate image attribute exploration.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou
Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities.
{"title":"IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection","authors":"Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou","doi":"arxiv-2408.01690","DOIUrl":"https://doi.org/arxiv-2408.01690","url":null,"abstract":"Effective fraud detection and analysis of government-issued identity\u0000documents, such as passports, driver's licenses, and identity cards, are\u0000essential in thwarting identity theft and bolstering security on online\u0000platforms. The training of accurate fraud detection and analysis tools depends\u0000on the availability of extensive identity document datasets. However, current\u0000publicly available benchmark datasets for identity document analysis, including\u0000MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a\u0000limited number of samples, cover insufficient varieties of fraud patterns, and\u0000seldom include alterations in critical personal identifying fields like\u0000portrait images, limiting their utility in training models capable of detecting\u0000realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark\u0000dataset, IDNet, designed to advance privacy-preserving fraud detection efforts.\u0000The IDNet dataset comprises 837,060 images of synthetically generated identity\u0000documents, totaling approximately 490 gigabytes, categorized into 20 types from\u0000$10$ U.S. states and 10 European countries. We evaluate the utility and present\u0000use cases of the dataset, illustrating how it can aid in training\u0000privacy-preserving fraud detection methods, facilitating the generation of\u0000camera and video capturing of identity documents, and testing schema\u0000unification and other identity document management functionalities.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In today's music industry, album cover design is as crucial as the music itself, reflecting the artist's vision and brand. However, many AI-driven album cover services require subscriptions or technical expertise, limiting accessibility. To address these challenges, we developed Music2P, an open-source, multi-modal AI-driven tool that streamlines album cover creation, making it efficient, accessible, and cost-effective through Ngrok. Music2P automates the design process using techniques such as Bootstrapping Language Image Pre-training (BLIP), music-to-text conversion (LP-music-caps), image segmentation (LoRA), and album cover and QR code generation (ControlNet). This paper demonstrates the Music2P interface, details our application of these technologies, and outlines future improvements. Our ultimate goal is to provide a tool that empowers musicians and producers, especially those with limited resources or expertise, to create compelling album covers.
{"title":"Music2P: A Multi-Modal AI-Driven Tool for Simplifying Album Cover Design","authors":"Joong Ho Choi, Geonyeong Choi, Ji-Eun Han, Wonjin Yang, Zhi-Qi Cheng","doi":"arxiv-2408.01651","DOIUrl":"https://doi.org/arxiv-2408.01651","url":null,"abstract":"In today's music industry, album cover design is as crucial as the music\u0000itself, reflecting the artist's vision and brand. However, many AI-driven album\u0000cover services require subscriptions or technical expertise, limiting\u0000accessibility. To address these challenges, we developed Music2P, an\u0000open-source, multi-modal AI-driven tool that streamlines album cover creation,\u0000making it efficient, accessible, and cost-effective through Ngrok. Music2P\u0000automates the design process using techniques such as Bootstrapping Language\u0000Image Pre-training (BLIP), music-to-text conversion (LP-music-caps), image\u0000segmentation (LoRA), and album cover and QR code generation (ControlNet). This\u0000paper demonstrates the Music2P interface, details our application of these\u0000technologies, and outlines future improvements. Our ultimate goal is to provide\u0000a tool that empowers musicians and producers, especially those with limited\u0000resources or expertise, to create compelling album covers.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video grounding is a fundamental problem in multimodal content understanding, aiming to localize specific natural language queries in an untrimmed video. However, current video grounding datasets merely focus on simple events and are either limited to shorter videos or brief sentences, which hinders the model from evolving toward stronger multimodal understanding capabilities. To address these limitations, we present a large-scale video grounding dataset named SynopGround, in which more than 2800 hours of videos are sourced from popular TV dramas and are paired with accurately localized human-written synopses. Each paragraph in the synopsis serves as a language query and is manually annotated with precise temporal boundaries in the long video. These paragraph queries are tightly correlated to each other and contain a wealth of abstract expressions summarizing video storylines and specific descriptions portraying event details, which enables the model to learn multimodal perception on more intricate concepts over longer context dependencies. Based on the dataset, we further introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG), which takes as input multiple paragraphs and a long video for grounding each paragraph query to its temporal interval. In addition, we propose a novel Local-Global Multimodal Reasoner (LGMR) to explicitly model the local-global structures of long-term multimodal inputs for MPVG. Our method provides an effective baseline solution to the multi-paragraph video grounding problem. Extensive experiments verify the proposed model's effectiveness as well as its superiority in long-term multi-paragraph video grounding over prior state-of-the-arts. Dataset and code are publicly available. Project page: https://synopground.github.io/.
{"title":"SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses","authors":"Chaolei Tan, Zihang Lin, Junfu Pu, Zhongang Qi, Wei-Yi Pei, Zhi Qu, Yexin Wang, Ying Shan, Wei-Shi Zheng, Jian-Fang Hu","doi":"arxiv-2408.01669","DOIUrl":"https://doi.org/arxiv-2408.01669","url":null,"abstract":"Video grounding is a fundamental problem in multimodal content understanding,\u0000aiming to localize specific natural language queries in an untrimmed video.\u0000However, current video grounding datasets merely focus on simple events and are\u0000either limited to shorter videos or brief sentences, which hinders the model\u0000from evolving toward stronger multimodal understanding capabilities. To address\u0000these limitations, we present a large-scale video grounding dataset named\u0000SynopGround, in which more than 2800 hours of videos are sourced from popular\u0000TV dramas and are paired with accurately localized human-written synopses. Each\u0000paragraph in the synopsis serves as a language query and is manually annotated\u0000with precise temporal boundaries in the long video. These paragraph queries are\u0000tightly correlated to each other and contain a wealth of abstract expressions\u0000summarizing video storylines and specific descriptions portraying event\u0000details, which enables the model to learn multimodal perception on more\u0000intricate concepts over longer context dependencies. Based on the dataset, we\u0000further introduce a more complex setting of video grounding dubbed\u0000Multi-Paragraph Video Grounding (MPVG), which takes as input multiple\u0000paragraphs and a long video for grounding each paragraph query to its temporal\u0000interval. In addition, we propose a novel Local-Global Multimodal Reasoner\u0000(LGMR) to explicitly model the local-global structures of long-term multimodal\u0000inputs for MPVG. Our method provides an effective baseline solution to the\u0000multi-paragraph video grounding problem. Extensive experiments verify the\u0000proposed model's effectiveness as well as its superiority in long-term\u0000multi-paragraph video grounding over prior state-of-the-arts. Dataset and code\u0000are publicly available. Project page: https://synopground.github.io/.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}