Xianzhi Zhang, Yipeng Zhou, Di Wu, Quan Z. Sheng, Miao Hu, Linchang Xiao
Online video streaming has evolved into an integral component of the contemporary Internet landscape. Yet, the disclosure of user requests presents formidable privacy challenges. As users stream their preferred online videos, their requests are automatically seized by video content providers, potentially leaking users' privacy. Unfortunately, current protection methods are not well-suited to preserving user request privacy from content providers while maintaining high-quality online video services. To tackle this challenge, we introduce a novel Privacy-Preserving Video Fetching (PPVF) framework, which utilizes trusted edge devices to pre-fetch and cache videos, ensuring the privacy of users' requests while optimizing the efficiency of edge caching. More specifically, we design PPVF with three core components: (1) textit{Online privacy budget scheduler}, which employs a theoretically guaranteed online algorithm to select non-requested videos as candidates with assigned privacy budgets. Alternative videos are chosen by an online algorithm that is theoretically guaranteed to consider both video utilities and available privacy budgets. (2) textit{Noisy video request generator}, which generates redundant video requests (in addition to original ones) utilizing correlated differential privacy to obfuscate request privacy. (3) textit{Online video utility predictor}, which leverages federated learning to collaboratively evaluate video utility in an online fashion, aiding in video selection in (1) and noise generation in (2). Finally, we conduct extensive experiments using real-world video request traces from Tencent Video. The results demonstrate that PPVF effectively safeguards user request privacy while upholding high video caching performance.
{"title":"PPVF: An Efficient Privacy-Preserving Online Video Fetching Framework with Correlated Differential Privacy","authors":"Xianzhi Zhang, Yipeng Zhou, Di Wu, Quan Z. Sheng, Miao Hu, Linchang Xiao","doi":"arxiv-2408.14735","DOIUrl":"https://doi.org/arxiv-2408.14735","url":null,"abstract":"Online video streaming has evolved into an integral component of the\u0000contemporary Internet landscape. Yet, the disclosure of user requests presents\u0000formidable privacy challenges. As users stream their preferred online videos,\u0000their requests are automatically seized by video content providers, potentially\u0000leaking users' privacy. Unfortunately, current protection methods are not well-suited to preserving\u0000user request privacy from content providers while maintaining high-quality\u0000online video services. To tackle this challenge, we introduce a novel\u0000Privacy-Preserving Video Fetching (PPVF) framework, which utilizes trusted edge\u0000devices to pre-fetch and cache videos, ensuring the privacy of users' requests\u0000while optimizing the efficiency of edge caching. More specifically, we design\u0000PPVF with three core components: (1) textit{Online privacy budget scheduler},\u0000which employs a theoretically guaranteed online algorithm to select\u0000non-requested videos as candidates with assigned privacy budgets. Alternative\u0000videos are chosen by an online algorithm that is theoretically guaranteed to\u0000consider both video utilities and available privacy budgets. (2) textit{Noisy\u0000video request generator}, which generates redundant video requests (in addition\u0000to original ones) utilizing correlated differential privacy to obfuscate\u0000request privacy. (3) textit{Online video utility predictor}, which leverages\u0000federated learning to collaboratively evaluate video utility in an online\u0000fashion, aiding in video selection in (1) and noise generation in (2). Finally,\u0000we conduct extensive experiments using real-world video request traces from\u0000Tencent Video. The results demonstrate that PPVF effectively safeguards user\u0000request privacy while upholding high video caching performance.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at https://github.com/aimagelab/DiCO.
{"title":"Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization","authors":"Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara","doi":"arxiv-2408.14547","DOIUrl":"https://doi.org/arxiv-2408.14547","url":null,"abstract":"The conventional training approach for image captioning involves pre-training\u0000a network using teacher forcing and subsequent fine-tuning with Self-Critical\u0000Sequence Training to maximize hand-crafted captioning metrics. However, when\u0000attempting to optimize modern and higher-quality metrics like CLIP-Score and\u0000PAC-Score, this training method often encounters instability and fails to\u0000acquire the genuine descriptive capabilities needed to produce fluent and\u0000informative captions. In this paper, we propose a new training paradigm termed\u0000Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and\u0000optimizes a reward model that is distilled from a learnable captioning\u0000evaluator with high human correlation. This is done by solving a weighted\u0000classification problem directly inside the captioner. At the same time, DiCO\u0000prevents divergence from the original model, ensuring that fluency is\u0000maintained. DiCO not only exhibits improved stability and enhanced quality in\u0000the generated captions but also aligns more closely with human preferences\u0000compared to existing methods, especially in modern metrics. Additionally, it\u0000maintains competitive performance in traditional metrics. Our source code and\u0000trained models are publicly available at https://github.com/aimagelab/DiCO.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The explosive growth of multimedia content in the digital economy era has brought challenges in content recognition, copyright protection, and data management. As an emerging content management technology, perceptual hash-based digital fingerprints, serving as compact summaries of multimedia content, have been widely adopted for efficient multimedia content identification and retrieval across different modalities (e.g., text, image, video, audio), attracting significant attention from both academia and industry. Despite the increasing applications of digital fingerprints, there is a lack of systematic and comprehensive literature review on multimedia digital fingerprints. This survey aims to fill this gap and provide an important resource for researchers studying the details and related advancements of multimedia digital fingerprints. The survey first introduces the definition, characteristics, and related concepts (including hash functions, granularity, similarity measures, etc.) of digital fingerprints. It then focuses on analyzing and summarizing the algorithms for extracting unimodal fingerprints of different types of digital content, including text fingerprints, image fingerprints, video fingerprints, and audio fingerprints. Particularly, it provides an in-depth review and summary of deep learning-based fingerprints. Additionally, the survey elaborates on the various practical applications of digital fingerprints and outlines the challenges and potential future research directions. The goal is to promote the continued development of multimedia digital fingerprint research.
{"title":"Digital Fingerprinting on Multimedia: A Survey","authors":"Wendi Chen, Wensheng Gan, Philip S. Yu","doi":"arxiv-2408.14155","DOIUrl":"https://doi.org/arxiv-2408.14155","url":null,"abstract":"The explosive growth of multimedia content in the digital economy era has\u0000brought challenges in content recognition, copyright protection, and data\u0000management. As an emerging content management technology, perceptual hash-based\u0000digital fingerprints, serving as compact summaries of multimedia content, have\u0000been widely adopted for efficient multimedia content identification and\u0000retrieval across different modalities (e.g., text, image, video, audio),\u0000attracting significant attention from both academia and industry. Despite the\u0000increasing applications of digital fingerprints, there is a lack of systematic\u0000and comprehensive literature review on multimedia digital fingerprints. This\u0000survey aims to fill this gap and provide an important resource for researchers\u0000studying the details and related advancements of multimedia digital\u0000fingerprints. The survey first introduces the definition, characteristics, and\u0000related concepts (including hash functions, granularity, similarity measures,\u0000etc.) of digital fingerprints. It then focuses on analyzing and summarizing the\u0000algorithms for extracting unimodal fingerprints of different types of digital\u0000content, including text fingerprints, image fingerprints, video fingerprints,\u0000and audio fingerprints. Particularly, it provides an in-depth review and\u0000summary of deep learning-based fingerprints. Additionally, the survey\u0000elaborates on the various practical applications of digital fingerprints and\u0000outlines the challenges and potential future research directions. The goal is\u0000to promote the continued development of multimedia digital fingerprint\u0000research.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Houma Alliance Book, one of history's earliest calligraphic examples, was unearthed in the 1970s. These artifacts were meticulously organized, reproduced, and copied by the Shanxi Provincial Institute of Cultural Relics. However, because of their ancient origins and severe ink erosion, identifying characters in the Houma Alliance Book is challenging, necessitating the use of digital technology. In this paper, we propose a new ancient handwritten character recognition database for the Houma alliance book, along with a novel benchmark based on deep learning architectures. More specifically, a collection of 26,732 characters samples from the Houma Alliance Book were gathered, encompassing 327 different types of ancient characters through iterative annotation. Furthermore, benchmark algorithms were proposed by combining four deep neural network classifiers with two data augmentation methods. This research provides valuable resources and technical support for further studies on the Houma Alliance Book and other ancient characters. This contributes to our understanding of ancient culture and history, as well as the preservation and inheritance of humanity's cultural heritage.
{"title":"HABD: a houma alliance book ancient handwritten character recognition database","authors":"Xiaoyu Yuan, Xiaohua Huang, Zibo Zhang, Yabo Sun","doi":"arxiv-2408.14084","DOIUrl":"https://doi.org/arxiv-2408.14084","url":null,"abstract":"The Houma Alliance Book, one of history's earliest calligraphic examples, was\u0000unearthed in the 1970s. These artifacts were meticulously organized,\u0000reproduced, and copied by the Shanxi Provincial Institute of Cultural Relics.\u0000However, because of their ancient origins and severe ink erosion, identifying\u0000characters in the Houma Alliance Book is challenging, necessitating the use of\u0000digital technology. In this paper, we propose a new ancient handwritten\u0000character recognition database for the Houma alliance book, along with a novel\u0000benchmark based on deep learning architectures. More specifically, a collection\u0000of 26,732 characters samples from the Houma Alliance Book were gathered,\u0000encompassing 327 different types of ancient characters through iterative\u0000annotation. Furthermore, benchmark algorithms were proposed by combining four\u0000deep neural network classifiers with two data augmentation methods. This\u0000research provides valuable resources and technical support for further studies\u0000on the Houma Alliance Book and other ancient characters. This contributes to\u0000our understanding of ancient culture and history, as well as the preservation\u0000and inheritance of humanity's cultural heritage.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anmol Manjunath, Viola Negroni, Sara Mandelli, Daniel Moreira, Paolo Bestagini
Recent breakthroughs in deep learning and generative systems have significantly fostered the creation of synthetic media, as well as the local alteration of real content via the insertion of highly realistic synthetic manipulations. Local image manipulation, in particular, poses serious challenges to the integrity of digital content and societal trust. This problem is not only confined to multimedia data, but also extends to biological images included in scientific publications, like images depicting Western blots. In this work, we address the task of localizing synthetic manipulations in Western blot images. To discriminate between pristine and synthetic pixels of an analyzed image, we propose a synthetic detector that operates on small patches extracted from the image. We aggregate patch contributions to estimate a tampering heatmap, highlighting synthetic pixels out of pristine ones. Our methodology proves effective when tested over two manipulated Western blot image datasets, one altered automatically and the other manually by exploiting advanced AI-based image manipulation tools that are unknown at our training stage. We also explore the robustness of our method over an external dataset of other scientific images depicting different semantics, manipulated through unseen generation techniques.
最近在深度学习和生成系统方面取得的突破极大地促进了合成媒体的创建,以及通过插入高度逼真的合成处理对真实内容进行局部篡改。局部图像处理尤其对数字内容的完整性和社会信任构成了严重挑战。这个问题不仅局限于多媒体数据,还延伸到科学出版物中的生物图像,如描述 Western 印迹的图像。在这项工作中,我们解决了在 Western 印迹图像中定位合成操作的任务。为了区分分析图像中的原始像素和合成像素,我们提出了一种合成检测器,该检测器对从图像中提取的小补丁进行检测。我们汇集补丁贡献来估算篡改热图,从原始像素中突出合成像素。我们的方法在两个经过处理的 Western 印迹图像数据集上进行了测试,证明是有效的,其中一个数据集是自动修改的,另一个数据集是利用先进的人工智能图像处理工具手动修改的,而这些工具在我们的训练阶段是未知的。我们还探索了我们的方法在外部数据集上的鲁棒性,这些外部数据集包含了通过未知生成技术处理的描述不同语义的其他科学图像。
{"title":"Localization of Synthetic Manipulations in Western Blot Images","authors":"Anmol Manjunath, Viola Negroni, Sara Mandelli, Daniel Moreira, Paolo Bestagini","doi":"arxiv-2408.13786","DOIUrl":"https://doi.org/arxiv-2408.13786","url":null,"abstract":"Recent breakthroughs in deep learning and generative systems have\u0000significantly fostered the creation of synthetic media, as well as the local\u0000alteration of real content via the insertion of highly realistic synthetic\u0000manipulations. Local image manipulation, in particular, poses serious\u0000challenges to the integrity of digital content and societal trust. This problem\u0000is not only confined to multimedia data, but also extends to biological images\u0000included in scientific publications, like images depicting Western blots. In\u0000this work, we address the task of localizing synthetic manipulations in Western\u0000blot images. To discriminate between pristine and synthetic pixels of an\u0000analyzed image, we propose a synthetic detector that operates on small patches\u0000extracted from the image. We aggregate patch contributions to estimate a\u0000tampering heatmap, highlighting synthetic pixels out of pristine ones. Our\u0000methodology proves effective when tested over two manipulated Western blot\u0000image datasets, one altered automatically and the other manually by exploiting\u0000advanced AI-based image manipulation tools that are unknown at our training\u0000stage. We also explore the robustness of our method over an external dataset of\u0000other scientific images depicting different semantics, manipulated through\u0000unseen generation techniques.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech-language multi-modal learning presents a significant challenge due to the fine nuanced information inherent in speech styles. Therefore, a large-scale dataset providing elaborate comprehension of speech style is urgently needed to facilitate insightful interplay between speech audio and natural language. However, constructing such datasets presents a major trade-off between large-scale data collection and high-quality annotation. To tackle this challenge, we propose an automatic speech annotation system for expressiveness interpretation that annotates in-the-wild speech clips with expressive and vivid human language descriptions. Initially, speech audios are processed by a series of expert classifiers and captioning models to capture diverse speech characteristics, followed by a fine-tuned LLaMA for customized annotation generation. Unlike previous tag/templet-based annotation frameworks with limited information and diversity, our system provides in-depth understandings of speech style through tailored natural language descriptions, thereby enabling accurate and voluminous data generation for large model training. With this system, we create SpeechCraft, a fine-grained bilingual expressive speech dataset. It is distinguished by highly descriptive natural language style prompts, containing approximately 2,000 hours of audio data and encompassing over two million speech clips. Extensive experiments demonstrate that the proposed dataset significantly boosts speech-language task performance in stylist speech synthesis and speech style understanding.
{"title":"SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description","authors":"Zeyu Jin, Jia Jia, Qixin Wang, Kehan Li, Shuoyi Zhou, Songtao Zhou, Xiaoyu Qin, Zhiyong Wu","doi":"arxiv-2408.13608","DOIUrl":"https://doi.org/arxiv-2408.13608","url":null,"abstract":"Speech-language multi-modal learning presents a significant challenge due to\u0000the fine nuanced information inherent in speech styles. Therefore, a\u0000large-scale dataset providing elaborate comprehension of speech style is\u0000urgently needed to facilitate insightful interplay between speech audio and\u0000natural language. However, constructing such datasets presents a major\u0000trade-off between large-scale data collection and high-quality annotation. To\u0000tackle this challenge, we propose an automatic speech annotation system for\u0000expressiveness interpretation that annotates in-the-wild speech clips with\u0000expressive and vivid human language descriptions. Initially, speech audios are\u0000processed by a series of expert classifiers and captioning models to capture\u0000diverse speech characteristics, followed by a fine-tuned LLaMA for customized\u0000annotation generation. Unlike previous tag/templet-based annotation frameworks\u0000with limited information and diversity, our system provides in-depth\u0000understandings of speech style through tailored natural language descriptions,\u0000thereby enabling accurate and voluminous data generation for large model\u0000training. With this system, we create SpeechCraft, a fine-grained bilingual\u0000expressive speech dataset. It is distinguished by highly descriptive natural\u0000language style prompts, containing approximately 2,000 hours of audio data and\u0000encompassing over two million speech clips. Extensive experiments demonstrate\u0000that the proposed dataset significantly boosts speech-language task performance\u0000in stylist speech synthesis and speech style understanding.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vision and Language Navigation (VLN) is a challenging task that requires agents to understand instructions and navigate to the destination in a visual environment.One of the key challenges in outdoor VLN is keeping track of which part of the instruction was completed. To alleviate this problem, previous works mainly focus on grounding the natural language to the visual input, but neglecting the crucial role of the agent's spatial position information in the grounding process. In this work, we first explore the substantial effect of spatial position locating on the grounding of outdoor VLN, drawing inspiration from human navigation. In real-world navigation scenarios, before planning a path to the destination, humans typically need to figure out their current location. This observation underscores the pivotal role of spatial localization in the navigation process. In this work, we introduce a novel framework, Locating be for Planning (Loc4Plan), designed to incorporate spatial perception for action planning in outdoor VLN tasks. The main idea behind Loc4Plan is to perform the spatial localization before planning a decision action based on corresponding guidance, which comprises a block-aware spatial locating (BAL) module and a spatial-aware action planning (SAP) module. Specifically, to help the agent perceive its spatial location in the environment, we propose to learn a position predictor that measures how far the agent is from the next intersection for reflecting its position, which is achieved by the BAL module. After the locating process, we propose the SAP module to incorporate spatial information to ground the corresponding guidance and enhance the precision of action planning. Extensive experiments on the Touchdown and map2seq datasets show that the proposed Loc4Plan outperforms the SOTA methods.
视觉与语言导航(VLN)是一项具有挑战性的任务,它要求机器人在视觉环境中理解指令并导航到目的地。为了缓解这一问题,前人的研究主要集中在将自然语言与视觉输入接地,但忽略了代理的空间位置信息在接地过程中的关键作用。在这项工作中,我们首先从人类导航中汲取灵感,探索空间位置定位对室外 VLN 落地的实质性影响。在现实世界的导航场景中,在规划前往目的地的路径之前,人类通常需要弄清楚自己当前的位置。这一观察结果强调了空间定位在导航过程中的关键作用。在这项工作中,我们引入了一个新颖的框架--定位规划(Locating be for Planning,Loc4Plan),旨在将空间感知纳入户外 VLN 任务的行动规划中。Loc4Plan 背后的主要思想是在根据相应的指导规划决策行动之前进行空间定位,它包括一个块感知空间定位(BAL)模块和一个空间感知行动规划(SAP)模块。具体来说,为了帮助机器人感知其在环境中的空间位置,我们建议学习一个位置预测器,测量机器人距离下一个交叉路口有多远,以反映其位置,这由 BAL 模块实现。在 Touchdown 和 map2seq 数据集上的大量实验表明,所提出的 Loc4Plan 优于 SOTA 方法。
{"title":"Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation","authors":"Huilin Tian, Jingke Meng, Wei-Shi Zheng, Yuan-Ming Li, Junkai Yan, Yunong Zhang","doi":"arxiv-2408.05090","DOIUrl":"https://doi.org/arxiv-2408.05090","url":null,"abstract":"Vision and Language Navigation (VLN) is a challenging task that requires\u0000agents to understand instructions and navigate to the destination in a visual\u0000environment.One of the key challenges in outdoor VLN is keeping track of which\u0000part of the instruction was completed. To alleviate this problem, previous\u0000works mainly focus on grounding the natural language to the visual input, but\u0000neglecting the crucial role of the agent's spatial position information in the\u0000grounding process. In this work, we first explore the substantial effect of\u0000spatial position locating on the grounding of outdoor VLN, drawing inspiration\u0000from human navigation. In real-world navigation scenarios, before planning a\u0000path to the destination, humans typically need to figure out their current\u0000location. This observation underscores the pivotal role of spatial localization\u0000in the navigation process. In this work, we introduce a novel framework,\u0000Locating be for Planning (Loc4Plan), designed to incorporate spatial perception\u0000for action planning in outdoor VLN tasks. The main idea behind Loc4Plan is to\u0000perform the spatial localization before planning a decision action based on\u0000corresponding guidance, which comprises a block-aware spatial locating (BAL)\u0000module and a spatial-aware action planning (SAP) module. Specifically, to help\u0000the agent perceive its spatial location in the environment, we propose to learn\u0000a position predictor that measures how far the agent is from the next\u0000intersection for reflecting its position, which is achieved by the BAL module.\u0000After the locating process, we propose the SAP module to incorporate spatial\u0000information to ground the corresponding guidance and enhance the precision of\u0000action planning. Extensive experiments on the Touchdown and map2seq datasets\u0000show that the proposed Loc4Plan outperforms the SOTA methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The growing demand for high-quality point cloud transmission over wireless networks presents significant challenges, primarily due to the large data sizes and the need for efficient encoding techniques. In response to these challenges, we introduce a novel system named Deep Point Cloud Semantic Transmission (PCST), designed for end-to-end wireless point cloud transmission. Our approach employs a progressive resampling framework using sparse convolution to project point cloud data into a semantic latent space. These semantic features are subsequently encoded through a deep joint source-channel (JSCC) encoder, generating the channel-input sequence. To enhance transmission efficiency, we use an adaptive entropy-based approach to assess the importance of each semantic feature, allowing transmission lengths to vary according to their predicted entropy. PCST is robust across diverse Signal-to-Noise Ratio (SNR) levels and supports an adjustable rate-distortion (RD) trade-off, ensuring flexible and efficient transmission. Experimental results indicate that PCST significantly outperforms traditional separate source-channel coding (SSCC) schemes, delivering superior reconstruction quality while achieving over a 50% reduction in bandwidth usage.
{"title":"Deep joint source-channel coding for wireless point cloud transmission","authors":"Cixiao Zhang, Mufan Liu, Wenjie Huang, Yin Xu, Yiling Xu, Dazhi He","doi":"arxiv-2408.04889","DOIUrl":"https://doi.org/arxiv-2408.04889","url":null,"abstract":"The growing demand for high-quality point cloud transmission over wireless\u0000networks presents significant challenges, primarily due to the large data sizes\u0000and the need for efficient encoding techniques. In response to these\u0000challenges, we introduce a novel system named Deep Point Cloud Semantic\u0000Transmission (PCST), designed for end-to-end wireless point cloud transmission.\u0000Our approach employs a progressive resampling framework using sparse\u0000convolution to project point cloud data into a semantic latent space. These\u0000semantic features are subsequently encoded through a deep joint source-channel\u0000(JSCC) encoder, generating the channel-input sequence. To enhance transmission\u0000efficiency, we use an adaptive entropy-based approach to assess the importance\u0000of each semantic feature, allowing transmission lengths to vary according to\u0000their predicted entropy. PCST is robust across diverse Signal-to-Noise Ratio\u0000(SNR) levels and supports an adjustable rate-distortion (RD) trade-off,\u0000ensuring flexible and efficient transmission. Experimental results indicate\u0000that PCST significantly outperforms traditional separate source-channel coding\u0000(SSCC) schemes, delivering superior reconstruction quality while achieving over\u0000a 50% reduction in bandwidth usage.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emotion Prediction in Conversation (EPC) aims to forecast the emotions of forthcoming utterances by utilizing preceding dialogues. Previous EPC approaches relied on simple context modeling for emotion extraction, overlooking fine-grained emotion cues at the word level. Additionally, prior works failed to account for the intrinsic differences between modalities, resulting in redundant information. To overcome these limitations, we propose an emotional cues extraction and fusion network, which consists of two stages: a modality-specific learning stage that utilizes word-level labels and prosody learning to construct emotion embedding spaces for each modality, and a two-step fusion stage for integrating multi-modal features. Moreover, the emotion features extracted by our model are also applicable to the Emotion Recognition in Conversation (ERC) task. Experimental results validate the efficacy of the proposed method, demonstrating superior performance on both IEMOCAP and MELD datasets.
{"title":"Emotional Cues Extraction and Fusion for Multi-modal Emotion Prediction and Recognition in Conversation","authors":"Haoxiang Shi, Ziqi Liang, Jun Yu","doi":"arxiv-2408.04547","DOIUrl":"https://doi.org/arxiv-2408.04547","url":null,"abstract":"Emotion Prediction in Conversation (EPC) aims to forecast the emotions of\u0000forthcoming utterances by utilizing preceding dialogues. Previous EPC\u0000approaches relied on simple context modeling for emotion extraction,\u0000overlooking fine-grained emotion cues at the word level. Additionally, prior\u0000works failed to account for the intrinsic differences between modalities,\u0000resulting in redundant information. To overcome these limitations, we propose\u0000an emotional cues extraction and fusion network, which consists of two stages:\u0000a modality-specific learning stage that utilizes word-level labels and prosody\u0000learning to construct emotion embedding spaces for each modality, and a\u0000two-step fusion stage for integrating multi-modal features. Moreover, the\u0000emotion features extracted by our model are also applicable to the Emotion\u0000Recognition in Conversation (ERC) task. Experimental results validate the\u0000efficacy of the proposed method, demonstrating superior performance on both\u0000IEMOCAP and MELD datasets.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangcheng Du, Zhao Zhou, Yanlong Wang, Zhuoyao Wang, Yingbin Zheng, Cheng Jin
Deep networks have shown impressive performance in the image restoration tasks, such as image colorization. However, we find that previous approaches rely on the digital representation from single color model with a specific mapping function, a.k.a., color space, during the colorization pipeline. In this paper, we first investigate the modeling of different color spaces, and find each of them exhibiting distinctive characteristics with unique distribution of colors. The complementarity among multiple color spaces leads to benefits for the image colorization task. We present MultiColor, a new learning-based approach to automatically colorize grayscale images that combines clues from multiple color spaces. Specifically, we employ a set of dedicated colorization modules for individual color space. Within each module, a transformer decoder is first employed to refine color query embeddings and then a color mapper produces color channel prediction using the embeddings and semantic features. With these predicted color channels representing various color spaces, a complementary network is designed to exploit the complementarity and generate pleasing and reasonable colorized images. We conduct extensive experiments on real-world datasets, and the results demonstrate superior performance over the state-of-the-arts.
{"title":"MultiColor: Image Colorization by Learning from Multiple Color Spaces","authors":"Xiangcheng Du, Zhao Zhou, Yanlong Wang, Zhuoyao Wang, Yingbin Zheng, Cheng Jin","doi":"arxiv-2408.04172","DOIUrl":"https://doi.org/arxiv-2408.04172","url":null,"abstract":"Deep networks have shown impressive performance in the image restoration\u0000tasks, such as image colorization. However, we find that previous approaches\u0000rely on the digital representation from single color model with a specific\u0000mapping function, a.k.a., color space, during the colorization pipeline. In\u0000this paper, we first investigate the modeling of different color spaces, and\u0000find each of them exhibiting distinctive characteristics with unique\u0000distribution of colors. The complementarity among multiple color spaces leads\u0000to benefits for the image colorization task. We present MultiColor, a new learning-based approach to automatically\u0000colorize grayscale images that combines clues from multiple color spaces.\u0000Specifically, we employ a set of dedicated colorization modules for individual\u0000color space. Within each module, a transformer decoder is first employed to\u0000refine color query embeddings and then a color mapper produces color channel\u0000prediction using the embeddings and semantic features. With these predicted\u0000color channels representing various color spaces, a complementary network is\u0000designed to exploit the complementarity and generate pleasing and reasonable\u0000colorized images. We conduct extensive experiments on real-world datasets, and\u0000the results demonstrate superior performance over the state-of-the-arts.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}