Pub Date : 2025-09-01DOI: 10.1109/TMM.2025.3604967
Junlin Liu;Xinchen Lyu;Chenshan Ren;Qimei Cui
Input diversity is an effective technique for crafting transferable adversarial examples that can deceive unknown AI models. Existing input-diversity-based methods typically use single input transformation, limiting targeted transferability and defense robustness. Combining different transformation types is challenging, as keeping increasing types would degrade semantic information and targeted transferability. This paper proposes a quality-aware transformation combination attack (TCA) that selects high-quality transformation combinations. The quality-aware selection enables expansion of transformation types, enhances input diversity, and hence improves targeted transferability and defense robustness. We first design a quality-evaluation framework to quantify the effectiveness of transformation combinations, which jointly considers convergence, transferability, and robustness. Only a small group (up to 10) of images are required for computation-efficient quality evaluation. Experiments validate TCA’s superiority over state-of-the-art baselines in adversarial transferability and robustness. When defenses are secured, the average targeted success rate of TCA with four transformation types (i.e., TCA-t4) outperforms the best baseline by 26%$sim$42% on ImageNet.
{"title":"Crafting More Transferable Adversarial Examples via Quality-Aware Transformation Combination","authors":"Junlin Liu;Xinchen Lyu;Chenshan Ren;Qimei Cui","doi":"10.1109/TMM.2025.3604967","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604967","url":null,"abstract":"Input diversity is an effective technique for crafting transferable adversarial examples that can deceive unknown AI models. Existing input-diversity-based methods typically use single input transformation, limiting targeted transferability and defense robustness. Combining different transformation types is challenging, as keeping increasing types would degrade semantic information and targeted transferability. This paper proposes a quality-aware <underline>t</u>ransformation <underline>c</u>ombination <underline>a</u>ttack (TCA) that selects high-quality transformation combinations. The quality-aware selection enables expansion of transformation types, enhances input diversity, and hence improves targeted transferability and defense robustness. We first design a quality-evaluation framework to quantify the effectiveness of transformation combinations, which jointly considers convergence, transferability, and robustness. Only a small group (up to 10) of images are required for computation-efficient quality evaluation. Experiments validate TCA’s superiority over state-of-the-art baselines in adversarial transferability and robustness. When defenses are secured, the average targeted success rate of TCA with four transformation types (i.e., TCA-t4) outperforms the best baseline by 26%<inline-formula><tex-math>$sim$</tex-math></inline-formula>42% on ImageNet.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7917-7929"},"PeriodicalIF":9.7,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, virtual reality technology is advancing rapidly and becoming increasingly matured. Omnidirectional images have integrated into the daily lives of many individuals. However, these images are susceptible to irreversible distortion during the encoding and transmission processes. Given the unique characteristics of deformation and distortion in omnidirectional images, the development of a quality assessment method is crucial. To ensure that our network not only delivers efficient and stable performance but also maintains a minimal parameter count, we have integrated the concept of knowledge distillation into our network. This involves utilizing a full-reference (FR) teacher network to guide the training of a no-reference (NR) student network by cross-projection distilling knowledge. To specifically implement this method, a Dual Projection Format Fusion (DPFF) module is specifically designed to complement and integrate the mutual fusion of the two projection formats of omnidirectional images. In the design of our knowledge distillation process and loss function, we have introduced a review mechanism to enhance the performance and efficiency of response-based knowledge, as well as utilized intermediate fusion features to improve the effectiveness of feature-based knowledge. These components are combined to formulate the final loss function. Experimental results validate the superiority of our proposed model over existing FR and NR methods when evaluated on four omnidirectional image databases. This highlights the effectiveness of our proposed model in elevating the quality assessment of omnidirectional images.
当前,虚拟现实技术发展迅速,日趋成熟。全方位的图像已经融入了许多人的日常生活。然而,这些图像在编码和传输过程中容易产生不可逆失真。鉴于全向图像的形变和畸变的独特特性,开发一种质量评估方法至关重要。为了确保我们的网络不仅提供高效和稳定的性能,而且保持最小的参数计数,我们将知识蒸馏的概念集成到我们的网络中。这涉及到利用全参考(FR)教师网络通过交叉投影提取知识来指导无参考(NR)学生网络的训练。为具体实现该方法,专门设计了双投影格式融合(Dual Projection Format Fusion, DPFF)模块,对全向图像两种投影格式的相互融合进行补充和集成。在知识蒸馏过程和损失函数的设计中,我们引入了评审机制来提高基于响应的知识的性能和效率,并利用中间融合特征来提高基于特征的知识的有效性。这些分量组合起来形成最终的损失函数。实验结果验证了该模型在4个全向图像数据库上优于现有的FR和NR方法。这突出了我们提出的模型在提高全向图像质量评估方面的有效性。
{"title":"Cross-Projection Distilling Knowledge for Omnidirectional Image Quality Assessment","authors":"Huixin Hu;Feng Shao;Hangwei Chen;Xiongli Chai;Qiuping Jiang","doi":"10.1109/TMM.2025.3590920","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590920","url":null,"abstract":"Nowadays, virtual reality technology is advancing rapidly and becoming increasingly matured. Omnidirectional images have integrated into the daily lives of many individuals. However, these images are susceptible to irreversible distortion during the encoding and transmission processes. Given the unique characteristics of deformation and distortion in omnidirectional images, the development of a quality assessment method is crucial. To ensure that our network not only delivers efficient and stable performance but also maintains a minimal parameter count, we have integrated the concept of knowledge distillation into our network. This involves utilizing a full-reference (FR) teacher network to guide the training of a no-reference (NR) student network by cross-projection distilling knowledge. To specifically implement this method, a Dual Projection Format Fusion (DPFF) module is specifically designed to complement and integrate the mutual fusion of the two projection formats of omnidirectional images. In the design of our knowledge distillation process and loss function, we have introduced a review mechanism to enhance the performance and efficiency of response-based knowledge, as well as utilized intermediate fusion features to improve the effectiveness of feature-based knowledge. These components are combined to formulate the final loss function. Experimental results validate the superiority of our proposed model over existing FR and NR methods when evaluated on four omnidirectional image databases. This highlights the effectiveness of our proposed model in elevating the quality assessment of omnidirectional images.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6752-6765"},"PeriodicalIF":9.7,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increasing interest in learning from paired medical images and textual reports highlights the need for methods that can achieve multi-grained alignment between these two modalities. However, most existing approaches overlook fine-grained semantic alignment, which can constrain the quality of the generated representations. To tackle this problem, we propose the Multi-Grained Vision-and-Language Alignment (MGVLA) model, which effectively leverages multi-grained correspondences between medical images and texts at different levels, including disease, instance, and token levels. For disease-level alignment, our approach adopts the concept of contrastive learning and uses medical terminologies detected from textual reports as soft labels to guide the alignment process. At the instance level, we propose a strategy for sampling hard negatives, where images and texts with the same disease type but differing in details such as disease locations and severity are considered as hard negatives. This strategy helps our approach to better distinguish between positive and negative image-text pairs, ultimately enhancing the quality of our learned representations. For token-level alignment, we employ a masking and recovery technique to achieve fine-grained semantic alignment between patches and sub-words. This approach effectively aligns the different levels of granularity between the image and language modalities. To assess the efficacy of our MGVLA model, we conduct comprehensive experiments on the image-text retrieval and phrase grounding tasks.
{"title":"Multi-Grained Vision-and-Language Model for Medical Image and Text Alignment","authors":"Huimin Yan;Xian Yang;Liang Bai;Jiamin Li;Jiye Liang","doi":"10.1109/TMM.2025.3590930","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590930","url":null,"abstract":"The increasing interest in learning from paired medical images and textual reports highlights the need for methods that can achieve multi-grained alignment between these two modalities. However, most existing approaches overlook fine-grained semantic alignment, which can constrain the quality of the generated representations. To tackle this problem, we propose the Multi-Grained Vision-and-Language Alignment (MGVLA) model, which effectively leverages multi-grained correspondences between medical images and texts at different levels, including disease, instance, and token levels. For disease-level alignment, our approach adopts the concept of contrastive learning and uses medical terminologies detected from textual reports as soft labels to guide the alignment process. At the instance level, we propose a strategy for sampling hard negatives, where images and texts with the same disease type but differing in details such as disease locations and severity are considered as hard negatives. This strategy helps our approach to better distinguish between positive and negative image-text pairs, ultimately enhancing the quality of our learned representations. For token-level alignment, we employ a masking and recovery technique to achieve fine-grained semantic alignment between patches and sub-words. This approach effectively aligns the different levels of granularity between the image and language modalities. To assess the efficacy of our MGVLA model, we conduct comprehensive experiments on the image-text retrieval and phrase grounding tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6780-6792"},"PeriodicalIF":9.7,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-23DOI: 10.1109/TMM.2025.3590912
Sida Tian;Can Zhang;Wei Yuan;Wei Tan;Wenjie Zhu
In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quality outputs. This paper presents a generalized symbolic music generation framework, XMusic, which supports flexible prompts (i.e., images, videos, texts, tags, and humming) to generate emotionally controllable and high-quality symbolic music. XMusic consists of two core components, XProjector and XComposer. XProjector parses the prompts of various modalities into symbolic music elements (i.e., emotions, genres, rhythms and notes) within the projection space to generate matching music. XComposer contains a Generator and a Selector. The Generator generates emotionally controllable and melodious music based on our innovative symbolic music representation, whereas the Selector identifies high-quality symbolic music by constructing a multi-task learning scheme involving quality assessment, emotion recognition, and genre recognition tasks. In addition, we build XMIDI, a large-scale symbolic music dataset that contains 108,023 MIDI files annotated with precise emotion and genre labels. Objective and subjective evaluations show that XMusic significantly outperforms the current state-of-the-art methods with impressive music quality. Our XMusic has been awarded as one of the nine Highlights of Collectibles at WAIC 2023. The project homepage of XMusic is: https://xmusic-project.github.io.
{"title":"XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework","authors":"Sida Tian;Can Zhang;Wei Yuan;Wei Tan;Wenjie Zhu","doi":"10.1109/TMM.2025.3590912","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590912","url":null,"abstract":"In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quality outputs. This paper presents a generalized symbolic music generation framework, XMusic, which supports flexible prompts (i.e., images, videos, texts, tags, and humming) to generate emotionally controllable and high-quality symbolic music. XMusic consists of two core components, XProjector and XComposer. XProjector parses the prompts of various modalities into symbolic music elements (i.e., emotions, genres, rhythms and notes) within the projection space to generate matching music. XComposer contains a Generator and a Selector. The Generator generates emotionally controllable and melodious music based on our innovative symbolic music representation, whereas the Selector identifies high-quality symbolic music by constructing a multi-task learning scheme involving quality assessment, emotion recognition, and genre recognition tasks. In addition, we build XMIDI, a large-scale symbolic music dataset that contains 108,023 MIDI files annotated with precise emotion and genre labels. Objective and subjective evaluations show that XMusic significantly outperforms the current state-of-the-art methods with impressive music quality. Our XMusic has been awarded as one of the nine <italic>Highlights of Collectibles at WAIC 2023</i>. The project homepage of XMusic is: <uri>https://xmusic-project.github.io</uri>.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6857-6871"},"PeriodicalIF":9.7,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Blind image inpainting is a challenging task aimed at reconstructing corrupted regions without relying on mask information. Due to the lack of mask priors, previous methods usually integrate a mask prediction network in the initial phase, followed by an inpainting backbone. However, this multi-stage generation process may result in feature misalignment. While recent end-to-end generative methods bypass the mask prediction step, they typically struggle with weak perception of contaminated regions and introduce structural distortions. This study presents a novel mask region perception strategy for blind image inpainting by combining adversarial training with forgery detection. To implement this strategy, we propose an attention-driven forgery adversarial network (AFAN), which leverages adaptive contextual attention (ACA) blocks for effective feature modulation. Specifically, within the generator, ACA employs self-attention to enhance content reconstruction by utilizing the rich contextual information of adjacent tokens. In the discriminator, ACA utilizes cross-attention with noise priors to guide adversarial learning for forgery detection. Moreover, we design a high-frequency omni-dimensional dynamic convolution (HODC) based on edge feature enhancement to improve detail representation. Extensive evaluations across multiple datasets demonstrate that the proposed AFAN model outperforms existing generative methods in blind image inpainting, particularly in terms of quality and texture fidelity.
{"title":"AFAN: An Attention-Driven Forgery Adversarial Network for Blind Image Inpainting","authors":"Jiahao Wang;Gang Pan;Di Sun;Jinyuan Li;Jiawan Zhang","doi":"10.1109/TMM.2025.3590914","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590914","url":null,"abstract":"Blind image inpainting is a challenging task aimed at reconstructing corrupted regions without relying on mask information. Due to the lack of mask priors, previous methods usually integrate a mask prediction network in the initial phase, followed by an inpainting backbone. However, this multi-stage generation process may result in feature misalignment. While recent end-to-end generative methods bypass the mask prediction step, they typically struggle with weak perception of contaminated regions and introduce structural distortions. This study presents a novel mask region perception strategy for blind image inpainting by combining adversarial training with forgery detection. To implement this strategy, we propose an attention-driven forgery adversarial network (AFAN), which leverages adaptive contextual attention (ACA) blocks for effective feature modulation. Specifically, within the generator, ACA employs self-attention to enhance content reconstruction by utilizing the rich contextual information of adjacent tokens. In the discriminator, ACA utilizes cross-attention with noise priors to guide adversarial learning for forgery detection. Moreover, we design a high-frequency omni-dimensional dynamic convolution (HODC) based on edge feature enhancement to improve detail representation. Extensive evaluations across multiple datasets demonstrate that the proposed AFAN model outperforms existing generative methods in blind image inpainting, particularly in terms of quality and texture fidelity.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6845-6856"},"PeriodicalIF":9.7,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-22DOI: 10.1109/TMM.2025.3590929
Ziming Li;Yaxin Liu;Chuanpeng Yang;Yan Zhou;Songlin Hu
The rapid development of online media has heightened the importance of multimodal emotion recognition (MER) in video analysis. However, practical applications often encounter challenges due to missing modalities caused by various interferences. It is difficult to predict the specific missing situations, such as the number and types of missing modalities. Current approaches to modality missing typically apply a uniform method to address various missing cases, which are insufficiently adaptive to dynamic conditions. For example, translation-based methods can efficiently complete missing text from audio, but generating audio or video features that retain the original emotional information from other modalities is challenging and may introduce additional noise. In this paper, we introduce ROSA, a novel robust self-adaptive model designed to address various missing cases with tailored approaches, leveraging available modalities effectively and reducing the introduction of additional noise. Specifically, the A-T Completion module based on the encoder-decoder architecture enables ROSA to generate missing raw text from audio rather than mere embedding representations, capturing more nuanced modal features. Additionally, we design the T-V Fusion module based on a vision-language large model for deep extraction and fusion of textual and visual features. Comprehensive experiments conducted on three widely used public datasets demonstrate the superiority and effectiveness of our model. ROSA outperforms other models in both fixed missing rate and fixed missing modality cases. The ablation studies further highlights the contribution of each designed module.
{"title":"ROSA: A Robust Self-Adaptive Model for Multimodal Emotion Recognition With Uncertain Missing Modalities","authors":"Ziming Li;Yaxin Liu;Chuanpeng Yang;Yan Zhou;Songlin Hu","doi":"10.1109/TMM.2025.3590929","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590929","url":null,"abstract":"The rapid development of online media has heightened the importance of multimodal emotion recognition (MER) in video analysis. However, practical applications often encounter challenges due to missing modalities caused by various interferences. It is difficult to predict the specific missing situations, such as the number and types of missing modalities. Current approaches to modality missing typically apply a uniform method to address various missing cases, which are insufficiently adaptive to dynamic conditions. For example, translation-based methods can efficiently complete missing text from audio, but generating audio or video features that retain the original emotional information from other modalities is challenging and may introduce additional noise. In this paper, we introduce ROSA, a novel <bold>ro</b>bust <bold>s</b>elf-<bold>a</b>daptive model designed to address various missing cases with tailored approaches, leveraging available modalities effectively and reducing the introduction of additional noise. Specifically, the A-T Completion module based on the encoder-decoder architecture enables ROSA to generate missing raw text from audio rather than mere embedding representations, capturing more nuanced modal features. Additionally, we design the T-V Fusion module based on a vision-language large model for deep extraction and fusion of textual and visual features. Comprehensive experiments conducted on three widely used public datasets demonstrate the superiority and effectiveness of our model. ROSA outperforms other models in both fixed missing rate and fixed missing modality cases. The ablation studies further highlights the contribution of each designed module.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6766-6779"},"PeriodicalIF":9.7,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Self-similarity techniques are booming in no-reference super-resolution (SR) due to accurate estimation of the degradation types involved in low-resolution images. However, high-dimensional matrix multiplication within self-similarity computation prohibitively consumes massive computational costs. We find that the high-dimensional attention map is derived from the matrix multiplication between query and key, followed by a softmax function. This softmax makes the matrix multiplication inseparable, posing a great challenge in simplifying computational complexity. To address this issue, we first propose a second-order Taylor expansion approximation (STEA) to separate the matrix multiplication of query and key, resulting in the complexity reduction from $mathcal {O}(N^{2})$ to $mathcal {O}(N)$. Then, we design a multi-scale large field reception (MLFR) to compensate for the performance degradation caused by STEA. Finally, we apply these two core designs to laboratory and real-world scenarios by constructing LabNet and RealNet, respectively. Extensive experimental results tested on five synthetic datasets demonstrate that our LabNet sets a new benchmark in qualitative and quantitative evaluations. Tested on the real-world dataset, our RealNet achieves superior visual quality over existing methods. Ablation studies further verify the contributions of STEA and MLFR towards both LabNet and RealNet frameworks.
{"title":"Image Super-Resolution With Taylor Expansion Approximation and Large Field Reception","authors":"Jiancong Feng;Yuan-Gen Wang;Mingjie Li;Fengchuang Xing","doi":"10.1109/TMM.2025.3590917","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590917","url":null,"abstract":"Self-similarity techniques are booming in no-reference super-resolution (SR) due to accurate estimation of the degradation types involved in low-resolution images. However, high-dimensional matrix multiplication within self-similarity computation prohibitively consumes massive computational costs. We find that the high-dimensional attention map is derived from the matrix multiplication between query and key, followed by a softmax function. This softmax makes the matrix multiplication inseparable, posing a great challenge in simplifying computational complexity. To address this issue, we first propose a second-order Taylor expansion approximation (STEA) to separate the matrix multiplication of query and key, resulting in the complexity reduction from <inline-formula><tex-math>$mathcal {O}(N^{2})$</tex-math></inline-formula> to <inline-formula><tex-math>$mathcal {O}(N)$</tex-math></inline-formula>. Then, we design a multi-scale large field reception (MLFR) to compensate for the performance degradation caused by STEA. Finally, we apply these two core designs to laboratory and real-world scenarios by constructing LabNet and RealNet, respectively. Extensive experimental results tested on five synthetic datasets demonstrate that our LabNet sets a new benchmark in qualitative and quantitative evaluations. Tested on the real-world dataset, our RealNet achieves superior visual quality over existing methods. Ablation studies further verify the contributions of STEA and MLFR towards both LabNet and RealNet frameworks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6819-6830"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Analyzing student actions is an important and challenging task in educational research. Existing efforts have been hampered by the lack of accessible datasets to capture the nuanced action dynamics in classrooms. In this paper, we present a new multi-label Student Action Video (SAV) dataset, specifically designed for action detection in classroom settings. The SAV dataset consists of 4,324 carefully trimmed video clips from 758 different classrooms, annotated with 15 distinct student actions. Compared to existing action detection datasets, the SAV dataset stands out by providing a wide range of real classroom scenarios, high-quality video data, and unique challenges, including subtle movement differences, dense object engagement, significant scale differences, varied shooting angles, and visual occlusion. These complexities introduce new opportunities and challenges to advance action detection methods. To benchmark this, we propose a novel baseline method based on a visual transformer, designed to enhance attention to key local details within small and dense object regions. Our method demonstrates excellent performance with a mean Average Precision (mAP) of 67.9% and 27.4% on the SAV and AVA datasets, respectively. This paper not only provides the dataset but also calls for further research into AI-driven educational tools that may transform teaching methodologies and learning outcomes.
{"title":"Towards Student Actions in Classroom Scenes: New Dataset and Baseline","authors":"Zhuolin Tan;Chenqiang Gao;Anyong Qin;Ruixin Chen;Tiecheng Song;Feng Yang;Deyu Meng","doi":"10.1109/TMM.2025.3590899","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590899","url":null,"abstract":"Analyzing student actions is an important and challenging task in educational research. Existing efforts have been hampered by the lack of accessible datasets to capture the nuanced action dynamics in classrooms. In this paper, we present a new multi-label <italic>Student Action Video</i> (SAV) dataset, specifically designed for action detection in classroom settings. The SAV dataset consists of 4,324 carefully trimmed video clips from 758 different classrooms, annotated with 15 distinct student actions. Compared to existing action detection datasets, the SAV dataset stands out by providing a wide range of real classroom scenarios, high-quality video data, and unique challenges, including subtle movement differences, dense object engagement, significant scale differences, varied shooting angles, and visual occlusion. These complexities introduce new opportunities and challenges to advance action detection methods. To benchmark this, we propose a novel baseline method based on a visual transformer, designed to enhance attention to key local details within small and dense object regions. Our method demonstrates excellent performance with a mean Average Precision (mAP) of 67.9% and 27.4% on the SAV and AVA datasets, respectively. This paper not only provides the dataset but also calls for further research into AI-driven educational tools that may transform teaching methodologies and learning outcomes.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6831-6844"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-Domain Generalization Object Detection (Single-DGOD) refers to training a model with only one source domain, enabling the model to generalize to any unseen domain. For instance, a detector trained on a sunny daytime dataset should also perform well in scenarios such as rainy nighttime. The main challenge is to improve the detector’s ability to learn the domain-invariant representation (DIR) while removing domain-specific information. Recent progress in Single-DGOD has demonstrated the efficacy of removing domain-specific information by adjusting feature distributions. Nonetheless, simply adjusting the global feature distribution in Single-DGOD task is insufficient to learn the potential relationship from sunny to adverse weather, as these ignore the significant domain gaps between instances across different weathers. In this paper, we propose a novel object detection method for more robust single-domain generalization. In particular, it mainly consists of a frequency-aware selective whitening module (FSW) for removing redundant domain-specific information and a contrastive feature alignment module (CFA) for enhancing domain-invariant information among instances. Specially, FSW extracts the magnitude spectrum of the feature and uses a group whitening loss to selectively eliminate redundant domain-specific information in the magnitude. To further eliminate domain differences among instances, we apply the style transfer method for data augmentation and use the augmented data in the CFA module. CFA formulates both the original and the augmentd RoI features into a series of groups with different categories, and utilizes contrastive learning across them to facilitate the learning of DIR in various categories. Experiments show that our method achieves favorable performance on existing standard benchmarks.
{"title":"Single-Domain Generalized Object Detection With Frequency Whitening and Contrastive Learning","authors":"Xiaolong Guo;Chengxu Liu;Xueming Qian;Zhixiao Wang;Xubin Feng;Yao Xue","doi":"10.1109/TMM.2025.3590915","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590915","url":null,"abstract":"Single-Domain Generalization Object Detection (Single-DGOD) refers to training a model with only one source domain, enabling the model to generalize to any unseen domain. For instance, a detector trained on a sunny daytime dataset should also perform well in scenarios such as rainy nighttime. The main challenge is to improve the detector’s ability to learn the domain-invariant representation (DIR) while removing domain-specific information. Recent progress in Single-DGOD has demonstrated the efficacy of removing domain-specific information by adjusting feature distributions. Nonetheless, simply adjusting the global feature distribution in Single-DGOD task is insufficient to learn the potential relationship from sunny to adverse weather, as these ignore the significant domain gaps between instances across different weathers. In this paper, we propose a novel object detection method for more robust single-domain generalization. In particular, it mainly consists of a frequency-aware selective whitening module (FSW) for removing redundant domain-specific information and a contrastive feature alignment module (CFA) for enhancing domain-invariant information among instances. Specially, FSW extracts the magnitude spectrum of the feature and uses a group whitening loss to selectively eliminate redundant domain-specific information in the magnitude. To further eliminate domain differences among instances, we apply the style transfer method for data augmentation and use the augmented data in the CFA module. CFA formulates both the original and the augmentd RoI features into a series of groups with different categories, and utilizes contrastive learning across them to facilitate the learning of DIR in various categories. Experiments show that our method achieves favorable performance on existing standard benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6805-6818"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimodal sentiment analysis aims at exploiting complementary information from multiple modalities or data sources to enhance the understanding and interpretation of sentiment. While existing multi-modal fusion techniques offer significant improvements in sentiment analysis, real-world scenarios often involve missing modalities, introducing complexity due to uncertainty of which modalities may be absent. To tackle the challenge of incomplete modality-specific feature extraction caused by missing modalities, this paper proposes a Cosine Margin-Aware Network (CMANet) which centers on the Cosine Margin-Aware Distillation (CMAD) module. The core module measures distance between samples and the classification boundary, enabling CMANet to focus on samples near the boundary. So, it effectively captures the unique features of different modal combinations. To address the issue of modality imbalance during modality-specific feature extraction, this paper proposes a Weak Modality Regularization (WMR) strategy, which aligns the feature distributions between strong and weak modalities at the dataset-level, while also enhancing the prediction loss of samples at the sample-level. This dual mechanism improves the recognition robustness of weak modality combination. Extensive experiments demonstrate that the proposed method outperforms the previous best model, MMIN, with a 3.82% improvement in unweighted accuracy. These results underscore the robustness of the approach under conditions of uncertain and missing modalities.
{"title":"Boosting Modal-Specific Representations for Sentiment Analysis With Incomplete Modalities","authors":"Xin Jiang;Lihuo He;Fei Gao;Kaifan Zhang;Jie Li;Xinbo Gao","doi":"10.1109/TMM.2025.3590909","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590909","url":null,"abstract":"Multimodal sentiment analysis aims at exploiting complementary information from multiple modalities or data sources to enhance the understanding and interpretation of sentiment. While existing multi-modal fusion techniques offer significant improvements in sentiment analysis, real-world scenarios often involve missing modalities, introducing complexity due to uncertainty of which modalities may be absent. To tackle the challenge of incomplete modality-specific feature extraction caused by missing modalities, this paper proposes a Cosine Margin-Aware Network (CMANet) which centers on the Cosine Margin-Aware Distillation (CMAD) module. The core module measures distance between samples and the classification boundary, enabling CMANet to focus on samples near the boundary. So, it effectively captures the unique features of different modal combinations. To address the issue of modality imbalance during modality-specific feature extraction, this paper proposes a Weak Modality Regularization (WMR) strategy, which aligns the feature distributions between strong and weak modalities at the dataset-level, while also enhancing the prediction loss of samples at the sample-level. This dual mechanism improves the recognition robustness of weak modality combination. Extensive experiments demonstrate that the proposed method outperforms the previous best model, MMIN, with a 3.82% improvement in unweighted accuracy. These results underscore the robustness of the approach under conditions of uncertain and missing modalities.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6793-6804"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}