Pub Date : 2026-02-24DOI: 10.1109/JSTSP.2026.3661776
{"title":"IEEE Signal Processing Society Publication Information","authors":"","doi":"10.1109/JSTSP.2026.3661776","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3661776","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"C2-C2"},"PeriodicalIF":13.7,"publicationDate":"2026-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11409417","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147274993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-24DOI: 10.1109/JSTSP.2026.3661774
{"title":"IEEE Signal Processing Society Information","authors":"","doi":"10.1109/JSTSP.2026.3661774","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3661774","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"C3-C3"},"PeriodicalIF":13.7,"publicationDate":"2026-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11409416","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147274999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-20DOI: 10.1109/JSTSP.2026.3665308
{"title":"2025 Index IEEE Journal of Selected Topics in Signal Processing Vol. 19","authors":"","doi":"10.1109/JSTSP.2026.3665308","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3665308","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 8","pages":"2028-2064"},"PeriodicalIF":13.7,"publicationDate":"2026-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11404272","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/JSTSP.2026.3658790
{"title":"IEEE Signal Processing Society Information","authors":"","doi":"10.1109/JSTSP.2026.3658790","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3658790","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 8","pages":"C3-C3"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11373666","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146122792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/JSTSP.2025.3633987
{"title":"List of Reviewers 2025","authors":"","doi":"10.1109/JSTSP.2025.3633987","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3633987","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 8","pages":"2025-2027"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11373660","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146122760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As recent advances in multilingual large language models (LLMs) demonstrate powerful performance across numerous tasks, various studies attempt to analyze their intrinsic behavior across different languages to improve these models. Such works have expanded to the modality level, being used to detect modality-specific components (often called neurons) in the vision domain. However, it remains unclear whether such methods are also applicable to speech, another key modality used for everyday communication. In this work, we investigate whether current neuron detection methods can reliably identify neurons associated with speech processing in speech-capable LLMs. Specifically, we utilize two representative neuron detection techniques to identify candidate modality-specific neurons for speech and text, and evaluate their specialization through neuron deactivation experiments across diverse benchmarks and experimental setups. Our results show that, unlike in the text and visual modality, existing methods do not reliably detect speech-specific neurons, highlighting the limitations of current diagnostic approaches and the need for more effective methods to better interpret and improve speech LLMs.
{"title":"Beyond Language-Specific Neurons: The Challenge of Identifying Speech-Specific Neurons in Multimodal LLMs","authors":"Nohil Park;Che Hyun Lee;Jiheum Yeom;Heeseung Kim;Sungroh Yoon","doi":"10.1109/JSTSP.2026.3657641","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3657641","url":null,"abstract":"As recent advances in multilingual large language models (LLMs) demonstrate powerful performance across numerous tasks, various studies attempt to analyze their intrinsic behavior across different languages to improve these models. Such works have expanded to the modality level, being used to detect modality-specific components (often called neurons) in the vision domain. However, it remains unclear whether such methods are also applicable to speech, another key modality used for everyday communication. In this work, we investigate whether current neuron detection methods can reliably identify neurons associated with speech processing in speech-capable LLMs. Specifically, we utilize two representative neuron detection techniques to identify candidate modality-specific neurons for speech and text, and evaluate their specialization through neuron deactivation experiments across diverse benchmarks and experimental setups. Our results show that, unlike in the text and visual modality, existing methods do not reliably detect speech-specific neurons, highlighting the limitations of current diagnostic approaches and the need for more effective methods to better interpret and improve speech LLMs.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"90-98"},"PeriodicalIF":13.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147274988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conventional recommender systems and Large Language Model (LLM)-based recommender systems each have their strengths and weaknesses. While conventional recommendation methods excel at mining collaborative information and modeling sequential behavior, they struggle with data sparsity and the long-tail problem. LLM, on the other hand, is proficient at utilizing rich textual contexts but faces challenges in mining collaborative or sequential information. Despite their individual successes, there is a significant gap in leveraging their ensemble potential to enhance recommendation performance. In this paper, we introduce a general and model-agnostic framework known as Large language models with mutual augmentation and adaptive aggregation for Recommendation (Llama4Rec), aiming to bridge this gap via explicitly ensemble LLM and conventional recommendation model for more effective recommendation. We propose data augmentation and prompt augmentation strategies tailored to enhance the conventional recommendation model and LLM respectively. An adaptive aggregation module is adopted to combine the predictions of both kinds of models to refine the final recommendation results. Empirical studies on three datasets validate the superiority of Llama4Rec, demonstrating significant improvements in recommendation performance.
{"title":"Integrating Large Language Models Into Recommendation via Mutual Augmentation and Adaptive Aggregation","authors":"Sichun Luo;Yuxuan Yao;Bowei He;Wei Shao;Jian Xu;Yinya Huang;Aojun Zhou;Xinyi Zhang;Yuanzhang Xiao;Hanxu Hou;Mingjie Zhan;Linqi Song","doi":"10.1109/JSTSP.2026.3653160","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3653160","url":null,"abstract":"Conventional recommender systems and Large Language Model (LLM)-based recommender systems each have their strengths and weaknesses. While conventional recommendation methods excel at mining collaborative information and modeling sequential behavior, they struggle with data sparsity and the long-tail problem. LLM, on the other hand, is proficient at utilizing rich textual contexts but faces challenges in mining collaborative or sequential information. Despite their individual successes, there is a significant gap in leveraging their ensemble potential to enhance recommendation performance. In this paper, we introduce a general and model-agnostic framework known as <underline>L</u>arge <underline>la</u>nguage models with <underline>m</u>utual augmentation and <underline>a</u>daptive aggregation for <underline>Rec</u>ommendation (<bold>Llama4Rec</b>), aiming to bridge this gap via <italic>explicitly ensemble LLM and conventional recommendation model</i> for more effective recommendation. We propose data augmentation and prompt augmentation strategies tailored to enhance the conventional recommendation model and LLM respectively. An adaptive aggregation module is adopted to combine the predictions of both kinds of models to refine the final recommendation results. Empirical studies on three datasets validate the superiority of Llama4Rec, demonstrating significant improvements in recommendation performance.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"77-89"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147275001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/JSTSP.2026.3652299
Qifei Li;Yingming Gao;Yuhua Wen;Yingying Zhou;Zheng Lian;Bin Liu;Zhengqi Wen;Jianhua Tao;Ya Li
Speech emotion recognition (SER) has made significant advancements recently due to its critical role in human-computer interaction. However, current studies predominantly rely on discriminative recognition methods, which can classify emotions but fail to provide insights into the reasoning behind the classification. Recently, researchers have started using large language models (LLM) for explainable SER. Existing studies have two main approaches: one relies on manually annotated information as the basis for LLM to explain emotions, but this annotation is costly. The second converts speech information into textual descriptions as input to LLM, but these descriptions often contain limited details, which may lead to the loss of emotion-related information, thereby degrading performance. To address these issues, we first propose an automated method for annotating explainable speech emotion datasets to reduce annotation costs, using interpretable speech features instead of manually annotated subjective information as the basis for LLM to explain emotions. Second, we propose a generative explainable SER method based on LLM, called SEmoLLM, which uses WavLM to encode raw speech signals as input to the LLM, avoiding the issue of emotion-related information loss. Finally, we evaluate the proposed method on four emotion datasets. The experimental results demonstrate that the performance of SEmoLLM is comparable to that of discriminative emotion recognition, while also enabling basic speech emotion explanation. The results also show that generating descriptions of gender, pitch, or volume can improve emotion recognition performance. The proposed method and findings provide a new perspective on the explainability research in emotion-related tasks.
{"title":"Exploring the Use of Large Language Models and Interpretable Features for Explainable Speech Emotion Recognition","authors":"Qifei Li;Yingming Gao;Yuhua Wen;Yingying Zhou;Zheng Lian;Bin Liu;Zhengqi Wen;Jianhua Tao;Ya Li","doi":"10.1109/JSTSP.2026.3652299","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3652299","url":null,"abstract":"Speech emotion recognition (SER) has made significant advancements recently due to its critical role in human-computer interaction. However, current studies predominantly rely on discriminative recognition methods, which can classify emotions but fail to provide insights into the reasoning behind the classification. Recently, researchers have started using large language models (LLM) for explainable SER. Existing studies have two main approaches: one relies on manually annotated information as the basis for LLM to explain emotions, but this annotation is costly. The second converts speech information into textual descriptions as input to LLM, but these descriptions often contain limited details, which may lead to the loss of emotion-related information, thereby degrading performance. To address these issues, we first propose an automated method for annotating explainable speech emotion datasets to reduce annotation costs, using interpretable speech features instead of manually annotated subjective information as the basis for LLM to explain emotions. Second, we propose a generative explainable SER method based on LLM, called SEmoLLM, which uses WavLM to encode raw speech signals as input to the LLM, avoiding the issue of emotion-related information loss. Finally, we evaluate the proposed method on four emotion datasets. The experimental results demonstrate that the performance of SEmoLLM is comparable to that of discriminative emotion recognition, while also enabling basic speech emotion explanation. The results also show that generating descriptions of gender, pitch, or volume can improve emotion recognition performance. The proposed method and findings provide a new perspective on the explainability research in emotion-related tasks.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"32-46"},"PeriodicalIF":13.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147274984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.
最近开源多模态大型语言模型(Multimodal Large Language Models, MLLM)框架的激增,比如LLaVA,为人工智能开发人员和研究人员提供了一个方便的起点。然而,大多数MLLM框架都以视觉作为主要的输入形态,对语音、音频和音乐的情态提供了有限的深度支持。这种情况阻碍了音频语言模型的发展,并迫使研究人员在代码编写和超参数调优上花费大量精力。我们提出SLAM-LLM,一个开源的深度学习框架,旨在培训定制的mllm,专注于语音,语言,音频和音乐处理。SLAM-LLM提供了不同编码器、投影仪、llm和参数高效微调插件的模块化配置。SLAM-LLM还包括主流任务的详细训练和推理食谱,以及高性能检查点,如基于llm的自动语音识别(ASR),自动音频字幕(AAC)和音乐字幕(MC)。其中一些配方已经达到或接近最先进的性能,一些相关技术也被学术论文所接受。我们希望SLAM-LLM将加速迭代、开发、数据工程和研究人员的模型训练。我们致力于通过这个开源框架不断推进基于音频的mlm,并呼吁社区为基于llm的语音、音频和音乐处理做出贡献。
{"title":"SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing","authors":"Ziyang Ma;Guanrou Yang;Wenxi Chen;Zhifu Gao;Yexing Du;Xiquan Li;Zhisheng Zheng;Haina Zhu;Jianheng Zhuo;Zheshu Song;Ruiyang Xu;Tiranrui Wang;Yifan Yang;Yanqiao Zhu;Zhikang Niu;Liumeng Xue;Yinghao Ma;Ruibin Yuan;Shiliang Zhang;Kai Yu;Eng Siong Chng;Xie Chen","doi":"10.1109/JSTSP.2026.3653157","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3653157","url":null,"abstract":"The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"63-76"},"PeriodicalIF":13.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11346946","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147274986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/JSTSP.2026.3653259
Ivan Pereira-Sánchez;Julia Navarro;Ana Belén Petro;Joan Duran
This paper addresses the problem of reconstructing a high-resolution hyperspectral image from a low-resolution multispectral observation. While spatial super-resolution and spectral super-resolution have been extensively studied, joint spatio-spectral super-resolution remains relatively explored. We propose an end-to-end model-driven framework that explicitly decomposes the joint spatio-spectral super-resolution problem into spatial super-resolution, spectral super-resolution and fusion tasks. Each sub-task is addressed by unfolding a variational-based approach, where the operators involved in the proximal gradient iterative scheme are replaced with tailored learnable modules. In particular, we design an upsampling operator for spatial super-resolution based on classical back-projection algorithms, adapted to handle arbitrary scaling factors. Spectral reconstruction is performed using learnable cluster-based upsampling and downsampling operators. For image fusion, we integrate low-frequency estimation and high-frequency injection modules to combine the spatial and spectral information from spatial super-resolution and spectral super-resolution outputs. Additionally, we introduce an efficient nonlocal post-processing step that leverages image self-similarity by combining a multi-head attention mechanism with residual connections. Extensive evaluations on several datasets and sampling factors demonstrate the effectiveness of our approach.
{"title":"Model-Guided Network With Cluster-Based Operators for Spatio-Spectral Super-Resolution","authors":"Ivan Pereira-Sánchez;Julia Navarro;Ana Belén Petro;Joan Duran","doi":"10.1109/JSTSP.2026.3653259","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3653259","url":null,"abstract":"This paper addresses the problem of reconstructing a high-resolution hyperspectral image from a low-resolution multispectral observation. While spatial super-resolution and spectral super-resolution have been extensively studied, joint spatio-spectral super-resolution remains relatively explored. We propose an end-to-end model-driven framework that explicitly decomposes the joint spatio-spectral super-resolution problem into spatial super-resolution, spectral super-resolution and fusion tasks. Each sub-task is addressed by unfolding a variational-based approach, where the operators involved in the proximal gradient iterative scheme are replaced with tailored learnable modules. In particular, we design an upsampling operator for spatial super-resolution based on classical back-projection algorithms, adapted to handle arbitrary scaling factors. Spectral reconstruction is performed using learnable cluster-based upsampling and downsampling operators. For image fusion, we integrate low-frequency estimation and high-frequency injection modules to combine the spatial and spectral information from spatial super-resolution and spectral super-resolution outputs. Additionally, we introduce an efficient nonlocal post-processing step that leverages image self-similarity by combining a multi-head attention mechanism with residual connections. Extensive evaluations on several datasets and sampling factors demonstrate the effectiveness of our approach.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 8","pages":"2010-2024"},"PeriodicalIF":13.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146122780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}