首页 > 最新文献

IEEE Journal of Selected Topics in Signal Processing最新文献

英文 中文
IEEE Signal Processing Society Publication Information IEEE信号处理学会出版物信息
IF 13.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2026-02-24 DOI: 10.1109/JSTSP.2026.3661776
{"title":"IEEE Signal Processing Society Publication Information","authors":"","doi":"10.1109/JSTSP.2026.3661776","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3661776","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"C2-C2"},"PeriodicalIF":13.7,"publicationDate":"2026-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11409417","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147274993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Signal Processing Society Information IEEE信号处理学会信息
IF 13.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2026-02-24 DOI: 10.1109/JSTSP.2026.3661774
{"title":"IEEE Signal Processing Society Information","authors":"","doi":"10.1109/JSTSP.2026.3661774","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3661774","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"C3-C3"},"PeriodicalIF":13.7,"publicationDate":"2026-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11409416","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147274999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
2025 Index IEEE Journal of Selected Topics in Signal Processing Vol. 19 2025索引IEEE信号处理精选主题杂志Vol. 19
IF 13.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2026-02-20 DOI: 10.1109/JSTSP.2026.3665308
{"title":"2025 Index IEEE Journal of Selected Topics in Signal Processing Vol. 19","authors":"","doi":"10.1109/JSTSP.2026.3665308","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3665308","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 8","pages":"2028-2064"},"PeriodicalIF":13.7,"publicationDate":"2026-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11404272","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Signal Processing Society Information IEEE信号处理学会信息
IF 13.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2026-02-06 DOI: 10.1109/JSTSP.2026.3658790
{"title":"IEEE Signal Processing Society Information","authors":"","doi":"10.1109/JSTSP.2026.3658790","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3658790","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 8","pages":"C3-C3"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11373666","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146122792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
List of Reviewers 2025 2025年评审人员名单
IF 13.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2026-02-06 DOI: 10.1109/JSTSP.2025.3633987
{"title":"List of Reviewers 2025","authors":"","doi":"10.1109/JSTSP.2025.3633987","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3633987","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 8","pages":"2025-2027"},"PeriodicalIF":13.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11373660","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146122760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond Language-Specific Neurons: The Challenge of Identifying Speech-Specific Neurons in Multimodal LLMs 超越语言特异性神经元:在多模态llm中识别语言特异性神经元的挑战
IF 13.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2026-01-28 DOI: 10.1109/JSTSP.2026.3657641
Nohil Park;Che Hyun Lee;Jiheum Yeom;Heeseung Kim;Sungroh Yoon
As recent advances in multilingual large language models (LLMs) demonstrate powerful performance across numerous tasks, various studies attempt to analyze their intrinsic behavior across different languages to improve these models. Such works have expanded to the modality level, being used to detect modality-specific components (often called neurons) in the vision domain. However, it remains unclear whether such methods are also applicable to speech, another key modality used for everyday communication. In this work, we investigate whether current neuron detection methods can reliably identify neurons associated with speech processing in speech-capable LLMs. Specifically, we utilize two representative neuron detection techniques to identify candidate modality-specific neurons for speech and text, and evaluate their specialization through neuron deactivation experiments across diverse benchmarks and experimental setups. Our results show that, unlike in the text and visual modality, existing methods do not reliably detect speech-specific neurons, highlighting the limitations of current diagnostic approaches and the need for more effective methods to better interpret and improve speech LLMs.
随着最近多语言大型语言模型(llm)在许多任务中表现出强大的性能,各种研究试图分析它们在不同语言中的内在行为,以改进这些模型。这些工作已经扩展到模态水平,被用来检测视觉领域中特定模态的成分(通常称为神经元)。然而,目前尚不清楚这些方法是否也适用于言语,这是日常交流的另一种重要方式。在这项工作中,我们研究了当前的神经元检测方法是否能够可靠地识别与语音处理相关的神经元。具体来说,我们利用两种具有代表性的神经元检测技术来识别语音和文本的候选模态特异性神经元,并通过不同基准和实验设置的神经元失活实验来评估它们的专业化。我们的研究结果表明,与文本和视觉模式不同,现有方法不能可靠地检测语音特异性神经元,这突出了当前诊断方法的局限性,需要更有效的方法来更好地解释和改进语音llm。
{"title":"Beyond Language-Specific Neurons: The Challenge of Identifying Speech-Specific Neurons in Multimodal LLMs","authors":"Nohil Park;Che Hyun Lee;Jiheum Yeom;Heeseung Kim;Sungroh Yoon","doi":"10.1109/JSTSP.2026.3657641","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3657641","url":null,"abstract":"As recent advances in multilingual large language models (LLMs) demonstrate powerful performance across numerous tasks, various studies attempt to analyze their intrinsic behavior across different languages to improve these models. Such works have expanded to the modality level, being used to detect modality-specific components (often called neurons) in the vision domain. However, it remains unclear whether such methods are also applicable to speech, another key modality used for everyday communication. In this work, we investigate whether current neuron detection methods can reliably identify neurons associated with speech processing in speech-capable LLMs. Specifically, we utilize two representative neuron detection techniques to identify candidate modality-specific neurons for speech and text, and evaluate their specialization through neuron deactivation experiments across diverse benchmarks and experimental setups. Our results show that, unlike in the text and visual modality, existing methods do not reliably detect speech-specific neurons, highlighting the limitations of current diagnostic approaches and the need for more effective methods to better interpret and improve speech LLMs.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"90-98"},"PeriodicalIF":13.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147274988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating Large Language Models Into Recommendation via Mutual Augmentation and Adaptive Aggregation 基于相互增强和自适应聚合的大型语言模型推荐集成
IF 13.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2026-01-16 DOI: 10.1109/JSTSP.2026.3653160
Sichun Luo;Yuxuan Yao;Bowei He;Wei Shao;Jian Xu;Yinya Huang;Aojun Zhou;Xinyi Zhang;Yuanzhang Xiao;Hanxu Hou;Mingjie Zhan;Linqi Song
Conventional recommender systems and Large Language Model (LLM)-based recommender systems each have their strengths and weaknesses. While conventional recommendation methods excel at mining collaborative information and modeling sequential behavior, they struggle with data sparsity and the long-tail problem. LLM, on the other hand, is proficient at utilizing rich textual contexts but faces challenges in mining collaborative or sequential information. Despite their individual successes, there is a significant gap in leveraging their ensemble potential to enhance recommendation performance. In this paper, we introduce a general and model-agnostic framework known as Large language models with mutual augmentation and adaptive aggregation for Recommendation (Llama4Rec), aiming to bridge this gap via explicitly ensemble LLM and conventional recommendation model for more effective recommendation. We propose data augmentation and prompt augmentation strategies tailored to enhance the conventional recommendation model and LLM respectively. An adaptive aggregation module is adopted to combine the predictions of both kinds of models to refine the final recommendation results. Empirical studies on three datasets validate the superiority of Llama4Rec, demonstrating significant improvements in recommendation performance.
传统的推荐系统和基于LLM的推荐系统各有优缺点。虽然传统的推荐方法擅长挖掘协作信息和建模顺序行为,但它们存在数据稀疏性和长尾问题。另一方面,法学硕士精通利用丰富的文本上下文,但在挖掘协作或顺序信息方面面临挑战。尽管它们各自取得了成功,但在利用它们的整体潜力来增强推荐性能方面存在显著差距。在本文中,我们引入了一个通用的和模型不可知的框架,称为具有相互增强和自适应聚合的推荐大语言模型(Llama4Rec),旨在通过显式集成LLM和传统推荐模型来弥补这一差距,从而实现更有效的推荐。我们分别针对传统推荐模型和LLM提出了数据增强和提示增强策略。采用自适应聚合模块,对两种模型的预测结果进行组合,提炼最终的推荐结果。在三个数据集上的实证研究验证了Llama4Rec的优越性,在推荐性能上有了显著的提高。
{"title":"Integrating Large Language Models Into Recommendation via Mutual Augmentation and Adaptive Aggregation","authors":"Sichun Luo;Yuxuan Yao;Bowei He;Wei Shao;Jian Xu;Yinya Huang;Aojun Zhou;Xinyi Zhang;Yuanzhang Xiao;Hanxu Hou;Mingjie Zhan;Linqi Song","doi":"10.1109/JSTSP.2026.3653160","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3653160","url":null,"abstract":"Conventional recommender systems and Large Language Model (LLM)-based recommender systems each have their strengths and weaknesses. While conventional recommendation methods excel at mining collaborative information and modeling sequential behavior, they struggle with data sparsity and the long-tail problem. LLM, on the other hand, is proficient at utilizing rich textual contexts but faces challenges in mining collaborative or sequential information. Despite their individual successes, there is a significant gap in leveraging their ensemble potential to enhance recommendation performance. In this paper, we introduce a general and model-agnostic framework known as <underline>L</u>arge <underline>la</u>nguage models with <underline>m</u>utual augmentation and <underline>a</u>daptive aggregation for <underline>Rec</u>ommendation (<bold>Llama4Rec</b>), aiming to bridge this gap via <italic>explicitly ensemble LLM and conventional recommendation model</i> for more effective recommendation. We propose data augmentation and prompt augmentation strategies tailored to enhance the conventional recommendation model and LLM respectively. An adaptive aggregation module is adopted to combine the predictions of both kinds of models to refine the final recommendation results. Empirical studies on three datasets validate the superiority of Llama4Rec, demonstrating significant improvements in recommendation performance.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"77-89"},"PeriodicalIF":13.7,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147275001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the Use of Large Language Models and Interpretable Features for Explainable Speech Emotion Recognition 探索大型语言模型和可解释特征在可解释语音情感识别中的应用
IF 13.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2026-01-12 DOI: 10.1109/JSTSP.2026.3652299
Qifei Li;Yingming Gao;Yuhua Wen;Yingying Zhou;Zheng Lian;Bin Liu;Zhengqi Wen;Jianhua Tao;Ya Li
Speech emotion recognition (SER) has made significant advancements recently due to its critical role in human-computer interaction. However, current studies predominantly rely on discriminative recognition methods, which can classify emotions but fail to provide insights into the reasoning behind the classification. Recently, researchers have started using large language models (LLM) for explainable SER. Existing studies have two main approaches: one relies on manually annotated information as the basis for LLM to explain emotions, but this annotation is costly. The second converts speech information into textual descriptions as input to LLM, but these descriptions often contain limited details, which may lead to the loss of emotion-related information, thereby degrading performance. To address these issues, we first propose an automated method for annotating explainable speech emotion datasets to reduce annotation costs, using interpretable speech features instead of manually annotated subjective information as the basis for LLM to explain emotions. Second, we propose a generative explainable SER method based on LLM, called SEmoLLM, which uses WavLM to encode raw speech signals as input to the LLM, avoiding the issue of emotion-related information loss. Finally, we evaluate the proposed method on four emotion datasets. The experimental results demonstrate that the performance of SEmoLLM is comparable to that of discriminative emotion recognition, while also enabling basic speech emotion explanation. The results also show that generating descriptions of gender, pitch, or volume can improve emotion recognition performance. The proposed method and findings provide a new perspective on the explainability research in emotion-related tasks.
语音情感识别由于在人机交互中的重要作用,近年来取得了重大进展。然而,目前的研究主要依赖于判别识别方法,这种方法可以对情绪进行分类,但无法深入了解分类背后的原因。最近,研究人员开始使用大型语言模型(LLM)来解释SER。现有的研究主要有两种方法:一种是依靠人工注释的信息作为LLM解释情绪的基础,但这种注释的成本很高。第二种是将语音信息转换为文本描述作为LLM的输入,但这些描述往往包含有限的细节,这可能导致情绪相关信息的丢失,从而降低性能。为了解决这些问题,我们首先提出了一种自动注释可解释语音情感数据集的方法,以降低注释成本,使用可解释的语音特征代替人工注释的主观信息作为LLM解释情感的基础。其次,我们提出了一种基于LLM的生成式可解释SER方法,称为SEmoLLM,该方法使用WavLM对原始语音信号进行编码作为LLM的输入,避免了情感相关信息丢失的问题。最后,我们在四个情感数据集上对所提出的方法进行了评估。实验结果表明,SEmoLLM的性能与判别情绪识别相当,同时也能实现基本的语音情绪解释。结果还表明,生成性别、音调或音量的描述可以提高情绪识别性能。本文提出的方法和研究结果为情绪相关任务的可解释性研究提供了新的视角。
{"title":"Exploring the Use of Large Language Models and Interpretable Features for Explainable Speech Emotion Recognition","authors":"Qifei Li;Yingming Gao;Yuhua Wen;Yingying Zhou;Zheng Lian;Bin Liu;Zhengqi Wen;Jianhua Tao;Ya Li","doi":"10.1109/JSTSP.2026.3652299","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3652299","url":null,"abstract":"Speech emotion recognition (SER) has made significant advancements recently due to its critical role in human-computer interaction. However, current studies predominantly rely on discriminative recognition methods, which can classify emotions but fail to provide insights into the reasoning behind the classification. Recently, researchers have started using large language models (LLM) for explainable SER. Existing studies have two main approaches: one relies on manually annotated information as the basis for LLM to explain emotions, but this annotation is costly. The second converts speech information into textual descriptions as input to LLM, but these descriptions often contain limited details, which may lead to the loss of emotion-related information, thereby degrading performance. To address these issues, we first propose an automated method for annotating explainable speech emotion datasets to reduce annotation costs, using interpretable speech features instead of manually annotated subjective information as the basis for LLM to explain emotions. Second, we propose a generative explainable SER method based on LLM, called SEmoLLM, which uses WavLM to encode raw speech signals as input to the LLM, avoiding the issue of emotion-related information loss. Finally, we evaluate the proposed method on four emotion datasets. The experimental results demonstrate that the performance of SEmoLLM is comparable to that of discriminative emotion recognition, while also enabling basic speech emotion explanation. The results also show that generating descriptions of gender, pitch, or volume can improve emotion recognition performance. The proposed method and findings provide a new perspective on the explainability research in emotion-related tasks.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"32-46"},"PeriodicalIF":13.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147274984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing SLAM-LLM:一个模块化、开源的多模态大型语言模型框架和语音、语言、音频和音乐处理的最佳实践
IF 13.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2026-01-12 DOI: 10.1109/JSTSP.2026.3653157
Ziyang Ma;Guanrou Yang;Wenxi Chen;Zhifu Gao;Yexing Du;Xiquan Li;Zhisheng Zheng;Haina Zhu;Jianheng Zhuo;Zheshu Song;Ruiyang Xu;Tiranrui Wang;Yifan Yang;Yanqiao Zhu;Zhikang Niu;Liumeng Xue;Yinghao Ma;Ruibin Yuan;Shiliang Zhang;Kai Yu;Eng Siong Chng;Xie Chen
The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.
最近开源多模态大型语言模型(Multimodal Large Language Models, MLLM)框架的激增,比如LLaVA,为人工智能开发人员和研究人员提供了一个方便的起点。然而,大多数MLLM框架都以视觉作为主要的输入形态,对语音、音频和音乐的情态提供了有限的深度支持。这种情况阻碍了音频语言模型的发展,并迫使研究人员在代码编写和超参数调优上花费大量精力。我们提出SLAM-LLM,一个开源的深度学习框架,旨在培训定制的mllm,专注于语音,语言,音频和音乐处理。SLAM-LLM提供了不同编码器、投影仪、llm和参数高效微调插件的模块化配置。SLAM-LLM还包括主流任务的详细训练和推理食谱,以及高性能检查点,如基于llm的自动语音识别(ASR),自动音频字幕(AAC)和音乐字幕(MC)。其中一些配方已经达到或接近最先进的性能,一些相关技术也被学术论文所接受。我们希望SLAM-LLM将加速迭代、开发、数据工程和研究人员的模型训练。我们致力于通过这个开源框架不断推进基于音频的mlm,并呼吁社区为基于llm的语音、音频和音乐处理做出贡献。
{"title":"SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing","authors":"Ziyang Ma;Guanrou Yang;Wenxi Chen;Zhifu Gao;Yexing Du;Xiquan Li;Zhisheng Zheng;Haina Zhu;Jianheng Zhuo;Zheshu Song;Ruiyang Xu;Tiranrui Wang;Yifan Yang;Yanqiao Zhu;Zhikang Niu;Liumeng Xue;Yinghao Ma;Ruibin Yuan;Shiliang Zhang;Kai Yu;Eng Siong Chng;Xie Chen","doi":"10.1109/JSTSP.2026.3653157","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3653157","url":null,"abstract":"The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"20 1","pages":"63-76"},"PeriodicalIF":13.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11346946","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147274986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Model-Guided Network With Cluster-Based Operators for Spatio-Spectral Super-Resolution 基于聚类算子的空间光谱超分辨率模型引导网络
IF 13.7 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2026-01-12 DOI: 10.1109/JSTSP.2026.3653259
Ivan Pereira-Sánchez;Julia Navarro;Ana Belén Petro;Joan Duran
This paper addresses the problem of reconstructing a high-resolution hyperspectral image from a low-resolution multispectral observation. While spatial super-resolution and spectral super-resolution have been extensively studied, joint spatio-spectral super-resolution remains relatively explored. We propose an end-to-end model-driven framework that explicitly decomposes the joint spatio-spectral super-resolution problem into spatial super-resolution, spectral super-resolution and fusion tasks. Each sub-task is addressed by unfolding a variational-based approach, where the operators involved in the proximal gradient iterative scheme are replaced with tailored learnable modules. In particular, we design an upsampling operator for spatial super-resolution based on classical back-projection algorithms, adapted to handle arbitrary scaling factors. Spectral reconstruction is performed using learnable cluster-based upsampling and downsampling operators. For image fusion, we integrate low-frequency estimation and high-frequency injection modules to combine the spatial and spectral information from spatial super-resolution and spectral super-resolution outputs. Additionally, we introduce an efficient nonlocal post-processing step that leverages image self-similarity by combining a multi-head attention mechanism with residual connections. Extensive evaluations on several datasets and sampling factors demonstrate the effectiveness of our approach.
本文解决了从低分辨率多光谱观测中重建高分辨率高光谱图像的问题。虽然空间超分辨率和光谱超分辨率已经得到了广泛的研究,但联合空间光谱超分辨率还处于相对探索阶段。我们提出了一个端到端的模型驱动框架,该框架明确地将空间-光谱联合超分辨率问题分解为空间超分辨率、光谱超分辨率和融合任务。每个子任务都是通过展开基于变分的方法来解决的,其中涉及到近端梯度迭代方案的算子被定制的可学习模块所取代。特别地,我们设计了一种基于经典反投影算法的空间超分辨率上采样算子,适合处理任意缩放因子。光谱重建使用可学习的基于聚类的上采样和下采样算子进行。在图像融合方面,我们集成了低频估计和高频注入模块,将空间超分辨率和光谱超分辨率输出的空间和光谱信息结合起来。此外,我们引入了一个有效的非局部后处理步骤,通过结合多头注意机制和剩余连接来利用图像的自相似性。对几个数据集和抽样因素的广泛评估证明了我们方法的有效性。
{"title":"Model-Guided Network With Cluster-Based Operators for Spatio-Spectral Super-Resolution","authors":"Ivan Pereira-Sánchez;Julia Navarro;Ana Belén Petro;Joan Duran","doi":"10.1109/JSTSP.2026.3653259","DOIUrl":"https://doi.org/10.1109/JSTSP.2026.3653259","url":null,"abstract":"This paper addresses the problem of reconstructing a high-resolution hyperspectral image from a low-resolution multispectral observation. While spatial super-resolution and spectral super-resolution have been extensively studied, joint spatio-spectral super-resolution remains relatively explored. We propose an end-to-end model-driven framework that explicitly decomposes the joint spatio-spectral super-resolution problem into spatial super-resolution, spectral super-resolution and fusion tasks. Each sub-task is addressed by unfolding a variational-based approach, where the operators involved in the proximal gradient iterative scheme are replaced with tailored learnable modules. In particular, we design an upsampling operator for spatial super-resolution based on classical back-projection algorithms, adapted to handle arbitrary scaling factors. Spectral reconstruction is performed using learnable cluster-based upsampling and downsampling operators. For image fusion, we integrate low-frequency estimation and high-frequency injection modules to combine the spatial and spectral information from spatial super-resolution and spectral super-resolution outputs. Additionally, we introduce an efficient nonlocal post-processing step that leverages image self-similarity by combining a multi-head attention mechanism with residual connections. Extensive evaluations on several datasets and sampling factors demonstrate the effectiveness of our approach.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 8","pages":"2010-2024"},"PeriodicalIF":13.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146122780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Journal of Selected Topics in Signal Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1