Pub Date : 2024-12-12DOI: 10.1109/JSTSP.2024.3516374
Le Dong;Mengzu Liu;Tengteng Tang;Tao Huang;Jie Lin;Weisheng Dong;Guangming Shi
The snapshot multispectral imaging system using a multispectral filter array (MSFA) efficiently captures sample the multispectral image (MSI) information of scenes and obtain spectral mosaic images. To obtain the complete MSI information from these spectral mosaics, effective demosaicing methods are essential. Traditional MSI demosaicing techniques depend on pixel correlation and various hand-crafted priors, while deep learning-based approaches learn the mapping between spectral mosaic images and MSI directly. However, current methods often fall short in recovery performance, leaving significant room for improvement in the MSI demosaicing field. In this paper, we propose a novel MSI demosaicing method based on the spatial-spectral mixing transformer with hybrid image prior, named SSMT-HIP, to enhance image reconstruction and detail recovery. Our framework, the spatial-spectral mixing transformer (SSMT), is designed to comprehensively learn the spatial-spectral correlations of the data, addressing the limitations of current CNN-based methods in capturing both spatial and spectral characteristics of MSI. Furthermore, we introduce the deep hybrid image prior (HIP), which combines the deep Gaussian scale mixture (GSM) prior and the deep nonlocal auto-regressive (NAR) prior. This hybrid prior is learned in an end-to-end manner through the deep unfolding network. The GSM prior excels at recovering image textures and details, while the NAR prior enhances long-range dependencies in MSI. Extensive experiments on both synthetic and real-world data demonstrate that our proposed method outperforms existing state-of-the-art MSI demosaicing methods.
{"title":"Spatial-Spectral Mixing Transformer With Hybrid Image Prior for Multispectral Image Demosaicing","authors":"Le Dong;Mengzu Liu;Tengteng Tang;Tao Huang;Jie Lin;Weisheng Dong;Guangming Shi","doi":"10.1109/JSTSP.2024.3516374","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3516374","url":null,"abstract":"The snapshot multispectral imaging system using a multispectral filter array (MSFA) efficiently captures sample the multispectral image (MSI) information of scenes and obtain spectral mosaic images. To obtain the complete MSI information from these spectral mosaics, effective demosaicing methods are essential. Traditional MSI demosaicing techniques depend on pixel correlation and various hand-crafted priors, while deep learning-based approaches learn the mapping between spectral mosaic images and MSI directly. However, current methods often fall short in recovery performance, leaving significant room for improvement in the MSI demosaicing field. In this paper, we propose a novel MSI demosaicing method based on the spatial-spectral mixing transformer with hybrid image prior, named SSMT-HIP, to enhance image reconstruction and detail recovery. Our framework, the spatial-spectral mixing transformer (SSMT), is designed to comprehensively learn the spatial-spectral correlations of the data, addressing the limitations of current CNN-based methods in capturing both spatial and spectral characteristics of MSI. Furthermore, we introduce the deep hybrid image prior (HIP), which combines the deep Gaussian scale mixture (GSM) prior and the deep nonlocal auto-regressive (NAR) prior. This hybrid prior is learned in an end-to-end manner through the deep unfolding network. The GSM prior excels at recovering image textures and details, while the NAR prior enhances long-range dependencies in MSI. Extensive experiments on both synthetic and real-world data demonstrate that our proposed method outperforms existing state-of-the-art MSI demosaicing methods.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 1","pages":"221-233"},"PeriodicalIF":8.7,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143512870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-12DOI: 10.1109/JSTSP.2024.3516392
Fei Zhou;Tianhao Gu;Zhicong Huang;Guoping Qiu
The visual quality of an image is confounded by a number of intertwined factors including its semantic content, distortion characteristics and appearance properties such as brightness, contrast, sharpness, and colourfulness. Distilling high level knowledge about all these quality bearing attributes is crucial for developing objective Image Quality Assessment (IQA). While existing solutions have modeled some of these aspects, a comprehensive solution that involves all these important quality related attributes has not yet been developed. In this paper, we present a new blind IQA (BIQA) model termed Self-supervision and Vision-Language supervision Image QUality Evaluator (SLIQUE) that features a joint vision-language and visual contrastive representation learning framework for acquiring high level knowledge about the images semantic contents, distortion characteristics and appearance properties for IQA. For training SLIQUE, we have developed a systematic approach to constructing a first of its kind large image database annotated with all three categories of quality relevant texts. The Text Annotated Distortion, Appearance and Content (TADAC1) database has over 1.6 million images annotated with textual descriptions of their semantic contents, distortion characteristics and appearance properties. The method for constructing TADAC and the database itself will be particularly useful for exploiting vision-language modeling for advanced IQA applications. Extensive experimental results show that SLIQUE has superior performances over state of the art, demonstrating the soundness of its design principle and the effectiveness of its implementation.
{"title":"Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment","authors":"Fei Zhou;Tianhao Gu;Zhicong Huang;Guoping Qiu","doi":"10.1109/JSTSP.2024.3516392","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3516392","url":null,"abstract":"The visual quality of an image is confounded by a number of intertwined factors including its semantic content, distortion characteristics and appearance properties such as brightness, contrast, sharpness, and colourfulness. Distilling high level knowledge about all these quality bearing attributes is crucial for developing objective Image Quality Assessment (IQA). While existing solutions have modeled some of these aspects, a comprehensive solution that involves all these important quality related attributes has not yet been developed. In this paper, we present a new blind IQA (BIQA) model termed Self-supervision and Vision-Language supervision Image QUality Evaluator (SLIQUE) that features a joint vision-language and visual contrastive representation learning framework for acquiring high level knowledge about the images semantic contents, distortion characteristics and appearance properties for IQA. For training SLIQUE, we have developed a systematic approach to constructing a first of its kind large image database annotated with all three categories of quality relevant texts. The Text Annotated Distortion, Appearance and Content (TADAC<sup>1</sup>) database has over 1.6 million images annotated with textual descriptions of their semantic contents, distortion characteristics and appearance properties. The method for constructing TADAC and the database itself will be particularly useful for exploiting vision-language modeling for advanced IQA applications. Extensive experimental results show that SLIQUE has superior performances over state of the art, demonstrating the soundness of its design principle and the effectiveness of its implementation.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 1","pages":"234-247"},"PeriodicalIF":8.7,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143512839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09DOI: 10.1109/JSTSP.2024.3511403
Wei Chen;Yuanwei Liu;Hamid Jafarkhani;Yonina C. Eldar;Peiying Zhu;Khaled B. Letaief
Wireless communication systems to date primarily rely on the orthogonality of resources to facilitate the design and implementation, from user access to data transmission. Emerging applications and scenarios in the sixth generation (6G) wireless systems will require massive connectivity and transmission of a deluge of data, which calls for more flexibility in the design concept that goes beyond orthogonality. Furthermore, recent advances in signal processing and learning, e.g., deep learning, provide promising approaches to deal with complex and previously intractable problems. This article provides an overview of research efforts to date in the field of signal processing and learning for next-generation multiple access, with an emphasis on massive random access and non-orthogonal multiple access. The promising interplay with new technologies and the challenges in learning-based NGMA are discussed.
{"title":"Signal Processing and Learning for Next Generation Multiple Access in 6G","authors":"Wei Chen;Yuanwei Liu;Hamid Jafarkhani;Yonina C. Eldar;Peiying Zhu;Khaled B. Letaief","doi":"10.1109/JSTSP.2024.3511403","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3511403","url":null,"abstract":"Wireless communication systems to date primarily rely on the orthogonality of resources to facilitate the design and implementation, from user access to data transmission. Emerging applications and scenarios in the sixth generation (6G) wireless systems will require massive connectivity and transmission of a deluge of data, which calls for more flexibility in the design concept that goes beyond orthogonality. Furthermore, recent advances in signal processing and learning, e.g., deep learning, provide promising approaches to deal with complex and previously intractable problems. This article provides an overview of research efforts to date in the field of signal processing and learning for next-generation multiple access, with an emphasis on massive random access and non-orthogonal multiple access. The promising interplay with new technologies and the challenges in learning-based NGMA are discussed.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 7","pages":"1146-1177"},"PeriodicalIF":8.7,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-04DOI: 10.1109/JSTSP.2024.3467914
Debarpan Bhattacharya;Amir H. Poorjam;Deepak Mittal;Sriram Ganapathy
The recent advancements in artificial intelligence (AI), with the release of several large models having only query access, make a strong case for explainability of deep models in a post-hoc gradient free manner. In this paper, we propose a framework, named distillation aided explainability (DAX), that attempts to generate a saliency-based explanation in a model agnostic gradient free application. The DAX approach poses the problem of explanation in a learnable setting with a mask generation network and a distillation network. The mask generation network learns to generate the multiplier mask that finds the salient regions of the input, while the student distillation network aims to approximate the local behavior of the black-box model. We propose a joint optimization of the two networks in the DAX framework using the locally perturbed input samples, with the targets derived from input-output access to the black-box model. We extensively evaluate DAX across different modalities (image and audio), in a classification setting, using a diverse set of evaluations (intersection over union with ground truth, deletion based and subjective human evaluation based measures) and benchmark it with respect to 9 different methods. In these evaluations, the DAX significantly outperforms the existing approaches on all modalities and evaluation metrics.
{"title":"Gradient-Free Post-Hoc Explainability Using Distillation Aided Learnable Approach","authors":"Debarpan Bhattacharya;Amir H. Poorjam;Deepak Mittal;Sriram Ganapathy","doi":"10.1109/JSTSP.2024.3467914","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3467914","url":null,"abstract":"The recent advancements in artificial intelligence (AI), with the release of several large models having only query access, make a strong case for explainability of deep models in a post-hoc gradient free manner. In this paper, we propose a framework, named distillation aided explainability (DAX), that attempts to generate a saliency-based explanation in a model agnostic gradient free application. The DAX approach poses the problem of explanation in a learnable setting with a mask generation network and a distillation network. The mask generation network learns to generate the multiplier mask that finds the salient regions of the input, while the student distillation network aims to approximate the local behavior of the black-box model. We propose a joint optimization of the two networks in the DAX framework using the locally perturbed input samples, with the targets derived from input-output access to the black-box model. We extensively evaluate DAX across different modalities (image and audio), in a classification setting, using a diverse set of evaluations (intersection over union with ground truth, deletion based and subjective human evaluation based measures) and benchmark it with respect to 9 different methods. In these evaluations, the DAX significantly outperforms the existing approaches on all modalities and evaluation metrics.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 1","pages":"169-180"},"PeriodicalIF":8.7,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143512982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-27DOI: 10.1109/JSTSP.2024.3507371
Navjot Singh;Suhas Diggavi
In this paper, we consider the problem of learning a linear regression model on a data domain of interest (target) given few samples. To aid learning, we are provided with a set of pre-trained regression models that are trained on potentially different data domains (sources). Assuming a representation structure for the data generating linear models at the sources and the target domains, we propose a representation transfer based learning method for constructing the target model. The proposed scheme is comprised of two phases: (i) utilizing the different source representations to construct a representation that is adapted to the target data, and (ii) using the obtained model as an initialization to a fine-tuning procedure that re-trains the entire (over-parameterized) regression model on the target data. For each phase of the training method, we provide excess risk bounds for the learned model compared to the true data generating target model. The derived bounds show a gain in sample complexity for our proposed method compared to the baseline method of not leveraging source representations when achieving the same excess risk, therefore, theoretically demonstrating the effectiveness of transfer learning for linear regression.
{"title":"Representation Transfer Learning via Multiple Pre-Trained Models for Linear Regression","authors":"Navjot Singh;Suhas Diggavi","doi":"10.1109/JSTSP.2024.3507371","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3507371","url":null,"abstract":"In this paper, we consider the problem of learning a linear regression model on a data domain of interest (target) given few samples. To aid learning, we are provided with a set of pre-trained regression models that are trained on potentially different data domains (sources). Assuming a representation structure for the data generating linear models at the sources and the target domains, we propose a representation transfer based learning method for constructing the target model. The proposed scheme is comprised of two phases: (i) utilizing the different source representations to construct a representation that is adapted to the target data, and (ii) using the obtained model as an initialization to a fine-tuning procedure that re-trains the entire (over-parameterized) regression model on the target data. For each phase of the training method, we provide excess risk bounds for the learned model compared to the true data generating target model. The derived bounds show a gain in sample complexity for our proposed method compared to the baseline method of not leveraging source representations when achieving the same excess risk, therefore, theoretically demonstrating the effectiveness of transfer learning for linear regression.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 1","pages":"208-220"},"PeriodicalIF":8.7,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143512857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-26DOI: 10.1109/JSTSP.2024.3506286
Haohe Liu;Xuenan Xu;Yi Yuan;Mengyue Wu;Wenwu Wang;Mark D. Plumbley
Large languagemodels (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general sound, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised pre-trained Audio Masked Autoencoder (AudioMAE), discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.40 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated state-of-the-art audio codecs, even at significantly lower bitrates.
{"title":"SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound","authors":"Haohe Liu;Xuenan Xu;Yi Yuan;Mengyue Wu;Wenwu Wang;Mark D. Plumbley","doi":"10.1109/JSTSP.2024.3506286","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3506286","url":null,"abstract":"Large languagemodels (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general sound, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised pre-trained Audio Masked Autoencoder (AudioMAE), discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.40 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated state-of-the-art audio codecs, even at significantly lower bitrates.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 8","pages":"1448-1461"},"PeriodicalIF":8.7,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-20DOI: 10.1109/JSTSP.2024.3497655
Cheol Jun Cho;Peter Wu;Tejas S. Prabhune;Dhruv Agarwal;Gopala K. Anumanchipalli
Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech – Speech Articulatory Coding (SPARC). SPARC comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.
{"title":"Coding Speech Through Vocal Tract Kinematics","authors":"Cheol Jun Cho;Peter Wu;Tejas S. Prabhune;Dhruv Agarwal;Gopala K. Anumanchipalli","doi":"10.1109/JSTSP.2024.3497655","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3497655","url":null,"abstract":"Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech – Speech Articulatory Coding (SPARC). SPARC comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 8","pages":"1427-1440"},"PeriodicalIF":8.7,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transformer models have gained popularity for their exceptional performance. However, these models still face the challenge of high inference latency. To improve the computational efficiency of such models, we propose a novel differentiable pruning method called DARIO (DifferentiAble vision transformer pRunIng with low-cost prOxies). Our approach involves optimizing a set of gating parameters using differentiable, data-agnostic, scale-invariant, and low-cost performance proxies. DARIO is a data-agnostic pruning method, it does not need any classification heads during pruning. We evaluated DARIO on two popular state-of-the-art pre-trained ViT models, including both large (MAE-ViT) and small (MobileViT) sizes. Extensive experiments conducted across 40 diverse datasets demonstrated the effectiveness and efficiency of our DARIO method. DARIO not only significantly improves inference efficiency on modern hardware but also excels in preserving accuracy. Notably, DARIO has even achieved an increase in accuracy on MobileViT, despite only fine-tuning the last block and the classification head.
变压器型号因其卓越的性能而广受欢迎。然而,这些模型仍然面临着高推理延迟的挑战。为了提高这类模型的计算效率,我们提出了一种新的可微剪枝方法DARIO (differentiable vision transformer pruning with low-cost prOxies)。我们的方法包括使用可微的、数据不可知的、规模不变的和低成本的性能代理来优化一组门控参数。DARIO是一种与数据无关的剪枝方法,剪枝过程中不需要任何分类头。我们在两种流行的最先进的预训练ViT模型上评估了DARIO,包括大型(MAE-ViT)和小型(MobileViT)模型。在40个不同的数据集上进行的大量实验证明了我们的DARIO方法的有效性和效率。DARIO不仅显著提高了现代硬件上的推理效率,而且还保持了精度。值得注意的是,DARIO甚至在MobileViT上实现了准确性的提高,尽管只对最后一个块和分类头进行了微调。
{"title":"DARIO: Differentiable Vision Transformer Pruning With Low-Cost Proxies","authors":"Haozhe Sun;Alexandre Heuillet;Felix Mohr;Hedi Tabia","doi":"10.1109/JSTSP.2024.3501685","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3501685","url":null,"abstract":"Transformer models have gained popularity for their exceptional performance. However, these models still face the challenge of high inference latency. To improve the computational efficiency of such models, we propose a novel differentiable pruning method called DARIO (<bold>D</b>ifferenti<bold>A</b>ble vision transformer p<bold>R</b>un<bold>I</b>ng with low-cost pr<bold>O</b>xies). Our approach involves optimizing a set of gating parameters using differentiable, data-agnostic, scale-invariant, and low-cost performance proxies. DARIO is a data-agnostic pruning method, it does not need any classification heads during pruning. We evaluated DARIO on two popular state-of-the-art pre-trained ViT models, including both large (MAE-ViT) and small (MobileViT) sizes. Extensive experiments conducted across 40 diverse datasets demonstrated the effectiveness and efficiency of our DARIO method. DARIO not only significantly improves inference efficiency on modern hardware but also excels in preserving accuracy. Notably, DARIO has even achieved an increase in accuracy on MobileViT, despite only fine-tuning the last block and the classification head.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 6","pages":"997-1009"},"PeriodicalIF":8.7,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-18DOI: 10.1109/JSTSP.2024.3501681
Eunkyun Lee;Seungkwon Beack;Jong Won Shin
Alias-and-Separate (AaS) speech coding framework has shown the possibility to encode wideband (WB) speech with a narrowband (NB) speech codec and reconstruct it using speech separation. WB speech is first decimated incurring aliasing and then coded, transmitted, and decoded with a NB codec. The decoded signal is then separated into lower band and spectrally-flipped high band using a speech separation module, which are expanded, lowpass/highpass filtered, and added together to reconstruct the WB speech. The original AaS system, however, has algorithmic delay originated from the overlap-add operation for consecutive segments. This algorithmic delay can be reduced by omitting the overlap-add procedure, but the quality of the reconstructed speech is also degraded due to artifacts on the segment boundaries. In this work, we propose an improved AaS framework with minimum algorithmic delay. The decoded signal is first expanded by inserting zeros in-between samples before being processed by source separation module. As the expanded signal can be viewed as a summation of the frequency-shifted versions of the original signal, the decoded-and-expanded signal is then separated into the frequency-shifted signals, which are multiplied by complex exponentials and summed up to reconstruct the original signal. With carefully designed transposed convolution operation in the separation module, the proposed system requires minimal algorithmic delay while preventing discontinuity at the segment boundaries. Additionally, we propose to employ a generative vocoder to further improve the perceived quality and a modified multi-resolution short-time Fourier transform (MR-STFT) loss. Experimental results on the WB speech coding with a NB codec demonstrated that the proposed system outperformed the original AaS system and the existing WB speech codec in the subjective listening test. We have also shown that the proposed method can be applied when the decimation factor is not 2 in the experiment on the fullband speech coding with a WB codec.
{"title":"Improved Alias-and-Separate Speech Coding Framework With Minimal Algorithmic Delay","authors":"Eunkyun Lee;Seungkwon Beack;Jong Won Shin","doi":"10.1109/JSTSP.2024.3501681","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3501681","url":null,"abstract":"Alias-and-Separate (AaS) speech coding framework has shown the possibility to encode wideband (WB) speech with a narrowband (NB) speech codec and reconstruct it using speech separation. WB speech is first decimated incurring aliasing and then coded, transmitted, and decoded with a NB codec. The decoded signal is then separated into lower band and spectrally-flipped high band using a speech separation module, which are expanded, lowpass/highpass filtered, and added together to reconstruct the WB speech. The original AaS system, however, has algorithmic delay originated from the overlap-add operation for consecutive segments. This algorithmic delay can be reduced by omitting the overlap-add procedure, but the quality of the reconstructed speech is also degraded due to artifacts on the segment boundaries. In this work, we propose an improved AaS framework with minimum algorithmic delay. The decoded signal is first expanded by inserting zeros in-between samples before being processed by source separation module. As the expanded signal can be viewed as a summation of the frequency-shifted versions of the original signal, the decoded-and-expanded signal is then separated into the frequency-shifted signals, which are multiplied by complex exponentials and summed up to reconstruct the original signal. With carefully designed transposed convolution operation in the separation module, the proposed system requires minimal algorithmic delay while preventing discontinuity at the segment boundaries. Additionally, we propose to employ a generative vocoder to further improve the perceived quality and a modified multi-resolution short-time Fourier transform (MR-STFT) loss. Experimental results on the WB speech coding with a NB codec demonstrated that the proposed system outperformed the original AaS system and the existing WB speech codec in the subjective listening test. We have also shown that the proposed method can be applied when the decimation factor is not 2 in the experiment on the fullband speech coding with a WB codec.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"18 8","pages":"1414-1426"},"PeriodicalIF":8.7,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-13DOI: 10.1109/JSTSP.2024.3497660
Jian Xu;Shuo Wan;Yinchuan Li;Sichun Luo;Zhilin Chen;Yunfeng Shao;Zhitang Chen;Shao-Lun Huang;Linqi Song
Federated learning (FL) is an increasingly popular paradigm for protecting data privacy in machine learning systems. However, the data heterogeneity and high computation cost/latency are challenging barriers for employing FL in real-world applications with heterogeneous devices. In this paper, we propose a novel personalized FL framework named $mathtt {CompFL}$ allowing cooperative training of models with varied structures to mitigate those issues. First, $mathtt {CompFL}$ initializes a set of expert models in varied sizes and allows each client to choose one or multiple expert models for training according to their capacities. Second, $mathtt {CompFL}$ combines the model decoupling strategy and local-global feature alignment to mitigate the adverse impact of label heterogeneity, where clients only share the feature extractor part for each model architecture. Third, to encourage mutual enhancement of various models, knowledge distillation in local training is further applied to improve the overall performance. To make our framework workable in real systems, we implement it in both centralized settings with server-coordinated parallel training, and decentralized settings with newly developed device-to-device training-forwarding schemes. Extensive experiments on benchmark datasets are conducted to verify the potential of our framework for personalized FL over heterogeneous devices.
{"title":"Cooperative Multi-Model Training for Personalized Federated Learning Over Heterogeneous Devices","authors":"Jian Xu;Shuo Wan;Yinchuan Li;Sichun Luo;Zhilin Chen;Yunfeng Shao;Zhitang Chen;Shao-Lun Huang;Linqi Song","doi":"10.1109/JSTSP.2024.3497660","DOIUrl":"https://doi.org/10.1109/JSTSP.2024.3497660","url":null,"abstract":"Federated learning (FL) is an increasingly popular paradigm for protecting data privacy in machine learning systems. However, the data heterogeneity and high computation cost/latency are challenging barriers for employing FL in real-world applications with heterogeneous devices. In this paper, we propose a novel personalized FL framework named <inline-formula><tex-math>$mathtt {CompFL}$</tex-math></inline-formula> allowing cooperative training of models with varied structures to mitigate those issues. First, <inline-formula><tex-math>$mathtt {CompFL}$</tex-math></inline-formula> initializes a set of expert models in varied sizes and allows each client to choose one or multiple expert models for training according to their capacities. Second, <inline-formula><tex-math>$mathtt {CompFL}$</tex-math></inline-formula> combines the model decoupling strategy and local-global feature alignment to mitigate the adverse impact of label heterogeneity, where clients only share the feature extractor part for each model architecture. Third, to encourage mutual enhancement of various models, knowledge distillation in local training is further applied to improve the overall performance. To make our framework workable in real systems, we implement it in both centralized settings with server-coordinated parallel training, and decentralized settings with newly developed device-to-device training-forwarding schemes. Extensive experiments on benchmark datasets are conducted to verify the potential of our framework for personalized FL over heterogeneous devices.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 1","pages":"195-207"},"PeriodicalIF":8.7,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143512858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}