首页 > 最新文献

IEEE open journal of signal processing最新文献

英文 中文
Non-Gaussian Process Dynamical Models 非高斯过程动力学模型
IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-01-27 DOI: 10.1109/OJSP.2025.3534690
Yaman Kındap;Simon Godsill
Probabilistic dynamical models used in applications in tracking and prediction are typically assumed to be Gaussian noise driven motions since well-known inference algorithms can be applied to these models. However, in many real world examples deviations from Gaussianity are expected to appear, e.g., rapid changes in speed or direction, which cannot be reflected using processes with a smooth mean response. In this work, we introduce the non-Gaussian process (NGP) dynamical model which allow for straightforward modelling of heavy-tailed, non-Gaussian behaviours while retaining a tractable conditional Gaussian process (GP) structure through an infinite mixture of non-homogeneous GPs representation. We present two novel inference methodologies for these new models based on the conditionally Gaussian formulation of NGPs which are suitable for both MCMC and marginalised particle filtering algorithms. The results are demonstrated on synthetically generated data sets.
在跟踪和预测应用中使用的概率动态模型通常被假设为高斯噪声驱动的运动,因为众所周知的推理算法可以应用于这些模型。然而,在许多现实世界的例子中,预计会出现偏离高斯性的情况,例如,速度或方向的快速变化,这不能用具有平滑平均响应的过程来反映。在这项工作中,我们引入了非高斯过程(NGP)动态模型,该模型允许直接建模重尾,非高斯行为,同时通过非齐次GP表示的无限混合保留可处理的条件高斯过程(GP)结构。基于ngp的条件高斯公式,我们提出了两种适用于MCMC和边缘粒子滤波算法的新模型推理方法。结果在综合生成的数据集上得到了验证。
{"title":"Non-Gaussian Process Dynamical Models","authors":"Yaman Kındap;Simon Godsill","doi":"10.1109/OJSP.2025.3534690","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3534690","url":null,"abstract":"Probabilistic dynamical models used in applications in tracking and prediction are typically assumed to be Gaussian noise driven motions since well-known inference algorithms can be applied to these models. However, in many real world examples deviations from Gaussianity are expected to appear, e.g., rapid changes in speed or direction, which cannot be reflected using processes with a smooth mean response. In this work, we introduce the non-Gaussian process (NGP) dynamical model which allow for straightforward modelling of heavy-tailed, non-Gaussian behaviours while retaining a tractable conditional Gaussian process (GP) structure through an infinite mixture of non-homogeneous GPs representation. We present two novel inference methodologies for these new models based on the conditionally Gaussian formulation of NGPs which are suitable for both MCMC and marginalised particle filtering algorithms. The results are demonstrated on synthetically generated data sets.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"213-221"},"PeriodicalIF":2.9,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10854574","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143455222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Moving Object Segmentation in LiDAR Point Clouds Using Minimal Number of Sweeps 使用最小扫描次数的激光雷达点云中有效的运动目标分割
IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-01-20 DOI: 10.1109/OJSP.2025.3532199
Zoltan Rozsa;Akos Madaras;Tamas Sziranyi
LiDAR point clouds are a rich source of information for autonomous vehicles and ADAS systems. However, they can be challenging to segment for moving objects as - among other things - finding correspondences between sparse point clouds of consecutive frames is difficult. Traditional methods rely on a (global or local) map of the environment, which can be demanding to acquire and maintain in real-world conditions and the presence of the moving objects themselves. This paper proposes a novel approach using as minimal sweeps as possible to decrease the computational burden and achieve mapless moving object segmentation (MOS) in LiDAR point clouds. Our approach is based on a multimodal learning model with single-modal inference. The model is trained on a dataset of LiDAR point clouds and related camera images. The model learns to associate features from the two modalities, allowing it to predict dynamic objects even in the absence of a map and the camera modality. We propose semantic information usage for multi-frame instance segmentation in order to enhance performance measures. We evaluate our approach to the SemanticKITTI and Apollo real-world autonomous driving datasets. Our results show that our approach can achieve state-of-the-art performance on moving object segmentation and utilize only a few (even one) LiDAR frames.
激光雷达点云是自动驾驶汽车和ADAS系统的丰富信息源。然而,对于运动物体的分割是具有挑战性的,因为在连续帧的稀疏点云之间找到对应关系是困难的。传统方法依赖于环境的(全局或局部)地图,这可能要求在现实世界条件下获取和维护移动对象本身的存在。本文提出了一种利用尽可能少的扫描来减少计算量并实现激光雷达点云无映射运动目标分割的新方法。我们的方法是基于具有单模态推理的多模态学习模型。该模型在激光雷达点云和相关相机图像数据集上进行训练。该模型学习从两种模态中关联特征,使其能够在没有地图和相机模态的情况下预测动态物体。我们提出了语义信息用于多帧实例分割,以提高性能指标。我们评估了SemanticKITTI和Apollo真实世界自动驾驶数据集的方法。我们的研究结果表明,我们的方法可以在运动目标分割方面实现最先进的性能,并且只使用少数(甚至一个)激光雷达帧。
{"title":"Efficient Moving Object Segmentation in LiDAR Point Clouds Using Minimal Number of Sweeps","authors":"Zoltan Rozsa;Akos Madaras;Tamas Sziranyi","doi":"10.1109/OJSP.2025.3532199","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3532199","url":null,"abstract":"LiDAR point clouds are a rich source of information for autonomous vehicles and ADAS systems. However, they can be challenging to segment for moving objects as - among other things - finding correspondences between sparse point clouds of consecutive frames is difficult. Traditional methods rely on a (global or local) map of the environment, which can be demanding to acquire and maintain in real-world conditions and the presence of the moving objects themselves. This paper proposes a novel approach using as minimal sweeps as possible to decrease the computational burden and achieve mapless moving object segmentation (MOS) in LiDAR point clouds. Our approach is based on a multimodal learning model with single-modal inference. The model is trained on a dataset of LiDAR point clouds and related camera images. The model learns to associate features from the two modalities, allowing it to predict dynamic objects even in the absence of a map and the camera modality. We propose semantic information usage for multi-frame instance segmentation in order to enhance performance measures. We evaluate our approach to the SemanticKITTI and Apollo real-world autonomous driving datasets. Our results show that our approach can achieve state-of-the-art performance on moving object segmentation and utilize only a few (even one) LiDAR frames.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"118-128"},"PeriodicalIF":2.9,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10848132","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143379492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LIMMITS'24: Multi-Speaker, Multi-Lingual INDIC TTS With Voice Cloning 限制'24:多扬声器,多语言印度TTS与语音克隆
IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-01-20 DOI: 10.1109/OJSP.2025.3531782
Sathvik Udupa;Jesuraja Bandekar;Abhayjeet Singh;Deekshitha G;Saurabh Kumar;Sandhya Badiger;Amala Nagireddi;Roopa R;Prasanta Kumar Ghosh;Hema A. Murthy;Pranaw Kumar;Keiichi Tokuda;Mark Hasegawa-Johnson;Philipp Olbrich
The Multi-speaker, Multi-lingual Indic Text to Speech (TTS) with voice cloning (LIMMITS'24) challenge is organized as part of the ICASSP 2024 signal processing grand challenge. LIMMITS'24 aims at the development of voice cloning for the multi-speaker, multi-lingual Text-to-Speech (TTS) model. Towards this, 80 hours of TTS data has been released in each of Bengali, Chhattisgarhi, English (Indian), and Kannada languages. This is in addition to Telugu, Hindi, and Marathi data released during the LIMMITS'23 challenge. The challenge encourages the advancement of TTS in Indian Languages as well as the development of multi-speaker voice cloning techniques for TTS. The three tracks of LIMMITS'24 have provided an opportunity for various researchers and practitioners around the world to explore the state of the art in research for voice cloning with TTS.
具有语音克隆的多扬声器,多语言印度文本到语音(TTS) (LIMMITS’24)挑战是ICASSP 2024信号处理大挑战的一部分。LIMMITS’24旨在为多说话者、多语言文本到语音(TTS)模型开发语音克隆。为此,用孟加拉语、恰蒂斯加尔语、英语(印度语)和卡纳达语分别发布了80小时的TTS数据。这是在LIMMITS 23挑战赛期间发布的泰卢固语、印地语和马拉地语数据之外的数据。这项挑战鼓励了印度语言的TTS的进步,以及为TTS开发多说话人语音克隆技术。LIMMITS’24的三个轨道为世界各地的研究人员和实践者提供了一个机会,探索使用TTS进行语音克隆研究的最新技术。
{"title":"LIMMITS'24: Multi-Speaker, Multi-Lingual INDIC TTS With Voice Cloning","authors":"Sathvik Udupa;Jesuraja Bandekar;Abhayjeet Singh;Deekshitha G;Saurabh Kumar;Sandhya Badiger;Amala Nagireddi;Roopa R;Prasanta Kumar Ghosh;Hema A. Murthy;Pranaw Kumar;Keiichi Tokuda;Mark Hasegawa-Johnson;Philipp Olbrich","doi":"10.1109/OJSP.2025.3531782","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3531782","url":null,"abstract":"The Multi-speaker, Multi-lingual Indic Text to Speech (TTS) with voice cloning (LIMMITS'24) challenge is organized as part of the ICASSP 2024 signal processing grand challenge. LIMMITS'24 aims at the development of voice cloning for the multi-speaker, multi-lingual Text-to-Speech (TTS) model. Towards this, 80 hours of TTS data has been released in each of Bengali, Chhattisgarhi, English (Indian), and Kannada languages. This is in addition to Telugu, Hindi, and Marathi data released during the LIMMITS'23 challenge. The challenge encourages the advancement of TTS in Indian Languages as well as the development of multi-speaker voice cloning techniques for TTS. The three tracks of LIMMITS'24 have provided an opportunity for various researchers and practitioners around the world to explore the state of the art in research for voice cloning with TTS.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"293-302"},"PeriodicalIF":2.9,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10845816","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Posterior-Based Analysis of Spatio-Temporal Features for Sign Language Assessment 基于后验的手语评价时空特征分析
IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-01-17 DOI: 10.1109/OJSP.2025.3531781
Neha Tarigopula;Sandrine Tornay;Ozge Mercanoglu Sincan;Richard Bowden;Mathew Magimai.-Doss
Sign Language conveys information through multiple channels composed of manual (handshape, hand movement) and non-manual (facial expression, mouthing, body posture) components. Sign language assessment involves giving granular feedback to a learner, in terms of correctness of the manual and non-manual components, aiding the learner's progress. Existing methods rely on handcrafted skeleton-based features for hand movement within a KL-HMM framework to identify errors in manual components. However, modern deep learning models offer powerful spatio-temporal representations for videos to represent hand movement and facial expressions. Despite their success in classification tasks, these representations often struggle to attribute errors to specific sources, such as incorrect handshape, improper movement, or incorrect facial expressions. To address this limitation, we leverage and analyze the spatio-temporal representations from Inflated 3D Convolutional Networks (I3D) and integrate them into the KL-HMM framework to assess sign language videos on both manual and non-manual components. By applying masking and cropping techniques, we isolate and evaluate distinct channels of hand movement, and facial expressions using the I3D model and handshape using the CNN-based model. Our approach outperforms traditional methods based on handcrafted features, as validated through experiments on the SMILE-DSGS dataset, and therefore demonstrates that it can enhance the effectiveness of sign language learning tools.
手语通过多种渠道传递信息,这些渠道由手势(手形、手部动作)和非手势(面部表情、口齿、身体姿势)组成。手语评估包括根据手动和非手动组件的正确性向学习者提供细粒度的反馈,以帮助学习者的进步。现有的方法依赖于在KL-HMM框架内手工制作的基于骨架的手部运动特征来识别手动组件中的错误。然而,现代深度学习模型为视频提供了强大的时空表征,以表示手部运动和面部表情。尽管它们在分类任务中取得了成功,但这些表示通常难以将错误归因于特定的来源,例如不正确的手形、不正确的动作或不正确的面部表情。为了解决这一限制,我们利用并分析了来自膨胀3D卷积网络(I3D)的时空表征,并将其集成到KL-HMM框架中,以评估手动和非手动组件上的手语视频。通过应用掩蔽和裁剪技术,我们使用I3D模型分离和评估手部运动和面部表情的不同通道,使用基于cnn的模型分离和评估手部形状。我们的方法优于传统的基于手工特征的方法,并通过SMILE-DSGS数据集的实验验证了这一点,因此表明它可以提高手语学习工具的有效性。
{"title":"Posterior-Based Analysis of Spatio-Temporal Features for Sign Language Assessment","authors":"Neha Tarigopula;Sandrine Tornay;Ozge Mercanoglu Sincan;Richard Bowden;Mathew Magimai.-Doss","doi":"10.1109/OJSP.2025.3531781","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3531781","url":null,"abstract":"Sign Language conveys information through multiple channels composed of manual (handshape, hand movement) and non-manual (facial expression, mouthing, body posture) components. Sign language assessment involves giving granular feedback to a learner, in terms of correctness of the manual and non-manual components, aiding the learner's progress. Existing methods rely on handcrafted skeleton-based features for hand movement within a KL-HMM framework to identify errors in manual components. However, modern deep learning models offer powerful spatio-temporal representations for videos to represent hand movement and facial expressions. Despite their success in classification tasks, these representations often struggle to attribute errors to specific sources, such as incorrect handshape, improper movement, or incorrect facial expressions. To address this limitation, we leverage and analyze the spatio-temporal representations from Inflated 3D Convolutional Networks (I3D) and integrate them into the KL-HMM framework to assess sign language videos on both manual and non-manual components. By applying masking and cropping techniques, we isolate and evaluate distinct channels of hand movement, and facial expressions using the I3D model and handshape using the CNN-based model. Our approach outperforms traditional methods based on handcrafted features, as validated through experiments on the SMILE-DSGS dataset, and therefore demonstrates that it can enhance the effectiveness of sign language learning tools.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"284-292"},"PeriodicalIF":2.9,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10845152","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to “Energy Efficient Signal Detection Using SPRT and Ordered Transmissions in Wireless Sensor Networks” 修正“无线传感器网络中使用SPRT和有序传输的节能信号检测”
IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-01-17 DOI: 10.1109/OJSP.2024.3519916
Shailee Yagnik;Ramanarayanan Viswanathan;Lei Cao
In [1, p. 1124], a footnote is needed on (13) as shown below: begin{equation*}qquadqquadquad{{alpha }^# } < left( {1 - {{c}_1}} right)alpha + left( {1 - left( {1 - {{c}_1}} right)alpha } right)alphaqquadqquadquad hbox{(13)$^{1}$} end{equation*}
在[1,p. 1124]中,需要对(13)作如下脚注: begin{equation*}qquadqquadquad{{alpha }^# } < left( {1 - {{c}_1}} right)alpha + left( {1 - left( {1 - {{c}_1}} right)alpha } right)alphaqquadqquadquad hbox{(13)$^{1}$} end{equation*}
{"title":"Correction to “Energy Efficient Signal Detection Using SPRT and Ordered Transmissions in Wireless Sensor Networks”","authors":"Shailee Yagnik;Ramanarayanan Viswanathan;Lei Cao","doi":"10.1109/OJSP.2024.3519916","DOIUrl":"https://doi.org/10.1109/OJSP.2024.3519916","url":null,"abstract":"In [1, p. 1124], a footnote is needed on (13) as shown below: begin{equation*}qquadqquadquad{{alpha }^# } < left( {1 - {{c}_1}} right)alpha + left( {1 - left( {1 - {{c}_1}} right)alpha } right)alphaqquadqquadquad hbox{(13)$^{1}$} end{equation*}","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"16-16"},"PeriodicalIF":2.9,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10845022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Formant Tracking by Combining Deep Neural Network and Linear Prediction 结合深度神经网络和线性预测的峰群跟踪
IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-01-16 DOI: 10.1109/OJSP.2025.3530876
Sudarsana Reddy Kadiri;Kevin Huang;Christina Hagedorn;Dani Byrd;Paavo Alku;Shrikanth Narayanan
Formant tracking is an area of speech science that has recently undergone a technology shift from classical model-driven signal processing methods to modern data-driven deep learning methods. In this study, these two domains are combined in formant tracking by refining the formants estimated by a data-driven deep neural network (DNN) with formant estimates given by a model-driven linear prediction (LP) method. In the refinement process, the three lowest formants, initially estimated by the DNN-based method, are frame-wise replaced with local spectral peaks identified by the LP method. The LP-based refinement stage can be seamlessly integrated into the DNN without any training. As an LP method, the study advocates the use of quasiclosed phase forward-backward (QCP-FB) analysis. Three spectral representations are compared as DNN inputs: mel-frequency cepstral coefficients (MFCCs), the spectrogram, and the complex spectrogram. Formant tracking performance was evaluated by comparing the proposed refined DNN tracker with seven reference trackers, which included both signal processing and deep learning based methods. As evaluation data, ground truth formants of the Vocal Tract Resonance (VTR) corpus were used. The results demonstrate that the refined DNN trackers outperformed all conventional trackers. The best results were obtained by using the MFCC input for the DNN. The proposed MFCC refinement (MFCC-DNNQCP-FB) reduced estimation errors by 0.8 Hz, 12.9 Hz, and 11.7 Hz for the first (F1), second (F2), and third (F3) formants, respectively, compared to the Deep Formants refinement (DeepFQCP-FB). When compared to the model-driven KARMA tracking method, the proposed refinement reduced estimation errors by 2.3 Hz, 55.5 Hz, and 143.4 Hz for F1, F2, and F3, respectively. A detailed evaluation across various phonetic categories and gender groups showed that the proposed hybrid refinement approach improves formanttracking performance across most test conditions.
声调跟踪是语音科学的一个领域,最近经历了从经典的模型驱动信号处理方法到现代数据驱动深度学习方法的技术转变。在本研究中,通过对数据驱动的深度神经网络(DNN)估计的声调与模型驱动的线性预测(LP)方法给出的声调估计值进行细化,将这两个领域结合到声调跟踪中。在细化过程中,最初由基于 DNN 的方法估算出的三个最低的声母会被 LP 方法识别出的局部频谱峰值逐帧替换。基于 LP 的细化阶段可无缝集成到 DNN 中,无需任何训练。作为一种 LP 方法,研究提倡使用准闭合相位前向后向(QCP-FB)分析。作为 DNN 输入,对三种频谱表示进行了比较:梅尔频率epstral系数(MFCC)、频谱图和复合频谱图。通过将所提出的改进型 DNN 跟踪器与七个参考跟踪器(包括基于信号处理和深度学习的方法)进行比较,对阵音跟踪性能进行了评估。作为评估数据,使用了声带共振(VTR)语料库的地面真实声母。结果表明,改进后的 DNN 追踪器优于所有传统追踪器。使用 MFCC 输入的 DNN 获得了最佳结果。与 Deep Formants refinement(DeepFQCP-FB)相比,拟议的 MFCC refinement(MFCC-DNNQCP-FB)将第一(F1)、第二(F2)和第三(F3)声母的估计误差分别降低了 0.8 Hz、12.9 Hz 和 11.7 Hz。与模型驱动的 KARMA 跟踪方法相比,所提出的改进方法将 F1、F2 和 F3 的估计误差分别降低了 2.3 Hz、55.5 Hz 和 143.4 Hz。对不同音素类别和性别组的详细评估表明,所提出的混合细化方法在大多数测试条件下都能提高声像跟踪性能。
{"title":"Formant Tracking by Combining Deep Neural Network and Linear Prediction","authors":"Sudarsana Reddy Kadiri;Kevin Huang;Christina Hagedorn;Dani Byrd;Paavo Alku;Shrikanth Narayanan","doi":"10.1109/OJSP.2025.3530876","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3530876","url":null,"abstract":"Formant tracking is an area of speech science that has recently undergone a technology shift from classical model-driven signal processing methods to modern data-driven deep learning methods. In this study, these two domains are combined in formant tracking by refining the formants estimated by a data-driven deep neural network (DNN) with formant estimates given by a model-driven linear prediction (LP) method. In the refinement process, the three lowest formants, initially estimated by the DNN-based method, are frame-wise replaced with local spectral peaks identified by the LP method. The LP-based refinement stage can be seamlessly integrated into the DNN without any training. As an LP method, the study advocates the use of quasiclosed phase forward-backward (QCP-FB) analysis. Three spectral representations are compared as DNN inputs: mel-frequency cepstral coefficients (MFCCs), the spectrogram, and the complex spectrogram. Formant tracking performance was evaluated by comparing the proposed refined DNN tracker with seven reference trackers, which included both signal processing and deep learning based methods. As evaluation data, ground truth formants of the Vocal Tract Resonance (VTR) corpus were used. The results demonstrate that the refined DNN trackers outperformed all conventional trackers. The best results were obtained by using the MFCC input for the DNN. The proposed MFCC refinement (MFCC-DNN<sub>QCP-FB</sub>) reduced estimation errors by 0.8 Hz, 12.9 Hz, and 11.7 Hz for the first (F1), second (F2), and third (F3) formants, respectively, compared to the Deep Formants refinement (DeepF<sub>QCP-FB</sub>). When compared to the model-driven KARMA tracking method, the proposed refinement reduced estimation errors by 2.3 Hz, 55.5 Hz, and 143.4 Hz for F1, F2, and F3, respectively. A detailed evaluation across various phonetic categories and gender groups showed that the proposed hybrid refinement approach improves formanttracking performance across most test conditions.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"222-230"},"PeriodicalIF":2.9,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843356","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143430569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixture of Emotion Dependent Experts: Facial Expressions Recognition in Videos Through Stacked Expert Models 情感依赖专家的混合:通过堆叠专家模型识别视频中的面部表情
IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-01-16 DOI: 10.1109/OJSP.2025.3530793
Ali N. Salman;Karen Rosero;Lucas Goncalves;Carlos Busso
Recent advancements in dynamic facial expression recognition (DFER) have predominantly utilized static features, which are theoretically inferior to dynamic features. However, models fully trained with dynamic features often suffer from over-fitting due to the limited size and diversity of the training data for fully supervised learning (SL) models. A significant challenge with existing models based on static features in recognizing emotions from videos is their tendency to form biased representations, often unbalanced or skewed towards more prevalent or basic emotional features present in the static domain, especially with posed expression. Therefore, this approach under-represents the nuances present in the dynamic domain. To address this issue, our study introduces a novel approach that we refer to as mixture of emotion-dependent experts (MoEDE). This strategy relies on emotion-specific feature extractors to produce more diverse emotional static features to train DFER systems. Each emotion-dependent expert focuses exclusively on one emotional category, formulating the problem as binary classifiers. Our DFER model combines these static representations with recurrent models, modeling their temporal relationships. The proposed MoEDE DFER approach achieves a macro F1-score of 74.5%, marking a significant improvement over the baseline, which presented a macro F1-score of 70.9% . The DFER baseline is similar to MoEDE, but it uses a single static feature extractor rather than stacked extractors. Additionally, our proposed approach shows consistent improvements compared to other four popular baselines.
动态面部表情识别(DFER)的最新进展主要是利用静态特征,这在理论上不如动态特征。然而,由于全监督学习(fully supervised learning, SL)模型的训练数据的大小和多样性有限,使用动态特征完全训练的模型往往会出现过拟合的问题。基于静态特征的现有模型在从视频中识别情感时面临的一个重大挑战是,它们倾向于形成有偏见的表征,通常不平衡或偏向于静态领域中更普遍或基本的情感特征,特别是在摆姿势表达时。因此,这种方法没有充分反映动态领域中存在的细微差别。为了解决这个问题,我们的研究引入了一种新的方法,我们称之为情感依赖专家的混合物(MoEDE)。该策略依赖于情感特定特征提取器产生更多样化的情感静态特征来训练DFER系统。每个依赖情绪的专家只关注一种情绪类别,将问题表述为二元分类器。我们的DFER模型将这些静态表示与循环模型结合起来,对它们的时间关系进行建模。提出的MoEDE DFER方法的宏观f1得分为74.5%,比基线的宏观f1得分为70.9%有显著提高。DFER基线类似于MoEDE,但它使用单个静态特征提取器而不是堆叠提取器。此外,与其他四种流行的基线相比,我们提出的方法显示出一致的改进。
{"title":"Mixture of Emotion Dependent Experts: Facial Expressions Recognition in Videos Through Stacked Expert Models","authors":"Ali N. Salman;Karen Rosero;Lucas Goncalves;Carlos Busso","doi":"10.1109/OJSP.2025.3530793","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3530793","url":null,"abstract":"Recent advancements in <italic>dynamic facial expression recognition</i> (DFER) have predominantly utilized static features, which are theoretically inferior to dynamic features. However, models fully trained with dynamic features often suffer from over-fitting due to the limited size and diversity of the training data for fully <italic>supervised learning</i> (SL) models. A significant challenge with existing models based on static features in recognizing emotions from videos is their tendency to form biased representations, often unbalanced or skewed towards more prevalent or basic emotional features present in the static domain, especially with posed expression. Therefore, this approach under-represents the nuances present in the dynamic domain. To address this issue, our study introduces a novel approach that we refer to as <italic>mixture of emotion-dependent experts</i> (MoEDE). This strategy relies on emotion-specific feature extractors to produce more diverse emotional static features to train DFER systems. Each emotion-dependent expert focuses exclusively on one emotional category, formulating the problem as binary classifiers. Our DFER model combines these static representations with recurrent models, modeling their temporal relationships. The proposed MoEDE DFER approach achieves a macro F1-score of 74.5%, marking a significant improvement over the baseline, which presented a macro F1-score of 70.9% . The DFER baseline is similar to MoEDE, but it uses a single static feature extractor rather than stacked extractors. Additionally, our proposed approach shows consistent improvements compared to other four popular baselines.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"323-332"},"PeriodicalIF":2.9,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843404","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Classification Models With Sophisticated Counterfactual Images 利用复杂反事实图像增强分类模型
IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-01-16 DOI: 10.1109/OJSP.2025.3530843
Xiang Li;Ren Togo;Keisuke Maeda;Takahiro Ogawa;Miki Haseyama
In deep learning, training data, which are mainly from realistic scenarios, often carry certain biases. This causes deep learning models to learn incorrect relationships between features when using these training data. However, because these models have black boxes, these problems cannot be solved effectively. In this paper, we aimed to 1) improve existing processes for generating language-guided counterfactual images and 2) employ counterfactual images to efficiently and directly identify model weaknesses in learning incorrect feature relationships. Furthermore, 3) we combined counterfactual images into datasets to fine-tune the model, thus correcting the model's perception of feature relationships. Through extensive experimentation, we confirmed the improvement in the quality of the generated counterfactual images, which can more effectively enhance the classification ability of various models.
在深度学习中,主要来自现实场景的训练数据往往带有一定的偏见。这导致深度学习模型在使用这些训练数据时学习到不正确的特征之间的关系。然而,由于这些模型有黑盒子,这些问题无法有效解决。在本文中,我们的目标是1)改进现有的生成语言引导的反事实图像的过程,2)使用反事实图像来有效和直接地识别模型在学习错误特征关系方面的弱点。此外,我们将反事实图像合并到数据集中对模型进行微调,从而纠正模型对特征关系的感知。通过大量的实验,我们证实了生成的反事实图像质量的提高,可以更有效地增强各种模型的分类能力。
{"title":"Enhancing Classification Models With Sophisticated Counterfactual Images","authors":"Xiang Li;Ren Togo;Keisuke Maeda;Takahiro Ogawa;Miki Haseyama","doi":"10.1109/OJSP.2025.3530843","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3530843","url":null,"abstract":"In deep learning, training data, which are mainly from realistic scenarios, often carry certain biases. This causes deep learning models to learn incorrect relationships between features when using these training data. However, because these models have <italic>black boxes</i>, these problems cannot be solved effectively. In this paper, we aimed to 1) improve existing processes for generating language-guided counterfactual images and 2) employ counterfactual images to efficiently and directly identify model weaknesses in learning incorrect feature relationships. Furthermore, 3) we combined counterfactual images into datasets to fine-tune the model, thus correcting the model's perception of feature relationships. Through extensive experimentation, we confirmed the improvement in the quality of the generated counterfactual images, which can more effectively enhance the classification ability of various models.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"89-98"},"PeriodicalIF":2.9,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843353","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143379496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feasibility Study of Location Authentication for IoT Data Using Power Grid Signatures 基于电网签名的物联网数据位置认证可行性研究
IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-01-16 DOI: 10.1109/OJSP.2025.3530847
Mudi Zhang;Charana Sonnadara;Sahil Shah;Min Wu
Ambient signatures related to the power grid offer an under-utilized opportunity to verify the time and location of sensing data collected by the Internet-of-Things (IoT). Such power signatures as the Electrical Network Frequency (ENF) have been used in multimedia forensics to answer questions about the time and location of audio-visual recordings. Going beyond multimedia data, this paper investigates a refined power signature of Electrical Network Voltage (ENV) for IoT sensing data and carries out a feasibility study of location verification for IoT data. ENV reflects the variations of the power system's supply voltage over time and is also present in the optical sensing data, akin to ENF. A physical model showing the presence of ENV in the optical sensing data is presented along with the corresponding signal processing mechanisms to estimate and utilize ENV signals from the power and optical sensing data as location stamps. Experiments are conducted in the State of Maryland of the United States to demonstrate the feasibility of using ENV signals for location authentication of IoT data.
与电网相关的环境特征为验证物联网(IoT)收集的传感数据的时间和位置提供了一个未充分利用的机会。像电子网络频率(ENF)这样的电力特征已经被用于多媒体取证,以回答有关视听记录的时间和地点的问题。超越多媒体数据,本文研究了物联网传感数据的电网电压(ENV)的精细功率签名,并对物联网数据的位置验证进行了可行性研究。ENV反映了电力系统供电电压随时间的变化,也出现在光学传感数据中,类似于ENF。提出了光感测数据中存在ENV的物理模型,以及相应的信号处理机制,以估计和利用来自功率和光感测数据的ENV信号作为定位戳。在美国马里兰州进行了实验,以证明使用ENV信号进行物联网数据位置认证的可行性。
{"title":"Feasibility Study of Location Authentication for IoT Data Using Power Grid Signatures","authors":"Mudi Zhang;Charana Sonnadara;Sahil Shah;Min Wu","doi":"10.1109/OJSP.2025.3530847","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3530847","url":null,"abstract":"Ambient signatures related to the power grid offer an under-utilized opportunity to verify the time and location of sensing data collected by the Internet-of-Things (IoT). Such power signatures as the Electrical Network Frequency (ENF) have been used in multimedia forensics to answer questions about the time and location of audio-visual recordings. Going beyond multimedia data, this paper investigates a refined power signature of Electrical Network Voltage (ENV) for IoT sensing data and carries out a feasibility study of location verification for IoT data. ENV reflects the variations of the power system's supply voltage over time and is also present in the optical sensing data, akin to ENF. A physical model showing the presence of ENV in the optical sensing data is presented along with the corresponding signal processing mechanisms to estimate and utilize ENV signals from the power and optical sensing data as location stamps. Experiments are conducted in the State of Maryland of the United States to demonstrate the feasibility of using ENV signals for location authentication of IoT data.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"405-416"},"PeriodicalIF":2.9,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843385","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143740238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition 视听情感识别中单模态和多模态评定标签的联合学习
IF 2.9 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-01-15 DOI: 10.1109/OJSP.2025.3530274
Lucas Goncalves;Huang-Cheng Chou;Ali N. Salman;Chi-Chun Lee;Carlos Busso
Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions—whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.
视听情感识别是人机交互(HCI)领域的一个重要研究方向。传统上,视听情感数据集和相应的模型从评分者观看视听刺激后获得的注释中得出其基础真理。然而,这种传统的方法忽略了人类对情绪状态的细微感知,当在不同的情绪刺激条件下(无论是通过单模态还是多模态刺激)进行注释时,这种感知是不同的。本研究通过整合不同层次的注释刺激,反映不同的感知评估,探讨了增强AVER系统性能的潜力。我们提出了一种两阶段的训练方法,用音频、面部和视听刺激引起的标签来训练模型。我们的方法利用不同级别的注释刺激,根据模型的不同层中存在的模态,有效地在单模态和多模态级别对注释进行建模,以捕获跨单模态和多模态上下文的情感感知的全部范围。我们在CREMA-D情绪数据库上对模型进行了实验和评估。所提出的方法在宏观/加权f1分数上取得了最好的成绩。此外,考虑到年龄、性别和种族,我们测量了模型校准、性能偏差和公平性指标。
{"title":"Jointly Learning From Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition","authors":"Lucas Goncalves;Huang-Cheng Chou;Ali N. Salman;Chi-Chun Lee;Carlos Busso","doi":"10.1109/OJSP.2025.3530274","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3530274","url":null,"abstract":"<italic>Audio-visual emotion recognition</i> (AVER) has been an important research area in <italic>human-computer interaction</i> (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions—whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"165-174"},"PeriodicalIF":2.9,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10842047","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143404005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE open journal of signal processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1