Pub Date : 2024-10-01DOI: 10.1016/j.patrec.2024.10.018
Nan Yang , Cheuk Hang Leung , Xing Yan
In this paper, we introduce a novel distance measure that conforms to the definition of a semi-distance, for quantifying the similarity between Hidden Markov Models (HMMs). This distance measure is not only easier to implement, but also accounts for state alignment before distance calculation, ensuring correctness and accuracy. Our proposed distance measure presents a significant advancement in HMM comparison, offering a more practical and accurate solution compared to existing measures. Numerical examples that demonstrate the utility of the proposed distance measure are given for HMMs with continuous state probability densities. In real-world data experiments, we employ HMM to represent the evolution of financial time series or music. Subsequently, leveraging the proposed distance measure, we conduct HMM-based unsupervised clustering, demonstrating promising results. Our approach proves effective in capturing the inherent difference in dynamics of financial time series, showcasing the practicality and success of the proposed distance measure.
{"title":"A novel HMM distance measure with state alignment","authors":"Nan Yang , Cheuk Hang Leung , Xing Yan","doi":"10.1016/j.patrec.2024.10.018","DOIUrl":"10.1016/j.patrec.2024.10.018","url":null,"abstract":"<div><div>In this paper, we introduce a novel distance measure that conforms to the definition of a semi-distance, for quantifying the similarity between Hidden Markov Models (HMMs). This distance measure is not only easier to implement, but also accounts for state alignment before distance calculation, ensuring correctness and accuracy. Our proposed distance measure presents a significant advancement in HMM comparison, offering a more practical and accurate solution compared to existing measures. Numerical examples that demonstrate the utility of the proposed distance measure are given for HMMs with continuous state probability densities. In real-world data experiments, we employ HMM to represent the evolution of financial time series or music. Subsequently, leveraging the proposed distance measure, we conduct HMM-based unsupervised clustering, demonstrating promising results. Our approach proves effective in capturing the inherent difference in dynamics of financial time series, showcasing the practicality and success of the proposed distance measure.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 314-321"},"PeriodicalIF":3.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.patrec.2024.10.014
Junghyun Seo , Sungjun Wang , Hyeonjae Jeon , Taesoo Kim , Yongsik Jin , Soon Kwon , Jeseok Kim , Yongseob Lim
There are diverse datasets available for training deep learning models utilized in autonomous driving. However, most of these datasets are composed of images obtained in day conditions, leading to a data imbalance issue when dealing with night condition images. Several day-to-night image translation models have been proposed to resolve the insufficiency of the night condition dataset, but these models often generate artifacts and cannot control the brightness of the generated image. In this study, we propose a LuminanceGAN, for controlling the brightness degree in night conditions to generate realistic night image outputs. The proposed novel Y-control loss converges the brightness degree of the output image to a specific luminance value. Furthermore, the implementation of the self-attention module effectively reduces artifacts in the generated images. Consequently, in qualitative comparisons, our model demonstrates superior performance in day-to-night image translation. Additionally, a quantitative evaluation was conducted using lane detection models, showing that our proposed method improves performance in night lane detection tasks. Moreover, the quality of the generated indoor dark images was assessed using an evaluation metric. It can be proven that our model generates images most similar to real dark images compared to other image translation models.
有多种数据集可用于训练自动驾驶中使用的深度学习模型。然而,这些数据集大多由白天获得的图像组成,导致在处理夜间图像时出现数据不平衡问题。为了解决夜间状态数据集的不足,人们提出了一些日夜图像转换模型,但这些模型往往会产生伪影,而且无法控制生成图像的亮度。在本研究中,我们提出了一种 LuminanceGAN,用于控制夜间条件下的亮度,以生成逼真的夜间图像输出。所提出的新型 Y 控制损失可将输出图像的亮度收敛到特定的亮度值。此外,自我注意模块的实施有效地减少了生成图像中的伪影。因此,在定性比较中,我们的模型在日夜图像转换中表现出了卓越的性能。此外,我们还使用车道检测模型进行了定量评估,结果表明我们提出的方法提高了夜间车道检测任务的性能。此外,我们还使用评估指标对生成的室内黑暗图像的质量进行了评估。结果证明,与其他图像转换模型相比,我们的模型生成的图像与真实的黑暗图像最为相似。
{"title":"LuminanceGAN: Controlling the brightness of generated images for various night conditions","authors":"Junghyun Seo , Sungjun Wang , Hyeonjae Jeon , Taesoo Kim , Yongsik Jin , Soon Kwon , Jeseok Kim , Yongseob Lim","doi":"10.1016/j.patrec.2024.10.014","DOIUrl":"10.1016/j.patrec.2024.10.014","url":null,"abstract":"<div><div>There are diverse datasets available for training deep learning models utilized in autonomous driving. However, most of these datasets are composed of images obtained in day conditions, leading to a data imbalance issue when dealing with night condition images. Several day-to-night image translation models have been proposed to resolve the insufficiency of the night condition dataset, but these models often generate artifacts and cannot control the brightness of the generated image. In this study, we propose a LuminanceGAN, for controlling the brightness degree in night conditions to generate realistic night image outputs. The proposed novel Y-control loss converges the brightness degree of the output image to a specific luminance value. Furthermore, the implementation of the self-attention module effectively reduces artifacts in the generated images. Consequently, in qualitative comparisons, our model demonstrates superior performance in day-to-night image translation. Additionally, a quantitative evaluation was conducted using lane detection models, showing that our proposed method improves performance in night lane detection tasks. Moreover, the quality of the generated indoor dark images was assessed using an evaluation metric. It can be proven that our model generates images most similar to real dark images compared to other image translation models.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 292-299"},"PeriodicalIF":3.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.patrec.2024.10.007
Jongmin Yu , Hyeontaek Oh , Younkwan Lee , Jinhong Yang
This paper proposes the Adversarial Denoising Diffusion Model (ADDM). Diffusion models excel at generating high-quality samples, outperforming other generative models. These models also achieve outstanding medical image anomaly detection (AD) results due to their strong sampling ability. However, the performance of the diffusion model-based methods is highly varied depending on the sampling frequency, and the time cost to generate good-quality samples is significantly higher than that of other generative models. We propose the ADDM, a diffusion model-based AD method trained with adversarial learning that can maintain high-quality sample generation ability and significantly reduce the number of sampling steps. The proposed adversarial learning is achieved by classifying model-based denoised samples and samples to which random Gaussian noise is added to a specific sampling step. Compared with the loss function of diffusion models, defined under the noise space to minimise the predicted noise and scheduled noise, the diffusion model can explicitly learn semantic information about the sample space since adversarial learning is defined based on the sample space. Our experiment demonstrated that adversarial learning helps achieve a data sampling performance similar to the DDPM with much fewer sampling steps. Experimental results show that the proposed ADDM outperformed existing unsupervised AD methods on Brain MRI images. In particular, in the comparison using 22 T1-weighted MRI scans provided by the Centre for Clinical Brain Sciences from the University of Edinburgh, the ADDM achieves similar performance with 50% fewer sampling steps than other DDPM-based AD methods, and it shows 6.2% better performance about the Dice metric with the same number of sampling steps.
本文提出了对抗性去噪扩散模型(ADDM)。扩散模型擅长生成高质量样本,其性能优于其他生成模型。由于其强大的采样能力,这些模型还能实现出色的医学图像异常检测(AD)结果。然而,基于扩散模型的方法的性能随采样频率的不同而变化很大,而且生成高质量样本的时间成本明显高于其他生成模型。我们提出的 ADDM 是一种基于扩散模型、经过对抗学习训练的 AD 方法,它既能保持高质量样本的生成能力,又能大大减少采样步骤。所提出的对抗学习是通过对基于模型的去噪样本和在特定采样步骤中加入随机高斯噪声的样本进行分类来实现的。扩散模型的损失函数是在噪声空间下定义的,目的是最小化预测噪声和预定噪声,与之相比,由于对抗学习是基于样本空间定义的,因此扩散模型可以明确地学习样本空间的语义信息。我们的实验证明,对抗学习有助于以更少的采样步骤实现与 DDPM 相似的数据采样性能。实验结果表明,在脑磁共振成像图像上,所提出的 ADDM 优于现有的无监督 AD 方法。特别是,在使用爱丁堡大学临床脑科学中心提供的 22 张 T1 加权 MRI 扫描图像进行的比较中,ADDM 比其他基于 DDPM 的 AD 方法减少了 50% 的采样步骤,却取得了类似的性能,而且在采样步骤相同的情况下,ADDM 比 Dice 指标高出 6.2%。
{"title":"Denoising diffusion model with adversarial learning for unsupervised anomaly detection on brain MRI images","authors":"Jongmin Yu , Hyeontaek Oh , Younkwan Lee , Jinhong Yang","doi":"10.1016/j.patrec.2024.10.007","DOIUrl":"10.1016/j.patrec.2024.10.007","url":null,"abstract":"<div><div>This paper proposes the Adversarial Denoising Diffusion Model (ADDM). Diffusion models excel at generating high-quality samples, outperforming other generative models. These models also achieve outstanding medical image anomaly detection (AD) results due to their strong sampling ability. However, the performance of the diffusion model-based methods is highly varied depending on the sampling frequency, and the time cost to generate good-quality samples is significantly higher than that of other generative models. We propose the ADDM, a diffusion model-based AD method trained with adversarial learning that can maintain high-quality sample generation ability and significantly reduce the number of sampling steps. The proposed adversarial learning is achieved by classifying model-based denoised samples and samples to which random Gaussian noise is added to a specific sampling step. Compared with the loss function of diffusion models, defined under the noise space to minimise the predicted noise and scheduled noise, the diffusion model can explicitly learn semantic information about the sample space since adversarial learning is defined based on the sample space. Our experiment demonstrated that adversarial learning helps achieve a data sampling performance similar to the DDPM with much fewer sampling steps. Experimental results show that the proposed ADDM outperformed existing unsupervised AD methods on Brain MRI images. In particular, in the comparison using 22 T1-weighted MRI scans provided by the Centre for Clinical Brain Sciences from the University of Edinburgh, the ADDM achieves similar performance with 50% fewer sampling steps than other DDPM-based AD methods, and it shows 6.2% better performance about the Dice metric with the same number of sampling steps.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 229-235"},"PeriodicalIF":3.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142535004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.patrec.2024.09.023
Ting-Ruen Wei , Yuan Wang , Yoshitaka Inoue , Hsin-Tai Wu , Yi Fang
Missing data in tabular dataset is a common issue as the performance of downstream tasks usually depends on the completeness of the training dataset. Previous missing data imputation methods focus on numeric and categorical columns, but we propose a novel end-to-end approach called Table Transformers for Imputing Textual Attributes (TTITA) based on the transformer to impute unstructured textual columns using other columns in the table. We conduct extensive experiments on three datasets, and our approach shows competitive performance outperforming baseline models such as recurrent neural networks and Llama2. The performance improvement is more significant when the target sequence has a longer length. Additionally, we incorporate multi-task learning to simultaneously impute for heterogeneous columns, boosting the performance for text imputation. We also qualitatively compare with ChatGPT for realistic applications.
{"title":"Table Transformers for imputing textual attributes","authors":"Ting-Ruen Wei , Yuan Wang , Yoshitaka Inoue , Hsin-Tai Wu , Yi Fang","doi":"10.1016/j.patrec.2024.09.023","DOIUrl":"10.1016/j.patrec.2024.09.023","url":null,"abstract":"<div><div>Missing data in tabular dataset is a common issue as the performance of downstream tasks usually depends on the completeness of the training dataset. Previous missing data imputation methods focus on numeric and categorical columns, but we propose a novel end-to-end approach called Table Transformers for Imputing Textual Attributes (TTITA) based on the transformer to impute unstructured textual columns using other columns in the table. We conduct extensive experiments on three datasets, and our approach shows competitive performance outperforming baseline models such as recurrent neural networks and Llama2. The performance improvement is more significant when the target sequence has a longer length. Additionally, we incorporate multi-task learning to simultaneously impute for heterogeneous columns, boosting the performance for text imputation. We also qualitatively compare with ChatGPT for realistic applications.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 258-264"},"PeriodicalIF":3.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142551873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.patrec.2024.10.008
Jinfeng Li , Meiling Feng , Chengyi Xia
Convolutional Neural Networks (CNNs) are extensively utilized in medical disease diagnosis, demonstrating the prominent performance in most cases. However, medical image processing based on deep learning faces some challenges. The limited availability and time-consuming annotations of medical image data restrict the scale and accuracy of model training. Data diversity and complexity further complicate these challenges. In order to address these issues, we introduce the Double Branch Convolutional Transformer (DBCvT), a hybrid CNN-Transformer feature extractor, which can better capture diverse fine-grained features and remain suitable for small datasets. In this model, separable downsampling convolution (SDConv) is used to mitigate excessive information loss during downsampling in standard convolutions. Additionally, we propose the Dual branch Channel Efficient multi-head Self-Attention (DCESA) mechanism to enhance the self-attention efficiency, consequently elevating network performance and effectiveness. Moreover, we introduce a novel convolutional channel-enhanced Attention mechanism to strengthen inter-channel relationships within feature maps post self-attention. The experiments of DBCvT on various medical image datasets have demonstrated the outstanding classification performance and generalization capability of the proposed model.
{"title":"DBCvT: Double Branch Convolutional Transformer for Medical Image Classification","authors":"Jinfeng Li , Meiling Feng , Chengyi Xia","doi":"10.1016/j.patrec.2024.10.008","DOIUrl":"10.1016/j.patrec.2024.10.008","url":null,"abstract":"<div><div>Convolutional Neural Networks (CNNs) are extensively utilized in medical disease diagnosis, demonstrating the prominent performance in most cases. However, medical image processing based on deep learning faces some challenges. The limited availability and time-consuming annotations of medical image data restrict the scale and accuracy of model training. Data diversity and complexity further complicate these challenges. In order to address these issues, we introduce the Double Branch Convolutional Transformer (DBCvT), a hybrid CNN-Transformer feature extractor, which can better capture diverse fine-grained features and remain suitable for small datasets. In this model, separable downsampling convolution (SDConv) is used to mitigate excessive information loss during downsampling in standard convolutions. Additionally, we propose the Dual branch Channel Efficient multi-head Self-Attention (DCESA) mechanism to enhance the self-attention efficiency, consequently elevating network performance and effectiveness. Moreover, we introduce a novel convolutional channel-enhanced Attention mechanism to strengthen inter-channel relationships within feature maps post self-attention. The experiments of DBCvT on various medical image datasets have demonstrated the outstanding classification performance and generalization capability of the proposed model.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 250-257"},"PeriodicalIF":3.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142535360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Point cloud completion network often encodes points into a global feature vector, then predicts the complete point cloud through the vector generation process. However, this method may not accurately capture complex shapes, as global feature vectors struggle to recover their detailed structure. In this paper, we present a novel shape completion network, namely RD-Net, that innovatively focuses on the interaction of information between points to provide both local and global information for generating fine-grained complete shape. Specifically, we propose a stored iteration-based method for point cloud sampling that quickly captures representative points within the point cloud. Subsequently, in order to better predict the shape and structure of the missing part, we design an iterative edge-convolution module. It uses a CNN-like hierarchy for feature extraction and learning context information. Moreover, we design a two-stage reconstruction process for latent vector decoding. We first employ a feature-points-based multi-scale generating decoder to estimate the missing point cloud hierarchically. This is followed by a self-attention mechanism that refines the generated shape and effectively generates structural details. By combining these innovations, RD-Net achieves a 2% reduction in CD error compared to the state-of-the-art method on the ShapeNet-part dataset.
点云补全网络通常将点编码为全局特征向量,然后通过向量生成过程预测完整的点云。然而,这种方法可能无法准确捕捉复杂的形状,因为全局特征向量难以恢复其细节结构。在本文中,我们提出了一种新颖的形状补全网络,即 RD-Net,它创新性地关注点与点之间的信息交互,为生成精细的完整形状提供局部和全局信息。具体来说,我们提出了一种基于存储迭代的点云采样方法,可快速捕捉点云中的代表性点。随后,为了更好地预测缺失部分的形状和结构,我们设计了一个迭代边缘卷积模块。它使用类似 CNN 的层次结构来提取特征和学习上下文信息。此外,我们还为潜向量解码设计了一个两阶段重建过程。我们首先采用基于特征点的多尺度生成解码器来分层估计缺失点云。随后,我们采用自我关注机制来完善生成的形状,并有效生成结构细节。通过结合这些创新技术,RD-Net 在 ShapeNet-part 数据集上的 CD 误差比最先进的方法减少了 2%。
{"title":"Regional dynamic point cloud completion network","authors":"Liping Zhu, Yixuan Yang, Kai Liu, Silin Wu, Bingyao Wang, Xianxiang Chang","doi":"10.1016/j.patrec.2024.10.017","DOIUrl":"10.1016/j.patrec.2024.10.017","url":null,"abstract":"<div><div>Point cloud completion network often encodes points into a global feature vector, then predicts the complete point cloud through the vector generation process. However, this method may not accurately capture complex shapes, as global feature vectors struggle to recover their detailed structure. In this paper, we present a novel shape completion network, namely RD-Net, that innovatively focuses on the interaction of information between points to provide both local and global information for generating fine-grained complete shape. Specifically, we propose a stored iteration-based method for point cloud sampling that quickly captures representative points within the point cloud. Subsequently, in order to better predict the shape and structure of the missing part, we design an iterative edge-convolution module. It uses a CNN-like hierarchy for feature extraction and learning context information. Moreover, we design a two-stage reconstruction process for latent vector decoding. We first employ a feature-points-based multi-scale generating decoder to estimate the missing point cloud hierarchically. This is followed by a self-attention mechanism that refines the generated shape and effectively generates structural details. By combining these innovations, RD-Net achieves a 2% reduction in CD error compared to the state-of-the-art method on the ShapeNet-part dataset.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 322-329"},"PeriodicalIF":3.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.patrec.2024.10.009
Beomjo Kim, Kyung-Ah Sohn
This paper presents a novel approach to subject-driven image generation that addresses the limitations of traditional text-to-image diffusion models. Our method generates images using reference images without relying on language-based prompts. We introduce a visual detail preserving module that captures intricate details and textures, addressing overfitting issues associated with limited training samples. The model's performance is further enhanced through a modified classifier-free guidance technique and feature concatenation, enabling the natural positioning and harmonization of subjects within diverse scenes. Quantitative assessments using CLIP, DINO and Quality scores (QS), along with a user study, demonstrate the superior quality of our generated images. Our work highlights the potential of pre-trained models and visual patch embeddings in subject-driven editing, balancing diversity and fidelity in image generation tasks. Our implementation is available at https://github.com/8eomio/Subject-Inpainting.
{"title":"Text-free diffusion inpainting using reference images for enhanced visual fidelity","authors":"Beomjo Kim, Kyung-Ah Sohn","doi":"10.1016/j.patrec.2024.10.009","DOIUrl":"10.1016/j.patrec.2024.10.009","url":null,"abstract":"<div><div>This paper presents a novel approach to subject-driven image generation that addresses the limitations of traditional text-to-image diffusion models. Our method generates images using reference images without relying on language-based prompts. We introduce a visual detail preserving module that captures intricate details and textures, addressing overfitting issues associated with limited training samples. The model's performance is further enhanced through a modified classifier-free guidance technique and feature concatenation, enabling the natural positioning and harmonization of subjects within diverse scenes. Quantitative assessments using CLIP, DINO and Quality scores (QS), along with a user study, demonstrate the superior quality of our generated images. Our work highlights the potential of pre-trained models and visual patch embeddings in subject-driven editing, balancing diversity and fidelity in image generation tasks. Our implementation is available at <span><span>https://github.com/8eomio/Subject-Inpainting</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 221-228"},"PeriodicalIF":3.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142535003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.patrec.2024.11.001
Mingxuan Chen , Shiqi Li , Xujun Wei , Jiacheng Song
In the rapidly advancing field of industrial automation, the availability of robust and diverse datasets is crucial for the development and evaluation of machine learning models. The data repository consists of four distinct versions of datasets: MMIFR-D, MMIFR-FS, MMIFR-OD and MMIFR-P. The MMIFR-D dataset comprises a comprehensive assemblage of 5907 images accompanied by corresponding textual descriptions, notably facilitating the application of industrial equipment classification. In contrast, the MMIFR-FS dataset serves as an alternative variant characterized by the inclusion of 129 distinct classes and 5907 images, specifically catering to the task of few-shot learning within the industrial domain. MMIFR-OD is another alternative variant, comprising 8,839 annotation instances across 128 distinct categories, is predominantly utilized for object detection tasks. Additionally, the MMIFR-P dataset consists of 142 textual–visual information pairs, making it suitable for detecting pairs of industrial equipment. Furthermore, we conduct a comprehensive comparative analysis of our dataset in relation to other datasets used in industrial settings. Benchmark performances for different industrial tasks on our data repository are provided. The proposed multimodal dataset, MMIFR, can be utilized for research in industrial automation, quality control, safety monitoring, and other relevant domains.
{"title":"MMIFR: Multi-modal industry focused data repository","authors":"Mingxuan Chen , Shiqi Li , Xujun Wei , Jiacheng Song","doi":"10.1016/j.patrec.2024.11.001","DOIUrl":"10.1016/j.patrec.2024.11.001","url":null,"abstract":"<div><div>In the rapidly advancing field of industrial automation, the availability of robust and diverse datasets is crucial for the development and evaluation of machine learning models. The data repository consists of four distinct versions of datasets: MMIFR-D, MMIFR-FS, MMIFR-OD and MMIFR-P. The MMIFR-D dataset comprises a comprehensive assemblage of 5907 images accompanied by corresponding textual descriptions, notably facilitating the application of industrial equipment classification. In contrast, the MMIFR-FS dataset serves as an alternative variant characterized by the inclusion of 129 distinct classes and 5907 images, specifically catering to the task of few-shot learning within the industrial domain. MMIFR-OD is another alternative variant, comprising 8,839 annotation instances across 128 distinct categories, is predominantly utilized for object detection tasks. Additionally, the MMIFR-P dataset consists of 142 textual–visual information pairs, making it suitable for detecting pairs of industrial equipment. Furthermore, we conduct a comprehensive comparative analysis of our dataset in relation to other datasets used in industrial settings. Benchmark performances for different industrial tasks on our data repository are provided. The proposed multimodal dataset, MMIFR, can be utilized for research in industrial automation, quality control, safety monitoring, and other relevant domains.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 306-313"},"PeriodicalIF":3.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.patrec.2024.09.017
Yibo Lou , Wenjie Zhang , Xiaoning Song , Yang Hua , Xiao-Jun Wu
Efficiently leveraging semantic information is crucial for advancing video captioning in recent years. But, prevailing approaches that involve designing various Part-of-Speech (POS) tags as prior information lack essential linguistic knowledge guidance throughout the training procedure, particularly in the context of POS and initial description generation. Furthermore, the restriction to a single source of semantic information ignores the potential for varied interpretations inherent in each video. To solve these problems, we propose the Exploring Deeper into Semantics (EDS) method for video captioning. EDS comprises three feasible modules that focus on semantic information. Specifically, we propose the Semantic Supervised Generation (SSG) module. It integrates semantic information as a prior, and facilitates enriched interrelations among words for POS supervision. A novel Similarity Semantic Extension (SSE) module is proposed to employ a query-based semantic expansion for collaboratively generating fine-grained content. Additionally, the proposed Input Semantic Enhancement (ISE) module provides a strategy for mitigating the information constraints faced during the initial phase of word generation. The experiments conducted show that, by exploiting semantic information through supervision, extension, and enhancement, EDS not only yields promising results but also underlines the effectiveness. Code will be available at https://github.com/BradenJoson/EDS.
{"title":"EDS: Exploring deeper into semantics for video captioning","authors":"Yibo Lou , Wenjie Zhang , Xiaoning Song , Yang Hua , Xiao-Jun Wu","doi":"10.1016/j.patrec.2024.09.017","DOIUrl":"10.1016/j.patrec.2024.09.017","url":null,"abstract":"<div><div>Efficiently leveraging semantic information is crucial for advancing video captioning in recent years. But, prevailing approaches that involve designing various Part-of-Speech (POS) tags as prior information lack essential linguistic knowledge guidance throughout the training procedure, particularly in the context of POS and initial description generation. Furthermore, the restriction to a single source of semantic information ignores the potential for varied interpretations inherent in each video. To solve these problems, we propose the Exploring Deeper into Semantics (EDS) method for video captioning. EDS comprises three feasible modules that focus on semantic information. Specifically, we propose the Semantic Supervised Generation (SSG) module. It integrates semantic information as a prior, and facilitates enriched interrelations among words for POS supervision. A novel Similarity Semantic Extension (SSE) module is proposed to employ a query-based semantic expansion for collaboratively generating fine-grained content. Additionally, the proposed Input Semantic Enhancement (ISE) module provides a strategy for mitigating the information constraints faced during the initial phase of word generation. The experiments conducted show that, by exploiting semantic information through supervision, extension, and enhancement, EDS not only yields promising results but also underlines the effectiveness. Code will be available at <span><span>https://github.com/BradenJoson/EDS</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 133-140"},"PeriodicalIF":3.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142421584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated learning enables multiple clients to collaborate to train a model without sharing data. Clients with insufficient data or data diversity participate in federated learning to learn a model with superior performance. MRI data suffers from inadequate data and different data distribution due to differences in MRI scanners and client characteristics. Also, privacy concerns preclude data sharing. In this work, we propose a novel adaptive federated meta-learning (FAM) mechanism for collaboratively learning a single global model, which is personalized locally on individual clients. The learnt sparse global model captures the common features in the MRI data across clients. This model is grown on each client to learn a personalized model by capturing additional client-specific parameters from local data. Experimental results on multiple data sets show that the personalization process at each client quickly converges using a limited number of epochs. The personalized client models outperformed the locally trained models, demonstrating the efficacy of the FAM mechanism. Additionally, the FAM-based sparse global model has fewer parameters that require less communication overhead during federated learning. This makes the model viable for networks with limited resources.
{"title":"FAM: Adaptive federated meta-learning for MRI data","authors":"Indrajeet Kumar Sinha, Shekhar Verma, Krishna Pratap Singh","doi":"10.1016/j.patrec.2024.09.018","DOIUrl":"10.1016/j.patrec.2024.09.018","url":null,"abstract":"<div><div>Federated learning enables multiple clients to collaborate to train a model without sharing data. Clients with insufficient data or data diversity participate in federated learning to learn a model with superior performance. MRI data suffers from inadequate data and different data distribution due to differences in MRI scanners and client characteristics. Also, privacy concerns preclude data sharing. In this work, we propose a novel adaptive federated meta-learning (FAM) mechanism for collaboratively learning a single global model, which is personalized locally on individual clients. The learnt sparse global model captures the common features in the MRI data across clients. This model is grown on each client to learn a personalized model by capturing additional client-specific parameters from local data. Experimental results on multiple data sets show that the personalization process at each client quickly converges using a limited number of epochs. The personalized client models outperformed the locally trained models, demonstrating the efficacy of the FAM mechanism. Additionally, the FAM-based sparse global model has fewer parameters that require less communication overhead during federated learning. This makes the model viable for networks with limited resources.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 205-212"},"PeriodicalIF":3.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}