Pub Date : 2024-08-28DOI: 10.1016/j.patrec.2024.08.019
Chan Sim, Gyeonghwan Kim
Few-shot classification is a challenging task to recognize unseen classes with limited data. Following the success of Vision Transformer in various large-scale datasets image recognition domains, recent few-shot classification methods employ transformer-style. However, most of them focus only on cross-attention between support and query sets, mainly considering channel-similarity. To address this issue, we introduce dual-similarity network (DSN) in which attention maps for the same target within a class are made identical. With the network, a way of effective training through the integration of the channel-similarity and the map-similarity has been sought. Our method, while focused on -way -shot scenarios, also demonstrates strong performance in 1-shot settings through augmentation. The experimental results verify the effectiveness of DSN on widely used benchmark datasets.
少镜头分类是一项具有挑战性的任务,需要利用有限的数据识别未见类别。随着 Vision Transformer 在各种大规模数据集图像识别领域的成功应用,近期的少量分类方法也采用了 Transformer 风格。然而,这些方法大多只关注支持集和查询集之间的交叉关注,主要考虑通道相似性。为了解决这个问题,我们引入了双相似性网络(DSN)。通过该网络,我们找到了一种整合通道相似性和地图相似性的有效训练方法。我们的方法虽然侧重于 N 路 K 次搜索,但通过增强,在 1 次搜索的情况下也能表现出很强的性能。实验结果验证了 DSN 在广泛使用的基准数据集上的有效性。
{"title":"Cross-attention based dual-similarity network for few-shot learning","authors":"Chan Sim, Gyeonghwan Kim","doi":"10.1016/j.patrec.2024.08.019","DOIUrl":"10.1016/j.patrec.2024.08.019","url":null,"abstract":"<div><p>Few-shot classification is a challenging task to recognize unseen classes with limited data. Following the success of Vision Transformer in various large-scale datasets image recognition domains, recent few-shot classification methods employ transformer-style. However, most of them focus only on cross-attention between support and query sets, mainly considering channel-similarity. To address this issue, we introduce <em>dual-similarity network</em> (DSN) in which attention maps for the same target within a class are made identical. With the network, a way of effective training through the integration of the channel-similarity and the map-similarity has been sought. Our method, while focused on <span><math><mi>N</mi></math></span>-way <span><math><mi>K</mi></math></span>-shot scenarios, also demonstrates strong performance in 1-shot settings through augmentation. The experimental results verify the effectiveness of DSN on widely used benchmark datasets.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 1-6"},"PeriodicalIF":3.9,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142136774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1016/j.patrec.2024.08.006
Aecheon Jung , Sungeun Hong , Yoonsuk Hyun
Owing to the advancements in deep learning, object detection has made significant progress in estimating the positions and classes of multiple objects within an image. However, detecting objects of various scales within a single image remains a challenging problem. In this study, we suggest a scale-aware token matching to predict the positions and classes of objects for transformer-based object detection. We train a model by matching detection tokens with ground truth considering its size, unlike the previous methods that performed matching without considering the scale during the training process. We divide one detection token set into multiple sets based on scale and match each token set differently with ground truth, thereby, training the model without additional computation costs. The experimental results demonstrate that scale information can be assigned to tokens. Scale-aware tokens can independently learn scale-specific information by using a novel loss function, which improves the detection performance on small objects.
{"title":"Scale-aware token-matching for transformer-based object detector","authors":"Aecheon Jung , Sungeun Hong , Yoonsuk Hyun","doi":"10.1016/j.patrec.2024.08.006","DOIUrl":"10.1016/j.patrec.2024.08.006","url":null,"abstract":"<div><p>Owing to the advancements in deep learning, object detection has made significant progress in estimating the positions and classes of multiple objects within an image. However, detecting objects of various scales within a single image remains a challenging problem. In this study, we suggest a scale-aware token matching to predict the positions and classes of objects for transformer-based object detection. We train a model by matching detection tokens with ground truth considering its size, unlike the previous methods that performed matching without considering the scale during the training process. We divide one detection token set into multiple sets based on scale and match each token set differently with ground truth, thereby, training the model without additional computation costs. The experimental results demonstrate that scale information can be assigned to tokens. Scale-aware tokens can independently learn scale-specific information by using a novel loss function, which improves the detection performance on small objects.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 197-202"},"PeriodicalIF":3.9,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167865524002381/pdfft?md5=455cf43c88bbb69d1fdd489f7d4c3fe2&pid=1-s2.0-S0167865524002381-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142083774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1016/j.patrec.2024.08.011
Lin Jiang , Jigang Wu , Shuping Zhao , Jiaxing Li
In cross-modal retrieval, most existing hashing-based methods merely considered the relationship between feature representations to reduce the heterogeneous gap for data from various modalities, whereas they neglected the correlation between feature representations and the corresponding labels. This leads to the loss of significant semantic information, and the degradation of the class discriminability of the model. To tackle these issues, this paper presents a novel cross-modal retrieval method called coding self-representative and label-relaxed hashing (CSLRH) for cross-modal retrieval. Specifically, we propose a self-representation learning term to enhance the class-specific feature representations and reduce the noise interference. Additionally, we introduce a label-relaxed regression to establish semantic relations between the hash codes and the label information, aiming to enhance the semantic discriminability. Moreover, we incorporate a non-linear regression to capture the correlation of non-linear features in hash codes for cross-modal retrieval. Experimental results on three widely-used datasets verify the effectiveness of our proposed method, which can generate more discriminative hash codes to improve the precisions of cross-modal retrieval.
{"title":"Coding self-representative and label-relaxed hashing for cross-modal retrieval","authors":"Lin Jiang , Jigang Wu , Shuping Zhao , Jiaxing Li","doi":"10.1016/j.patrec.2024.08.011","DOIUrl":"10.1016/j.patrec.2024.08.011","url":null,"abstract":"<div><p>In cross-modal retrieval, most existing hashing-based methods merely considered the relationship between feature representations to reduce the heterogeneous gap for data from various modalities, whereas they neglected the correlation between feature representations and the corresponding labels. This leads to the loss of significant semantic information, and the degradation of the class discriminability of the model. To tackle these issues, this paper presents a novel cross-modal retrieval method called coding self-representative and label-relaxed hashing (CSLRH) for cross-modal retrieval. Specifically, we propose a self-representation learning term to enhance the class-specific feature representations and reduce the noise interference. Additionally, we introduce a label-relaxed regression to establish semantic relations between the hash codes and the label information, aiming to enhance the semantic discriminability. Moreover, we incorporate a non-linear regression to capture the correlation of non-linear features in hash codes for cross-modal retrieval. Experimental results on three widely-used datasets verify the effectiveness of our proposed method, which can generate more discriminative hash codes to improve the precisions of cross-modal retrieval.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 1-7"},"PeriodicalIF":3.9,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142087408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1016/j.patrec.2024.08.016
Weihua Liu , Xiabi Liu , Huiyu Li , Chaochao Lin
The popular softmax loss and its recent extensions have achieved great success in deep learning-based image classification. However, the data for training image classifiers often exhibit a highly skewed distribution in quality, i.e., the number of data with good quality is much more than that with low quality. If this problem is ignored, low-quality data are hard to classify correctly. In this paper, we discover the positive correlation between the quality of an image and its feature norm (-norm) learned from softmax loss through careful experiments on various applications with different deep neural networks. Based on this finding, we propose a contraction mapping function to compress the range of feature norms of training images according to their quality and embed this contraction mapping function into softmax loss and its extensions to produce novel learning objectives. Experiments on various applications, including handwritten digit recognition, lung nodule classification, and face recognition, demonstrate that the proposed approach is promising to effectively deal with the problem of learning quality imbalance data and leads to significant and stable improvements in the classification accuracy. The code is available at https://github.com/Huiyu-Li/CM-M-Softmax-Loss.
在基于深度学习的图像分类中,流行的 softmax 损失及其最近的扩展取得了巨大成功。然而,用于训练图像分类器的数据在质量上往往呈现高度倾斜分布,即质量好的数据数量远远多于质量差的数据数量。如果忽略这个问题,低质量数据就很难被正确分类。在本文中,我们通过使用不同的深度神经网络对各种应用进行仔细实验,发现了图像质量与通过 softmax loss 学习到的特征规范(L2-norm)之间的正相关性。基于这一发现,我们提出了一种收缩映射函数,用于根据图像质量压缩训练图像的特征规范范围,并将这种收缩映射函数嵌入到 softmax loss 及其扩展中,以产生新的学习目标。在手写数字识别、肺结节分类和人脸识别等各种应用上的实验表明,所提出的方法有望有效地解决学习质量不平衡数据的问题,并能显著而稳定地提高分类准确率。代码见 https://github.com/Huiyu-Li/CM-M-Softmax-Loss。
{"title":"Contraction mapping of feature norms for data quality imbalance learning","authors":"Weihua Liu , Xiabi Liu , Huiyu Li , Chaochao Lin","doi":"10.1016/j.patrec.2024.08.016","DOIUrl":"10.1016/j.patrec.2024.08.016","url":null,"abstract":"<div><p>The popular softmax loss and its recent extensions have achieved great success in deep learning-based image classification. However, the data for training image classifiers often exhibit a highly skewed distribution in quality, i.e., the number of data with good quality is much more than that with low quality. If this problem is ignored, low-quality data are hard to classify correctly. In this paper, we discover the positive correlation between the quality of an image and its feature norm (<span><math><msub><mrow><mi>L</mi></mrow><mrow><mn>2</mn></mrow></msub></math></span>-norm) learned from softmax loss through careful experiments on various applications with different deep neural networks. Based on this finding, we propose a contraction mapping function to compress the range of feature norms of training images according to their quality and embed this contraction mapping function into softmax loss and its extensions to produce novel learning objectives. Experiments on various applications, including handwritten digit recognition, lung nodule classification, and face recognition, demonstrate that the proposed approach is promising to effectively deal with the problem of learning quality imbalance data and leads to significant and stable improvements in the classification accuracy. The code is available at <span><span>https://github.com/Huiyu-Li/CM-M-Softmax-Loss</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 232-238"},"PeriodicalIF":3.9,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142087411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1016/j.patrec.2024.08.010
Manh Hung Nguyen , Lisheng Sun Hosoya , Isabelle Guyon
Training a large set of machine learning algorithms to convergence in order to select the best-performing algorithm for a dataset is computationally wasteful. Moreover, in a budget-limited scenario, it is crucial to carefully select an algorithm candidate and allocate a budget for training it, ensuring that the limited budget is optimally distributed to favor the most promising candidates. Casting this problem as a Markov Decision Process, we propose a novel framework in which an agent must select in the process of learning the most promising algorithm without waiting until it is fully trained. At each time step, given an observation of partial learning curves of algorithms, the agent must decide whether to allocate resources to further train the most promising algorithm (exploitation), to wake up another algorithm previously put to sleep, or to start training a new algorithm (exploration). In addition, our framework allows the agent to meta-learn from learning curves on past datasets along with dataset meta-features and algorithm hyperparameters. By incorporating meta-learning, we aim to avoid myopic decisions based solely on premature learning curves on the dataset at hand. We introduce two benchmarks of learning curves that served in international competitions at WCCI’22 and AutoML-conf’22, of which we analyze the results. Our findings show that both meta-learning and the progression of learning curves enhance the algorithm selection process, as evidenced by methods of winning teams and our DDQN baseline, compared to heuristic baselines or a random search. Interestingly, our cost-effective baseline, which selects the best-performing algorithm w.r.t. a small budget, can perform decently when learning curves do not intersect frequently.
{"title":"Meta-learning from learning curves for budget-limited algorithm selection","authors":"Manh Hung Nguyen , Lisheng Sun Hosoya , Isabelle Guyon","doi":"10.1016/j.patrec.2024.08.010","DOIUrl":"10.1016/j.patrec.2024.08.010","url":null,"abstract":"<div><p>Training a large set of machine learning algorithms to convergence in order to select the best-performing algorithm for a dataset is computationally wasteful. Moreover, in a budget-limited scenario, it is crucial to carefully select an algorithm candidate and allocate a budget for training it, ensuring that the limited budget is optimally distributed to favor the most promising candidates. Casting this problem as a Markov Decision Process, we propose a novel framework in which an agent must select in the process of learning the most promising algorithm without waiting until it is fully trained. At each time step, given an observation of partial learning curves of algorithms, the agent must decide whether to allocate resources to further train the most promising algorithm (exploitation), to wake up another algorithm previously put to sleep, or to start training a new algorithm (exploration). In addition, our framework allows the agent to meta-learn from learning curves on past datasets along with dataset meta-features and algorithm hyperparameters. By incorporating meta-learning, we aim to avoid myopic decisions based solely on premature learning curves on the dataset at hand. We introduce two benchmarks of learning curves that served in international competitions at WCCI’22 and AutoML-conf’22, of which we analyze the results. Our findings show that both meta-learning and the progression of learning curves enhance the algorithm selection process, as evidenced by methods of winning teams and our DDQN baseline, compared to heuristic baselines or a random search. Interestingly, our cost-effective baseline, which selects the best-performing algorithm w.r.t. a small budget, can perform decently when learning curves do not intersect frequently.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 225-231"},"PeriodicalIF":3.9,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142087410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1016/j.patrec.2024.08.015
Áron Fóthi, Joul Skaf, Fengjiao Lu, Kristian Fenech
This paper addresses the challenging task of unsupervised relative human pose estimation. Our solution exploits the potential offered by utilizing multiple uncalibrated cameras. It is assumed that spatial human pose and camera parameter estimation can be solved as a block sparse dictionary learning problem with zero supervision. The resulting structures and camera parameters can fit individual skeletons into a common space. To do so, we exploit the fact that all individuals in the image are viewed from the same camera viewpoint, thus exploiting the information provided by multiple camera views and overcoming the lack of information on camera parameters. To the best of our knowledge, this is the first solution that requires neither 3D ground truth nor knowledge of the intrinsic or extrinsic camera parameters. Our approach demonstrates the potential of using multiple viewpoints to solve challenging computer vision problems. Additionally, we provide access to the code, encouraging further development and experimentation. https://github.com/Jeryoss/MVMB-NRSFM.
{"title":"Deep NRSFM for multi-view multi-body pose estimation","authors":"Áron Fóthi, Joul Skaf, Fengjiao Lu, Kristian Fenech","doi":"10.1016/j.patrec.2024.08.015","DOIUrl":"10.1016/j.patrec.2024.08.015","url":null,"abstract":"<div><p>This paper addresses the challenging task of unsupervised relative human pose estimation. Our solution exploits the potential offered by utilizing multiple uncalibrated cameras. It is assumed that spatial human pose and camera parameter estimation can be solved as a block sparse dictionary learning problem with zero supervision. The resulting structures and camera parameters can fit individual skeletons into a common space. To do so, we exploit the fact that all individuals in the image are viewed from the same camera viewpoint, thus exploiting the information provided by multiple camera views and overcoming the lack of information on camera parameters. To the best of our knowledge, this is the first solution that requires neither 3D ground truth nor knowledge of the intrinsic or extrinsic camera parameters. Our approach demonstrates the potential of using multiple viewpoints to solve challenging computer vision problems. Additionally, we provide access to the code, encouraging further development and experimentation. <span><span>https://github.com/Jeryoss/MVMB-NRSFM</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 218-224"},"PeriodicalIF":3.9,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167865524002472/pdfft?md5=c7f415f86c9c99693c29d66ef080962f&pid=1-s2.0-S0167865524002472-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142087409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1016/j.patrec.2024.08.008
Usman Muhammad , Mourad Oussalah , Jorma Laaksonen
With the growing availability of databases for face presentation attack detection, researchers are increasingly focusing on video-based face anti-spoofing methods that involve hundreds to thousands of images for training the models. However, there is currently no clear consensus on the optimal number of frames in a video to improve face spoofing detection. Inspired by the visual saliency theory, we present a video summarization method for face anti-spoofing detection that aims to enhance the performance and efficiency of deep learning models by leveraging visual saliency. In particular, saliency information is extracted from the differences between the Laplacian and Wiener filter outputs of the source images, enabling the identification of the most visually salient regions within each frame. Subsequently, the source images are decomposed into base and detail images, enhancing the representation of the most important information. Weighting maps are then computed based on the saliency information, indicating the importance of each pixel in the image. By linearly combining the base and detail images using the weighting maps, the method fuses the source images to create a single representative image that summarizes the entire video. The key contribution of the proposed method lies in demonstrating how visual saliency can be used as a data-centric approach to improve the performance and efficiency for face presentation attack detection. By focusing on the most salient images or regions within the images, a more representative and diverse training set can be created, potentially leading to more effective models. To validate the method’s effectiveness, a simple CNN–RNN deep learning architecture was used, and the experimental results showcased state-of-the-art performance on four challenging face anti-spoofing datasets.
{"title":"Saliency-based video summarization for face anti-spoofing","authors":"Usman Muhammad , Mourad Oussalah , Jorma Laaksonen","doi":"10.1016/j.patrec.2024.08.008","DOIUrl":"10.1016/j.patrec.2024.08.008","url":null,"abstract":"<div><p>With the growing availability of databases for face presentation attack detection, researchers are increasingly focusing on video-based face anti-spoofing methods that involve hundreds to thousands of images for training the models. However, there is currently no clear consensus on the optimal number of frames in a video to improve face spoofing detection. Inspired by the visual saliency theory, we present a video summarization method for face anti-spoofing detection that aims to enhance the performance and efficiency of deep learning models by leveraging visual saliency. In particular, saliency information is extracted from the differences between the Laplacian and Wiener filter outputs of the source images, enabling the identification of the most visually salient regions within each frame. Subsequently, the source images are decomposed into base and detail images, enhancing the representation of the most important information. Weighting maps are then computed based on the saliency information, indicating the importance of each pixel in the image. By linearly combining the base and detail images using the weighting maps, the method fuses the source images to create a single representative image that summarizes the entire video. The key contribution of the proposed method lies in demonstrating how visual saliency can be used as a data-centric approach to improve the performance and efficiency for face presentation attack detection. By focusing on the most salient images or regions within the images, a more representative and diverse training set can be created, potentially leading to more effective models. To validate the method’s effectiveness, a simple CNN–RNN deep learning architecture was used, and the experimental results showcased state-of-the-art performance on four challenging face anti-spoofing datasets.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 190-196"},"PeriodicalIF":3.9,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142048955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1016/j.patrec.2024.08.014
Xin Min , Wei Li , Ruiqi Han , Tianlong Ji , Weidong Xie
Recently, considering the advancement of information technology in healthcare, electronic medical records (EMRs) have become the repository of patients’ treatment processes in hospitals, including the patient’s treatment pattern (standard treatment process), the patient’s medical history, the patient’s admission diagnosis, etc. In particular, EMRs-based treatment recommendation systems have become critical for optimizing clinical decision-making. EMRs contain complex relationships between patients and treatment patterns. Recent studies have shown that graph neural collaborative filtering can effectively capture the complex relationships in EMRs. However, none of the existing methods take into account the impact of medical content such as the patient’s admission diagnosis, and medical history on treatment recommendations. In this work, we propose a graph neural collaborative filtering model with medical content-aware pre-training (CAPRec) for learning initial embeddings with medical content to improve recommendation performance. First the model constructs a patient-treatment pattern interaction graph from EMRs data. Then we attempt to use the medical content for pre-training learning and transfer the learned embeddings to a graph neural collaborative filtering model. Finally, the learned initial embedding can support the downstream task of graph collaborative filtering. Extensive experiments on real world datasets have consistently demonstrated the effectiveness of the medical content-aware training framework in improving treatment recommendations.
{"title":"Graph neural collaborative filtering with medical content-aware pre-training for treatment pattern recommendation","authors":"Xin Min , Wei Li , Ruiqi Han , Tianlong Ji , Weidong Xie","doi":"10.1016/j.patrec.2024.08.014","DOIUrl":"10.1016/j.patrec.2024.08.014","url":null,"abstract":"<div><p>Recently, considering the advancement of information technology in healthcare, electronic medical records (EMRs) have become the repository of patients’ treatment processes in hospitals, including the patient’s treatment pattern (standard treatment process), the patient’s medical history, the patient’s admission diagnosis, etc. In particular, EMRs-based treatment recommendation systems have become critical for optimizing clinical decision-making. EMRs contain complex relationships between patients and treatment patterns. Recent studies have shown that graph neural collaborative filtering can effectively capture the complex relationships in EMRs. However, none of the existing methods take into account the impact of medical content such as the patient’s admission diagnosis, and medical history on treatment recommendations. In this work, we propose a graph neural collaborative filtering model with medical content-aware pre-training (CAPRec) for learning initial embeddings with medical content to improve recommendation performance. First the model constructs a patient-treatment pattern interaction graph from EMRs data. Then we attempt to use the medical content for pre-training learning and transfer the learned embeddings to a graph neural collaborative filtering model. Finally, the learned initial embedding can support the downstream task of graph collaborative filtering. Extensive experiments on real world datasets have consistently demonstrated the effectiveness of the medical content-aware training framework in improving treatment recommendations.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 210-217"},"PeriodicalIF":3.9,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142083773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1016/j.patrec.2024.08.012
Anurag Dhote , Mohammed Javed , David S. Doermann
Charts are a visualization tool used in scientific documents to facilitate easy comprehension of complex relationships underlying data and experiments. Researchers use various chart types to convey scientific information, so the problem of data extraction and subsequent chart understanding becomes very challenging. Many studies have been taken up in the literature to address the problem of chart mining, whose motivation is to facilitate the editing of existing charts, carry out extrapolative studies, and provide a deeper understanding of the underlying data. The first step towards chart understanding is chart classification, for which traditional ML and CNN-based deep learning models have been used in the literature. In this paper, we propose Swin-Chart, a Swin transformer-based deep learning approach for chart classification, which generalizes well across multiple datasets with a wide range of chart categories. Swin-Chart comprises a pre-trained Swin Transformer, a finetuning component, and a weight averaging component. The proposed approach is tested on a five-chart image benchmark dataset. We observed that the Swin-Chart model outperformers existing state-of-the-art models on all the datasets. Furthermore, we also provide an ablation study of the Swin-Chart model with all five datasets to understand the importance of various sub-parts such as the back-bone Swin transformer model, the value of several best weights selected for the weight averaging component, and the presence of the weight averaging component itself.
The Swin-Chart model also received first position in the chart classification task on the latest dataset in the CHART Infographics competition at ICDAR 2023 - chartinfo.github.io.
{"title":"Swin-chart: An efficient approach for chart classification","authors":"Anurag Dhote , Mohammed Javed , David S. Doermann","doi":"10.1016/j.patrec.2024.08.012","DOIUrl":"10.1016/j.patrec.2024.08.012","url":null,"abstract":"<div><p>Charts are a visualization tool used in scientific documents to facilitate easy comprehension of complex relationships underlying data and experiments. Researchers use various chart types to convey scientific information, so the problem of data extraction and subsequent chart understanding becomes very challenging. Many studies have been taken up in the literature to address the problem of chart mining, whose motivation is to facilitate the editing of existing charts, carry out extrapolative studies, and provide a deeper understanding of the underlying data. The first step towards chart understanding is chart classification, for which traditional ML and CNN-based deep learning models have been used in the literature. In this paper, we propose Swin-Chart, a Swin transformer-based deep learning approach for chart classification, which generalizes well across multiple datasets with a wide range of chart categories. Swin-Chart comprises a pre-trained Swin Transformer, a finetuning component, and a weight averaging component. The proposed approach is tested on a five-chart image benchmark dataset. We observed that the Swin-Chart model outperformers existing state-of-the-art models on all the datasets. Furthermore, we also provide an ablation study of the Swin-Chart model with all five datasets to understand the importance of various sub-parts such as the back-bone Swin transformer model, the value of several best weights selected for the weight averaging component, and the presence of the weight averaging component itself.</p><p>The Swin-Chart model also received first position in the chart classification task on the latest dataset in the CHART Infographics competition at ICDAR 2023 - chartinfo.github.io.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 203-209"},"PeriodicalIF":3.9,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142083775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-20DOI: 10.1016/j.patrec.2024.08.007
Ali Zoljodi , Sadegh Abadijou , Mina Alibeigi , Masoud Daneshtalab
Detecting lane markings in road scenes poses a significant challenge due to their intricate nature, which is susceptible to unfavorable conditions. While lane markings have strong shape priors, their visibility is easily compromised by varying lighting conditions, adverse weather, occlusions by other vehicles or pedestrians, road plane changes, and fading of colors over time. The detection process is further complicated by the presence of several lane shapes and natural variations, necessitating large amounts of high-quality and diverse data to train a robust lane detection model capable of handling various real-world scenarios.
In this paper, we present a novel self-supervised learning method termed Contrastive Learning for Lane Detection via Cross-Similarity (CLLD) to enhance the resilience and effectiveness of lane detection models in real-world scenarios, particularly when the visibility of lane markings are compromised. CLLD introduces a novel contrastive learning (CL) method that assesses the similarity of local features within the global context of the input image. It uses the surrounding information to predict lane markings. This is achieved by integrating local feature contrastive learning with our newly proposed operation, dubbed cross-similarity.
The local feature CL concentrates on extracting features from small patches, a necessity for accurately localizing lane segments. Meanwhile, cross-similarity captures global features, enabling the detection of obscured lane segments based on their surroundings. We enhance cross-similarity by randomly masking portions of input images in the process of augmentation. Extensive experiments on TuSimple and CuLane benchmark datasets demonstrate that CLLD consistently outperforms state-of-the-art contrastive learning methods, particularly in visibility-impairing conditions like shadows, while it also delivers comparable results under normal conditions. When compared to supervised learning, CLLD still excels in challenging scenarios such as shadows and crowded scenes, which are common in real-world driving.
道路场景中的车道标线错综复杂,很容易受到不利条件的影响,因此对其进行检测是一项巨大的挑战。虽然车道标线具有很强的形状先验性,但其可视性很容易受到不同光照条件、恶劣天气、其他车辆或行人遮挡、路面变化以及颜色随时间褪色等因素的影响。检测过程因多种车道形状和自然变化的存在而变得更加复杂,因此需要大量高质量和多样化的数据来训练能够处理各种真实世界场景的鲁棒车道检测模型。在本文中,我们提出了一种名为 "通过交叉相似性进行车道检测的对比学习"(Contrastive Learning for Lane Detection via Cross-Similarity,简称 CLLD)的新型自监督学习方法,以增强车道检测模型在真实世界场景中的适应性和有效性,尤其是当车道标记的可见性受到影响时。CLLD 引入了一种新颖的对比学习(CL)方法,在输入图像的全局背景下评估局部特征的相似性。它利用周边信息来预测车道标记。这是通过将局部特征对比学习与我们新提出的操作(称为交叉相似性)相结合来实现的。局部特征 CL 专注于从小块图像中提取特征,这是精确定位车道分段的必要条件。同时,交叉相似性可以捕捉全局特征,从而根据周围环境检测出模糊的车道段。我们通过在增强过程中随机屏蔽部分输入图像来增强交叉相似性。在 TuSimple 和 CuLane 基准数据集上进行的大量实验表明,CLLD 始终优于最先进的对比学习方法,尤其是在阴影等有损可见度的条件下,同时它在正常条件下也能提供与之相当的结果。与监督学习相比,CLLD 在阴影和拥挤场景等具有挑战性的场景中仍然表现出色,而这些场景在实际驾驶中很常见。
{"title":"Contrastive Learning for Lane Detection via cross-similarity","authors":"Ali Zoljodi , Sadegh Abadijou , Mina Alibeigi , Masoud Daneshtalab","doi":"10.1016/j.patrec.2024.08.007","DOIUrl":"10.1016/j.patrec.2024.08.007","url":null,"abstract":"<div><p>Detecting lane markings in road scenes poses a significant challenge due to their intricate nature, which is susceptible to unfavorable conditions. While lane markings have strong shape priors, their visibility is easily compromised by varying lighting conditions, adverse weather, occlusions by other vehicles or pedestrians, road plane changes, and fading of colors over time. The detection process is further complicated by the presence of several lane shapes and natural variations, necessitating large amounts of high-quality and diverse data to train a robust lane detection model capable of handling various real-world scenarios.</p><p>In this paper, we present a novel self-supervised learning method termed Contrastive Learning for Lane Detection via Cross-Similarity (CLLD) to enhance the resilience and effectiveness of lane detection models in real-world scenarios, particularly when the visibility of lane markings are compromised. CLLD introduces a novel contrastive learning (CL) method that assesses the similarity of local features within the global context of the input image. It uses the surrounding information to predict lane markings. This is achieved by integrating local feature contrastive learning with our newly proposed operation, dubbed <em>cross-similarity</em>.</p><p>The local feature CL concentrates on extracting features from small patches, a necessity for accurately localizing lane segments. Meanwhile, cross-similarity captures global features, enabling the detection of obscured lane segments based on their surroundings. We enhance cross-similarity by randomly masking portions of input images in the process of augmentation. Extensive experiments on TuSimple and CuLane benchmark datasets demonstrate that CLLD consistently outperforms state-of-the-art contrastive learning methods, particularly in visibility-impairing conditions like shadows, while it also delivers comparable results under normal conditions. When compared to supervised learning, CLLD still excels in challenging scenarios such as shadows and crowded scenes, which are common in real-world driving.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 175-183"},"PeriodicalIF":3.9,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167865524002393/pdfft?md5=216ead31bb4d56cfb720a21ce2d4db87&pid=1-s2.0-S0167865524002393-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142021151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}