Pub Date : 2024-05-13DOI: 10.1007/s10044-024-01269-w
Umang Patel, Shruti Bhilare, Avik Hati
Speaker recognition system (SRS) serves as the gatekeeper for secure access, using the unique vocal characteristics of individuals for identification and verification. SRS can be found several biometric security applications such as in banks, autonomous cars, military, and smart devices. However, as technology advances, so do the threats to these models. With the rise of adversarial attacks, these models have been put to the test. Adversarial machine learning (AML) techniques have been utilized to exploit vulnerabilities in SRS, threatening their reliability and security. In this study, we concentrate on transferability in AML within the realm of SRS. Transferability refers to the capability of adversarial examples generated for one model to outsmart another model. Our research centers on enhancing the transferability of adversarial attacks in SRS. Our innovative approach involves strategically skipping non-linear activation functions during the backpropagation process to achieve this goal. The proposed method yields promising results in enhancing the transferability of adversarial examples across diverse SRS architectures, parameters, features, and datasets. To validate the effectiveness of our proposed method, we conduct an evaluation using the state-of-the-art FoolHD attack, an attack designed specifically for exploiting SRS. By implementing our method in various scenarios, including cross-architecture, cross-parameter, cross-feature, and cross-dataset settings, we demonstrate its resilience and versatility. To evaluate the performance of the proposed method in improving transferability, we have introduced three novel metrics: enhanced transferability, relative transferability, and effort in enhancing transferability. Our experiments demonstrate a significant boost in the transferability of adversarial examples in SRS. This research contributes to the growing body of knowledge on AML for SRS and emphasizes the urgency of developing robust defenses to safeguard these critical biometric systems.
{"title":"Enhancing cross-domain transferability of black-box adversarial attacks on speaker recognition systems using linearized backpropagation","authors":"Umang Patel, Shruti Bhilare, Avik Hati","doi":"10.1007/s10044-024-01269-w","DOIUrl":"https://doi.org/10.1007/s10044-024-01269-w","url":null,"abstract":"<p>Speaker recognition system (SRS) serves as the gatekeeper for secure access, using the unique vocal characteristics of individuals for identification and verification. SRS can be found several biometric security applications such as in banks, autonomous cars, military, and smart devices. However, as technology advances, so do the threats to these models. With the rise of adversarial attacks, these models have been put to the test. Adversarial machine learning (AML) techniques have been utilized to exploit vulnerabilities in SRS, threatening their reliability and security. In this study, we concentrate on transferability in AML within the realm of SRS. Transferability refers to the capability of adversarial examples generated for one model to outsmart another model. Our research centers on enhancing the transferability of adversarial attacks in SRS. Our innovative approach involves strategically skipping non-linear activation functions during the backpropagation process to achieve this goal. The proposed method yields promising results in enhancing the transferability of adversarial examples across diverse SRS architectures, parameters, features, and datasets. To validate the effectiveness of our proposed method, we conduct an evaluation using the state-of-the-art FoolHD attack, an attack designed specifically for exploiting SRS. By implementing our method in various scenarios, including cross-architecture, cross-parameter, cross-feature, and cross-dataset settings, we demonstrate its resilience and versatility. To evaluate the performance of the proposed method in improving transferability, we have introduced three novel metrics: <i>enhanced transferability</i>, <i>relative transferability</i>, and <i>effort in enhancing transferability</i>. Our experiments demonstrate a significant boost in the transferability of adversarial examples in SRS. This research contributes to the growing body of knowledge on AML for SRS and emphasizes the urgency of developing robust defenses to safeguard these critical biometric systems.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"79 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140927568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-09DOI: 10.1007/s10044-024-01281-0
Shuming Cui, Hongwei Deng
The recently proposed DETR successfully applied the Transformer to object detection and achieved impressive results. However, the learned object queries often explore the entire image to match the corresponding regions, resulting in slow convergence of DETR. Additionally, DETR only uses single-scale features from the final stage of the backbone network, leading to poor performance in small object detection. To address these issues, we propose an effective training strategy for improving the DETR framework, named PMG-DETR. We achieve this by using Position-sensitive Multi-scale attention and Grouped queries. First, to better fuse the multi-scale features, we propose a Position-sensitive Multi-scale attention. By incorporating a spatial sampling strategy into deformable attention, we can further improve the performance of small object detection. Second, we extend the attention mechanism by introducing a novel positional encoding scheme. Finally, we propose a grouping strategy for object queries, where queries are grouped at the decoder side for a more precise inclusion of regions of interest and to accelerate DETR convergence. Extensive experiments on the COCO dataset show that PMG-DETR can achieve better performance compared to DETR, e.g., AP 47.8(%) using ResNet50 as backbone trained in 50 epochs. We perform ablation studies on the COCO dataset to validate the effectiveness of the proposed PMG-DETR.
{"title":"PMG-DETR: fast convergence of DETR with position-sensitive multi-scale attention and grouped queries","authors":"Shuming Cui, Hongwei Deng","doi":"10.1007/s10044-024-01281-0","DOIUrl":"https://doi.org/10.1007/s10044-024-01281-0","url":null,"abstract":"<p>The recently proposed DETR successfully applied the Transformer to object detection and achieved impressive results. However, the learned object queries often explore the entire image to match the corresponding regions, resulting in slow convergence of DETR. Additionally, DETR only uses single-scale features from the final stage of the backbone network, leading to poor performance in small object detection. To address these issues, we propose an effective training strategy for improving the DETR framework, named PMG-DETR. We achieve this by using Position-sensitive Multi-scale attention and Grouped queries. First, to better fuse the multi-scale features, we propose a Position-sensitive Multi-scale attention. By incorporating a spatial sampling strategy into deformable attention, we can further improve the performance of small object detection. Second, we extend the attention mechanism by introducing a novel positional encoding scheme. Finally, we propose a grouping strategy for object queries, where queries are grouped at the decoder side for a more precise inclusion of regions of interest and to accelerate DETR convergence. Extensive experiments on the COCO dataset show that PMG-DETR can achieve better performance compared to DETR, e.g., AP 47.8<span>(%)</span> using ResNet50 as backbone trained in 50 epochs. We perform ablation studies on the COCO dataset to validate the effectiveness of the proposed PMG-DETR.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"13 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140927569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-09DOI: 10.1007/s10044-024-01280-1
Ambily Francis, S. Immanuel Alex Pandian, K. Martin Sagayam, Lam Dang, J. Anitha, Linh Dinh, Marc Pomplun, Hien Dang
Alzheimer’s disease is a degenerative brain disease that impairs memory, thinking skills, and the ability to perform even the most basic tasks. The primary challenge in this domain is accurate early stage disease detection. When the disease is detected at an early stage, medical professionals can prescribe medications to reduce brain shrinkage. Although the disease may not be curable, these interventions can extend the patient’s life by slowing down the rate of shrinkage. The four cognitive states of the human brain are cognitive normal (CN), mild cognitive impairment convertible (MCIc), mild cognitive impairment non-convertible (MCInc), and Alzheimer’s disease (AD). Mild cognitive impairment convertible (MCIc) is the early stage of Alzheimer’s disease. Individuals with MCIc will develop Alzheimer’s disease for a few years. However, it is difficult to detect this state through medical investigations. The mild cognitive impairment non-convertible state (MCInc) is the state immediately before MCIc. MCInc is a common condition in people of all ages, where minor memory issues arise as a result of normal aging. Early detection of AD can be claimed if and only if the transition from MCInc to MCIc is complete. Deep learning algorithms can be promising techniques for identifying the progression stage of a disease using magnetic resonance imaging. In this study, a novel deep learning algorithm was proposed to improve the classification accuracy of MCIc vs. MCInc. This study utilized the advantages of local binary patterns along with squeeze and excitation networks (SENet). Without the squeeze and excitation network, the classification accuracy of MCIc versus MCInc was 82%. The classification accuracy improved by 86% with the use of SENet. The experimental results show that the proposed model achieves better performance for MCInc vs. MCIc classification in terms of accuracy, precision, recall, F1 score, and ROC.
{"title":"Early detection of Alzheimer’s disease using squeeze and excitation network with local binary pattern descriptor","authors":"Ambily Francis, S. Immanuel Alex Pandian, K. Martin Sagayam, Lam Dang, J. Anitha, Linh Dinh, Marc Pomplun, Hien Dang","doi":"10.1007/s10044-024-01280-1","DOIUrl":"https://doi.org/10.1007/s10044-024-01280-1","url":null,"abstract":"<p>Alzheimer’s disease is a degenerative brain disease that impairs memory, thinking skills, and the ability to perform even the most basic tasks. The primary challenge in this domain is accurate early stage disease detection. When the disease is detected at an early stage, medical professionals can prescribe medications to reduce brain shrinkage. Although the disease may not be curable, these interventions can extend the patient’s life by slowing down the rate of shrinkage. The four cognitive states of the human brain are cognitive normal (CN), mild cognitive impairment convertible (MCIc), mild cognitive impairment non-convertible (MCInc), and Alzheimer’s disease (AD). Mild cognitive impairment convertible (MCIc) is the early stage of Alzheimer’s disease. Individuals with MCIc will develop Alzheimer’s disease for a few years. However, it is difficult to detect this state through medical investigations. The mild cognitive impairment non-convertible state (MCInc) is the state immediately before MCIc. MCInc is a common condition in people of all ages, where minor memory issues arise as a result of normal aging. Early detection of AD can be claimed if and only if the transition from MCInc to MCIc is complete. Deep learning algorithms can be promising techniques for identifying the progression stage of a disease using magnetic resonance imaging. In this study, a novel deep learning algorithm was proposed to improve the classification accuracy of MCIc vs. MCInc. This study utilized the advantages of local binary patterns along with squeeze and excitation networks (SENet). Without the squeeze and excitation network, the classification accuracy of MCIc versus MCInc was 82%. The classification accuracy improved by 86% with the use of SENet. The experimental results show that the proposed model achieves better performance for MCInc vs. MCIc classification in terms of accuracy, precision, recall, F1 score, and ROC.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"119 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140927528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fingerprint identification is an important issue for people recognition when using Automatic Fingerprint Identification Systems (AFIS). The size of fingerprint databases has increased with the growing use of AFIS for identification at border control, visa issuance and other procedures around the world. Fingerprint indexing algorithms are used to reduce the fingerprint search space, speed up the identification processing time and also improve the accuracy of the identification result. In this paper, we propose a new binary fingerprint indexing method based on synthetic indexes to address this problem on large databases. Two fundamental properties are considered for these synthetic indexes: discriminancy and representativeness. A biometric database is then structured considering synthetic indexes for each fingerprint template, which guaranties to have a fixed number of indexes for the database during the enrollment and identification processes. We compare the proposed algorithm with the classical Minutiae Cylinder Code (MCC) indexing method, which is one of the best methods in the State of the art. In order to evaluate the proposed method, we use all Fingerprint Verification Competition (FVC) datasets from 2000 to 2006 databases separately and combined to confirm the accuracy of our algorithm for real applications. The proposed method achieves a high hit rate (more than 98%) for a low value of penetration rate (less than 5%) compared to existing methods in the literature.
{"title":"Digital fingerprint indexing using synthetic binary indexes","authors":"Joannes Falade, Sandra Cremer, Christophe Rosenberger","doi":"10.1007/s10044-024-01283-y","DOIUrl":"https://doi.org/10.1007/s10044-024-01283-y","url":null,"abstract":"<p>Fingerprint identification is an important issue for people recognition when using Automatic Fingerprint Identification Systems (AFIS). The size of fingerprint databases has increased with the growing use of AFIS for identification at border control, visa issuance and other procedures around the world. Fingerprint indexing algorithms are used to reduce the fingerprint search space, speed up the identification processing time and also improve the accuracy of the identification result. In this paper, we propose a new binary fingerprint indexing method based on synthetic indexes to address this problem on large databases. Two fundamental properties are considered for these synthetic indexes: discriminancy and representativeness. A biometric database is then structured considering synthetic indexes for each fingerprint template, which guaranties to have a fixed number of indexes for the database during the enrollment and identification processes. We compare the proposed algorithm with the classical Minutiae Cylinder Code (MCC) indexing method, which is one of the best methods in the State of the art. In order to evaluate the proposed method, we use all Fingerprint Verification Competition (FVC) datasets from 2000 to 2006 databases separately and combined to confirm the accuracy of our algorithm for real applications. The proposed method achieves a high hit rate (more than 98%) for a low value of penetration rate (less than 5%) compared to existing methods in the literature.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"14 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the growing use of wireless surveillance cameras in (Internet of things) IoT applications the need to address storage capacity and transmission bandwidth challenges becomes crucial. The majority of successive frames from surveillance cameras contain redundant and irrelevant information, leading to increased transmission burden. Existing video pre-processing techniques often focus on reducing the number of frames without considering accuracy and fail to effectively handle both spatial and temporal redundancies simultaneously. To address these issues, an anchor-free key action point network (AKA-Net) is proposed for video pre-processing in the IoT-edge computing environment. The oriented Features from Accelerated Segment Test (FAST) and rotated Binary Robust Independent Elementary Features (BRIEF) (ORB) feature descriptor is employed to remove duplicate frames, leading to more compact and efficient video representation. The AKA-Net's major contributions include its powerful representation capabilities achieved through the bottleneck module in the information-transferring backbone network, which effectively captures multi-scale features. The information-transferring module helps to improve the performance of the object detection algorithm for video pre-processing by fusing the complementary information from different scales. This allows the algorithm to detect objects of different sizes more accurately, making it highly effective for real-time video pre-processing tasks. Then, the key action point selection module that utilizes the self-attention mechanism is introduced to accurately select informative key action points. This enables efficient network transmission with lower bandwidth requirements, while maintaining high accuracy and low latency. It treats every pixel within the feature map as a temporal-spatial point and leverages self-attention to identify and select the most relevant keypoints. Experiments show that the proposed AKA-Net outperforms existing methods in terms of compression ratio of 54.2% and accuracy with a rate of 96.7%. By addressing spatial and temporal redundancies and optimizing key action point selection, AKA-Net offers a significant advancement in video pre-processing for smart surveillance systems, benefiting various IoT applications.
{"title":"Aka-Net: anchor free-based object detection network for surveillance video transmission in the IOT edge computing environment","authors":"Preethi Sambandam Raju, Revathi Arumugam Rajendran, Murugan Mahalingam","doi":"10.1007/s10044-024-01272-1","DOIUrl":"https://doi.org/10.1007/s10044-024-01272-1","url":null,"abstract":"<p>With the growing use of wireless surveillance cameras in (Internet of things) IoT applications the need to address storage capacity and transmission bandwidth challenges becomes crucial. The majority of successive frames from surveillance cameras contain redundant and irrelevant information, leading to increased transmission burden. Existing video pre-processing techniques often focus on reducing the number of frames without considering accuracy and fail to effectively handle both spatial and temporal redundancies simultaneously. To address these issues, an anchor-free key action point network (AKA-Net) is proposed for video pre-processing in the IoT-edge computing environment. The oriented Features from Accelerated Segment Test (FAST) and rotated Binary Robust Independent Elementary Features (BRIEF) (ORB) feature descriptor is employed to remove duplicate frames, leading to more compact and efficient video representation. The AKA-Net's major contributions include its powerful representation capabilities achieved through the bottleneck module in the information-transferring backbone network, which effectively captures multi-scale features. The information-transferring module helps to improve the performance of the object detection algorithm for video pre-processing by fusing the complementary information from different scales. This allows the algorithm to detect objects of different sizes more accurately, making it highly effective for real-time video pre-processing tasks. Then, the key action point selection module that utilizes the self-attention mechanism is introduced to accurately select informative key action points. This enables efficient network transmission with lower bandwidth requirements, while maintaining high accuracy and low latency. It treats every pixel within the feature map as a temporal-spatial point and leverages self-attention to identify and select the most relevant keypoints. Experiments show that the proposed AKA-Net outperforms existing methods in terms of compression ratio of 54.2% and accuracy with a rate of 96.7%. By addressing spatial and temporal redundancies and optimizing key action point selection, AKA-Net offers a significant advancement in video pre-processing for smart surveillance systems, benefiting various IoT applications.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"40 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-03DOI: 10.1007/s10044-024-01279-8
Atefeh Ghorbanpour, Manoochehr Nahvi
Analysis of video sequences of public places is an important topic in video surveillance systems. Due to the high probability of occurring abnormal behavior in crowded scene, the main purpose of many surveillance systems is to monitor the crowd movement, and detection of abnormalities. To speed up this process and also for error reduction, it is highly important to use automated and intelligent tools in surveillance systems, as an alternative to the human operator. This study presents an unsupervised and online algorithm for analysis of dynamic crowd behavior, which uses the proposed features, with the capability to analyze crowds over time and reveal different behaviors of the crowd groups. In the proposed algorithm, prominent points are initially tracked. These key points are processed by the proposed system that includes removing the fixed points, employing proposed features of the moving points, automated determination of neighborhood, the similarity of the invariant neighbors. Group clustering is done automatically and the classification stage is conducted without the training phase. The dynamic behavior of the crowd is examined using the features and the extracted group properties and different states in the scene are diagnosed by dynamic thresholding. Experimental evaluation of the proposed method on several databases shows that it is performed properly in video sequences and it is able to detect various abnormal behaviors in the crowd scenes.
{"title":"Unsupervised group-based crowd dynamic behavior detection and tracking in online video sequences","authors":"Atefeh Ghorbanpour, Manoochehr Nahvi","doi":"10.1007/s10044-024-01279-8","DOIUrl":"https://doi.org/10.1007/s10044-024-01279-8","url":null,"abstract":"<p>Analysis of video sequences of public places is an important topic in video surveillance systems. Due to the high probability of occurring abnormal behavior in crowded scene, the main purpose of many surveillance systems is to monitor the crowd movement, and detection of abnormalities. To speed up this process and also for error reduction, it is highly important to use automated and intelligent tools in surveillance systems, as an alternative to the human operator. This study presents an unsupervised and online algorithm for analysis of dynamic crowd behavior, which uses the proposed features, with the capability to analyze crowds over time and reveal different behaviors of the crowd groups. In the proposed algorithm, prominent points are initially tracked. These key points are processed by the proposed system that includes removing the fixed points, employing proposed features of the moving points, automated determination of neighborhood, the similarity of the invariant neighbors. Group clustering is done automatically and the classification stage is conducted without the training phase. The dynamic behavior of the crowd is examined using the features and the extracted group properties and different states in the scene are diagnosed by dynamic thresholding. Experimental evaluation of the proposed method on several databases shows that it is performed properly in video sequences and it is able to detect various abnormal behaviors in the crowd scenes.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"28 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-02DOI: 10.1007/s10044-024-01273-0
D. Anil Kumar, P. V. V. Kishore, K. Sravani
Human pose identification from 2D video sequences is extremely challenging under the influence of recording artifacts such as lighting, sensor motion, unpredictable subject movements and many more. In this work, the objective is to recognize rhythmic human poses from independently sourced online videos of an Indian classical dance form, Bharatanatyam. The data set (BOICDVD22) consists of internet-sourced video frames of 5 different songs from 10 dancers that are labelled into the corresponding lyrical classes. Inferencing and achieving a decent accuracy on the models trained with this multi-sourced online data is a challenging task. The past works focused on the creation of a miniature offline non-shareable ICD dataset with standard deep learning models which resulted in unsatisfactory performance. Recently, attention-based feature learning has been driving the performance of deep learning models. The most suitable attention mechanism for online data is wavelet-based attention. Though successful, wavelet-based feature learning is applied across one layer and is dependent on global average pooling (GAP) in both channel and spatial dimensions. The current generation of wavelet attention has resulted in unbalanced spatial attention across all the video frames. To overcome this unbalanced attention and induce human-like attention this work proposes to replace the GAP wavelet channel or spatial at a particular layer in the backbone architecture with wavelet multi-head progressive attention (WMHPA). It enhances the attention mechanism as well as decreases information loss because of no GAP. Progressiveness in attention enables the WMHPA to evenly distribute attention features across all the video frames. The results show the highest possible accuracy on the dance data set due to multi-resolution attention across the entire network. The WMHPA validates against state-of-the-art on our ICD as well as benchmarked person re-identification action datasets.
{"title":"Deep Bharatanatyam pose recognition: a wavelet multi head progressive attention","authors":"D. Anil Kumar, P. V. V. Kishore, K. Sravani","doi":"10.1007/s10044-024-01273-0","DOIUrl":"https://doi.org/10.1007/s10044-024-01273-0","url":null,"abstract":"<p>Human pose identification from 2D video sequences is extremely challenging under the influence of recording artifacts such as lighting, sensor motion, unpredictable subject movements and many more. In this work, the objective is to recognize rhythmic human poses from independently sourced online videos of an Indian classical dance form, Bharatanatyam. The data set (BOICDVD22) consists of internet-sourced video frames of 5 different songs from 10 dancers that are labelled into the corresponding lyrical classes. Inferencing and achieving a decent accuracy on the models trained with this multi-sourced online data is a challenging task. The past works focused on the creation of a miniature offline non-shareable ICD dataset with standard deep learning models which resulted in unsatisfactory performance. Recently, attention-based feature learning has been driving the performance of deep learning models. The most suitable attention mechanism for online data is wavelet-based attention. Though successful, wavelet-based feature learning is applied across one layer and is dependent on global average pooling (GAP) in both channel and spatial dimensions. The current generation of wavelet attention has resulted in unbalanced spatial attention across all the video frames. To overcome this unbalanced attention and induce human-like attention this work proposes to replace the GAP wavelet channel or spatial at a particular layer in the backbone architecture with wavelet multi-head progressive attention (WMHPA). It enhances the attention mechanism as well as decreases information loss because of no GAP. Progressiveness in attention enables the WMHPA to evenly distribute attention features across all the video frames. The results show the highest possible accuracy on the dance data set due to multi-resolution attention across the entire network. The WMHPA validates against state-of-the-art on our ICD as well as benchmarked person re-identification action datasets.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"18 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-30DOI: 10.1007/s10044-024-01275-y
Min-Chang Liu, Fang-Rong Hsu, Chua-Huang Huang
The concept of complex event processing refers to the process of tracking and analyzing a set of related events and drawing conclusions from them. For such systems, complex event recognition is essential. The object of complex event recognition is to recognize meaningful events or patterns and construct processing rules to respond to them. Researchers have conducted numerous studies on the recognition of complex event patterns by using recognition languages or models. However, the completeness of the process in complex event recognition has rarely been discussed. Although the reality of the event is uncertain, the structure for modeling and explaining complex event interactions of contingent information remains unclear. In this study, we focused on developing a general framework for addressing these problems and demonstrating the applicability of model-based approaches to represent spatio-temporal dimensions and causality in complex event recognition. In this paper, we propose an event behavior model for complex event recognition from a process perspective. The developed model could detect and explain anomalies associated with complex events. An experiment was conducted to evaluate the model performance. The results revealed that temporal operations within overlapping events were crucial to event pattern recognition.
{"title":"Complex event recognition and anomaly detection with event behavior model","authors":"Min-Chang Liu, Fang-Rong Hsu, Chua-Huang Huang","doi":"10.1007/s10044-024-01275-y","DOIUrl":"https://doi.org/10.1007/s10044-024-01275-y","url":null,"abstract":"<p>The concept of complex event processing refers to the process of tracking and analyzing a set of related events and drawing conclusions from them. For such systems, complex event recognition is essential. The object of complex event recognition is to recognize meaningful events or patterns and construct processing rules to respond to them. Researchers have conducted numerous studies on the recognition of complex event patterns by using recognition languages or models. However, the completeness of the process in complex event recognition has rarely been discussed. Although the reality of the event is uncertain, the structure for modeling and explaining complex event interactions of contingent information remains unclear. In this study, we focused on developing a general framework for addressing these problems and demonstrating the applicability of model-based approaches to represent spatio-temporal dimensions and causality in complex event recognition. In this paper, we propose an event behavior model for complex event recognition from a process perspective. The developed model could detect and explain anomalies associated with complex events. An experiment was conducted to evaluate the model performance. The results revealed that temporal operations within overlapping events were crucial to event pattern recognition.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"11 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-29DOI: 10.1007/s10044-024-01270-3
Isabel Jiménez-Velasco, Jorge Zafra-Palma, Rafael Muñoz-Salinas, Manuel J. Marín-Jiménez
Human interaction recognition (HIR) is a significant challenge in computer vision that focuses on identifying human interactions in images and videos. HIR presents a great complexity due to factors such as pose diversity, varying scene conditions, or the presence of multiple individuals. Recent research has explored different approaches to address it, with an increasing emphasis on human pose estimation. In this work, we propose Proxemics-Net++, an extension of the Proxemics-Net model, capable of addressing the problem of recognizing human interactions in images through two different tasks: the identification of the types of “touch codes” or proxemics and the identification of the type of social relationship between pairs. To achieve this, we use RGB and body pose information together with the state-of-the-art deep learning architecture, ConvNeXt, as the backbone. We performed an ablative analysis to understand how the combination of RGB and body pose information affects these two tasks. Experimental results show that body pose information contributes significantly to proxemic recognition (first task) as it allows to improve the existing state of the art, while its contribution in the classification of social relations (second task) is limited due to the ambiguity of labelling in this problem, resulting in RGB information being more influential in this task.
{"title":"Proxemics-net++: classification of human interactions in still images","authors":"Isabel Jiménez-Velasco, Jorge Zafra-Palma, Rafael Muñoz-Salinas, Manuel J. Marín-Jiménez","doi":"10.1007/s10044-024-01270-3","DOIUrl":"https://doi.org/10.1007/s10044-024-01270-3","url":null,"abstract":"<p>Human interaction recognition (HIR) is a significant challenge in computer vision that focuses on identifying human interactions in images and videos. HIR presents a great complexity due to factors such as pose diversity, varying scene conditions, or the presence of multiple individuals. Recent research has explored different approaches to address it, with an increasing emphasis on human pose estimation. In this work, we propose Proxemics-Net++, an extension of the Proxemics-Net model, capable of addressing the problem of recognizing human interactions in images through two different tasks: the identification of the types of “touch codes” or proxemics and the identification of the type of social relationship between pairs. To achieve this, we use RGB and body pose information together with the state-of-the-art deep learning architecture, ConvNeXt, as the backbone. We performed an ablative analysis to understand how the combination of RGB and body pose information affects these two tasks. Experimental results show that body pose information contributes significantly to proxemic recognition (first task) as it allows to improve the existing state of the art, while its contribution in the classification of social relations (second task) is limited due to the ambiguity of labelling in this problem, resulting in RGB information being more influential in this task.\u0000</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"11 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-29DOI: 10.1007/s10044-024-01276-x
Cui Li, Jiao Wang
Target detection, as a core issue in the field of computer vision, is widely applied in many key areas such as face recognition, license plate recognition, security protection, and driverless driving. Although its detection speed and accuracy continue to break records, there are still many challenges and difficulties in target detection of remote sensing images, which require further in-depth research and exploration. Remote sensing images can be regarded as a "three-dimensional data cube", with more complex background information, dense and small object targets, and more severe weather interference factors. These factors lead to large positioning errors and low detection accuracy in the target detection process of remote sensing images. An improved YOLOv7 object detection model is proposed to address the problem of high false negative rate for dense and small objects in remote sensing images. Firstly, the GAM attention mechanism is introduced, and a global scheduling mechanism is proposed to improve the performance of deep neural networks by reducing information reduction and expanding global interaction representations, thus enhancing the network's sensitivity to targets. Secondly, the loss function CIoU in the original Yolov7 network model is replaced by SIoU, aiming to optimize the loss function, reduce losses, and improve the generalization of the network. Finally, the model is tested on the public available RSOD remote sensing dataset, and its generalization is verified on the Okahublot FloW-Img sub-dataset. The results showed that the accuracy (MAP@0.5) of detecting objects improved by 1.7 percentage points and 1.5 percentage points respectively for the improved Yolov7 network model compared to the original model, effectively improves the accuracy of detecting small targets in remote sensing images and solves the problem of leakage detection of small targets in remote sensing images.
{"title":"Remote sensing image location based on improved Yolov7 target detection","authors":"Cui Li, Jiao Wang","doi":"10.1007/s10044-024-01276-x","DOIUrl":"https://doi.org/10.1007/s10044-024-01276-x","url":null,"abstract":"<p>Target detection, as a core issue in the field of computer vision, is widely applied in many key areas such as face recognition, license plate recognition, security protection, and driverless driving. Although its detection speed and accuracy continue to break records, there are still many challenges and difficulties in target detection of remote sensing images, which require further in-depth research and exploration. Remote sensing images can be regarded as a \"three-dimensional data cube\", with more complex background information, dense and small object targets, and more severe weather interference factors. These factors lead to large positioning errors and low detection accuracy in the target detection process of remote sensing images. An improved YOLOv7 object detection model is proposed to address the problem of high false negative rate for dense and small objects in remote sensing images. Firstly, the GAM attention mechanism is introduced, and a global scheduling mechanism is proposed to improve the performance of deep neural networks by reducing information reduction and expanding global interaction representations, thus enhancing the network's sensitivity to targets. Secondly, the loss function CIoU in the original Yolov7 network model is replaced by SIoU, aiming to optimize the loss function, reduce losses, and improve the generalization of the network. Finally, the model is tested on the public available RSOD remote sensing dataset, and its generalization is verified on the Okahublot FloW-Img sub-dataset. The results showed that the accuracy (MAP@0.5) of detecting objects improved by 1.7 percentage points and 1.5 percentage points respectively for the improved Yolov7 network model compared to the original model, effectively improves the accuracy of detecting small targets in remote sensing images and solves the problem of leakage detection of small targets in remote sensing images.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"11 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}