IET Computer Vision最新文献_第3页

Unlocking the power of multi-modal fusion in 3D object tracking

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-25 DOI: 10.1049/cvi2.12335

Yue Hu

3D Single Object Tracking plays a vital role in autonomous driving and robotics, yet traditional approaches have predominantly focused on using pure LiDAR-based point cloud data, often neglecting the benefits of integrating image modalities. To address this gap, we propose a novel Multi-modal Image-LiDAR Tracker (MILT) designed to overcome the limitations of single-modality methods by effectively combining RGB and point cloud data. Our key contribution is a dual-branch architecture that separately extracts geometric features from LiDAR and texture features from images. These features are then fused in a BEV perspective to achieve a comprehensive representation of the tracked object. A significant innovation in our approach is the Image-to-LiDAR Adapter module, which transfers the rich feature representation capabilities of the image modality to the 3D tracking task, and the BEV-Fusion module, which facilitates the interactive fusion of geometry and texture features. By validating MILT on public datasets, we demonstrate substantial performance improvements over traditional methods, effectively showcasing the advantages of our multi-modal fusion strategy. This work advances the state-of-the-art in SOT by integrating complementary information from RGB and LiDAR modalities, resulting in enhanced tracking accuracy and robustness.

{"title":"Unlocking the power of multi-modal fusion in 3D object tracking","authors":"Yue Hu","doi":"10.1049/cvi2.12335","DOIUrl":"https://doi.org/10.1049/cvi2.12335","url":null,"abstract":"3D Single Object Tracking plays a vital role in autonomous driving and robotics, yet traditional approaches have predominantly focused on using pure LiDAR-based point cloud data, often neglecting the benefits of integrating image modalities. To address this gap, we propose a novel Multi-modal Image-LiDAR Tracker (MILT) designed to overcome the limitations of single-modality methods by effectively combining RGB and point cloud data. Our key contribution is a dual-branch architecture that separately extracts geometric features from LiDAR and texture features from images. These features are then fused in a BEV perspective to achieve a comprehensive representation of the tracked object. A significant innovation in our approach is the Image-to-LiDAR Adapter module, which transfers the rich feature representation capabilities of the image modality to the 3D tracking task, and the BEV-Fusion module, which facilitates the interactive fusion of geometry and texture features. By validating MILT on public datasets, we demonstrate substantial performance improvements over traditional methods, effectively showcasing the advantages of our multi-modal fusion strategy. This work advances the state-of-the-art in SOT by integrating complementary information from RGB and LiDAR modalities, resulting in enhanced tracking accuracy and robustness.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12335","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Category-instance distillation based on visual-language models for rehearsal-free class incremental learning

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-23 DOI: 10.1049/cvi2.12327

Weilong Jin, Zilei Wang, Yixin Zhang

Recently, visual-language models (VLMs) have displayed potent capabilities in the field of computer vision. Their emerging trend as the backbone of visual tasks necessitates studying class incremental learning (CIL) issues within the VLM architecture. However, the pre-training data for many VLMs is proprietary, and during the incremental phase, old task data may also raise privacy issues. Moreover, replay-based methods can introduce new problems like class imbalance, the selection of data for replay and a trade-off between replay cost and performance. Therefore, the authors choose the more challenging rehearsal-free settings. In this paper, the authors study class-incremental tasks based on the large pre-trained vision-language models like CLIP model. Initially, at the category level, the authors combine traditional optimisation and distillation techniques, utilising both pre-trained models and models trained in previous incremental stages to jointly guide the training of the new model. This paradigm effectively balances the stability and plasticity of the new model, mitigating the issue of catastrophic forgetting. Moreover, utilising the VLM infrastructure, the authors redefine the relationship between instances. This allows us to glean fine-grained instance relational information from the a priori knowledge provided during pre-training. The authors supplement this approach with an entropy-balancing method that allows the model to adaptively distribute optimisation weights across training samples. The authors’ experimental results validate that their method, within the framework of VLMs, outperforms traditional CIL methods.

{"title":"Category-instance distillation based on visual-language models for rehearsal-free class incremental learning","authors":"Weilong Jin, Zilei Wang, Yixin Zhang","doi":"10.1049/cvi2.12327","DOIUrl":"https://doi.org/10.1049/cvi2.12327","url":null,"abstract":"Recently, visual-language models (VLMs) have displayed potent capabilities in the field of computer vision. Their emerging trend as the backbone of visual tasks necessitates studying class incremental learning (CIL) issues within the VLM architecture. However, the pre-training data for many VLMs is proprietary, and during the incremental phase, old task data may also raise privacy issues. Moreover, replay-based methods can introduce new problems like class imbalance, the selection of data for replay and a trade-off between replay cost and performance. Therefore, the authors choose the more challenging rehearsal-free settings. In this paper, the authors study class-incremental tasks based on the large pre-trained vision-language models like CLIP model. Initially, at the category level, the authors combine traditional optimisation and distillation techniques, utilising both pre-trained models and models trained in previous incremental stages to jointly guide the training of the new model. This paradigm effectively balances the stability and plasticity of the new model, mitigating the issue of catastrophic forgetting. Moreover, utilising the VLM infrastructure, the authors redefine the relationship between instances. This allows us to glean fine-grained instance relational information from the a priori knowledge provided during pre-training. The authors supplement this approach with an entropy-balancing method that allows the model to adaptively distribute optimisation weights across training samples. The authors’ experimental results validate that their method, within the framework of VLMs, outperforms traditional CIL methods.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12327","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Outliers rejection for robust camera pose estimation using graduated non-convexity

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-23 DOI: 10.1049/cvi2.12330

Hao Yi, Bo Liu, Bin Zhao, Enhai Liu

Camera pose estimation plays a crucial role in computer vision, which is widely used in augmented reality, robotics and autonomous driving. However, previous studies have neglected the presence of outliers in measurements, so that even a small percentage of outliers will significantly degrade precision. In order to deal with outliers, this paper proposes using a graduated non-convexity (GNC) method to suppress outliers in robust camera pose estimation, which serves as the core of GNCPnP. The authors first reformulate the camera pose estimation problem using a non-convex cost, which is less affected by outliers. Then, to apply a non-minimum solver to solve the reformulated problem, the authors use the Black-Rangarajan duality theory to transform it. Finally, to address the dependence of non-convex optimisation on initial values, the GNC method was customised according to the truncated least squares cost. The results of simulation and real experiments show that GNCPnP can effectively handle the interference of outliers and achieve higher accuracy compared to existing state-of-the-art algorithms. In particular, the camera pose estimation accuracy of GNCPnP in the case of a low percentage of outliers is almost comparable to that of the state-of-the-art algorithm in the case of no outliers.

{"title":"Outliers rejection for robust camera pose estimation using graduated non-convexity","authors":"Hao Yi, Bo Liu, Bin Zhao, Enhai Liu","doi":"10.1049/cvi2.12330","DOIUrl":"https://doi.org/10.1049/cvi2.12330","url":null,"abstract":"Camera pose estimation plays a crucial role in computer vision, which is widely used in augmented reality, robotics and autonomous driving. However, previous studies have neglected the presence of outliers in measurements, so that even a small percentage of outliers will significantly degrade precision. In order to deal with outliers, this paper proposes using a graduated non-convexity (GNC) method to suppress outliers in robust camera pose estimation, which serves as the core of GNCPnP. The authors first reformulate the camera pose estimation problem using a non-convex cost, which is less affected by outliers. Then, to apply a non-minimum solver to solve the reformulated problem, the authors use the Black-Rangarajan duality theory to transform it. Finally, to address the dependence of non-convex optimisation on initial values, the GNC method was customised according to the truncated least squares cost. The results of simulation and real experiments show that GNCPnP can effectively handle the interference of outliers and achieve higher accuracy compared to existing state-of-the-art algorithms. In particular, the camera pose estimation accuracy of GNCPnP in the case of a low percentage of outliers is almost comparable to that of the state-of-the-art algorithm in the case of no outliers.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12330","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Weakly supervised bounding-box generation for camera-trap image based animal detection

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-20 DOI: 10.1049/cvi2.12332

Puxuan Xie, Renwu Gao, Weizeng Lu, Linlin Shen

In ecology, deep learning is improving the performance of camera-trap image based wild animal analysis. However, high labelling cost becomes a big challenge, as it requires involvement of huge human annotation. For example, the Snapshot Serengeti (SS) dataset contains over 900,000 images, while only 322,653 contains valid animals, 68,000 volunteers were recruited to provide image level labels such as species, the no. of animals and five behaviour attributes such as standing, resting and moving etc. In contrast, the Gold Standard SS Bounding-Box Coordinates (GSBBC for short) contains only 4011 images for training of object detection algorithms, as the annotation of bounding-box for animals in the image, is much more costive. Such a no. of training images, is obviously insufficient. To address this, the authors propose a method to generate bounding-boxes for a larger dataset using limited manually labelled images. To achieve this, the authors first train a wild animal detector using a small dataset (e.g. GSBBC) that is manually labelled to locate animals in images; then apply this detector to a bigger dataset (e.g. SS) for bounding-box generation; finally, we remove false detections according to the existing label information of the images. Experiments show that detector trained with images whose bounding-boxes are generated using the proposal, outperformed the existing camera-trap image based animal detection, in terms of mean average precision (mAP). Compared with the traditional data augmentation method, our method improved the mAP by 21.3% and 44.9% for rare species, also alleviating the long-tail issue in data distribution. In addition, detectors trained with the proposed method also achieve promising results when applied to classification and counting tasks, which are commonly required in wildlife research.

{"title":"Weakly supervised bounding-box generation for camera-trap image based animal detection","authors":"Puxuan Xie, Renwu Gao, Weizeng Lu, Linlin Shen","doi":"10.1049/cvi2.12332","DOIUrl":"https://doi.org/10.1049/cvi2.12332","url":null,"abstract":"In ecology, deep learning is improving the performance of camera-trap image based wild animal analysis. However, high labelling cost becomes a big challenge, as it requires involvement of huge human annotation. For example, the Snapshot Serengeti (SS) dataset contains over 900,000 images, while only 322,653 contains valid animals, 68,000 volunteers were recruited to provide image level labels such as species, the no. of animals and five behaviour attributes such as standing, resting and moving etc. In contrast, the Gold Standard SS Bounding-Box Coordinates (GSBBC for short) contains only 4011 images for training of object detection algorithms, as the annotation of bounding-box for animals in the image, is much more costive. Such a no. of training images, is obviously insufficient. To address this, the authors propose a method to generate bounding-boxes for a larger dataset using limited manually labelled images. To achieve this, the authors first train a wild animal detector using a small dataset (e.g. GSBBC) that is manually labelled to locate animals in images; then apply this detector to a bigger dataset (e.g. SS) for bounding-box generation; finally, we remove false detections according to the existing label information of the images. Experiments show that detector trained with images whose bounding-boxes are generated using the proposal, outperformed the existing camera-trap image based animal detection, in terms of mean average precision (mAP). Compared with the traditional data augmentation method, our method improved the mAP by 21.3% and 44.9% for rare species, also alleviating the long-tail issue in data distribution. In addition, detectors trained with the proposed method also achieve promising results when applied to classification and counting tasks, which are commonly required in wildlife research.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12332","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Guest Editorial: Anomaly detection and open-set recognition applications for computer vision

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-19 DOI: 10.1049/cvi2.12329

Hakan Cevikalp, Robi Polikar, Ömer Nezih Gerek, Songcan Chen, Chuanxing Geng

Anomaly detection is a method employed to identify data points or patterns that significantly deviate from expected or normal behaviour within a dataset. This approach aims to detect observations regarded as unusual, erroneous, anomalous, rare, or potentially indicative of fraudulent or malicious activity. Open-set recognition, also referred to as open-set identification or open-set classification, is a pattern recognition task that extends traditional classification by addressing the presence of unknown or novel classes during the testing phase. This approach highlights a strong connection between anomaly detection and open-set recognition, as both seek to identify samples originating from unknown classes or distributions. Open-set recognition methods frequently involve modelling both known and unknown classes during training, allowing for the capture of the distribution of known classes while explicitly addressing the space of unknown classes. Techniques in open-set recognition may include outlier detection, density estimation, or configuring decision boundaries to better differentiate between known and unknown classes. This special issue calls for original contributions introducing novel datasets, innovative architectures, and advanced training methods for tasks related to visual anomaly detection and open-set recognition.

{"title":"Guest Editorial: Anomaly detection and open-set recognition applications for computer vision","authors":"Hakan Cevikalp, Robi Polikar, Ömer Nezih Gerek, Songcan Chen, Chuanxing Geng","doi":"10.1049/cvi2.12329","DOIUrl":"https://doi.org/10.1049/cvi2.12329","url":null,"abstract":"Anomaly detection is a method employed to identify data points or patterns that significantly deviate from expected or normal behaviour within a dataset. This approach aims to detect observations regarded as unusual, erroneous, anomalous, rare, or potentially indicative of fraudulent or malicious activity. Open-set recognition, also referred to as open-set identification or open-set classification, is a pattern recognition task that extends traditional classification by addressing the presence of unknown or novel classes during the testing phase. This approach highlights a strong connection between anomaly detection and open-set recognition, as both seek to identify samples originating from unknown classes or distributions. Open-set recognition methods frequently involve modelling both known and unknown classes during training, allowing for the capture of the distribution of known classes while explicitly addressing the space of unknown classes. Techniques in open-set recognition may include outlier detection, density estimation, or configuring decision boundaries to better differentiate between known and unknown classes. This special issue calls for original contributions introducing novel datasets, innovative architectures, and advanced training methods for tasks related to visual anomaly detection and open-set recognition.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1069-1071"},"PeriodicalIF":1.5,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12329","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Autoencoder-based unsupervised one-class learning for abnormal activity detection in egocentric videos

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-19 DOI: 10.1049/cvi2.12333

Haowen Hu, Ryo Hachiuma, Hideo Saito

In recent years, abnormal human activity detection has become an important research topic. However, most existing methods focus on detecting abnormal activities of pedestrians in surveillance videos; even those methods using egocentric videos deal with the activities of pedestrians around the camera wearer. In this paper, the authors present an unsupervised auto-encoder-based network trained by one-class learning that inputs RGB image sequences recorded by egocentric cameras to detect abnormal activities of the camera wearers themselves. To improve the performance of network, the authors introduce a ‘re-encoding’ architecture and a regularisation loss function term, minimising the KL divergence between the distributions of features extracted by the first and second encoders. Unlike the common use of KL divergence loss to obtain a feature distribution close to an already-known distribution, the aim is to encourage the features extracted by the second encoder to have a close distribution to those extracted from the first encoder. The authors evaluate the proposed method on the Epic-Kitchens-55 dataset and conduct an ablation study to analyse the functions of different components. Experimental results demonstrate that the method outperforms the comparison methods in all cases and demonstrate the effectiveness of the proposed re-encoding architecture and the regularisation term.

{"title":"Autoencoder-based unsupervised one-class learning for abnormal activity detection in egocentric videos","authors":"Haowen Hu, Ryo Hachiuma, Hideo Saito","doi":"10.1049/cvi2.12333","DOIUrl":"https://doi.org/10.1049/cvi2.12333","url":null,"abstract":"In recent years, abnormal human activity detection has become an important research topic. However, most existing methods focus on detecting abnormal activities of pedestrians in surveillance videos; even those methods using egocentric videos deal with the activities of pedestrians around the camera wearer. In this paper, the authors present an unsupervised auto-encoder-based network trained by one-class learning that inputs RGB image sequences recorded by egocentric cameras to detect abnormal activities of the camera wearers themselves. To improve the performance of network, the authors introduce a ‘re-encoding’ architecture and a regularisation loss function term, minimising the KL divergence between the distributions of features extracted by the first and second encoders. Unlike the common use of KL divergence loss to obtain a feature distribution close to an already-known distribution, the aim is to encourage the features extracted by the second encoder to have a close distribution to those extracted from the first encoder. The authors evaluate the proposed method on the Epic-Kitchens-55 dataset and conduct an ablation study to analyse the functions of different components. Experimental results demonstrate that the method outperforms the comparison methods in all cases and demonstrate the effectiveness of the proposed re-encoding architecture and the regularisation term.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12333","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Metric-guided class-level alignment for domain adaptation

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-15 DOI: 10.1049/cvi2.12322

Xiaoshun Wang, Yunhan Li

The utilisation of domain adaptation methods facilitates the resolution of classification challenges in an unlabelled target domain by capitalising on the labelled information from source domains. Unfortunately, previous domain adaptation methods have focused mostly on global domain adaptation and have not taken into account class-specific data, which leads to poor knowledge transfer performance. The study of class-level domain adaptation, which aims to precisely match the distributions of different domains, has garnered attention in recent times. However, existing investigations into class-level alignment frequently align domain features either directly on or in close proximity to classification boundaries, resulting in the creation of uncertain samples that could potentially impair classification accuracy. To address the aforementioned problem, we propose a new approach called metric-guided class-level alignment (MCA) as a solution to this problem. Specifically, we employ different metrics to enable the network to acquire supplementary information, thereby enhancing class-level alignment. Moreover, MCA can be effectively combined with existing domain-level alignment methods to successfully mitigate the challenges posed by domain shift. Extensive testing on commonly-used public datasets shows that our method outperforms many other cutting-edge domain adaptation methods, showing significant gains over baseline performance.

{"title":"Metric-guided class-level alignment for domain adaptation","authors":"Xiaoshun Wang, Yunhan Li","doi":"10.1049/cvi2.12322","DOIUrl":"https://doi.org/10.1049/cvi2.12322","url":null,"abstract":"The utilisation of domain adaptation methods facilitates the resolution of classification challenges in an unlabelled target domain by capitalising on the labelled information from source domains. Unfortunately, previous domain adaptation methods have focused mostly on global domain adaptation and have not taken into account class-specific data, which leads to poor knowledge transfer performance. The study of class-level domain adaptation, which aims to precisely match the distributions of different domains, has garnered attention in recent times. However, existing investigations into class-level alignment frequently align domain features either directly on or in close proximity to classification boundaries, resulting in the creation of uncertain samples that could potentially impair classification accuracy. To address the aforementioned problem, we propose a new approach called metric-guided class-level alignment (MCA) as a solution to this problem. Specifically, we employ different metrics to enable the network to acquire supplementary information, thereby enhancing class-level alignment. Moreover, MCA can be effectively combined with existing domain-level alignment methods to successfully mitigate the challenges posed by domain shift. Extensive testing on commonly-used public datasets shows that our method outperforms many other cutting-edge domain adaptation methods, showing significant gains over baseline performance.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12322","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143423759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Representation alignment contrastive regularisation for multi-object tracking

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-15 DOI: 10.1049/cvi2.12331

Shujie Chen, Zhonglin Liu, Jianfeng Dong, Xun Wang, Di Zhou

Achieving high-performance in multi-object tracking algorithms heavily relies on modelling spatial-temporal relationships during the data association stage. Mainstream approaches encompass rule-based and deep learning-based methods for spatial-temporal relationship modelling. While the former relies on physical motion laws, offering wider applicability but yielding suboptimal results for complex object movements, the latter, though achieving high-performance, lacks interpretability and involves complex module designs. This work aims to simplify deep learning-based spatial-temporal relationship models and introduce interpretability into features for data association. Specifically, a lightweight single-layer transformer encoder is utilised to model spatial-temporal relationships. To make features more interpretative, two contrastive regularisation losses based on representation alignment are proposed, derived from spatial-temporal consistency rules. By applying weighted summation to affinity matrices, the aligned features can seamlessly integrate into the data association stage of the original tracking workflow. Experimental results showcase that our model enhances the majority of existing tracking networks' performance without excessive complexity, with minimal increase in training overhead and nearly negligible computational and storage costs.

{"title":"Representation alignment contrastive regularisation for multi-object tracking","authors":"Shujie Chen, Zhonglin Liu, Jianfeng Dong, Xun Wang, Di Zhou","doi":"10.1049/cvi2.12331","DOIUrl":"https://doi.org/10.1049/cvi2.12331","url":null,"abstract":"Achieving high-performance in multi-object tracking algorithms heavily relies on modelling spatial-temporal relationships during the data association stage. Mainstream approaches encompass rule-based and deep learning-based methods for spatial-temporal relationship modelling. While the former relies on physical motion laws, offering wider applicability but yielding suboptimal results for complex object movements, the latter, though achieving high-performance, lacks interpretability and involves complex module designs. This work aims to simplify deep learning-based spatial-temporal relationship models and introduce interpretability into features for data association. Specifically, a lightweight single-layer transformer encoder is utilised to model spatial-temporal relationships. To make features more interpretative, two contrastive regularisation losses based on representation alignment are proposed, derived from spatial-temporal consistency rules. By applying weighted summation to affinity matrices, the aligned features can seamlessly integrate into the data association stage of the original tracking workflow. Experimental results showcase that our model enhances the majority of existing tracking networks' performance without excessive complexity, with minimal increase in training overhead and nearly negligible computational and storage costs.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12331","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hybrid feature-based moving cast shadow detection

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-13 DOI: 10.1049/cvi2.12328

Jiangyan Dai, Huihui Zhang, Jin Gao, Chunlei Chen, Yugen Yi

The accurate detection of moving objects is essential in various applications of artificial intelligence, particularly in the field of intelligent surveillance systems. However, the moving cast shadow detection significantly decreases the precision of moving object detection because they share similar motion characteristics. To address the issue, the authors propose an innovative approach to detect moving cast shadows by combining the hybrid feature with a broad learning system (BLS). The approach involves extracting low-level features from the input and background images based on colour constancy and texture consistency principles that are shown to be highly effective in moving cast shadow detection. The authors then utilise the BLS to create a hybrid feature and BLS uses the extracted low-level features as input instead of the original data. BLS is an innovative form of deep learning that can map input to feature nodes and further enhance them by enhancement nodes, resulting in more compact features for classification. Finally, the authors develop an efficient and straightforward post-processing technique to improve the accuracy of moving object detection. To evaluate the effectiveness and generalisation ability, the authors conduct extensive experiments on public ATON-CVRR and CDnet datasets to verify the superior performance of our method by comparing with representative approaches.

在人工智能的各种应用中，尤其是在智能监控系统领域，准确检测移动物体至关重要。然而，移动投影检测会大大降低移动物体检测的精度，因为它们具有相似的运动特征。为了解决这个问题，作者提出了一种创新方法，通过将混合特征与广泛学习系统（BLS）相结合来检测移动投影。该方法基于色彩恒定性和纹理一致性原理，从输入图像和背景图像中提取低级特征，这些特征在移动投影检测中非常有效。然后，作者利用 BLS 创建混合特征，BLS 将提取的低级特征作为输入，而不是原始数据。BLS 是深度学习的一种创新形式，它可以将输入映射到特征节点，并通过增强节点进一步增强，从而获得更紧凑的分类特征。最后，作者开发了一种高效、直接的后处理技术，以提高移动物体检测的准确性。为了评估该方法的有效性和泛化能力，作者在公开的 ATON-CVRR 和 CDnet 数据集上进行了大量实验，通过与具有代表性的方法进行比较，验证了我们的方法的优越性能。

{"title":"Hybrid feature-based moving cast shadow detection","authors":"Jiangyan Dai, Huihui Zhang, Jin Gao, Chunlei Chen, Yugen Yi","doi":"10.1049/cvi2.12328","DOIUrl":"https://doi.org/10.1049/cvi2.12328","url":null,"abstract":"The accurate detection of moving objects is essential in various applications of artificial intelligence, particularly in the field of intelligent surveillance systems. However, the moving cast shadow detection significantly decreases the precision of moving object detection because they share similar motion characteristics. To address the issue, the authors propose an innovative approach to detect moving cast shadows by combining the hybrid feature with a broad learning system (BLS). The approach involves extracting low-level features from the input and background images based on colour constancy and texture consistency principles that are shown to be highly effective in moving cast shadow detection. The authors then utilise the BLS to create a hybrid feature and BLS uses the extracted low-level features as input instead of the original data. BLS is an innovative form of deep learning that can map input to feature nodes and further enhance them by enhancement nodes, resulting in more compact features for classification. Finally, the authors develop an efficient and straightforward post-processing technique to improve the accuracy of moving object detection. To evaluate the effectiveness and generalisation ability, the authors conduct extensive experiments on public ATON-CVRR and CDnet datasets to verify the superior performance of our method by comparing with representative approaches.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12328","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143423582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High precision light field image depth estimation via multi-region attention enhanced network

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-10 DOI: 10.1049/cvi2.12326

Jie Li, Wenxuan Yang, Chuanlun Zhang, Heng Li, Xinjia Li, Lin Wang, Yanling Wang, Xiaoyan Wang

Light field (LF) depth estimation is a key task with numerous practical applications. However, achieving high-precision depth estimation in challenging scenarios, such as occlusions and detailed regions (e.g. fine structures and edges), remains a significant challenge. To address this problem, the authors propose a LF depth estimation network based on multi-region selection and guided optimisation. Firstly, we construct a multi-region disparity selection module based on angular patch, which selects specific regions for generating angular patch, achieving representative sub-angular patch by balancing different regions. Secondly, different from traditional guided deformable convolution, the guided optimisation leverages colour prior information to learn the aggregation of sampling points, which enhances the deformable convolution ability by learning deformation parameters and fitting irregular windows. Finally, to achieve high-precision LF depth estimation, the authors have developed a network architecture based on the proposed multi-region disparity selection and guided optimisation module. Experiments demonstrate the effectiveness of network on the HCInew dataset, especially in handling occlusions and detailed regions.

光场（LF）深度估算是一项具有众多实际应用的关键任务。然而，在遮挡和细节区域（如精细结构和边缘）等具有挑战性的场景中实现高精度深度估计仍然是一项重大挑战。为了解决这个问题，作者提出了一种基于多区域选择和引导优化的 LF 深度估计网络。首先，我们构建了基于角补丁的多区域差异选择模块，该模块选择特定区域生成角补丁，通过平衡不同区域实现具有代表性的子角补丁。其次，与传统的引导式可变形卷积不同，引导式优化利用颜色先验信息来学习采样点的聚集，通过学习变形参数和拟合不规则窗口来增强可变形卷积的能力。最后，为了实现高精度的 LF 深度估计，作者基于所提出的多区域差异选择和引导优化模块开发了一种网络架构。实验证明了该网络在 HCInew 数据集上的有效性，尤其是在处理遮挡和细节区域方面。

{"title":"High precision light field image depth estimation via multi-region attention enhanced network","authors":"Jie Li, Wenxuan Yang, Chuanlun Zhang, Heng Li, Xinjia Li, Lin Wang, Yanling Wang, Xiaoyan Wang","doi":"10.1049/cvi2.12326","DOIUrl":"https://doi.org/10.1049/cvi2.12326","url":null,"abstract":"Light field (LF) depth estimation is a key task with numerous practical applications. However, achieving high-precision depth estimation in challenging scenarios, such as occlusions and detailed regions (e.g. fine structures and edges), remains a significant challenge. To address this problem, the authors propose a LF depth estimation network based on multi-region selection and guided optimisation. Firstly, we construct a multi-region disparity selection module based on angular patch, which selects specific regions for generating angular patch, achieving representative sub-angular patch by balancing different regions. Secondly, different from traditional guided deformable convolution, the guided optimisation leverages colour prior information to learn the aggregation of sampling points, which enhances the deformable convolution ability by learning deformation parameters and fitting irregular windows. Finally, to achieve high-precision LF depth estimation, the authors have developed a network architecture based on the proposed multi-region disparity selection and guided optimisation module. Experiments demonstrate the effectiveness of network on the HCInew dataset, especially in handling occlusions and detailed regions.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1390-1406"},"PeriodicalIF":1.5,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12326","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0