3D Single Object Tracking plays a vital role in autonomous driving and robotics, yet traditional approaches have predominantly focused on using pure LiDAR-based point cloud data, often neglecting the benefits of integrating image modalities. To address this gap, we propose a novel Multi-modal Image-LiDAR Tracker (MILT) designed to overcome the limitations of single-modality methods by effectively combining RGB and point cloud data. Our key contribution is a dual-branch architecture that separately extracts geometric features from LiDAR and texture features from images. These features are then fused in a BEV perspective to achieve a comprehensive representation of the tracked object. A significant innovation in our approach is the Image-to-LiDAR Adapter module, which transfers the rich feature representation capabilities of the image modality to the 3D tracking task, and the BEV-Fusion module, which facilitates the interactive fusion of geometry and texture features. By validating MILT on public datasets, we demonstrate substantial performance improvements over traditional methods, effectively showcasing the advantages of our multi-modal fusion strategy. This work advances the state-of-the-art in SOT by integrating complementary information from RGB and LiDAR modalities, resulting in enhanced tracking accuracy and robustness.
{"title":"Unlocking the power of multi-modal fusion in 3D object tracking","authors":"Yue Hu","doi":"10.1049/cvi2.12335","DOIUrl":"https://doi.org/10.1049/cvi2.12335","url":null,"abstract":"<p>3D Single Object Tracking plays a vital role in autonomous driving and robotics, yet traditional approaches have predominantly focused on using pure LiDAR-based point cloud data, often neglecting the benefits of integrating image modalities. To address this gap, we propose a novel Multi-modal Image-LiDAR Tracker (MILT) designed to overcome the limitations of single-modality methods by effectively combining RGB and point cloud data. Our key contribution is a dual-branch architecture that separately extracts geometric features from LiDAR and texture features from images. These features are then fused in a BEV perspective to achieve a comprehensive representation of the tracked object. A significant innovation in our approach is the Image-to-LiDAR Adapter module, which transfers the rich feature representation capabilities of the image modality to the 3D tracking task, and the BEV-Fusion module, which facilitates the interactive fusion of geometry and texture features. By validating MILT on public datasets, we demonstrate substantial performance improvements over traditional methods, effectively showcasing the advantages of our multi-modal fusion strategy. This work advances the state-of-the-art in SOT by integrating complementary information from RGB and LiDAR modalities, resulting in enhanced tracking accuracy and robustness.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12335","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, visual-language models (VLMs) have displayed potent capabilities in the field of computer vision. Their emerging trend as the backbone of visual tasks necessitates studying class incremental learning (CIL) issues within the VLM architecture. However, the pre-training data for many VLMs is proprietary, and during the incremental phase, old task data may also raise privacy issues. Moreover, replay-based methods can introduce new problems like class imbalance, the selection of data for replay and a trade-off between replay cost and performance. Therefore, the authors choose the more challenging rehearsal-free settings. In this paper, the authors study class-incremental tasks based on the large pre-trained vision-language models like CLIP model. Initially, at the category level, the authors combine traditional optimisation and distillation techniques, utilising both pre-trained models and models trained in previous incremental stages to jointly guide the training of the new model. This paradigm effectively balances the stability and plasticity of the new model, mitigating the issue of catastrophic forgetting. Moreover, utilising the VLM infrastructure, the authors redefine the relationship between instances. This allows us to glean fine-grained instance relational information from the a priori knowledge provided during pre-training. The authors supplement this approach with an entropy-balancing method that allows the model to adaptively distribute optimisation weights across training samples. The authors’ experimental results validate that their method, within the framework of VLMs, outperforms traditional CIL methods.
{"title":"Category-instance distillation based on visual-language models for rehearsal-free class incremental learning","authors":"Weilong Jin, Zilei Wang, Yixin Zhang","doi":"10.1049/cvi2.12327","DOIUrl":"https://doi.org/10.1049/cvi2.12327","url":null,"abstract":"<p>Recently, visual-language models (VLMs) have displayed potent capabilities in the field of computer vision. Their emerging trend as the backbone of visual tasks necessitates studying class incremental learning (CIL) issues within the VLM architecture. However, the pre-training data for many VLMs is proprietary, and during the incremental phase, old task data may also raise privacy issues. Moreover, replay-based methods can introduce new problems like class imbalance, the selection of data for replay and a trade-off between replay cost and performance. Therefore, the authors choose the more challenging rehearsal-free settings. In this paper, the authors study class-incremental tasks based on the large pre-trained vision-language models like CLIP model. Initially, at the category level, the authors combine traditional optimisation and distillation techniques, utilising both pre-trained models and models trained in previous incremental stages to jointly guide the training of the new model. This paradigm effectively balances the stability and plasticity of the new model, mitigating the issue of catastrophic forgetting. Moreover, utilising the VLM infrastructure, the authors redefine the relationship between instances. This allows us to glean fine-grained instance relational information from the a priori knowledge provided during pre-training. The authors supplement this approach with an entropy-balancing method that allows the model to adaptively distribute optimisation weights across training samples. The authors’ experimental results validate that their method, within the framework of VLMs, outperforms traditional CIL methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12327","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Camera pose estimation plays a crucial role in computer vision, which is widely used in augmented reality, robotics and autonomous driving. However, previous studies have neglected the presence of outliers in measurements, so that even a small percentage of outliers will significantly degrade precision. In order to deal with outliers, this paper proposes using a graduated non-convexity (GNC) method to suppress outliers in robust camera pose estimation, which serves as the core of GNCPnP. The authors first reformulate the camera pose estimation problem using a non-convex cost, which is less affected by outliers. Then, to apply a non-minimum solver to solve the reformulated problem, the authors use the Black-Rangarajan duality theory to transform it. Finally, to address the dependence of non-convex optimisation on initial values, the GNC method was customised according to the truncated least squares cost. The results of simulation and real experiments show that GNCPnP can effectively handle the interference of outliers and achieve higher accuracy compared to existing state-of-the-art algorithms. In particular, the camera pose estimation accuracy of GNCPnP in the case of a low percentage of outliers is almost comparable to that of the state-of-the-art algorithm in the case of no outliers.
{"title":"Outliers rejection for robust camera pose estimation using graduated non-convexity","authors":"Hao Yi, Bo Liu, Bin Zhao, Enhai Liu","doi":"10.1049/cvi2.12330","DOIUrl":"https://doi.org/10.1049/cvi2.12330","url":null,"abstract":"<p>Camera pose estimation plays a crucial role in computer vision, which is widely used in augmented reality, robotics and autonomous driving. However, previous studies have neglected the presence of outliers in measurements, so that even a small percentage of outliers will significantly degrade precision. In order to deal with outliers, this paper proposes using a graduated non-convexity (GNC) method to suppress outliers in robust camera pose estimation, which serves as the core of GNCPnP. The authors first reformulate the camera pose estimation problem using a non-convex cost, which is less affected by outliers. Then, to apply a non-minimum solver to solve the reformulated problem, the authors use the Black-Rangarajan duality theory to transform it. Finally, to address the dependence of non-convex optimisation on initial values, the GNC method was customised according to the truncated least squares cost. The results of simulation and real experiments show that GNCPnP can effectively handle the interference of outliers and achieve higher accuracy compared to existing state-of-the-art algorithms. In particular, the camera pose estimation accuracy of GNCPnP in the case of a low percentage of outliers is almost comparable to that of the state-of-the-art algorithm in the case of no outliers.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12330","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In ecology, deep learning is improving the performance of camera-trap image based wild animal analysis. However, high labelling cost becomes a big challenge, as it requires involvement of huge human annotation. For example, the Snapshot Serengeti (SS) dataset contains over 900,000 images, while only 322,653 contains valid animals, 68,000 volunteers were recruited to provide image level labels such as species, the no. of animals and five behaviour attributes such as standing, resting and moving etc. In contrast, the Gold Standard SS Bounding-Box Coordinates (GSBBC for short) contains only 4011 images for training of object detection algorithms, as the annotation of bounding-box for animals in the image, is much more costive. Such a no. of training images, is obviously insufficient. To address this, the authors propose a method to generate bounding-boxes for a larger dataset using limited manually labelled images. To achieve this, the authors first train a wild animal detector using a small dataset (e.g. GSBBC) that is manually labelled to locate animals in images; then apply this detector to a bigger dataset (e.g. SS) for bounding-box generation; finally, we remove false detections according to the existing label information of the images. Experiments show that detector trained with images whose bounding-boxes are generated using the proposal, outperformed the existing camera-trap image based animal detection, in terms of mean average precision (mAP). Compared with the traditional data augmentation method, our method improved the mAP by 21.3% and 44.9% for rare species, also alleviating the long-tail issue in data distribution. In addition, detectors trained with the proposed method also achieve promising results when applied to classification and counting tasks, which are commonly required in wildlife research.
{"title":"Weakly supervised bounding-box generation for camera-trap image based animal detection","authors":"Puxuan Xie, Renwu Gao, Weizeng Lu, Linlin Shen","doi":"10.1049/cvi2.12332","DOIUrl":"https://doi.org/10.1049/cvi2.12332","url":null,"abstract":"<p>In ecology, deep learning is improving the performance of camera-trap image based wild animal analysis. However, high labelling cost becomes a big challenge, as it requires involvement of huge human annotation. For example, the Snapshot Serengeti (SS) dataset contains over 900,000 images, while only 322,653 contains valid animals, 68,000 volunteers were recruited to provide image level labels such as species, the no. of animals and five behaviour attributes such as standing, resting and moving etc. In contrast, the Gold Standard SS Bounding-Box Coordinates (GSBBC for short) contains only 4011 images for training of object detection algorithms, as the annotation of bounding-box for animals in the image, is much more costive. Such a no. of training images, is obviously insufficient. To address this, the authors propose a method to generate bounding-boxes for a larger dataset using limited manually labelled images. To achieve this, the authors first train a wild animal detector using a small dataset (e.g. GSBBC) that is manually labelled to locate animals in images; then apply this detector to a bigger dataset (e.g. SS) for bounding-box generation; finally, we remove false detections according to the existing label information of the images. Experiments show that detector trained with images whose bounding-boxes are generated using the proposal, outperformed the existing camera-trap image based animal detection, in terms of mean average precision (mAP). Compared with the traditional data augmentation method, our method improved the mAP by 21.3% and 44.9% for rare species, also alleviating the long-tail issue in data distribution. In addition, detectors trained with the proposed method also achieve promising results when applied to classification and counting tasks, which are commonly required in wildlife research.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12332","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anomaly detection is a method employed to identify data points or patterns that significantly deviate from expected or normal behaviour within a dataset. This approach aims to detect observations regarded as unusual, erroneous, anomalous, rare, or potentially indicative of fraudulent or malicious activity. Open-set recognition, also referred to as open-set identification or open-set classification, is a pattern recognition task that extends traditional classification by addressing the presence of unknown or novel classes during the testing phase. This approach highlights a strong connection between anomaly detection and open-set recognition, as both seek to identify samples originating from unknown classes or distributions. Open-set recognition methods frequently involve modelling both known and unknown classes during training, allowing for the capture of the distribution of known classes while explicitly addressing the space of unknown classes. Techniques in open-set recognition may include outlier detection, density estimation, or configuring decision boundaries to better differentiate between known and unknown classes. This special issue calls for original contributions introducing novel datasets, innovative architectures, and advanced training methods for tasks related to visual anomaly detection and open-set recognition.
{"title":"Guest Editorial: Anomaly detection and open-set recognition applications for computer vision","authors":"Hakan Cevikalp, Robi Polikar, Ömer Nezih Gerek, Songcan Chen, Chuanxing Geng","doi":"10.1049/cvi2.12329","DOIUrl":"https://doi.org/10.1049/cvi2.12329","url":null,"abstract":"<p>Anomaly detection is a method employed to identify data points or patterns that significantly deviate from expected or normal behaviour within a dataset. This approach aims to detect observations regarded as unusual, erroneous, anomalous, rare, or potentially indicative of fraudulent or malicious activity. Open-set recognition, also referred to as open-set identification or open-set classification, is a pattern recognition task that extends traditional classification by addressing the presence of unknown or novel classes during the testing phase. This approach highlights a strong connection between anomaly detection and open-set recognition, as both seek to identify samples originating from unknown classes or distributions. Open-set recognition methods frequently involve modelling both known and unknown classes during training, allowing for the capture of the distribution of known classes while explicitly addressing the space of unknown classes. Techniques in open-set recognition may include outlier detection, density estimation, or configuring decision boundaries to better differentiate between known and unknown classes. This special issue calls for original contributions introducing novel datasets, innovative architectures, and advanced training methods for tasks related to visual anomaly detection and open-set recognition.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1069-1071"},"PeriodicalIF":1.5,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12329","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, abnormal human activity detection has become an important research topic. However, most existing methods focus on detecting abnormal activities of pedestrians in surveillance videos; even those methods using egocentric videos deal with the activities of pedestrians around the camera wearer. In this paper, the authors present an unsupervised auto-encoder-based network trained by one-class learning that inputs RGB image sequences recorded by egocentric cameras to detect abnormal activities of the camera wearers themselves. To improve the performance of network, the authors introduce a ‘re-encoding’ architecture and a regularisation loss function term, minimising the KL divergence between the distributions of features extracted by the first and second encoders. Unlike the common use of KL divergence loss to obtain a feature distribution close to an already-known distribution, the aim is to encourage the features extracted by the second encoder to have a close distribution to those extracted from the first encoder. The authors evaluate the proposed method on the Epic-Kitchens-55 dataset and conduct an ablation study to analyse the functions of different components. Experimental results demonstrate that the method outperforms the comparison methods in all cases and demonstrate the effectiveness of the proposed re-encoding architecture and the regularisation term.
{"title":"Autoencoder-based unsupervised one-class learning for abnormal activity detection in egocentric videos","authors":"Haowen Hu, Ryo Hachiuma, Hideo Saito","doi":"10.1049/cvi2.12333","DOIUrl":"https://doi.org/10.1049/cvi2.12333","url":null,"abstract":"<p>In recent years, abnormal human activity detection has become an important research topic. However, most existing methods focus on detecting abnormal activities of pedestrians in surveillance videos; even those methods using egocentric videos deal with the activities of pedestrians around the camera wearer. In this paper, the authors present an unsupervised auto-encoder-based network trained by one-class learning that inputs RGB image sequences recorded by egocentric cameras to detect abnormal activities of the camera wearers themselves. To improve the performance of network, the authors introduce a ‘re-encoding’ architecture and a regularisation loss function term, minimising the KL divergence between the distributions of features extracted by the first and second encoders. Unlike the common use of KL divergence loss to obtain a feature distribution close to an already-known distribution, the aim is to encourage the features extracted by the second encoder to have a close distribution to those extracted from the first encoder. The authors evaluate the proposed method on the Epic-Kitchens-55 dataset and conduct an ablation study to analyse the functions of different components. Experimental results demonstrate that the method outperforms the comparison methods in all cases and demonstrate the effectiveness of the proposed re-encoding architecture and the regularisation term.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12333","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The utilisation of domain adaptation methods facilitates the resolution of classification challenges in an unlabelled target domain by capitalising on the labelled information from source domains. Unfortunately, previous domain adaptation methods have focused mostly on global domain adaptation and have not taken into account class-specific data, which leads to poor knowledge transfer performance. The study of class-level domain adaptation, which aims to precisely match the distributions of different domains, has garnered attention in recent times. However, existing investigations into class-level alignment frequently align domain features either directly on or in close proximity to classification boundaries, resulting in the creation of uncertain samples that could potentially impair classification accuracy. To address the aforementioned problem, we propose a new approach called metric-guided class-level alignment (MCA) as a solution to this problem. Specifically, we employ different metrics to enable the network to acquire supplementary information, thereby enhancing class-level alignment. Moreover, MCA can be effectively combined with existing domain-level alignment methods to successfully mitigate the challenges posed by domain shift. Extensive testing on commonly-used public datasets shows that our method outperforms many other cutting-edge domain adaptation methods, showing significant gains over baseline performance.
{"title":"Metric-guided class-level alignment for domain adaptation","authors":"Xiaoshun Wang, Yunhan Li","doi":"10.1049/cvi2.12322","DOIUrl":"https://doi.org/10.1049/cvi2.12322","url":null,"abstract":"<p>The utilisation of domain adaptation methods facilitates the resolution of classification challenges in an unlabelled target domain by capitalising on the labelled information from source domains. Unfortunately, previous domain adaptation methods have focused mostly on global domain adaptation and have not taken into account class-specific data, which leads to poor knowledge transfer performance. The study of class-level domain adaptation, which aims to precisely match the distributions of different domains, has garnered attention in recent times. However, existing investigations into class-level alignment frequently align domain features either directly on or in close proximity to classification boundaries, resulting in the creation of uncertain samples that could potentially impair classification accuracy. To address the aforementioned problem, we propose a new approach called metric-guided class-level alignment (MCA) as a solution to this problem. Specifically, we employ different metrics to enable the network to acquire supplementary information, thereby enhancing class-level alignment. Moreover, MCA can be effectively combined with existing domain-level alignment methods to successfully mitigate the challenges posed by domain shift. Extensive testing on commonly-used public datasets shows that our method outperforms many other cutting-edge domain adaptation methods, showing significant gains over baseline performance.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12322","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143423759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shujie Chen, Zhonglin Liu, Jianfeng Dong, Xun Wang, Di Zhou
Achieving high-performance in multi-object tracking algorithms heavily relies on modelling spatial-temporal relationships during the data association stage. Mainstream approaches encompass rule-based and deep learning-based methods for spatial-temporal relationship modelling. While the former relies on physical motion laws, offering wider applicability but yielding suboptimal results for complex object movements, the latter, though achieving high-performance, lacks interpretability and involves complex module designs. This work aims to simplify deep learning-based spatial-temporal relationship models and introduce interpretability into features for data association. Specifically, a lightweight single-layer transformer encoder is utilised to model spatial-temporal relationships. To make features more interpretative, two contrastive regularisation losses based on representation alignment are proposed, derived from spatial-temporal consistency rules. By applying weighted summation to affinity matrices, the aligned features can seamlessly integrate into the data association stage of the original tracking workflow. Experimental results showcase that our model enhances the majority of existing tracking networks' performance without excessive complexity, with minimal increase in training overhead and nearly negligible computational and storage costs.
{"title":"Representation alignment contrastive regularisation for multi-object tracking","authors":"Shujie Chen, Zhonglin Liu, Jianfeng Dong, Xun Wang, Di Zhou","doi":"10.1049/cvi2.12331","DOIUrl":"https://doi.org/10.1049/cvi2.12331","url":null,"abstract":"<p>Achieving high-performance in multi-object tracking algorithms heavily relies on modelling spatial-temporal relationships during the data association stage. Mainstream approaches encompass rule-based and deep learning-based methods for spatial-temporal relationship modelling. While the former relies on physical motion laws, offering wider applicability but yielding suboptimal results for complex object movements, the latter, though achieving high-performance, lacks interpretability and involves complex module designs. This work aims to simplify deep learning-based spatial-temporal relationship models and introduce interpretability into features for data association. Specifically, a lightweight single-layer transformer encoder is utilised to model spatial-temporal relationships. To make features more interpretative, two contrastive regularisation losses based on representation alignment are proposed, derived from spatial-temporal consistency rules. By applying weighted summation to affinity matrices, the aligned features can seamlessly integrate into the data association stage of the original tracking workflow. Experimental results showcase that our model enhances the majority of existing tracking networks' performance without excessive complexity, with minimal increase in training overhead and nearly negligible computational and storage costs.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12331","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiangyan Dai, Huihui Zhang, Jin Gao, Chunlei Chen, Yugen Yi
The accurate detection of moving objects is essential in various applications of artificial intelligence, particularly in the field of intelligent surveillance systems. However, the moving cast shadow detection significantly decreases the precision of moving object detection because they share similar motion characteristics. To address the issue, the authors propose an innovative approach to detect moving cast shadows by combining the hybrid feature with a broad learning system (BLS). The approach involves extracting low-level features from the input and background images based on colour constancy and texture consistency principles that are shown to be highly effective in moving cast shadow detection. The authors then utilise the BLS to create a hybrid feature and BLS uses the extracted low-level features as input instead of the original data. BLS is an innovative form of deep learning that can map input to feature nodes and further enhance them by enhancement nodes, resulting in more compact features for classification. Finally, the authors develop an efficient and straightforward post-processing technique to improve the accuracy of moving object detection. To evaluate the effectiveness and generalisation ability, the authors conduct extensive experiments on public ATON-CVRR and CDnet datasets to verify the superior performance of our method by comparing with representative approaches.
{"title":"Hybrid feature-based moving cast shadow detection","authors":"Jiangyan Dai, Huihui Zhang, Jin Gao, Chunlei Chen, Yugen Yi","doi":"10.1049/cvi2.12328","DOIUrl":"https://doi.org/10.1049/cvi2.12328","url":null,"abstract":"<p>The accurate detection of moving objects is essential in various applications of artificial intelligence, particularly in the field of intelligent surveillance systems. However, the moving cast shadow detection significantly decreases the precision of moving object detection because they share similar motion characteristics. To address the issue, the authors propose an innovative approach to detect moving cast shadows by combining the hybrid feature with a broad learning system (BLS). The approach involves extracting low-level features from the input and background images based on colour constancy and texture consistency principles that are shown to be highly effective in moving cast shadow detection. The authors then utilise the BLS to create a hybrid feature and BLS uses the extracted low-level features as input instead of the original data. BLS is an innovative form of deep learning that can map input to feature nodes and further enhance them by enhancement nodes, resulting in more compact features for classification. Finally, the authors develop an efficient and straightforward post-processing technique to improve the accuracy of moving object detection. To evaluate the effectiveness and generalisation ability, the authors conduct extensive experiments on public ATON-CVRR and CDnet datasets to verify the superior performance of our method by comparing with representative approaches.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12328","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143423582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Li, Wenxuan Yang, Chuanlun Zhang, Heng Li, Xinjia Li, Lin Wang, Yanling Wang, Xiaoyan Wang
Light field (LF) depth estimation is a key task with numerous practical applications. However, achieving high-precision depth estimation in challenging scenarios, such as occlusions and detailed regions (e.g. fine structures and edges), remains a significant challenge. To address this problem, the authors propose a LF depth estimation network based on multi-region selection and guided optimisation. Firstly, we construct a multi-region disparity selection module based on angular patch, which selects specific regions for generating angular patch, achieving representative sub-angular patch by balancing different regions. Secondly, different from traditional guided deformable convolution, the guided optimisation leverages colour prior information to learn the aggregation of sampling points, which enhances the deformable convolution ability by learning deformation parameters and fitting irregular windows. Finally, to achieve high-precision LF depth estimation, the authors have developed a network architecture based on the proposed multi-region disparity selection and guided optimisation module. Experiments demonstrate the effectiveness of network on the HCInew dataset, especially in handling occlusions and detailed regions.
{"title":"High precision light field image depth estimation via multi-region attention enhanced network","authors":"Jie Li, Wenxuan Yang, Chuanlun Zhang, Heng Li, Xinjia Li, Lin Wang, Yanling Wang, Xiaoyan Wang","doi":"10.1049/cvi2.12326","DOIUrl":"https://doi.org/10.1049/cvi2.12326","url":null,"abstract":"<p>Light field (LF) depth estimation is a key task with numerous practical applications. However, achieving high-precision depth estimation in challenging scenarios, such as occlusions and detailed regions (e.g. fine structures and edges), remains a significant challenge. To address this problem, the authors propose a LF depth estimation network based on multi-region selection and guided optimisation. Firstly, we construct a multi-region disparity selection module based on angular patch, which selects specific regions for generating angular patch, achieving representative sub-angular patch by balancing different regions. Secondly, different from traditional guided deformable convolution, the guided optimisation leverages colour prior information to learn the aggregation of sampling points, which enhances the deformable convolution ability by learning deformation parameters and fitting irregular windows. Finally, to achieve high-precision LF depth estimation, the authors have developed a network architecture based on the proposed multi-region disparity selection and guided optimisation module. Experiments demonstrate the effectiveness of network on the HCInew dataset, especially in handling occlusions and detailed regions.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1390-1406"},"PeriodicalIF":1.5,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12326","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}