Hongjian Gu, Wenxuan Zou, Keyang Cheng, Bin Wu, Humaira Abdul Ghafoor, Yongzhao Zhan
Person re-identification is aimed at searching for specific target pedestrians from non-intersecting cameras. However, in real complex scenes, pedestrians are easily obscured, which makes the target pedestrian search task time-consuming and challenging. To address the problem of pedestrians' susceptibility to occlusion, a person re-identification via deep compound eye network (CEN) and pose repair module is proposed, which includes (1) A deep CEN based on multi-camera logical topology is proposed, which adopts graph convolution and a Gated Recurrent Unit to capture the temporal and spatial information of pedestrian walking and finally carries out pedestrian global matching through the Siamese network; (2) An integrated spatial-temporal information aggregation network is designed to facilitate pose repair. The target pedestrian features under the multi-level logic topology camera are utilised as auxiliary information to repair the occluded target pedestrian image, so as to reduce the impact of pedestrian mismatch due to pose changes; (3) A joint optimisation mechanism of CEN and pose repair network is introduced, where multi-camera logical topology inference provides auxiliary information and retrieval order for the pose repair network. The authors conducted experiments on multiple datasets, including Occluded-DukeMTMC, CUHK-SYSU, PRW, SLP, and UJS-reID. The results indicate that the authors’ method achieved significant performance across these datasets. Specifically, on the CUHK-SYSU dataset, the authors’ model achieved a top-1 accuracy of 89.1% and a mean Average Precision accuracy of 83.1% in the recognition of occluded individuals.
{"title":"Person re-identification via deep compound eye network and pose repair module","authors":"Hongjian Gu, Wenxuan Zou, Keyang Cheng, Bin Wu, Humaira Abdul Ghafoor, Yongzhao Zhan","doi":"10.1049/cvi2.12282","DOIUrl":"10.1049/cvi2.12282","url":null,"abstract":"<p>Person re-identification is aimed at searching for specific target pedestrians from non-intersecting cameras. However, in real complex scenes, pedestrians are easily obscured, which makes the target pedestrian search task time-consuming and challenging. To address the problem of pedestrians' susceptibility to occlusion, a person re-identification via deep compound eye network (CEN) and pose repair module is proposed, which includes (1) A deep CEN based on multi-camera logical topology is proposed, which adopts graph convolution and a Gated Recurrent Unit to capture the temporal and spatial information of pedestrian walking and finally carries out pedestrian global matching through the Siamese network; (2) An integrated spatial-temporal information aggregation network is designed to facilitate pose repair. The target pedestrian features under the multi-level logic topology camera are utilised as auxiliary information to repair the occluded target pedestrian image, so as to reduce the impact of pedestrian mismatch due to pose changes; (3) A joint optimisation mechanism of CEN and pose repair network is introduced, where multi-camera logical topology inference provides auxiliary information and retrieval order for the pose repair network. The authors conducted experiments on multiple datasets, including Occluded-DukeMTMC, CUHK-SYSU, PRW, SLP, and UJS-reID. The results indicate that the authors’ method achieved significant performance across these datasets. Specifically, on the CUHK-SYSU dataset, the authors’ model achieved a top-1 accuracy of 89.1% and a mean Average Precision accuracy of 83.1% in the recognition of occluded individuals.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"826-841"},"PeriodicalIF":1.5,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12282","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140741587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video frame interpolation (VFI) is a technique that synthesises intermediate frames between adjacent original video frames to enhance the temporal super-resolution of the video. However, existing methods usually rely on heavy model architectures with a large number of parameters. The authors introduce an efficient VFI network based on multiple lightweight convolutional units and a Local three-scale encoding (LTSE) structure. In particular, the authors introduce a LTSE structure with two-level attention cascades. This design is tailored to enhance the efficient capture of details and contextual information across diverse scales in images. Secondly, the authors introduce recurrent convolutional layers (RCL) and residual operations, designing the recurrent residual convolutional unit to optimise the LTSE structure. Additionally, a lightweight convolutional unit named separable recurrent residual convolutional unit is introduced to reduce the model parameters. Finally, the authors obtain the three-scale decoding features from the decoder and warp them for a set of three-scale pre-warped maps. The authors fuse them into the synthesis network to generate high-quality interpolated frames. The experimental results indicate that the proposed approach achieves superior performance with fewer model parameters.
{"title":"Video frame interpolation via spatial multi-scale modelling","authors":"Zhe Qu, Weijing Liu, Lizhen Cui, Xiaohui Yang","doi":"10.1049/cvi2.12281","DOIUrl":"10.1049/cvi2.12281","url":null,"abstract":"<p>Video frame interpolation (VFI) is a technique that synthesises intermediate frames between adjacent original video frames to enhance the temporal super-resolution of the video. However, existing methods usually rely on heavy model architectures with a large number of parameters. The authors introduce an efficient VFI network based on multiple lightweight convolutional units and a Local three-scale encoding (LTSE) structure. In particular, the authors introduce a LTSE structure with two-level attention cascades. This design is tailored to enhance the efficient capture of details and contextual information across diverse scales in images. Secondly, the authors introduce recurrent convolutional layers (RCL) and residual operations, designing the recurrent residual convolutional unit to optimise the LTSE structure. Additionally, a lightweight convolutional unit named separable recurrent residual convolutional unit is introduced to reduce the model parameters. Finally, the authors obtain the three-scale decoding features from the decoder and warp them for a set of three-scale pre-warped maps. The authors fuse them into the synthesis network to generate high-quality interpolated frames. The experimental results indicate that the proposed approach achieves superior performance with fewer model parameters.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"458-472"},"PeriodicalIF":1.7,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12281","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140746884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors present global-interval and local-continuous feature extraction networks for gait recognition. Unlike conventional gait recognition methods focussing on the full gait cycle, the authors introduce a novel global- continuous-dilated temporal feature extraction (TFE) to extract continuous and interval motion features from the silhouette frames globally. Simultaneously, an inter-frame motion excitation (IME) module is proposed to enhance the unique motion expression of an individual, which remains unchanged regardless of clothing variations. The spatio-temporal features extracted from the TFE and IME modules are then weighted and concatenated by an adaptive aggregator network for recognition. Through the experiments over CASIA-B and mini-OUMVLP datasets, the proposed method has shown the comparable performance (as 98%, 95%, and 84.9% in the normal walking, carrying a bag or packbag, and wearing coats or jackets categories in CASIA-B, and 89% in mini-OUMVLP) to the other state-of-the-art approaches. Extensive experiments conducted on the CASIA-B and mini-OUMVLP datasets have demonstrated the comparable performance of our proposed method compared to other state-of-the-art approaches.
{"title":"Continuous-dilated temporal and inter-frame motion excitation feature learning for gait recognition","authors":"Chunsheng Hua, Hao Zhang, Jia Li, Yingjie Pan","doi":"10.1049/cvi2.12278","DOIUrl":"10.1049/cvi2.12278","url":null,"abstract":"<p>The authors present global-interval and local-continuous feature extraction networks for gait recognition. Unlike conventional gait recognition methods focussing on the full gait cycle, the authors introduce a novel global- continuous-dilated temporal feature extraction (<i>TFE</i>) to extract continuous and interval motion features from the silhouette frames globally. Simultaneously, an inter-frame motion excitation (<i>IME</i>) module is proposed to enhance the unique motion expression of an individual, which remains unchanged regardless of clothing variations. The spatio-temporal features extracted from the <i>TFE</i> and <i>IME</i> modules are then weighted and concatenated by an adaptive aggregator network for recognition. Through the experiments over CASIA-B and mini-OUMVLP datasets, the proposed method has shown the comparable performance (as 98%, 95%, and 84.9% in the normal walking, carrying a bag or packbag, and wearing coats or jackets categories in CASIA-B, and 89% in mini-OUMVLP) to the other state-of-the-art approaches. Extensive experiments conducted on the CASIA-B and mini-OUMVLP datasets have demonstrated the comparable performance of our proposed method compared to other state-of-the-art approaches.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"788-800"},"PeriodicalIF":1.5,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140781350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dong-hwi Kim, Dong-hun Lee, Aro Kim, Jinwoo Jeong, Jong Taek Lee, Sungjei Kim, Sang-hyo Park
The authors propose a compression strategy for a 3D human pose estimation model based on a transformer which yields high accuracy but increases the model size. This approach involves a pruning-guided determination of the search range to achieve lightweight pose estimation under limited training time and to identify the optimal model size. In addition, the authors propose a transformer-based feature distillation (TFD) method, which efficiently exploits the pose estimation model in terms of both model size and accuracy by leveraging transformer architecture characteristics. Pruning-guided TFD is the first approach for 3D human pose estimation that employs transformer architecture. The proposed approach was tested on various extensive data sets, and the results show that it can reduce the model size by 30% compared to the state-of-the-art while ensuring high accuracy.
{"title":"Pruning-guided feature distillation for an efficient transformer-based pose estimation model","authors":"Dong-hwi Kim, Dong-hun Lee, Aro Kim, Jinwoo Jeong, Jong Taek Lee, Sungjei Kim, Sang-hyo Park","doi":"10.1049/cvi2.12277","DOIUrl":"https://doi.org/10.1049/cvi2.12277","url":null,"abstract":"<p>The authors propose a compression strategy for a 3D human pose estimation model based on a transformer which yields high accuracy but increases the model size. This approach involves a pruning-guided determination of the search range to achieve lightweight pose estimation under limited training time and to identify the optimal model size. In addition, the authors propose a transformer-based feature distillation (TFD) method, which efficiently exploits the pose estimation model in terms of both model size and accuracy by leveraging transformer architecture characteristics. Pruning-guided TFD is the first approach for 3D human pose estimation that employs transformer architecture. The proposed approach was tested on various extensive data sets, and the results show that it can reduce the model size by 30% compared to the state-of-the-art while ensuring high accuracy.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"745-758"},"PeriodicalIF":1.5,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12277","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human–object interaction (HOI) detection, which localises and recognises interactions between human and object, requires high-level image and scene understanding. Recent methods for HOI detection typically utilise transformer-based architecture to build unified future representation. However, these methods use random initial queries to predict interactive human–object pairs, leading to a lack of prior knowledge. Furthermore, most methods provide unified features to forecast interactions using conventional decoder structures, but they lack the ability to build efficient multi-task representations. To address these problems, we propose a novel two-stage HOI detector called PGCD, mainly consisting of prompt guidance query and cascaded constraint decoders. Firstly, the authors propose a novel prompt guidance query generation module (PGQ) to introduce the guidance-semantic features. In PGQ, the authors build visual-semantic transfer to obtain fuller semantic representations. In addition, a cascaded constraint decoder architecture (CD) with random masks is designed to build fine-grained interaction features and improve the model's generalisation performance. Experimental results demonstrate that the authors’ proposed approach obtains significant performance on the two widely used benchmarks, that is, HICO-DET and V-COCO.
{"title":"Prompt guidance query with cascaded constraint decoders for human–object interaction detection","authors":"Sheng Liu, Bingnan Guo, Feng Zhang, Junhao Chen, Ruixiang Chen","doi":"10.1049/cvi2.12276","DOIUrl":"10.1049/cvi2.12276","url":null,"abstract":"<p>Human–object interaction (HOI) detection, which localises and recognises interactions between human and object, requires high-level image and scene understanding. Recent methods for HOI detection typically utilise transformer-based architecture to build unified future representation. However, these methods use random initial queries to predict interactive human–object pairs, leading to a lack of prior knowledge. Furthermore, most methods provide unified features to forecast interactions using conventional decoder structures, but they lack the ability to build efficient multi-task representations. To address these problems, we propose a novel two-stage HOI detector called PGCD, mainly consisting of prompt guidance query and cascaded constraint decoders. Firstly, the authors propose a novel prompt guidance query generation module (PGQ) to introduce the guidance-semantic features. In PGQ, the authors build visual-semantic transfer to obtain fuller semantic representations. In addition, a cascaded constraint decoder architecture (CD) with random masks is designed to build fine-grained interaction features and improve the model's generalisation performance. Experimental results demonstrate that the authors’ proposed approach obtains significant performance on the two widely used benchmarks, that is, HICO-DET and V-COCO.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"772-787"},"PeriodicalIF":1.5,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12276","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140366408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Wang, Meimei Xu, Huazhu Xue, Zhanqiang Huo, Fen Luo
Although existing object detectors achieve encouraging performance of object detection and localisation under real ideal conditions, the detection performance in adverse weather conditions (snowy) is very poor and not enough to cope with the detection task in adverse weather conditions. Existing methods do not deal well with the effect of snow on the identity of object features or usually ignore or even discard potential information that can help improve the detection performance. To this end, the authors propose a novel and improved end-to-end object detection network joint image restoration. Specifically, in order to address the problem of identity degradation of object detection due to snow, an ingenious restoration-detection dual branch network structure combined with a Multi-Integrated Attention module is proposed, which can well mitigate the effect of snow on the identity of object features, thus improving the detection performance of the detector. In order to make more effective use of the features that are beneficial to the detection task, a Self-Adaptive Feature Fusion module is introduced, which can help the network better learn the potential features that are beneficial to the detection and eliminate the effect of heavy or large local snow in the object area on detection by a special feature fusion, thus improving the network's detection capability in snowy. In addition, the authors construct a large-scale, multi-size snowy dataset called Synthetic and Real Snowy Dataset (SRSD), and it is a good and necessary complement and improvement to the existing snowy-related tasks. Extensive experiments on a public snowy dataset (Snowy-weather Datasets) and SRSD indicate that our method outperforms the existing state-of-the-art object detectors.
虽然现有的物体检测器在真实理想条件下的物体检测和定位性能令人鼓舞,但在恶劣天气条件下(下雪)的检测性能却非常差,不足以应对恶劣天气条件下的检测任务。现有方法不能很好地处理雪对物体特征识别的影响,或者通常会忽略甚至丢弃有助于提高检测性能的潜在信息。为此,作者提出了一种新颖、改进的端到端物体检测网络联合图像复原。具体地说,针对雪导致的物体检测身份退化问题,提出了一种巧妙的恢复-检测双分支网络结构,并结合多集成注意模块,可以很好地缓解雪对物体特征身份的影响,从而提高检测器的检测性能。为了更有效地利用有利于检测任务的特征,引入了自适应特征融合模块,该模块可以帮助网络更好地学习有利于检测的潜在特征,并通过特殊的特征融合消除物体区域大雪或局部大雪对检测的影响,从而提高网络在雪地中的检测能力。此外,作者还构建了一个大规模、多尺寸的雪地数据集,称为合成与真实雪地数据集(Synthetic and Real Snowy Dataset,SSD),这是对现有雪地相关任务的很好和必要的补充和改进。在公共雪景数据集(Snowy-weather Datasets)和 SRSD 上进行的大量实验表明,我们的方法优于现有的最先进的物体检测器。
{"title":"Joint image restoration for object detection in snowy weather","authors":"Jing Wang, Meimei Xu, Huazhu Xue, Zhanqiang Huo, Fen Luo","doi":"10.1049/cvi2.12274","DOIUrl":"10.1049/cvi2.12274","url":null,"abstract":"<p>Although existing object detectors achieve encouraging performance of object detection and localisation under real ideal conditions, the detection performance in adverse weather conditions (snowy) is very poor and not enough to cope with the detection task in adverse weather conditions. Existing methods do not deal well with the effect of snow on the identity of object features or usually ignore or even discard potential information that can help improve the detection performance. To this end, the authors propose a novel and improved end-to-end object detection network joint image restoration. Specifically, in order to address the problem of identity degradation of object detection due to snow, an ingenious restoration-detection dual branch network structure combined with a Multi-Integrated Attention module is proposed, which can well mitigate the effect of snow on the identity of object features, thus improving the detection performance of the detector. In order to make more effective use of the features that are beneficial to the detection task, a Self-Adaptive Feature Fusion module is introduced, which can help the network better learn the potential features that are beneficial to the detection and eliminate the effect of heavy or large local snow in the object area on detection by a special feature fusion, thus improving the network's detection capability in snowy. In addition, the authors construct a large-scale, multi-size snowy dataset called Synthetic and Real Snowy Dataset (SRSD), and it is a good and necessary complement and improvement to the existing snowy-related tasks. Extensive experiments on a public snowy dataset (Snowy-weather Datasets) and SRSD indicate that our method outperforms the existing state-of-the-art object detectors.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"759-771"},"PeriodicalIF":1.5,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12274","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140376973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image captioning is an important task for understanding images. Recently, many studies have used tags to build alignments between image information and language information. However, existing methods ignore the problem that simple semantic tags have difficulty expressing the detailed semantics for different image contents. Therefore, the authors propose a tag-inferring and tag-guided Transformer for image captioning to generate fine-grained captions. First, a tag-inferring encoder is proposed, which uses the tags extracted by the scene graph model to infer tags with deeper semantic information. Then, with the obtained deep tag information, a tag-guided decoder that includes short-term attention to improve the features of words in the sentence and gated cross-modal attention to combine image features, tag features and language features to produce informative semantic features is proposed. Finally, the word probability distribution of all positions in the sequence is calculated to generate descriptions for the image. The experiments demonstrate that the authors’ method can combine tags to obtain precise captions and that it achieves competitive performance with a 40.6% BLEU-4 score and 135.3% CIDEr score on the MSCOCO data set.
{"title":"Tag-inferring and tag-guided Transformer for image captioning","authors":"Yaohua Yi, Yinkai Liang, Dezhu Kong, Ziwei Tang, Jibing Peng","doi":"10.1049/cvi2.12280","DOIUrl":"10.1049/cvi2.12280","url":null,"abstract":"<p>Image captioning is an important task for understanding images. Recently, many studies have used tags to build alignments between image information and language information. However, existing methods ignore the problem that simple semantic tags have difficulty expressing the detailed semantics for different image contents. Therefore, the authors propose a tag-inferring and tag-guided Transformer for image captioning to generate fine-grained captions. First, a tag-inferring encoder is proposed, which uses the tags extracted by the scene graph model to infer tags with deeper semantic information. Then, with the obtained deep tag information, a tag-guided decoder that includes short-term attention to improve the features of words in the sentence and gated cross-modal attention to combine image features, tag features and language features to produce informative semantic features is proposed. Finally, the word probability distribution of all positions in the sequence is calculated to generate descriptions for the image. The experiments demonstrate that the authors’ method can combine tags to obtain precise captions and that it achieves competitive performance with a 40.6% BLEU-4 score and 135.3% CIDEr score on the MSCOCO data set.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"801-812"},"PeriodicalIF":1.5,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12280","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140218940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Perception systems in autonomous vehicles need to accurately detect and classify objects within their surrounding environments. Numerous types of sensors are deployed on these vehicles, and the combination of such multimodal data streams can significantly boost performance. The authors introduce a novel sensor fusion framework using deep convolutional neural networks. The framework employs both camera and LiDAR sensors in a multimodal, multiview configuration. The authors leverage both data types by introducing two new innovative fusion mechanisms: element-wise multiplication and multimodal factorised bilinear pooling. The methods improve the bird's eye view moderate average precision score by +4.97% and +8.35% on the KITTI dataset when compared to traditional fusion operators like element-wise addition and feature map concatenation. An in-depth analysis of key design choices impacting performance, such as data augmentation, multi-task learning, and convolutional architecture design is offered. The study aims to pave the way for the development of more robust multimodal machine vision systems. The authors conclude the paper with qualitative results, discussing both successful and problematic cases, along with potential ways to mitigate the latter.
{"title":"Learnable fusion mechanisms for multimodal object detection in autonomous vehicles","authors":"Yahya Massoud, Robert Laganiere","doi":"10.1049/cvi2.12259","DOIUrl":"10.1049/cvi2.12259","url":null,"abstract":"<p>Perception systems in autonomous vehicles need to accurately detect and classify objects within their surrounding environments. Numerous types of sensors are deployed on these vehicles, and the combination of such multimodal data streams can significantly boost performance. The authors introduce a novel sensor fusion framework using deep convolutional neural networks. The framework employs both camera and LiDAR sensors in a multimodal, multiview configuration. The authors leverage both data types by introducing two new innovative fusion mechanisms: element-wise multiplication and multimodal factorised bilinear pooling. The methods improve the bird's eye view moderate average precision score by +4.97% and +8.35% on the KITTI dataset when compared to traditional fusion operators like element-wise addition and feature map concatenation. An in-depth analysis of key design choices impacting performance, such as data augmentation, multi-task learning, and convolutional architecture design is offered. The study aims to pave the way for the development of more robust multimodal machine vision systems. The authors conclude the paper with qualitative results, discussing both successful and problematic cases, along with potential ways to mitigate the latter.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"499-511"},"PeriodicalIF":1.7,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12259","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140237870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaohan Ma, Rize Jin, Jianming Wang, Tae-Sun Chung
Sign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign languages convey meaning by means of multiple asynchronous articulators, including manual and non-manual information channels. Recent deep learning-based SLP models directly generate the full-articulatory sign sequence from the text input in an end-to-end manner. However, these models largely down weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. To explore these neglected aspects, an efficient cascade dual-decoder Transformer (CasDual-Transformer) for SLP is proposed to learn, successively, two mappings SLPhand: Text → Hand pose and SLPsign: Text → Sign pose, utilising an attention-based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, a novel spatio-temporal loss to penalise shape dissimilarity and temporal distortions of produced sequences is introduced. Experimental studies are performed on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that the authors’ model demonstrates competitive performance compared to state-of-the-art models, and in some cases, achieves considerable improvements over them.
{"title":"Attentional bias for hands: Cascade dual-decoder transformer for sign language production","authors":"Xiaohan Ma, Rize Jin, Jianming Wang, Tae-Sun Chung","doi":"10.1049/cvi2.12273","DOIUrl":"10.1049/cvi2.12273","url":null,"abstract":"<p>Sign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign languages convey meaning by means of multiple asynchronous articulators, including manual and non-manual information channels. Recent deep learning-based SLP models directly generate the full-articulatory sign sequence from the text input in an end-to-end manner. However, these models largely down weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. To explore these neglected aspects, an efficient cascade dual-decoder Transformer (CasDual-Transformer) for SLP is proposed to learn, successively, two mappings <i>SLP</i><sub><i>hand</i></sub>: <i>Text</i> → <i>Hand pose</i> and <i>SLP</i><sub>sign</sub>: <i>Text</i> → <i>Sign pose</i>, utilising an attention-based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, a novel spatio-temporal loss to penalise shape dissimilarity and temporal distortions of produced sequences is introduced. Experimental studies are performed on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that the authors’ model demonstrates competitive performance compared to state-of-the-art models, and in some cases, achieves considerable improvements over them.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"696-708"},"PeriodicalIF":1.5,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12273","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140257431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nasirul Mumenin, Mohammad Abu Yousuf, Md Asif Nashiry, A. K. M. Azad, Salem A. Alyami, Pietro Lio', Mohammad Ali Moni
Autism Spectrum Disorder (ASD) is a chronic condition characterised by impairments in social interaction and communication. Early detection of ASD is desired, and there exists a demand for the development of diagnostic aids to facilitate this. A lightweight Involutional Neural Network (INN) architecture has been developed to diagnose ASD. The model follows a simpler architectural design and has less number of parameters than the state-of-the-art (SOTA) image classification models, requiring lower computational resources. The proposed model is trained to detect ASD from eye-tracking scanpath (SP), heatmap (HM), and fixation map (FM) images. Monte Carlo Dropout has been applied to the model to perform an uncertainty analysis and ensure the effectiveness of the output provided by the proposed INN model. The model has been trained and evaluated using two publicly accessible datasets. From the experiment, it is seen that the model has achieved 98.12% accuracy, 96.83% accuracy, and 97.61% accuracy on SP, FM, and HM, respectively, which outperforms the current SOTA image classification models and other existing works conducted on this topic.
自闭症谱系障碍(ASD)是一种以社交互动和沟通障碍为特征的慢性疾病。人们希望能及早发现自闭症,因此需要开发诊断辅助工具来实现这一目标。为诊断 ASD,我们开发了一种轻量级内卷积神经网络(INN)架构。与最先进的(SOTA)图像分类模型相比,该模型采用了更简单的架构设计,参数数量更少,所需的计算资源更低。该模型经过训练,可从眼动跟踪扫描路径 (SP)、热图 (HM) 和固定图 (FM) 图像中检测出 ASD。该模型采用蒙特卡洛剔除法(Monte Carlo Dropout)进行不确定性分析,以确保 INN 模型输出的有效性。该模型使用两个可公开访问的数据集进行了训练和评估。从实验中可以看出,该模型在 SP、FM 和 HM 上的准确率分别达到了 98.12%、96.83% 和 97.61%,优于目前的 SOTA 图像分类模型和其他现有的相关工作。
{"title":"ASDNet: A robust involution-based architecture for diagnosis of autism spectrum disorder utilising eye-tracking technology","authors":"Nasirul Mumenin, Mohammad Abu Yousuf, Md Asif Nashiry, A. K. M. Azad, Salem A. Alyami, Pietro Lio', Mohammad Ali Moni","doi":"10.1049/cvi2.12271","DOIUrl":"10.1049/cvi2.12271","url":null,"abstract":"<p>Autism Spectrum Disorder (ASD) is a chronic condition characterised by impairments in social interaction and communication. Early detection of ASD is desired, and there exists a demand for the development of diagnostic aids to facilitate this. A lightweight Involutional Neural Network (INN) architecture has been developed to diagnose ASD. The model follows a simpler architectural design and has less number of parameters than the state-of-the-art (SOTA) image classification models, requiring lower computational resources. The proposed model is trained to detect ASD from eye-tracking scanpath (SP), heatmap (HM), and fixation map (FM) images. Monte Carlo Dropout has been applied to the model to perform an uncertainty analysis and ensure the effectiveness of the output provided by the proposed INN model. The model has been trained and evaluated using two publicly accessible datasets. From the experiment, it is seen that the model has achieved 98.12% accuracy, 96.83% accuracy, and 97.61% accuracy on SP, FM, and HM, respectively, which outperforms the current SOTA image classification models and other existing works conducted on this topic.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"666-681"},"PeriodicalIF":1.5,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12271","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139785035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}