Pub Date : 2024-04-04DOI: 10.1109/TCDS.2024.3373155
{"title":"IEEE Transactions on Cognitive and Developmental Systems Information for Authors","authors":"","doi":"10.1109/TCDS.2024.3373155","DOIUrl":"https://doi.org/10.1109/TCDS.2024.3373155","url":null,"abstract":"","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 2","pages":"C4-C4"},"PeriodicalIF":5.0,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10491285","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140348385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The fully supervised semantic segmentation requires detailed annotation of each pixel, which is time-consuming and laborious at the pixel-by-pixel level. To solve this problem, the direction of this article is to perform the semantic segmentation task by using image-level categorical annotation. Existing methods using image level annotation usually use class activation maps (CAMs) to find the location of the target object as the first step. By training a classifier, the presence of objects in the image can be searched effectively. However, CAMs appear that as follows: 1) objects are excessively focused on specific regions, capturing only the most prominent and critical areas and 2) it is easy to misinterpret the frequently occurring background regions, the foreground and background are confused. This article introduces cross language image matching based on out-of-distribution data and convolutional block attention module (CLODA), the concept of double branching in the cross language image matching framework, and adds a convolutional attention module to the attention branch to solve the problem of excess focus on objects in the CAMs. Importing out-of-distribution data on out of distribution branches helps classification networks improve misinterpretation of areas of focus. Optimizing regions of interest for attentional branch learning using cross pseudosupervision on two branches. Experimental results show that the pseudomasks generated by the proposed network can achieve 75.3% in mean Intersection over Union (mIoU) with the pattern analysis, statistical modeling and computational learning visual object classes (PASCAL VOC) 2012 training set. The performance of the segmentation network trained with the pseudomasks is up to 72.3% and 72.1% in mIoU on the validation and testing set of PASCAL VOC 2012.
{"title":"Attention Mechanism and Out-of-Distribution Data on Cross Language Image Matching for Weakly Supervised Semantic Segmentation","authors":"Chi-Chia Sun;Jing-Ming Guo;Chen-Hung Chung;Bo-Yu Chen","doi":"10.1109/TCDS.2024.3382914","DOIUrl":"10.1109/TCDS.2024.3382914","url":null,"abstract":"The fully supervised semantic segmentation requires detailed annotation of each pixel, which is time-consuming and laborious at the pixel-by-pixel level. To solve this problem, the direction of this article is to perform the semantic segmentation task by using image-level categorical annotation. Existing methods using image level annotation usually use class activation maps (CAMs) to find the location of the target object as the first step. By training a classifier, the presence of objects in the image can be searched effectively. However, CAMs appear that as follows: 1) objects are excessively focused on specific regions, capturing only the most prominent and critical areas and 2) it is easy to misinterpret the frequently occurring background regions, the foreground and background are confused. This article introduces cross language image matching based on out-of-distribution data and convolutional block attention module (CLODA), the concept of double branching in the cross language image matching framework, and adds a convolutional attention module to the attention branch to solve the problem of excess focus on objects in the CAMs. Importing out-of-distribution data on out of distribution branches helps classification networks improve misinterpretation of areas of focus. Optimizing regions of interest for attentional branch learning using cross pseudosupervision on two branches. Experimental results show that the pseudomasks generated by the proposed network can achieve 75.3% in mean Intersection over Union (mIoU) with the pattern analysis, statistical modeling and computational learning visual object classes (PASCAL VOC) 2012 training set. The performance of the segmentation network trained with the pseudomasks is up to 72.3% and 72.1% in mIoU on the validation and testing set of PASCAL VOC 2012.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 4","pages":"1604-1610"},"PeriodicalIF":5.0,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-02DOI: 10.1109/TCDS.2024.3383952
Sonal Kumar;Arijit Sur;Rashmi Dutta Baruah
Successive proposals of several self-supervised training schemes (STSs) continue to emerge, taking one step closer to developing a universal foundation model. In this process, unsupervised downstream tasks are recognized as one of the evaluation methods to validate the quality of visual features learned with self-supervised training. However, unsupervised dense semantic segmentation has yet to be explored as a downstream task, which can utilize and evaluate the quality of semantic information introduced in patch-level feature representations during self-supervised training of vision transformers. Therefore, we propose a novel data-driven framework, DatUS, to perform unsupervised dense semantic segmentation (DSS) as a downstream task. DatUS generates semantically consistent pseudosegmentation masks for an unlabeled image dataset without using visual prior or synchronized data. The experiment shows that the proposed framework achieves the highest MIoU (24.90) and average F1 score (36.3) by choosing DINOv2 and the highest pixel accuracy (62.18) by choosing DINO as the STS on the training set of SUIM dataset. It also outperforms state-of-the-art methods for the unsupervised DSS task with 15.02% MIoU, 21.47% pixel accuracy, and 16.06% average F1 score on the validation set of SUIM dataset. It achieves a competitive level of accuracy for a large-scale COCO dataset.
{"title":"DatUS: Data-Driven Unsupervised Semantic Segmentation With Pretrained Self-Supervised Vision Transformer","authors":"Sonal Kumar;Arijit Sur;Rashmi Dutta Baruah","doi":"10.1109/TCDS.2024.3383952","DOIUrl":"10.1109/TCDS.2024.3383952","url":null,"abstract":"Successive proposals of several self-supervised training schemes (STSs) continue to emerge, taking one step closer to developing a universal foundation model. In this process, unsupervised downstream tasks are recognized as one of the evaluation methods to validate the quality of visual features learned with self-supervised training. However, unsupervised dense semantic segmentation has yet to be explored as a downstream task, which can utilize and evaluate the quality of semantic information introduced in patch-level feature representations during self-supervised training of vision transformers. Therefore, we propose a novel data-driven framework, DatUS, to perform unsupervised dense semantic segmentation (DSS) as a downstream task. DatUS generates semantically consistent pseudosegmentation masks for an unlabeled image dataset without using visual prior or synchronized data. The experiment shows that the proposed framework achieves the highest MIoU (24.90) and average F1 score (36.3) by choosing DINOv2 and the highest pixel accuracy (62.18) by choosing DINO as the STS on the training set of SUIM dataset. It also outperforms state-of-the-art methods for the unsupervised DSS task with 15.02% MIoU, 21.47% pixel accuracy, and 16.06% average F1 score on the validation set of SUIM dataset. It achieves a competitive level of accuracy for a large-scale COCO dataset.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 5","pages":"1775-1788"},"PeriodicalIF":5.0,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-02DOI: 10.1109/TCDS.2024.3384269
Yuqi Liu;Qichao Zhang;Yinfeng Gao;Dongbin Zhao
Learning an efficient and safe driving strategy in a traffic-heavy intersection scenario and generalizing it to different intersections remains a challenging task for autonomous driving. This is because there are differences in the structure of roads at different intersections, and autonomous vehicles need to generalize the strategies they have learned in the training environments. This requires the autonomous vehicle to capture not only the interactions between agents but also the relationships between agents and the map effectively. To address this challenge, we present a technique that integrates the information of high-definition (HD) maps and traffic participants into vector representations, called lane graph vectorization (LGV). In order to construct a driving policy for intersection navigation, we incorporate LGV into the twin-delayed deep deterministic policy gradient (TD3) algorithm with prioritized experience replay (PER). To train and validate the proposed algorithm, we construct a gym environment for intersection navigation within the high-fidelity CARLA simulator, integrating dense interactive traffic flow and various generalization test intersection scenarios. Experimental results demonstrate the effectiveness of LGV for intersection navigation tasks and outperform the state-of-the-art in our proposed scenarios.
{"title":"Deep-Reinforcement-Learning-Based Driving Policy at Intersections Utilizing Lane Graph Networks","authors":"Yuqi Liu;Qichao Zhang;Yinfeng Gao;Dongbin Zhao","doi":"10.1109/TCDS.2024.3384269","DOIUrl":"10.1109/TCDS.2024.3384269","url":null,"abstract":"Learning an efficient and safe driving strategy in a traffic-heavy intersection scenario and generalizing it to different intersections remains a challenging task for autonomous driving. This is because there are differences in the structure of roads at different intersections, and autonomous vehicles need to generalize the strategies they have learned in the training environments. This requires the autonomous vehicle to capture not only the interactions between agents but also the relationships between agents and the map effectively. To address this challenge, we present a technique that integrates the information of high-definition (HD) maps and traffic participants into vector representations, called lane graph vectorization (LGV). In order to construct a driving policy for intersection navigation, we incorporate LGV into the twin-delayed deep deterministic policy gradient (TD3) algorithm with prioritized experience replay (PER). To train and validate the proposed algorithm, we construct a gym environment for intersection navigation within the high-fidelity CARLA simulator, integrating dense interactive traffic flow and various generalization test intersection scenarios. Experimental results demonstrate the effectiveness of LGV for intersection navigation tasks and outperform the state-of-the-art in our proposed scenarios.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 5","pages":"1759-1774"},"PeriodicalIF":5.0,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140594174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-01DOI: 10.1109/TCDS.2024.3383428
Yangfan Hu;Qian Zheng;Gang Pan
To address the energy bottleneck in deep neural networks (DNNs), the research community has developed binary neural networks (BNNs) and spiking neural networks (SNNs) from different perspectives. To combine the advantages of both BNNs and SNNs for better energy efficiency, this article proposes BitSNNs, which leverage binary weights, single-step inference, and activation sparsity. During the development of BitSNNs, we observed performance degradation in deep ResNets due to the gradient approximation error. To mitigate this issue, we delve into the learning process and propose the utilization of a hardtanh function before activation binarization. Additionally, this article investigates the critical role of activation sparsity in BitSNNs for energy efficiency, a topic often overlooked in the existing literature. Our study reveals strategies to strike a balance between accuracy and energy consumption during the training/testing stage, potentially benefiting applications in edge computing. Notably, our proposed method achieves state-of-the-art performance while significantly reducing energy consumption.
{"title":"BitSNNs: Revisiting Energy-Efficient Spiking Neural Networks","authors":"Yangfan Hu;Qian Zheng;Gang Pan","doi":"10.1109/TCDS.2024.3383428","DOIUrl":"10.1109/TCDS.2024.3383428","url":null,"abstract":"To address the energy bottleneck in deep neural networks (DNNs), the research community has developed binary neural networks (BNNs) and spiking neural networks (SNNs) from different perspectives. To combine the advantages of both BNNs and SNNs for better energy efficiency, this article proposes BitSNNs, which leverage binary weights, single-step inference, and activation sparsity. During the development of BitSNNs, we observed performance degradation in deep ResNets due to the gradient approximation error. To mitigate this issue, we delve into the learning process and propose the utilization of a hardtanh function before activation binarization. Additionally, this article investigates the critical role of activation sparsity in BitSNNs for energy efficiency, a topic often overlooked in the existing literature. Our study reveals strategies to strike a balance between accuracy and energy consumption during the training/testing stage, potentially benefiting applications in edge computing. Notably, our proposed method achieves state-of-the-art performance while significantly reducing energy consumption.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 5","pages":"1736-1747"},"PeriodicalIF":5.0,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-01DOI: 10.1109/TCDS.2024.3383158
Boyu Li;Haoran Li;Yuanheng Zhu;Dongbin Zhao
Agent-agnostic reinforcement learning aims to learn a universal control policy that can simultaneously control a set of robots with different morphologies. Recent studies have suggested that using the transformer model can address variations in state and action spaces caused by different morphologies, and morphology information is necessary to improve policy performance. However, existing methods have limitations in exploiting morphological information, where the rationality of observation integration cannot be guaranteed. We propose morphological adaptive transformer (MAT), a transformer-based universal control algorithm that can adapt to various morphologies without any modifications. MAT includes two essential components: functional position encoding (FPE) and morphological attention mechanism (MAM). The FPE provides robust and consistent positional prior information for limb observation to avoid limb confusion and implicitly obtain functional descriptions of limbs. The MAM enhances the attribute prior information of limbs, improves the correlation between observations, and makes the policy pay attention to more limbs. We combine observation with prior information to help policy adapt to the morphology of robots, thereby optimizing its performance with unknown morphologies. Experiments on agent-agnostic tasks in Gym MuJoCo environment demonstrate that our algorithm can assign more reasonable morphological prior information to each limb, and the performance of our algorithm is comparable to the prior state-of-the-art algorithm with better generalization.
{"title":"MAT: Morphological Adaptive Transformer for Universal Morphology Policy Learning","authors":"Boyu Li;Haoran Li;Yuanheng Zhu;Dongbin Zhao","doi":"10.1109/TCDS.2024.3383158","DOIUrl":"10.1109/TCDS.2024.3383158","url":null,"abstract":"Agent-agnostic reinforcement learning aims to learn a universal control policy that can simultaneously control a set of robots with different morphologies. Recent studies have suggested that using the transformer model can address variations in state and action spaces caused by different morphologies, and morphology information is necessary to improve policy performance. However, existing methods have limitations in exploiting morphological information, where the rationality of observation integration cannot be guaranteed. We propose morphological adaptive transformer (MAT), a transformer-based universal control algorithm that can adapt to various morphologies without any modifications. MAT includes two essential components: functional position encoding (FPE) and morphological attention mechanism (MAM). The FPE provides robust and consistent positional prior information for limb observation to avoid limb confusion and implicitly obtain functional descriptions of limbs. The MAM enhances the attribute prior information of limbs, improves the correlation between observations, and makes the policy pay attention to more limbs. We combine observation with prior information to help policy adapt to the morphology of robots, thereby optimizing its performance with unknown morphologies. Experiments on agent-agnostic tasks in Gym MuJoCo environment demonstrate that our algorithm can assign more reasonable morphological prior information to each limb, and the performance of our algorithm is comparable to the prior state-of-the-art algorithm with better generalization.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 4","pages":"1611-1621"},"PeriodicalIF":5.0,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140593972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-30DOI: 10.1109/TCDS.2024.3405573
Dhruv Sharma;Chhavi Dhiman;Dinesh Kumar
Automatic image captioning is a computationally intensive and structurally complicated task that describes the contents of an image in the form of a natural language sentence. Methods developed in the recent past focused mainly on the description of factual content in images thereby ignoring the different emotions and styles (romantic, humorous, angry, etc.) associated with the image. To overcome this, few works incorporated style-based caption generation that captures the variability in the generated descriptions. This article presents a style embedding-based variational autoencoder for controlled stylized caption generation framework (RFCG+SE-VAE-CSCG). It generates controlled text-based stylized descriptions of images. It works in two phases, i.e., $ 1)$