Pub Date : 2024-12-13eCollection Date: 2024-01-01DOI: 10.3389/fnbot.2024.1513488
Lv Yongyin, Yu Caixia
Introduction: Segmentation tasks in computer vision play a crucial role in various applications, ranging from object detection to medical imaging and cultural heritage preservation. Traditional approaches, including convolutional neural networks (CNNs) and standard transformer-based models, have achieved significant success; however, they often face challenges in capturing fine-grained details and maintaining efficiency across diverse datasets. These methods struggle with balancing precision and computational efficiency, especially when dealing with complex patterns and high-resolution images.
Methods: To address these limitations, we propose a novel segmentation model that integrates a hierarchical vision transformer backbone with multi-scale self-attention, cascaded attention decoding, and diffusion-based robustness enhancement. Our approach aims to capture both local details and global contexts effectively while maintaining lower computational overhead.
Results and discussion: Experiments conducted on four diverse datasets, including Ancient Architecture, MS COCO, Cityscapes, and ScanNet, demonstrate that our model outperforms state-of-the-art methods in accuracy, recall, and computational efficiency. The results highlight the model's ability to generalize well across different tasks and provide robust segmentation, even in challenging scenarios. Our work paves the way for more efficient and precise segmentation techniques, making it valuable for applications where both detail and speed are critical.
{"title":"Cross-attention swin-transformer for detailed segmentation of ancient architectural color patterns.","authors":"Lv Yongyin, Yu Caixia","doi":"10.3389/fnbot.2024.1513488","DOIUrl":"10.3389/fnbot.2024.1513488","url":null,"abstract":"<p><strong>Introduction: </strong>Segmentation tasks in computer vision play a crucial role in various applications, ranging from object detection to medical imaging and cultural heritage preservation. Traditional approaches, including convolutional neural networks (CNNs) and standard transformer-based models, have achieved significant success; however, they often face challenges in capturing fine-grained details and maintaining efficiency across diverse datasets. These methods struggle with balancing precision and computational efficiency, especially when dealing with complex patterns and high-resolution images.</p><p><strong>Methods: </strong>To address these limitations, we propose a novel segmentation model that integrates a hierarchical vision transformer backbone with multi-scale self-attention, cascaded attention decoding, and diffusion-based robustness enhancement. Our approach aims to capture both local details and global contexts effectively while maintaining lower computational overhead.</p><p><strong>Results and discussion: </strong>Experiments conducted on four diverse datasets, including Ancient Architecture, MS COCO, Cityscapes, and ScanNet, demonstrate that our model outperforms state-of-the-art methods in accuracy, recall, and computational efficiency. The results highlight the model's ability to generalize well across different tasks and provide robust segmentation, even in challenging scenarios. Our work paves the way for more efficient and precise segmentation techniques, making it valuable for applications where both detail and speed are critical.</p>","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1513488"},"PeriodicalIF":2.6,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11707421/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142947470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-10eCollection Date: 2024-01-01DOI: 10.3389/fnbot.2024.1485640
Xiaoguang Li, Yaqi Chu, Xuejian Wu
Non-invasive brain-computer interfaces (BCI) hold great promise in the field of neurorehabilitation. They are easy to use and do not require surgery, particularly in the area of motor imagery electroencephalography (EEG). However, motor imagery EEG signals often have a low signal-to-noise ratio and limited spatial and temporal resolution. Traditional deep neural networks typically only focus on the spatial and temporal features of EEG, resulting in relatively low decoding and accuracy rates for motor imagery tasks. To address these challenges, this paper proposes a 3D Convolutional Neural Network (P-3DCNN) decoding method that jointly learns spatial-frequency feature maps from the frequency and spatial domains of the EEG signals. First, the Welch method is used to calculate the frequency band power spectrum of the EEG, and a 2D matrix representing the spatial topology distribution of the electrodes is constructed. These spatial-frequency representations are then generated through cubic interpolation of the temporal EEG data. Next, the paper designs a 3DCNN network with 1D and 2D convolutional layers in series to optimize the convolutional kernel parameters and effectively learn the spatial-frequency features of the EEG. Batch normalization and dropout are also applied to improve the training speed and classification performance of the network. Finally, through experiments, the proposed method is compared to various classic machine learning and deep learning techniques. The results show an average decoding accuracy rate of 86.69%, surpassing other advanced networks. This demonstrates the effectiveness of our approach in decoding motor imagery EEG and offers valuable insights for the development of BCI.
{"title":"3D convolutional neural network based on spatial-spectral feature pictures learning for decoding motor imagery EEG signal.","authors":"Xiaoguang Li, Yaqi Chu, Xuejian Wu","doi":"10.3389/fnbot.2024.1485640","DOIUrl":"10.3389/fnbot.2024.1485640","url":null,"abstract":"<p><p>Non-invasive brain-computer interfaces (BCI) hold great promise in the field of neurorehabilitation. They are easy to use and do not require surgery, particularly in the area of motor imagery electroencephalography (EEG). However, motor imagery EEG signals often have a low signal-to-noise ratio and limited spatial and temporal resolution. Traditional deep neural networks typically only focus on the spatial and temporal features of EEG, resulting in relatively low decoding and accuracy rates for motor imagery tasks. To address these challenges, this paper proposes a 3D Convolutional Neural Network (P-3DCNN) decoding method that jointly learns spatial-frequency feature maps from the frequency and spatial domains of the EEG signals. First, the Welch method is used to calculate the frequency band power spectrum of the EEG, and a 2D matrix representing the spatial topology distribution of the electrodes is constructed. These spatial-frequency representations are then generated through cubic interpolation of the temporal EEG data. Next, the paper designs a 3DCNN network with 1D and 2D convolutional layers in series to optimize the convolutional kernel parameters and effectively learn the spatial-frequency features of the EEG. Batch normalization and dropout are also applied to improve the training speed and classification performance of the network. Finally, through experiments, the proposed method is compared to various classic machine learning and deep learning techniques. The results show an average decoding accuracy rate of 86.69%, surpassing other advanced networks. This demonstrates the effectiveness of our approach in decoding motor imagery EEG and offers valuable insights for the development of BCI.</p>","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1485640"},"PeriodicalIF":2.6,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11667157/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142885708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-04eCollection Date: 2024-01-01DOI: 10.3389/fnbot.2024.1481297
Xiaoxia Xie, Yuan Jia, Tiande Ma
The user perception of mobile game is crucial for improving user experience and thus enhancing game profitability. The sparse data captured in the game can lead to sporadic performance of the model. This paper proposes a new method, the balanced graph factorization machine (BGFM), based on existing algorithms, considering the data imbalance and important high-dimensional features. The data categories are first balanced by Borderline-SMOTE oversampling, and then features are represented naturally in a graph-structured way. The highlight is that the BGFM contains interaction mechanisms for aggregating beneficial features. The results are represented as edges in the graph. Next, BGFM combines factorization machine (FM) and graph neural network strategies to concatenate any sequential feature interactions of features in the graph with an attention mechanism that assigns inter-feature weights. Experiments were conducted on the collected game perception dataset. The performance of proposed BGFM was compared with eight state-of-the-art models, significantly surpassing all of them by AUC, precision, recall, and F-measure indices.
{"title":"An improved graph factorization machine based on solving unbalanced game perception.","authors":"Xiaoxia Xie, Yuan Jia, Tiande Ma","doi":"10.3389/fnbot.2024.1481297","DOIUrl":"10.3389/fnbot.2024.1481297","url":null,"abstract":"<p><p>The user perception of mobile game is crucial for improving user experience and thus enhancing game profitability. The sparse data captured in the game can lead to sporadic performance of the model. This paper proposes a new method, the balanced graph factorization machine (BGFM), based on existing algorithms, considering the data imbalance and important high-dimensional features. The data categories are first balanced by Borderline-SMOTE oversampling, and then features are represented naturally in a graph-structured way. The highlight is that the BGFM contains interaction mechanisms for aggregating beneficial features. The results are represented as edges in the graph. Next, BGFM combines factorization machine (FM) and graph neural network strategies to concatenate any sequential feature interactions of features in the graph with an attention mechanism that assigns inter-feature weights. Experiments were conducted on the collected game perception dataset. The performance of proposed BGFM was compared with eight state-of-the-art models, significantly surpassing all of them by AUC, precision, recall, and F-measure indices.</p>","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1481297"},"PeriodicalIF":2.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11652536/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142853907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-04eCollection Date: 2024-01-01DOI: 10.3389/fnbot.2024.1443678
Yawar Abbas, Naif Al Mudawi, Bayan Alabdullah, Touseef Sadiq, Asaad Algarni, Hameedur Rahman, Ahmad Jalal
Introduction: Recognizing human actions is crucial for allowing machines to understand and recognize human behavior, with applications spanning video based surveillance systems, human-robot collaboration, sports analysis systems, and entertainment. The immense diversity in human movement and appearance poses a significant challenge in this field, especially when dealing with drone-recorded (RGB) videos. Factors such as dynamic backgrounds, motion blur, occlusions, varying video capture angles, and exposure issues greatly complicate recognition tasks.
Methods: In this study, we suggest a method that addresses these challenges in RGB videos captured by drones. Our approach begins by segmenting the video into individual frames, followed by preprocessing steps applied to these RGB frames. The preprocessing aims to reduce computational costs, optimize image quality, and enhance foreground objects while removing the background.
Result: This results in improved visibility of foreground objects while eliminating background noise. Next, we employ the YOLOv9 detection algorithm to identify human bodies within the images. From the grayscale silhouette, we extract the human skeleton and identify 15 important locations, such as the head, neck, shoulders (left and right), elbows, wrists, hips, knees, ankles, and hips (left and right), and belly button. By using all these points, we extract specific positions, angular and distance relationships between them, as well as 3D point clouds and fiducial points. Subsequently, we optimize this data using the kernel discriminant analysis (KDA) optimizer, followed by classification using a deep neural network (CNN). To validate our system, we conducted experiments on three benchmark datasets: UAV-Human, UCF, and Drone-Action.
Discussion: On these datasets, our suggested model produced corresponding action recognition accuracies of 0.68, 0.75, and 0.83.
{"title":"Unmanned aerial vehicles for human detection and recognition using neural-network model.","authors":"Yawar Abbas, Naif Al Mudawi, Bayan Alabdullah, Touseef Sadiq, Asaad Algarni, Hameedur Rahman, Ahmad Jalal","doi":"10.3389/fnbot.2024.1443678","DOIUrl":"10.3389/fnbot.2024.1443678","url":null,"abstract":"<p><strong>Introduction: </strong>Recognizing human actions is crucial for allowing machines to understand and recognize human behavior, with applications spanning video based surveillance systems, human-robot collaboration, sports analysis systems, and entertainment. The immense diversity in human movement and appearance poses a significant challenge in this field, especially when dealing with drone-recorded (RGB) videos. Factors such as dynamic backgrounds, motion blur, occlusions, varying video capture angles, and exposure issues greatly complicate recognition tasks.</p><p><strong>Methods: </strong>In this study, we suggest a method that addresses these challenges in RGB videos captured by drones. Our approach begins by segmenting the video into individual frames, followed by preprocessing steps applied to these RGB frames. The preprocessing aims to reduce computational costs, optimize image quality, and enhance foreground objects while removing the background.</p><p><strong>Result: </strong>This results in improved visibility of foreground objects while eliminating background noise. Next, we employ the YOLOv9 detection algorithm to identify human bodies within the images. From the grayscale silhouette, we extract the human skeleton and identify 15 important locations, such as the head, neck, shoulders (left and right), elbows, wrists, hips, knees, ankles, and hips (left and right), and belly button. By using all these points, we extract specific positions, angular and distance relationships between them, as well as 3D point clouds and fiducial points. Subsequently, we optimize this data using the kernel discriminant analysis (KDA) optimizer, followed by classification using a deep neural network (CNN). To validate our system, we conducted experiments on three benchmark datasets: UAV-Human, UCF, and Drone-Action.</p><p><strong>Discussion: </strong>On these datasets, our suggested model produced corresponding action recognition accuracies of 0.68, 0.75, and 0.83.</p>","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1443678"},"PeriodicalIF":2.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11652500/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142853909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-04eCollection Date: 2024-01-01DOI: 10.3389/fnbot.2024.1462023
Xinyu Jiang, Chenfei Ma, Kianoush Nazarpour
Introduction: Myoelectric control systems translate different patterns of electromyographic (EMG) signals into the control commands of diverse human-machine interfaces via hand gesture recognition, enabling intuitive control of prosthesis and immersive interactions in the metaverse. The effect of arm position is a confounding factor leading to the variability of EMG characteristics. Developing a model with its characteristics and performance invariant across postures, could largely promote the translation of myoelectric control into real world practice.
Methods: Here we propose a self-calibrating random forest (RF) model which can (1) be pre-trained on data from many users, then one-shot calibrated on a new user and (2) self-calibrate in an unsupervised and autonomous way to adapt to varying arm positions.
Results: Analyses on data from 86 participants (66 for pre-training and 20 in real-time evaluation experiments) demonstrate the high generalisability of the proposed RF architecture to varying arm positions.
Discussion: Our work promotes the use of simple, explainable, efficient and parallelisable model for posture-invariant myoelectric control.
{"title":"Posture-invariant myoelectric control with self-calibrating random forests.","authors":"Xinyu Jiang, Chenfei Ma, Kianoush Nazarpour","doi":"10.3389/fnbot.2024.1462023","DOIUrl":"10.3389/fnbot.2024.1462023","url":null,"abstract":"<p><strong>Introduction: </strong>Myoelectric control systems translate different patterns of electromyographic (EMG) signals into the control commands of diverse human-machine interfaces via hand gesture recognition, enabling intuitive control of prosthesis and immersive interactions in the metaverse. The effect of arm position is a confounding factor leading to the variability of EMG characteristics. Developing a model with its characteristics and performance invariant across postures, could largely promote the translation of myoelectric control into real world practice.</p><p><strong>Methods: </strong>Here we propose a self-calibrating random forest (RF) model which can (1) be pre-trained on data from many users, then one-shot calibrated on a new user and (2) self-calibrate in an unsupervised and autonomous way to adapt to varying arm positions.</p><p><strong>Results: </strong>Analyses on data from 86 participants (66 for pre-training and 20 in real-time evaluation experiments) demonstrate the high generalisability of the proposed RF architecture to varying arm positions.</p><p><strong>Discussion: </strong>Our work promotes the use of simple, explainable, efficient and parallelisable model for posture-invariant myoelectric control.</p>","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1462023"},"PeriodicalIF":2.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11652494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142853908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As robots become integral to various sectors, improving human-robot collaboration is crucial, particularly in anticipating human actions to enhance safety and efficiency. Electroencephalographic (EEG) signals offer a promising solution, as they can detect brain activity preceding movement by over a second, enabling predictive capabilities in robots. This study explores how EEG can be used for action anticipation in human-robot interaction (HRI), leveraging its high temporal resolution and modern deep learning techniques. We evaluated multiple Deep Learning classification models on a motor imagery (MI) dataset, achieving up to 80.90% accuracy. These results were further validated in a pilot experiment, where actions were accurately predicted several hundred milliseconds before execution. This research demonstrates the potential of combining EEG with deep learning to enhance real-time collaborative tasks, paving the way for safer and more efficient human-robot interactions.
{"title":"EEG-based action anticipation in human-robot interaction: a comparative pilot study.","authors":"Rodrigo Vieira, Plinio Moreno, Athanasios Vourvopoulos","doi":"10.3389/fnbot.2024.1491721","DOIUrl":"10.3389/fnbot.2024.1491721","url":null,"abstract":"<p><p>As robots become integral to various sectors, improving human-robot collaboration is crucial, particularly in anticipating human actions to enhance safety and efficiency. Electroencephalographic (EEG) signals offer a promising solution, as they can detect brain activity preceding movement by over a second, enabling predictive capabilities in robots. This study explores how EEG can be used for action anticipation in human-robot interaction (HRI), leveraging its high temporal resolution and modern deep learning techniques. We evaluated multiple Deep Learning classification models on a motor imagery (MI) dataset, achieving up to 80.90% accuracy. These results were further validated in a pilot experiment, where actions were accurately predicted several hundred milliseconds before execution. This research demonstrates the potential of combining EEG with deep learning to enhance real-time collaborative tasks, paving the way for safer and more efficient human-robot interactions.</p>","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1491721"},"PeriodicalIF":2.6,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11649676/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142845975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-27eCollection Date: 2024-01-01DOI: 10.3389/fnbot.2024.1362444
Naïg Chenais, Arno Görgen
Digital immersive technologies have become increasingly prominent in clinical research and practice, including medical communication and technical education, serious games for health, psychotherapy, and interfaces for neurorehabilitation. The worldwide enthusiasm for digital health and digital therapeutics has prompted the development and testing of numerous applications and interaction methods. Nevertheless, the lack of consistency in the approaches and the peculiarity of the constructed environments contribute to an increasing disparity between the eagerness for new immersive designs and the long-term clinical adoption of these technologies. Several challenges emerge in aligning the different priorities of virtual environment designers and clinicians. This article seeks to examine the utilization and mechanics of medical immersive interfaces based on extended reality and highlight specific design challenges. The transfer of skills from virtual to clinical environments is often confounded by perceptual and attractiveness factors. We argue that a multidisciplinary approach to development and testing, along with a comprehensive acknowledgement of the shared mechanisms that underlie immersive training, are essential for the sustainable integration of extended reality into clinical settings. The present review discusses the application of a multilevel sensory framework to extended reality design, with the aim of developing brain-centered immersive interfaces tailored for therapeutic and educational purposes. Such a framework must include broader design questions, such as the integration of digital technologies into psychosocial care models, clinical validation, and related ethical concerns. We propose that efforts to bridge the virtual gap should include mixed methodologies and neurodesign approaches, integrating user behavioral and physiological feedback into iterative design phases.
{"title":"Immersive interfaces for clinical applications: current status and future perspective.","authors":"Naïg Chenais, Arno Görgen","doi":"10.3389/fnbot.2024.1362444","DOIUrl":"10.3389/fnbot.2024.1362444","url":null,"abstract":"<p><p>Digital immersive technologies have become increasingly prominent in clinical research and practice, including medical communication and technical education, serious games for health, psychotherapy, and interfaces for neurorehabilitation. The worldwide enthusiasm for digital health and digital therapeutics has prompted the development and testing of numerous applications and interaction methods. Nevertheless, the lack of consistency in the approaches and the peculiarity of the constructed environments contribute to an increasing disparity between the eagerness for new immersive designs and the long-term clinical adoption of these technologies. Several challenges emerge in aligning the different priorities of virtual environment designers and clinicians. This article seeks to examine the utilization and mechanics of medical immersive interfaces based on extended reality and highlight specific design challenges. The transfer of skills from virtual to clinical environments is often confounded by perceptual and attractiveness factors. We argue that a multidisciplinary approach to development and testing, along with a comprehensive acknowledgement of the shared mechanisms that underlie immersive training, are essential for the sustainable integration of extended reality into clinical settings. The present review discusses the application of a multilevel sensory framework to extended reality design, with the aim of developing brain-centered immersive interfaces tailored for therapeutic and educational purposes. Such a framework must include broader design questions, such as the integration of digital technologies into psychosocial care models, clinical validation, and related ethical concerns. We propose that efforts to bridge the virtual gap should include mixed methodologies and neurodesign approaches, integrating user behavioral and physiological feedback into iterative design phases.</p>","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1362444"},"PeriodicalIF":2.6,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631914/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142812874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-26eCollection Date: 2024-01-01DOI: 10.3389/fnbot.2024.1439195
Zhang Juan, Jing Zhang, Ming Gao
Introduction: With the rapid development of the tourism industry, the demand for accurate and personalized travel route recommendations has significantly increased. However, traditional methods often fail to effectively integrate visual and sequential information, leading to recommendations that are both less accurate and less personalized.
Methods: This paper introduces SelfAM-Vtrans, a novel algorithm that leverages multimodal data-combining visual Transformers, LSTMs, and self-attention mechanisms-to enhance the accuracy and personalization of travel route recommendations. SelfAM-Vtrans integrates visual and sequential information by employing a visual Transformer to extract features from travel images, thereby capturing spatial relationships within them. Concurrently, a Long Short-Term Memory (LSTM) network encodes sequential data to capture the temporal dependencies within travel sequences. To effectively merge these two modalities, a self-attention mechanism fuses the visual features and sequential encodings, thoroughly accounting for their interdependencies. Based on this fused representation, a classification or regression model is trained using real travel datasets to recommend optimal travel routes.
Results and discussion: The algorithm was rigorously evaluated through experiments conducted on real-world travel datasets, and its performance was benchmarked against other route recommendation methods. The results demonstrate that SelfAM-Vtrans significantly outperforms traditional approaches in terms of both recommendation accuracy and personalization. By comprehensively incorporating both visual and sequential data, this method offers travelers more tailored and precise route suggestions, thereby enriching the overall travel experience.
{"title":"A multimodal travel route recommendation system leveraging visual Transformers and self-attention mechanisms.","authors":"Zhang Juan, Jing Zhang, Ming Gao","doi":"10.3389/fnbot.2024.1439195","DOIUrl":"10.3389/fnbot.2024.1439195","url":null,"abstract":"<p><strong>Introduction: </strong>With the rapid development of the tourism industry, the demand for accurate and personalized travel route recommendations has significantly increased. However, traditional methods often fail to effectively integrate visual and sequential information, leading to recommendations that are both less accurate and less personalized.</p><p><strong>Methods: </strong>This paper introduces SelfAM-Vtrans, a novel algorithm that leverages multimodal data-combining visual Transformers, LSTMs, and self-attention mechanisms-to enhance the accuracy and personalization of travel route recommendations. SelfAM-Vtrans integrates visual and sequential information by employing a visual Transformer to extract features from travel images, thereby capturing spatial relationships within them. Concurrently, a Long Short-Term Memory (LSTM) network encodes sequential data to capture the temporal dependencies within travel sequences. To effectively merge these two modalities, a self-attention mechanism fuses the visual features and sequential encodings, thoroughly accounting for their interdependencies. Based on this fused representation, a classification or regression model is trained using real travel datasets to recommend optimal travel routes.</p><p><strong>Results and discussion: </strong>The algorithm was rigorously evaluated through experiments conducted on real-world travel datasets, and its performance was benchmarked against other route recommendation methods. The results demonstrate that SelfAM-Vtrans significantly outperforms traditional approaches in terms of both recommendation accuracy and personalization. By comprehensively incorporating both visual and sequential data, this method offers travelers more tailored and precise route suggestions, thereby enriching the overall travel experience.</p>","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1439195"},"PeriodicalIF":2.6,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11628496/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142806763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-21eCollection Date: 2024-01-01DOI: 10.3389/fnbot.2024.1479694
Jie Chang, Zhenmeng Wang, Chao Yan
Introduction: In recent years, with the rapid development of artificial intelligence technology, the field of music education has begun to explore new teaching models. Traditional music education research methods have primarily focused on single-modal studies such as note recognition and instrument performance techniques, often overlooking the importance of multimodal data integration and interactive teaching. Existing methods often struggle with handling multimodal data effectively, unable to fully utilize visual, auditory, and textual information for comprehensive analysis, which limits the effectiveness of teaching.
Methods: To address these challenges, this project introduces MusicARLtrans Net, a multimodal interactive music education agent system driven by reinforcement learning. The system integrates Speech-to-Text (STT) technology to achieve accurate transcription of user voice commands, utilizes the ALBEF (Align Before Fuse) model for aligning and integrating multimodal data, and applies reinforcement learning to optimize teaching strategies.
Results and discussion: This approach provides a personalized and real-time feedback interactive learning experience by effectively combining auditory, visual, and textual information. The system collects and annotates multimodal data related to music education, trains and integrates various modules, and ultimately delivers an efficient and intelligent music education agent. Experimental results demonstrate that MusicARLtrans Net significantly outperforms traditional methods, achieving an accuracy of 96.77% on the LibriSpeech dataset and 97.55% on the MS COCO dataset, with marked improvements in recall, F1 score, and AUC metrics. These results highlight the system's superiority in speech recognition accuracy, multimodal data understanding, and teaching strategy optimization, which together lead to enhanced learning outcomes and user satisfaction. The findings hold substantial academic and practical significance, demonstrating the potential of advanced AI-driven systems in revolutionizing music education.
导读:近年来,随着人工智能技术的飞速发展,音乐教育领域开始探索新的教学模式。传统的音乐教育研究方法主要集中在音符识别、乐器演奏技巧等单模态研究上,往往忽视了多模态数据整合和互动教学的重要性。现有的方法往往难以有效地处理多模态数据,无法充分利用视觉、听觉和文本信息进行综合分析,这限制了教学的有效性。方法:为了解决这些挑战,本项目引入了MusicARLtrans Net,这是一个由强化学习驱动的多模态交互式音乐教育代理系统。该系统集成了语音到文本(STT)技术,实现用户语音命令的准确转录,利用ALBEF (Align Before Fuse)模型对多模态数据进行对齐和集成,并应用强化学习优化教学策略。结果与讨论:该方法通过有效地结合听觉、视觉和文本信息,提供个性化和实时反馈的交互式学习体验。系统对音乐教育相关的多模态数据进行采集和标注,对各个模块进行训练和整合,最终提供一个高效、智能的音乐教育代理。实验结果表明,MusicARLtrans Net显著优于传统方法,在librisspeech数据集上达到96.77%的准确率,在MS COCO数据集上达到97.55%的准确率,在召回率、F1分数和AUC指标上有显著提高。这些结果突出了系统在语音识别准确性、多模态数据理解和教学策略优化方面的优势,这些优势共同提高了学习效果和用户满意度。这一发现具有重大的学术和现实意义,展示了先进的人工智能驱动系统在彻底改变音乐教育方面的潜力。
{"title":"MusicARLtrans Net: a multimodal agent interactive music education system driven via reinforcement learning.","authors":"Jie Chang, Zhenmeng Wang, Chao Yan","doi":"10.3389/fnbot.2024.1479694","DOIUrl":"10.3389/fnbot.2024.1479694","url":null,"abstract":"<p><strong>Introduction: </strong>In recent years, with the rapid development of artificial intelligence technology, the field of music education has begun to explore new teaching models. Traditional music education research methods have primarily focused on single-modal studies such as note recognition and instrument performance techniques, often overlooking the importance of multimodal data integration and interactive teaching. Existing methods often struggle with handling multimodal data effectively, unable to fully utilize visual, auditory, and textual information for comprehensive analysis, which limits the effectiveness of teaching.</p><p><strong>Methods: </strong>To address these challenges, this project introduces MusicARLtrans Net, a multimodal interactive music education agent system driven by reinforcement learning. The system integrates Speech-to-Text (STT) technology to achieve accurate transcription of user voice commands, utilizes the ALBEF (Align Before Fuse) model for aligning and integrating multimodal data, and applies reinforcement learning to optimize teaching strategies.</p><p><strong>Results and discussion: </strong>This approach provides a personalized and real-time feedback interactive learning experience by effectively combining auditory, visual, and textual information. The system collects and annotates multimodal data related to music education, trains and integrates various modules, and ultimately delivers an efficient and intelligent music education agent. Experimental results demonstrate that MusicARLtrans Net significantly outperforms traditional methods, achieving an accuracy of <b>96.77%</b> on the LibriSpeech dataset and <b>97.55%</b> on the MS COCO dataset, with marked improvements in recall, F1 score, and AUC metrics. These results highlight the system's superiority in speech recognition accuracy, multimodal data understanding, and teaching strategy optimization, which together lead to enhanced learning outcomes and user satisfaction. The findings hold substantial academic and practical significance, demonstrating the potential of advanced AI-driven systems in revolutionizing music education.</p>","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1479694"},"PeriodicalIF":2.6,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11617572/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142785067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-20eCollection Date: 2024-01-01DOI: 10.3389/fnbot.2024.1483131
Ni Wang
Introduction: With the development of globalization and the increasing importance of English in international communication, effectively improving English writing skills has become a key focus in language learning. Traditional methods for English writing guidance and error correction have predominantly relied on rule-based approaches or statistical models, such as conventional language models and basic machine learning algorithms. While these methods can aid learners in improving their writing quality to some extent, they often suffer from limitations such as inflexibility, insufficient contextual understanding, and an inability to handle multimodal information. These shortcomings restrict their effectiveness in more complex linguistic environments.
Methods: To address these challenges, this study introduces ETG-ALtrans, a multimodal robot-assisted English writing guidance and error correction technology based on an improved ALBEF model and VGG19 architecture, enhanced by reinforcement learning. The approach leverages VGG19 to extract visual features and integrates them with the ALBEF model, achieving precise alignment and fusion of images and text. This enhances the model's ability to comprehend context. Furthermore, by incorporating reinforcement learning, the model can adaptively refine its correction strategies, thereby optimizing the effectiveness of writing guidance.
Results and discussion: Experimental results demonstrate that the proposed ETG-ALtrans method significantly improves the accuracy of English writing error correction and the intelligence level of writing guidance in multimodal data scenarios. Compared to traditional methods, this approach not only enhances the precision of writing suggestions but also better caters to the personalized needs of learners, thereby effectively improving their writing skills. This research is of significant importance in the field of language learning technology and offers new perspectives and methodologies for the development of future English writing assistance tools.
{"title":"Multimodal robot-assisted English writing guidance and error correction with reinforcement learning.","authors":"Ni Wang","doi":"10.3389/fnbot.2024.1483131","DOIUrl":"10.3389/fnbot.2024.1483131","url":null,"abstract":"<p><strong>Introduction: </strong>With the development of globalization and the increasing importance of English in international communication, effectively improving English writing skills has become a key focus in language learning. Traditional methods for English writing guidance and error correction have predominantly relied on rule-based approaches or statistical models, such as conventional language models and basic machine learning algorithms. While these methods can aid learners in improving their writing quality to some extent, they often suffer from limitations such as inflexibility, insufficient contextual understanding, and an inability to handle multimodal information. These shortcomings restrict their effectiveness in more complex linguistic environments.</p><p><strong>Methods: </strong>To address these challenges, this study introduces ETG-ALtrans, a multimodal robot-assisted English writing guidance and error correction technology based on an improved ALBEF model and VGG19 architecture, enhanced by reinforcement learning. The approach leverages VGG19 to extract visual features and integrates them with the ALBEF model, achieving precise alignment and fusion of images and text. This enhances the model's ability to comprehend context. Furthermore, by incorporating reinforcement learning, the model can adaptively refine its correction strategies, thereby optimizing the effectiveness of writing guidance.</p><p><strong>Results and discussion: </strong>Experimental results demonstrate that the proposed ETG-ALtrans method significantly improves the accuracy of English writing error correction and the intelligence level of writing guidance in multimodal data scenarios. Compared to traditional methods, this approach not only enhances the precision of writing suggestions but also better caters to the personalized needs of learners, thereby effectively improving their writing skills. This research is of significant importance in the field of language learning technology and offers new perspectives and methodologies for the development of future English writing assistance tools.</p>","PeriodicalId":12628,"journal":{"name":"Frontiers in Neurorobotics","volume":"18 ","pages":"1483131"},"PeriodicalIF":2.6,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11614782/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142779207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}