Over the last decade, deep generative models have evolved to generate realistic and sharp images. The success of these models is often attributed to an extremely large number of trainable parameters and an abundance of training data, with limited or no understanding of the underlying data manifold. In this article, we explore the possibility of learning a deep generative model that is structured to better capture the underlying manifold's geometry, to effectively improve image generation while providing implicit controlled generation by design. Our approach structures the latent space into multiple disjoint representations capturing different attribute manifolds. The global representations are guided by a disentangling loss for effective attribute representation learning and a differential manifold divergence loss to learn an effective implicit generative model. Experimental results on a 3D shapes dataset demonstrate the model's ability to disentangle attributes without direct supervision and its controllable generative capabilities. These findings underscore the potential of structuring deep generative models to enhance image generation and attribute control without direct supervision with ground truth attributes signaling progress toward more sophisticated deep generative models.
{"title":"Orthogonality and graph divergence losses promote disentanglement in generative models","authors":"Ankita Shukla, Rishi Dadhich, Rajhans Singh, Anirudh Rayas, Pouria Saidi, Gautam Dasarathy, Visar Berisha, Pavan Turaga","doi":"10.3389/fcomp.2024.1274779","DOIUrl":"https://doi.org/10.3389/fcomp.2024.1274779","url":null,"abstract":"Over the last decade, deep generative models have evolved to generate realistic and sharp images. The success of these models is often attributed to an extremely large number of trainable parameters and an abundance of training data, with limited or no understanding of the underlying data manifold. In this article, we explore the possibility of learning a deep generative model that is structured to better capture the underlying manifold's geometry, to effectively improve image generation while providing implicit controlled generation by design. Our approach structures the latent space into multiple disjoint representations capturing different attribute manifolds. The global representations are guided by a disentangling loss for effective attribute representation learning and a differential manifold divergence loss to learn an effective implicit generative model. Experimental results on a 3D shapes dataset demonstrate the model's ability to disentangle attributes without direct supervision and its controllable generative capabilities. These findings underscore the potential of structuring deep generative models to enhance image generation and attribute control without direct supervision with ground truth attributes signaling progress toward more sophisticated deep generative models.","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141108829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-21DOI: 10.3389/fcomp.2024.1384252
Georgia Zellou, Nicole Holliday
This article reviews recent literature investigating speech variation in production and comprehension during spoken language communication between humans and devices. Human speech patterns toward voice-AI presents a test to our scientific understanding about speech communication and language use. First, work exploring how human-AI interactions are similar to, or different from, human-human interactions in the realm of speech variation is reviewed. In particular, we focus on studies examining how users adapt their speech when resolving linguistic misunderstandings by computers and when accommodating their speech toward devices. Next, we consider work that investigates how top-down factors in the interaction can influence users’ linguistic interpretations of speech produced by technological agents and how the ways in which speech is generated (via text-to-speech synthesis, TTS) and recognized (using automatic speech recognition technology, ASR) has an effect on communication. Throughout this review, we aim to bridge both HCI frameworks and theoretical linguistic models accounting for variation in human speech. We also highlight findings in this growing area that can provide insight to the cognitive and social representations underlying linguistic communication more broadly. Additionally, we touch on the implications of this line of work for addressing major societal issues in speech technology.
{"title":"Linguistic analysis of human-computer interaction","authors":"Georgia Zellou, Nicole Holliday","doi":"10.3389/fcomp.2024.1384252","DOIUrl":"https://doi.org/10.3389/fcomp.2024.1384252","url":null,"abstract":"This article reviews recent literature investigating speech variation in production and comprehension during spoken language communication between humans and devices. Human speech patterns toward voice-AI presents a test to our scientific understanding about speech communication and language use. First, work exploring how human-AI interactions are similar to, or different from, human-human interactions in the realm of speech variation is reviewed. In particular, we focus on studies examining how users adapt their speech when resolving linguistic misunderstandings by computers and when accommodating their speech toward devices. Next, we consider work that investigates how top-down factors in the interaction can influence users’ linguistic interpretations of speech produced by technological agents and how the ways in which speech is generated (via text-to-speech synthesis, TTS) and recognized (using automatic speech recognition technology, ASR) has an effect on communication. Throughout this review, we aim to bridge both HCI frameworks and theoretical linguistic models accounting for variation in human speech. We also highlight findings in this growing area that can provide insight to the cognitive and social representations underlying linguistic communication more broadly. Additionally, we touch on the implications of this line of work for addressing major societal issues in speech technology.","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141117480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-16DOI: 10.3389/fcomp.2024.1367534
Nicholas Baker, P. Kellman
A remarkable phenomenon in perception is that the visual system spontaneously organizes sets of discrete elements into abstract shape representations. We studied perceptual performance with dot displays to discover what spatial relationships support shape perception.In Experiment 1, we tested conditions that lead dot arrays to be perceived as smooth contours vs. having vertices. We found that the perception of a smooth contour vs. a vertex was influenced by spatial relations between dots beyond the three points that define the angle of the point in question. However, there appeared to be a hard boundary around 90° such that any angle 90° or less was perceived as a vertex regardless of the spatial relations of ancillary dots. We hypothesized that dot arrays whose triplets were perceived as smooth curves would be more readily perceived as a unitary object because they can be encoded more economically. In Experiment 2, we generated dot arrays with and without such “vertex triplets” and compared participants’ phenomenological reports of a unified shape with smooth curves vs. shapes with angular corners. Observers gave higher shape ratings for dot arrays from curvilinear shapes. In Experiment 3, we tested shape encoding using a mental rotation task. Participants judged whether two dot arrays were the same or different at five angular differences. Subjects responded reliably faster for displays without vertex triplets, suggesting economical encoding of smooth displays. We followed this up in Experiment 4 using a visual search task. Shapes with and without vertex triplets were embedded in arrays with 25 distractor dots. Participants were asked to detect which display in a 2IFC paradigm contained a shape against a distractor with random dots. Performance was better when the dots were sampled from a smooth shape than when they were sampled from a shape with vertex triplets.These results suggest that the visual system processes dot arrangements as coherent shapes automatically using precise smoothness constraints. This ability may be a consequence of processes that extract curvature in defining object shape and is consistent with recent theory and evidence suggesting that 2D contour representations are composed of constant curvature primitives.
{"title":"Shape from dots: a window into abstraction processes in visual perception","authors":"Nicholas Baker, P. Kellman","doi":"10.3389/fcomp.2024.1367534","DOIUrl":"https://doi.org/10.3389/fcomp.2024.1367534","url":null,"abstract":"A remarkable phenomenon in perception is that the visual system spontaneously organizes sets of discrete elements into abstract shape representations. We studied perceptual performance with dot displays to discover what spatial relationships support shape perception.In Experiment 1, we tested conditions that lead dot arrays to be perceived as smooth contours vs. having vertices. We found that the perception of a smooth contour vs. a vertex was influenced by spatial relations between dots beyond the three points that define the angle of the point in question. However, there appeared to be a hard boundary around 90° such that any angle 90° or less was perceived as a vertex regardless of the spatial relations of ancillary dots. We hypothesized that dot arrays whose triplets were perceived as smooth curves would be more readily perceived as a unitary object because they can be encoded more economically. In Experiment 2, we generated dot arrays with and without such “vertex triplets” and compared participants’ phenomenological reports of a unified shape with smooth curves vs. shapes with angular corners. Observers gave higher shape ratings for dot arrays from curvilinear shapes. In Experiment 3, we tested shape encoding using a mental rotation task. Participants judged whether two dot arrays were the same or different at five angular differences. Subjects responded reliably faster for displays without vertex triplets, suggesting economical encoding of smooth displays. We followed this up in Experiment 4 using a visual search task. Shapes with and without vertex triplets were embedded in arrays with 25 distractor dots. Participants were asked to detect which display in a 2IFC paradigm contained a shape against a distractor with random dots. Performance was better when the dots were sampled from a smooth shape than when they were sampled from a shape with vertex triplets.These results suggest that the visual system processes dot arrangements as coherent shapes automatically using precise smoothness constraints. This ability may be a consequence of processes that extract curvature in defining object shape and is consistent with recent theory and evidence suggesting that 2D contour representations are composed of constant curvature primitives.","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141127464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-19DOI: 10.3389/fcomp.2024.1385392
T. Gilbert, Zexiao Lin, Sally Day, Antonia Hamilton, Jamie A. Ward
This paper presents a novel method to synchronize multiple wireless inertial measurement unit sensors (IMU) using their onboard magnetometers. The basic method uses an external electromagnetic pulse to create a known event measured by the magnetometer of multiple IMUs and in turn uses this to synchronize the devices. An initial evaluation using four commercial IMUs reveals a maximum error of 40 ms per hour as limited by a 25 Hz sample rate. Building on this we introduce a novel method to improve synchronization beyond the limitations imposed by the sample rate and evaluate this in a further study using 8 IMUs. We show that a sequence of electromagnetic pulses, in total lasting <3-s, can reduce the maximum synchronization error to 8 ms (for 25 Hz sample rate, and accounting for the transient response time of the magnetic field generator). An advantage of this method is that it can be applied to several devices, either simultaneously or individually, without the need to remove them from the context in which they are being used. This makes the approach particularly suited to synchronizing multi-person on-body sensors while they are being worn.
{"title":"A magnetometer-based method for in-situ syncing of wearable inertial measurement units","authors":"T. Gilbert, Zexiao Lin, Sally Day, Antonia Hamilton, Jamie A. Ward","doi":"10.3389/fcomp.2024.1385392","DOIUrl":"https://doi.org/10.3389/fcomp.2024.1385392","url":null,"abstract":"This paper presents a novel method to synchronize multiple wireless inertial measurement unit sensors (IMU) using their onboard magnetometers. The basic method uses an external electromagnetic pulse to create a known event measured by the magnetometer of multiple IMUs and in turn uses this to synchronize the devices. An initial evaluation using four commercial IMUs reveals a maximum error of 40 ms per hour as limited by a 25 Hz sample rate. Building on this we introduce a novel method to improve synchronization beyond the limitations imposed by the sample rate and evaluate this in a further study using 8 IMUs. We show that a sequence of electromagnetic pulses, in total lasting <3-s, can reduce the maximum synchronization error to 8 ms (for 25 Hz sample rate, and accounting for the transient response time of the magnetic field generator). An advantage of this method is that it can be applied to several devices, either simultaneously or individually, without the need to remove them from the context in which they are being used. This makes the approach particularly suited to synchronizing multi-person on-body sensors while they are being worn.","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140684614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-15DOI: 10.3389/fcomp.2024.1305670
Kamil Koniuch, Sabina Baraković, J. Husić, Sruti Subramanian, Katrien De Moor, Lucjan Janowski, Michał Wierzchoń
Modern video streaming services require quality assurance of the presented audiovisual material. Quality assurance mechanisms allow streaming platforms to provide quality levels that are considered sufficient to yield user satisfaction, with the least possible amount of data transferred. A variety of measures and approaches have been developed to control video quality, e.g., by adapting it to network conditions. These include objective matrices of the quality and thresholds identified by means of subjective perceptual judgments. The former group of matrices has recently gained the attention of (multi) media researchers. They call this area of study “Quality of Experience” (QoE). In this paper, we present a theoretical model based on review of previous QoE’s models. We argue that most of them represent the bottom-up approach to modeling. Such models focus on describing as many variables as possible, but with a limited ability to investigate the causal relationship between them; therefore, the applicability of the findings in practice is limited. To advance the field, we therefore propose a structural, top-down model of video QoE that describes causal relationships among variables. This novel top-down model serves as a practical guide for structuring QoE experiments, ensuring the incorporation of influential factors in a confirmatory manner.
{"title":"Top-down and bottom-up approaches to video quality of experience studies; overview and proposal of a new model","authors":"Kamil Koniuch, Sabina Baraković, J. Husić, Sruti Subramanian, Katrien De Moor, Lucjan Janowski, Michał Wierzchoń","doi":"10.3389/fcomp.2024.1305670","DOIUrl":"https://doi.org/10.3389/fcomp.2024.1305670","url":null,"abstract":"Modern video streaming services require quality assurance of the presented audiovisual material. Quality assurance mechanisms allow streaming platforms to provide quality levels that are considered sufficient to yield user satisfaction, with the least possible amount of data transferred. A variety of measures and approaches have been developed to control video quality, e.g., by adapting it to network conditions. These include objective matrices of the quality and thresholds identified by means of subjective perceptual judgments. The former group of matrices has recently gained the attention of (multi) media researchers. They call this area of study “Quality of Experience” (QoE). In this paper, we present a theoretical model based on review of previous QoE’s models. We argue that most of them represent the bottom-up approach to modeling. Such models focus on describing as many variables as possible, but with a limited ability to investigate the causal relationship between them; therefore, the applicability of the findings in practice is limited. To advance the field, we therefore propose a structural, top-down model of video QoE that describes causal relationships among variables. This novel top-down model serves as a practical guide for structuring QoE experiments, ensuring the incorporation of influential factors in a confirmatory manner.","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140702611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.3389/fcomp.2024.1379559
Natasha Dwyer, Matthew Harrison, Ben O’Mara, Kirsten Harley
This interdisciplinary research initiative seeks to enhance the accessibility of video gaming for individuals living with Motor Neurone Disease (MND), a condition characterized by progressive muscle weakness. Gaming serves as a social and recreational outlet for many, connecting friends, family, and even strangers through collaboration and competition. However, MND’s disease progression, including muscle weakness and paralysis, severely limit the ability to engage in gaming. In this paper, we desscribe our exploration of AI solutions to improve accessibility to gaming. We argue that any application of accessible AI must be led by lived experience. Notably, we found in our previous scoping review, existing academic research into video games for those living with MND largely neglects the experiences of MND patients in the context of video games and AI, which was a prompt for us to address this critical gap.
{"title":"Inclusive gaming through AI: a perspective for identifying opportunities and obstacles through co-design with people living with MND","authors":"Natasha Dwyer, Matthew Harrison, Ben O’Mara, Kirsten Harley","doi":"10.3389/fcomp.2024.1379559","DOIUrl":"https://doi.org/10.3389/fcomp.2024.1379559","url":null,"abstract":"This interdisciplinary research initiative seeks to enhance the accessibility of video gaming for individuals living with Motor Neurone Disease (MND), a condition characterized by progressive muscle weakness. Gaming serves as a social and recreational outlet for many, connecting friends, family, and even strangers through collaboration and competition. However, MND’s disease progression, including muscle weakness and paralysis, severely limit the ability to engage in gaming. In this paper, we desscribe our exploration of AI solutions to improve accessibility to gaming. We argue that any application of accessible AI must be led by lived experience. Notably, we found in our previous scoping review, existing academic research into video games for those living with MND largely neglects the experiences of MND patients in the context of video games and AI, which was a prompt for us to address this critical gap.","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140718517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.3389/fcomp.2024.1371181
Andreas Foltyn, J. Deuschel, Nadine R. Lang-Richter, Nina Holzer, Maximilian P. Oppelt
Numerous studies have focused on constructing multimodal machine learning models for estimating a person's cognitive load. However, a prevalent limitation is that these models are typically evaluated on data from the same scenario they were trained on. Little attention has been given to their robustness against data distribution shifts, which may occur during deployment. The aim of this paper is to investigate the performance of these models when confronted with a scenario different from the one on which they were trained. For this evaluation, we utilized a dataset encompassing two distinct scenarios: an n-Back test and a driving simulation. We selected a variety of classic machine learning and deep learning architectures, which were further complemented by various fusion techniques. The models were trained on the data from the n-Back task and tested on both scenarios to evaluate their predictive performance. However, the predictive performance alone may not lead to a trustworthy model. Therefore, we looked at the uncertainty estimates of these models. By leveraging these estimates, we can reduce misclassification by resorting to alternative measures in situations of high uncertainty. The findings indicate that late fusion produces stable classification results across the examined models for both scenarios, enhancing robustness compared to feature-based fusion methods. Although a simple logistic regression tends to provide the best predictive performance for n-Back, this is not always the case if the data distribution is shifted. Finally, the predictive performance of individual modalities differs significantly between the two scenarios. This research provides insights into the capabilities and limitations of multimodal machine learning models in handling distribution shifts and identifies which approaches may potentially be suitable for achieving robust results.
{"title":"Evaluating the robustness of multimodal task load estimation models","authors":"Andreas Foltyn, J. Deuschel, Nadine R. Lang-Richter, Nina Holzer, Maximilian P. Oppelt","doi":"10.3389/fcomp.2024.1371181","DOIUrl":"https://doi.org/10.3389/fcomp.2024.1371181","url":null,"abstract":"Numerous studies have focused on constructing multimodal machine learning models for estimating a person's cognitive load. However, a prevalent limitation is that these models are typically evaluated on data from the same scenario they were trained on. Little attention has been given to their robustness against data distribution shifts, which may occur during deployment. The aim of this paper is to investigate the performance of these models when confronted with a scenario different from the one on which they were trained. For this evaluation, we utilized a dataset encompassing two distinct scenarios: an n-Back test and a driving simulation. We selected a variety of classic machine learning and deep learning architectures, which were further complemented by various fusion techniques. The models were trained on the data from the n-Back task and tested on both scenarios to evaluate their predictive performance. However, the predictive performance alone may not lead to a trustworthy model. Therefore, we looked at the uncertainty estimates of these models. By leveraging these estimates, we can reduce misclassification by resorting to alternative measures in situations of high uncertainty. The findings indicate that late fusion produces stable classification results across the examined models for both scenarios, enhancing robustness compared to feature-based fusion methods. Although a simple logistic regression tends to provide the best predictive performance for n-Back, this is not always the case if the data distribution is shifted. Finally, the predictive performance of individual modalities differs significantly between the two scenarios. This research provides insights into the capabilities and limitations of multimodal machine learning models in handling distribution shifts and identifies which approaches may potentially be suitable for achieving robust results.","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140718341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-09DOI: 10.3389/fcomp.2024.1304687
Minxiao Wang, Ning Yang
Children diagnosed with Autism Spectrum Disorder (ASD) often struggle to grasp social conventions and promptly recognize others' emotions. Recent advancements in the application of deep learning (DL) to emotion recognition are solidifying the role of AI-powered assistive technology in supporting autistic children. However, the cost of collecting and annotating large-scale high-quality human emotion data and the phenomenon of unbalanced performance on different modalities of data challenge DL-based emotion recognition. In response to these challenges, this paper explores transfer learning, wherein large pre-trained models like Contrastive Language-Image Pre-training (CLIP) and wav2vec 2.0 are fine-tuned to improve audio- and video-based emotion recognition with text- based guidance. In this work, we propose the EmoAsst framework, which includes a visual fusion module and emotion prompt fine-tuning for CLIP, in addition to leveraging CLIP's text encoder and supervised contrastive learning for audio-based emotion recognition on the wav2vec 2.0 model. In addition, a joint few-shot emotion classifier enhances the accuracy and offers great adaptability for real-world applications. The evaluation results on the MELD dataset highlight the outstanding performance of our methods, surpassing the majority of existing video and audio-based approaches. Notably, our research demonstrates the promising potential of the proposed text-based guidance techniques for improving video and audio-based Emotion Recognition and Classification (ERC).
{"title":"EmoAsst: emotion recognition assistant via text-guided transfer learning on pre-trained visual and acoustic models","authors":"Minxiao Wang, Ning Yang","doi":"10.3389/fcomp.2024.1304687","DOIUrl":"https://doi.org/10.3389/fcomp.2024.1304687","url":null,"abstract":"Children diagnosed with Autism Spectrum Disorder (ASD) often struggle to grasp social conventions and promptly recognize others' emotions. Recent advancements in the application of deep learning (DL) to emotion recognition are solidifying the role of AI-powered assistive technology in supporting autistic children. However, the cost of collecting and annotating large-scale high-quality human emotion data and the phenomenon of unbalanced performance on different modalities of data challenge DL-based emotion recognition. In response to these challenges, this paper explores transfer learning, wherein large pre-trained models like Contrastive Language-Image Pre-training (CLIP) and wav2vec 2.0 are fine-tuned to improve audio- and video-based emotion recognition with text- based guidance. In this work, we propose the EmoAsst framework, which includes a visual fusion module and emotion prompt fine-tuning for CLIP, in addition to leveraging CLIP's text encoder and supervised contrastive learning for audio-based emotion recognition on the wav2vec 2.0 model. In addition, a joint few-shot emotion classifier enhances the accuracy and offers great adaptability for real-world applications. The evaluation results on the MELD dataset highlight the outstanding performance of our methods, surpassing the majority of existing video and audio-based approaches. Notably, our research demonstrates the promising potential of the proposed text-based guidance techniques for improving video and audio-based Emotion Recognition and Classification (ERC).","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140727072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-08DOI: 10.3389/fcomp.2024.1381351
Umema Hani, Osama Sohaib, Khalid Khan, Asma Aleidi, Noman Islam
This research addresses a challenge of the hacker classification framework based on the “big five personality traits” model (OCEAN) and explores associations between personality traits and hacker types. The method's application prediction performance was evaluated in two groups: Students with hacking experience who intend to pursue information security and ethical hacking and industry professionals who work as White Hat hackers. These professionals were further categorized based on their behavioral tendencies, incorporating Gray Hat traits. The k-means algorithm analyzed intra-cluster dependencies, elucidating variations within different clusters and their correlation with Hat types. The study achieved an 88% accuracy in mapping clusters with Hat types, effectively identifying cyber-criminal behaviors. Ethical considerations regarding privacy and bias in personality profiling methodologies within cybersecurity are discussed, emphasizing the importance of informed consent, transparency, and accountability in data management practices. Furthermore, the research underscores the need for sustainable cybersecurity practices, integrating environmental and societal impacts into security frameworks. This study aims to advance responsible cybersecurity practices by promoting awareness and ethical considerations and prioritizing privacy, equity, and sustainability principles.
{"title":"Psychological profiling of hackers via machine learning toward sustainable cybersecurity","authors":"Umema Hani, Osama Sohaib, Khalid Khan, Asma Aleidi, Noman Islam","doi":"10.3389/fcomp.2024.1381351","DOIUrl":"https://doi.org/10.3389/fcomp.2024.1381351","url":null,"abstract":"This research addresses a challenge of the hacker classification framework based on the “big five personality traits” model (OCEAN) and explores associations between personality traits and hacker types. The method's application prediction performance was evaluated in two groups: Students with hacking experience who intend to pursue information security and ethical hacking and industry professionals who work as White Hat hackers. These professionals were further categorized based on their behavioral tendencies, incorporating Gray Hat traits. The k-means algorithm analyzed intra-cluster dependencies, elucidating variations within different clusters and their correlation with Hat types. The study achieved an 88% accuracy in mapping clusters with Hat types, effectively identifying cyber-criminal behaviors. Ethical considerations regarding privacy and bias in personality profiling methodologies within cybersecurity are discussed, emphasizing the importance of informed consent, transparency, and accountability in data management practices. Furthermore, the research underscores the need for sustainable cybersecurity practices, integrating environmental and societal impacts into security frameworks. This study aims to advance responsible cybersecurity practices by promoting awareness and ethical considerations and prioritizing privacy, equity, and sustainability principles.","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140731330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-05DOI: 10.3389/fcomp.2024.1379925
Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, P. Lukowicz
In support of smart wearable researchers striving to select optimal ground truth methods for motion capture across a spectrum of loose garment types, we present an extended benchmark named DrapeMoCapBench (DMCB+). This augmented benchmark incorporates a more intricate limb-wise Motion Capture (MoCap) accuracy analysis, and enhanced drape calculation, and introduces a novel benchmarking tool that encompasses multicamera deep learning MoCap methods. DMCB+ is specifically designed to evaluate the performance of both optical marker-based and markerless MoCap techniques, taking into account the challenges posed by various loose garment types. While high-cost marker-based systems are acknowledged for their precision, they often require skin-tight markers on bony areas, which can be impractical with loose garments. On the other hand, markerless MoCap methods driven by computer vision models have evolved to be more cost-effective, utilizing smartphone cameras and exhibiting promising results. Utilizing real-world MoCap datasets, DMCB+ conducts 3D physics simulations with a comprehensive set of variables, including six drape levels, three motion intensities, and six body-gender combinations. The extended benchmark provides a nuanced analysis of advanced marker-based and markerless MoCap techniques, highlighting their strengths and weaknesses across distinct scenarios. In particular, DMCB+ reveals that when evaluating casual loose garments, both marker-based and markerless methods exhibit notable performance degradation (>10 cm). However, in scenarios involving everyday activities with basic and swift motions, markerless MoCap outperforms marker-based alternatives. This positions markerless MoCap as an advantageous and economical choice for wearable studies. The inclusion of a multicamera deep learning MoCap method in the benchmarking tool further expands the scope, allowing researchers to assess the capabilities of cutting-edge technologies in diverse motion capture scenarios.
{"title":"A comprehensive evaluation of marker-based, markerless methods for loose garment scenarios in varying camera configurations","authors":"Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, P. Lukowicz","doi":"10.3389/fcomp.2024.1379925","DOIUrl":"https://doi.org/10.3389/fcomp.2024.1379925","url":null,"abstract":"In support of smart wearable researchers striving to select optimal ground truth methods for motion capture across a spectrum of loose garment types, we present an extended benchmark named DrapeMoCapBench (DMCB+). This augmented benchmark incorporates a more intricate limb-wise Motion Capture (MoCap) accuracy analysis, and enhanced drape calculation, and introduces a novel benchmarking tool that encompasses multicamera deep learning MoCap methods. DMCB+ is specifically designed to evaluate the performance of both optical marker-based and markerless MoCap techniques, taking into account the challenges posed by various loose garment types. While high-cost marker-based systems are acknowledged for their precision, they often require skin-tight markers on bony areas, which can be impractical with loose garments. On the other hand, markerless MoCap methods driven by computer vision models have evolved to be more cost-effective, utilizing smartphone cameras and exhibiting promising results. Utilizing real-world MoCap datasets, DMCB+ conducts 3D physics simulations with a comprehensive set of variables, including six drape levels, three motion intensities, and six body-gender combinations. The extended benchmark provides a nuanced analysis of advanced marker-based and markerless MoCap techniques, highlighting their strengths and weaknesses across distinct scenarios. In particular, DMCB+ reveals that when evaluating casual loose garments, both marker-based and markerless methods exhibit notable performance degradation (>10 cm). However, in scenarios involving everyday activities with basic and swift motions, markerless MoCap outperforms marker-based alternatives. This positions markerless MoCap as an advantageous and economical choice for wearable studies. The inclusion of a multicamera deep learning MoCap method in the benchmarking tool further expands the scope, allowing researchers to assess the capabilities of cutting-edge technologies in diverse motion capture scenarios.","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140736197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}