Pub Date : 2024-05-16DOI: 10.3389/fdata.2024.1353469
Kang Liu, Shi Geng, Ping Shen, Lei Zhao, Peng Zhou, Wen Liu
To develop a robust machine learning prediction model for the automatic screening and diagnosis of obstructive sleep apnea (OSA) using five advanced algorithms, namely Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), Support Vector Machine (SVM), Light Gradient Boosting Machine (LightGBM), and Random Forest (RF) to provide substantial support for early clinical diagnosis and intervention.We conducted a retrospective analysis of clinical data from 439 patients who underwent polysomnography at the Affiliated Hospital of Xuzhou Medical University between October 2019 and October 2022. Predictor variables such as demographic information [age, sex, height, weight, body mass index (BMI)], medical history, and Epworth Sleepiness Scale (ESS) were used. Univariate analysis was used to identify variables with significant differences, and the dataset was then divided into training and validation sets in a 4:1 ratio. The training set was established to predict OSA severity grading. The validation set was used to assess model performance using the area under the curve (AUC). Additionally, a separate analysis was conducted, categorizing the normal population as one group and patients with moderate-to-severe OSA as another. The same univariate analysis was applied, and the dataset was divided into training and validation sets in a 4:1 ratio. The training set was used to build a prediction model for screening moderate-to-severe OSA, while the validation set was used to verify the model's performance.Among the four groups, the LightGBM model outperformed others, with the top five feature importance rankings of ESS total score, BMI, sex, hypertension, and gastroesophageal reflux (GERD), where Age, ESS total score and BMI played the most significant roles. In the dichotomous model, RF is the best performer of the five models respectively. The top five ranked feature importance of the best-performing RF models were ESS total score, BMI, GERD, age and Dry mouth, with ESS total score and BMI being particularly pivotal.Machine learning-based prediction models for OSA disease grading and screening prove instrumental in the early identification of patients with moderate-to-severe OSA, revealing pertinent risk factors and facilitating timely interventions to counter pathological changes induced by OSA. Notably, ESS total score and BMI emerge as the most critical features for predicting OSA, emphasizing their significance in clinical assessments. The dataset will be publicly available on my Github.
利用极梯度提升(XGBoost)、逻辑回归(LR)、支持向量机(SVM)、轻梯度提升机(LightGBM)和随机森林(RF)五种先进算法,开发一种用于阻塞性睡眠呼吸暂停(OSA)自动筛查和诊断的稳健机器学习预测模型,为早期临床诊断和干预提供实质性支持。我们对2019年10月至2022年10月期间在徐州医科大学附属医院接受多导睡眠图检查的439名患者的临床数据进行了回顾性分析。预测变量包括人口统计学信息[年龄、性别、身高、体重、体重指数(BMI)]、病史和埃普沃思嗜睡量表(ESS)。采用单变量分析来确定具有显著差异的变量,然后按 4:1 的比例将数据集分为训练集和验证集。训练集用于预测 OSA 严重程度分级。验证集用于使用曲线下面积(AUC)评估模型性能。此外,还进行了一项单独的分析,将正常人群分为一组,将中重度 OSA 患者分为另一组。采用相同的单变量分析,并按 4:1 的比例将数据集分为训练集和验证集。在四组患者中,LightGBM 模型的表现优于其他模型,其前五位特征的重要性依次为ESS 总分、体重指数、性别、高血压和胃食管反流(GERD),其中年龄、ESS 总分和体重指数的作用最大。在二分模型中,RF 分别是五个模型中表现最好的。基于机器学习的 OSA 疾病分级和筛查预测模型有助于早期识别中度至重度 OSA 患者,揭示相关风险因素并促进及时干预,以应对 OSA 引起的病理变化。值得注意的是,ESS 总分和体重指数是预测 OSA 的最关键特征,这强调了它们在临床评估中的重要性。数据集将在我的 Github 上公开发布。
{"title":"Development and application of a machine learning-based predictive model for obstructive sleep apnea screening","authors":"Kang Liu, Shi Geng, Ping Shen, Lei Zhao, Peng Zhou, Wen Liu","doi":"10.3389/fdata.2024.1353469","DOIUrl":"https://doi.org/10.3389/fdata.2024.1353469","url":null,"abstract":"To develop a robust machine learning prediction model for the automatic screening and diagnosis of obstructive sleep apnea (OSA) using five advanced algorithms, namely Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), Support Vector Machine (SVM), Light Gradient Boosting Machine (LightGBM), and Random Forest (RF) to provide substantial support for early clinical diagnosis and intervention.We conducted a retrospective analysis of clinical data from 439 patients who underwent polysomnography at the Affiliated Hospital of Xuzhou Medical University between October 2019 and October 2022. Predictor variables such as demographic information [age, sex, height, weight, body mass index (BMI)], medical history, and Epworth Sleepiness Scale (ESS) were used. Univariate analysis was used to identify variables with significant differences, and the dataset was then divided into training and validation sets in a 4:1 ratio. The training set was established to predict OSA severity grading. The validation set was used to assess model performance using the area under the curve (AUC). Additionally, a separate analysis was conducted, categorizing the normal population as one group and patients with moderate-to-severe OSA as another. The same univariate analysis was applied, and the dataset was divided into training and validation sets in a 4:1 ratio. The training set was used to build a prediction model for screening moderate-to-severe OSA, while the validation set was used to verify the model's performance.Among the four groups, the LightGBM model outperformed others, with the top five feature importance rankings of ESS total score, BMI, sex, hypertension, and gastroesophageal reflux (GERD), where Age, ESS total score and BMI played the most significant roles. In the dichotomous model, RF is the best performer of the five models respectively. The top five ranked feature importance of the best-performing RF models were ESS total score, BMI, GERD, age and Dry mouth, with ESS total score and BMI being particularly pivotal.Machine learning-based prediction models for OSA disease grading and screening prove instrumental in the early identification of patients with moderate-to-severe OSA, revealing pertinent risk factors and facilitating timely interventions to counter pathological changes induced by OSA. Notably, ESS total score and BMI emerge as the most critical features for predicting OSA, emphasizing their significance in clinical assessments. The dataset will be publicly available on my Github.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141127186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-15DOI: 10.3389/fdata.2024.1384240
Leatrice Talita Rodrigues, Barbara Sanches Antunes Goeldner, Emílio Graciliano Ferreira Mercuri, S. M. Noe
Tradescantia plant is a complex system that is sensible to environmental factors such as water supply, pH, temperature, light, radiation, impurities, and nutrient availability. It can be used as a biomonitor for environmental changes; however, the bioassays are time-consuming and have a strong human interference factor that might change the result depending on who is performing the analysis. We have developed computer vision models to study color variations from Tradescantia clone 4430 plant stamen hair cells, which can be stressed due to air pollution and soil contamination. The study introduces a novel dataset, Trad-204, comprising single-cell images from Tradescantia clone 4430, captured during the Tradescantia stamen-hair mutation bioassay (Trad-SHM). The dataset contain images from two experiments, one focusing on air pollution by particulate matter and another based on soil contaminated by diesel oil. Both experiments were carried out in Curitiba, Brazil, between 2020 and 2023. The images represent single cells with different shapes, sizes, and colors, reflecting the plant's responses to environmental stressors. An automatic classification task was developed to distinguishing between blue and pink cells, and the study explores both a baseline model and three artificial neural network (ANN) architectures, namely, TinyVGG, VGG-16, and ResNet34. Tradescantia revealed sensibility to both air particulate matter concentration and diesel oil in soil. The results indicate that Residual Network architecture outperforms the other models in terms of accuracy on both training and testing sets. The dataset and findings contribute to the understanding of plant cell responses to environmental stress and provide valuable resources for further research in automated image analysis of plant cells. Discussion highlights the impact of turgor pressure on cell shape and the potential implications for plant physiology. The comparison between ANN architectures aligns with previous research, emphasizing the superior performance of ResNet models in image classification tasks. Artificial intelligence identification of pink cells improves the counting accuracy, thus avoiding human errors due to different color perceptions, fatigue, or inattention, in addition to facilitating and speeding up the analysis process. Overall, the study offers insights into plant cell dynamics and provides a foundation for future investigations like cells morphology change. This research corroborates that biomonitoring should be considered as an important tool for political actions, being a relevant issue in risk assessment and the development of new public policies relating to the environment.
{"title":"Tradescantia response to air and soil pollution, stamen hair cells dataset and ANN color classification","authors":"Leatrice Talita Rodrigues, Barbara Sanches Antunes Goeldner, Emílio Graciliano Ferreira Mercuri, S. M. Noe","doi":"10.3389/fdata.2024.1384240","DOIUrl":"https://doi.org/10.3389/fdata.2024.1384240","url":null,"abstract":"Tradescantia plant is a complex system that is sensible to environmental factors such as water supply, pH, temperature, light, radiation, impurities, and nutrient availability. It can be used as a biomonitor for environmental changes; however, the bioassays are time-consuming and have a strong human interference factor that might change the result depending on who is performing the analysis. We have developed computer vision models to study color variations from Tradescantia clone 4430 plant stamen hair cells, which can be stressed due to air pollution and soil contamination. The study introduces a novel dataset, Trad-204, comprising single-cell images from Tradescantia clone 4430, captured during the Tradescantia stamen-hair mutation bioassay (Trad-SHM). The dataset contain images from two experiments, one focusing on air pollution by particulate matter and another based on soil contaminated by diesel oil. Both experiments were carried out in Curitiba, Brazil, between 2020 and 2023. The images represent single cells with different shapes, sizes, and colors, reflecting the plant's responses to environmental stressors. An automatic classification task was developed to distinguishing between blue and pink cells, and the study explores both a baseline model and three artificial neural network (ANN) architectures, namely, TinyVGG, VGG-16, and ResNet34. Tradescantia revealed sensibility to both air particulate matter concentration and diesel oil in soil. The results indicate that Residual Network architecture outperforms the other models in terms of accuracy on both training and testing sets. The dataset and findings contribute to the understanding of plant cell responses to environmental stress and provide valuable resources for further research in automated image analysis of plant cells. Discussion highlights the impact of turgor pressure on cell shape and the potential implications for plant physiology. The comparison between ANN architectures aligns with previous research, emphasizing the superior performance of ResNet models in image classification tasks. Artificial intelligence identification of pink cells improves the counting accuracy, thus avoiding human errors due to different color perceptions, fatigue, or inattention, in addition to facilitating and speeding up the analysis process. Overall, the study offers insights into plant cell dynamics and provides a foundation for future investigations like cells morphology change. This research corroborates that biomonitoring should be considered as an important tool for political actions, being a relevant issue in risk assessment and the development of new public policies relating to the environment.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140971653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artificial Intelligence (AI) is increasingly used as a helper to develop computing programs. While it can boost software development and improve coding proficiency, this practice offers no guarantee of security. On the contrary, recent research shows that some AI models produce software with vulnerabilities. This situation leads to the question: How serious and widespread are the security flaws in code generated using AI models?Through a systematic literature review, this work reviews the state of the art on how AI models impact software security. It systematizes the knowledge about the risks of using AI in coding security-critical software.It reviews what security flaws of well-known vulnerabilities (e.g., the MITRE CWE Top 25 Most Dangerous Software Weaknesses) are commonly hidden in AI-generated code. It also reviews works that discuss how vulnerabilities in AI-generated code can be exploited to compromise security and lists the attempts to improve the security of such AI-generated code.Overall, this work provides a comprehensive and systematic overview of the impact of AI in secure coding. This topic has sparked interest and concern within the software security engineering community. It highlights the importance of setting up security measures and processes, such as code verification, and that such practices could be customized for AI-aided code production.
{"title":"A systematic literature review on the impact of AI models on the security of code generation","authors":"Claudia Negri-Ribalta, Rémi Geraud-Stewart, Anastasia Sergeeva, Gabriele Lenzini","doi":"10.3389/fdata.2024.1386720","DOIUrl":"https://doi.org/10.3389/fdata.2024.1386720","url":null,"abstract":"Artificial Intelligence (AI) is increasingly used as a helper to develop computing programs. While it can boost software development and improve coding proficiency, this practice offers no guarantee of security. On the contrary, recent research shows that some AI models produce software with vulnerabilities. This situation leads to the question: How serious and widespread are the security flaws in code generated using AI models?Through a systematic literature review, this work reviews the state of the art on how AI models impact software security. It systematizes the knowledge about the risks of using AI in coding security-critical software.It reviews what security flaws of well-known vulnerabilities (e.g., the MITRE CWE Top 25 Most Dangerous Software Weaknesses) are commonly hidden in AI-generated code. It also reviews works that discuss how vulnerabilities in AI-generated code can be exploited to compromise security and lists the attempts to improve the security of such AI-generated code.Overall, this work provides a comprehensive and systematic overview of the impact of AI in secure coding. This topic has sparked interest and concern within the software security engineering community. It highlights the importance of setting up security measures and processes, such as code verification, and that such practices could be customized for AI-aided code production.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140985358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-10DOI: 10.3389/fdata.2024.1188620
Silke Schwandt, Christian Wachter
Visualizations are ubiquitous in data-driven research, serving as both tools for knowledge production and genuine means of knowledge communication. Despite criticisms targeting the alleged objectivity of visualizations in the digital humanities (DH) and reflections on how they may serve as representations of both scholarly perspective and uncertainty within the data analysis pipeline, there remains a notable scarcity of in-depth theoretical grounding for these assumptions in DH discussions. It is our understanding that only through theoretical foundations such as basic semiotic principles and perspectives on media modality one can fully assess the use and potential of visualizations for innovation in scholarly interpretation. We argue that visualizations have the capacity to “productively irritate” existing scholarly knowledge in a given research field. This does not just mean that visualizations depict patterns in datasets that seem not in line with prior research and thus stimulate deeper examination. Complementarily, “irritation” here consists of visualizations producing uncertainty about their own meaning—yet it is precisely this uncertainty in which the potential for greater insight lies. It stimulates questions about what is depicted and what is not. This turns out to be a valuable resource for scholarly interpretation, and one could argue that visualizing big data is particularly prolific in this sense, because due to their complexity researchers cannot interpret the data without visual representations. However, we argue that “productive irritation” can also happen below the level of big data. We see this potential rooted in the genuinely semiotic and semantic properties of visual media, which studies in multimodality and specifically in the field of Bildlinguistik have carved out: a visualization's holistic overview of data patterns is juxtaposed to its semantic vagueness, which gives way to deep interpretations and multiple perspectives on that data. We elucidate this potential using examples from medieval English legal history. Visualizations of data relating to legal functions and social constellations of various people in court offer surprising insights that can lead to new knowledge through “productive irritation.”
{"title":"Visualization as irritation: producing knowledge about medieval courts through uncertainty","authors":"Silke Schwandt, Christian Wachter","doi":"10.3389/fdata.2024.1188620","DOIUrl":"https://doi.org/10.3389/fdata.2024.1188620","url":null,"abstract":"Visualizations are ubiquitous in data-driven research, serving as both tools for knowledge production and genuine means of knowledge communication. Despite criticisms targeting the alleged objectivity of visualizations in the digital humanities (DH) and reflections on how they may serve as representations of both scholarly perspective and uncertainty within the data analysis pipeline, there remains a notable scarcity of in-depth theoretical grounding for these assumptions in DH discussions. It is our understanding that only through theoretical foundations such as basic semiotic principles and perspectives on media modality one can fully assess the use and potential of visualizations for innovation in scholarly interpretation. We argue that visualizations have the capacity to “productively irritate” existing scholarly knowledge in a given research field. This does not just mean that visualizations depict patterns in datasets that seem not in line with prior research and thus stimulate deeper examination. Complementarily, “irritation” here consists of visualizations producing uncertainty about their own meaning—yet it is precisely this uncertainty in which the potential for greater insight lies. It stimulates questions about what is depicted and what is not. This turns out to be a valuable resource for scholarly interpretation, and one could argue that visualizing big data is particularly prolific in this sense, because due to their complexity researchers cannot interpret the data without visual representations. However, we argue that “productive irritation” can also happen below the level of big data. We see this potential rooted in the genuinely semiotic and semantic properties of visual media, which studies in multimodality and specifically in the field of Bildlinguistik have carved out: a visualization's holistic overview of data patterns is juxtaposed to its semantic vagueness, which gives way to deep interpretations and multiple perspectives on that data. We elucidate this potential using examples from medieval English legal history. Visualizations of data relating to legal functions and social constellations of various people in court offer surprising insights that can lead to new knowledge through “productive irritation.”","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140993695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-09DOI: 10.3389/fdata.2024.1369895
Raed Alsini, Q. Abu Al-haija, Abdulaziz A. Alsulami, Badraddin Alturki, Abdulaziz A. Alqurashi, M. D. Mashat, Ali Alqahtani, Nawaf Alhebaishi
The cryptocurrency market is captivating the attention of both retail and institutional investors. While this highly volatile market offers investors substantial profit opportunities, it also entails risks due to its sensitivity to speculative news and the erratic behavior of major investors, both of which can provoke unexpected price fluctuations.In this study, we contend that extreme and sudden price changes and atypical patterns might compromise the performance of technical signals utilized as the basis for feature extraction in a machine learning-based trading system by either augmenting or diminishing the model's generalization capability. To address this issue, this research uses a bagged tree (BT) model to forecast the buy signal for the cryptocurrency market. To achieve this, traders must acquire knowledge about the cryptocurrency market and modify their strategies accordingly.To make an informed decision, we depended on the most prevalently utilized oscillators, namely, the buy signal in the cryptocurrency market, comprising the Relative Strength Index (RSI), Bollinger Bands (BB), and the Moving Average Convergence/Divergence (MACD) indicator. Also, the research evaluates how accurately a model can predict the performance of different cryptocurrencies such as Bitcoin (BTC), Ethereum (ETH), Cardano (ADA), and Binance Coin (BNB). Furthermore, the efficacy of the most popular machine learning model in precisely forecasting outcomes within the cryptocurrency market is examined. Notably, predicting buy signal values using a BT model provides promising results.
{"title":"Forecasting cryptocurrency's buy signal with a bagged tree learning approach to enhance purchase decisions","authors":"Raed Alsini, Q. Abu Al-haija, Abdulaziz A. Alsulami, Badraddin Alturki, Abdulaziz A. Alqurashi, M. D. Mashat, Ali Alqahtani, Nawaf Alhebaishi","doi":"10.3389/fdata.2024.1369895","DOIUrl":"https://doi.org/10.3389/fdata.2024.1369895","url":null,"abstract":"The cryptocurrency market is captivating the attention of both retail and institutional investors. While this highly volatile market offers investors substantial profit opportunities, it also entails risks due to its sensitivity to speculative news and the erratic behavior of major investors, both of which can provoke unexpected price fluctuations.In this study, we contend that extreme and sudden price changes and atypical patterns might compromise the performance of technical signals utilized as the basis for feature extraction in a machine learning-based trading system by either augmenting or diminishing the model's generalization capability. To address this issue, this research uses a bagged tree (BT) model to forecast the buy signal for the cryptocurrency market. To achieve this, traders must acquire knowledge about the cryptocurrency market and modify their strategies accordingly.To make an informed decision, we depended on the most prevalently utilized oscillators, namely, the buy signal in the cryptocurrency market, comprising the Relative Strength Index (RSI), Bollinger Bands (BB), and the Moving Average Convergence/Divergence (MACD) indicator. Also, the research evaluates how accurately a model can predict the performance of different cryptocurrencies such as Bitcoin (BTC), Ethereum (ETH), Cardano (ADA), and Binance Coin (BNB). Furthermore, the efficacy of the most popular machine learning model in precisely forecasting outcomes within the cryptocurrency market is examined. Notably, predicting buy signal values using a BT model provides promising results.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140997120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The evaluation of performance using competencies within a structured framework holds significant importance across various professional domains, particularly in roles like project manager. Typically, this assessment process, overseen by senior evaluators, involves scoring competencies based on data gathered from interviews, completed forms, and evaluation programs. However, this task is tedious and time-consuming, and requires the expertise of qualified professionals. Moreover, it is compounded by the inconsistent scoring biases introduced by different evaluators. In this paper, we propose a novel approach to automatically predict competency scores, thereby facilitating the assessment of project managers' performance. Initially, we performed data fusion to compile a comprehensive dataset from various sources and modalities, including demographic data, profile-related data, and historical competency assessments. Subsequently, NLP techniques were used to pre-process text data. Finally, recommender systems were explored to predict competency scores. We compared four different recommender system approaches: content-based filtering, demographic filtering, collaborative filtering, and hybrid filtering. Using assessment data collected from 38 project managers, encompassing scores across 67 different competencies, we evaluated the performance of each approach. Notably, the content-based approach yielded promising results, achieving a precision rate of 81.03%. Furthermore, we addressed the challenge of cold-starting, which in our context involves predicting scores for either a new project manager lacking competency data or a newly introduced competency without historical records. Our analysis revealed that demographic filtering achieved an average precision of 54.05% when dealing with new project managers. In contrast, content-based filtering exhibited remarkable performance, achieving a precision of 85.79% in predicting scores for new competencies. These findings underscore the potential of recommender systems in competency assessment, thereby facilitating more effective performance evaluation process.
{"title":"Multi-modal recommender system for predicting project manager performance within a competency-based framework","authors":"Imene Jemal, Wilfried Armand Naoussi Sijou, Belkacem Chikhaoui","doi":"10.3389/fdata.2024.1295009","DOIUrl":"https://doi.org/10.3389/fdata.2024.1295009","url":null,"abstract":"The evaluation of performance using competencies within a structured framework holds significant importance across various professional domains, particularly in roles like project manager. Typically, this assessment process, overseen by senior evaluators, involves scoring competencies based on data gathered from interviews, completed forms, and evaluation programs. However, this task is tedious and time-consuming, and requires the expertise of qualified professionals. Moreover, it is compounded by the inconsistent scoring biases introduced by different evaluators. In this paper, we propose a novel approach to automatically predict competency scores, thereby facilitating the assessment of project managers' performance. Initially, we performed data fusion to compile a comprehensive dataset from various sources and modalities, including demographic data, profile-related data, and historical competency assessments. Subsequently, NLP techniques were used to pre-process text data. Finally, recommender systems were explored to predict competency scores. We compared four different recommender system approaches: content-based filtering, demographic filtering, collaborative filtering, and hybrid filtering. Using assessment data collected from 38 project managers, encompassing scores across 67 different competencies, we evaluated the performance of each approach. Notably, the content-based approach yielded promising results, achieving a precision rate of 81.03%. Furthermore, we addressed the challenge of cold-starting, which in our context involves predicting scores for either a new project manager lacking competency data or a newly introduced competency without historical records. Our analysis revealed that demographic filtering achieved an average precision of 54.05% when dealing with new project managers. In contrast, content-based filtering exhibited remarkable performance, achieving a precision of 85.79% in predicting scores for new competencies. These findings underscore the potential of recommender systems in competency assessment, thereby facilitating more effective performance evaluation process.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140997081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-09DOI: 10.3389/fdata.2024.1381163
Nineta Polemi, Isabel Praça, K. Kioskli, Adrien Bécue
This paper addresses the critical gaps in existing AI risk management frameworks, emphasizing the neglect of human factors and the absence of metrics for socially related or human threats. Drawing from insights provided by NIST AI RFM and ENISA, the research underscores the need for understanding the limitations of human-AI interaction and the development of ethical and social measurements. The paper explores various dimensions of trustworthiness, covering legislation, AI cyber threat intelligence, and characteristics of AI adversaries. It delves into technical threats and vulnerabilities, including data access, poisoning, and backdoors, highlighting the importance of collaboration between cybersecurity engineers, AI experts, and social-psychology-behavior-ethics professionals. Furthermore, the socio-psychological threats associated with AI integration into society are examined, addressing issues such as bias, misinformation, and privacy erosion. The manuscript proposes a comprehensive approach to AI trustworthiness, combining technical and social mitigation measures, standards, and ongoing research initiatives. Additionally, it introduces innovative defense strategies, such as cyber-social exercises, digital clones, and conversational agents, to enhance understanding of adversary profiles and fortify AI security. The paper concludes with a call for interdisciplinary collaboration, awareness campaigns, and continuous research efforts to create a robust and resilient AI ecosystem aligned with ethical standards and societal expectations.
本文探讨了现有人工智能风险管理框架中存在的关键差距,强调了对人为因素的忽视,以及缺乏与社会相关或人为威胁的衡量标准。研究借鉴了 NIST AI RFM 和 ENISA 提供的见解,强调有必要了解人与人工智能互动的局限性,并制定道德和社会衡量标准。本文探讨了可信度的各个层面,包括立法、人工智能网络威胁情报和人工智能对手的特征。论文深入探讨了技术威胁和漏洞,包括数据访问、中毒和后门,强调了网络安全工程师、人工智能专家和社会心理学-行为伦理学专业人士之间合作的重要性。此外,还探讨了与人工智能融入社会相关的社会心理威胁,涉及偏见、错误信息和隐私侵蚀等问题。手稿提出了一种全面的人工智能可信性方法,将技术和社会缓解措施、标准和正在进行的研究计划结合起来。此外,它还介绍了创新的防御策略,如网络社交演习、数字克隆和对话代理,以加深对对手特征的了解并加强人工智能的安全性。论文最后呼吁开展跨学科合作、提高认识运动和持续研究工作,以创建一个符合道德标准和社会期望的强大而有弹性的人工智能生态系统。
{"title":"Challenges and efforts in managing AI trustworthiness risks: a state of knowledge","authors":"Nineta Polemi, Isabel Praça, K. Kioskli, Adrien Bécue","doi":"10.3389/fdata.2024.1381163","DOIUrl":"https://doi.org/10.3389/fdata.2024.1381163","url":null,"abstract":"This paper addresses the critical gaps in existing AI risk management frameworks, emphasizing the neglect of human factors and the absence of metrics for socially related or human threats. Drawing from insights provided by NIST AI RFM and ENISA, the research underscores the need for understanding the limitations of human-AI interaction and the development of ethical and social measurements. The paper explores various dimensions of trustworthiness, covering legislation, AI cyber threat intelligence, and characteristics of AI adversaries. It delves into technical threats and vulnerabilities, including data access, poisoning, and backdoors, highlighting the importance of collaboration between cybersecurity engineers, AI experts, and social-psychology-behavior-ethics professionals. Furthermore, the socio-psychological threats associated with AI integration into society are examined, addressing issues such as bias, misinformation, and privacy erosion. The manuscript proposes a comprehensive approach to AI trustworthiness, combining technical and social mitigation measures, standards, and ongoing research initiatives. Additionally, it introduces innovative defense strategies, such as cyber-social exercises, digital clones, and conversational agents, to enhance understanding of adversary profiles and fortify AI security. The paper concludes with a call for interdisciplinary collaboration, awareness campaigns, and continuous research efforts to create a robust and resilient AI ecosystem aligned with ethical standards and societal expectations.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140996466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, analyzing the explanation for the prediction of Graph Neural Networks (GNNs) has attracted increasing attention. Despite this progress, most existing methods do not adequately consider the inherent uncertainties stemming from the randomness of model parameters and graph data, which may lead to overconfidence and misguiding explanations. However, it is challenging for most of GNN explanation methods to quantify these uncertainties since they obtain the prediction explanation in a post-hoc and model-agnostic manner without considering the randomness of graph data and model parameters. To address the above problems, this paper proposes a novel uncertainty quantification framework for GNN explanations. For mitigating the randomness of graph data in the explanation, our framework accounts for two distinct data uncertainties, allowing for a direct assessment of the uncertainty in GNN explanations. For mitigating the randomness of learned model parameters, our method learns the parameter distribution directly from the data, obviating the need for assumptions about specific distributions. Moreover, the explanation uncertainty within model parameters is also quantified based on the learned parameter distributions. This holistic approach can integrate with any post-hoc GNN explanation methods. Empirical results from our study show that our proposed method sets a new standard for GNN explanation performance across diverse real-world graph benchmarks.
{"title":"Quantifying uncertainty in graph neural network explanations","authors":"Junji Jiang, Chen Ling, Hongyi Li, Guangji Bai, Xujiang Zhao, Liang Zhao","doi":"10.3389/fdata.2024.1392662","DOIUrl":"https://doi.org/10.3389/fdata.2024.1392662","url":null,"abstract":"In recent years, analyzing the explanation for the prediction of Graph Neural Networks (GNNs) has attracted increasing attention. Despite this progress, most existing methods do not adequately consider the inherent uncertainties stemming from the randomness of model parameters and graph data, which may lead to overconfidence and misguiding explanations. However, it is challenging for most of GNN explanation methods to quantify these uncertainties since they obtain the prediction explanation in a post-hoc and model-agnostic manner without considering the randomness of graph data and model parameters. To address the above problems, this paper proposes a novel uncertainty quantification framework for GNN explanations. For mitigating the randomness of graph data in the explanation, our framework accounts for two distinct data uncertainties, allowing for a direct assessment of the uncertainty in GNN explanations. For mitigating the randomness of learned model parameters, our method learns the parameter distribution directly from the data, obviating the need for assumptions about specific distributions. Moreover, the explanation uncertainty within model parameters is also quantified based on the learned parameter distributions. This holistic approach can integrate with any post-hoc GNN explanation methods. Empirical results from our study show that our proposed method sets a new standard for GNN explanation performance across diverse real-world graph benchmarks.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140996593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-09DOI: 10.3389/fdata.2024.1375818
Pierpaolo Artioli, Antonio Maci, Alessio Magrì
Government agencies are now encouraging industries to enhance their security systems to detect and respond proactively to cybersecurity incidents. Consequently, equipping with a security operation center that combines the analytical capabilities of human experts with systems based on Machine Learning (ML) plays a critical role. In this setting, Security Information and Event Management (SIEM) platforms can effectively handle network-related events to trigger cybersecurity alerts. Furthermore, a SIEM may include a User and Entity Behavior Analytics (UEBA) engine that examines the behavior of both users and devices, or entities, within a corporate network.In recent literature, several contributions have employed ML algorithms for UEBA, especially those based on the unsupervised learning paradigm, because anomalous behaviors are usually not known in advance. However, to shorten the gap between research advances and practice, it is necessary to comprehensively analyze the effectiveness of these methodologies. This paper proposes a thorough investigation of traditional and emerging clustering algorithms for UEBA, considering multiple application contexts, i.e., different user-entity interaction scenarios.Our study involves three datasets sourced from the existing literature and fifteen clustering algorithms. Among the compared techniques, HDBSCAN and DenMune showed promising performance on the state-of-the-art CERT behavior-related dataset, producing groups with a density very close to the number of users.
{"title":"A comprehensive investigation of clustering algorithms for User and Entity Behavior Analytics","authors":"Pierpaolo Artioli, Antonio Maci, Alessio Magrì","doi":"10.3389/fdata.2024.1375818","DOIUrl":"https://doi.org/10.3389/fdata.2024.1375818","url":null,"abstract":"Government agencies are now encouraging industries to enhance their security systems to detect and respond proactively to cybersecurity incidents. Consequently, equipping with a security operation center that combines the analytical capabilities of human experts with systems based on Machine Learning (ML) plays a critical role. In this setting, Security Information and Event Management (SIEM) platforms can effectively handle network-related events to trigger cybersecurity alerts. Furthermore, a SIEM may include a User and Entity Behavior Analytics (UEBA) engine that examines the behavior of both users and devices, or entities, within a corporate network.In recent literature, several contributions have employed ML algorithms for UEBA, especially those based on the unsupervised learning paradigm, because anomalous behaviors are usually not known in advance. However, to shorten the gap between research advances and practice, it is necessary to comprehensively analyze the effectiveness of these methodologies. This paper proposes a thorough investigation of traditional and emerging clustering algorithms for UEBA, considering multiple application contexts, i.e., different user-entity interaction scenarios.Our study involves three datasets sourced from the existing literature and fifteen clustering algorithms. Among the compared techniques, HDBSCAN and DenMune showed promising performance on the state-of-the-art CERT behavior-related dataset, producing groups with a density very close to the number of users.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140994869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-07DOI: 10.3389/fdata.2024.1184444
David Mhlanga
In the rapidly evolving landscape of financial technology (FinTech), big data stands as a cornerstone, driving significant transformations. This study delves into the pivotal role of big data in FinTech and its implications for financial inclusion. Employing a comprehensive literature review methodology, we analyze diverse sources including academic journals, industry reports, and online articles. Our findings illuminate how big data catalyzes the development of novel financial products and services, enhances risk management, and boosts operational efficiency, thereby fostering financial inclusion. Particularly, big data's capability to offer insightful customer behavior analytics is highlighted as a key driver for creating inclusive financial services. However, challenges such as data privacy and security, and the need for ethical algorithmic practices are also identified. This research contributes valuable insights for policymakers, regulators, and industry practitioners, suggesting a need for balanced regulatory frameworks to harness big data's potential ethically and responsibly. The outcomes of this study underscore the transformative power of big data in FinTech, indicating a pathway toward a more inclusive financial ecosystem.
{"title":"The role of big data in financial technology toward financial inclusion","authors":"David Mhlanga","doi":"10.3389/fdata.2024.1184444","DOIUrl":"https://doi.org/10.3389/fdata.2024.1184444","url":null,"abstract":"In the rapidly evolving landscape of financial technology (FinTech), big data stands as a cornerstone, driving significant transformations. This study delves into the pivotal role of big data in FinTech and its implications for financial inclusion. Employing a comprehensive literature review methodology, we analyze diverse sources including academic journals, industry reports, and online articles. Our findings illuminate how big data catalyzes the development of novel financial products and services, enhances risk management, and boosts operational efficiency, thereby fostering financial inclusion. Particularly, big data's capability to offer insightful customer behavior analytics is highlighted as a key driver for creating inclusive financial services. However, challenges such as data privacy and security, and the need for ethical algorithmic practices are also identified. This research contributes valuable insights for policymakers, regulators, and industry practitioners, suggesting a need for balanced regulatory frameworks to harness big data's potential ethically and responsibly. The outcomes of this study underscore the transformative power of big data in FinTech, indicating a pathway toward a more inclusive financial ecosystem.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141004699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}