Pub Date : 2024-10-03eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1406365
Ahmad R Alsaber, Adeeba Al-Herz, Balqees Alawadhi, Iyad Abu Doush, Parul Setiya, Ahmad T Al-Sultan, Khulood Saleh, Adel Al-Awadhi, Eman Hasan, Waleed Al-Kandari, Khalid Mokaddem, Aqeel A Ghanem, Yousef Attia, Mohammed Hussain, Naser AlHadhood, Yaser Ali, Hoda Tarakmeh, Ghaydaa Aldabie, Amjad AlKadi, Hebah Alhajeri
Background: Rheumatoid arthritis (RA) is a common condition treated with biological disease-modifying anti-rheumatic medicines (bDMARDs). However, many patients exhibit resistance, necessitating the use of machine learning models to predict remissions in patients treated with bDMARDs, thereby reducing healthcare costs and minimizing negative effects.
Objective: The study aims to develop machine learning models using data from the Kuwait Registry for Rheumatic Diseases (KRRD) to identify clinical characteristics predictive of remission in RA patients treated with biologics.
Methods: The study collected follow-up data from 1,968 patients treated with bDMARDs from four public hospitals in Kuwait from 2013 to 2022. Machine learning techniques like lasso, ridge, support vector machine, random forest, XGBoost, and Shapley additive explanation were used to predict remission at a 1-year follow-up.
Results: The study used the Shapley plot in explainable Artificial Intelligence (XAI) to analyze the effects of predictors on remission prognosis across different types of bDMARDs. Top clinical features were identified for patients treated with bDMARDs, each associated with specific mean SHAP values. The findings highlight the importance of clinical assessments and specific treatments in shaping treatment outcomes.
Conclusion: The proposed machine learning model system effectively identifies clinical features predicting remission in bDMARDs, potentially improving treatment efficacy in rheumatoid arthritis patients.
{"title":"Machine learning-based remission prediction in rheumatoid arthritis patients treated with biologic disease-modifying anti-rheumatic drugs: findings from the Kuwait rheumatic disease registry.","authors":"Ahmad R Alsaber, Adeeba Al-Herz, Balqees Alawadhi, Iyad Abu Doush, Parul Setiya, Ahmad T Al-Sultan, Khulood Saleh, Adel Al-Awadhi, Eman Hasan, Waleed Al-Kandari, Khalid Mokaddem, Aqeel A Ghanem, Yousef Attia, Mohammed Hussain, Naser AlHadhood, Yaser Ali, Hoda Tarakmeh, Ghaydaa Aldabie, Amjad AlKadi, Hebah Alhajeri","doi":"10.3389/fdata.2024.1406365","DOIUrl":"https://doi.org/10.3389/fdata.2024.1406365","url":null,"abstract":"<p><strong>Background: </strong>Rheumatoid arthritis (RA) is a common condition treated with biological disease-modifying anti-rheumatic medicines (bDMARDs). However, many patients exhibit resistance, necessitating the use of machine learning models to predict remissions in patients treated with bDMARDs, thereby reducing healthcare costs and minimizing negative effects.</p><p><strong>Objective: </strong>The study aims to develop machine learning models using data from the Kuwait Registry for Rheumatic Diseases (KRRD) to identify clinical characteristics predictive of remission in RA patients treated with biologics.</p><p><strong>Methods: </strong>The study collected follow-up data from 1,968 patients treated with bDMARDs from four public hospitals in Kuwait from 2013 to 2022. Machine learning techniques like lasso, ridge, support vector machine, random forest, XGBoost, and Shapley additive explanation were used to predict remission at a 1-year follow-up.</p><p><strong>Results: </strong>The study used the Shapley plot in explainable Artificial Intelligence (XAI) to analyze the effects of predictors on remission prognosis across different types of bDMARDs. Top clinical features were identified for patients treated with bDMARDs, each associated with specific mean SHAP values. The findings highlight the importance of clinical assessments and specific treatments in shaping treatment outcomes.</p><p><strong>Conclusion: </strong>The proposed machine learning model system effectively identifies clinical features predicting remission in bDMARDs, potentially improving treatment efficacy in rheumatoid arthritis patients.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1406365"},"PeriodicalIF":2.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484091/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142480535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Purpose: Volumetric modulated arc therapy (VMAT) is a new treatment modality in modern radiotherapy. To ensure the quality of the radiotherapy plan, a physics plan review is routinely conducted by senior clinicians; however, this process is less efficient and less accurate. In this study, a multi-task AutoEncoder (AE) is proposed to automate anomaly detection of VMAT plans for lung cancer patients.
Methods: The feature maps are first extracted from a VMAT plan. Then, a multi-task AE is trained based on the input of a feature map, and its output is the two targets (beam aperture and prescribed dose). Based on the distribution of reconstruction errors on the training set, a detection threshold value is obtained. For a testing sample, its reconstruction error is calculated using the AE model and compared with the threshold value to determine its classes (anomaly or regular). The proposed multi-task AE model is compared to the other existing AE models, including Vanilla AE, Contractive AE, and Variational AE. The area under the receiver operating characteristic curve (AUC) and the other statistics are used to evaluate the performance of these models.
Results: Among the four tested AE models, the proposed multi-task AE model achieves the highest values in AUC (0.964), accuracy (0.821), precision (0.471), and F1 score (0.632), and the lowest value in FPR (0.206).
Conclusion: The proposed multi-task AE model using two-dimensional (2D) feature maps can effectively detect anomalies in radiotherapy plans for lung cancer patients. Compared to the other existing AE models, the multi-task AE is more accurate and efficient. The proposed model provides a feasible way to carry out automated anomaly detection of VMAT plans in radiotherapy.
{"title":"Unsupervised machine learning model for detecting anomalous volumetric modulated arc therapy plans for lung cancer patients.","authors":"Peng Huang, Jiawen Shang, Yuhan Fan, Zhihui Hu, Jianrong Dai, Zhiqiang Liu, Hui Yan","doi":"10.3389/fdata.2024.1462745","DOIUrl":"https://doi.org/10.3389/fdata.2024.1462745","url":null,"abstract":"<p><strong>Purpose: </strong>Volumetric modulated arc therapy (VMAT) is a new treatment modality in modern radiotherapy. To ensure the quality of the radiotherapy plan, a physics plan review is routinely conducted by senior clinicians; however, this process is less efficient and less accurate. In this study, a multi-task AutoEncoder (AE) is proposed to automate anomaly detection of VMAT plans for lung cancer patients.</p><p><strong>Methods: </strong>The feature maps are first extracted from a VMAT plan. Then, a multi-task AE is trained based on the input of a feature map, and its output is the two targets (beam aperture and prescribed dose). Based on the distribution of reconstruction errors on the training set, a detection threshold value is obtained. For a testing sample, its reconstruction error is calculated using the AE model and compared with the threshold value to determine its classes (anomaly or regular). The proposed multi-task AE model is compared to the other existing AE models, including Vanilla AE, Contractive AE, and Variational AE. The area under the receiver operating characteristic curve (AUC) and the other statistics are used to evaluate the performance of these models.</p><p><strong>Results: </strong>Among the four tested AE models, the proposed multi-task AE model achieves the highest values in AUC (0.964), accuracy (0.821), precision (0.471), and <i>F</i>1 score (0.632), and the lowest value in FPR (0.206).</p><p><strong>Conclusion: </strong>The proposed multi-task AE model using two-dimensional (2D) feature maps can effectively detect anomalies in radiotherapy plans for lung cancer patients. Compared to the other existing AE models, the multi-task AE is more accurate and efficient. The proposed model provides a feasible way to carry out automated anomaly detection of VMAT plans in radiotherapy.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1462745"},"PeriodicalIF":2.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484413/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142480538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-30eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1469981
Zarindokht Helforoush, Hossein Sayyad
Introduction: As the global prevalence of obesity continues to rise, it has become a major public health concern requiring more accurate prediction methods. Traditional regression models often fail to capture the complex interactions between genetic, environmental, and behavioral factors contributing to obesity.
Methods: This study explores the potential of machine-learning techniques to improve obesity risk prediction. Various supervised learning algorithms, including the novel ANN-PSO hybrid model, were applied following comprehensive data preprocessing and evaluation.
Results: The proposed ANN-PSO model achieved a remarkable accuracy rate of 92%, outperforming traditional regression methods. SHAP was employed to analyze feature importance, offering deeper insights into the influence of various factors on obesity risk.
Discussion: The findings highlight the transformative role of advanced machine-learning models in public health research, offering a pathway for personalized healthcare interventions. By providing detailed obesity risk profiles, these models enable healthcare providers to tailor prevention and treatment strategies to individual needs. The results underscore the need to integrate innovative machine-learning approaches into global public health efforts to combat the growing obesity epidemic.
{"title":"Prediction and classification of obesity risk based on a hybrid metaheuristic machine learning approach.","authors":"Zarindokht Helforoush, Hossein Sayyad","doi":"10.3389/fdata.2024.1469981","DOIUrl":"https://doi.org/10.3389/fdata.2024.1469981","url":null,"abstract":"<p><strong>Introduction: </strong>As the global prevalence of obesity continues to rise, it has become a major public health concern requiring more accurate prediction methods. Traditional regression models often fail to capture the complex interactions between genetic, environmental, and behavioral factors contributing to obesity.</p><p><strong>Methods: </strong>This study explores the potential of machine-learning techniques to improve obesity risk prediction. Various supervised learning algorithms, including the novel ANN-PSO hybrid model, were applied following comprehensive data preprocessing and evaluation.</p><p><strong>Results: </strong>The proposed ANN-PSO model achieved a remarkable accuracy rate of 92%, outperforming traditional regression methods. SHAP was employed to analyze feature importance, offering deeper insights into the influence of various factors on obesity risk.</p><p><strong>Discussion: </strong>The findings highlight the transformative role of advanced machine-learning models in public health research, offering a pathway for personalized healthcare interventions. By providing detailed obesity risk profiles, these models enable healthcare providers to tailor prevention and treatment strategies to individual needs. The results underscore the need to integrate innovative machine-learning approaches into global public health efforts to combat the growing obesity epidemic.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1469981"},"PeriodicalIF":2.4,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471553/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142480537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-25eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1455399
Abinaya Chandrasekar, Sigrún Eyrúnardóttir Clark, Sam Martin, Samantha Vanderslott, Elaine C Flores, David Aceituno, Phoebe Barnett, Cecilia Vindrola-Padros, Norha Vera San Juan
Introduction: Qualitative data provides deep insights into an individual's behaviors and beliefs, and the contextual factors that may shape these. Big qualitative data analysis is an emerging field that aims to identify trends and patterns in large qualitative datasets. The purpose of this review was to identify the methods used to analyse large bodies of qualitative data, their cited strengths and limitations and comparisons between manual and digital analysis approaches.
Methods: A multifaceted approach has been taken to develop the review relying on academic, gray and media-based literature, using approaches such as iterative analysis, frequency analysis, text network analysis and team discussion.
Results: The review identified 520 articles that detailed analysis approaches of big qualitative data. From these publications a diverse range of methods and software used for analysis were identified, with thematic analysis and basic software being most common. Studies were most commonly conducted in high-income countries, and the most common data sources were open-ended survey responses, interview transcripts, and first-person narratives.
Discussion: We identified an emerging trend to expand the sources of qualitative data (e.g., using social media data, images, or videos), and develop new methods and software for analysis. As the qualitative analysis field may continue to change, it will be necessary to conduct further research to compare the utility of different big qualitative analysis methods and to develop standardized guidelines to raise awareness and support researchers in the use of more novel approaches for big qualitative analysis.
{"title":"Making the most of big qualitative datasets: a living systematic review of analysis methods.","authors":"Abinaya Chandrasekar, Sigrún Eyrúnardóttir Clark, Sam Martin, Samantha Vanderslott, Elaine C Flores, David Aceituno, Phoebe Barnett, Cecilia Vindrola-Padros, Norha Vera San Juan","doi":"10.3389/fdata.2024.1455399","DOIUrl":"10.3389/fdata.2024.1455399","url":null,"abstract":"<p><strong>Introduction: </strong>Qualitative data provides deep insights into an individual's behaviors and beliefs, and the contextual factors that may shape these. Big qualitative data analysis is an emerging field that aims to identify trends and patterns in large qualitative datasets. The purpose of this review was to identify the methods used to analyse large bodies of qualitative data, their cited strengths and limitations and comparisons between manual and digital analysis approaches.</p><p><strong>Methods: </strong>A multifaceted approach has been taken to develop the review relying on academic, gray and media-based literature, using approaches such as iterative analysis, frequency analysis, text network analysis and team discussion.</p><p><strong>Results: </strong>The review identified 520 articles that detailed analysis approaches of big qualitative data. From these publications a diverse range of methods and software used for analysis were identified, with thematic analysis and basic software being most common. Studies were most commonly conducted in high-income countries, and the most common data sources were open-ended survey responses, interview transcripts, and first-person narratives.</p><p><strong>Discussion: </strong>We identified an emerging trend to expand the sources of qualitative data (e.g., using social media data, images, or videos), and develop new methods and software for analysis. As the qualitative analysis field may continue to change, it will be necessary to conduct further research to compare the utility of different big qualitative analysis methods and to develop standardized guidelines to raise awareness and support researchers in the use of more novel approaches for big qualitative analysis.</p><p><strong>Systematic review registration: </strong>https://osf.io/hbvsy/?view_only=.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1455399"},"PeriodicalIF":2.4,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11461344/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1393758
Syed Taimoor Hussain Shah, Syed Adil Hussain Shah, Iqra Iqbal Khan, Atif Imran, Syed Baqir Hussain Shah, Atif Mehmood, Shahzad Ahmad Qureshi, Mudassar Raza, Angelo Di Terlizzi, Marco Cavaglià, Marco Agostino Deriu
Detecting lung diseases in medical images can be quite challenging for radiologists. In some cases, even experienced experts may struggle with accurately diagnosing chest diseases, leading to potential inaccuracies due to complex or unseen biomarkers. This review paper delves into various datasets and machine learning techniques employed in recent research for lung disease classification, focusing on pneumonia analysis using chest X-ray images. We explore conventional machine learning methods, pretrained deep learning models, customized convolutional neural networks (CNNs), and ensemble methods. A comprehensive comparison of different classification approaches is presented, encompassing data acquisition, preprocessing, feature extraction, and classification using machine vision, machine and deep learning, and explainable-AI (XAI). Our analysis highlights the superior performance of transfer learning-based methods using CNNs and ensemble models/features for lung disease classification. In addition, our comprehensive review offers insights for researchers in other medical domains too who utilize radiological images. By providing a thorough overview of various techniques, our work enables the establishment of effective strategies and identification of suitable methods for a wide range of challenges. Currently, beyond traditional evaluation metrics, researchers emphasize the importance of XAI techniques in machine and deep learning models and their applications in classification tasks. This incorporation helps in gaining a deeper understanding of their decision-making processes, leading to improved trust, transparency, and overall clinical decision-making. Our comprehensive review serves as a valuable resource for researchers and practitioners seeking not only to advance the field of lung disease detection using machine learning and XAI but also from other diverse domains.
对于放射科医生来说,在医学影像中检测肺部疾病是一项相当具有挑战性的工作。在某些情况下,即使是经验丰富的专家也很难准确诊断胸部疾病,因为复杂或未见的生物标志物可能会导致诊断不准确。本综述论文深入探讨了近期研究中用于肺部疾病分类的各种数据集和机器学习技术,重点是使用胸部 X 光图像进行肺炎分析。我们探讨了传统机器学习方法、预训练深度学习模型、定制卷积神经网络(CNN)和集合方法。我们对不同的分类方法进行了全面比较,包括数据采集、预处理、特征提取,以及使用机器视觉、机器学习、深度学习和可解释人工智能(XAI)进行分类。我们的分析强调了基于迁移学习的方法(使用 CNN 和集合模型/特征)在肺病分类方面的卓越性能。此外,我们的全面综述还为其他医疗领域利用放射图像的研究人员提供了启示。通过对各种技术的全面概述,我们的工作有助于建立有效的策略,并针对广泛的挑战确定合适的方法。目前,除了传统的评估指标外,研究人员还强调了 XAI 技术在机器学习和深度学习模型中的重要性及其在分类任务中的应用。这种结合有助于更深入地了解他们的决策过程,从而提高信任度、透明度和整体临床决策水平。我们的综合综述不仅为寻求利用机器学习和 XAI 推动肺病检测领域发展的研究人员和从业人员提供了宝贵的资源,也为其他不同领域的研究人员和从业人员提供了宝贵的资源。
{"title":"Data-driven classification and explainable-AI in the field of lung imaging.","authors":"Syed Taimoor Hussain Shah, Syed Adil Hussain Shah, Iqra Iqbal Khan, Atif Imran, Syed Baqir Hussain Shah, Atif Mehmood, Shahzad Ahmad Qureshi, Mudassar Raza, Angelo Di Terlizzi, Marco Cavaglià, Marco Agostino Deriu","doi":"10.3389/fdata.2024.1393758","DOIUrl":"10.3389/fdata.2024.1393758","url":null,"abstract":"<p><p>Detecting lung diseases in medical images can be quite challenging for radiologists. In some cases, even experienced experts may struggle with accurately diagnosing chest diseases, leading to potential inaccuracies due to complex or unseen biomarkers. This review paper delves into various datasets and machine learning techniques employed in recent research for lung disease classification, focusing on pneumonia analysis using chest X-ray images. We explore conventional machine learning methods, pretrained deep learning models, customized convolutional neural networks (CNNs), and ensemble methods. A comprehensive comparison of different classification approaches is presented, encompassing data acquisition, preprocessing, feature extraction, and classification using machine vision, machine and deep learning, and explainable-AI (XAI). Our analysis highlights the superior performance of transfer learning-based methods using CNNs and ensemble models/features for lung disease classification. In addition, our comprehensive review offers insights for researchers in other medical domains too who utilize radiological images. By providing a thorough overview of various techniques, our work enables the establishment of effective strategies and identification of suitable methods for a wide range of challenges. Currently, beyond traditional evaluation metrics, researchers emphasize the importance of XAI techniques in machine and deep learning models and their applications in classification tasks. This incorporation helps in gaining a deeper understanding of their decision-making processes, leading to improved trust, transparency, and overall clinical decision-making. Our comprehensive review serves as a valuable resource for researchers and practitioners seeking not only to advance the field of lung disease detection using machine learning and XAI but also from other diverse domains.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1393758"},"PeriodicalIF":2.4,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446784/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1428568
Anna Aksenova, Anoop Johny, Tim Adams, Phil Gribbon, Marc Jacobs, Martin Hofmann-Apitius
In today's data-centric landscape, effective data stewardship is critical for facilitating scientific research and innovation. This article provides an overview of essential tools and frameworks for modern data stewardship practices. Over 300 tools were analyzed in this study, assessing their utility, relevance to data stewardship, and applicability within the life sciences domain.
{"title":"Current state of data stewardship tools in life science.","authors":"Anna Aksenova, Anoop Johny, Tim Adams, Phil Gribbon, Marc Jacobs, Martin Hofmann-Apitius","doi":"10.3389/fdata.2024.1428568","DOIUrl":"10.3389/fdata.2024.1428568","url":null,"abstract":"<p><p>In today's data-centric landscape, effective data stewardship is critical for facilitating scientific research and innovation. This article provides an overview of essential tools and frameworks for modern data stewardship practices. Over 300 tools were analyzed in this study, assessing their utility, relevance to data stewardship, and applicability within the life sciences domain.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1428568"},"PeriodicalIF":2.4,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11439729/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142332186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Introduction: Recent advancements in Natural Language Processing (NLP) and widely available social media data have made it possible to predict human personalities in various computational applications. In this context, pre-trained Large Language Models (LLMs) have gained recognition for their exceptional performance in NLP benchmarks. However, these models require substantial computational resources, escalating their carbon and water footprint. Consequently, a shift toward more computationally efficient smaller models is observed.
Methods: This study compares a small model ALBERT (11.8M parameters) with a larger model, RoBERTa (125M parameters) in predicting big five personality traits. It utilizes the PANDORA dataset comprising Reddit comments, processing them on a Tesla P100-PCIE-16GB GPU. The study customized both models to support multi-output regression and added two linear layers for fine-grained regression analysis.
Results: Results are evaluated on Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), considering the computational resources consumed during training. While ALBERT consumed lower levels of system memory with lower heat emission, it took higher computation time compared to RoBERTa. The study produced comparable levels of MSE, RMSE, and training loss reduction.
Discussion: This highlights the influence of training data quality on the model's performance, outweighing the significance of model size. Theoretical and practical implications are also discussed.
导言:自然语言处理(NLP)领域的最新进展和广泛可用的社交媒体数据使得在各种计算应用中预测人类性格成为可能。在这种情况下,预训练的大型语言模型(LLM)因其在 NLP 基准测试中的优异表现而获得了认可。然而,这些模型需要大量的计算资源,从而增加了碳足迹和水足迹。因此,人们开始转向计算效率更高的小型模型:本研究比较了小型模型 ALBERT(1180 万个参数)和大型模型 RoBERTa(1.25 亿个参数)在预测五大人格特质方面的效果。研究利用了由 Reddit 评论组成的 PANDORA 数据集,并在 Tesla P100-PCIE-16GB GPU 上进行了处理。研究对这两个模型进行了定制,以支持多输出回归,并添加了两个线性层进行细粒度回归分析:结果:根据平均平方误差(MSE)和均方根误差(RMSE)对结果进行了评估,同时考虑了训练过程中消耗的计算资源。与 RoBERTa 相比,ALBERT 消耗的系统内存更少,发热量更低,但计算时间更长。这项研究产生了相当水平的 MSE、RMSE 和训练损失降低率:这凸显了训练数据质量对模型性能的影响,其重要性超过了模型大小。此外,还讨论了理论和实践意义。
{"title":"Navigating pathways to automated personality prediction: a comparative study of small and medium language models.","authors":"Fatima Habib, Zeeshan Ali, Akbar Azam, Komal Kamran, Fahad Mansoor Pasha","doi":"10.3389/fdata.2024.1387325","DOIUrl":"https://doi.org/10.3389/fdata.2024.1387325","url":null,"abstract":"<p><strong>Introduction: </strong>Recent advancements in Natural Language Processing (NLP) and widely available social media data have made it possible to predict human personalities in various computational applications. In this context, pre-trained Large Language Models (LLMs) have gained recognition for their exceptional performance in NLP benchmarks. However, these models require substantial computational resources, escalating their carbon and water footprint. Consequently, a shift toward more computationally efficient smaller models is observed.</p><p><strong>Methods: </strong>This study compares a small model ALBERT (11.8M parameters) with a larger model, RoBERTa (125M parameters) in predicting big five personality traits. It utilizes the PANDORA dataset comprising Reddit comments, processing them on a Tesla P100-PCIE-16GB GPU. The study customized both models to support multi-output regression and added two linear layers for fine-grained regression analysis.</p><p><strong>Results: </strong>Results are evaluated on Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), considering the computational resources consumed during training. While ALBERT consumed lower levels of system memory with lower heat emission, it took higher computation time compared to RoBERTa. The study produced comparable levels of MSE, RMSE, and training loss reduction.</p><p><strong>Discussion: </strong>This highlights the influence of training data quality on the model's performance, outweighing the significance of model size. Theoretical and practical implications are also discussed.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1387325"},"PeriodicalIF":2.4,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427259/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142332187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1441869
Xiaoyao Han, Oskar Josef Gstrein, Vasilios Andrikopoulos
Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this "no consensus" stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular "V" characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the "V" characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.
{"title":"When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data.","authors":"Xiaoyao Han, Oskar Josef Gstrein, Vasilios Andrikopoulos","doi":"10.3389/fdata.2024.1441869","DOIUrl":"https://doi.org/10.3389/fdata.2024.1441869","url":null,"abstract":"<p><p>Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this \"no consensus\" stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular \"V\" characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the \"V\" characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1441869"},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11420115/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142332189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-09eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1446071
Nicholas Kofi Akortia Hagan, John R Talburt
Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.
{"title":"SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.","authors":"Nicholas Kofi Akortia Hagan, John R Talburt","doi":"10.3389/fdata.2024.1446071","DOIUrl":"10.3389/fdata.2024.1446071","url":null,"abstract":"<p><p>Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1446071"},"PeriodicalIF":2.4,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11416992/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142309124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-04eCollection Date: 2024-01-01DOI: 10.3389/fdata.2024.1400024
Enes Altuncu, Virginia N L Franqueira, Shujun Li
Recent advancements in AI, especially deep learning, have contributed to a significant increase in the creation of new realistic-looking synthetic media (video, image, and audio) and manipulation of existing media, which has led to the creation of the new term "deepfake." Based on both the research literature and resources in English, this paper gives a comprehensive overview of deepfake, covering multiple important aspects of this emerging concept, including (1) different definitions, (2) commonly used performance metrics and standards, and (3) deepfake-related datasets. In addition, the paper also reports a meta-review of 15 selected deepfake-related survey papers published since 2020, focusing not only on the mentioned aspects but also on the analysis of key challenges and recommendations. We believe that this paper is the most comprehensive review of deepfake in terms of the aspects covered.
{"title":"Deepfake: definitions, performance metrics and standards, datasets, and a meta-review.","authors":"Enes Altuncu, Virginia N L Franqueira, Shujun Li","doi":"10.3389/fdata.2024.1400024","DOIUrl":"https://doi.org/10.3389/fdata.2024.1400024","url":null,"abstract":"<p><p>Recent advancements in AI, especially deep learning, have contributed to a significant increase in the creation of new realistic-looking synthetic media (video, image, and audio) and manipulation of existing media, which has led to the creation of the new term \"deepfake.\" Based on both the research literature and resources in English, this paper gives a comprehensive overview of deepfake, covering multiple important aspects of this emerging concept, including (1) different definitions, (2) commonly used performance metrics and standards, and (3) deepfake-related datasets. In addition, the paper also reports a meta-review of 15 selected deepfake-related survey papers published since 2020, focusing not only on the mentioned aspects but also on the analysis of key challenges and recommendations. We believe that this paper is the most comprehensive review of deepfake in terms of the aspects covered.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1400024"},"PeriodicalIF":2.4,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11408348/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142300698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}