Frontiers in Big Data最新文献_第5页

Unsupervised machine learning model for detecting anomalous volumetric modulated arc therapy plans for lung cancer patients. 用于检测肺癌患者异常容积调制弧治疗计划的无监督机器学习模型。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-10-03 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1462745

Peng Huang, Jiawen Shang, Yuhan Fan, Zhihui Hu, Jianrong Dai, Zhiqiang Liu, Hui Yan

Purpose: Volumetric modulated arc therapy (VMAT) is a new treatment modality in modern radiotherapy. To ensure the quality of the radiotherapy plan, a physics plan review is routinely conducted by senior clinicians; however, this process is less efficient and less accurate. In this study, a multi-task AutoEncoder (AE) is proposed to automate anomaly detection of VMAT plans for lung cancer patients.

Methods: The feature maps are first extracted from a VMAT plan. Then, a multi-task AE is trained based on the input of a feature map, and its output is the two targets (beam aperture and prescribed dose). Based on the distribution of reconstruction errors on the training set, a detection threshold value is obtained. For a testing sample, its reconstruction error is calculated using the AE model and compared with the threshold value to determine its classes (anomaly or regular). The proposed multi-task AE model is compared to the other existing AE models, including Vanilla AE, Contractive AE, and Variational AE. The area under the receiver operating characteristic curve (AUC) and the other statistics are used to evaluate the performance of these models.

Results: Among the four tested AE models, the proposed multi-task AE model achieves the highest values in AUC (0.964), accuracy (0.821), precision (0.471), and F1 score (0.632), and the lowest value in FPR (0.206).

Conclusion: The proposed multi-task AE model using two-dimensional (2D) feature maps can effectively detect anomalies in radiotherapy plans for lung cancer patients. Compared to the other existing AE models, the multi-task AE is more accurate and efficient. The proposed model provides a feasible way to carry out automated anomaly detection of VMAT plans in radiotherapy.

目的：容积调制弧治疗（VMAT）是现代放射治疗的一种新的治疗方式。为确保放疗计划的质量，资深临床医生通常会对放疗计划进行物理审查，但这一过程效率较低，准确性也不高。本研究提出了一种多任务自动编码器（AE），用于自动检测肺癌患者的 VMAT 计划异常：方法：首先从 VMAT 计划中提取特征图。方法：首先从 VMAT 计划中提取特征图，然后根据特征图的输入训练多任务 AE，其输出为两个目标（光束孔径和规定剂量）。根据训练集上重建误差的分布，得出检测阈值。对于测试样本，使用 AE 模型计算其重建误差，并与阈值进行比较，以确定其类别（异常或正常）。建议的多任务 AE 模型与其他现有的 AE 模型（包括香草 AE、收缩 AE 和变异 AE）进行了比较。使用接收器工作特征曲线下面积（AUC）和其他统计数据来评估这些模型的性能：在四个测试的 AE 模型中，所提出的多任务 AE 模型的 AUC 值（0.964）、准确度（0.821）、精确度（0.471）和 F1 分数（0.632）最高，而 FPR 值（0.206）最低：结论：利用二维（2D）特征图提出的多任务 AE 模型能有效检测肺癌患者放疗计划中的异常情况。与现有的其他 AE 模型相比，多任务 AE 更准确、更高效。所提出的模型为放疗中的 VMAT 计划异常自动检测提供了一种可行的方法。

{"title":"Unsupervised machine learning model for detecting anomalous volumetric modulated arc therapy plans for lung cancer patients.","authors":"Peng Huang, Jiawen Shang, Yuhan Fan, Zhihui Hu, Jianrong Dai, Zhiqiang Liu, Hui Yan","doi":"10.3389/fdata.2024.1462745","DOIUrl":"https://doi.org/10.3389/fdata.2024.1462745","url":null,"abstract":"Purpose: Volumetric modulated arc therapy (VMAT) is a new treatment modality in modern radiotherapy. To ensure the quality of the radiotherapy plan, a physics plan review is routinely conducted by senior clinicians; however, this process is less efficient and less accurate. In this study, a multi-task AutoEncoder (AE) is proposed to automate anomaly detection of VMAT plans for lung cancer patients.Methods: The feature maps are first extracted from a VMAT plan. Then, a multi-task AE is trained based on the input of a feature map, and its output is the two targets (beam aperture and prescribed dose). Based on the distribution of reconstruction errors on the training set, a detection threshold value is obtained. For a testing sample, its reconstruction error is calculated using the AE model and compared with the threshold value to determine its classes (anomaly or regular). The proposed multi-task AE model is compared to the other existing AE models, including Vanilla AE, Contractive AE, and Variational AE. The area under the receiver operating characteristic curve (AUC) and the other statistics are used to evaluate the performance of these models.Results: Among the four tested AE models, the proposed multi-task AE model achieves the highest values in AUC (0.964), accuracy (0.821), precision (0.471), and F1 score (0.632), and the lowest value in FPR (0.206).Conclusion: The proposed multi-task AE model using two-dimensional (2D) feature maps can effectively detect anomalies in radiotherapy plans for lung cancer patients. Compared to the other existing AE models, the multi-task AE is more accurate and efficient. The proposed model provides a feasible way to carry out automated anomaly detection of VMAT plans in radiotherapy.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1462745"},"PeriodicalIF":2.4,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484413/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142480538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prediction and classification of obesity risk based on a hybrid metaheuristic machine learning approach. 基于混合元启发式机器学习方法的肥胖风险预测与分类。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-09-30 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1469981

Zarindokht Helforoush, Hossein Sayyad

Introduction: As the global prevalence of obesity continues to rise, it has become a major public health concern requiring more accurate prediction methods. Traditional regression models often fail to capture the complex interactions between genetic, environmental, and behavioral factors contributing to obesity.

Methods: This study explores the potential of machine-learning techniques to improve obesity risk prediction. Various supervised learning algorithms, including the novel ANN-PSO hybrid model, were applied following comprehensive data preprocessing and evaluation.

Results: The proposed ANN-PSO model achieved a remarkable accuracy rate of 92%, outperforming traditional regression methods. SHAP was employed to analyze feature importance, offering deeper insights into the influence of various factors on obesity risk.

Discussion: The findings highlight the transformative role of advanced machine-learning models in public health research, offering a pathway for personalized healthcare interventions. By providing detailed obesity risk profiles, these models enable healthcare providers to tailor prevention and treatment strategies to individual needs. The results underscore the need to integrate innovative machine-learning approaches into global public health efforts to combat the growing obesity epidemic.

导言：随着全球肥胖症发病率的持续上升，肥胖症已成为一个重大的公共卫生问题，需要更准确的预测方法。传统的回归模型往往无法捕捉到导致肥胖的遗传、环境和行为因素之间复杂的相互作用：本研究探讨了机器学习技术在改善肥胖风险预测方面的潜力。在对数据进行全面预处理和评估后，应用了各种监督学习算法，包括新型 ANN-PSO 混合模型：结果：所提出的 ANN-PSO 模型准确率高达 92%，优于传统的回归方法。采用 SHAP 分析特征重要性，更深入地了解了各种因素对肥胖风险的影响：讨论：研究结果凸显了先进的机器学习模型在公共卫生研究中的变革性作用，为个性化医疗干预提供了途径。通过提供详细的肥胖风险概况，这些模型使医疗服务提供者能够根据个人需求制定预防和治疗策略。研究结果强调，有必要将创新的机器学习方法纳入全球公共卫生工作，以应对日益严重的肥胖症流行。

{"title":"Prediction and classification of obesity risk based on a hybrid metaheuristic machine learning approach.","authors":"Zarindokht Helforoush, Hossein Sayyad","doi":"10.3389/fdata.2024.1469981","DOIUrl":"https://doi.org/10.3389/fdata.2024.1469981","url":null,"abstract":"Introduction: As the global prevalence of obesity continues to rise, it has become a major public health concern requiring more accurate prediction methods. Traditional regression models often fail to capture the complex interactions between genetic, environmental, and behavioral factors contributing to obesity.Methods: This study explores the potential of machine-learning techniques to improve obesity risk prediction. Various supervised learning algorithms, including the novel ANN-PSO hybrid model, were applied following comprehensive data preprocessing and evaluation.Results: The proposed ANN-PSO model achieved a remarkable accuracy rate of 92%, outperforming traditional regression methods. SHAP was employed to analyze feature importance, offering deeper insights into the influence of various factors on obesity risk.Discussion: The findings highlight the transformative role of advanced machine-learning models in public health research, offering a pathway for personalized healthcare interventions. By providing detailed obesity risk profiles, these models enable healthcare providers to tailor prevention and treatment strategies to individual needs. The results underscore the need to integrate innovative machine-learning approaches into global public health efforts to combat the growing obesity epidemic.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1469981"},"PeriodicalIF":2.4,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471553/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142480537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Making the most of big qualitative datasets: a living systematic review of analysis methods. 充分利用大型定性数据集：对分析方法的系统回顾。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-09-25 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1455399

Abinaya Chandrasekar, Sigrún Eyrúnardóttir Clark, Sam Martin, Samantha Vanderslott, Elaine C Flores, David Aceituno, Phoebe Barnett, Cecilia Vindrola-Padros, Norha Vera San Juan

Introduction: Qualitative data provides deep insights into an individual's behaviors and beliefs, and the contextual factors that may shape these. Big qualitative data analysis is an emerging field that aims to identify trends and patterns in large qualitative datasets. The purpose of this review was to identify the methods used to analyse large bodies of qualitative data, their cited strengths and limitations and comparisons between manual and digital analysis approaches.

Methods: A multifaceted approach has been taken to develop the review relying on academic, gray and media-based literature, using approaches such as iterative analysis, frequency analysis, text network analysis and team discussion.

Results: The review identified 520 articles that detailed analysis approaches of big qualitative data. From these publications a diverse range of methods and software used for analysis were identified, with thematic analysis and basic software being most common. Studies were most commonly conducted in high-income countries, and the most common data sources were open-ended survey responses, interview transcripts, and first-person narratives.

Discussion: We identified an emerging trend to expand the sources of qualitative data (e.g., using social media data, images, or videos), and develop new methods and software for analysis. As the qualitative analysis field may continue to change, it will be necessary to conduct further research to compare the utility of different big qualitative analysis methods and to develop standardized guidelines to raise awareness and support researchers in the use of more novel approaches for big qualitative analysis.

Systematic review registration: https://osf.io/hbvsy/?view_only=.

简介定性数据能够深入揭示个人的行为和信念，以及可能影响这些行为和信念的背景因素。大型定性数据分析是一个新兴领域，旨在识别大型定性数据集中的趋势和模式。本综述旨在确定用于分析大量定性数据的方法、这些方法的优势和局限性，以及人工分析方法和数字分析方法之间的比较：方法：采用多方面的方法，依靠学术、灰色和基于媒体的文献，使用迭代分析、频率分析、文本网络分析和团队讨论等方法进行综述：结果：综述确定了 520 篇详细介绍大型定性数据分析方法的文章。从这些出版物中发现了各种不同的分析方法和软件，其中主题分析和基本软件最为常见。研究通常在高收入国家进行，最常见的数据来源是开放式调查回复、访谈记录和第一人称叙述：我们发现了一种新趋势，即扩大定性数据的来源（如使用社交媒体数据、图像或视频），并开发新的分析方法和软件。由于定性分析领域可能会继续发生变化，因此有必要开展进一步的研究，比较不同的大样本定性分析方法的效用，并制定标准化指南，以提高研究人员的认识，支持他们使用更多新颖的方法进行大样本定性分析。系统综述注册：https://osf.io/hbvsy/?view_only=。

{"title":"Making the most of big qualitative datasets: a living systematic review of analysis methods.","authors":"Abinaya Chandrasekar, Sigrún Eyrúnardóttir Clark, Sam Martin, Samantha Vanderslott, Elaine C Flores, David Aceituno, Phoebe Barnett, Cecilia Vindrola-Padros, Norha Vera San Juan","doi":"10.3389/fdata.2024.1455399","DOIUrl":"10.3389/fdata.2024.1455399","url":null,"abstract":"Introduction: Qualitative data provides deep insights into an individual's behaviors and beliefs, and the contextual factors that may shape these. Big qualitative data analysis is an emerging field that aims to identify trends and patterns in large qualitative datasets. The purpose of this review was to identify the methods used to analyse large bodies of qualitative data, their cited strengths and limitations and comparisons between manual and digital analysis approaches.Methods: A multifaceted approach has been taken to develop the review relying on academic, gray and media-based literature, using approaches such as iterative analysis, frequency analysis, text network analysis and team discussion.Results: The review identified 520 articles that detailed analysis approaches of big qualitative data. From these publications a diverse range of methods and software used for analysis were identified, with thematic analysis and basic software being most common. Studies were most commonly conducted in high-income countries, and the most common data sources were open-ended survey responses, interview transcripts, and first-person narratives.Discussion: We identified an emerging trend to expand the sources of qualitative data (e.g., using social media data, images, or videos), and develop new methods and software for analysis. As the qualitative analysis field may continue to change, it will be necessary to conduct further research to compare the utility of different big qualitative analysis methods and to develop standardized guidelines to raise awareness and support researchers in the use of more novel approaches for big qualitative analysis.Systematic review registration: https://osf.io/hbvsy/?view_only=.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1455399"},"PeriodicalIF":2.4,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11461344/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data-driven classification and explainable-AI in the field of lung imaging. 肺部成像领域的数据驱动分类和可解释人工智能。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-09-19 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1393758

Syed Taimoor Hussain Shah, Syed Adil Hussain Shah, Iqra Iqbal Khan, Atif Imran, Syed Baqir Hussain Shah, Atif Mehmood, Shahzad Ahmad Qureshi, Mudassar Raza, Angelo Di Terlizzi, Marco Cavaglià, Marco Agostino Deriu

Detecting lung diseases in medical images can be quite challenging for radiologists. In some cases, even experienced experts may struggle with accurately diagnosing chest diseases, leading to potential inaccuracies due to complex or unseen biomarkers. This review paper delves into various datasets and machine learning techniques employed in recent research for lung disease classification, focusing on pneumonia analysis using chest X-ray images. We explore conventional machine learning methods, pretrained deep learning models, customized convolutional neural networks (CNNs), and ensemble methods. A comprehensive comparison of different classification approaches is presented, encompassing data acquisition, preprocessing, feature extraction, and classification using machine vision, machine and deep learning, and explainable-AI (XAI). Our analysis highlights the superior performance of transfer learning-based methods using CNNs and ensemble models/features for lung disease classification. In addition, our comprehensive review offers insights for researchers in other medical domains too who utilize radiological images. By providing a thorough overview of various techniques, our work enables the establishment of effective strategies and identification of suitable methods for a wide range of challenges. Currently, beyond traditional evaluation metrics, researchers emphasize the importance of XAI techniques in machine and deep learning models and their applications in classification tasks. This incorporation helps in gaining a deeper understanding of their decision-making processes, leading to improved trust, transparency, and overall clinical decision-making. Our comprehensive review serves as a valuable resource for researchers and practitioners seeking not only to advance the field of lung disease detection using machine learning and XAI but also from other diverse domains.

对于放射科医生来说，在医学影像中检测肺部疾病是一项相当具有挑战性的工作。在某些情况下，即使是经验丰富的专家也很难准确诊断胸部疾病，因为复杂或未见的生物标志物可能会导致诊断不准确。本综述论文深入探讨了近期研究中用于肺部疾病分类的各种数据集和机器学习技术，重点是使用胸部 X 光图像进行肺炎分析。我们探讨了传统机器学习方法、预训练深度学习模型、定制卷积神经网络（CNN）和集合方法。我们对不同的分类方法进行了全面比较，包括数据采集、预处理、特征提取，以及使用机器视觉、机器学习、深度学习和可解释人工智能（XAI）进行分类。我们的分析强调了基于迁移学习的方法（使用 CNN 和集合模型/特征）在肺病分类方面的卓越性能。此外，我们的全面综述还为其他医疗领域利用放射图像的研究人员提供了启示。通过对各种技术的全面概述，我们的工作有助于建立有效的策略，并针对广泛的挑战确定合适的方法。目前，除了传统的评估指标外，研究人员还强调了 XAI 技术在机器学习和深度学习模型中的重要性及其在分类任务中的应用。这种结合有助于更深入地了解他们的决策过程，从而提高信任度、透明度和整体临床决策水平。我们的综合综述不仅为寻求利用机器学习和 XAI 推动肺病检测领域发展的研究人员和从业人员提供了宝贵的资源，也为其他不同领域的研究人员和从业人员提供了宝贵的资源。

{"title":"Data-driven classification and explainable-AI in the field of lung imaging.","authors":"Syed Taimoor Hussain Shah, Syed Adil Hussain Shah, Iqra Iqbal Khan, Atif Imran, Syed Baqir Hussain Shah, Atif Mehmood, Shahzad Ahmad Qureshi, Mudassar Raza, Angelo Di Terlizzi, Marco Cavaglià, Marco Agostino Deriu","doi":"10.3389/fdata.2024.1393758","DOIUrl":"10.3389/fdata.2024.1393758","url":null,"abstract":"Detecting lung diseases in medical images can be quite challenging for radiologists. In some cases, even experienced experts may struggle with accurately diagnosing chest diseases, leading to potential inaccuracies due to complex or unseen biomarkers. This review paper delves into various datasets and machine learning techniques employed in recent research for lung disease classification, focusing on pneumonia analysis using chest X-ray images. We explore conventional machine learning methods, pretrained deep learning models, customized convolutional neural networks (CNNs), and ensemble methods. A comprehensive comparison of different classification approaches is presented, encompassing data acquisition, preprocessing, feature extraction, and classification using machine vision, machine and deep learning, and explainable-AI (XAI). Our analysis highlights the superior performance of transfer learning-based methods using CNNs and ensemble models/features for lung disease classification. In addition, our comprehensive review offers insights for researchers in other medical domains too who utilize radiological images. By providing a thorough overview of various techniques, our work enables the establishment of effective strategies and identification of suitable methods for a wide range of challenges. Currently, beyond traditional evaluation metrics, researchers emphasize the importance of XAI techniques in machine and deep learning models and their applications in classification tasks. This incorporation helps in gaining a deeper understanding of their decision-making processes, leading to improved trust, transparency, and overall clinical decision-making. Our comprehensive review serves as a valuable resource for researchers and practitioners seeking not only to advance the field of lung disease detection using machine learning and XAI but also from other diverse domains.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1393758"},"PeriodicalIF":2.4,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446784/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Current state of data stewardship tools in life science. 生命科学数据管理工具的现状。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-09-16 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1428568

Anna Aksenova, Anoop Johny, Tim Adams, Phil Gribbon, Marc Jacobs, Martin Hofmann-Apitius

In today's data-centric landscape, effective data stewardship is critical for facilitating scientific research and innovation. This article provides an overview of essential tools and frameworks for modern data stewardship practices. Over 300 tools were analyzed in this study, assessing their utility, relevance to data stewardship, and applicability within the life sciences domain.

在当今以数据为中心的时代，有效的数据管理对于促进科学研究和创新至关重要。本文概述了现代数据管理实践的基本工具和框架。本研究分析了 300 多种工具，评估了它们的实用性、与数据管理的相关性以及在生命科学领域的适用性。

引用次数: 0

Navigating pathways to automated personality prediction: a comparative study of small and medium language models. 通往自动人格预测之路：中小型语言模型的比较研究。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-09-13 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1387325

Fatima Habib, Zeeshan Ali, Akbar Azam, Komal Kamran, Fahad Mansoor Pasha

Introduction: Recent advancements in Natural Language Processing (NLP) and widely available social media data have made it possible to predict human personalities in various computational applications. In this context, pre-trained Large Language Models (LLMs) have gained recognition for their exceptional performance in NLP benchmarks. However, these models require substantial computational resources, escalating their carbon and water footprint. Consequently, a shift toward more computationally efficient smaller models is observed.

Methods: This study compares a small model ALBERT (11.8M parameters) with a larger model, RoBERTa (125M parameters) in predicting big five personality traits. It utilizes the PANDORA dataset comprising Reddit comments, processing them on a Tesla P100-PCIE-16GB GPU. The study customized both models to support multi-output regression and added two linear layers for fine-grained regression analysis.

Results: Results are evaluated on Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), considering the computational resources consumed during training. While ALBERT consumed lower levels of system memory with lower heat emission, it took higher computation time compared to RoBERTa. The study produced comparable levels of MSE, RMSE, and training loss reduction.

Discussion: This highlights the influence of training data quality on the model's performance, outweighing the significance of model size. Theoretical and practical implications are also discussed.

导言：自然语言处理（NLP）领域的最新进展和广泛可用的社交媒体数据使得在各种计算应用中预测人类性格成为可能。在这种情况下，预训练的大型语言模型（LLM）因其在 NLP 基准测试中的优异表现而获得了认可。然而，这些模型需要大量的计算资源，从而增加了碳足迹和水足迹。因此，人们开始转向计算效率更高的小型模型：本研究比较了小型模型 ALBERT（1180 万个参数）和大型模型 RoBERTa（1.25 亿个参数）在预测五大人格特质方面的效果。研究利用了由 Reddit 评论组成的 PANDORA 数据集，并在 Tesla P100-PCIE-16GB GPU 上进行了处理。研究对这两个模型进行了定制，以支持多输出回归，并添加了两个线性层进行细粒度回归分析：结果：根据平均平方误差（MSE）和均方根误差（RMSE）对结果进行了评估，同时考虑了训练过程中消耗的计算资源。与 RoBERTa 相比，ALBERT 消耗的系统内存更少，发热量更低，但计算时间更长。这项研究产生了相当水平的 MSE、RMSE 和训练损失降低率：这凸显了训练数据质量对模型性能的影响，其重要性超过了模型大小。此外，还讨论了理论和实践意义。

{"title":"Navigating pathways to automated personality prediction: a comparative study of small and medium language models.","authors":"Fatima Habib, Zeeshan Ali, Akbar Azam, Komal Kamran, Fahad Mansoor Pasha","doi":"10.3389/fdata.2024.1387325","DOIUrl":"https://doi.org/10.3389/fdata.2024.1387325","url":null,"abstract":"Introduction: Recent advancements in Natural Language Processing (NLP) and widely available social media data have made it possible to predict human personalities in various computational applications. In this context, pre-trained Large Language Models (LLMs) have gained recognition for their exceptional performance in NLP benchmarks. However, these models require substantial computational resources, escalating their carbon and water footprint. Consequently, a shift toward more computationally efficient smaller models is observed.Methods: This study compares a small model ALBERT (11.8M parameters) with a larger model, RoBERTa (125M parameters) in predicting big five personality traits. It utilizes the PANDORA dataset comprising Reddit comments, processing them on a Tesla P100-PCIE-16GB GPU. The study customized both models to support multi-output regression and added two linear layers for fine-grained regression analysis.Results: Results are evaluated on Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), considering the computational resources consumed during training. While ALBERT consumed lower levels of system memory with lower heat emission, it took higher computation time compared to RoBERTa. The study produced comparable levels of MSE, RMSE, and training loss reduction.Discussion: This highlights the influence of training data quality on the model's performance, outweighing the significance of model size. Theoretical and practical implications are also discussed.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1387325"},"PeriodicalIF":2.4,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427259/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142332187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data. 当我们谈论大数据时，我们真正指的是什么？更准确地定义大数据。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-09-10 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1441869

Xiaoyao Han, Oskar Josef Gstrein, Vasilios Andrikopoulos

Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this "no consensus" stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular "V" characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the "V" characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.

尽管对大数据的官方定义缺乏共识，但多年来，基于这种 "无共识 "的立场，研究和调查仍在继续推进。然而，由于缺乏对大数据的明确定义和范围，导致科学研究和交流缺乏共同点。即使具有流行的 "V "型特征，大数据仍然难以捉摸。该术语含义广泛，在研究中的用法各不相同，往往指代完全不同的概念，而论文中也很少明确说明这一点。虽然许多研究和综述都试图对大数据有一个全面的理解，但对大数据一词在研究环境中的定位和实际意义却鲜有系统的研究。针对这一空白，本文对二手研究进行了系统性文献综述（SLR），以全面概述大数据在不同科学领域的应用和理解。我们的目标是监测大数据概念在科学领域的应用情况，确定哪些技术在哪些领域盛行，并调查对该术语的理论理解与实际使用之间的差异。我们的研究发现，不同的科学领域正在使用各种大数据技术，包括机器学习算法、分布式计算框架和其他工具。大数据的这些表现形式可分为四大类：抽象概念、大型数据集、机器学习技术和大数据生态系统。本研究发现，尽管对 "V "的特征达成了普遍共识，但不同科学领域的研究人员对大数据有着不同的隐含理解。这些隐含的理解极大地影响了涉及大数据的研究内容和讨论，尽管这些理解往往没有明确表述。我们呼吁在研究中更清晰地阐明大数据的含义，以促进更顺畅的科学交流。

{"title":"When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data.","authors":"Xiaoyao Han, Oskar Josef Gstrein, Vasilios Andrikopoulos","doi":"10.3389/fdata.2024.1441869","DOIUrl":"https://doi.org/10.3389/fdata.2024.1441869","url":null,"abstract":"Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this \"no consensus\" stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular \"V\" characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the \"V\" characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1441869"},"PeriodicalIF":2.4,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11420115/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142332189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SparkDWM: a scalable design of a Data Washing Machine using Apache Spark. SparkDWM：使用 Apache Spark 的数据清洗机的可扩展设计。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-09-09 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1446071

Nicholas Kofi Akortia Hagan, John R Talburt

Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.

数据量一直是大多数实际应用中快速增长的资产之一。这就增加了人为错误的发生率，如记录重复、拼写错误和转置错误，以及其他数据质量问题。实体解析是一种 ETL 流程，旨在通过确保实体指向相同的现实世界对象来解决数据不一致问题。大多数传统实体解析系统面临的主要挑战之一是确保其可扩展性，以满足不断增长的数据需求。本研究旨在重构一个名为 "数据清洗机"（Data Washing Machine）的概念验证实体解析系统，使其能够使用 Apache Spark 分布式数据处理框架实现高度可扩展性。我们使用 PySpark 的弹性分布式数据集解决了传统数据清洗机的单线程设计问题，并改进了数据清洗机的设计，使其能够使用来自引用的内在元数据信息。我们使用 18 个合成生成的数据集证明，我们的系统实现了与传统数据清洗机相同的结果。我们还使用从数千到数百万的各种真实基准 ER 数据集测试了我们系统的可扩展性。实验结果表明，我们提出的系统比基于 MapReduce 的数据清洗机性能更好。我们还将我们的系统与 Famer 进行了比较，得出的结论是，在给定最佳聚类起始参数的情况下，我们的系统可以找到更多的聚类。

{"title":"SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.","authors":"Nicholas Kofi Akortia Hagan, John R Talburt","doi":"10.3389/fdata.2024.1446071","DOIUrl":"10.3389/fdata.2024.1446071","url":null,"abstract":"Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1446071"},"PeriodicalIF":2.4,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11416992/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142309124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deepfake: definitions, performance metrics and standards, datasets, and a meta-review. Deepfake：定义、性能指标和标准、数据集和元综述。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-09-04 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1400024

Enes Altuncu, Virginia N L Franqueira, Shujun Li

Recent advancements in AI, especially deep learning, have contributed to a significant increase in the creation of new realistic-looking synthetic media (video, image, and audio) and manipulation of existing media, which has led to the creation of the new term "deepfake." Based on both the research literature and resources in English, this paper gives a comprehensive overview of deepfake, covering multiple important aspects of this emerging concept, including (1) different definitions, (2) commonly used performance metrics and standards, and (3) deepfake-related datasets. In addition, the paper also reports a meta-review of 15 selected deepfake-related survey papers published since 2020, focusing not only on the mentioned aspects but also on the analysis of key challenges and recommendations. We believe that this paper is the most comprehensive review of deepfake in terms of the aspects covered.

最近，人工智能（尤其是深度学习）的发展促进了新的逼真合成媒体（视频、图像和音频）的创建和对现有媒体的处理的显著增加，这导致了新术语 "deepfake "的产生。本文以研究文献和英文资源为基础，对深度伪造进行了全面概述，涵盖了这一新兴概念的多个重要方面，包括：（1）不同的定义；（2）常用的性能指标和标准；（3）与深度伪造相关的数据集。此外，本文还报告了对 2020 年以来发表的 15 篇精选 deepfake 相关调查论文的元综述，不仅侧重于上述方面，还分析了主要挑战和建议。我们认为，就所涉及的方面而言，本文是对 deepfake 最全面的综述。

引用次数: 0

Sparse and Expandable Network for Google's Pathways. 谷歌 Pathways 的稀疏可扩展网络。

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data

Pub Date : 2024-08-29 eCollection Date: 2024-01-01 DOI: 10.3389/fdata.2024.1348030

Charles X Ling, Ganyu Wang, Boyu Wang

Introduction: Recently, Google introduced Pathways as its next-generation AI architecture. Pathways must address three critical challenges: learning one general model for several continuous tasks, ensuring tasks can leverage each other without forgetting old tasks, and learning from multi-modal data such as images and audio. Additionally, Pathways must maintain sparsity in both learning and deployment. Current lifelong multi-task learning approaches are inadequate in addressing these challenges.

Methods: To address these challenges, we propose SEN, a Sparse and Expandable Network. SEN is designed to handle multiple tasks concurrently by maintaining sparsity and enabling expansion when new tasks are introduced. The network leverages multi-modal data, integrating information from different sources while preventing interference between tasks.

Results: The proposed SEN model demonstrates significant improvements in multi-task learning, successfully managing task interference and forgetting. It effectively integrates data from various modalities and maintains efficiency through sparsity during both the learning and deployment phases.

Discussion: SEN offers a straightforward yet effective solution to the limitations of current lifelong multi-task learning methods. By addressing the challenges identified in the Pathways architecture, SEN provides a promising approach for developing AI systems capable of learning and adapting over time without sacrificing performance or efficiency.

简介最近，谷歌推出了下一代人工智能架构 Pathways。Pathways 必须解决三个关键挑战：为多个连续任务学习一个通用模型；确保任务之间可以相互利用，同时不遗忘旧任务；从图像和音频等多模态数据中学习。此外，Pathways 还必须在学习和部署过程中保持稀疏性。目前的终身多任务学习方法不足以应对这些挑战：为了应对这些挑战，我们提出了稀疏可扩展网络 SEN。SEN 的设计目的是通过保持稀疏性来同时处理多个任务，并在引入新任务时实现扩展。该网络利用多模态数据，整合来自不同来源的信息，同时防止任务之间的干扰：结果：所提出的 SEN 模型在多任务学习方面有显著改进，成功地管理了任务干扰和遗忘。它有效整合了各种模式的数据，并在学习和部署阶段通过稀疏性保持了效率：SEN 为解决当前终身多任务学习方法的局限性提供了一个简单而有效的解决方案。通过解决 Pathways 架构中发现的挑战，SEN 为开发能够在不牺牲性能或效率的情况下进行长期学习和适应的人工智能系统提供了一种前景广阔的方法。

{"title":"Sparse and Expandable Network for Google's Pathways.","authors":"Charles X Ling, Ganyu Wang, Boyu Wang","doi":"10.3389/fdata.2024.1348030","DOIUrl":"https://doi.org/10.3389/fdata.2024.1348030","url":null,"abstract":"Introduction: Recently, Google introduced Pathways as its next-generation AI architecture. Pathways must address three critical challenges: learning one general model for several continuous tasks, ensuring tasks can leverage each other without forgetting old tasks, and learning from multi-modal data such as images and audio. Additionally, Pathways must maintain sparsity in both learning and deployment. Current lifelong multi-task learning approaches are inadequate in addressing these challenges.Methods: To address these challenges, we propose SEN, a Sparse and Expandable Network. SEN is designed to handle multiple tasks concurrently by maintaining sparsity and enabling expansion when new tasks are introduced. The network leverages multi-modal data, integrating information from different sources while preventing interference between tasks.Results: The proposed SEN model demonstrates significant improvements in multi-task learning, successfully managing task interference and forgetting. It effectively integrates data from various modalities and maintains efficiency through sparsity during both the learning and deployment phases.Discussion: SEN offers a straightforward yet effective solution to the limitations of current lifelong multi-task learning methods. By addressing the challenges identified in the Pathways architecture, SEN provides a promising approach for developing AI systems capable of learning and adapting over time without sacrificing performance or efficiency.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1348030"},"PeriodicalIF":2.4,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11390433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142300699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0