Journal of Big Data最新文献_第8页

Computer aided technology based on graph sample and aggregate attention network optimized for soccer teaching and training 基于图样本和聚合注意力网络的计算机辅助技术，优化足球教学和训练

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-05 DOI: 10.1186/s40537-024-00893-x

Guanghui Yang, Xinyuan Feng

Football is the most popular game in the world and has significant influence on various aspects including politics, economy and culture. The experience of the football developed nation has shown that the steady growth of youth football is crucial for elevating a nation's overall football proficiency. It is essential to develop techniques and create strategies that adapt to their individual physical features to resolve the football players’ problem of lacking exercise in various topics. In this manuscript, Computer aided technology depending on the Graph Sample and Aggregate Attention Network Optimized for Soccer Teaching and Training (CAT-GSAAN-STT) is proposed to improve the efficiency of Soccer teaching and training effectively. The proposed method contains four stages, like data collection, data preprocessing, prediction and optimization. Initially the input data are collected by Microsoft Kinect V2 smart camera. Then the collected data are preprocessed by using Improving graph collaborative filtering. After preprocessing the data is given for motion recognition layer here prediction is done using Graph Sample and Aggregate Attention Network (GSAAN) for improving the effectiveness of Soccer Teaching and Training. To enhance the accuracy of the system, the GSAAN are optimized by using Artificial Rabbits Optimization. The proposed CAT-GSAAN-STT method is executed in Python and the efficiency of the proposed technique is examined with different metrics, like accuracy, computation time, learning activity analysis, student performance ratio and teaching evaluation analysis. The simulation outcomes proves that the proposed technique attains provides28.33%, 31.60%, 25.63% higherRecognition accuracy and33.67%, 38.12% and 27.34%lesser evaluation time while compared with existing techniques like computer aided teaching system based upon artificial intelligence in football teaching with training (STT-IOT-CATS), Computer Aided Teaching System for Football Teaching and Training Based on Video Image (CAT-STT-VI) and method for enhancing the football coaching quality using artificial intelligence and meta verse-empowered in mobile internet environment (SI-STQ-AI-MIE) respectively.

足球是世界上最受欢迎的运动，对政治、经济和文化等各个方面都有重大影响。足球发达国家的经验表明，青少年足球运动的稳步发展对于提升一个国家的整体足球水平至关重要。要解决足球运动员在各方面缺乏锻炼的问题，就必须根据他们各自的身体特点开发技术、制定策略。本文提出了基于图形样本和聚合注意力网络的足球教学和训练优化计算机辅助技术（CAT-GSAAN-STT），以有效提高足球教学和训练的效率。该方法包括数据收集、数据预处理、预测和优化四个阶段。首先，通过 Microsoft Kinect V2 智能摄像头采集输入数据。然后使用改进图协同过滤法对收集到的数据进行预处理。预处理后的数据将用于运动识别层，在此使用图形样本和聚合注意力网络（GSAAN）进行预测，以提高足球教学和训练的效果。为了提高系统的准确性，使用人工兔子优化法对 GSAAN 进行了优化。建议的 CAT-GSAAN-STT 方法在 Python 中执行，并通过不同的指标，如准确性、计算时间、学习活动分析、学生成绩比率和教学评价分析，来检验建议技术的效率。模拟结果证明，所提技术的识别准确率分别提高了 28.33%、31.60% 和 25.63%，评估时间分别缩短了 33.67%、38.12% 和 27.34%。与现有技术相比，如基于人工智能的足球教学与训练计算机辅助教学系统（STT-IOT-CATS）、基于视频图像的足球教学与训练计算机辅助教学系统（CAT-STT-VI）和在移动互联网环境下利用人工智能和元诗增强足球教练质量的方法（SI-STQ-AI-MIE），评估时间分别减少了 34%。

{"title":"Computer aided technology based on graph sample and aggregate attention network optimized for soccer teaching and training","authors":"Guanghui Yang, Xinyuan Feng","doi":"10.1186/s40537-024-00893-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00893-x","url":null,"abstract":"Football is the most popular game in the world and has significant influence on various aspects including politics, economy and culture. The experience of the football developed nation has shown that the steady growth of youth football is crucial for elevating a nation's overall football proficiency. It is essential to develop techniques and create strategies that adapt to their individual physical features to resolve the football players’ problem of lacking exercise in various topics. In this manuscript, Computer aided technology depending on the Graph Sample and Aggregate Attention Network Optimized for Soccer Teaching and Training (CAT-GSAAN-STT) is proposed to improve the efficiency of Soccer teaching and training effectively. The proposed method contains four stages, like data collection, data preprocessing, prediction and optimization. Initially the input data are collected by Microsoft Kinect V2 smart camera. Then the collected data are preprocessed by using Improving graph collaborative filtering. After preprocessing the data is given for motion recognition layer here prediction is done using Graph Sample and Aggregate Attention Network (GSAAN) for improving the effectiveness of Soccer Teaching and Training. To enhance the accuracy of the system, the GSAAN are optimized by using Artificial Rabbits Optimization. The proposed CAT-GSAAN-STT method is executed in Python and the efficiency of the proposed technique is examined with different metrics, like accuracy, computation time, learning activity analysis, student performance ratio and teaching evaluation analysis. The simulation outcomes proves that the proposed technique attains provides28.33%, 31.60%, 25.63% higherRecognition accuracy and33.67%, 38.12% and 27.34%lesser evaluation time while compared with existing techniques like computer aided teaching system based upon artificial intelligence in football teaching with training (STT-IOT-CATS), Computer Aided Teaching System for Football Teaching and Training Based on Video Image (CAT-STT-VI) and method for enhancing the football coaching quality using artificial intelligence and meta verse-empowered in mobile internet environment (SI-STQ-AI-MIE) respectively.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"38 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adapting transformer-based language models for heart disease detection and risk factors extraction 调整基于转换器的语言模型，用于心脏病检测和风险因素提取

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-04 DOI: 10.1186/s40537-024-00903-y

Essam H. Houssein, Rehab E. Mohamed, Gang Hu, Abdelmgeid A. Ali

Efficiently treating cardiac patients before the onset of a heart attack relies on the precise prediction of heart disease. Identifying and detecting the risk factors for heart disease such as diabetes mellitus, Coronary Artery Disease (CAD), hyperlipidemia, hypertension, smoking, familial CAD history, obesity, and medications is critical for developing effective preventative and management measures. Although Electronic Health Records (EHRs) have emerged as valuable resources for identifying these risk factors, their unstructured format poses challenges for cardiologists in retrieving relevant information. This research proposed employing transfer learning techniques to automatically extract heart disease risk factors from EHRs. Leveraging transfer learning, a deep learning technique has demonstrated a significant performance in various clinical natural language processing (NLP) applications, particularly in heart disease risk prediction. This study explored the application of transformer-based language models, specifically utilizing pre-trained architectures like BERT (Bidirectional Encoder Representations from Transformers), RoBERTa, BioClinicalBERT, XLNet, and BioBERT for heart disease detection and extraction of related risk factors from clinical notes, using the i2b2 dataset. These transformer models are pre-trained on an extensive corpus of medical literature and clinical records to gain a deep understanding of contextualized language representations. Adapted models are then fine-tuned using annotated datasets specific to heart disease, such as the i2b2 dataset, enabling them to learn patterns and relationships within the domain. These models have demonstrated superior performance in extracting semantic information from EHRs, automating high-performance heart disease risk factor identification, and performing downstream NLP tasks within the clinical domain. This study proposed fine-tuned five widely used transformer-based models, namely BERT, RoBERTa, BioClinicalBERT, XLNet, and BioBERT, using the 2014 i2b2 clinical NLP challenge dataset. The fine-tuned models surpass conventional approaches in predicting the presence of heart disease risk factors with impressive accuracy. The RoBERTa model has achieved the highest performance, with micro F1-scores of 94.27%, while the BERT, BioClinicalBERT, XLNet, and BioBERT models have provided competitive performances with micro F1-scores of 93.73%, 94.03%, 93.97%, and 93.99%, respectively. Finally, a simple ensemble of the five transformer-based models has been proposed, which outperformed the most existing methods in heart disease risk fan, achieving a micro F1-Score of 94.26%. This study demonstrated the efficacy of transfer learning using transformer-based models in enhancing risk prediction and facilitating early intervention for heart disease prevention.

在心脏病发作前有效治疗心脏病患者有赖于对心脏病的精确预测。识别和检测心脏病的危险因素，如糖尿病、冠状动脉疾病（CAD）、高脂血症、高血压、吸烟、家族性冠状动脉疾病史、肥胖和药物，对于制定有效的预防和管理措施至关重要。虽然电子健康记录（EHR）已成为识别这些风险因素的宝贵资源，但其非结构化的格式给心脏病专家检索相关信息带来了挑战。本研究建议采用迁移学习技术自动从电子病历中提取心脏病风险因素。迁移学习是一种深度学习技术，在各种临床自然语言处理（NLP）应用中，尤其是在心脏病风险预测中表现出了显著的性能。本研究探索了基于变换器的语言模型的应用，特别是利用 i2b2 数据集，利用 BERT（来自变换器的双向编码器表示）、RoBERTa、BioClinicalBERT、XLNet 和 BioBERT 等预训练架构，从临床笔记中检测心脏病并提取相关风险因素。这些转换器模型在大量医学文献和临床记录的语料库中进行了预训练，以深入理解语境化语言表达。然后，利用专门针对心脏病的注释数据集（如 i2b2 数据集）对调整后的模型进行微调，使其能够学习该领域内的模式和关系。这些模型在从电子病历中提取语义信息、自动进行高性能心脏病风险因素识别以及在临床领域内执行下游 NLP 任务方面表现出色。本研究利用 2014 年 i2b2 临床 NLP 挑战数据集，提出了微调五种广泛使用的基于转换器的模型，即 BERT、RoBERTa、BioClinicalBERT、XLNet 和 BioBERT。经过微调的模型在预测心脏病风险因素方面超越了传统方法，其准确性令人印象深刻。RoBERTa 模型的性能最高，其微观 F1 分数为 94.27%，而 BERT、BioClinicalBERT、XLNet 和 BioBERT 模型的微观 F1 分数分别为 93.73%、94.03%、93.97% 和 93.99%，表现极具竞争力。最后，研究人员提出了基于五个转换器的模型的简单集合，该集合在心脏病风险扇形中的表现优于大多数现有方法，微观 F1 分数达到 94.26%。这项研究证明了基于转换器模型的迁移学习在增强风险预测和促进心脏病早期干预方面的功效。

{"title":"Adapting transformer-based language models for heart disease detection and risk factors extraction","authors":"Essam H. Houssein, Rehab E. Mohamed, Gang Hu, Abdelmgeid A. Ali","doi":"10.1186/s40537-024-00903-y","DOIUrl":"https://doi.org/10.1186/s40537-024-00903-y","url":null,"abstract":"Efficiently treating cardiac patients before the onset of a heart attack relies on the precise prediction of heart disease. Identifying and detecting the risk factors for heart disease such as diabetes mellitus, Coronary Artery Disease (CAD), hyperlipidemia, hypertension, smoking, familial CAD history, obesity, and medications is critical for developing effective preventative and management measures. Although Electronic Health Records (EHRs) have emerged as valuable resources for identifying these risk factors, their unstructured format poses challenges for cardiologists in retrieving relevant information. This research proposed employing transfer learning techniques to automatically extract heart disease risk factors from EHRs. Leveraging transfer learning, a deep learning technique has demonstrated a significant performance in various clinical natural language processing (NLP) applications, particularly in heart disease risk prediction. This study explored the application of transformer-based language models, specifically utilizing pre-trained architectures like BERT (Bidirectional Encoder Representations from Transformers), RoBERTa, BioClinicalBERT, XLNet, and BioBERT for heart disease detection and extraction of related risk factors from clinical notes, using the i2b2 dataset. These transformer models are pre-trained on an extensive corpus of medical literature and clinical records to gain a deep understanding of contextualized language representations. Adapted models are then fine-tuned using annotated datasets specific to heart disease, such as the i2b2 dataset, enabling them to learn patterns and relationships within the domain. These models have demonstrated superior performance in extracting semantic information from EHRs, automating high-performance heart disease risk factor identification, and performing downstream NLP tasks within the clinical domain. This study proposed fine-tuned five widely used transformer-based models, namely BERT, RoBERTa, BioClinicalBERT, XLNet, and BioBERT, using the 2014 i2b2 clinical NLP challenge dataset. The fine-tuned models surpass conventional approaches in predicting the presence of heart disease risk factors with impressive accuracy. The RoBERTa model has achieved the highest performance, with micro F1-scores of 94.27%, while the BERT, BioClinicalBERT, XLNet, and BioBERT models have provided competitive performances with micro F1-scores of 93.73%, 94.03%, 93.97%, and 93.99%, respectively. Finally, a simple ensemble of the five transformer-based models has been proposed, which outperformed the most existing methods in heart disease risk fan, achieving a micro F1-Score of 94.26%. This study demonstrated the efficacy of transfer learning using transformer-based models in enhancing risk prediction and facilitating early intervention for heart disease prevention.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"24 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gene selection via improved nuclear reaction optimization algorithm for cancer classification in high-dimensional data 通过改进的核反应优化算法选择基因，用于高维数据中的癌症分类

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-04-03 DOI: 10.1186/s40537-024-00902-z

Abstract

RNA Sequencing (RNA-Seq) has been considered a revolutionary technique in gene profiling and quantification. It offers a comprehensive view of the transcriptome, making it a more expansive technique in comparison with micro-array. Genes that discriminate malignancy and normal can be deduced using quantitative gene expression. However, this data is a high-dimensional dense matrix; each sample has a dimension of more than 20,000 genes. Dealing with this data poses challenges. This paper proposes RBNRO-DE (Relief Binary NRO based on Differential Evolution) for handling the gene selection strategy on (rnaseqv2 illuminahiseq rnaseqv2 un edu Level 3 RSEM genes normalized) with more than 20,000 genes to pick the best informative genes and assess them through 22 cancer datasets. The k-nearest Neighbor (k-NN) and Support Vector Machine (SVM) are applied to assess the quality of the selected genes. Binary versions of the most common meta-heuristic algorithms have been compared with the proposed RBNRO-DE algorithm. In most of the 22 cancer datasets, the RBNRO-DE algorithm based on k-NN and SVM classifiers achieved optimal convergence and classification accuracy up to 100% integrated with a feature reduction size down to 98%, which is very evident when compared to its counterparts, according to Wilcoxon’s rank-sum test (5% significance level).

摘要 RNA 测序（RNA-Seq）被认为是基因谱分析和定量的革命性技术。它提供了转录组的全面视图，使其成为一种与微阵列相比更具扩展性的技术。利用定量基因表达可以推断出区分恶性肿瘤和正常肿瘤的基因。然而，这些数据是一个高维密集矩阵；每个样本都有超过 20,000 个基因。处理这些数据是一项挑战。本文提出了基于差分进化的救济二元 NRO（Relief Binary NRO based on Differential Evolution）处理基因选择策略（rnaseqv2 illuminahiseq rnaseqv2 un edu Level 3 RSEM genes normalized），在 20,000 多个基因中挑选出信息量最大的基因，并通过 22 个癌症数据集对其进行评估。k-nearest Neighbor（k-NN）和支持向量机（SVM）被用来评估所选基因的质量。最常见的元启发式算法的二进制版本与所提出的 RBNRO-DE 算法进行了比较。根据 Wilcoxon 秩和检验（5% 显著性水平），在大多数 22 个癌症数据集中，基于 k-NN 和 SVM 分类器的 RBNRO-DE 算法实现了最佳收敛，分类准确率高达 100%，特征缩减量低至 98%，与同类算法相比非常明显。

{"title":"Gene selection via improved nuclear reaction optimization algorithm for cancer classification in high-dimensional data","authors":"","doi":"10.1186/s40537-024-00902-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00902-z","url":null,"abstract":"<h3>Abstract</h3> RNA Sequencing (RNA-Seq) has been considered a revolutionary technique in gene profiling and quantification. It offers a comprehensive view of the transcriptome, making it a more expansive technique in comparison with micro-array. Genes that discriminate malignancy and normal can be deduced using quantitative gene expression. However, this data is a high-dimensional dense matrix; each sample has a dimension of more than 20,000 genes. Dealing with this data poses challenges. This paper proposes RBNRO-DE (Relief Binary NRO based on Differential Evolution) for handling the gene selection strategy on (rnaseqv2 illuminahiseq rnaseqv2 un edu Level 3 RSEM genes normalized) with more than 20,000 genes to pick the best informative genes and assess them through 22 cancer datasets. The k-nearest Neighbor (k-NN) and Support Vector Machine (SVM) are applied to assess the quality of the selected genes. Binary versions of the most common meta-heuristic algorithms have been compared with the proposed RBNRO-DE algorithm. In most of the 22 cancer datasets, the RBNRO-DE algorithm based on k-NN and SVM classifiers achieved optimal convergence and classification accuracy up to 100% integrated with a feature reduction size down to 98%, which is very evident when compared to its counterparts, according to Wilcoxon’s rank-sum test (5% significance level).","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"82 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The role of classifiers and data complexity in learned Bloom filters: insights and recommendations 分类器和数据复杂性在学习型布鲁姆过滤器中的作用：见解和建议

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-03-27 DOI: 10.1186/s40537-024-00906-9

Dario Malchiodi, Davide Raimondi, Giacomo Fumagalli, Raffaele Giancarlo, Marco Frasca

Bloom filters, since their introduction over 50 years ago, have become a pillar to handle membership queries in small space, with relevant application in Big Data Mining and Stream Processing. Further improvements have been recently proposed with the use of Machine Learning techniques: learned Bloom filters. Those latter make considerably more complicated the proper parameter setting of this multi-criteria data structure, in particular in regard to the choice of one of its key components (the classifier) and accounting for the classification complexity of the input dataset. Given this State of the Art, our contributions are as follows. (1) A novel methodology, supported by software, for designing, analyzing and implementing learned Bloom filters that account for their own multi-criteria nature, in particular concerning classifier type choice and data classification complexity. Extensive experiments show the validity of the proposed methodology and, being our software public, we offer a valid tool to the practitioners interested in using learned Bloom filters. (2) Further contributions to the advancement of the State of the Art that are of great practical relevance are the following: (a) the classifier inference time should not be taken as a proxy for the filter reject time; (b) of the many classifiers we have considered, only two offer good performance; this result is in agreement with and further strengthens early findings in the literature; (c) Sandwiched Bloom filter, which is already known as being one of the references of this area, is further shown here to have the remarkable property of robustness to data complexity and classifier performance variability.

布鲁姆过滤器自 50 多年前问世以来，已成为在较小空间内处理成员查询的支柱，并在大数据挖掘和流处理中得到了相关应用。最近，人们利用机器学习技术提出了进一步的改进方案：学习型布鲁姆过滤器。后者使这种多标准数据结构的适当参数设置变得更加复杂，特别是在选择其关键组件之一（分类器）和考虑输入数据集的分类复杂性方面。鉴于这一技术现状，我们的贡献如下。(1) 一种由软件支持的新方法，用于设计、分析和实施学习型布鲁姆过滤器，该过滤器考虑到了自身的多标准特性，特别是分类器类型选择和数据分类复杂性。广泛的实验表明，所提出的方法是有效的，而且由于我们的软件是公开的，我们为有兴趣使用学习型布鲁姆过滤器的从业人员提供了一个有效的工具。(2) 对提升技术水平具有重大现实意义的其他贡献如下：(a) 分类器的推理时间不应被视为筛选器拒绝时间的代表；(b) 在我们考虑的众多分类器中，只有两个能提供良好的性能；这一结果与文献中的早期发现一致，并进一步加强了文献中的早期发现；(c) Sandwiched Bloom 筛选器已被认为是这一领域的参考之一，本文进一步证明了它对数据复杂性和分类器性能变化的显著鲁棒性。

{"title":"The role of classifiers and data complexity in learned Bloom filters: insights and recommendations","authors":"Dario Malchiodi, Davide Raimondi, Giacomo Fumagalli, Raffaele Giancarlo, Marco Frasca","doi":"10.1186/s40537-024-00906-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00906-9","url":null,"abstract":"Bloom filters, since their introduction over 50 years ago, have become a pillar to handle membership queries in small space, with relevant application in Big Data Mining and Stream Processing. Further improvements have been recently proposed with the use of Machine Learning techniques: learned Bloom filters. Those latter make considerably more complicated the proper parameter setting of this multi-criteria data structure, in particular in regard to the choice of one of its key components (the classifier) and accounting for the classification complexity of the input dataset. Given this State of the Art, our contributions are as follows. (1) A novel methodology, supported by software, for designing, analyzing and implementing learned Bloom filters that account for their own multi-criteria nature, in particular concerning classifier type choice and data classification complexity. Extensive experiments show the validity of the proposed methodology and, being our software public, we offer a valid tool to the practitioners interested in using learned Bloom filters. (2) Further contributions to the advancement of the State of the Art that are of great practical relevance are the following: (a) the classifier inference time should not be taken as a proxy for the filter reject time; (b) of the many classifiers we have considered, only two offer good performance; this result is in agreement with and further strengthens early findings in the literature; (c) Sandwiched Bloom filter, which is already known as being one of the references of this area, is further shown here to have the remarkable property of robustness to data complexity and classifier performance variability.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"5 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140313130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods 特征选择策略：基于 SHAP 值和重要性的方法比较分析

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-03-26 DOI: 10.1186/s40537-024-00905-w

Huanjing Wang, Qianxin Liang, John T. Hancock, Taghi M. Khoshgoftaar

In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.

在高维信用卡欺诈数据方面，研究人员和从业人员通常利用特征选择技术来提高欺诈检测模型的性能。本研究比较了使用 SHAP（SHapley Additive exPlanations）值和模型内置特征重要性列表选择的最重要特征的模型性能。这两种方法都对特征进行排序，并选择最重要的特征进行模型评估。为了评估这些特征选择技术的有效性，我们使用五种分类器建立了分类模型：XGBoost、决策树、CatBoost、极随机树和随机森林。精度-召回曲线下的面积（AUPRC）作为评估指标。所有实验都是在 Kaggle 信用卡欺诈检测数据集上进行的。实验结果和统计测试表明，基于重要性值的特征选择方法优于基于 SHAP 值的分类器和各种特征子集大小的特征选择方法。对于在较大数据集上训练的模型，建议使用模型内置的特征重要性列表作为主要特征选择方法，而不是 SHAP。这一建议的依据是，计算 SHAP 特征重要性是一项独特的工作，而模型在训练过程中自然会提供内置的特征重要性，无需额外工作。因此，对于更大的数据集和更复杂的模型来说，选择模型内置的特征重要性列表可以提供更高效、更实用的方法。

{"title":"Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods","authors":"Huanjing Wang, Qianxin Liang, John T. Hancock, Taghi M. Khoshgoftaar","doi":"10.1186/s40537-024-00905-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00905-w","url":null,"abstract":"In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"6 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140298523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-sample $$zeta $$ -mixup: richer, more realistic synthetic samples from a p-series interpolant 多样本 $$zeta $$ -mixup：来自 p 系列插值器的更丰富、更逼真的合成样本

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-03-23 DOI: 10.1186/s40537-024-00898-6

Abstract

Modern deep learning training procedures rely on model regularization techniques such as data augmentation methods, which generate training samples that increase the diversity of data and richness of label information. A popular recent method, mixup, uses convex combinations of pairs of original samples to generate new samples. However, as we show in our experiments, mixup can produce undesirable synthetic samples, where the data is sampled off the manifold and can contain incorrect labels. We propose (zeta ) -mixup, a generalization of mixup with provably and demonstrably desirable properties that allows convex combinations of ({T} ge 2) samples, leading to more realistic and diverse outputs that incorporate information from ({T}) original samples by using a p-series interpolant. We show that, compared to mixup, (zeta ) -mixup better preserves the intrinsic dimensionality of the original datasets, which is a desirable property for training generalizable models. Furthermore, we show that our implementation of (zeta ) -mixup is faster than mixup, and extensive evaluation on controlled synthetic and 26 diverse real-world natural and medical image classification datasets shows that (zeta ) -mixup outperforms mixup, CutMix, and traditional data augmentation techniques. The code will be released at https://github.com/kakumarabhishek/zeta-mixup.

摘要现代深度学习训练程序依赖于数据增强方法等模型正则化技术，这些方法生成的训练样本可以增加数据的多样性和标签信息的丰富性。最近流行的一种方法是 mixup，它使用原始样本对的凸组合来生成新样本。然而，正如我们在实验中展示的那样，mixup 会产生不理想的合成样本，其中的数据采样偏离流形，可能包含不正确的标签。我们提出了 (zeta ) -mixup，它是对 mixup 的一种概括，具有可证明和可证明的理想特性，允许 ({T} ge 2) 样本的凸组合，通过使用 p 系列插值法，将来自 ({T}) 原始样本的信息纳入到更真实和多样化的输出中。我们的研究表明，与 mixup 相比，（zeta ）-mixup 更好地保留了原始数据集的内在维度，而这正是训练可泛化模型的理想特性。此外，我们还展示了我们的 (zeta ) -mixup 实现比 mixup 更快，在受控合成数据集和 26 个不同的真实世界自然和医学图像分类数据集上进行的广泛评估显示，(zeta ) -mixup 优于 mixup、CutMix 和传统的数据增强技术。代码将在 https://github.com/kakumarabhishek/zeta-mixup 上发布。

{"title":"Multi-sample $$zeta $$ -mixup: richer, more realistic synthetic samples from a p-series interpolant","authors":"","doi":"10.1186/s40537-024-00898-6","DOIUrl":"https://doi.org/10.1186/s40537-024-00898-6","url":null,"abstract":"<h3>Abstract</h3> Modern deep learning training procedures rely on model regularization techniques such as data augmentation methods, which generate training samples that increase the diversity of data and richness of label information. A popular recent method, mixup, uses convex combinations of pairs of original samples to generate new samples. However, as we show in our experiments, mixup can produce undesirable synthetic samples, where the data is sampled off the manifold and can contain incorrect labels. We propose (zeta ) -mixup, a generalization of mixup with provably and demonstrably desirable properties that allows convex combinations of ({T} ge 2) samples, leading to more realistic and diverse outputs that incorporate information from ({T}) original samples by using a p-series interpolant. We show that, compared to mixup, (zeta ) -mixup better preserves the intrinsic dimensionality of the original datasets, which is a desirable property for training generalizable models. Furthermore, we show that our implementation of (zeta ) -mixup is faster than mixup, and extensive evaluation on controlled synthetic and 26 diverse real-world natural and medical image classification datasets shows that (zeta ) -mixup outperforms mixup, CutMix, and traditional data augmentation techniques. The code will be released at https://github.com/kakumarabhishek/zeta-mixup.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"9 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning manifolds from non-stationary streams 从非稳态流中学习流形

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-03-23 DOI: 10.1186/s40537-023-00872-8

Suchismit Mahapatra, Varun Chandola

Streaming adaptations of manifold learning based dimensionality reduction methods, such as Isomap, are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary and are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifold-specific kernel function and is trained on an initial batch of sufficient size, can closely approximate the state-of-art streaming Isomap algorithms, and the predictive variance obtained from the GPR prediction can be employed as an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution. For instance, key findings on a Gas sensor array data set show that our method can detect changes in the underlying data stream, triggered due to real-world factors, such as introduction of a new gas in the system, while efficiently mapping data on a low-dimensional manifold.

基于流形学习的流适应降维方法（如 Isomap）所依据的假设是，一小批初始观测数据足以精确学习流形，而剩余的流数据实例可以廉价地映射到该流形上。然而，目前还没有理论结果表明这一核心假设是成立的。此外，这类方法通常假定底层数据分布是静态的，无法检测或处理数据流时可能出现的分布突变或渐变。我们提出的理论结果表明，随着数据量的增加，流形的质量会逐渐收敛。然后我们证明，使用特定流形核函数并在足够大的初始批次上进行训练的高斯过程回归（GPR）模型可以近似于最先进的流式 Isomap 算法，而且从 GPR 预测中获得的预测方差可以用作底层数据分布变化的有效检测器。在几个合成和真实数据集上的结果表明，由此产生的算法可以在流式环境中有效地学习高维数据的低维表示，同时识别生成分布的变化。例如，在气体传感器阵列数据集上的主要研究结果表明，我们的方法可以检测到由于真实世界因素（如系统中引入新气体）而引发的底层数据流的变化，同时有效地将数据映射到低维流形上。

{"title":"Learning manifolds from non-stationary streams","authors":"Suchismit Mahapatra, Varun Chandola","doi":"10.1186/s40537-023-00872-8","DOIUrl":"https://doi.org/10.1186/s40537-023-00872-8","url":null,"abstract":"Streaming adaptations of manifold learning based dimensionality reduction methods, such as Isomap, are based on the assumption that a small initial batch of observations is enough for exact learning of the manifold, while remaining streaming data instances can be cheaply mapped to this manifold. However, there are no theoretical results to show that this core assumption is valid. Moreover, such methods typically assume that the underlying data distribution is stationary and are not equipped to detect, or handle, sudden changes or gradual drifts in the distribution that may occur when the data is streaming. We present theoretical results to show that the quality of a manifold asymptotically converges as the size of data increases. We then show that a Gaussian Process Regression (GPR) model, that uses a manifold-specific kernel function and is trained on an initial batch of sufficient size, can closely approximate the state-of-art streaming Isomap algorithms, and the predictive variance obtained from the GPR prediction can be employed as an effective detector of changes in the underlying data distribution. Results on several synthetic and real data sets show that the resulting algorithm can effectively learn lower dimensional representation of high dimensional data in a streaming setting, while identifying shifts in the generative distribution. For instance, key findings on a Gas sensor array data set show that our method can detect changes in the underlying data stream, triggered due to real-world factors, such as introduction of a new gas in the system, while efficiently mapping data on a low-dimensional manifold.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"26 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An adaptive hybrid african vultures-aquila optimizer with Xgb-Tree algorithm for fake news detection 采用 Xgb-Tree 算法的自适应混合非洲秃鹫-龙舌兰优化器用于假新闻检测

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-03-19 DOI: 10.1186/s40537-024-00895-9

Amr A. Abd El-Mageed, Amr A. Abohany, Asmaa H. Ali, Khalid M. Hosny

Online platforms and social networking have increased in the contemporary years. They are now a major news source worldwide, leading to the online proliferation of Fake News (FNs). These FNs are alarming because they fundamentally reshape public opinion, which may cause customers to leave these online platforms, threatening the reputations of several organizations and industries. This rapid dissemination of FNs makes it imperative for automated systems to detect them, encouraging many researchers to propose various systems to classify news articles and detect FNs automatically. In this paper, a Fake News Detection (FND) methodology is presented based on an effective IBAVO-AO algorithm, which stands for hybridization of African Vultures Optimization (AVO) and Aquila Optimization (AO) algorithms, with an extreme gradient boosting Tree (Xgb-Tree) classifier. The suggested methodology involves three main phases: Initially, the unstructured FNs dataset is analyzed, and the essential features are extracted by tokenizing, encoding, and padding the input news words into a sequence of integers utilizing the GLOVE approach. Then, the extracted features are filtered using the effective Relief algorithm to select only the appropriate ones. Finally, the recovered features are used to classify the news items using the suggested IBAVO-AO algorithm based on the Xgb-Tree classifier. Hence, the suggested methodology is distinguished from prior models in that it performs automatic data pre-processing, optimization, and classification tasks. The proposed methodology is carried out on the ISOT-FNs dataset, containing more than 44 thousand multiple news articles divided into truthful and fake. We validated the proposed methodology’s reliability by examining numerous evaluation metrics involving accuracy, fitness values, the number of selected features, Kappa, Precision, Recall, F1-score, Specificity, Sensitivity, ROC_AUC, and MCC. Then, the proposed methodology is compared against the most common meta-heuristic optimization algorithms utilizing the ISOT-FNs. The experimental results reveal that the suggested methodology achieved optimal classification accuracy and F1-score and successfully categorized more than 92.5% of news articles compared to its peers. This study will assist researchers in expanding their understanding of meta-heuristic optimization algorithms applications for FND.

Graphical Abstract

当代，网络平台和社交网络日益增多。它们现已成为全球主要的新闻来源，导致假新闻（FNs）在网上泛滥。这些假新闻令人震惊，因为它们从根本上重塑了公众舆论，可能导致客户离开这些在线平台，威胁到一些组织和行业的声誉。假新闻的快速传播使得自动系统检测假新闻成为当务之急，因此许多研究人员提出了各种系统来对新闻文章进行分类并自动检测假新闻。本文提出了一种假新闻检测（FND）方法，该方法基于一种有效的 IBAVO-AO 算法，即非洲秃鹫优化（AVO）和 Aquila 优化（AO）算法与极端梯度提升树（Xgb-Tree）分类器的混合算法。建议的方法包括三个主要阶段：首先，分析非结构化 FNs 数据集，利用 GLOVE 方法将输入的新闻词标记化、编码并填充为整数序列，从而提取基本特征。然后，使用有效的 Relief 算法对提取的特征进行过滤，只选择合适的特征。最后，使用基于 Xgb-Tree 分类器的 IBAVO-AO 算法对恢复的特征进行新闻分类。因此，建议的方法有别于之前的模型，因为它能自动执行数据预处理、优化和分类任务。我们在 ISOT-FNs 数据集上实施了所提出的方法，该数据集包含 4.4 万多篇新闻文章，分为真假两类。我们通过考察大量评估指标，包括准确率、适配值、所选特征数量、Kappa、精度、召回率、F1-分数、特异性、灵敏度、ROC_AUC 和 MCC，验证了所提方法的可靠性。然后，利用 ISOT-FNs 将建议的方法与最常见的元启发式优化算法进行了比较。实验结果表明，与同类方法相比，所建议的方法达到了最佳的分类准确率和 F1 分数，并成功分类了 92.5% 以上的新闻文章。这项研究将有助于研究人员扩大对 FND 元启发式优化算法应用的理解。

{"title":"An adaptive hybrid african vultures-aquila optimizer with Xgb-Tree algorithm for fake news detection","authors":"Amr A. Abd El-Mageed, Amr A. Abohany, Asmaa H. Ali, Khalid M. Hosny","doi":"10.1186/s40537-024-00895-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00895-9","url":null,"abstract":"Online platforms and social networking have increased in the contemporary years. They are now a major news source worldwide, leading to the online proliferation of Fake News (FNs). These FNs are alarming because they fundamentally reshape public opinion, which may cause customers to leave these online platforms, threatening the reputations of several organizations and industries. This rapid dissemination of FNs makes it imperative for automated systems to detect them, encouraging many researchers to propose various systems to classify news articles and detect FNs automatically. In this paper, a Fake News Detection (FND) methodology is presented based on an effective IBAVO-AO algorithm, which stands for hybridization of African Vultures Optimization (AVO) and Aquila Optimization (AO) algorithms, with an extreme gradient boosting Tree (Xgb-Tree) classifier. The suggested methodology involves three main phases: Initially, the unstructured FNs dataset is analyzed, and the essential features are extracted by tokenizing, encoding, and padding the input news words into a sequence of integers utilizing the GLOVE approach. Then, the extracted features are filtered using the effective Relief algorithm to select only the appropriate ones. Finally, the recovered features are used to classify the news items using the suggested IBAVO-AO algorithm based on the Xgb-Tree classifier. Hence, the suggested methodology is distinguished from prior models in that it performs automatic data pre-processing, optimization, and classification tasks. The proposed methodology is carried out on the ISOT-FNs dataset, containing more than 44 thousand multiple news articles divided into truthful and fake. We validated the proposed methodology’s reliability by examining numerous evaluation metrics involving accuracy, fitness values, the number of selected features, Kappa, Precision, Recall, F1-score, Specificity, Sensitivity, ROC_AUC, and MCC. Then, the proposed methodology is compared against the most common meta-heuristic optimization algorithms utilizing the ISOT-FNs. The experimental results reveal that the suggested methodology achieved optimal classification accuracy and F1-score and successfully categorized more than 92.5% of news articles compared to its peers. This study will assist researchers in expanding their understanding of meta-heuristic optimization algorithms applications for FND.<h3 data-test=\"abstract-sub-heading\">Graphical Abstract</h3>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"38 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140168840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integration of transcriptomic analysis and multiple machine learning approaches identifies NAFLD progression-specific hub genes to reveal distinct genomic patterns and actionable targets 整合转录组分析和多种机器学习方法，确定非酒精性脂肪肝进展特异性枢纽基因，揭示独特的基因组模式和可操作的靶点

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-03-15 DOI: 10.1186/s40537-024-00899-5

Jing Sun, Run Shi, Yang Wu, Yan Lou, Lijuan Nie, Chun Zhang, Yutian Cao, Qianhua Yan, Lifang Ye, Shu Zhang, Xuanbin Wang, Qibiao Wu, Xuehua Jiao, Jiangyi Yu, Zhuyuan Fang, Xiqiao Zhou

Background

Nonalcoholic fatty liver disease (NAFLD) is a leading public health problem worldwide. Approximately one fourth of patients with nonalcoholic fatty liver (NAFL) progress to nonalcoholic steatohepatitis (NASH), an advanced stage of NAFLD. Hence, there is an urgent need to make a better understanding of NAFLD heterogeneity and facilitate personalized management of high-risk NAFLD patients who may benefit from more intensive surveillance and preventive intervene.

Methods

In this study, a series of bioinformatic methods were performed to identify NAFLD progression-specific pathways and genes, and three machine learning approaches were combined to construct a risk-stratification gene signature to quantify risk assessment. In addition, bulk RNA-seq, single-cell RNA-seq (scRNA-seq) transcriptome profiling data and whole-exome sequencing (WES) data were comprehensively analyzed to reveal the genomic alterations and altered pathways between distinct molecular subtypes.

Results

Two distinct subtypes of NAFL were identified with the NAFLD progression-specific genes, and one subtype has a high similarity of the inflammatory pattern and fibrotic potential with NASH. The established risk-stratification gene signature could discriminate advanced samples from overall NAFLD. COL1A2, one key gene closely related to NAFLD progression, is specifically expressed in fibroblasts involved in hepatocellular carcinoma (HCC), and significantly correlated with EMT and angiogenesis in pan-cancer. Moreover, the β-catenin/COL1A2 axis might play a critical role in fibrosis severity and inflammatory response during NAFLD-HCC progression.

Conclusion

In summary, our study provided evidence for the necessity of molecular classification and established a risk-stratification gene signature to quantify risk assessment of NAFLD, aiming to identify different risk subsets and to guide personalized treatment.

背景非酒精性脂肪肝（NAFLD）是全球主要的公共健康问题。大约四分之一的非酒精性脂肪肝患者会发展为非酒精性脂肪性肝炎（NASH），这是非酒精性脂肪肝的晚期阶段。因此，迫切需要更好地了解非酒精性脂肪肝的异质性，并促进对高风险非酒精性脂肪肝患者的个性化管理，这些患者可能会从更密集的监测和预防性干预中获益。方法在这项研究中，采用了一系列生物信息学方法来识别非酒精性脂肪肝进展的特异性通路和基因，并结合三种机器学习方法构建了风险分级基因特征，以量化风险评估。结果通过非酒精性脂肪肝进展特异性基因确定了非酒精性脂肪肝的两个不同亚型，其中一个亚型的炎症模式和纤维化潜能与NASH高度相似。已建立的风险分级基因特征可将晚期样本与总体非酒精性脂肪肝区分开来。COL1A2是与NAFLD进展密切相关的一个关键基因，它在肝细胞癌（HCC）的成纤维细胞中特异性表达，并与泛癌中的EMT和血管生成显著相关。此外，β-catenin/COL1A2 轴可能在非酒精性脂肪肝-肝癌进展过程中的纤维化严重程度和炎症反应中发挥关键作用。总之，我们的研究为分子分类的必要性提供了证据，并建立了风险分级基因特征，以量化非酒精性脂肪肝的风险评估，从而识别不同的风险亚群并指导个性化治疗。

{"title":"Integration of transcriptomic analysis and multiple machine learning approaches identifies NAFLD progression-specific hub genes to reveal distinct genomic patterns and actionable targets","authors":"Jing Sun, Run Shi, Yang Wu, Yan Lou, Lijuan Nie, Chun Zhang, Yutian Cao, Qianhua Yan, Lifang Ye, Shu Zhang, Xuanbin Wang, Qibiao Wu, Xuehua Jiao, Jiangyi Yu, Zhuyuan Fang, Xiqiao Zhou","doi":"10.1186/s40537-024-00899-5","DOIUrl":"https://doi.org/10.1186/s40537-024-00899-5","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Background</h3>Nonalcoholic fatty liver disease (NAFLD) is a leading public health problem worldwide. Approximately one fourth of patients with nonalcoholic fatty liver (NAFL) progress to nonalcoholic steatohepatitis (NASH), an advanced stage of NAFLD. Hence, there is an urgent need to make a better understanding of NAFLD heterogeneity and facilitate personalized management of high-risk NAFLD patients who may benefit from more intensive surveillance and preventive intervene.<h3 data-test=\"abstract-sub-heading\">Methods</h3>In this study, a series of bioinformatic methods were performed to identify NAFLD progression-specific pathways and genes, and three machine learning approaches were combined to construct a risk-stratification gene signature to quantify risk assessment. In addition, bulk RNA-seq, single-cell RNA-seq (scRNA-seq) transcriptome profiling data and whole-exome sequencing (WES) data were comprehensively analyzed to reveal the genomic alterations and altered pathways between distinct molecular subtypes.<h3 data-test=\"abstract-sub-heading\">Results</h3>Two distinct subtypes of NAFL were identified with the NAFLD progression-specific genes, and one subtype has a high similarity of the inflammatory pattern and fibrotic potential with NASH. The established risk-stratification gene signature could discriminate advanced samples from overall NAFLD. COL1A2, one key gene closely related to NAFLD progression, is specifically expressed in fibroblasts involved in hepatocellular carcinoma (HCC), and significantly correlated with EMT and angiogenesis in pan-cancer. Moreover, the β-catenin/COL1A2 axis might play a critical role in fibrosis severity and inflammatory response during NAFLD-HCC progression.<h3 data-test=\"abstract-sub-heading\">Conclusion</h3>In summary, our study provided evidence for the necessity of molecular classification and established a risk-stratification gene signature to quantify risk assessment of NAFLD, aiming to identify different risk subsets and to guide personalized treatment.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"2 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140154958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Where you go is who you are: a study on machine learning based semantic privacy attacks 你去哪里，你就是谁：基于机器学习的语义隐私攻击研究

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-03-12 DOI: 10.1186/s40537-024-00888-8

Nina Wiedemann, Krzysztof Janowicz, Martin Raubal, Ourania Kounadi

Concerns about data privacy are omnipresent, given the increasing usage of digital applications and their underlying business model that includes selling user data. Location data is particularly sensitive since they allow us to infer activity patterns and interests of users, e.g., by categorizing visited locations based on nearby points of interest (POI). On top of that, machine learning methods provide new powerful tools to interpret big data. In light of these considerations, we raise the following question: What is the actual risk that realistic, machine learning based privacy attacks can obtain meaningful semantic information from raw location data, subject to inaccuracies in the data? In response, we present a systematic analysis of two attack scenarios, namely location categorization and user profiling. Experiments on the Foursquare dataset and tracking data demonstrate the potential for abuse of high-quality spatial information, leading to a significant privacy loss even with location inaccuracy of up to 200 m. With location obfuscation of more than 1 km, spatial information hardly adds any value, but a high privacy risk solely from temporal information remains. The availability of public context data such as POIs plays a key role in inference based on spatial information. Our findings point out the risks of ever-growing databases of tracking data and spatial context data, which policymakers should consider for privacy regulations, and which could guide individuals in their personal location protection measures.

鉴于数字应用程序的使用日益增多，其基本商业模式包括出售用户数据，因此人们对数据隐私的担忧无处不在。位置数据尤为敏感，因为它们允许我们推断用户的活动模式和兴趣，例如，根据附近的兴趣点（POI）对访问过的地点进行分类。此外，机器学习方法为解释大数据提供了新的强大工具。鉴于上述考虑，我们提出了以下问题：在数据不准确的情况下，基于机器学习的现实隐私攻击能够从原始位置数据中获取有意义的语义信息的实际风险有多大？对此，我们对两种攻击场景（即位置分类和用户特征分析）进行了系统分析。在 Foursquare 数据集和跟踪数据上进行的实验表明，高质量的空间信息有可能被滥用，即使位置误差不超过 200 米，也会导致严重的隐私损失。公共背景数据（如 POI）的可用性在基于空间信息的推理中起着关键作用。我们的研究结果指出了不断增长的跟踪数据和空间背景数据数据库所带来的风险，政策制定者应考虑将其纳入隐私法规，并指导个人采取个人位置保护措施。

{"title":"Where you go is who you are: a study on machine learning based semantic privacy attacks","authors":"Nina Wiedemann, Krzysztof Janowicz, Martin Raubal, Ourania Kounadi","doi":"10.1186/s40537-024-00888-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00888-8","url":null,"abstract":"Concerns about data privacy are omnipresent, given the increasing usage of digital applications and their underlying business model that includes selling user data. Location data is particularly sensitive since they allow us to infer activity patterns and interests of users, e.g., by categorizing visited locations based on nearby points of interest (POI). On top of that, machine learning methods provide new powerful tools to interpret big data. In light of these considerations, we raise the following question: What is the actual risk that realistic, machine learning based privacy attacks can obtain meaningful semantic information from raw location data, subject to inaccuracies in the data? In response, we present a systematic analysis of two attack scenarios, namely location categorization and user profiling. Experiments on the Foursquare dataset and tracking data demonstrate the potential for abuse of high-quality spatial information, leading to a significant privacy loss even with location inaccuracy of up to 200 m. With location obfuscation of more than 1 km, spatial information hardly adds any value, but a high privacy risk solely from temporal information remains. The availability of public context data such as POIs plays a key role in inference based on spatial information. Our findings point out the risks of ever-growing databases of tracking data and spatial context data, which policymakers should consider for privacy regulations, and which could guide individuals in their personal location protection measures.","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"19 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140125539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0