首页 > 最新文献

Biodata Mining最新文献

英文 中文
Deep joint learning diagnosis of Alzheimer's disease based on multimodal feature fusion. 基于多模态特征融合的阿尔茨海默病深度联合学习诊断。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-11-05 DOI: 10.1186/s13040-024-00395-9
Jingru Wang, Shipeng Wen, Wenjie Liu, Xianglian Meng, Zhuqing Jiao

Alzheimer's disease (AD) is an advanced and incurable neurodegenerative disease. Genetic variations are intrinsic etiological factors contributing to the abnormal expression of brain function and structure in AD patients. A new multimodal feature fusion called "magnetic resonance imaging (MRI)-p value" was proposed to construct 3D fusion images by introducing genes as a priori knowledge. Moreover, a new deep joint learning diagnostic model was constructed to fully learn images features. One branch trained a residual network (ResNet) to learn the features of local pathological regions. The other branch learned the position information of brain regions with different changes in the different categories of subjects' brains by introducing attention convolution, and then obtained the discriminative probability information from locations via convolution and global average pooling. The feature and position information of the two branches were linearly interacted to acquire the diagnostic basis for classifying the different categories of subjects. The diagnoses of AD and health control (HC), AD and mild cognitive impairment (MCI), HC and MCI were performed with data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). The results showed that the proposed method achieved optimal results in AD-related diagnosis. The classification accuracy (ACC) and area under the curve (AUC) of the three experimental groups were 93.44% and 96.67%, 89.06% and 92%, and 84% and 81.84%, respectively. Moreover, a total of six novel genes were found to be significantly associated with AD, namely NTM, MAML2, NAALADL2, FHIT, TMEM132D and PCSK5, which provided new targets for the potential treatment of neurodegenerative diseases.

阿尔茨海默病(AD)是一种无法治愈的晚期神经退行性疾病。基因变异是导致阿尔茨海默病患者大脑功能和结构异常的内在病因。研究人员提出了一种名为 "磁共振成像(MRI)-p 值 "的新型多模态特征融合方法,通过引入基因作为先验知识来构建三维融合图像。此外,还构建了一个新的深度联合学习诊断模型,以全面学习图像特征。一个分支训练了一个残差网络(ResNet),以学习局部病理区域的特征。另一个分支通过引入注意力卷积,学习不同类别受试者大脑中发生不同变化的脑区的位置信息,然后通过卷积和全局平均池获得位置的判别概率信息。两个分支的特征信息和位置信息进行线性交互,从而获得对不同类别受试者进行分类的诊断依据。利用阿尔茨海默病神经影像学倡议(ADNI)的数据,对注意力缺失症和健康控制(HC)、注意力缺失症和轻度认知障碍(MCI)、轻度认知障碍和 MCI 进行了诊断。结果表明,所提出的方法在与阿兹海默症相关的诊断中取得了最佳效果。三个实验组的分类准确率(ACC)和曲线下面积(AUC)分别为93.44%和96.67%、89.06%和92%、84%和81.84%。此外,共发现6个新基因与AD显著相关,分别是NTM、MAML2、NAALADL2、FHIT、TMEM132D和PCSK5,为潜在的神经退行性疾病治疗提供了新靶点。
{"title":"Deep joint learning diagnosis of Alzheimer's disease based on multimodal feature fusion.","authors":"Jingru Wang, Shipeng Wen, Wenjie Liu, Xianglian Meng, Zhuqing Jiao","doi":"10.1186/s13040-024-00395-9","DOIUrl":"10.1186/s13040-024-00395-9","url":null,"abstract":"<p><p>Alzheimer's disease (AD) is an advanced and incurable neurodegenerative disease. Genetic variations are intrinsic etiological factors contributing to the abnormal expression of brain function and structure in AD patients. A new multimodal feature fusion called \"magnetic resonance imaging (MRI)-p value\" was proposed to construct 3D fusion images by introducing genes as a priori knowledge. Moreover, a new deep joint learning diagnostic model was constructed to fully learn images features. One branch trained a residual network (ResNet) to learn the features of local pathological regions. The other branch learned the position information of brain regions with different changes in the different categories of subjects' brains by introducing attention convolution, and then obtained the discriminative probability information from locations via convolution and global average pooling. The feature and position information of the two branches were linearly interacted to acquire the diagnostic basis for classifying the different categories of subjects. The diagnoses of AD and health control (HC), AD and mild cognitive impairment (MCI), HC and MCI were performed with data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). The results showed that the proposed method achieved optimal results in AD-related diagnosis. The classification accuracy (ACC) and area under the curve (AUC) of the three experimental groups were 93.44% and 96.67%, 89.06% and 92%, and 84% and 81.84%, respectively. Moreover, a total of six novel genes were found to be significantly associated with AD, namely NTM, MAML2, NAALADL2, FHIT, TMEM132D and PCSK5, which provided new targets for the potential treatment of neurodegenerative diseases.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11536794/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142584754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling heterogeneity of Sudanese hospital stay in neonatal and maternal unit: non-parametric random effect models with Gamma distribution. 苏丹新生儿和产妇住院异质性建模:伽马分布非参数随机效应模型。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-11-01 DOI: 10.1186/s13040-024-00403-y
Amani Almohaimeed, Ishag Adam

Objective: Studies looking into patient and institutional variables linked to extended hospital stays have arisen as a result of the increased focus on severe maternal morbidity and mortality. Understanding the length of hospitalization of patients after delivery is important to gain insights into when hospitals will reach capacity and to predict corresponding staffing or equipment requirements. In Sudan, the distribution of the length of stay during delivery hospitalizations is heavily skewed, with the average length of stay of 2 to 3 days. This study aimed to investigate the use of non-parametric random effect model with Gamma distributed response for analyzing skewed hospital length of stay data in Sudan in neonatal and maternal unit.

Methods: We applied Gamma regression models with unknown random effects, estimated using the non-parametric maximum likelihood (NPML) technique [5]. The NPML reduces the heterogeneity in the distribution of the response and produce a robust estimation since it does not require any assumptions on the distribution. The same applies to the log-Gamma link that does not require any transformation for the data distribution and it can handle the outliers in the data points. In this study, the models are fitted with and without covariates and compared using AIC and BIC values.

Results: The findings imply that in the context of health care database investigations, Gamma regression models with non-parametric random effect consistently reduce heterogeneity and improve model accuracy. The generalized linear model with covariates and random effect (k = 4) had the best fit, indicating that Sudanese hospital length of stay data could be classified into four groups with varying average stays influenced by maternal, neonatal, and obstetrics data.

Conclusion: Identifying factors contributing to longer stays allows hospitals to implement strategies for improvement. Non-parametric random effect model with Gamma distributed response effectively accounts for unobserved heterogeneity and individual-level variability, leading to more accurate inferences and improved patient care. Including random effects can significantly affect variable significance in statistical models, emphasizing the need to consider unobserved heterogeneity when analyzing data containing potential individual-level variability. The findings emphasise the importance of making robust methodological choices in healthcare research in order to inform accurate policy decisions.

目的:由于人们越来越关注严重的孕产妇发病率和死亡率,因此对与延长住院时间有关的病人和机构变量进行了研究。了解产后病人的住院时间对于深入了解医院何时会达到饱和以及预测相应的人员或设备需求非常重要。在苏丹,分娩住院期间的住院时间分布严重倾斜,平均住院时间为 2 到 3 天。本研究旨在探讨使用伽马分布响应的非参数随机效应模型来分析苏丹新生儿和孕产妇病房的偏斜住院时间数据:我们使用非参数最大似然法(NPML)技术[5]估计了带有未知随机效应的伽马回归模型。非参数最大似然法减少了响应分布的异质性,并产生了稳健的估计,因为它不需要对分布做任何假设。对数-伽马链路也是如此,它不需要对数据分布进行任何转换,而且可以处理数据点中的异常值。在本研究中,使用 AIC 值和 BIC 值对有辅变量和无辅变量的模型进行了拟合和比较:结果:研究结果表明,在医疗数据库调查中,带有非参数随机效应的伽马回归模型能持续减少异质性并提高模型的准确性。带有协变量和随机效应的广义线性模型(k = 4)拟合效果最佳,表明苏丹医院的住院时间数据可分为四组,受产妇、新生儿和产科数据的影响,平均住院时间各不相同:结论:找出导致住院时间延长的因素,有助于医院实施改进策略。采用伽马分布响应的非参数随机效应模型可有效考虑未观察到的异质性和个体水平的变异性,从而得出更准确的推论并改善患者护理。纳入随机效应会极大地影响统计模型中变量的显著性,这强调了在分析包含潜在个体水平变异性的数据时考虑未观察到的异质性的必要性。研究结果强调了在医疗保健研究中选择稳健方法的重要性,以便为准确的政策决策提供信息。
{"title":"Modeling heterogeneity of Sudanese hospital stay in neonatal and maternal unit: non-parametric random effect models with Gamma distribution.","authors":"Amani Almohaimeed, Ishag Adam","doi":"10.1186/s13040-024-00403-y","DOIUrl":"10.1186/s13040-024-00403-y","url":null,"abstract":"<p><strong>Objective: </strong>Studies looking into patient and institutional variables linked to extended hospital stays have arisen as a result of the increased focus on severe maternal morbidity and mortality. Understanding the length of hospitalization of patients after delivery is important to gain insights into when hospitals will reach capacity and to predict corresponding staffing or equipment requirements. In Sudan, the distribution of the length of stay during delivery hospitalizations is heavily skewed, with the average length of stay of 2 to 3 days. This study aimed to investigate the use of non-parametric random effect model with Gamma distributed response for analyzing skewed hospital length of stay data in Sudan in neonatal and maternal unit.</p><p><strong>Methods: </strong>We applied Gamma regression models with unknown random effects, estimated using the non-parametric maximum likelihood (NPML) technique [5]. The NPML reduces the heterogeneity in the distribution of the response and produce a robust estimation since it does not require any assumptions on the distribution. The same applies to the log-Gamma link that does not require any transformation for the data distribution and it can handle the outliers in the data points. In this study, the models are fitted with and without covariates and compared using AIC and BIC values.</p><p><strong>Results: </strong>The findings imply that in the context of health care database investigations, Gamma regression models with non-parametric random effect consistently reduce heterogeneity and improve model accuracy. The generalized linear model with covariates and random effect (k = 4) had the best fit, indicating that Sudanese hospital length of stay data could be classified into four groups with varying average stays influenced by maternal, neonatal, and obstetrics data.</p><p><strong>Conclusion: </strong>Identifying factors contributing to longer stays allows hospitals to implement strategies for improvement. Non-parametric random effect model with Gamma distributed response effectively accounts for unobserved heterogeneity and individual-level variability, leading to more accurate inferences and improved patient care. Including random effects can significantly affect variable significance in statistical models, emphasizing the need to consider unobserved heterogeneity when analyzing data containing potential individual-level variability. The findings emphasise the importance of making robust methodological choices in healthcare research in order to inform accurate policy decisions.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529257/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142565124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability. 利用生成式对抗网络进行集合特征选择和表格数据增强,以提高皮肤黑色素瘤的识别能力和可解释性。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-30 DOI: 10.1186/s13040-024-00397-7
Vanesa Gómez-Martínez, David Chushig-Muzo, Marit B Veierød, Conceição Granja, Cristina Soguero-Ruiz

Background: Cutaneous melanoma is the most aggressive form of skin cancer, responsible for most skin cancer-related deaths. Recent advances in artificial intelligence, jointly with the availability of public dermoscopy image datasets, have allowed to assist dermatologists in melanoma identification. While image feature extraction holds potential for melanoma detection, it often leads to high-dimensional data. Furthermore, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented.

Methods: In this paper, we propose to combine ensemble feature selection (FS) methods and data augmentation with the conditional tabular generative adversarial networks (CTGAN) to enhance melanoma identification in imbalanced datasets. We employed dermoscopy images from two public datasets, PH2 and Derm7pt, which contain melanoma and not-melanoma lesions. To capture intrinsic information from skin lesions, we conduct two feature extraction (FE) approaches, including handcrafted and embedding features. For the former, color, geometric and first-, second-, and higher-order texture features were extracted, whereas for the latter, embeddings were obtained using ResNet-based models. To alleviate the high-dimensionality in the FE, ensemble FS with filter methods were used and evaluated. For data augmentation, we conducted a progressive analysis of the imbalance ratio (IR), related to the amount of synthetic samples created, and evaluated the impact on the predictive results. To gain interpretability on predictive models, we used SHAP, bootstrap resampling statistical tests and UMAP visualizations.

Results: The combination of ensemble FS, CTGAN, and linear models achieved the best predictive results, achieving AUCROC values of 87% (with support vector machine and IR=0.9) and 76% (with LASSO and IR=1.0) for the PH2 and Derm7pt, respectively. We also identified that melanoma lesions were mainly characterized by features related to color, while not-melanoma lesions were characterized by texture features.

Conclusions: Our results demonstrate the effectiveness of ensemble FS and synthetic data in the development of models that accurately identify melanoma. This research advances skin lesion analysis, contributing to both melanoma detection and the interpretation of main features for its identification.

背景:皮肤黑色素瘤是最具侵袭性的皮肤癌,是造成大多数皮肤癌相关死亡的原因。人工智能领域的最新进展,加上公共皮肤镜图像数据集的可用性,有助于皮肤科医生识别黑色素瘤。虽然图像特征提取在黑色素瘤检测方面具有潜力,但它往往会产生高维数据。此外,大多数图像数据集都存在类不平衡的问题,即少数几个类有大量样本,而其他类的代表性不足:本文建议将集合特征选择(FS)方法和数据增强与条件表生成对抗网络(CTGAN)相结合,以增强不平衡数据集中的黑色素瘤识别能力。我们采用了两个公开数据集 PH2 和 Derm7pt 中的皮肤镜图像,其中包含黑色素瘤和非黑色素瘤病变。为了捕捉皮肤病变的内在信息,我们采用了两种特征提取(FE)方法,包括手工特征提取和嵌入特征提取。对于前者,我们提取了颜色、几何和一阶、二阶及高阶纹理特征,而对于后者,我们使用基于 ResNet 的模型获得了嵌入特征。为了减轻 FE 的高维性,我们使用并评估了带有过滤器方法的集合 FS。在数据增强方面,我们对与合成样本量相关的不平衡率(IR)进行了渐进分析,并评估了其对预测结果的影响。为了获得预测模型的可解释性,我们使用了SHAP、自举重采样统计检验和UMAP可视化:结果:集合FS、CTGAN和线性模型的组合取得了最佳预测结果,PH2和Derm7pt的AUCROC值分别达到87%(支持向量机,IR=0.9)和76%(LASSO,IR=1.0)。我们还发现,黑色素瘤病变的主要特征是与颜色相关的特征,而非黑色素瘤病变的主要特征是纹理特征:我们的研究结果表明,在开发能准确识别黑色素瘤的模型时,集合FS和合成数据非常有效。这项研究推动了皮肤病变分析,有助于黑色素瘤的检测和主要特征的解释。
{"title":"Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability.","authors":"Vanesa Gómez-Martínez, David Chushig-Muzo, Marit B Veierød, Conceição Granja, Cristina Soguero-Ruiz","doi":"10.1186/s13040-024-00397-7","DOIUrl":"10.1186/s13040-024-00397-7","url":null,"abstract":"<p><strong>Background: </strong>Cutaneous melanoma is the most aggressive form of skin cancer, responsible for most skin cancer-related deaths. Recent advances in artificial intelligence, jointly with the availability of public dermoscopy image datasets, have allowed to assist dermatologists in melanoma identification. While image feature extraction holds potential for melanoma detection, it often leads to high-dimensional data. Furthermore, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented.</p><p><strong>Methods: </strong>In this paper, we propose to combine ensemble feature selection (FS) methods and data augmentation with the conditional tabular generative adversarial networks (CTGAN) to enhance melanoma identification in imbalanced datasets. We employed dermoscopy images from two public datasets, PH2 and Derm7pt, which contain melanoma and not-melanoma lesions. To capture intrinsic information from skin lesions, we conduct two feature extraction (FE) approaches, including handcrafted and embedding features. For the former, color, geometric and first-, second-, and higher-order texture features were extracted, whereas for the latter, embeddings were obtained using ResNet-based models. To alleviate the high-dimensionality in the FE, ensemble FS with filter methods were used and evaluated. For data augmentation, we conducted a progressive analysis of the imbalance ratio (IR), related to the amount of synthetic samples created, and evaluated the impact on the predictive results. To gain interpretability on predictive models, we used SHAP, bootstrap resampling statistical tests and UMAP visualizations.</p><p><strong>Results: </strong>The combination of ensemble FS, CTGAN, and linear models achieved the best predictive results, achieving AUCROC values of 87% (with support vector machine and IR=0.9) and 76% (with LASSO and IR=1.0) for the PH2 and Derm7pt, respectively. We also identified that melanoma lesions were mainly characterized by features related to color, while not-melanoma lesions were characterized by texture features.</p><p><strong>Conclusions: </strong>Our results demonstrate the effectiveness of ensemble FS and synthetic data in the development of models that accurately identify melanoma. This research advances skin lesion analysis, contributing to both melanoma detection and the interpretation of main features for its identification.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11526724/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142548479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Priority-Elastic net for binary disease outcome prediction based on multi-omics data. 基于多组学数据的二元疾病结果预测优先级弹性网
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-29 DOI: 10.1186/s13040-024-00401-0
Laila Musib, Roberta Coletti, Marta B Lopes, Helena Mouriño, Eunice Carrasquinha

Background: High-dimensional omics data integration has emerged as a prominent avenue within the healthcare industry, presenting substantial potential to improve predictive models. However, the data integration process faces several challenges, including data heterogeneity, priority sequence in which data blocks are prioritized for rendering predictive information contained in multiple blocks, assessing the flow of information from one omics level to the other and multicollinearity.

Methods: We propose the Priority-Elastic net algorithm, a hierarchical regression method extending Priority-Lasso for the binary logistic regression model by incorporating a priority order for blocks of variables while fitting Elastic-net models sequentially for each block. The fitted values from each step are then used as an offset in the subsequent step. Additionally, we considered the adaptive elastic-net penalty within our priority framework to compare the results.

Results: The Priority-Elastic net and Priority-Adaptive Elastic net algorithms were evaluated on a brain tumor dataset available from The Cancer Genome Atlas (TCGA), accounting for transcriptomics, proteomics, and clinical information measured over two glioma types: Lower-grade glioma (LGG) and glioblastoma (GBM).

Conclusion: Our findings suggest that the Priority-Elastic net is a highly advantageous choice for a wide range of applications. It offers moderate computational complexity, flexibility in integrating prior knowledge while introducing a hierarchical modeling perspective, and, importantly, improved stability and accuracy in predictions, making it superior to the other methods discussed. This evolution marks a significant step forward in predictive modeling, offering a sophisticated tool for navigating the complexities of multi-omics datasets in pursuit of precision medicine's ultimate goal: personalized treatment optimization based on a comprehensive array of patient-specific data. This framework can be generalized to time-to-event, Cox proportional hazards regression and multicategorical outcomes. A practical implementation of this method is available upon request in R script, complete with an example to facilitate its application.

背景:高维整体组学数据整合已成为医疗保健行业的一个重要途径,为改进预测模型提供了巨大潜力。然而,数据整合过程面临着一些挑战,包括数据异质性、数据块优先顺序以呈现包含在多个数据块中的预测信息、评估从一个整体组学层次到另一个整体组学层次的信息流以及多重共线性:我们提出了 "优先级弹性网算法",这是一种分层回归方法,它将优先级拉索(Priority-Lasso)扩展到了二元逻辑回归模型中,在为每个数据块依次拟合弹性网模型的同时,为变量块设定了优先级顺序。每一步的拟合值都会被用作后续步骤的偏移量。此外,我们还在优先级框架内考虑了自适应弹性网惩罚,以比较结果:我们在癌症基因组图谱(TCGA)提供的脑肿瘤数据集上对优先级弹性网算法和优先级自适应弹性网算法进行了评估,其中包括两种胶质瘤类型的转录组学、蛋白质组学和临床信息:结论:我们的研究结果表明,优先级弹性网是一种非常有利的选择,适用于广泛的应用领域。它具有适度的计算复杂性、整合先验知识的灵活性,同时引入了分层建模视角,更重要的是,它提高了预测的稳定性和准确性,使其优于所讨论的其他方法。这一演变标志着预测建模向前迈进了一大步,为驾驭复杂的多组学数据集提供了先进的工具,以实现精准医学的终极目标:基于一系列患者特定数据的个性化治疗优化。这一框架可推广到时间到事件、Cox 比例危险回归和多分类结果。如果您需要,我们可以用 R 脚本提供这种方法的实际应用,并提供一个示例以方便应用。
{"title":"Priority-Elastic net for binary disease outcome prediction based on multi-omics data.","authors":"Laila Musib, Roberta Coletti, Marta B Lopes, Helena Mouriño, Eunice Carrasquinha","doi":"10.1186/s13040-024-00401-0","DOIUrl":"10.1186/s13040-024-00401-0","url":null,"abstract":"<p><strong>Background: </strong>High-dimensional omics data integration has emerged as a prominent avenue within the healthcare industry, presenting substantial potential to improve predictive models. However, the data integration process faces several challenges, including data heterogeneity, priority sequence in which data blocks are prioritized for rendering predictive information contained in multiple blocks, assessing the flow of information from one omics level to the other and multicollinearity.</p><p><strong>Methods: </strong>We propose the Priority-Elastic net algorithm, a hierarchical regression method extending Priority-Lasso for the binary logistic regression model by incorporating a priority order for blocks of variables while fitting Elastic-net models sequentially for each block. The fitted values from each step are then used as an offset in the subsequent step. Additionally, we considered the adaptive elastic-net penalty within our priority framework to compare the results.</p><p><strong>Results: </strong>The Priority-Elastic net and Priority-Adaptive Elastic net algorithms were evaluated on a brain tumor dataset available from The Cancer Genome Atlas (TCGA), accounting for transcriptomics, proteomics, and clinical information measured over two glioma types: Lower-grade glioma (LGG) and glioblastoma (GBM).</p><p><strong>Conclusion: </strong>Our findings suggest that the Priority-Elastic net is a highly advantageous choice for a wide range of applications. It offers moderate computational complexity, flexibility in integrating prior knowledge while introducing a hierarchical modeling perspective, and, importantly, improved stability and accuracy in predictions, making it superior to the other methods discussed. This evolution marks a significant step forward in predictive modeling, offering a sophisticated tool for navigating the complexities of multi-omics datasets in pursuit of precision medicine's ultimate goal: personalized treatment optimization based on a comprehensive array of patient-specific data. This framework can be generalized to time-to-event, Cox proportional hazards regression and multicategorical outcomes. A practical implementation of this method is available upon request in R script, complete with an example to facilitate its application.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11523883/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142548496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A regularized Cox hierarchical model for incorporating annotation information in predictive omic studies. 将注释信息纳入预测性 omic 研究的正则化 Cox 层次模型。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-24 DOI: 10.1186/s13040-024-00398-6
Dixin Shen, Juan Pablo Lewinger, Eric Kawaguchi

Background: Associated with high-dimensional omics data there are often "meta-features" such as biological pathways and functional annotations, summary statistics from similar studies that can be informative for predicting an outcome of interest. We introduce a regularized hierarchical framework for integrating meta-features, with the goal of improving prediction and feature selection performance with time-to-event outcomes.

Methods: A hierarchical framework is deployed to incorporate meta-features. Regularization is applied to the omic features as well as the meta-features so that high-dimensional data can be handled at both levels. The proposed hierarchical Cox model can be efficiently fitted by a combination of iterative reweighted least squares and cyclic coordinate descent.

Results: In a simulation study we show that when the external meta-features are informative, the regularized hierarchical model can substantially improve prediction performance over standard regularized Cox regression. We illustrate the proposed model with applications to breast cancer and melanoma survival based on gene expression profiles, which show the improvement in prediction performance by applying meta-features, as well as the discovery of important omic feature sets with sparse regularization at meta-feature level.

Conclusions: The proposed hierarchical regularized regression model enables integration of external meta-feature information directly into the modeling process for time-to-event outcomes, improves prediction performance when the external meta-feature data is informative. Importantly, when the external meta-features are uninformative, the prediction performance based on the regularized hierarchical model is on par with standard regularized Cox regression, indicating robustness of the framework. In addition to developing predictive signatures, the model can also be deployed in discovery applications where the main goal is to identify important features associated with the outcome rather than developing a predictive model.

背景:与高维 omics 数据相关的往往是 "元特征",如生物通路和功能注释,这些来自类似研究的总结性统计数据可能对预测感兴趣的结果具有参考价值。我们引入了一个正则化的分层框架来整合元特征,目的是提高时间到事件结果的预测和特征选择性能:方法:采用分层框架整合元特征。方法:采用分层框架纳入元特征,并对omic特征和元特征进行正则化处理,从而在两个层面上处理高维数据。结合迭代加权最小二乘法和循环坐标下降法,可以有效拟合所提出的分层考克斯模型:在一项模拟研究中,我们发现当外部元特征信息丰富时,正则化分层模型比标准正则化 Cox 回归能大幅提高预测性能。我们将提出的模型应用于基于基因表达谱的乳腺癌和黑色素瘤存活率研究,结果表明,应用元特征可以提高预测性能,在元特征水平上进行稀疏正则化还可以发现重要的 omic 特征集:结论:所提出的分层正则化回归模型能将外部元特征信息直接整合到时间到事件结果的建模过程中,当外部元特征数据信息丰富时,能提高预测性能。重要的是,当外部元特征信息不丰富时,基于正则化分层模型的预测性能与标准正则化 Cox 回归相当,这表明了该框架的稳健性。除了开发预测特征外,该模型还可以部署在发现应用中,其主要目标是识别与结果相关的重要特征,而不是开发预测模型。
{"title":"A regularized Cox hierarchical model for incorporating annotation information in predictive omic studies.","authors":"Dixin Shen, Juan Pablo Lewinger, Eric Kawaguchi","doi":"10.1186/s13040-024-00398-6","DOIUrl":"10.1186/s13040-024-00398-6","url":null,"abstract":"<p><strong>Background: </strong>Associated with high-dimensional omics data there are often \"meta-features\" such as biological pathways and functional annotations, summary statistics from similar studies that can be informative for predicting an outcome of interest. We introduce a regularized hierarchical framework for integrating meta-features, with the goal of improving prediction and feature selection performance with time-to-event outcomes.</p><p><strong>Methods: </strong>A hierarchical framework is deployed to incorporate meta-features. Regularization is applied to the omic features as well as the meta-features so that high-dimensional data can be handled at both levels. The proposed hierarchical Cox model can be efficiently fitted by a combination of iterative reweighted least squares and cyclic coordinate descent.</p><p><strong>Results: </strong>In a simulation study we show that when the external meta-features are informative, the regularized hierarchical model can substantially improve prediction performance over standard regularized Cox regression. We illustrate the proposed model with applications to breast cancer and melanoma survival based on gene expression profiles, which show the improvement in prediction performance by applying meta-features, as well as the discovery of important omic feature sets with sparse regularization at meta-feature level.</p><p><strong>Conclusions: </strong>The proposed hierarchical regularized regression model enables integration of external meta-feature information directly into the modeling process for time-to-event outcomes, improves prediction performance when the external meta-feature data is informative. Importantly, when the external meta-features are uninformative, the prediction performance based on the regularized hierarchical model is on par with standard regularized Cox regression, indicating robustness of the framework. In addition to developing predictive signatures, the model can also be deployed in discovery applications where the main goal is to identify important features associated with the outcome rather than developing a predictive model.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515443/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142511162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
G4 & the balanced metric family - a novel approach to solving binary classification problems in medical device validation & verification studies. G4 和平衡度量系列--解决医疗器械验证和确认研究中二元分类问题的新方法。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-23 DOI: 10.1186/s13040-024-00402-z
Andrew Marra

Background: In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews' Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.

Results: Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI's performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.

Conclusions: Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device's performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.

背景:在医疗器械验证和确认研究中,接收者操作特征曲线下面积 (AUROC) 经常被用作主要终点,尽管有多份报告显示了它的局限性。因此,鼓励研究人员考虑采用其他指标作为主要终点。本文介绍了一种名为 G4 的新指标,它是灵敏度、特异性、阳性预测值和阴性预测值的几何平均数。G4 是一个平衡指标体系的一部分,该体系包括统一性能指标(又称 P4)和马修斯相关系数 (MCC)。本手稿旨在揭示在分析二元分类器的整体性能时将 G4 与平衡度量系列结合使用的好处:结果:在多阅读器多案例研究设计下,分析了包含不同少数群体流行率的模拟数据集。此外,还考虑了一项独立发表的研究数据,该研究测试了独特的超声人工智能算法在乳腺癌检测方面的性能。在每个数据集中,AUROC 与平衡度量系列一起报告,以供比较。当数据集中少数群体的流行率和偏差接近 50%时,所有三个平衡指标都能对人工智能的性能做出等效的解释。随着流行率的增加/减少,数据变得更加不平衡,AUROC 往往会高估/低估真正的分类器性能,而平衡度量系列则能抵御这种不平衡。在某些情况下,数据不平衡性很强(少数类流行率结论:在二元分类问题中使用 AUROC 作为主要终点,会随着数据集变得越来越不平衡而产生误导性结果。这一点在将 AUROC 纳入医疗设备验证和检验研究时会被明确注意到。G4、P4 和 MCC 不具有这种局限性,它们能更全面地反映医疗设备在临床环境中的性能。因此,我们鼓励研究人员在评估二元分类问题时探索平衡度量系列。
{"title":"G4 & the balanced metric family - a novel approach to solving binary classification problems in medical device validation & verification studies.","authors":"Andrew Marra","doi":"10.1186/s13040-024-00402-z","DOIUrl":"10.1186/s13040-024-00402-z","url":null,"abstract":"<p><strong>Background: </strong>In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews' Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.</p><p><strong>Results: </strong>Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI's performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.</p><p><strong>Conclusions: </strong>Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device's performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142511164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From COVID-19 to monkeypox: a novel predictive model for emerging infectious diseases. 从 COVID-19 到猴痘:新出现传染病的新型预测模型。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-22 DOI: 10.1186/s13040-024-00396-8
Deren Xu, Weng Howe Chan, Habibollah Haron, Hui Wen Nies, Kohbalan Moorthy

The outbreak of emerging infectious diseases poses significant challenges to global public health. Accurate early forecasting is crucial for effective resource allocation and emergency response planning. This study aims to develop a comprehensive predictive model for emerging infectious diseases, integrating the blending framework, transfer learning, incremental learning, and the biological feature Rt to increase prediction accuracy and practicality. By transferring features from a COVID-19 dataset to a monkeypox dataset and introducing dynamically updated incremental learning techniques, the model's predictive capability in data-scarce scenarios was significantly improved. The research findings demonstrate that the blending framework performs exceptionally well in short-term (7-day) predictions. Furthermore, the combination of transfer learning and incremental learning techniques significantly enhanced the adaptability and precision, with a 91.41% improvement in the RMSE and an 89.13% improvement in the MAE. In particular, the inclusion of the Rt feature enabled the model to more accurately reflect the dynamics of disease spread, further improving the RMSE by 1.91% and the MAE by 2.17%. This study underscores the significant application potential of multimodel fusion and real-time data updates in infectious disease prediction, offering new theoretical perspectives and technical support. This research not only enriches the theoretical foundation of infectious disease prediction models but also provides reliable technical support for public health emergency responses. Future research should continue to explore integrating data from multiple sources and enhancing model generalization capabilities to further enhance the practicality and reliability of predictive tools.

新发传染病的爆发给全球公共卫生带来了重大挑战。准确的早期预测对于有效的资源分配和应急计划至关重要。本研究旨在开发一种针对新发传染病的综合预测模型,将混合框架、迁移学习、增量学习和生物特征 Rt 整合在一起,以提高预测的准确性和实用性。通过将 COVID-19 数据集的特征转移到猴痘数据集,并引入动态更新的增量学习技术,该模型在数据稀缺情况下的预测能力得到了显著提高。研究结果表明,混合框架在短期(7 天)预测中表现优异。此外,迁移学习和增量学习技术的结合大大提高了适应性和精确度,均方根误差(RMSE)提高了 91.41%,均方根误差(MAE)提高了 89.13%。特别是 Rt 特征的加入,使模型能够更准确地反映疾病传播的动态,进一步将 RMSE 提高了 1.91%,MAE 提高了 2.17%。这项研究强调了多模型融合和实时数据更新在传染病预测中的巨大应用潜力,提供了新的理论视角和技术支持。这项研究不仅丰富了传染病预测模型的理论基础,也为公共卫生应急响应提供了可靠的技术支持。未来的研究应继续探索整合多源数据,增强模型泛化能力,进一步提高预测工具的实用性和可靠性。
{"title":"From COVID-19 to monkeypox: a novel predictive model for emerging infectious diseases.","authors":"Deren Xu, Weng Howe Chan, Habibollah Haron, Hui Wen Nies, Kohbalan Moorthy","doi":"10.1186/s13040-024-00396-8","DOIUrl":"https://doi.org/10.1186/s13040-024-00396-8","url":null,"abstract":"<p><p>The outbreak of emerging infectious diseases poses significant challenges to global public health. Accurate early forecasting is crucial for effective resource allocation and emergency response planning. This study aims to develop a comprehensive predictive model for emerging infectious diseases, integrating the blending framework, transfer learning, incremental learning, and the biological feature Rt to increase prediction accuracy and practicality. By transferring features from a COVID-19 dataset to a monkeypox dataset and introducing dynamically updated incremental learning techniques, the model's predictive capability in data-scarce scenarios was significantly improved. The research findings demonstrate that the blending framework performs exceptionally well in short-term (7-day) predictions. Furthermore, the combination of transfer learning and incremental learning techniques significantly enhanced the adaptability and precision, with a 91.41% improvement in the RMSE and an 89.13% improvement in the MAE. In particular, the inclusion of the Rt feature enabled the model to more accurately reflect the dynamics of disease spread, further improving the RMSE by 1.91% and the MAE by 2.17%. This study underscores the significant application potential of multimodel fusion and real-time data updates in infectious disease prediction, offering new theoretical perspectives and technical support. This research not only enriches the theoretical foundation of infectious disease prediction models but also provides reliable technical support for public health emergency responses. Future research should continue to explore integrating data from multiple sources and enhancing model generalization capabilities to further enhance the practicality and reliability of predictive tools.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11494870/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142511163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PAGER: A novel genotype encoding strategy for modeling deviations from additivity in complex trait association studies. PAGER:一种新的基因型编码策略,用于对复杂性状关联研究中的加性偏差进行建模。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-11 DOI: 10.1186/s13040-024-00393-x
Philip J Freda, Attri Ghosh, Priyanka Bhandary, Nicholas Matsumoto, Apurva S Chitre, Jiayan Zhou, Molly A Hall, Abraham A Palmer, Tayo Obafemi-Ajayi, Jason H Moore

Background: The additive model of inheritance assumes that heterozygotes (Aa) are exactly intermediate in respect to homozygotes (AA and aa). While this model is commonly used in single-locus genetic association studies, significant deviations from additivity are well-documented and contribute to phenotypic variance across many traits and systems. This assumption can introduce type I and type II errors by overestimating or underestimating the effects of variants that deviate from additivity. Alternative genotype encoding strategies have been explored to account for different inheritance patterns, but they often incur significant computational or methodological costs. To address these challenges, we introduce PAGER (Phenotype Adjusted Genotype Encoding and Ranking), an efficient pre-processing method that encodes each genetic variant based on normalized mean phenotypic differences between diallelic genotype classes (AA, Aa, and aa). This approach more accurately reflects each variant's true inheritance model, improving model precision while minimizing the costs associated with alternative encoding strategies.

Results: Through extensive benchmarking on SNPs simulated with both binary and continuous phenotypes, we demonstrate that PAGER accurately represents various inheritance patterns (including additive, dominant, recessive, and heterosis), achieves levels of statistical power that meet or exceed other encoding strategies, and attains computation speeds up to 55 times faster than a similar method, EDGE. We also apply PAGER to publicly available real-world data and identify a novel, relevant putative QTL associated with body mass index in rats (Rattus norvegicus) that is not detected with the additive model.

Conclusions: Overall, we show that PAGER is an efficient genotype encoding approach that can uncover sources of missing heritability and reveal novel insights in the study of complex traits while incurring minimal costs.

背景:加性遗传模型假定杂合子(Aa)与同源杂合子(AA 和 aa)完全处于中间状态。虽然这一模型通常用于单病灶遗传关联研究,但与加性遗传的显著偏差已得到充分证实,并导致许多性状和系统的表型变异。这一假设可能会高估或低估偏离可加性的变异的效应,从而导致 I 型和 II 型错误。为了解释不同的遗传模式,人们探索了其他基因型编码策略,但这些策略往往会产生巨大的计算或方法成本。为了应对这些挑战,我们引入了 PAGER(表型调整基因型编码和排序),这是一种高效的预处理方法,它根据二联基因型类别(AA、Aa 和 aa)之间的归一化平均表型差异对每个遗传变异进行编码。这种方法更准确地反映了每个变体的真实遗传模型,提高了模型的精确度,同时最大限度地降低了与其他编码策略相关的成本:通过对具有二元和连续表型的 SNPs 模拟进行广泛的基准测试,我们证明 PAGER 能准确表示各种遗传模式(包括加性、显性、隐性和杂合性),达到或超过其他编码策略的统计能力水平,而且计算速度比类似方法 EDGE 快达 55 倍。我们还将 PAGER 应用于公开的真实世界数据,并发现了一个与大鼠体重指数相关的新的、相关的假定 QTL,该 QTL 在加性模型中未被检测到:总之,我们证明了 PAGER 是一种高效的基因型编码方法,它能发现缺失遗传性的来源,并揭示复杂性状研究中的新见解,同时将成本降到最低。
{"title":"PAGER: A novel genotype encoding strategy for modeling deviations from additivity in complex trait association studies.","authors":"Philip J Freda, Attri Ghosh, Priyanka Bhandary, Nicholas Matsumoto, Apurva S Chitre, Jiayan Zhou, Molly A Hall, Abraham A Palmer, Tayo Obafemi-Ajayi, Jason H Moore","doi":"10.1186/s13040-024-00393-x","DOIUrl":"10.1186/s13040-024-00393-x","url":null,"abstract":"<p><strong>Background: </strong>The additive model of inheritance assumes that heterozygotes (Aa) are exactly intermediate in respect to homozygotes (AA and aa). While this model is commonly used in single-locus genetic association studies, significant deviations from additivity are well-documented and contribute to phenotypic variance across many traits and systems. This assumption can introduce type I and type II errors by overestimating or underestimating the effects of variants that deviate from additivity. Alternative genotype encoding strategies have been explored to account for different inheritance patterns, but they often incur significant computational or methodological costs. To address these challenges, we introduce PAGER (Phenotype Adjusted Genotype Encoding and Ranking), an efficient pre-processing method that encodes each genetic variant based on normalized mean phenotypic differences between diallelic genotype classes (AA, Aa, and aa). This approach more accurately reflects each variant's true inheritance model, improving model precision while minimizing the costs associated with alternative encoding strategies.</p><p><strong>Results: </strong>Through extensive benchmarking on SNPs simulated with both binary and continuous phenotypes, we demonstrate that PAGER accurately represents various inheritance patterns (including additive, dominant, recessive, and heterosis), achieves levels of statistical power that meet or exceed other encoding strategies, and attains computation speeds up to 55 times faster than a similar method, EDGE. We also apply PAGER to publicly available real-world data and identify a novel, relevant putative QTL associated with body mass index in rats (Rattus norvegicus) that is not detected with the additive model.</p><p><strong>Conclusions: </strong>Overall, we show that PAGER is an efficient genotype encoding approach that can uncover sources of missing heritability and reveal novel insights in the study of complex traits while incurring minimal costs.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11468469/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142407082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Decoding the genetic comorbidity network of Alzheimer's disease. 解码阿尔茨海默病的遗传合并症网络。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-09 DOI: 10.1186/s13040-024-00394-w
Xueli Zhang, Dantong Li, Siting Ye, Shunming Liu, Shuo Ma, Min Li, Qiliang Peng, Lianting Hu, Xianwen Shang, Mingguang He, Lei Zhang

Alzheimer's disease (AD) has emerged as the most prevalent and complex neurodegenerative disorder among the elderly population. However, the genetic comorbidity etiology for AD remains poorly understood. In this study, we conducted pleiotropic analysis for 41 AD phenotypic comorbidities, identifying ten genetic comorbidities with 16 pleiotropy genes associated with AD. Through biological functional and network analysis, we elucidated the molecular and functional landscape of AD genetic comorbidities. Furthermore, leveraging the pleiotropic genes and reported biomarkers for AD genetic comorbidities, we identified 50 potential biomarkers for AD diagnosis. Our findings deepen the understanding of the occurrence of AD genetic comorbidities and provide new insights for the search for AD diagnostic markers.

阿尔茨海默病(AD)已成为老年人群中最常见、最复杂的神经退行性疾病。然而,人们对阿尔茨海默病的遗传合并症病因仍知之甚少。在这项研究中,我们对41种AD表型合并症进行了多效性分析,确定了10种遗传合并症与16个与AD相关的多效性基因。通过生物功能和网络分析,我们阐明了AD遗传合并症的分子和功能图谱。此外,利用AD遗传合并症的多效基因和已报道的生物标志物,我们还发现了50种潜在的AD诊断生物标志物。我们的研究结果加深了人们对AD遗传合并症发生的理解,并为寻找AD诊断标志物提供了新的见解。
{"title":"Decoding the genetic comorbidity network of Alzheimer's disease.","authors":"Xueli Zhang, Dantong Li, Siting Ye, Shunming Liu, Shuo Ma, Min Li, Qiliang Peng, Lianting Hu, Xianwen Shang, Mingguang He, Lei Zhang","doi":"10.1186/s13040-024-00394-w","DOIUrl":"10.1186/s13040-024-00394-w","url":null,"abstract":"<p><p>Alzheimer's disease (AD) has emerged as the most prevalent and complex neurodegenerative disorder among the elderly population. However, the genetic comorbidity etiology for AD remains poorly understood. In this study, we conducted pleiotropic analysis for 41 AD phenotypic comorbidities, identifying ten genetic comorbidities with 16 pleiotropy genes associated with AD. Through biological functional and network analysis, we elucidated the molecular and functional landscape of AD genetic comorbidities. Furthermore, leveraging the pleiotropic genes and reported biomarkers for AD genetic comorbidities, we identified 50 potential biomarkers for AD diagnosis. Our findings deepen the understanding of the occurrence of AD genetic comorbidities and provide new insights for the search for AD diagnostic markers.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11465508/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142394496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MDVarP: modifier ~ disease-causing variant pairs predictor. MDVarP:修饰符 ~ 致病变异对预测器。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-08 DOI: 10.1186/s13040-024-00392-y
Hong Sun, Yunqin Chen, Liangxiao Ma

Background: Modifiers significantly impact disease phenotypes by modulating the effects of disease-causing variants, resulting in varying disease manifestations among individuals. However, identifying genetic interactions between modifier and disease-causing variants is challenging.

Results: We developed MDVarP, an ensemble model comprising 1000 random forest predictors, to identify modifier ~ disease-causing variant combinations. MDVarP achieves high accuracy and precision, as verified using an independent dataset with published evidence of genetic interactions. We identified 25 novel modifier ~ disease-causing variant combinations and obtained supporting evidence for these associations. MDVarP outputs a class label ("Associated-pair" or "Nonrelevant-pair") and two prediction scores indicating the probability of a true association.

Conclusions: MDVarP prioritizes variant pairs associated with phenotypic modulations, enabling more effective mapping of functional contributions from disease-causing and modifier variants. This framework interprets genetic interactions underlying phenotypic variations in human diseases, with potential applications in personalized medicine and disease prevention.

背景:修饰因子通过调节致病变异体的效应对疾病表型产生重大影响,导致个体间疾病表现各不相同。然而,识别修饰基因与致病变异基因之间的遗传相互作用是一项挑战:我们开发了一个由 1000 个随机森林预测因子组成的集合模型 MDVarP,用于识别修饰因子和致病变异体的组合。MDVarP 具有很高的准确性和精确性,这一点已通过一个独立数据集得到验证,该数据集已公布了基因相互作用的证据。我们确定了 25 个新的修饰因子与致病变异体组合,并获得了这些关联的支持性证据。MDVarP 输出了一个类别标签("相关-配对 "或 "非相关-配对")和两个预测分数,这两个分数显示了真正关联的概率:MDVarP 优先考虑与表型调节相关的变异对,从而能更有效地绘制致病变异和调节变异的功能贡献图。该框架解释了人类疾病表型变异背后的基因相互作用,有望应用于个性化医疗和疾病预防。
{"title":"MDVarP: modifier ~ disease-causing variant pairs predictor.","authors":"Hong Sun, Yunqin Chen, Liangxiao Ma","doi":"10.1186/s13040-024-00392-y","DOIUrl":"10.1186/s13040-024-00392-y","url":null,"abstract":"<p><strong>Background: </strong>Modifiers significantly impact disease phenotypes by modulating the effects of disease-causing variants, resulting in varying disease manifestations among individuals. However, identifying genetic interactions between modifier and disease-causing variants is challenging.</p><p><strong>Results: </strong>We developed MDVarP, an ensemble model comprising 1000 random forest predictors, to identify modifier ~ disease-causing variant combinations. MDVarP achieves high accuracy and precision, as verified using an independent dataset with published evidence of genetic interactions. We identified 25 novel modifier ~ disease-causing variant combinations and obtained supporting evidence for these associations. MDVarP outputs a class label (\"Associated-pair\" or \"Nonrelevant-pair\") and two prediction scores indicating the probability of a true association.</p><p><strong>Conclusions: </strong>MDVarP prioritizes variant pairs associated with phenotypic modulations, enabling more effective mapping of functional contributions from disease-causing and modifier variants. This framework interprets genetic interactions underlying phenotypic variations in human diseases, with potential applications in personalized medicine and disease prevention.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11460193/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142394497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1