首页 > 最新文献

Biodata Mining最新文献

英文 中文
Transcriptome-based network analysis related to regulatory T cells infiltration identified RCN1 as a potential biomarker for prognosis in clear cell renal cell carcinoma. 基于转录组的调节性T细胞浸润网络分析发现,RCN1是透明细胞肾细胞癌预后的潜在生物标志物。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-11-14 DOI: 10.1186/s13040-024-00404-x
Yang Qixin, Huang Jing, He Jiang, Liu Xueyang, Yu Lu, Li Yuehua

Background: Regulatory T cells (Tregs) play a critical role in shaping the immunosuppressive microenvironment within tumors. Investigating the role of Tregs in Clear cell renal cell carcinoma (ccRCC) is crucial for identifying prognostic markers and therapeutic targets for ccRCC.

Methods: Weighted gene co-expression network analysis (WGCNA) was utilized to pinpoint modules related to Treg infiltration in TCGA-KIRC samples. Following this, consensus clustering was employed to derive two clusters associated with Treg infiltration in ccRCC. A prognostic model was then developed using the gene module associated with Treg infiltration. We then evaluated the ability of the prognostic model to predict ccRCC overall survival and demonstrated that RCN1 can be used as a target to predict ccRCC prognosis.

Results: We deduce that the two clusters associated with Treg infiltration exhibit distinct compositions of the immune microenvironment, pathway activations, prognosis, and drug sensitivities commonly utilized in ccRCC treatment. Furthermore, a 7-gene model risk score, developed based on ccRCC Treg infiltration, proved to be a reliable prognostic marker in both training and validation cohorts. Additionally, survival analysis indicated that RCN1 serves as a reliable prognostic factor for ccRCC. Single-cell sequencing analysis revealed that RCN1 is predominantly expressed in tumor cells. A pan-cancer analysis highlighted that RCN1 is linked with poor prognosis and the activation of inflammatory response pathways across various cancers.

Conclusion: We developed a prognostic model associated with Treg infiltration, which facilitates the clinical categorization of ccRCC progression. Moreover, our findings underscore the significant potential of RCN1 as a ccRCC biomarker.

背景:调节性 T 细胞(Tregs调节性 T 细胞(Tregs)在形成肿瘤内免疫抑制微环境方面发挥着关键作用。研究Tregs在透明细胞肾细胞癌(ccRCC)中的作用对于确定ccRCC的预后标志物和治疗靶点至关重要:方法:利用加权基因共表达网络分析(WGCNA)确定TCGA-KIRC样本中与Treg浸润相关的模块。方法:利用加权基因共表达网络分析(WGCNA)确定了TCGA-KIRC样本中与Treg浸润相关的模块,然后利用共识聚类得出了两个与ccRCC中Treg浸润相关的聚类。然后利用与 Treg 浸润相关的基因模块建立了一个预后模型。然后,我们评估了该预后模型预测ccRCC总生存期的能力,并证明RCN1可作为预测ccRCC预后的靶点:结果:我们推断出,与Treg浸润相关的两个群组在免疫微环境、通路激活、预后和ccRCC治疗中常用的药物敏感性方面表现出不同的构成。此外,根据 ccRCC Treg 浸润情况开发的 7 基因模型风险评分在训练组和验证组中都被证明是可靠的预后标志物。此外,生存分析表明,RCN1是ccRCC的可靠预后因素。单细胞测序分析表明,RCN1 主要在肿瘤细胞中表达。一项泛癌症分析强调,RCN1与预后不良以及各种癌症的炎症反应通路激活有关:我们建立了一个与Treg浸润相关的预后模型,这有助于对ccRCC的进展进行临床分类。此外,我们的研究结果还强调了RCN1作为ccRCC生物标志物的巨大潜力。
{"title":"Transcriptome-based network analysis related to regulatory T cells infiltration identified RCN1 as a potential biomarker for prognosis in clear cell renal cell carcinoma.","authors":"Yang Qixin, Huang Jing, He Jiang, Liu Xueyang, Yu Lu, Li Yuehua","doi":"10.1186/s13040-024-00404-x","DOIUrl":"10.1186/s13040-024-00404-x","url":null,"abstract":"<p><strong>Background: </strong>Regulatory T cells (Tregs) play a critical role in shaping the immunosuppressive microenvironment within tumors. Investigating the role of Tregs in Clear cell renal cell carcinoma (ccRCC) is crucial for identifying prognostic markers and therapeutic targets for ccRCC.</p><p><strong>Methods: </strong>Weighted gene co-expression network analysis (WGCNA) was utilized to pinpoint modules related to Treg infiltration in TCGA-KIRC samples. Following this, consensus clustering was employed to derive two clusters associated with Treg infiltration in ccRCC. A prognostic model was then developed using the gene module associated with Treg infiltration. We then evaluated the ability of the prognostic model to predict ccRCC overall survival and demonstrated that RCN1 can be used as a target to predict ccRCC prognosis.</p><p><strong>Results: </strong>We deduce that the two clusters associated with Treg infiltration exhibit distinct compositions of the immune microenvironment, pathway activations, prognosis, and drug sensitivities commonly utilized in ccRCC treatment. Furthermore, a 7-gene model risk score, developed based on ccRCC Treg infiltration, proved to be a reliable prognostic marker in both training and validation cohorts. Additionally, survival analysis indicated that RCN1 serves as a reliable prognostic factor for ccRCC. Single-cell sequencing analysis revealed that RCN1 is predominantly expressed in tumor cells. A pan-cancer analysis highlighted that RCN1 is linked with poor prognosis and the activation of inflammatory response pathways across various cancers.</p><p><strong>Conclusion: </strong>We developed a prognostic model associated with Treg infiltration, which facilitates the clinical categorization of ccRCC progression. Moreover, our findings underscore the significant potential of RCN1 as a ccRCC biomarker.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"51"},"PeriodicalIF":4.0,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11566375/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142631060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deciphering the tissue-specific functional effect of Alzheimer risk SNPs with deep genome annotation. 利用深度基因组注释解密阿尔茨海默氏症风险 SNPs 的组织特异性功能效应。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-11-13 DOI: 10.1186/s13040-024-00400-1
Pradeep Varathan Pugalenthi, Bing He, Linhui Xie, Kwangsik Nho, Andrew J Saykin, Jingwen Yan

Alzheimer's disease (AD) is a highly heritable brain dementia, along with substantial failure of cognitive function. Large-scale genome-wide association studies (GWASs) have led to a set of SNPs significantly associated with AD and related traits. GWAS hits usually emerge as clusters where a lead SNP with the highest significance is surrounded by other less significant neighboring SNPs. Although functionality is not guaranteed even with the strongest associations in GWASs, lead SNPs have historically been the focus of the field, with the remaining associations inferred to be redundant. Recent deep genome annotation tools enable the prediction of function from a segment of a DNA sequence with significantly improved precision, which allows in-silico mutagenesis to interrogate the functional effect of SNP alleles. In this project, we explored the impact of top AD GWAS hits around APOE region on chromatin functions and whether it will be altered by the genetic context (i.e., alleles of neighboring SNPs). Our results showed that highly correlated SNPs in the same LD block could have distinct impacts on downstream functions. Although some GWAS lead SNPs showed dominant functional effects regardless of the neighborhood SNP alleles, several other SNPs did exhibit enhanced loss or gain of function under certain genetic contexts, suggesting potential additional information hidden in the LD blocks.

阿尔茨海默病(AD)是一种高度遗传性脑痴呆症,同时伴有认知功能的严重衰竭。大规模的全基因组关联研究(GWAS)发现了一系列与阿尔茨海默病及相关特征有显著关联的 SNPs。全基因组关联研究的结果通常会以群集的形式出现,在这些群集中,一个最重要的 SNP 被其他重要性较低的邻近 SNP 所包围。尽管在 GWAS 中,即使是关联性最强的 SNP 也不能保证其功能性,但主导 SNP 一直是该领域的研究重点,而其余的关联则被推断为多余的。最近的深度基因组注释工具可以从DNA序列的一个片段预测功能,其精确度大大提高,从而可以通过体内诱变来研究SNP等位基因的功能效应。在本项目中,我们探讨了APOE区域周围的顶级AD GWAS命中基因对染色质功能的影响,以及这种影响是否会因遗传背景(即相邻SNP的等位基因)而改变。我们的研究结果表明,在同一LD区块中高度相关的SNPs可能会对下游功能产生不同的影响。尽管一些 GWAS 引导 SNPs 显示出了显性功能效应,与邻近 SNP 等位基因无关,但其他几个 SNPs 在某些遗传背景下确实表现出了增强的功能丧失或增益,这表明 LD 区块中隐藏着潜在的额外信息。
{"title":"Deciphering the tissue-specific functional effect of Alzheimer risk SNPs with deep genome annotation.","authors":"Pradeep Varathan Pugalenthi, Bing He, Linhui Xie, Kwangsik Nho, Andrew J Saykin, Jingwen Yan","doi":"10.1186/s13040-024-00400-1","DOIUrl":"10.1186/s13040-024-00400-1","url":null,"abstract":"<p><p>Alzheimer's disease (AD) is a highly heritable brain dementia, along with substantial failure of cognitive function. Large-scale genome-wide association studies (GWASs) have led to a set of SNPs significantly associated with AD and related traits. GWAS hits usually emerge as clusters where a lead SNP with the highest significance is surrounded by other less significant neighboring SNPs. Although functionality is not guaranteed even with the strongest associations in GWASs, lead SNPs have historically been the focus of the field, with the remaining associations inferred to be redundant. Recent deep genome annotation tools enable the prediction of function from a segment of a DNA sequence with significantly improved precision, which allows in-silico mutagenesis to interrogate the functional effect of SNP alleles. In this project, we explored the impact of top AD GWAS hits around APOE region on chromatin functions and whether it will be altered by the genetic context (i.e., alleles of neighboring SNPs). Our results showed that highly correlated SNPs in the same LD block could have distinct impacts on downstream functions. Although some GWAS lead SNPs showed dominant functional effects regardless of the neighborhood SNP alleles, several other SNPs did exhibit enhanced loss or gain of function under certain genetic contexts, suggesting potential additional information hidden in the LD blocks.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"50"},"PeriodicalIF":4.0,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11558841/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142631056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Investigating potential drug targets for IgA nephropathy and membranous nephropathy through multi-queue plasma protein analysis: a Mendelian randomization study based on SMR and co-localization analysis. 通过多队列血浆蛋白分析研究 IgA 肾病和膜性肾病的潜在药物靶点:基于 SMR 和共定位分析的孟德尔随机研究。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-11-08 DOI: 10.1186/s13040-024-00405-w
Xinyi Xu, Changhong Miao, Shirui Yang, Lu Xiao, Ying Gao, Fangying Wu, Jianbo Xu

Background: Membranous nephropathy (MN) and IgA nephropathy (IgAN) pose challenges in clinical treatment with existing therapies primarily focusing on symptom relief and often yielding unsatisfactory outcomes. The search for novel drug targets remains crucial to address the shortcomings in managing both kidney diseases.

Methods: Utilizing GWAS data for MN (ncase = 2150, ncontrol = 5829) and IgAN (ncase = 15587, ncontrol = 462197), instrumental variables for plasma proteins were derived from recent GWAS. Sensitivity analysis involved bidirectional Mendelian randomization analysis, MR Steiger, Bayesian co-localization, and Phenotype scanning. The SMR analysis using eQTL data from the eQTLGen Consortium was conducted to assess the availability of selected protein targets. The PPI network was constructed to reveal potential associations with existing drug treatment targets.

Results: The study, subjected to the stringent Bonferroni correction, revealed significant associations: four proteins with MN and three proteins with IgAN. In plasma protein cis-pQTL data from two cohorts, an increase in one standard deviation in PLA2R1 (OR = 2.01, 95%CI = 1.83-2.21), AIF1 (OR = 9.04, 95%CI = 4.69-17.41), MLN (OR = 3.79, 95%CI = 2.12-6.78), and NFKB1 (OR = 29.43, 95%CI = 7.73-112.0) was associated with an increased risk of MN. Additionally, in plasma protein cis-pQTL data, a standard deviation increase in FCGR3B (OR = 1.15, 95%CI = 1.09-1.22) and BTN3A1 (OR = 4.05, 95%CI = 2.65-6.19) correlated with elevated IgAN risk, while AIF1 (OR = 0.58, 95%CI = 0.46-0.73) exhibited IgAN protection. Bayesian co-localization indicated that PLA2R1 (coloc.abf-PPH4 = 0.695), NFKB1 (coloc.abf-PPH4 = 0.949), FCGR3B (coloc.abf-PPH4 = 0.909), and BTN3A1 (coloc.abf-PPH4 = 0.685) share the same variants associated with MN and IgAN. The SMR analysis indicated a causal link between NFKB1 and BTN3A1 plasma protein eQTL in both conditions, and BTN3A1 was validated externally.

Conclusion: Genetically influenced plasma levels of PLA2R1 and NFKB1 impact MN risk, while FCGR3B and BTN3A1 levels are causally linked to IgAN risk, suggesting potential drug targets for further clinical exploration, notably BTN3A1 for IgAN.

背景:膜性肾病(MN)和IgA肾病(IgAN)给临床治疗带来了挑战,现有疗法主要侧重于缓解症状,但结果往往不尽人意。寻找新的药物靶点对于解决这两种肾病的治疗缺陷仍然至关重要:利用MN(ncase = 2150,ncontrol = 5829)和IgAN(ncase = 15587,ncontrol = 462197)的GWAS数据,从最近的GWAS中得出血浆蛋白的工具变量。敏感性分析包括双向孟德尔随机分析、MR Steiger、贝叶斯共定位和表型扫描。利用来自 eQTLGen Consortium 的 eQTL 数据进行了 SMR 分析,以评估选定蛋白质靶标的可用性。构建了PPI网络,以揭示与现有药物治疗靶点的潜在关联:该研究经过严格的 Bonferroni 校正,发现了显著的关联:4 种蛋白质与 MN 关联,3 种蛋白质与 IgAN 关联。在来自两个队列的血浆蛋白顺式-pQTL数据中,PLA2R1(OR = 2.01,95%CI = 1.83-2.21)、AIF1(OR = 9.04,95%CI = 4.69-17.41)、MLN(OR = 3.79,95%CI = 2.12-6.78)和NFKB1(OR = 29.43,95%CI = 7.73-112.0)每增加一个标准差,就会增加罹患MN的风险。此外,在血浆蛋白顺式-pQTL 数据中,FCGR3B(OR = 1.15,95%CI = 1.09-1.22)和 BTN3A1(OR = 4.05,95%CI = 2.65-6.19)的标准差增加与 IgAN 风险升高相关,而 AIF1(OR = 0.58,95%CI = 0.46-0.73)则表现出 IgAN 保护作用。贝叶斯共定位表明,PLA2R1(coloc.abf-PPH4 = 0.695)、NFKB1(coloc.abf-PPH4 = 0.949)、FCGR3B(coloc.abf-PPH4 = 0.909)和 BTN3A1(coloc.abf-PPH4 = 0.685)共享与 MN 和 IgAN 相关的相同变异。SMR分析表明,在这两种疾病中,NFKB1和BTN3A1血浆蛋白eQTL之间存在因果联系,BTN3A1也得到了外部验证:结论:受基因影响的血浆 PLA2R1 和 NFKB1 水平会影响 MN 风险,而 FCGR3B 和 BTN3A1 水平则与 IgAN 风险有因果关系,这为进一步临床探索提供了潜在的药物靶点,尤其是用于 IgAN 的 BTN3A1。
{"title":"Investigating potential drug targets for IgA nephropathy and membranous nephropathy through multi-queue plasma protein analysis: a Mendelian randomization study based on SMR and co-localization analysis.","authors":"Xinyi Xu, Changhong Miao, Shirui Yang, Lu Xiao, Ying Gao, Fangying Wu, Jianbo Xu","doi":"10.1186/s13040-024-00405-w","DOIUrl":"10.1186/s13040-024-00405-w","url":null,"abstract":"<p><strong>Background: </strong>Membranous nephropathy (MN) and IgA nephropathy (IgAN) pose challenges in clinical treatment with existing therapies primarily focusing on symptom relief and often yielding unsatisfactory outcomes. The search for novel drug targets remains crucial to address the shortcomings in managing both kidney diseases.</p><p><strong>Methods: </strong>Utilizing GWAS data for MN (ncase = 2150, ncontrol = 5829) and IgAN (ncase = 15587, ncontrol = 462197), instrumental variables for plasma proteins were derived from recent GWAS. Sensitivity analysis involved bidirectional Mendelian randomization analysis, MR Steiger, Bayesian co-localization, and Phenotype scanning. The SMR analysis using eQTL data from the eQTLGen Consortium was conducted to assess the availability of selected protein targets. The PPI network was constructed to reveal potential associations with existing drug treatment targets.</p><p><strong>Results: </strong>The study, subjected to the stringent Bonferroni correction, revealed significant associations: four proteins with MN and three proteins with IgAN. In plasma protein cis-pQTL data from two cohorts, an increase in one standard deviation in PLA2R1 (OR = 2.01, 95%CI = 1.83-2.21), AIF1 (OR = 9.04, 95%CI = 4.69-17.41), MLN (OR = 3.79, 95%CI = 2.12-6.78), and NFKB1 (OR = 29.43, 95%CI = 7.73-112.0) was associated with an increased risk of MN. Additionally, in plasma protein cis-pQTL data, a standard deviation increase in FCGR3B (OR = 1.15, 95%CI = 1.09-1.22) and BTN3A1 (OR = 4.05, 95%CI = 2.65-6.19) correlated with elevated IgAN risk, while AIF1 (OR = 0.58, 95%CI = 0.46-0.73) exhibited IgAN protection. Bayesian co-localization indicated that PLA2R1 (coloc.abf-PPH4 = 0.695), NFKB1 (coloc.abf-PPH4 = 0.949), FCGR3B (coloc.abf-PPH4 = 0.909), and BTN3A1 (coloc.abf-PPH4 = 0.685) share the same variants associated with MN and IgAN. The SMR analysis indicated a causal link between NFKB1 and BTN3A1 plasma protein eQTL in both conditions, and BTN3A1 was validated externally.</p><p><strong>Conclusion: </strong>Genetically influenced plasma levels of PLA2R1 and NFKB1 impact MN risk, while FCGR3B and BTN3A1 levels are causally linked to IgAN risk, suggesting potential drug targets for further clinical exploration, notably BTN3A1 for IgAN.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"49"},"PeriodicalIF":4.0,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11545554/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142631058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep joint learning diagnosis of Alzheimer's disease based on multimodal feature fusion. 基于多模态特征融合的阿尔茨海默病深度联合学习诊断。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-11-05 DOI: 10.1186/s13040-024-00395-9
Jingru Wang, Shipeng Wen, Wenjie Liu, Xianglian Meng, Zhuqing Jiao

Alzheimer's disease (AD) is an advanced and incurable neurodegenerative disease. Genetic variations are intrinsic etiological factors contributing to the abnormal expression of brain function and structure in AD patients. A new multimodal feature fusion called "magnetic resonance imaging (MRI)-p value" was proposed to construct 3D fusion images by introducing genes as a priori knowledge. Moreover, a new deep joint learning diagnostic model was constructed to fully learn images features. One branch trained a residual network (ResNet) to learn the features of local pathological regions. The other branch learned the position information of brain regions with different changes in the different categories of subjects' brains by introducing attention convolution, and then obtained the discriminative probability information from locations via convolution and global average pooling. The feature and position information of the two branches were linearly interacted to acquire the diagnostic basis for classifying the different categories of subjects. The diagnoses of AD and health control (HC), AD and mild cognitive impairment (MCI), HC and MCI were performed with data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). The results showed that the proposed method achieved optimal results in AD-related diagnosis. The classification accuracy (ACC) and area under the curve (AUC) of the three experimental groups were 93.44% and 96.67%, 89.06% and 92%, and 84% and 81.84%, respectively. Moreover, a total of six novel genes were found to be significantly associated with AD, namely NTM, MAML2, NAALADL2, FHIT, TMEM132D and PCSK5, which provided new targets for the potential treatment of neurodegenerative diseases.

阿尔茨海默病(AD)是一种无法治愈的晚期神经退行性疾病。基因变异是导致阿尔茨海默病患者大脑功能和结构异常的内在病因。研究人员提出了一种名为 "磁共振成像(MRI)-p 值 "的新型多模态特征融合方法,通过引入基因作为先验知识来构建三维融合图像。此外,还构建了一个新的深度联合学习诊断模型,以全面学习图像特征。一个分支训练了一个残差网络(ResNet),以学习局部病理区域的特征。另一个分支通过引入注意力卷积,学习不同类别受试者大脑中发生不同变化的脑区的位置信息,然后通过卷积和全局平均池获得位置的判别概率信息。两个分支的特征信息和位置信息进行线性交互,从而获得对不同类别受试者进行分类的诊断依据。利用阿尔茨海默病神经影像学倡议(ADNI)的数据,对注意力缺失症和健康控制(HC)、注意力缺失症和轻度认知障碍(MCI)、轻度认知障碍和 MCI 进行了诊断。结果表明,所提出的方法在与阿兹海默症相关的诊断中取得了最佳效果。三个实验组的分类准确率(ACC)和曲线下面积(AUC)分别为93.44%和96.67%、89.06%和92%、84%和81.84%。此外,共发现6个新基因与AD显著相关,分别是NTM、MAML2、NAALADL2、FHIT、TMEM132D和PCSK5,为潜在的神经退行性疾病治疗提供了新靶点。
{"title":"Deep joint learning diagnosis of Alzheimer's disease based on multimodal feature fusion.","authors":"Jingru Wang, Shipeng Wen, Wenjie Liu, Xianglian Meng, Zhuqing Jiao","doi":"10.1186/s13040-024-00395-9","DOIUrl":"10.1186/s13040-024-00395-9","url":null,"abstract":"<p><p>Alzheimer's disease (AD) is an advanced and incurable neurodegenerative disease. Genetic variations are intrinsic etiological factors contributing to the abnormal expression of brain function and structure in AD patients. A new multimodal feature fusion called \"magnetic resonance imaging (MRI)-p value\" was proposed to construct 3D fusion images by introducing genes as a priori knowledge. Moreover, a new deep joint learning diagnostic model was constructed to fully learn images features. One branch trained a residual network (ResNet) to learn the features of local pathological regions. The other branch learned the position information of brain regions with different changes in the different categories of subjects' brains by introducing attention convolution, and then obtained the discriminative probability information from locations via convolution and global average pooling. The feature and position information of the two branches were linearly interacted to acquire the diagnostic basis for classifying the different categories of subjects. The diagnoses of AD and health control (HC), AD and mild cognitive impairment (MCI), HC and MCI were performed with data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). The results showed that the proposed method achieved optimal results in AD-related diagnosis. The classification accuracy (ACC) and area under the curve (AUC) of the three experimental groups were 93.44% and 96.67%, 89.06% and 92%, and 84% and 81.84%, respectively. Moreover, a total of six novel genes were found to be significantly associated with AD, namely NTM, MAML2, NAALADL2, FHIT, TMEM132D and PCSK5, which provided new targets for the potential treatment of neurodegenerative diseases.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"48"},"PeriodicalIF":4.0,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11536794/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142584754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling heterogeneity of Sudanese hospital stay in neonatal and maternal unit: non-parametric random effect models with Gamma distribution. 苏丹新生儿和产妇住院异质性建模:伽马分布非参数随机效应模型。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-11-01 DOI: 10.1186/s13040-024-00403-y
Amani Almohaimeed, Ishag Adam

Objective: Studies looking into patient and institutional variables linked to extended hospital stays have arisen as a result of the increased focus on severe maternal morbidity and mortality. Understanding the length of hospitalization of patients after delivery is important to gain insights into when hospitals will reach capacity and to predict corresponding staffing or equipment requirements. In Sudan, the distribution of the length of stay during delivery hospitalizations is heavily skewed, with the average length of stay of 2 to 3 days. This study aimed to investigate the use of non-parametric random effect model with Gamma distributed response for analyzing skewed hospital length of stay data in Sudan in neonatal and maternal unit.

Methods: We applied Gamma regression models with unknown random effects, estimated using the non-parametric maximum likelihood (NPML) technique [5]. The NPML reduces the heterogeneity in the distribution of the response and produce a robust estimation since it does not require any assumptions on the distribution. The same applies to the log-Gamma link that does not require any transformation for the data distribution and it can handle the outliers in the data points. In this study, the models are fitted with and without covariates and compared using AIC and BIC values.

Results: The findings imply that in the context of health care database investigations, Gamma regression models with non-parametric random effect consistently reduce heterogeneity and improve model accuracy. The generalized linear model with covariates and random effect (k = 4) had the best fit, indicating that Sudanese hospital length of stay data could be classified into four groups with varying average stays influenced by maternal, neonatal, and obstetrics data.

Conclusion: Identifying factors contributing to longer stays allows hospitals to implement strategies for improvement. Non-parametric random effect model with Gamma distributed response effectively accounts for unobserved heterogeneity and individual-level variability, leading to more accurate inferences and improved patient care. Including random effects can significantly affect variable significance in statistical models, emphasizing the need to consider unobserved heterogeneity when analyzing data containing potential individual-level variability. The findings emphasise the importance of making robust methodological choices in healthcare research in order to inform accurate policy decisions.

目的:由于人们越来越关注严重的孕产妇发病率和死亡率,因此对与延长住院时间有关的病人和机构变量进行了研究。了解产后病人的住院时间对于深入了解医院何时会达到饱和以及预测相应的人员或设备需求非常重要。在苏丹,分娩住院期间的住院时间分布严重倾斜,平均住院时间为 2 到 3 天。本研究旨在探讨使用伽马分布响应的非参数随机效应模型来分析苏丹新生儿和孕产妇病房的偏斜住院时间数据:我们使用非参数最大似然法(NPML)技术[5]估计了带有未知随机效应的伽马回归模型。非参数最大似然法减少了响应分布的异质性,并产生了稳健的估计,因为它不需要对分布做任何假设。对数-伽马链路也是如此,它不需要对数据分布进行任何转换,而且可以处理数据点中的异常值。在本研究中,使用 AIC 值和 BIC 值对有辅变量和无辅变量的模型进行了拟合和比较:结果:研究结果表明,在医疗数据库调查中,带有非参数随机效应的伽马回归模型能持续减少异质性并提高模型的准确性。带有协变量和随机效应的广义线性模型(k = 4)拟合效果最佳,表明苏丹医院的住院时间数据可分为四组,受产妇、新生儿和产科数据的影响,平均住院时间各不相同:结论:找出导致住院时间延长的因素,有助于医院实施改进策略。采用伽马分布响应的非参数随机效应模型可有效考虑未观察到的异质性和个体水平的变异性,从而得出更准确的推论并改善患者护理。纳入随机效应会极大地影响统计模型中变量的显著性,这强调了在分析包含潜在个体水平变异性的数据时考虑未观察到的异质性的必要性。研究结果强调了在医疗保健研究中选择稳健方法的重要性,以便为准确的政策决策提供信息。
{"title":"Modeling heterogeneity of Sudanese hospital stay in neonatal and maternal unit: non-parametric random effect models with Gamma distribution.","authors":"Amani Almohaimeed, Ishag Adam","doi":"10.1186/s13040-024-00403-y","DOIUrl":"10.1186/s13040-024-00403-y","url":null,"abstract":"<p><strong>Objective: </strong>Studies looking into patient and institutional variables linked to extended hospital stays have arisen as a result of the increased focus on severe maternal morbidity and mortality. Understanding the length of hospitalization of patients after delivery is important to gain insights into when hospitals will reach capacity and to predict corresponding staffing or equipment requirements. In Sudan, the distribution of the length of stay during delivery hospitalizations is heavily skewed, with the average length of stay of 2 to 3 days. This study aimed to investigate the use of non-parametric random effect model with Gamma distributed response for analyzing skewed hospital length of stay data in Sudan in neonatal and maternal unit.</p><p><strong>Methods: </strong>We applied Gamma regression models with unknown random effects, estimated using the non-parametric maximum likelihood (NPML) technique [5]. The NPML reduces the heterogeneity in the distribution of the response and produce a robust estimation since it does not require any assumptions on the distribution. The same applies to the log-Gamma link that does not require any transformation for the data distribution and it can handle the outliers in the data points. In this study, the models are fitted with and without covariates and compared using AIC and BIC values.</p><p><strong>Results: </strong>The findings imply that in the context of health care database investigations, Gamma regression models with non-parametric random effect consistently reduce heterogeneity and improve model accuracy. The generalized linear model with covariates and random effect (k = 4) had the best fit, indicating that Sudanese hospital length of stay data could be classified into four groups with varying average stays influenced by maternal, neonatal, and obstetrics data.</p><p><strong>Conclusion: </strong>Identifying factors contributing to longer stays allows hospitals to implement strategies for improvement. Non-parametric random effect model with Gamma distributed response effectively accounts for unobserved heterogeneity and individual-level variability, leading to more accurate inferences and improved patient care. Including random effects can significantly affect variable significance in statistical models, emphasizing the need to consider unobserved heterogeneity when analyzing data containing potential individual-level variability. The findings emphasise the importance of making robust methodological choices in healthcare research in order to inform accurate policy decisions.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"47"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529257/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142565124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability. 利用生成式对抗网络进行集合特征选择和表格数据增强,以提高皮肤黑色素瘤的识别能力和可解释性。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-30 DOI: 10.1186/s13040-024-00397-7
Vanesa Gómez-Martínez, David Chushig-Muzo, Marit B Veierød, Conceição Granja, Cristina Soguero-Ruiz

Background: Cutaneous melanoma is the most aggressive form of skin cancer, responsible for most skin cancer-related deaths. Recent advances in artificial intelligence, jointly with the availability of public dermoscopy image datasets, have allowed to assist dermatologists in melanoma identification. While image feature extraction holds potential for melanoma detection, it often leads to high-dimensional data. Furthermore, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented.

Methods: In this paper, we propose to combine ensemble feature selection (FS) methods and data augmentation with the conditional tabular generative adversarial networks (CTGAN) to enhance melanoma identification in imbalanced datasets. We employed dermoscopy images from two public datasets, PH2 and Derm7pt, which contain melanoma and not-melanoma lesions. To capture intrinsic information from skin lesions, we conduct two feature extraction (FE) approaches, including handcrafted and embedding features. For the former, color, geometric and first-, second-, and higher-order texture features were extracted, whereas for the latter, embeddings were obtained using ResNet-based models. To alleviate the high-dimensionality in the FE, ensemble FS with filter methods were used and evaluated. For data augmentation, we conducted a progressive analysis of the imbalance ratio (IR), related to the amount of synthetic samples created, and evaluated the impact on the predictive results. To gain interpretability on predictive models, we used SHAP, bootstrap resampling statistical tests and UMAP visualizations.

Results: The combination of ensemble FS, CTGAN, and linear models achieved the best predictive results, achieving AUCROC values of 87% (with support vector machine and IR=0.9) and 76% (with LASSO and IR=1.0) for the PH2 and Derm7pt, respectively. We also identified that melanoma lesions were mainly characterized by features related to color, while not-melanoma lesions were characterized by texture features.

Conclusions: Our results demonstrate the effectiveness of ensemble FS and synthetic data in the development of models that accurately identify melanoma. This research advances skin lesion analysis, contributing to both melanoma detection and the interpretation of main features for its identification.

背景:皮肤黑色素瘤是最具侵袭性的皮肤癌,是造成大多数皮肤癌相关死亡的原因。人工智能领域的最新进展,加上公共皮肤镜图像数据集的可用性,有助于皮肤科医生识别黑色素瘤。虽然图像特征提取在黑色素瘤检测方面具有潜力,但它往往会产生高维数据。此外,大多数图像数据集都存在类不平衡的问题,即少数几个类有大量样本,而其他类的代表性不足:本文建议将集合特征选择(FS)方法和数据增强与条件表生成对抗网络(CTGAN)相结合,以增强不平衡数据集中的黑色素瘤识别能力。我们采用了两个公开数据集 PH2 和 Derm7pt 中的皮肤镜图像,其中包含黑色素瘤和非黑色素瘤病变。为了捕捉皮肤病变的内在信息,我们采用了两种特征提取(FE)方法,包括手工特征提取和嵌入特征提取。对于前者,我们提取了颜色、几何和一阶、二阶及高阶纹理特征,而对于后者,我们使用基于 ResNet 的模型获得了嵌入特征。为了减轻 FE 的高维性,我们使用并评估了带有过滤器方法的集合 FS。在数据增强方面,我们对与合成样本量相关的不平衡率(IR)进行了渐进分析,并评估了其对预测结果的影响。为了获得预测模型的可解释性,我们使用了SHAP、自举重采样统计检验和UMAP可视化:结果:集合FS、CTGAN和线性模型的组合取得了最佳预测结果,PH2和Derm7pt的AUCROC值分别达到87%(支持向量机,IR=0.9)和76%(LASSO,IR=1.0)。我们还发现,黑色素瘤病变的主要特征是与颜色相关的特征,而非黑色素瘤病变的主要特征是纹理特征:我们的研究结果表明,在开发能准确识别黑色素瘤的模型时,集合FS和合成数据非常有效。这项研究推动了皮肤病变分析,有助于黑色素瘤的检测和主要特征的解释。
{"title":"Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability.","authors":"Vanesa Gómez-Martínez, David Chushig-Muzo, Marit B Veierød, Conceição Granja, Cristina Soguero-Ruiz","doi":"10.1186/s13040-024-00397-7","DOIUrl":"10.1186/s13040-024-00397-7","url":null,"abstract":"<p><strong>Background: </strong>Cutaneous melanoma is the most aggressive form of skin cancer, responsible for most skin cancer-related deaths. Recent advances in artificial intelligence, jointly with the availability of public dermoscopy image datasets, have allowed to assist dermatologists in melanoma identification. While image feature extraction holds potential for melanoma detection, it often leads to high-dimensional data. Furthermore, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented.</p><p><strong>Methods: </strong>In this paper, we propose to combine ensemble feature selection (FS) methods and data augmentation with the conditional tabular generative adversarial networks (CTGAN) to enhance melanoma identification in imbalanced datasets. We employed dermoscopy images from two public datasets, PH2 and Derm7pt, which contain melanoma and not-melanoma lesions. To capture intrinsic information from skin lesions, we conduct two feature extraction (FE) approaches, including handcrafted and embedding features. For the former, color, geometric and first-, second-, and higher-order texture features were extracted, whereas for the latter, embeddings were obtained using ResNet-based models. To alleviate the high-dimensionality in the FE, ensemble FS with filter methods were used and evaluated. For data augmentation, we conducted a progressive analysis of the imbalance ratio (IR), related to the amount of synthetic samples created, and evaluated the impact on the predictive results. To gain interpretability on predictive models, we used SHAP, bootstrap resampling statistical tests and UMAP visualizations.</p><p><strong>Results: </strong>The combination of ensemble FS, CTGAN, and linear models achieved the best predictive results, achieving AUCROC values of 87% (with support vector machine and IR=0.9) and 76% (with LASSO and IR=1.0) for the PH2 and Derm7pt, respectively. We also identified that melanoma lesions were mainly characterized by features related to color, while not-melanoma lesions were characterized by texture features.</p><p><strong>Conclusions: </strong>Our results demonstrate the effectiveness of ensemble FS and synthetic data in the development of models that accurately identify melanoma. This research advances skin lesion analysis, contributing to both melanoma detection and the interpretation of main features for its identification.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"46"},"PeriodicalIF":4.0,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11526724/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142548479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Priority-Elastic net for binary disease outcome prediction based on multi-omics data. 基于多组学数据的二元疾病结果预测优先级弹性网
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-29 DOI: 10.1186/s13040-024-00401-0
Laila Musib, Roberta Coletti, Marta B Lopes, Helena Mouriño, Eunice Carrasquinha

Background: High-dimensional omics data integration has emerged as a prominent avenue within the healthcare industry, presenting substantial potential to improve predictive models. However, the data integration process faces several challenges, including data heterogeneity, priority sequence in which data blocks are prioritized for rendering predictive information contained in multiple blocks, assessing the flow of information from one omics level to the other and multicollinearity.

Methods: We propose the Priority-Elastic net algorithm, a hierarchical regression method extending Priority-Lasso for the binary logistic regression model by incorporating a priority order for blocks of variables while fitting Elastic-net models sequentially for each block. The fitted values from each step are then used as an offset in the subsequent step. Additionally, we considered the adaptive elastic-net penalty within our priority framework to compare the results.

Results: The Priority-Elastic net and Priority-Adaptive Elastic net algorithms were evaluated on a brain tumor dataset available from The Cancer Genome Atlas (TCGA), accounting for transcriptomics, proteomics, and clinical information measured over two glioma types: Lower-grade glioma (LGG) and glioblastoma (GBM).

Conclusion: Our findings suggest that the Priority-Elastic net is a highly advantageous choice for a wide range of applications. It offers moderate computational complexity, flexibility in integrating prior knowledge while introducing a hierarchical modeling perspective, and, importantly, improved stability and accuracy in predictions, making it superior to the other methods discussed. This evolution marks a significant step forward in predictive modeling, offering a sophisticated tool for navigating the complexities of multi-omics datasets in pursuit of precision medicine's ultimate goal: personalized treatment optimization based on a comprehensive array of patient-specific data. This framework can be generalized to time-to-event, Cox proportional hazards regression and multicategorical outcomes. A practical implementation of this method is available upon request in R script, complete with an example to facilitate its application.

背景:高维整体组学数据整合已成为医疗保健行业的一个重要途径,为改进预测模型提供了巨大潜力。然而,数据整合过程面临着一些挑战,包括数据异质性、数据块优先顺序以呈现包含在多个数据块中的预测信息、评估从一个整体组学层次到另一个整体组学层次的信息流以及多重共线性:我们提出了 "优先级弹性网算法",这是一种分层回归方法,它将优先级拉索(Priority-Lasso)扩展到了二元逻辑回归模型中,在为每个数据块依次拟合弹性网模型的同时,为变量块设定了优先级顺序。每一步的拟合值都会被用作后续步骤的偏移量。此外,我们还在优先级框架内考虑了自适应弹性网惩罚,以比较结果:我们在癌症基因组图谱(TCGA)提供的脑肿瘤数据集上对优先级弹性网算法和优先级自适应弹性网算法进行了评估,其中包括两种胶质瘤类型的转录组学、蛋白质组学和临床信息:结论:我们的研究结果表明,优先级弹性网是一种非常有利的选择,适用于广泛的应用领域。它具有适度的计算复杂性、整合先验知识的灵活性,同时引入了分层建模视角,更重要的是,它提高了预测的稳定性和准确性,使其优于所讨论的其他方法。这一演变标志着预测建模向前迈进了一大步,为驾驭复杂的多组学数据集提供了先进的工具,以实现精准医学的终极目标:基于一系列患者特定数据的个性化治疗优化。这一框架可推广到时间到事件、Cox 比例危险回归和多分类结果。如果您需要,我们可以用 R 脚本提供这种方法的实际应用,并提供一个示例以方便应用。
{"title":"Priority-Elastic net for binary disease outcome prediction based on multi-omics data.","authors":"Laila Musib, Roberta Coletti, Marta B Lopes, Helena Mouriño, Eunice Carrasquinha","doi":"10.1186/s13040-024-00401-0","DOIUrl":"10.1186/s13040-024-00401-0","url":null,"abstract":"<p><strong>Background: </strong>High-dimensional omics data integration has emerged as a prominent avenue within the healthcare industry, presenting substantial potential to improve predictive models. However, the data integration process faces several challenges, including data heterogeneity, priority sequence in which data blocks are prioritized for rendering predictive information contained in multiple blocks, assessing the flow of information from one omics level to the other and multicollinearity.</p><p><strong>Methods: </strong>We propose the Priority-Elastic net algorithm, a hierarchical regression method extending Priority-Lasso for the binary logistic regression model by incorporating a priority order for blocks of variables while fitting Elastic-net models sequentially for each block. The fitted values from each step are then used as an offset in the subsequent step. Additionally, we considered the adaptive elastic-net penalty within our priority framework to compare the results.</p><p><strong>Results: </strong>The Priority-Elastic net and Priority-Adaptive Elastic net algorithms were evaluated on a brain tumor dataset available from The Cancer Genome Atlas (TCGA), accounting for transcriptomics, proteomics, and clinical information measured over two glioma types: Lower-grade glioma (LGG) and glioblastoma (GBM).</p><p><strong>Conclusion: </strong>Our findings suggest that the Priority-Elastic net is a highly advantageous choice for a wide range of applications. It offers moderate computational complexity, flexibility in integrating prior knowledge while introducing a hierarchical modeling perspective, and, importantly, improved stability and accuracy in predictions, making it superior to the other methods discussed. This evolution marks a significant step forward in predictive modeling, offering a sophisticated tool for navigating the complexities of multi-omics datasets in pursuit of precision medicine's ultimate goal: personalized treatment optimization based on a comprehensive array of patient-specific data. This framework can be generalized to time-to-event, Cox proportional hazards regression and multicategorical outcomes. A practical implementation of this method is available upon request in R script, complete with an example to facilitate its application.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"45"},"PeriodicalIF":4.0,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11523883/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142548496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A regularized Cox hierarchical model for incorporating annotation information in predictive omic studies. 将注释信息纳入预测性 omic 研究的正则化 Cox 层次模型。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-24 DOI: 10.1186/s13040-024-00398-6
Dixin Shen, Juan Pablo Lewinger, Eric Kawaguchi

Background: Associated with high-dimensional omics data there are often "meta-features" such as biological pathways and functional annotations, summary statistics from similar studies that can be informative for predicting an outcome of interest. We introduce a regularized hierarchical framework for integrating meta-features, with the goal of improving prediction and feature selection performance with time-to-event outcomes.

Methods: A hierarchical framework is deployed to incorporate meta-features. Regularization is applied to the omic features as well as the meta-features so that high-dimensional data can be handled at both levels. The proposed hierarchical Cox model can be efficiently fitted by a combination of iterative reweighted least squares and cyclic coordinate descent.

Results: In a simulation study we show that when the external meta-features are informative, the regularized hierarchical model can substantially improve prediction performance over standard regularized Cox regression. We illustrate the proposed model with applications to breast cancer and melanoma survival based on gene expression profiles, which show the improvement in prediction performance by applying meta-features, as well as the discovery of important omic feature sets with sparse regularization at meta-feature level.

Conclusions: The proposed hierarchical regularized regression model enables integration of external meta-feature information directly into the modeling process for time-to-event outcomes, improves prediction performance when the external meta-feature data is informative. Importantly, when the external meta-features are uninformative, the prediction performance based on the regularized hierarchical model is on par with standard regularized Cox regression, indicating robustness of the framework. In addition to developing predictive signatures, the model can also be deployed in discovery applications where the main goal is to identify important features associated with the outcome rather than developing a predictive model.

背景:与高维 omics 数据相关的往往是 "元特征",如生物通路和功能注释,这些来自类似研究的总结性统计数据可能对预测感兴趣的结果具有参考价值。我们引入了一个正则化的分层框架来整合元特征,目的是提高时间到事件结果的预测和特征选择性能:方法:采用分层框架整合元特征。方法:采用分层框架纳入元特征,并对omic特征和元特征进行正则化处理,从而在两个层面上处理高维数据。结合迭代加权最小二乘法和循环坐标下降法,可以有效拟合所提出的分层考克斯模型:在一项模拟研究中,我们发现当外部元特征信息丰富时,正则化分层模型比标准正则化 Cox 回归能大幅提高预测性能。我们将提出的模型应用于基于基因表达谱的乳腺癌和黑色素瘤存活率研究,结果表明,应用元特征可以提高预测性能,在元特征水平上进行稀疏正则化还可以发现重要的 omic 特征集:结论:所提出的分层正则化回归模型能将外部元特征信息直接整合到时间到事件结果的建模过程中,当外部元特征数据信息丰富时,能提高预测性能。重要的是,当外部元特征信息不丰富时,基于正则化分层模型的预测性能与标准正则化 Cox 回归相当,这表明了该框架的稳健性。除了开发预测特征外,该模型还可以部署在发现应用中,其主要目标是识别与结果相关的重要特征,而不是开发预测模型。
{"title":"A regularized Cox hierarchical model for incorporating annotation information in predictive omic studies.","authors":"Dixin Shen, Juan Pablo Lewinger, Eric Kawaguchi","doi":"10.1186/s13040-024-00398-6","DOIUrl":"10.1186/s13040-024-00398-6","url":null,"abstract":"<p><strong>Background: </strong>Associated with high-dimensional omics data there are often \"meta-features\" such as biological pathways and functional annotations, summary statistics from similar studies that can be informative for predicting an outcome of interest. We introduce a regularized hierarchical framework for integrating meta-features, with the goal of improving prediction and feature selection performance with time-to-event outcomes.</p><p><strong>Methods: </strong>A hierarchical framework is deployed to incorporate meta-features. Regularization is applied to the omic features as well as the meta-features so that high-dimensional data can be handled at both levels. The proposed hierarchical Cox model can be efficiently fitted by a combination of iterative reweighted least squares and cyclic coordinate descent.</p><p><strong>Results: </strong>In a simulation study we show that when the external meta-features are informative, the regularized hierarchical model can substantially improve prediction performance over standard regularized Cox regression. We illustrate the proposed model with applications to breast cancer and melanoma survival based on gene expression profiles, which show the improvement in prediction performance by applying meta-features, as well as the discovery of important omic feature sets with sparse regularization at meta-feature level.</p><p><strong>Conclusions: </strong>The proposed hierarchical regularized regression model enables integration of external meta-feature information directly into the modeling process for time-to-event outcomes, improves prediction performance when the external meta-feature data is informative. Importantly, when the external meta-features are uninformative, the prediction performance based on the regularized hierarchical model is on par with standard regularized Cox regression, indicating robustness of the framework. In addition to developing predictive signatures, the model can also be deployed in discovery applications where the main goal is to identify important features associated with the outcome rather than developing a predictive model.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"44"},"PeriodicalIF":4.0,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515443/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142511162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
G4 & the balanced metric family - a novel approach to solving binary classification problems in medical device validation & verification studies. G4 和平衡度量系列--解决医疗器械验证和确认研究中二元分类问题的新方法。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-23 DOI: 10.1186/s13040-024-00402-z
Andrew Marra

Background: In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews' Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.

Results: Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI's performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.

Conclusions: Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device's performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.

背景:在医疗器械验证和确认研究中,接收者操作特征曲线下面积 (AUROC) 经常被用作主要终点,尽管有多份报告显示了它的局限性。因此,鼓励研究人员考虑采用其他指标作为主要终点。本文介绍了一种名为 G4 的新指标,它是灵敏度、特异性、阳性预测值和阴性预测值的几何平均数。G4 是一个平衡指标体系的一部分,该体系包括统一性能指标(又称 P4)和马修斯相关系数 (MCC)。本手稿旨在揭示在分析二元分类器的整体性能时将 G4 与平衡度量系列结合使用的好处:结果:在多阅读器多案例研究设计下,分析了包含不同少数群体流行率的模拟数据集。此外,还考虑了一项独立发表的研究数据,该研究测试了独特的超声人工智能算法在乳腺癌检测方面的性能。在每个数据集中,AUROC 与平衡度量系列一起报告,以供比较。当数据集中少数群体的流行率和偏差接近 50%时,所有三个平衡指标都能对人工智能的性能做出等效的解释。随着流行率的增加/减少,数据变得更加不平衡,AUROC 往往会高估/低估真正的分类器性能,而平衡度量系列则能抵御这种不平衡。在某些情况下,数据不平衡性很强(少数类流行率结论:在二元分类问题中使用 AUROC 作为主要终点,会随着数据集变得越来越不平衡而产生误导性结果。这一点在将 AUROC 纳入医疗设备验证和检验研究时会被明确注意到。G4、P4 和 MCC 不具有这种局限性,它们能更全面地反映医疗设备在临床环境中的性能。因此,我们鼓励研究人员在评估二元分类问题时探索平衡度量系列。
{"title":"G4 & the balanced metric family - a novel approach to solving binary classification problems in medical device validation & verification studies.","authors":"Andrew Marra","doi":"10.1186/s13040-024-00402-z","DOIUrl":"10.1186/s13040-024-00402-z","url":null,"abstract":"<p><strong>Background: </strong>In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews' Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.</p><p><strong>Results: </strong>Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI's performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.</p><p><strong>Conclusions: </strong>Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device's performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"43"},"PeriodicalIF":4.0,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142511164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From COVID-19 to monkeypox: a novel predictive model for emerging infectious diseases. 从 COVID-19 到猴痘:新出现传染病的新型预测模型。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-10-22 DOI: 10.1186/s13040-024-00396-8
Deren Xu, Weng Howe Chan, Habibollah Haron, Hui Wen Nies, Kohbalan Moorthy

The outbreak of emerging infectious diseases poses significant challenges to global public health. Accurate early forecasting is crucial for effective resource allocation and emergency response planning. This study aims to develop a comprehensive predictive model for emerging infectious diseases, integrating the blending framework, transfer learning, incremental learning, and the biological feature Rt to increase prediction accuracy and practicality. By transferring features from a COVID-19 dataset to a monkeypox dataset and introducing dynamically updated incremental learning techniques, the model's predictive capability in data-scarce scenarios was significantly improved. The research findings demonstrate that the blending framework performs exceptionally well in short-term (7-day) predictions. Furthermore, the combination of transfer learning and incremental learning techniques significantly enhanced the adaptability and precision, with a 91.41% improvement in the RMSE and an 89.13% improvement in the MAE. In particular, the inclusion of the Rt feature enabled the model to more accurately reflect the dynamics of disease spread, further improving the RMSE by 1.91% and the MAE by 2.17%. This study underscores the significant application potential of multimodel fusion and real-time data updates in infectious disease prediction, offering new theoretical perspectives and technical support. This research not only enriches the theoretical foundation of infectious disease prediction models but also provides reliable technical support for public health emergency responses. Future research should continue to explore integrating data from multiple sources and enhancing model generalization capabilities to further enhance the practicality and reliability of predictive tools.

新发传染病的爆发给全球公共卫生带来了重大挑战。准确的早期预测对于有效的资源分配和应急计划至关重要。本研究旨在开发一种针对新发传染病的综合预测模型,将混合框架、迁移学习、增量学习和生物特征 Rt 整合在一起,以提高预测的准确性和实用性。通过将 COVID-19 数据集的特征转移到猴痘数据集,并引入动态更新的增量学习技术,该模型在数据稀缺情况下的预测能力得到了显著提高。研究结果表明,混合框架在短期(7 天)预测中表现优异。此外,迁移学习和增量学习技术的结合大大提高了适应性和精确度,均方根误差(RMSE)提高了 91.41%,均方根误差(MAE)提高了 89.13%。特别是 Rt 特征的加入,使模型能够更准确地反映疾病传播的动态,进一步将 RMSE 提高了 1.91%,MAE 提高了 2.17%。这项研究强调了多模型融合和实时数据更新在传染病预测中的巨大应用潜力,提供了新的理论视角和技术支持。这项研究不仅丰富了传染病预测模型的理论基础,也为公共卫生应急响应提供了可靠的技术支持。未来的研究应继续探索整合多源数据,增强模型泛化能力,进一步提高预测工具的实用性和可靠性。
{"title":"From COVID-19 to monkeypox: a novel predictive model for emerging infectious diseases.","authors":"Deren Xu, Weng Howe Chan, Habibollah Haron, Hui Wen Nies, Kohbalan Moorthy","doi":"10.1186/s13040-024-00396-8","DOIUrl":"https://doi.org/10.1186/s13040-024-00396-8","url":null,"abstract":"<p><p>The outbreak of emerging infectious diseases poses significant challenges to global public health. Accurate early forecasting is crucial for effective resource allocation and emergency response planning. This study aims to develop a comprehensive predictive model for emerging infectious diseases, integrating the blending framework, transfer learning, incremental learning, and the biological feature Rt to increase prediction accuracy and practicality. By transferring features from a COVID-19 dataset to a monkeypox dataset and introducing dynamically updated incremental learning techniques, the model's predictive capability in data-scarce scenarios was significantly improved. The research findings demonstrate that the blending framework performs exceptionally well in short-term (7-day) predictions. Furthermore, the combination of transfer learning and incremental learning techniques significantly enhanced the adaptability and precision, with a 91.41% improvement in the RMSE and an 89.13% improvement in the MAE. In particular, the inclusion of the Rt feature enabled the model to more accurately reflect the dynamics of disease spread, further improving the RMSE by 1.91% and the MAE by 2.17%. This study underscores the significant application potential of multimodel fusion and real-time data updates in infectious disease prediction, offering new theoretical perspectives and technical support. This research not only enriches the theoretical foundation of infectious disease prediction models but also provides reliable technical support for public health emergency responses. Future research should continue to explore integrating data from multiple sources and enhancing model generalization capabilities to further enhance the practicality and reliability of predictive tools.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"42"},"PeriodicalIF":4.0,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11494870/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142511163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1