首页 > 最新文献

Biodata Mining最新文献

英文 中文
Disclosing transcriptomics network-based signatures of glioma heterogeneity using sparse methods. 使用稀疏方法揭示神经胶质瘤异质性的基于转录组学的特征。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-09-26 DOI: 10.1186/s13040-023-00341-1
Sofia Martins, Roberta Coletti, Marta B Lopes

Gliomas are primary malignant brain tumors with poor survival and high resistance to available treatments. Improving the molecular understanding of glioma and disclosing novel biomarkers of tumor development and progression could help to find novel targeted therapies for this type of cancer. Public databases such as The Cancer Genome Atlas (TCGA) provide an invaluable source of molecular information on cancer tissues. Machine learning tools show promise in dealing with the high dimension of omics data and extracting relevant information from it. In this work, network inference and clustering methods, namely Joint Graphical lasso and Robust Sparse K-means Clustering, were applied to RNA-sequencing data from TCGA glioma patients to identify shared and distinct gene networks among different types of glioma (glioblastoma, astrocytoma, and oligodendroglioma) and disclose new patient groups and the relevant genes behind groups' separation. The results obtained suggest that astrocytoma and oligodendroglioma have more similarities compared with glioblastoma, highlighting the molecular differences between glioblastoma and the others glioma subtypes. After a comprehensive literature search on the relevant genes pointed our from our analysis, we identified potential candidates for biomarkers of glioma. Further molecular validation of these genes is encouraged to understand their potential role in diagnosis and in the design of novel therapies.

胶质瘤是原发性恶性脑肿瘤,生存率低,对现有治疗方法的耐药性高。提高对神经胶质瘤的分子理解并揭示肿瘤发展和进展的新生物标志物可能有助于找到这种类型癌症的新靶向治疗方法。癌症基因组图谱(TCGA)等公共数据库为癌症组织的分子信息提供了宝贵的来源。机器学习工具在处理高维组学数据并从中提取相关信息方面表现出了良好的前景,应用于TCGA神经胶质瘤患者的RNA测序数据,以确定不同类型神经胶质瘤(胶质母细胞瘤、星形细胞瘤和少突胶质瘤)之间共享和不同的基因网络,并揭示新的患者群体和群体分离背后的相关基因。结果表明,与胶质母细胞瘤相比,星形细胞瘤和少突胶质瘤有更多的相似性,突出了胶质母细胞癌与其他胶质瘤亚型之间的分子差异。在对我们分析的相关基因进行全面的文献检索后,我们确定了神经胶质瘤生物标志物的潜在候选者。鼓励对这些基因进行进一步的分子验证,以了解它们在诊断和新疗法设计中的潜在作用。
{"title":"Disclosing transcriptomics network-based signatures of glioma heterogeneity using sparse methods.","authors":"Sofia Martins, Roberta Coletti, Marta B Lopes","doi":"10.1186/s13040-023-00341-1","DOIUrl":"10.1186/s13040-023-00341-1","url":null,"abstract":"<p><p>Gliomas are primary malignant brain tumors with poor survival and high resistance to available treatments. Improving the molecular understanding of glioma and disclosing novel biomarkers of tumor development and progression could help to find novel targeted therapies for this type of cancer. Public databases such as The Cancer Genome Atlas (TCGA) provide an invaluable source of molecular information on cancer tissues. Machine learning tools show promise in dealing with the high dimension of omics data and extracting relevant information from it. In this work, network inference and clustering methods, namely Joint Graphical lasso and Robust Sparse K-means Clustering, were applied to RNA-sequencing data from TCGA glioma patients to identify shared and distinct gene networks among different types of glioma (glioblastoma, astrocytoma, and oligodendroglioma) and disclose new patient groups and the relevant genes behind groups' separation. The results obtained suggest that astrocytoma and oligodendroglioma have more similarities compared with glioblastoma, highlighting the molecular differences between glioblastoma and the others glioma subtypes. After a comprehensive literature search on the relevant genes pointed our from our analysis, we identified potential candidates for biomarkers of glioma. Further molecular validation of these genes is encouraged to understand their potential role in diagnosis and in the design of novel therapies.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"26"},"PeriodicalIF":4.5,"publicationDate":"2023-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10523751/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41161853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STAR_outliers: a python package that separates univariate outliers from non-normal distributions. STAR_outliers:一个python包,用于从非正态分布中分离单变量异常值。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-09-04 DOI: 10.1186/s13040-023-00342-0
John T Gregg, Jason H Moore
<p><p>There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes signif
目前还没有任何单变量离群点检测算法,可以对任意形状的分布进行变换和建模,以去除单变量离群点。有些算法建模偏态,甚至更少建模峰度,没有一个算法建模双峰性和单调性。为了克服这些挑战,我们实现了一种针对偏度和尾重调整的异常值去除(STAR_outliers)的算法,该算法可以从具有许多不同形状轮廓的分布中稳健地去除单变量异常值,包括极端偏度、极端峰度、双峰性和单调性。我们表明,STAR_outliers比几种通用算法具有更高的召回率和精度来去除模拟的异常值,并且它还以更高的精度建模真实数据分布的异常边界。从任意形状的分布中可靠地去除单变量异常值是一项艰巨的任务。错误地假设单峰性或高估尾重不能去除异常值,而低估尾重则错误地从尾部去除常规数据。偏态通常会产生一条重尾和一条轻尾,我们表明一些复杂的离群值去除算法通常不能从轻尾中去除离群值。多元离群值检测算法最近变得很流行,但在测试了PyOD的多元离群值去除算法后,我们发现它们对于单变量离群值去除是不够的。它们通常不允许单变量输入,并且它们的异常值分数分布不能与可以准确建立异常值阈值的模型相拟合。因此,需要一种灵活的离群值去除算法来模拟任意形状的单变量分布。为了有效地模拟任意形状的单变量分布,我们将几种成熟的算法组合成一个名为STAR_outliers的新算法。与其他几种单变量算法相比,STAR_outliers删除了更多模拟的真实异常值和更少的非异常值。其中包括几种假设正态性的离群值去除方法,PyOD的隔离森林(IF)离群值去除算法(ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012)默认设置,以及Verardi和Vermandele基于IQR的算法,该算法在考虑偏态和峰度的同时去除离群值(Verardi和Vermandele, Journal de la sociacims francalaise de statisque 157:90-114, 2016)。由于IF算法的默认模型不能很好地拟合离群值得分,因此我们还将隔离森林算法与一个模型进行了比较,该模型需要删除尽可能多的数据点,如STAR_outliers按照离群值得分的递减顺序删除。我们还将这些算法与公开的2018年国家健康和营养检查调查(NHANES)数据进行了比较,设置了异常值阈值,使数值落在拟合模型域的99.3%以内。我们发现,平均而言,我们的STAR_outliers算法比其他离群值去除方法从这些特征中去除的值明显接近0.7%。STAR_outliers是一个易于实现的python包,用于去除异常值,优于多种常用的单变量异常值去除方法。
{"title":"STAR_outliers: a python package that separates univariate outliers from non-normal distributions.","authors":"John T Gregg, Jason H Moore","doi":"10.1186/s13040-023-00342-0","DOIUrl":"10.1186/s13040-023-00342-0","url":null,"abstract":"&lt;p&gt;&lt;p&gt;There are not currently any univariate outlier detection algorithms that transform and model arbitrarily shaped distributions to remove univariate outliers. Some algorithms model skew, even fewer model kurtosis, and none of them model bimodality and monotonicity. To overcome these challenges, we have implemented an algorithm for Skew and Tail-heaviness Adjusted Removal of Outliers (STAR_outliers) that robustly removes univariate outliers from distributions with many different shape profiles, including extreme skew, extreme kurtosis, bimodality, and monotonicity. We show that STAR_outliers removes simulated outliers with greater recall and precision than several general algorithms, and it also models the outlier bounds of real data distributions with greater accuracy.Background Reliably removing univariate outliers from arbitrarily shaped distributions is a difficult task. Incorrectly assuming unimodality or overestimating tail heaviness fails to remove outliers, while underestimating tail heaviness incorrectly removes regular data from the tails. Skew often produces one heavy tail and one light tail, and we show that several sophisticated outlier removal algorithms often fail to remove outliers from the light tail. Multivariate outlier detection algorithms have recently become popular, but having tested PyOD's multivariate outlier removal algorithms, we found them to be inadequate for univariate outlier removal. They usually do not allow for univariate input, and they do not fit their distributions of outliership scores with a model on which an outlier threshold can be accurately established. Thus, there is a need for a flexible outlier removal algorithm that can model arbitrarily shaped univariate distributions.Results In order to effectively model arbitrarily shaped univariate distributions, we have combined several well-established algorithms into a new algorithm called STAR_outliers. STAR_outliers removes more simulated true outliers and fewer non-outliers than several other univariate algorithms. These include several normality-assuming outlier removal methods, PyOD's isolation forest (IF) outlier removal algorithm (ACM Transactions on Knowledge Discovery from Data (TKDD) 6:3, 2012) with default settings, and an IQR based algorithm by Verardi and Vermandele that removes outliers while accounting for skew and kurtosis (Verardi and Vermandele, Journal de la Société Française de Statistique 157:90-114, 2016). Since the IF algorithm's default model poorly fit the outliership scores, we also compared the isolation forest algorithm with a model that entails removing as many datapoints as STAR_outliers does in order of decreasing outliership scores. We also compared these algorithms on the publicly available 2018 National Health and Nutrition Examination Survey (NHANES) data by setting the outlier threshold to keep values falling within the main 99.3 percent of the fitted model's domain. We show that our STAR_outliers algorithm removes signif","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"25"},"PeriodicalIF":4.5,"publicationDate":"2023-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10476292/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10166430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning based study for the classification of Type 2 diabetes mellitus subtypes. 基于机器学习的2型糖尿病亚型分类研究。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-08-22 DOI: 10.1186/s13040-023-00340-2
Nelson E Ordoñez-Guillen, Jose Luis Gonzalez-Compean, Ivan Lopez-Arevalo, Miguel Contreras-Murillo, Edwin Aldana-Bobadilla

Purpose: Data-driven diabetes research has increased its interest in exploring the heterogeneity of the disease, aiming to support in the development of more specific prognoses and treatments within the so-called precision medicine. Recently, one of these studies found five diabetes subgroups with varying risks of complications and treatment responses. Here, we tackle the development and assessment of different models for classifying Type 2 Diabetes (T2DM) subtypes through machine learning approaches, with the aim of providing a performance comparison and new insights on the matter.

Methods: We developed a three-stage methodology starting with the preprocessing of public databases NHANES (USA) and ENSANUT (Mexico) to construct a dataset with N = 10,077 adult diabetes patient records. We used N = 2,768 records for training/validation of models and left the remaining (N = 7,309) for testing. In the second stage, groups of observations -each one representing a T2DM subtype- were identified. We tested different clustering techniques and strategies and validated them by using internal and external clustering indices; obtaining two annotated datasets Dset A and Dset B. In the third stage, we developed different classification models assaying four algorithms, seven input-data schemes, and two validation settings on each annotated dataset. We also tested the obtained models using a majority-vote approach for classifying unseen patient records in the hold-out dataset.

Results: From the independently obtained bootstrap validation for Dset A and Dset B, mean accuracies across all seven data schemes were [Formula: see text] ([Formula: see text]) and [Formula: see text] ([Formula: see text]), respectively. Best accuracies were [Formula: see text] and [Formula: see text]. Both validation setting results were consistent. For the hold-out dataset, results were consonant with most of those obtained in the literature in terms of class proportions.

Conclusion: The development of machine learning systems for the classification of diabetes subtypes constitutes an important task to support physicians for fast and timely decision-making. We expect to deploy this methodology in a data analysis platform to conduct studies for identifying T2DM subtypes in patient records from hospitals.

目的:数据驱动的糖尿病研究增加了对探索疾病异质性的兴趣,旨在支持所谓的精准医学中更具体的预后和治疗的发展。最近,其中一项研究发现了五个糖尿病亚组,它们的并发症风险和治疗反应各不相同。在这里,我们通过机器学习方法解决了2型糖尿病(T2DM)亚型分类的不同模型的开发和评估,目的是提供性能比较和对该问题的新见解。方法:我们开发了一个三阶段的方法,从公共数据库NHANES(美国)和ENSANUT(墨西哥)的预处理开始,构建了一个包含N = 10,077例成人糖尿病患者记录的数据集。我们使用N = 2768条记录用于模型的训练/验证,剩下的(N = 7309)用于测试。在第二阶段,确定观察组-每组代表一个T2DM亚型。对不同的聚类技术和策略进行了测试,并利用内外聚类指标对其进行了验证;在第三阶段,我们开发了不同的分类模型,分析了每个注释数据集上的四种算法、七种输入数据方案和两种验证设置。我们还使用多数投票方法测试了获得的模型,用于对保留数据集中未见的患者记录进行分类。结果:从独立获得的Dset A和Dset B的bootstrap验证中,所有七个数据方案的平均精度分别为[公式:见文]([公式:见文])和[公式:见文]([公式:见文])。准确度最高的是[公式:见文]和[公式:见文]。两种验证设置结果一致。对于hold-out数据集,就类比例而言,结果与文献中获得的大多数结果一致。结论:开发用于糖尿病亚型分类的机器学习系统是支持医生快速及时决策的重要任务。我们希望在数据分析平台中部署这种方法,以开展在医院患者记录中识别T2DM亚型的研究。
{"title":"Machine learning based study for the classification of Type 2 diabetes mellitus subtypes.","authors":"Nelson E Ordoñez-Guillen, Jose Luis Gonzalez-Compean, Ivan Lopez-Arevalo, Miguel Contreras-Murillo, Edwin Aldana-Bobadilla","doi":"10.1186/s13040-023-00340-2","DOIUrl":"10.1186/s13040-023-00340-2","url":null,"abstract":"<p><strong>Purpose: </strong>Data-driven diabetes research has increased its interest in exploring the heterogeneity of the disease, aiming to support in the development of more specific prognoses and treatments within the so-called precision medicine. Recently, one of these studies found five diabetes subgroups with varying risks of complications and treatment responses. Here, we tackle the development and assessment of different models for classifying Type 2 Diabetes (T2DM) subtypes through machine learning approaches, with the aim of providing a performance comparison and new insights on the matter.</p><p><strong>Methods: </strong>We developed a three-stage methodology starting with the preprocessing of public databases NHANES (USA) and ENSANUT (Mexico) to construct a dataset with N = 10,077 adult diabetes patient records. We used N = 2,768 records for training/validation of models and left the remaining (N = 7,309) for testing. In the second stage, groups of observations -each one representing a T2DM subtype- were identified. We tested different clustering techniques and strategies and validated them by using internal and external clustering indices; obtaining two annotated datasets Dset A and Dset B. In the third stage, we developed different classification models assaying four algorithms, seven input-data schemes, and two validation settings on each annotated dataset. We also tested the obtained models using a majority-vote approach for classifying unseen patient records in the hold-out dataset.</p><p><strong>Results: </strong>From the independently obtained bootstrap validation for Dset A and Dset B, mean accuracies across all seven data schemes were [Formula: see text] ([Formula: see text]) and [Formula: see text] ([Formula: see text]), respectively. Best accuracies were [Formula: see text] and [Formula: see text]. Both validation setting results were consistent. For the hold-out dataset, results were consonant with most of those obtained in the literature in terms of class proportions.</p><p><strong>Conclusion: </strong>The development of machine learning systems for the classification of diabetes subtypes constitutes an important task to support physicians for fast and timely decision-making. We expect to deploy this methodology in a data analysis platform to conduct studies for identifying T2DM subtypes in patient records from hospitals.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"24"},"PeriodicalIF":4.5,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10463725/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10173698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessment of emerging pretraining strategies in interpretable multimodal deep learning for cancer prognostication. 评估可解释多模态深度学习中用于癌症预测的新兴预训练策略。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-07-22 DOI: 10.1186/s13040-023-00338-w
Zarif L Azher, Anish Suvarna, Ji-Qing Chen, Ze Zhang, Brock C Christensen, Lucas A Salas, Louis J Vaickus, Joshua J Levy

Background: Deep learning models can infer cancer patient prognosis from molecular and anatomic pathology information. Recent studies that leveraged information from complementary multimodal data improved prognostication, further illustrating the potential utility of such methods. However, current approaches: 1) do not comprehensively leverage biological and histomorphological relationships and 2) make use of emerging strategies to "pretrain" models (i.e., train models on a slightly orthogonal dataset/modeling objective) which may aid prognostication by reducing the amount of information required for achieving optimal performance. In addition, model interpretation is crucial for facilitating the clinical adoption of deep learning methods by fostering practitioner understanding and trust in the technology.

Methods: Here, we develop an interpretable multimodal modeling framework that combines DNA methylation, gene expression, and histopathology (i.e., tissue slides) data, and we compare performance of crossmodal pretraining, contrastive learning, and transfer learning versus the standard procedure.

Results: Our models outperform the existing state-of-the-art method (average 11.54% C-index increase), and baseline clinically driven models (average 11.7% C-index increase). Model interpretations elucidate consideration of biologically meaningful factors in making prognosis predictions.

Discussion: Our results demonstrate that the selection of pretraining strategies is crucial for obtaining highly accurate prognostication models, even more so than devising an innovative model architecture, and further emphasize the all-important role of the tumor microenvironment on disease progression.

背景:深度学习模型可以从分子和解剖病理信息中推断癌症患者的预后。最近的研究利用了来自互补多模态数据的信息,改善了预测,进一步说明了这些方法的潜在效用。然而,目前的方法:1)没有全面利用生物和组织形态学的关系,2)利用新兴的策略来“预训练”模型(即,在稍微正交的数据集/建模目标上训练模型),这可能通过减少实现最佳性能所需的信息量来帮助预测。此外,通过培养从业者对技术的理解和信任,模型解释对于促进临床采用深度学习方法至关重要。方法:在这里,我们开发了一个可解释的多模态建模框架,该框架结合了DNA甲基化、基因表达和组织病理学(即组织切片)数据,并将跨模态预训练、对比学习和迁移学习的性能与标准程序进行了比较。结果:我们的模型优于现有的最先进的方法(平均11.54%的c -指数增加)和基线临床驱动模型(平均11.7%的c -指数增加)。模型解释阐明了在进行预后预测时考虑生物学上有意义的因素。讨论:我们的研究结果表明,选择预训练策略对于获得高度准确的预测模型至关重要,甚至比设计创新的模型架构更重要,并进一步强调了肿瘤微环境在疾病进展中的重要作用。
{"title":"Assessment of emerging pretraining strategies in interpretable multimodal deep learning for cancer prognostication.","authors":"Zarif L Azher,&nbsp;Anish Suvarna,&nbsp;Ji-Qing Chen,&nbsp;Ze Zhang,&nbsp;Brock C Christensen,&nbsp;Lucas A Salas,&nbsp;Louis J Vaickus,&nbsp;Joshua J Levy","doi":"10.1186/s13040-023-00338-w","DOIUrl":"https://doi.org/10.1186/s13040-023-00338-w","url":null,"abstract":"<p><strong>Background: </strong>Deep learning models can infer cancer patient prognosis from molecular and anatomic pathology information. Recent studies that leveraged information from complementary multimodal data improved prognostication, further illustrating the potential utility of such methods. However, current approaches: 1) do not comprehensively leverage biological and histomorphological relationships and 2) make use of emerging strategies to \"pretrain\" models (i.e., train models on a slightly orthogonal dataset/modeling objective) which may aid prognostication by reducing the amount of information required for achieving optimal performance. In addition, model interpretation is crucial for facilitating the clinical adoption of deep learning methods by fostering practitioner understanding and trust in the technology.</p><p><strong>Methods: </strong>Here, we develop an interpretable multimodal modeling framework that combines DNA methylation, gene expression, and histopathology (i.e., tissue slides) data, and we compare performance of crossmodal pretraining, contrastive learning, and transfer learning versus the standard procedure.</p><p><strong>Results: </strong>Our models outperform the existing state-of-the-art method (average 11.54% C-index increase), and baseline clinically driven models (average 11.7% C-index increase). Model interpretations elucidate consideration of biologically meaningful factors in making prognosis predictions.</p><p><strong>Discussion: </strong>Our results demonstrate that the selection of pretraining strategies is crucial for obtaining highly accurate prognostication models, even more so than devising an innovative model architecture, and further emphasize the all-important role of the tumor microenvironment on disease progression.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"23"},"PeriodicalIF":4.5,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10363299/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9865606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Neural network-based prognostic predictive tool for gastric cardiac cancer: the worldwide retrospective study. 基于神经网络的胃癌预后预测工具:全球回顾性研究。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-07-18 DOI: 10.1186/s13040-023-00335-z
Wei Li, Minghang Zhang, Siyu Cai, Liangliang Wu, Chao Li, Yuqi He, Guibin Yang, Jinghui Wang, Yuanming Pan

Backgrounds: The incidence of gastric cardiac cancer (GCC) has obviously increased recently with poor prognosis. It's necessary to compare GCC prognosis with other gastric sites carcinoma and set up an effective prognostic model based on a neural network to predict the survival of GCC patients.

Methods: In the population-based cohort study, we first enrolled the clinical features from the Surveillance, Epidemiology and End Results (SEER) data (n = 31,397) as well as the public Chinese data from different hospitals (n = 1049). Then according to the diagnostic time, the SEER data were then divided into two cohorts, the train cohort (patients were diagnosed as GCC in 2010-2014, n = 4414) and the test cohort (diagnosed in 2015, n = 957). Age, sex, pathology, tumor, node, and metastasis (TNM) stage, tumor size, surgery or not, radiotherapy or not, chemotherapy or not and history of malignancy were chosen as the predictive clinical features. The train cohort was utilized to conduct the neural network-based prognostic predictive model which validated by itself and the test cohort. Area under the receiver operating characteristics curve (AUC) was used to evaluate model performance.

Results: The prognosis of GCC patients in SEER database was worse than that of non GCC (NGCC) patients, while it was not worse in the Chinese data. The total of 5371 patients were used to conduct the model, following inclusion and exclusion criteria. Neural network-based prognostic predictive model had a satisfactory performance for GCC overall survival (OS) prediction, which owned 0.7431 AUC in the train cohort (95% confidence intervals, CI, 0.7423-0.7439) and 0.7419 in the test cohort (95% CI, 0.7411-0.7428).

Conclusions: GCC patients indeed have different survival time compared with non GCC patients. And the neural network-based prognostic predictive tool developed in this study is a novel and promising software for the clinical outcome analysis of GCC patients.

背景:胃贲门癌(GCC)近年来发病率明显上升,预后较差。有必要将GCC与其他胃部位癌的预后进行比较,建立有效的基于神经网络的预后模型来预测GCC患者的生存。方法:在基于人群的队列研究中,我们首先纳入了来自监测、流行病学和最终结果(SEER)数据(n = 31,397)以及来自不同医院的中国公开数据(n = 1049)的临床特征。然后根据诊断时间将SEER数据分为两组,训练组(2010-2014年诊断为GCC的患者,n = 4414)和测试组(2015年诊断为GCC的患者,n = 957)。选择年龄、性别、病理、肿瘤、淋巴结和转移(TNM)分期、肿瘤大小、是否手术、是否放疗、是否化疗和恶性肿瘤史作为预测临床特征。利用列车队列进行基于神经网络的预后预测模型,并通过自身和测试队列的验证。采用受试者工作特性曲线下面积(AUC)评价模型性能。结果:SEER数据库中GCC患者的预后差于非GCC (NGCC)患者,而在中国数据中并不差。模型共纳入5371例患者,遵循纳入和排除标准。基于神经网络的预后预测模型对GCC总生存期(OS)的预测效果令人满意,在训练队列中AUC为0.7431(95%置信区间CI为0.7423-0.7439),在测试队列中AUC为0.7419 (95% CI为0.7411-0.7428)。结论:与非GCC患者相比,GCC患者的生存时间确实存在差异。本研究开发的基于神经网络的预后预测工具是一种新颖而有前途的用于GCC患者临床结果分析的软件。
{"title":"Neural network-based prognostic predictive tool for gastric cardiac cancer: the worldwide retrospective study.","authors":"Wei Li,&nbsp;Minghang Zhang,&nbsp;Siyu Cai,&nbsp;Liangliang Wu,&nbsp;Chao Li,&nbsp;Yuqi He,&nbsp;Guibin Yang,&nbsp;Jinghui Wang,&nbsp;Yuanming Pan","doi":"10.1186/s13040-023-00335-z","DOIUrl":"https://doi.org/10.1186/s13040-023-00335-z","url":null,"abstract":"<p><strong>Backgrounds: </strong>The incidence of gastric cardiac cancer (GCC) has obviously increased recently with poor prognosis. It's necessary to compare GCC prognosis with other gastric sites carcinoma and set up an effective prognostic model based on a neural network to predict the survival of GCC patients.</p><p><strong>Methods: </strong>In the population-based cohort study, we first enrolled the clinical features from the Surveillance, Epidemiology and End Results (SEER) data (n = 31,397) as well as the public Chinese data from different hospitals (n = 1049). Then according to the diagnostic time, the SEER data were then divided into two cohorts, the train cohort (patients were diagnosed as GCC in 2010-2014, n = 4414) and the test cohort (diagnosed in 2015, n = 957). Age, sex, pathology, tumor, node, and metastasis (TNM) stage, tumor size, surgery or not, radiotherapy or not, chemotherapy or not and history of malignancy were chosen as the predictive clinical features. The train cohort was utilized to conduct the neural network-based prognostic predictive model which validated by itself and the test cohort. Area under the receiver operating characteristics curve (AUC) was used to evaluate model performance.</p><p><strong>Results: </strong>The prognosis of GCC patients in SEER database was worse than that of non GCC (NGCC) patients, while it was not worse in the Chinese data. The total of 5371 patients were used to conduct the model, following inclusion and exclusion criteria. Neural network-based prognostic predictive model had a satisfactory performance for GCC overall survival (OS) prediction, which owned 0.7431 AUC in the train cohort (95% confidence intervals, CI, 0.7423-0.7439) and 0.7419 in the test cohort (95% CI, 0.7411-0.7428).</p><p><strong>Conclusions: </strong>GCC patients indeed have different survival time compared with non GCC patients. And the neural network-based prognostic predictive tool developed in this study is a novel and promising software for the clinical outcome analysis of GCC patients.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"21"},"PeriodicalIF":4.5,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10353146/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9844770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inverse problem for parameters identification in a modified SIRD epidemic model using ensemble neural networks. 基于集成神经网络的改进SIRD流行病模型参数辨识逆问题。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-07-18 DOI: 10.1186/s13040-023-00337-x
Marian Petrica, Ionel Popescu

In this paper, we propose a parameter identification methodology of the SIRD model, an extension of the classical SIR model, that considers the deceased as a separate category. In addition, our model includes one parameter which is the ratio between the real total number of infected and the number of infected that were documented in the official statistics. Due to many factors, like governmental decisions, several variants circulating, opening and closing of schools, the typical assumption that the parameters of the model stay constant for long periods of time is not realistic. Thus our objective is to create a method which works for short periods of time. In this scope, we approach the estimation relying on the previous 7 days of data and then use the identified parameters to make predictions. To perform the estimation of the parameters we propose the average of an ensemble of neural networks. Each neural network is constructed based on a database built by solving the SIRD for 7 days, with random parameters. In this way, the networks learn the parameters from the solution of the SIRD model. Lastly we use the ensemble to get estimates of the parameters from the real data of Covid19 in Romania and then we illustrate the predictions for different periods of time, from 10 up to 45 days, for the number of deaths. The main goal was to apply this approach on the analysis of COVID-19 evolution in Romania, but this was also exemplified on other countries like Hungary, Czech Republic and Poland with similar results. The results are backed by a theorem which guarantees that we can recover the parameters of the model from the reported data. We believe this methodology can be used as a general tool for dealing with short term predictions of infectious diseases or in other compartmental models.

在本文中,我们提出了SIRD模型的参数识别方法,这是经典SIR模型的扩展,将死者视为一个单独的类别。此外,我们的模型还包括一个参数,即实际感染总人数与官方统计中记录的感染人数之间的比率。由于政府决策、多种变量的流通、学校的开办和关闭等因素的影响,模型参数长时间保持不变的典型假设是不现实的。因此,我们的目标是创建一种短时间内有效的方法。在这个范围内,我们依靠前7天的数据接近估计,然后使用识别的参数进行预测。为了对参数进行估计,我们提出了神经网络集合的平均值。每个神经网络都是基于求解7天的SIRD所建立的数据库来构建的,具有随机参数。通过这种方式,网络从SIRD模型的解中学习参数。最后,我们使用集合从罗马尼亚covid - 19的实际数据中获得参数的估计,然后我们说明了不同时期(从10天到45天)对死亡人数的预测。主要目标是将这种方法应用于分析罗马尼亚的COVID-19演变,但匈牙利、捷克共和国和波兰等其他国家也采用了这种方法,取得了类似的结果。结果得到了一个定理的支持,该定理保证了我们可以从报告的数据中恢复模型的参数。我们认为,这种方法可以作为处理传染病短期预测的通用工具或用于其他隔间模型。
{"title":"Inverse problem for parameters identification in a modified SIRD epidemic model using ensemble neural networks.","authors":"Marian Petrica,&nbsp;Ionel Popescu","doi":"10.1186/s13040-023-00337-x","DOIUrl":"https://doi.org/10.1186/s13040-023-00337-x","url":null,"abstract":"<p><p>In this paper, we propose a parameter identification methodology of the SIRD model, an extension of the classical SIR model, that considers the deceased as a separate category. In addition, our model includes one parameter which is the ratio between the real total number of infected and the number of infected that were documented in the official statistics. Due to many factors, like governmental decisions, several variants circulating, opening and closing of schools, the typical assumption that the parameters of the model stay constant for long periods of time is not realistic. Thus our objective is to create a method which works for short periods of time. In this scope, we approach the estimation relying on the previous 7 days of data and then use the identified parameters to make predictions. To perform the estimation of the parameters we propose the average of an ensemble of neural networks. Each neural network is constructed based on a database built by solving the SIRD for 7 days, with random parameters. In this way, the networks learn the parameters from the solution of the SIRD model. Lastly we use the ensemble to get estimates of the parameters from the real data of Covid19 in Romania and then we illustrate the predictions for different periods of time, from 10 up to 45 days, for the number of deaths. The main goal was to apply this approach on the analysis of COVID-19 evolution in Romania, but this was also exemplified on other countries like Hungary, Czech Republic and Poland with similar results. The results are backed by a theorem which guarantees that we can recover the parameters of the model from the reported data. We believe this methodology can be used as a general tool for dealing with short term predictions of infectious diseases or in other compartmental models.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"22"},"PeriodicalIF":4.5,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10354917/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9847109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChatGPT and large language models in academia: opportunities and challenges. 学术界的 ChatGPT 和大型语言模型:机遇与挑战。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-07-13 DOI: 10.1186/s13040-023-00339-9
Jesse G Meyer, Ryan J Urbanowicz, Patrick C N Martin, Karen O'Connor, Ruowang Li, Pei-Chen Peng, Tiffani J Bright, Nicholas Tatonetti, Kyoung Jae Won, Graciela Gonzalez-Hernandez, Jason H Moore

The introduction of large language models (LLMs) that allow iterative "chat" in late 2022 is a paradigm shift that enables generation of text often indistinguishable from that written by humans. LLM-based chatbots have immense potential to improve academic work efficiency, but the ethical implications of their fair use and inherent bias must be considered. In this editorial, we discuss this technology from the academic's perspective with regard to its limitations and utility for academic writing, education, and programming. We end with our stance with regard to using LLMs and chatbots in academia, which is summarized as (1) we must find ways to effectively use them, (2) their use does not constitute plagiarism (although they may produce plagiarized text), (3) we must quantify their bias, (4) users must be cautious of their poor accuracy, and (5) the future is bright for their application to research and as an academic tool.

2022 年末引入的大型语言模型(LLM)允许迭代式 "聊天",这是一种范式的转变,它能生成与人类所写文本无异的文本。基于 LLM 的聊天机器人在提高学术工作效率方面潜力巨大,但必须考虑其公平使用和固有偏见的伦理影响。在这篇社论中,我们从学者的角度讨论了这项技术在学术写作、教育和编程方面的局限性和实用性。最后,我们对在学术界使用 LLM 和聊天机器人的立场总结如下:(1)我们必须找到有效使用它们的方法;(2)使用它们并不构成剽窃(尽管它们可能会产生剽窃文本);(3)我们必须量化它们的偏见;(4)用户必须警惕它们的低准确性;(5)它们作为学术工具应用于研究的前景是光明的。
{"title":"ChatGPT and large language models in academia: opportunities and challenges.","authors":"Jesse G Meyer, Ryan J Urbanowicz, Patrick C N Martin, Karen O'Connor, Ruowang Li, Pei-Chen Peng, Tiffani J Bright, Nicholas Tatonetti, Kyoung Jae Won, Graciela Gonzalez-Hernandez, Jason H Moore","doi":"10.1186/s13040-023-00339-9","DOIUrl":"10.1186/s13040-023-00339-9","url":null,"abstract":"<p><p>The introduction of large language models (LLMs) that allow iterative \"chat\" in late 2022 is a paradigm shift that enables generation of text often indistinguishable from that written by humans. LLM-based chatbots have immense potential to improve academic work efficiency, but the ethical implications of their fair use and inherent bias must be considered. In this editorial, we discuss this technology from the academic's perspective with regard to its limitations and utility for academic writing, education, and programming. We end with our stance with regard to using LLMs and chatbots in academia, which is summarized as (1) we must find ways to effectively use them, (2) their use does not constitute plagiarism (although they may produce plagiarized text), (3) we must quantify their bias, (4) users must be cautious of their poor accuracy, and (5) the future is bright for their application to research and as an academic tool.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"20"},"PeriodicalIF":4.0,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10339472/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9817686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Overlapping filter bank convolutional neural network for multisubject multicategory motor imagery brain-computer interface. 多主体多类别运动图像脑机接口的重叠滤波组卷积神经网络。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-07-11 DOI: 10.1186/s13040-023-00336-y
Jing Luo, Jundong Li, Qi Mao, Zhenghao Shi, Haiqin Liu, Xiaoyong Ren, Xinhong Hei

Background: Motor imagery brain-computer interfaces (BCIs) is a classic and potential BCI technology achieving brain computer integration. In motor imagery BCI, the operational frequency band of the EEG greatly affects the performance of motor imagery EEG recognition model. However, as most algorithms used a broad frequency band, the discrimination from multiple sub-bands were not fully utilized. Thus, using convolutional neural network (CNNs) to extract discriminative features from EEG signals of different frequency components is a promising method in multisubject EEG recognition.

Methods: This paper presents a novel overlapping filter bank CNN to incorporate discriminative information from multiple frequency components in multisubject motor imagery recognition. Specifically, two overlapping filter banks with fixed low-cut frequency or sliding low-cut frequency are employed to obtain multiple frequency component representations of EEG signals. Then, multiple CNN models are trained separately. Finally, the output probabilities of multiple CNN models are integrated to determine the predicted EEG label.

Results: Experiments were conducted based on four popular CNN backbone models and three public datasets. And the results showed that the overlapping filter bank CNN was efficient and universal in improving multisubject motor imagery BCI performance. Specifically, compared with the original backbone model, the proposed method can improve the average accuracy by 3.69 percentage points, F1 score by 0.04, and AUC by 0.03. In addition, the proposed method performed best among the comparison with the state-of-the-art methods.

Conclusion: The proposed overlapping filter bank CNN framework with fixed low-cut frequency is an efficient and universal method to improve the performance of multisubject motor imagery BCI.

背景:运动图像脑机接口(BCI)是实现脑机集成的一种经典的、有潜力的脑机接口技术。在运动图像脑机接口中,脑电信号的工作频带对运动图像脑电识别模型的性能影响很大。然而,由于大多数算法使用的是较宽的频带,因此没有充分利用多子带的识别能力。因此,利用卷积神经网络(cnn)从不同频率分量的脑电信号中提取判别特征是一种很有前途的多主体脑电信号识别方法。方法:本文提出了一种新的重叠滤波组CNN,用于多主体运动图像识别。具体而言,采用固定低截止频率或滑动低截止频率的两个重叠滤波器组来获得脑电信号的多频率分量表示。然后,分别训练多个CNN模型。最后,综合多个CNN模型的输出概率,确定预测的脑电标签。结果:基于四种流行的CNN主干模型和三种公开数据集进行了实验。结果表明,重叠滤波组CNN在提高多主体运动图像脑机接口性能方面是有效和通用的。具体而言,与原主干模型相比,该方法平均准确率提高3.69个百分点,F1分数提高0.04个百分点,AUC提高0.03个百分点。此外,所提出的方法在与最先进的方法的比较中表现最好。结论:提出的固定低频重叠滤波组CNN框架是提高多主体运动意象脑机接口性能的一种有效且通用的方法。
{"title":"Overlapping filter bank convolutional neural network for multisubject multicategory motor imagery brain-computer interface.","authors":"Jing Luo,&nbsp;Jundong Li,&nbsp;Qi Mao,&nbsp;Zhenghao Shi,&nbsp;Haiqin Liu,&nbsp;Xiaoyong Ren,&nbsp;Xinhong Hei","doi":"10.1186/s13040-023-00336-y","DOIUrl":"https://doi.org/10.1186/s13040-023-00336-y","url":null,"abstract":"<p><strong>Background: </strong>Motor imagery brain-computer interfaces (BCIs) is a classic and potential BCI technology achieving brain computer integration. In motor imagery BCI, the operational frequency band of the EEG greatly affects the performance of motor imagery EEG recognition model. However, as most algorithms used a broad frequency band, the discrimination from multiple sub-bands were not fully utilized. Thus, using convolutional neural network (CNNs) to extract discriminative features from EEG signals of different frequency components is a promising method in multisubject EEG recognition.</p><p><strong>Methods: </strong>This paper presents a novel overlapping filter bank CNN to incorporate discriminative information from multiple frequency components in multisubject motor imagery recognition. Specifically, two overlapping filter banks with fixed low-cut frequency or sliding low-cut frequency are employed to obtain multiple frequency component representations of EEG signals. Then, multiple CNN models are trained separately. Finally, the output probabilities of multiple CNN models are integrated to determine the predicted EEG label.</p><p><strong>Results: </strong>Experiments were conducted based on four popular CNN backbone models and three public datasets. And the results showed that the overlapping filter bank CNN was efficient and universal in improving multisubject motor imagery BCI performance. Specifically, compared with the original backbone model, the proposed method can improve the average accuracy by 3.69 percentage points, F1 score by 0.04, and AUC by 0.03. In addition, the proposed method performed best among the comparison with the state-of-the-art methods.</p><p><strong>Conclusion: </strong>The proposed overlapping filter bank CNN framework with fixed low-cut frequency is an efficient and universal method to improve the performance of multisubject motor imagery BCI.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"19"},"PeriodicalIF":4.5,"publicationDate":"2023-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10337209/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9817376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison of cancer subtype identification methods combined with feature selection methods in omics data analysis. 癌症亚型鉴定方法与特征选择方法在组学数据分析中的比较
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-07-07 DOI: 10.1186/s13040-023-00334-0
JiYoon Park, Jae Won Lee, Mira Park

Background: Cancer subtype identification is important for the early diagnosis of cancer and the provision of adequate treatment. Prior to identifying the subtype of cancer in a patient, feature selection is also crucial for reducing the dimensionality of the data by detecting genes that contain important information about the cancer subtype. Numerous cancer subtyping methods have been developed, and their performance has been compared. However, combinations of feature selection and subtype identification methods have rarely been considered. This study aimed to identify the best combination of variable selection and subtype identification methods in single omics data analysis.

Results: Combinations of six filter-based methods and six unsupervised subtype identification methods were investigated using The Cancer Genome Atlas (TCGA) datasets for four cancers. The number of features selected varied, and several evaluation metrics were used. Although no single combination was found to have a distinctively good performance, Consensus Clustering (CC) and Neighborhood-Based Multi-omics Clustering (NEMO) used with variance-based feature selection had a tendency to show lower p-values, and nonnegative matrix factorization (NMF) stably showed good performance in many cases unless the Dip test was used for feature selection. In terms of accuracy, the combination of NMF and similarity network fusion (SNF) with Monte Carlo Feature Selection (MCFS) and Minimum-Redundancy Maximum Relevance (mRMR) showed good overall performance. NMF always showed among the worst performances without feature selection in all datasets, but performed much better when used with various feature selection methods. iClusterBayes (ICB) had decent performance when used without feature selection.

Conclusions: Rather than a single method clearly emerging as optimal, the best methodology was different depending on the data used, the number of features selected, and the evaluation method. A guideline for choosing the best combination method under various situations is provided.

背景:癌症亚型识别对于癌症的早期诊断和提供适当的治疗非常重要。在确定患者的癌症亚型之前,通过检测包含癌症亚型重要信息的基因,特征选择对于降低数据的维数也至关重要。已经开发了许多癌症亚型方法,并对它们的性能进行了比较。然而,结合特征选择和亚型识别的方法很少被考虑。本研究旨在确定单组学数据分析中变量选择和亚型鉴定的最佳组合方法。结果:使用The Cancer Genome Atlas (TCGA)数据集对4种癌症进行了6种基于过滤器的方法和6种无监督亚型鉴定方法的组合研究。所选择的特征数量各不相同,并且使用了几种评估指标。虽然没有发现单一组合具有明显的良好性能,但共识聚类(CC)和基于邻域的多组学聚类(NEMO)与基于方差的特征选择一起使用有显示较低p值的趋势,非负矩阵分解(NMF)在许多情况下稳定地显示出良好的性能,除非使用Dip测试进行特征选择。在准确率方面,NMF和相似网络融合(SNF)与蒙特卡罗特征选择(MCFS)和最小冗余最大相关性(mRMR)相结合,整体表现良好。在没有特征选择的情况下,NMF在所有数据集中的表现都是最差的,而在与各种特征选择方法结合使用时,NMF的表现要好得多。iClusterBayes (ICB)在没有特征选择的情况下具有良好的性能。结论:最佳方法不是单一方法,而是根据所使用的数据、所选择的特征数量和评估方法而有所不同。为在各种情况下选择最佳组合方法提供了指导。
{"title":"Comparison of cancer subtype identification methods combined with feature selection methods in omics data analysis.","authors":"JiYoon Park,&nbsp;Jae Won Lee,&nbsp;Mira Park","doi":"10.1186/s13040-023-00334-0","DOIUrl":"https://doi.org/10.1186/s13040-023-00334-0","url":null,"abstract":"<p><strong>Background: </strong>Cancer subtype identification is important for the early diagnosis of cancer and the provision of adequate treatment. Prior to identifying the subtype of cancer in a patient, feature selection is also crucial for reducing the dimensionality of the data by detecting genes that contain important information about the cancer subtype. Numerous cancer subtyping methods have been developed, and their performance has been compared. However, combinations of feature selection and subtype identification methods have rarely been considered. This study aimed to identify the best combination of variable selection and subtype identification methods in single omics data analysis.</p><p><strong>Results: </strong>Combinations of six filter-based methods and six unsupervised subtype identification methods were investigated using The Cancer Genome Atlas (TCGA) datasets for four cancers. The number of features selected varied, and several evaluation metrics were used. Although no single combination was found to have a distinctively good performance, Consensus Clustering (CC) and Neighborhood-Based Multi-omics Clustering (NEMO) used with variance-based feature selection had a tendency to show lower p-values, and nonnegative matrix factorization (NMF) stably showed good performance in many cases unless the Dip test was used for feature selection. In terms of accuracy, the combination of NMF and similarity network fusion (SNF) with Monte Carlo Feature Selection (MCFS) and Minimum-Redundancy Maximum Relevance (mRMR) showed good overall performance. NMF always showed among the worst performances without feature selection in all datasets, but performed much better when used with various feature selection methods. iClusterBayes (ICB) had decent performance when used without feature selection.</p><p><strong>Conclusions: </strong>Rather than a single method clearly emerging as optimal, the best methodology was different depending on the data used, the number of features selected, and the evaluation method. A guideline for choosing the best combination method under various situations is provided.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"18"},"PeriodicalIF":4.5,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10329370/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9807660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ScInfoVAE: interpretable dimensional reduction of single cell transcription data with variational autoencoders and extended mutual information regularization. 使用变分自编码器和扩展互信息正则化的单细胞转录数据的可解释降维。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-06-10 DOI: 10.1186/s13040-023-00333-1
Weiquan Pan, Faning Long, Jian Pan

Single-cell RNA-sequencing (scRNA-seq) data can serve as a good indicator of cell-to-cell heterogeneity and can aid in the study of cell growth by identifying cell types. Recently, advances in Variational Autoencoder (VAE) have demonstrated their ability to learn robust feature representations for scRNA-seq. However, it has been observed that VAEs tend to ignore the latent variables when combined with a decoding distribution that is too flexible. In this paper, we introduce ScInfoVAE, a dimensional reduction method based on the mutual information variational autoencoder (InfoVAE), which can more effectively identify various cell types in scRNA-seq data of complex tissues. A joint InfoVAE deep model and zero-inflated negative binomial distributed model design based on ScInfoVAE reconstructs the objective function to noise scRNA-seq data and learn an efficient low-dimensional representation of it. We use ScInfoVAE to analyze the clustering performance of 15 real scRNA-seq datasets and demonstrate that our method provides high clustering performance. In addition, we use simulated data to investigate the interpretability of feature extraction, and visualization results show that the low-dimensional representation learned by ScInfoVAE retains local and global neighborhood structure data well. In addition, our model can significantly improve the quality of the variational posterior.

单细胞rna测序(scRNA-seq)数据可以作为细胞间异质性的良好指标,并可以通过识别细胞类型来帮助研究细胞生长。最近,变分自编码器(VAE)的进展已经证明了它们能够学习scRNA-seq的鲁棒特征表示。然而,已经观察到,当与过于灵活的解码分布相结合时,VAEs倾向于忽略潜在变量。本文介绍了一种基于互信息变分自编码器(InfoVAE)的降维方法sciinfovae,该方法可以更有效地识别复杂组织scRNA-seq数据中的各种细胞类型。基于ScInfoVAE的联合InfoVAE深度模型和零膨胀负二项分布模型设计,对scRNA-seq数据重构目标函数,并学习其高效的低维表示。利用ScInfoVAE对15个真实scRNA-seq数据集的聚类性能进行了分析,结果表明该方法具有较高的聚类性能。此外,我们利用模拟数据研究了特征提取的可解释性,可视化结果表明,ScInfoVAE学习的低维表示能很好地保留局部和全局邻域结构数据。此外,我们的模型可以显著提高变分后验的质量。
{"title":"ScInfoVAE: interpretable dimensional reduction of single cell transcription data with variational autoencoders and extended mutual information regularization.","authors":"Weiquan Pan,&nbsp;Faning Long,&nbsp;Jian Pan","doi":"10.1186/s13040-023-00333-1","DOIUrl":"https://doi.org/10.1186/s13040-023-00333-1","url":null,"abstract":"<p><p>Single-cell RNA-sequencing (scRNA-seq) data can serve as a good indicator of cell-to-cell heterogeneity and can aid in the study of cell growth by identifying cell types. Recently, advances in Variational Autoencoder (VAE) have demonstrated their ability to learn robust feature representations for scRNA-seq. However, it has been observed that VAEs tend to ignore the latent variables when combined with a decoding distribution that is too flexible. In this paper, we introduce ScInfoVAE, a dimensional reduction method based on the mutual information variational autoencoder (InfoVAE), which can more effectively identify various cell types in scRNA-seq data of complex tissues. A joint InfoVAE deep model and zero-inflated negative binomial distributed model design based on ScInfoVAE reconstructs the objective function to noise scRNA-seq data and learn an efficient low-dimensional representation of it. We use ScInfoVAE to analyze the clustering performance of 15 real scRNA-seq datasets and demonstrate that our method provides high clustering performance. In addition, we use simulated data to investigate the interpretability of feature extraction, and visualization results show that the low-dimensional representation learned by ScInfoVAE retains local and global neighborhood structure data well. In addition, our model can significantly improve the quality of the variational posterior.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"17"},"PeriodicalIF":4.5,"publicationDate":"2023-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10257850/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9673072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1