首页 > 最新文献

Biodata Mining最新文献

英文 中文
The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. 马修斯相关系数(MCC)应取代 ROC AUC,成为评估二元分类的标准指标。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-02-17 DOI: 10.1186/s13040-023-00322-4
Davide Chicco, Giuseppe Jurman

Binary classification is a common task for which machine learning and computational statistics are used, and the area under the receiver operating characteristic curve (ROC AUC) has become the common standard metric to evaluate binary classifications in most scientific fields. The ROC curve has true positive rate (also called sensitivity or recall) on the y axis and false positive rate on the x axis, and the ROC AUC can range from 0 (worst result) to 1 (perfect result). The ROC AUC, however, has several flaws and drawbacks. This score is generated including predictions that obtained insufficient sensitivity and specificity, and moreover it does not say anything about positive predictive value (also known as precision) nor negative predictive value (NPV) obtained by the classifier, therefore potentially generating inflated overoptimistic results. Since it is common to include ROC AUC alone without precision and negative predictive value, a researcher might erroneously conclude that their classification was successful. Furthermore, a given point in the ROC space does not identify a single confusion matrix nor a group of matrices sharing the same MCC value. Indeed, a given (sensitivity, specificity) pair can cover a broad MCC range, which casts doubts on the reliability of ROC AUC as a performance measure. In contrast, the Matthews correlation coefficient (MCC) generates a high score in its [Formula: see text] interval only if the classifier scored a high value for all the four basic rates of the confusion matrix: sensitivity, specificity, precision, and negative predictive value. A high MCC (for example, MCC [Formula: see text] 0.9), moreover, always corresponds to a high ROC AUC, and not vice versa. In this short study, we explain why the Matthews correlation coefficient should replace the ROC AUC as standard statistic in all the scientific studies involving a binary classification, in all scientific fields.

二元分类是机器学习和计算统计常用的任务,接收者工作特征曲线下面积(ROC AUC)已成为大多数科学领域评估二元分类的常用标准指标。ROC 曲线的 Y 轴为真阳性率(也称灵敏度或召回率),X 轴为假阳性率,ROC AUC 的范围从 0(最差结果)到 1(完美结果)不等。然而,ROC AUC 有几个缺陷和不足。这个分数是在预测灵敏度和特异性不足的情况下产生的,而且它对分类器获得的正预测值(也称为精确度)和负预测值(NPV)没有任何说明,因此可能会产生夸大的过于乐观的结果。由于只包含 ROC AUC 而不包含精确度和负预测值的情况很常见,研究人员可能会错误地得出分类成功的结论。此外,ROC 空间中的一个给定点并不能确定一个混淆矩阵或一组具有相同 MCC 值的矩阵。事实上,给定的(灵敏度、特异性)对可以覆盖很宽的 MCC 范围,这让人对 ROC AUC 作为性能测量指标的可靠性产生怀疑。相反,只有当分类器在混淆矩阵的所有四个基本比率(灵敏度、特异性、精确度和负预测值)上都获得高分时,马修斯相关系数(MCC)才会在其[计算公式:见正文]区间内产生高分。此外,高 MCC(例如 MCC [公式:见正文] 0.9)总是与高 ROC AUC 相对应,反之亦然。在这篇简短的研究中,我们将解释为什么马修斯相关系数应该取代 ROC AUC,成为所有科学领域涉及二元分类的所有科学研究的标准统计量。
{"title":"The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification.","authors":"Davide Chicco, Giuseppe Jurman","doi":"10.1186/s13040-023-00322-4","DOIUrl":"10.1186/s13040-023-00322-4","url":null,"abstract":"<p><p>Binary classification is a common task for which machine learning and computational statistics are used, and the area under the receiver operating characteristic curve (ROC AUC) has become the common standard metric to evaluate binary classifications in most scientific fields. The ROC curve has true positive rate (also called sensitivity or recall) on the y axis and false positive rate on the x axis, and the ROC AUC can range from 0 (worst result) to 1 (perfect result). The ROC AUC, however, has several flaws and drawbacks. This score is generated including predictions that obtained insufficient sensitivity and specificity, and moreover it does not say anything about positive predictive value (also known as precision) nor negative predictive value (NPV) obtained by the classifier, therefore potentially generating inflated overoptimistic results. Since it is common to include ROC AUC alone without precision and negative predictive value, a researcher might erroneously conclude that their classification was successful. Furthermore, a given point in the ROC space does not identify a single confusion matrix nor a group of matrices sharing the same MCC value. Indeed, a given (sensitivity, specificity) pair can cover a broad MCC range, which casts doubts on the reliability of ROC AUC as a performance measure. In contrast, the Matthews correlation coefficient (MCC) generates a high score in its [Formula: see text] interval only if the classifier scored a high value for all the four basic rates of the confusion matrix: sensitivity, specificity, precision, and negative predictive value. A high MCC (for example, MCC [Formula: see text] 0.9), moreover, always corresponds to a high ROC AUC, and not vice versa. In this short study, we explain why the Matthews correlation coefficient should replace the ROC AUC as standard statistic in all the scientific studies involving a binary classification, in all scientific fields.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"4"},"PeriodicalIF":4.0,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9938573/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9320067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LoFTK: a framework for fully automated calculation of predicted Loss-of-Function variants and genes. LoFTK:全自动计算预测功能缺失变体和基因的框架。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-02-02 DOI: 10.1186/s13040-023-00321-5
Abdulrahman Alasiri, Konrad J Karczewski, Brian Cole, Bao-Li Loza, Jason H Moore, Sander W van der Laan, Folkert W Asselbergs, Brendan J Keating, Jessica van Setten

Background: Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. The association of LoF variants with complex diseases and traits may lead to the discovery and validation of novel therapeutic targets. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Moreover, there is a lack of methods for detecting knockout genes caused by compound heterozygous (CH) LoF variants.

Results: We have developed the Loss-of-Function ToolKit (LoFTK), which allows efficient and automated prediction of LoF variants from genotyped, imputed and sequenced genomes. LoFTK enables the identification of genes that are inactive in one or two copies and provides summary statistics for downstream analyses. LoFTK can identify CH LoF variants, which result in LoF genes with two copies lost. Using data from parents and offspring we show that 96% of CH LoF genes predicted by LoFTK in the offspring have the respective alleles donated by each parent.

Conclusions: LoFTK is a command-line based tool that provides a reliable computational workflow for predicting LoF variants from genotyped and sequenced genomes, identifying genes that are inactive in 1 or 2 copies. LoFTK is an open software and is freely available to non-commercial users at https://github.com/CirculatoryHealth/LoFTK .

背景:人类基因中的功能缺失(LoF)变异非常重要,因为它们会影响临床表型,而且经常出现在健康人的基因组中。LoF变异与复杂疾病和性状的关联可能有助于发现和验证新的治疗靶点。目前的方法可以预测高置信度的 LoF 变异,但无法确定其影响的特定基因或拷贝数。此外,目前还缺乏检测由复合杂合子(CH)LoF 变异引起的基因敲除的方法:我们开发了功能缺失工具包(LoFTK),它可以从基因分型、估算和测序的基因组中高效、自动地预测LoF变异。LoFTK 能够识别在一个或两个拷贝中失去活性的基因,并为下游分析提供汇总统计数据。LoFTK 可以识别 CH LoF 变异,这将导致 LoF 基因丢失两个拷贝。通过使用亲代和子代的数据,我们发现在 LoFTK 预测的子代 CH LoF 基因中,96% 的基因具有父母各自捐献的等位基因:LoFTK是一种基于命令行的工具,它提供了一种可靠的计算工作流程,可从基因分型和测序的基因组中预测LoF变异,识别1个或2个拷贝中无活性的基因。LoFTK 是一款开放软件,非商业用户可通过 https://github.com/CirculatoryHealth/LoFTK 免费下载。
{"title":"LoFTK: a framework for fully automated calculation of predicted Loss-of-Function variants and genes.","authors":"Abdulrahman Alasiri, Konrad J Karczewski, Brian Cole, Bao-Li Loza, Jason H Moore, Sander W van der Laan, Folkert W Asselbergs, Brendan J Keating, Jessica van Setten","doi":"10.1186/s13040-023-00321-5","DOIUrl":"10.1186/s13040-023-00321-5","url":null,"abstract":"<p><strong>Background: </strong>Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. The association of LoF variants with complex diseases and traits may lead to the discovery and validation of novel therapeutic targets. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Moreover, there is a lack of methods for detecting knockout genes caused by compound heterozygous (CH) LoF variants.</p><p><strong>Results: </strong>We have developed the Loss-of-Function ToolKit (LoFTK), which allows efficient and automated prediction of LoF variants from genotyped, imputed and sequenced genomes. LoFTK enables the identification of genes that are inactive in one or two copies and provides summary statistics for downstream analyses. LoFTK can identify CH LoF variants, which result in LoF genes with two copies lost. Using data from parents and offspring we show that 96% of CH LoF genes predicted by LoFTK in the offspring have the respective alleles donated by each parent.</p><p><strong>Conclusions: </strong>LoFTK is a command-line based tool that provides a reliable computational workflow for predicting LoF variants from genotyped and sequenced genomes, identifying genes that are inactive in 1 or 2 copies. LoFTK is an open software and is freely available to non-commercial users at https://github.com/CirculatoryHealth/LoFTK .</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"3"},"PeriodicalIF":4.0,"publicationDate":"2023-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9893534/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9154622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detection of iron deficiency anemia by medical images: a comparative study of machine learning algorithms. 医学图像检测缺铁性贫血:机器学习算法的比较研究。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-01-24 DOI: 10.1186/s13040-023-00319-z
Peter Appiahene, Justice Williams Asare, Emmanuel Timmy Donkoh, Giovanni Dimauro, Rosalia Maglietta

Background: Anemia is one of the global public health problems that affect children and pregnant women. Anemia occurs when the level of red blood cells within the body decreases or when the structure of the red blood cells is destroyed or when the Hb level in the red blood cell is below the normal threshold, which results from one or more increased red cell destructions, blood loss, defective cell production or a depleted sum of Red Blood Cells.

Methods: The method used in this study is divided into three phases: the datasets were gathered, which is the palm, pre-processed the image, which comprised; Extracted images, and augmented images, segmented the Region of Interest of the images and acquired their various components of the CIE L*a*b* colour space (also referred to as the CIELAB), and finally developed the proposed models for the detection of anemia using the various algorithms, which include CNN, k-NN, Nave Bayes, SVM, and Decision Tree. The experiment utilized 527 initial datasets, rotation, flipping and translation were utilized and augmented the dataset to 2635. We randomly divided the augmented dataset into 70%, 10%, and 20% and trained, validated and tested the models respectively.

Results: The results of the study justify that the models performed appropriately when the palm is used to detect anemia, with the Naïve Bayes achieving a 99.96% accuracy while the SVM achieved the lowest accuracy of 96.34%, as the CNN also performed better with an accuracy of 99.92% in detecting anemia.

Conclusions: The invasive method of detecting anemia is expensive and time-consuming; however, anemia can be detected through the use of non-invasive methods such as machine learning algorithms which is efficient, cost-effective and takes less time. In this work, we compared machine learning models such as CNN, k-NN, Decision Tree, Naïve Bayes, and SVM to detect anemia using images of the palm. Finally, the study supports other similar studies on the potency of the Machine Learning Algorithm as a non-invasive method in detecting iron deficiency anemia.

背景:贫血是影响儿童和孕妇的全球性公共卫生问题之一。当体内红细胞水平下降或红细胞结构被破坏或红细胞中的Hb水平低于正常阈值时,就会发生贫血,这是由一种或多种红细胞破坏增加、失血、细胞产生缺陷或红细胞耗尽引起的。方法:本研究采用的方法分为三个阶段:收集数据集,即手掌,对图像进行预处理,其中包括;提取图像,增强图像,对图像的兴趣区域进行分割,并获得其在CIEL *a*b*色彩空间(也称为CIELAB)中的各个分量,最后使用各种算法(包括CNN, k-NN, Nave Bayes, SVM和Decision Tree)开发提出的贫血检测模型。实验利用527个初始数据集,利用旋转、翻转、平移等方法将数据集扩充到2635个。我们将增强的数据集随机分为70%、10%和20%,并分别对模型进行训练、验证和测试。结果:本研究结果证明了模型在用手掌检测贫血时的表现是合适的,Naïve Bayes的准确率达到99.96%,SVM的准确率最低,为96.34%,CNN在检测贫血方面的准确率也更好,达到99.92%。结论:有创检测贫血费用高、耗时长;然而,可以通过使用机器学习算法等非侵入性方法检测贫血,这种方法效率高,成本效益高,耗时短。在这项工作中,我们比较了机器学习模型,如CNN, k-NN,决策树,Naïve贝叶斯和SVM,以使用手掌图像检测贫血。最后,该研究支持了机器学习算法作为检测缺铁性贫血的非侵入性方法的效力的其他类似研究。
{"title":"Detection of iron deficiency anemia by medical images: a comparative study of machine learning algorithms.","authors":"Peter Appiahene,&nbsp;Justice Williams Asare,&nbsp;Emmanuel Timmy Donkoh,&nbsp;Giovanni Dimauro,&nbsp;Rosalia Maglietta","doi":"10.1186/s13040-023-00319-z","DOIUrl":"https://doi.org/10.1186/s13040-023-00319-z","url":null,"abstract":"<p><strong>Background: </strong>Anemia is one of the global public health problems that affect children and pregnant women. Anemia occurs when the level of red blood cells within the body decreases or when the structure of the red blood cells is destroyed or when the Hb level in the red blood cell is below the normal threshold, which results from one or more increased red cell destructions, blood loss, defective cell production or a depleted sum of Red Blood Cells.</p><p><strong>Methods: </strong>The method used in this study is divided into three phases: the datasets were gathered, which is the palm, pre-processed the image, which comprised; Extracted images, and augmented images, segmented the Region of Interest of the images and acquired their various components of the CIE L*a*b* colour space (also referred to as the CIELAB), and finally developed the proposed models for the detection of anemia using the various algorithms, which include CNN, k-NN, Nave Bayes, SVM, and Decision Tree. The experiment utilized 527 initial datasets, rotation, flipping and translation were utilized and augmented the dataset to 2635. We randomly divided the augmented dataset into 70%, 10%, and 20% and trained, validated and tested the models respectively.</p><p><strong>Results: </strong>The results of the study justify that the models performed appropriately when the palm is used to detect anemia, with the Naïve Bayes achieving a 99.96% accuracy while the SVM achieved the lowest accuracy of 96.34%, as the CNN also performed better with an accuracy of 99.92% in detecting anemia.</p><p><strong>Conclusions: </strong>The invasive method of detecting anemia is expensive and time-consuming; however, anemia can be detected through the use of non-invasive methods such as machine learning algorithms which is efficient, cost-effective and takes less time. In this work, we compared machine learning models such as CNN, k-NN, Decision Tree, Naïve Bayes, and SVM to detect anemia using images of the palm. Finally, the study supports other similar studies on the potency of the Machine Learning Algorithm as a non-invasive method in detecting iron deficiency anemia.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"2"},"PeriodicalIF":4.5,"publicationDate":"2023-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9875467/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10627230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Bacteria spatial tracking in Urban Park soils with MALDI-TOF Mass Spectrometry and Specific PCR. 利用MALDI-TOF质谱法和特异PCR技术追踪城市公园土壤细菌的空间分布。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2023-01-14 DOI: 10.1186/s13040-022-00318-6
Diego Arnal, Celeste Moya, Luigi Filippelli, Jaume Segura-Garcia, Sergi Maicas

Urban parks constitute one of the main leisure areas, especially for the most vulnerable people in our society, children, and the elderly. Contact with soils can pose a health risk. Microbiological testing is a key aspect in determining whether they are suitable for public use. The aim of this work is to map the spatial distribution of potential dangerous Enterobacteria but also bioremediation useful (lipase producers) isolates from soils in an urban park in the area of Valencia (Spain). To this end, our team has collected 25 samples of soil and isolated 500 microorganisms, using a mobile application to collect information of the soil samples (i.e. soil features, temperature, humidity, etc.) with geolocation. A combined protocol including matrix-assisted laser desorption/ionization time of flight mass spectrometry (MALDI-TOF MS) and 16S rDNA sequencing PCR has been established to characterize the isolates. The results have been processed using spatial statistical techniques (using Kriging method), taking into account the number of isolated strains, also proving the reactivity against standard pathogenic bacterial strains (Escherichia coli, Bacillus cereus, Salmonella, Pseudomonas and Staphylococcus aureus), and have increased the number of samples (to 896 samples) by interpolating spatially each parameter with this statistical method. The combined use of methods from biology and computer science allows the quality of the soil in urban parks to be predicted in an agile way, which can generate confidence in its use by citizens.

城市公园构成了主要的休闲场所之一,特别是对于我们社会中最弱势的人群,儿童和老年人。与土壤接触会造成健康风险。微生物测试是决定它们是否适合公众使用的一个关键方面。这项工作的目的是绘制潜在危险肠杆菌的空间分布,以及从巴伦西亚(西班牙)地区一个城市公园的土壤中分离出的生物修复有用的(脂肪酶产生者)分离物。为此,我们团队采集了25个土壤样本,分离了500个微生物,使用移动应用程序收集土壤样本的地理定位信息(即土壤特征、温度、湿度等)。建立了基质辅助激光解吸/电离飞行时间质谱(MALDI-TOF MS)和16S rDNA测序PCR的联合方案来鉴定分离物。利用空间统计技术(Kriging法)对结果进行处理,考虑到分离菌株的数量,也证明了对标准病原菌(大肠杆菌、蜡样芽孢杆菌、沙门氏菌、假单胞菌和金黄色葡萄球菌)的反应性,并利用该统计方法对各参数进行空间插值,增加了样本数量(896个样本)。结合使用生物学和计算机科学的方法,可以以一种灵活的方式预测城市公园的土壤质量,这可以使市民对其使用产生信心。
{"title":"Bacteria spatial tracking in Urban Park soils with MALDI-TOF Mass Spectrometry and Specific PCR.","authors":"Diego Arnal,&nbsp;Celeste Moya,&nbsp;Luigi Filippelli,&nbsp;Jaume Segura-Garcia,&nbsp;Sergi Maicas","doi":"10.1186/s13040-022-00318-6","DOIUrl":"https://doi.org/10.1186/s13040-022-00318-6","url":null,"abstract":"<p><p>Urban parks constitute one of the main leisure areas, especially for the most vulnerable people in our society, children, and the elderly. Contact with soils can pose a health risk. Microbiological testing is a key aspect in determining whether they are suitable for public use. The aim of this work is to map the spatial distribution of potential dangerous Enterobacteria but also bioremediation useful (lipase producers) isolates from soils in an urban park in the area of Valencia (Spain). To this end, our team has collected 25 samples of soil and isolated 500 microorganisms, using a mobile application to collect information of the soil samples (i.e. soil features, temperature, humidity, etc.) with geolocation. A combined protocol including matrix-assisted laser desorption/ionization time of flight mass spectrometry (MALDI-TOF MS) and 16S rDNA sequencing PCR has been established to characterize the isolates. The results have been processed using spatial statistical techniques (using Kriging method), taking into account the number of isolated strains, also proving the reactivity against standard pathogenic bacterial strains (Escherichia coli, Bacillus cereus, Salmonella, Pseudomonas and Staphylococcus aureus), and have increased the number of samples (to 896 samples) by interpolating spatially each parameter with this statistical method. The combined use of methods from biology and computer science allows the quality of the soil in urban parks to be predicted in an agile way, which can generate confidence in its use by citizens.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"1"},"PeriodicalIF":4.5,"publicationDate":"2023-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9840317/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9080181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Robust and rigorous identification of tissue-specific genes by statistically extending tau score. 通过统计扩展tau评分稳健和严格的组织特异性基因鉴定。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2022-12-09 DOI: 10.1186/s13040-022-00315-9
Hatice Büşra Lüleci, Alper Yılmaz

Objectives: In this study, we aimed to identify tissue-specific genes for various human tissues/organs more robustly and rigorously by extending the tau score algorithm.

Introduction: Tissue-specific genes are a class of genes whose functions and expressions are preferred in one or several tissues restrictedly. Identification of tissue-specific genes is essential for discovering multi-cellular biological processes such as tissue-specific molecular regulations, tissue development, physiology, and the pathogenesis of tissue-associated diseases.

Materials and methods: Gene expression data derived from five large RNA sequencing (RNA-seq) projects, spanning 96 different human tissues, were retrieved from ArrayExpress and ExpressionAtlas. The first step is categorizing genes using significant filters and tau score as a specificity index. After calculating tau for each gene in all datasets separately, statistical distance from the maximum expression level was estimated using a new meaningful procedure. Specific expression of a gene in one or several tissues was calculated after the integration of tau and statistical distance estimation, which is called as extended tau approach. Obtained tissue-specific genes for 96 different human tissues were functionally annotated, and some comparisons were carried out to show the effectiveness of the extended tau method.

Results and discussion: Categorization of genes based on expression level and identification of tissue-specific genes for a large number of tissues/organs were executed. Genes were successfully assigned to multiple tissues by generating the extended tau approach as opposed to the original tau score, which can assign tissue specificity to single tissue only.

目的:在本研究中,我们旨在通过扩展tau评分算法来更稳健和严格地识别各种人体组织/器官的组织特异性基因。组织特异性基因是一类功能和表达局限于一个或几个组织的基因。组织特异性基因的鉴定对于发现多细胞生物学过程至关重要,如组织特异性分子调控、组织发育、生理学和组织相关疾病的发病机制。材料和方法:基因表达数据来源于5个大型RNA测序(RNA-seq)项目,涵盖96种不同的人体组织,从ArrayExpress和ExpressionAtlas检索。第一步是使用显著过滤器和tau分数作为特异性指数对基因进行分类。在分别计算所有数据集中每个基因的tau后,使用一种新的有意义的程序估计与最大表达水平的统计距离。将tau和统计距离估计相结合,计算基因在一个或多个组织中的特异性表达,称为扩展tau法。对96种不同人体组织获得的组织特异性基因进行了功能注释,并进行了一些比较,以证明扩展tau方法的有效性。结果与讨论:对大量组织/器官进行了基于表达水平的基因分类和组织特异性基因鉴定。通过产生扩展tau方法,基因成功地分配到多个组织,而不是原始的tau评分,它只能将组织特异性分配到单个组织。
{"title":"Robust and rigorous identification of tissue-specific genes by statistically extending tau score.","authors":"Hatice Büşra Lüleci,&nbsp;Alper Yılmaz","doi":"10.1186/s13040-022-00315-9","DOIUrl":"https://doi.org/10.1186/s13040-022-00315-9","url":null,"abstract":"<p><strong>Objectives: </strong>In this study, we aimed to identify tissue-specific genes for various human tissues/organs more robustly and rigorously by extending the tau score algorithm.</p><p><strong>Introduction: </strong>Tissue-specific genes are a class of genes whose functions and expressions are preferred in one or several tissues restrictedly. Identification of tissue-specific genes is essential for discovering multi-cellular biological processes such as tissue-specific molecular regulations, tissue development, physiology, and the pathogenesis of tissue-associated diseases.</p><p><strong>Materials and methods: </strong>Gene expression data derived from five large RNA sequencing (RNA-seq) projects, spanning 96 different human tissues, were retrieved from ArrayExpress and ExpressionAtlas. The first step is categorizing genes using significant filters and tau score as a specificity index. After calculating tau for each gene in all datasets separately, statistical distance from the maximum expression level was estimated using a new meaningful procedure. Specific expression of a gene in one or several tissues was calculated after the integration of tau and statistical distance estimation, which is called as extended tau approach. Obtained tissue-specific genes for 96 different human tissues were functionally annotated, and some comparisons were carried out to show the effectiveness of the extended tau method.</p><p><strong>Results and discussion: </strong>Categorization of genes based on expression level and identification of tissue-specific genes for a large number of tissues/organs were executed. Genes were successfully assigned to multiple tissues by generating the extended tau approach as opposed to the original tau score, which can assign tissue specificity to single tissue only.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"15 1","pages":"31"},"PeriodicalIF":4.5,"publicationDate":"2022-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9733102/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9549696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Classification of breast cancer recurrence based on imputed data: a simulation study. 基于输入数据的乳腺癌复发分类:模拟研究。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2022-12-07 DOI: 10.1186/s13040-022-00316-8
Rahibu A Abassi, Amina S Msengwa

Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers' discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers' accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.

已经进行了一些研究来对各种现实生活事件进行分类,但在医学领域的研究很少;特别是在统计技术下的乳房复发。据我们所知,在存在缺失数据的情况下,没有关于统计分类准确率和分类器对乳腺癌复发的判别能力的比较报道。因此,本文旨在通过使用不同模拟条件下由输入过程产生的多个数据集,比较二元分类器(逻辑回归、线性和二次判别分析)的性能来填补这一分析空白。我们的研究有助于了解分类器在分类二元结果变量时的准确性和判别能力如何受到输入的数值缺失数据的影响。我们在随机缺失(MAR)和完全随机缺失(MCAR)机制下模拟了缺失率分别为15%、30%、45%和60%的不完整数据集。均值归算、热甲板、k近邻、通过链式方程的多重归算、期望最大化和预测均值匹配被用于归算不完整数据集。对每个分类器在MAR和MCAR机制下的正确分类精度和受试者工作特征(ROC)曲线下的面积进行比较。在MCAR机制下,基于平均输入数据的线性判别分类器在45%的缺失数据下获得了最高的分类准确率(73.9%)。作为分类器,在MCAR机制下,基于预测均值匹配输入数据的逻辑回归在缺失30%时产生的ROC曲线下面积最大(0.6418),而k近邻在缺失60%数据时最大(0.6428)。
{"title":"Classification of breast cancer recurrence based on imputed data: a simulation study.","authors":"Rahibu A Abassi,&nbsp;Amina S Msengwa","doi":"10.1186/s13040-022-00316-8","DOIUrl":"https://doi.org/10.1186/s13040-022-00316-8","url":null,"abstract":"<p><p>Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers' discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers' accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"15 1","pages":"30"},"PeriodicalIF":4.5,"publicationDate":"2022-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9727846/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10329275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting diseases in medical prescriptions using data mining methods. 利用数据挖掘方法检测医学处方中的疾病。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2022-11-24 DOI: 10.1186/s13040-022-00314-w
Sana Nazari Nezhad, Mohammad H Zahedi, Elham Farahani

Every year, the health of millions of people around the world is compromised by misdiagnosis, which sometimes could even lead to death. In addition, it entails huge financial costs for patients, insurance companies, and governments. Furthermore, many physicians' professional life is adversely affected by unintended errors in prescribing medication or misdiagnosing a disease. Our aim in this paper is to use data mining methods to find knowledge in a dataset of medical prescriptions that can be effective in improving the diagnostic process. In this study, using 4 single classification algorithms including decision tree, random forest, simple Bayes, and K-nearest neighbors, the disease and its category were predicted. Then, in order to improve the performance of these algorithms, we used an Ensemble Learning methodology to present our proposed model. In the final step, a number of experiments were performed to compare the performance of different data mining techniques. The final model proposed in this study has an accuracy and kappa score of 62.86% and 0.620 for disease prediction and 74.39% and 0.720 for prediction of the disease category, respectively, which has better performance than other studies in this field.In general, the results of this study can be used to help maintain the health of patients, and prevent the wastage of the financial resources of patients, insurance companies, and governments. In addition, it can aid physicians and help their careers by providing timely information on diagnostic errors. Finally, these results can be used as a basis for future research in this field.

每年,全世界数百万人的健康因误诊而受到损害,有时甚至可能导致死亡。此外,它还会给患者、保险公司和政府带来巨大的财务成本。此外,许多医生的职业生涯受到处方或误诊疾病的意外错误的不利影响。我们在本文中的目标是使用数据挖掘方法在医学处方数据集中找到可以有效改进诊断过程的知识。本研究采用决策树、随机森林、简单贝叶斯、k近邻4种单一分类算法对病害及其分类进行预测。然后,为了提高这些算法的性能,我们使用集成学习方法来呈现我们提出的模型。在最后一步中,进行了一些实验来比较不同数据挖掘技术的性能。本文提出的最终模型预测疾病的准确率和kappa评分分别为62.86%和0.620,预测疾病类别的准确率和kappa评分分别为74.39%和0.720,优于该领域的其他研究。总的来说,本研究的结果可以用来帮助维护患者的健康,防止浪费患者、保险公司和政府的财政资源。此外,它可以通过提供诊断错误的及时信息来帮助医生和帮助他们的职业生涯。最后,这些结果可以作为该领域未来研究的基础。
{"title":"Detecting diseases in medical prescriptions using data mining methods.","authors":"Sana Nazari Nezhad,&nbsp;Mohammad H Zahedi,&nbsp;Elham Farahani","doi":"10.1186/s13040-022-00314-w","DOIUrl":"https://doi.org/10.1186/s13040-022-00314-w","url":null,"abstract":"<p><p>Every year, the health of millions of people around the world is compromised by misdiagnosis, which sometimes could even lead to death. In addition, it entails huge financial costs for patients, insurance companies, and governments. Furthermore, many physicians' professional life is adversely affected by unintended errors in prescribing medication or misdiagnosing a disease. Our aim in this paper is to use data mining methods to find knowledge in a dataset of medical prescriptions that can be effective in improving the diagnostic process. In this study, using 4 single classification algorithms including decision tree, random forest, simple Bayes, and K-nearest neighbors, the disease and its category were predicted. Then, in order to improve the performance of these algorithms, we used an Ensemble Learning methodology to present our proposed model. In the final step, a number of experiments were performed to compare the performance of different data mining techniques. The final model proposed in this study has an accuracy and kappa score of 62.86% and 0.620 for disease prediction and 74.39% and 0.720 for prediction of the disease category, respectively, which has better performance than other studies in this field.In general, the results of this study can be used to help maintain the health of patients, and prevent the wastage of the financial resources of patients, insurance companies, and governments. In addition, it can aid physicians and help their careers by providing timely information on diagnostic errors. Finally, these results can be used as a basis for future research in this field.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"15 1","pages":"29"},"PeriodicalIF":4.5,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9694862/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10320720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts. 通过多关系提取生物医学摘要扩展数据库衍生的生物医学知识图谱。
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2022-10-18 DOI: 10.1186/s13040-022-00311-z
David N Nicholson, Daniel S Himmelstein, Casey S Greene

Background: Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types.

Results: We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1.

Conclusions: Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.

背景:知识图谱通过为生物医学实体提供上下文信息、构建网络和支持高通量分析的解释来支持生物医学研究工作。这些数据库是通过人工管理来填充的,随着出版率呈指数级增长,这是一个挑战。数据编程是一种范例,它通过将数据库与写为标签函数(设计用于自动注释文本数据的程序)的简单规则和启发式方法相结合,规避了这一艰巨的手动过程。不幸的是,编写一个有用的标签函数需要大量的错误分析,并且每个函数都需要花费数天的时间。这个瓶颈使得填充具有多个节点和边缘类型的知识图实际上是不可行的。因此,我们试图通过评估如何在多个边缘类型之间重用标签函数来加速标签函数的创建过程。结果:我们获得了实体标记的摘要,并将这些实体细分为仅包含化合物、基因和疾病。我们提取了包含共同提及的某些生物医学实体的句子,这些实体包含在先前描述的知识图谱Hetionet v1中。我们训练了一个基线模型,该模型使用仅数据库的标签函数,然后使用抽样方法来测量添加边缘特定或边缘不匹配的标签函数组合在基线上的改善程度。接下来,我们训练了一个判别器模型来检测表明生物医学关系的句子,然后估计可以被召回的边缘类型的数量并添加到Hetionet v1中。我们发现添加边缘不匹配的标签函数很少能改善关系提取,而控制边缘特定的标签函数却能。这一趋势有两个例外,化合物结合-基因和基因相互作用-基因,两者都表明了物理关系和可转移性的迹象。在测试的场景中,判别模型的性能很大程度上依赖于生成的注释。使用每种边缘类型的最佳判别模型,我们召回了Hetionet v1中近30%的已建立边缘。结论:我们的研究结果表明,该框架可以将新的边缘合并到我们的源知识图中。然而,标签功能转移的结果是混合的。只有描述非常相似边缘类型的标签函数在传输时才支持改进的性能。我们期望这一战略的持续发展可以为生物医学知识图谱的发现提供必要的构建模块,确保这些资源包括前沿的结果。
{"title":"Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts.","authors":"David N Nicholson,&nbsp;Daniel S Himmelstein,&nbsp;Casey S Greene","doi":"10.1186/s13040-022-00311-z","DOIUrl":"https://doi.org/10.1186/s13040-022-00311-z","url":null,"abstract":"<p><strong>Background: </strong>Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types.</p><p><strong>Results: </strong>We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1.</p><p><strong>Conclusions: </strong>Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"15 1","pages":"26"},"PeriodicalIF":4.5,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9578183/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10692874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Single_cell_GRN: gene regulatory network identification based on supervised learning method and Single-cell RNA-seq data Single_cell_GRN:基于监督学习方法和单细胞RNA-seq数据的基因调控网络识别
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2022-06-11 DOI: 10.1186/s13040-022-00297-8
Bin Yang, Wenzheng Bao, Baitong Chen, Dan Song
{"title":"Single_cell_GRN: gene regulatory network identification based on supervised learning method and Single-cell RNA-seq data","authors":"Bin Yang, Wenzheng Bao, Baitong Chen, Dan Song","doi":"10.1186/s13040-022-00297-8","DOIUrl":"https://doi.org/10.1186/s13040-022-00297-8","url":null,"abstract":"","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":" ","pages":""},"PeriodicalIF":4.5,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48895469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Colorectal cancer subtype identification from differential gene expression levels using minimalist deep learning 利用极简深度学习从差异基因表达水平鉴定结直肠癌亚型
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2022-04-23 DOI: 10.1186/s13040-022-00295-w
Li, Shaochuan, Yang, Yuning, Wang, Xin, Li, Jun, Yu, Jun, Li, Xiangtao, Wong, Ka-Chun
Cancer molecular subtyping plays a critical role in individualized patient treatment. In previous studies, high-throughput gene expression signature-based methods have been proposed to identify cancer subtypes. Unfortunately, the existing ones suffer from the curse of dimensionality, data sparsity, and computational deficiency. To address those problems, we propose a computational framework for colorectal cancer subtyping without any exploitation in model complexity and generality. A supervised learning framework based on deep learning (DeepCSD) is proposed to identify cancer subtypes. Specifically, based on the differentially expressed genes under cancer consensus molecular subtyping, we design a minimalist feed-forward neural network to capture the distinct molecular features in different cancer subtypes. To mitigate the overfitting phenomenon of deep learning as much as possible, L1 and L2 regularization and dropout layers are added. For demonstrating the effectiveness of DeepCSD, we compared it with other methods including Random Forest (RF), Deep forest (gcForest), support vector machine (SVM), XGBoost, and DeepCC on eight independent colorectal cancer datasets. The results reflect that DeepCSD can achieve superior performance over other algorithms. In addition, gene ontology enrichment and pathology analysis are conducted to reveal novel insights into the cancer subtype identification and characterization mechanisms. DeepCSD considers all subtype-specific genes as input, which is pathologically necessary for its completeness. At the same time, DeepCSD shows remarkable robustness in handling cross-platform gene expression data, achieving similar performance on both training and test data without significant model overfitting or exploitation of model complexity.
癌症分子分型在个体化治疗中起着至关重要的作用。在之前的研究中,已经提出了基于高通量基因表达特征的方法来识别癌症亚型。不幸的是,现有的方法受到维度、数据稀疏性和计算不足的困扰。为了解决这些问题,我们提出了一种计算框架,用于结直肠癌亚型分型,而不需要利用模型的复杂性和通用性。提出了一种基于深度学习的监督学习框架(DeepCSD)来识别癌症亚型。具体而言,基于癌症共识分子分型下的差异表达基因,我们设计了一个极简前馈神经网络来捕捉不同癌症亚型的不同分子特征。为了尽可能减轻深度学习的过拟合现象,增加了L1和L2正则化和dropout层。为了证明DeepCSD的有效性,我们将其与随机森林(RF)、深度森林(gcForest)、支持向量机(SVM)、XGBoost和DeepCC等其他方法在8个独立的结直肠癌数据集上进行了比较。结果表明,DeepCSD可以取得优于其他算法的性能。此外,通过基因本体富集和病理分析,揭示癌症亚型鉴定和表征机制的新见解。DeepCSD考虑所有亚型特异性基因作为输入,这是其完整性的病理必要条件。同时,DeepCSD在处理跨平台基因表达数据方面表现出了出色的鲁棒性,在训练和测试数据上都取得了相似的性能,没有明显的模型过拟合或利用模型复杂性。
{"title":"Colorectal cancer subtype identification from differential gene expression levels using minimalist deep learning","authors":"Li, Shaochuan, Yang, Yuning, Wang, Xin, Li, Jun, Yu, Jun, Li, Xiangtao, Wong, Ka-Chun","doi":"10.1186/s13040-022-00295-w","DOIUrl":"https://doi.org/10.1186/s13040-022-00295-w","url":null,"abstract":"Cancer molecular subtyping plays a critical role in individualized patient treatment. In previous studies, high-throughput gene expression signature-based methods have been proposed to identify cancer subtypes. Unfortunately, the existing ones suffer from the curse of dimensionality, data sparsity, and computational deficiency. To address those problems, we propose a computational framework for colorectal cancer subtyping without any exploitation in model complexity and generality. A supervised learning framework based on deep learning (DeepCSD) is proposed to identify cancer subtypes. Specifically, based on the differentially expressed genes under cancer consensus molecular subtyping, we design a minimalist feed-forward neural network to capture the distinct molecular features in different cancer subtypes. To mitigate the overfitting phenomenon of deep learning as much as possible, L1 and L2 regularization and dropout layers are added. For demonstrating the effectiveness of DeepCSD, we compared it with other methods including Random Forest (RF), Deep forest (gcForest), support vector machine (SVM), XGBoost, and DeepCC on eight independent colorectal cancer datasets. The results reflect that DeepCSD can achieve superior performance over other algorithms. In addition, gene ontology enrichment and pathology analysis are conducted to reveal novel insights into the cancer subtype identification and characterization mechanisms. DeepCSD considers all subtype-specific genes as input, which is pathologically necessary for its completeness. At the same time, DeepCSD shows remarkable robustness in handling cross-platform gene expression data, achieving similar performance on both training and test data without significant model overfitting or exploitation of model complexity.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"19 4","pages":""},"PeriodicalIF":4.5,"publicationDate":"2022-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138520205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1