首页 > 最新文献

Biodata Mining最新文献

英文 中文
LoFTK: a framework for fully automated calculation of predicted Loss-of-Function variants and genes. LoFTK:一个完全自动计算预测功能丧失变异和基因的框架。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2023-02-02 DOI: 10.1186/s13040-023-00321-5
Abdulrahman Alasiri, Konrad J Karczewski, Brian Cole, Bao-Li Loza, Jason H Moore, Sander W van der Laan, Folkert W Asselbergs, Brendan J Keating, Jessica van Setten

Background: Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. The association of LoF variants with complex diseases and traits may lead to the discovery and validation of novel therapeutic targets. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Moreover, there is a lack of methods for detecting knockout genes caused by compound heterozygous (CH) LoF variants.

Results: We have developed the Loss-of-Function ToolKit (LoFTK), which allows efficient and automated prediction of LoF variants from genotyped, imputed and sequenced genomes. LoFTK enables the identification of genes that are inactive in one or two copies and provides summary statistics for downstream analyses. LoFTK can identify CH LoF variants, which result in LoF genes with two copies lost. Using data from parents and offspring we show that 96% of CH LoF genes predicted by LoFTK in the offspring have the respective alleles donated by each parent.

Conclusions: LoFTK is a command-line based tool that provides a reliable computational workflow for predicting LoF variants from genotyped and sequenced genomes, identifying genes that are inactive in 1 or 2 copies. LoFTK is an open software and is freely available to non-commercial users at https://github.com/CirculatoryHealth/LoFTK .

背景:人类基因中的功能丧失(LoF)变异对临床表型的影响和在健康个体基因组中的频繁出现是很重要的。LoF变异与复杂疾病和性状的关联可能导致新的治疗靶点的发现和验证。目前的方法预测高可信度的LoF变异,而不确定特定的基因或它们影响的拷贝数。此外,目前还缺乏检测由复合杂合LoF变异引起的敲除基因的方法。结果:我们开发了功能丧失工具包(LoFTK),它允许从基因分型,输入和测序的基因组中有效和自动预测LoF变异。LoFTK能够识别在一个或两个拷贝中不活跃的基因,并为下游分析提供汇总统计数据。LoFTK可以识别CH LoF变异,导致LoF基因丢失两个拷贝。利用亲本和子代的数据,我们发现在后代中,由LoFTK预测的96%的CH LoF基因具有亲本各自捐赠的等位基因。结论:LoFTK是一个基于命令行的工具,它提供了一个可靠的计算工作流程,用于预测基因分型和测序基因组中的LoF变异,识别在1或2拷贝中无活性的基因。LoFTK是一个开放的软件,非商业用户可以在https://github.com/CirculatoryHealth/LoFTK上免费获得。
{"title":"LoFTK: a framework for fully automated calculation of predicted Loss-of-Function variants and genes.","authors":"Abdulrahman Alasiri,&nbsp;Konrad J Karczewski,&nbsp;Brian Cole,&nbsp;Bao-Li Loza,&nbsp;Jason H Moore,&nbsp;Sander W van der Laan,&nbsp;Folkert W Asselbergs,&nbsp;Brendan J Keating,&nbsp;Jessica van Setten","doi":"10.1186/s13040-023-00321-5","DOIUrl":"https://doi.org/10.1186/s13040-023-00321-5","url":null,"abstract":"<p><strong>Background: </strong>Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. The association of LoF variants with complex diseases and traits may lead to the discovery and validation of novel therapeutic targets. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Moreover, there is a lack of methods for detecting knockout genes caused by compound heterozygous (CH) LoF variants.</p><p><strong>Results: </strong>We have developed the Loss-of-Function ToolKit (LoFTK), which allows efficient and automated prediction of LoF variants from genotyped, imputed and sequenced genomes. LoFTK enables the identification of genes that are inactive in one or two copies and provides summary statistics for downstream analyses. LoFTK can identify CH LoF variants, which result in LoF genes with two copies lost. Using data from parents and offspring we show that 96% of CH LoF genes predicted by LoFTK in the offspring have the respective alleles donated by each parent.</p><p><strong>Conclusions: </strong>LoFTK is a command-line based tool that provides a reliable computational workflow for predicting LoF variants from genotyped and sequenced genomes, identifying genes that are inactive in 1 or 2 copies. LoFTK is an open software and is freely available to non-commercial users at https://github.com/CirculatoryHealth/LoFTK .</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9893534/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9154622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detection of iron deficiency anemia by medical images: a comparative study of machine learning algorithms. 医学图像检测缺铁性贫血:机器学习算法的比较研究。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2023-01-24 DOI: 10.1186/s13040-023-00319-z
Peter Appiahene, Justice Williams Asare, Emmanuel Timmy Donkoh, Giovanni Dimauro, Rosalia Maglietta

Background: Anemia is one of the global public health problems that affect children and pregnant women. Anemia occurs when the level of red blood cells within the body decreases or when the structure of the red blood cells is destroyed or when the Hb level in the red blood cell is below the normal threshold, which results from one or more increased red cell destructions, blood loss, defective cell production or a depleted sum of Red Blood Cells.

Methods: The method used in this study is divided into three phases: the datasets were gathered, which is the palm, pre-processed the image, which comprised; Extracted images, and augmented images, segmented the Region of Interest of the images and acquired their various components of the CIE L*a*b* colour space (also referred to as the CIELAB), and finally developed the proposed models for the detection of anemia using the various algorithms, which include CNN, k-NN, Nave Bayes, SVM, and Decision Tree. The experiment utilized 527 initial datasets, rotation, flipping and translation were utilized and augmented the dataset to 2635. We randomly divided the augmented dataset into 70%, 10%, and 20% and trained, validated and tested the models respectively.

Results: The results of the study justify that the models performed appropriately when the palm is used to detect anemia, with the Naïve Bayes achieving a 99.96% accuracy while the SVM achieved the lowest accuracy of 96.34%, as the CNN also performed better with an accuracy of 99.92% in detecting anemia.

Conclusions: The invasive method of detecting anemia is expensive and time-consuming; however, anemia can be detected through the use of non-invasive methods such as machine learning algorithms which is efficient, cost-effective and takes less time. In this work, we compared machine learning models such as CNN, k-NN, Decision Tree, Naïve Bayes, and SVM to detect anemia using images of the palm. Finally, the study supports other similar studies on the potency of the Machine Learning Algorithm as a non-invasive method in detecting iron deficiency anemia.

背景:贫血是影响儿童和孕妇的全球性公共卫生问题之一。当体内红细胞水平下降或红细胞结构被破坏或红细胞中的Hb水平低于正常阈值时,就会发生贫血,这是由一种或多种红细胞破坏增加、失血、细胞产生缺陷或红细胞耗尽引起的。方法:本研究采用的方法分为三个阶段:收集数据集,即手掌,对图像进行预处理,其中包括;提取图像,增强图像,对图像的兴趣区域进行分割,并获得其在CIEL *a*b*色彩空间(也称为CIELAB)中的各个分量,最后使用各种算法(包括CNN, k-NN, Nave Bayes, SVM和Decision Tree)开发提出的贫血检测模型。实验利用527个初始数据集,利用旋转、翻转、平移等方法将数据集扩充到2635个。我们将增强的数据集随机分为70%、10%和20%,并分别对模型进行训练、验证和测试。结果:本研究结果证明了模型在用手掌检测贫血时的表现是合适的,Naïve Bayes的准确率达到99.96%,SVM的准确率最低,为96.34%,CNN在检测贫血方面的准确率也更好,达到99.92%。结论:有创检测贫血费用高、耗时长;然而,可以通过使用机器学习算法等非侵入性方法检测贫血,这种方法效率高,成本效益高,耗时短。在这项工作中,我们比较了机器学习模型,如CNN, k-NN,决策树,Naïve贝叶斯和SVM,以使用手掌图像检测贫血。最后,该研究支持了机器学习算法作为检测缺铁性贫血的非侵入性方法的效力的其他类似研究。
{"title":"Detection of iron deficiency anemia by medical images: a comparative study of machine learning algorithms.","authors":"Peter Appiahene,&nbsp;Justice Williams Asare,&nbsp;Emmanuel Timmy Donkoh,&nbsp;Giovanni Dimauro,&nbsp;Rosalia Maglietta","doi":"10.1186/s13040-023-00319-z","DOIUrl":"https://doi.org/10.1186/s13040-023-00319-z","url":null,"abstract":"<p><strong>Background: </strong>Anemia is one of the global public health problems that affect children and pregnant women. Anemia occurs when the level of red blood cells within the body decreases or when the structure of the red blood cells is destroyed or when the Hb level in the red blood cell is below the normal threshold, which results from one or more increased red cell destructions, blood loss, defective cell production or a depleted sum of Red Blood Cells.</p><p><strong>Methods: </strong>The method used in this study is divided into three phases: the datasets were gathered, which is the palm, pre-processed the image, which comprised; Extracted images, and augmented images, segmented the Region of Interest of the images and acquired their various components of the CIE L*a*b* colour space (also referred to as the CIELAB), and finally developed the proposed models for the detection of anemia using the various algorithms, which include CNN, k-NN, Nave Bayes, SVM, and Decision Tree. The experiment utilized 527 initial datasets, rotation, flipping and translation were utilized and augmented the dataset to 2635. We randomly divided the augmented dataset into 70%, 10%, and 20% and trained, validated and tested the models respectively.</p><p><strong>Results: </strong>The results of the study justify that the models performed appropriately when the palm is used to detect anemia, with the Naïve Bayes achieving a 99.96% accuracy while the SVM achieved the lowest accuracy of 96.34%, as the CNN also performed better with an accuracy of 99.92% in detecting anemia.</p><p><strong>Conclusions: </strong>The invasive method of detecting anemia is expensive and time-consuming; however, anemia can be detected through the use of non-invasive methods such as machine learning algorithms which is efficient, cost-effective and takes less time. In this work, we compared machine learning models such as CNN, k-NN, Decision Tree, Naïve Bayes, and SVM to detect anemia using images of the palm. Finally, the study supports other similar studies on the potency of the Machine Learning Algorithm as a non-invasive method in detecting iron deficiency anemia.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9875467/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10627230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Bacteria spatial tracking in Urban Park soils with MALDI-TOF Mass Spectrometry and Specific PCR. 利用MALDI-TOF质谱法和特异PCR技术追踪城市公园土壤细菌的空间分布。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2023-01-14 DOI: 10.1186/s13040-022-00318-6
Diego Arnal, Celeste Moya, Luigi Filippelli, Jaume Segura-Garcia, Sergi Maicas

Urban parks constitute one of the main leisure areas, especially for the most vulnerable people in our society, children, and the elderly. Contact with soils can pose a health risk. Microbiological testing is a key aspect in determining whether they are suitable for public use. The aim of this work is to map the spatial distribution of potential dangerous Enterobacteria but also bioremediation useful (lipase producers) isolates from soils in an urban park in the area of Valencia (Spain). To this end, our team has collected 25 samples of soil and isolated 500 microorganisms, using a mobile application to collect information of the soil samples (i.e. soil features, temperature, humidity, etc.) with geolocation. A combined protocol including matrix-assisted laser desorption/ionization time of flight mass spectrometry (MALDI-TOF MS) and 16S rDNA sequencing PCR has been established to characterize the isolates. The results have been processed using spatial statistical techniques (using Kriging method), taking into account the number of isolated strains, also proving the reactivity against standard pathogenic bacterial strains (Escherichia coli, Bacillus cereus, Salmonella, Pseudomonas and Staphylococcus aureus), and have increased the number of samples (to 896 samples) by interpolating spatially each parameter with this statistical method. The combined use of methods from biology and computer science allows the quality of the soil in urban parks to be predicted in an agile way, which can generate confidence in its use by citizens.

城市公园构成了主要的休闲场所之一,特别是对于我们社会中最弱势的人群,儿童和老年人。与土壤接触会造成健康风险。微生物测试是决定它们是否适合公众使用的一个关键方面。这项工作的目的是绘制潜在危险肠杆菌的空间分布,以及从巴伦西亚(西班牙)地区一个城市公园的土壤中分离出的生物修复有用的(脂肪酶产生者)分离物。为此,我们团队采集了25个土壤样本,分离了500个微生物,使用移动应用程序收集土壤样本的地理定位信息(即土壤特征、温度、湿度等)。建立了基质辅助激光解吸/电离飞行时间质谱(MALDI-TOF MS)和16S rDNA测序PCR的联合方案来鉴定分离物。利用空间统计技术(Kriging法)对结果进行处理,考虑到分离菌株的数量,也证明了对标准病原菌(大肠杆菌、蜡样芽孢杆菌、沙门氏菌、假单胞菌和金黄色葡萄球菌)的反应性,并利用该统计方法对各参数进行空间插值,增加了样本数量(896个样本)。结合使用生物学和计算机科学的方法,可以以一种灵活的方式预测城市公园的土壤质量,这可以使市民对其使用产生信心。
{"title":"Bacteria spatial tracking in Urban Park soils with MALDI-TOF Mass Spectrometry and Specific PCR.","authors":"Diego Arnal,&nbsp;Celeste Moya,&nbsp;Luigi Filippelli,&nbsp;Jaume Segura-Garcia,&nbsp;Sergi Maicas","doi":"10.1186/s13040-022-00318-6","DOIUrl":"https://doi.org/10.1186/s13040-022-00318-6","url":null,"abstract":"<p><p>Urban parks constitute one of the main leisure areas, especially for the most vulnerable people in our society, children, and the elderly. Contact with soils can pose a health risk. Microbiological testing is a key aspect in determining whether they are suitable for public use. The aim of this work is to map the spatial distribution of potential dangerous Enterobacteria but also bioremediation useful (lipase producers) isolates from soils in an urban park in the area of Valencia (Spain). To this end, our team has collected 25 samples of soil and isolated 500 microorganisms, using a mobile application to collect information of the soil samples (i.e. soil features, temperature, humidity, etc.) with geolocation. A combined protocol including matrix-assisted laser desorption/ionization time of flight mass spectrometry (MALDI-TOF MS) and 16S rDNA sequencing PCR has been established to characterize the isolates. The results have been processed using spatial statistical techniques (using Kriging method), taking into account the number of isolated strains, also proving the reactivity against standard pathogenic bacterial strains (Escherichia coli, Bacillus cereus, Salmonella, Pseudomonas and Staphylococcus aureus), and have increased the number of samples (to 896 samples) by interpolating spatially each parameter with this statistical method. The combined use of methods from biology and computer science allows the quality of the soil in urban parks to be predicted in an agile way, which can generate confidence in its use by citizens.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2023-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9840317/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9080181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Robust and rigorous identification of tissue-specific genes by statistically extending tau score. 通过统计扩展tau评分稳健和严格的组织特异性基因鉴定。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-12-09 DOI: 10.1186/s13040-022-00315-9
Hatice Büşra Lüleci, Alper Yılmaz

Objectives: In this study, we aimed to identify tissue-specific genes for various human tissues/organs more robustly and rigorously by extending the tau score algorithm.

Introduction: Tissue-specific genes are a class of genes whose functions and expressions are preferred in one or several tissues restrictedly. Identification of tissue-specific genes is essential for discovering multi-cellular biological processes such as tissue-specific molecular regulations, tissue development, physiology, and the pathogenesis of tissue-associated diseases.

Materials and methods: Gene expression data derived from five large RNA sequencing (RNA-seq) projects, spanning 96 different human tissues, were retrieved from ArrayExpress and ExpressionAtlas. The first step is categorizing genes using significant filters and tau score as a specificity index. After calculating tau for each gene in all datasets separately, statistical distance from the maximum expression level was estimated using a new meaningful procedure. Specific expression of a gene in one or several tissues was calculated after the integration of tau and statistical distance estimation, which is called as extended tau approach. Obtained tissue-specific genes for 96 different human tissues were functionally annotated, and some comparisons were carried out to show the effectiveness of the extended tau method.

Results and discussion: Categorization of genes based on expression level and identification of tissue-specific genes for a large number of tissues/organs were executed. Genes were successfully assigned to multiple tissues by generating the extended tau approach as opposed to the original tau score, which can assign tissue specificity to single tissue only.

目的:在本研究中,我们旨在通过扩展tau评分算法来更稳健和严格地识别各种人体组织/器官的组织特异性基因。组织特异性基因是一类功能和表达局限于一个或几个组织的基因。组织特异性基因的鉴定对于发现多细胞生物学过程至关重要,如组织特异性分子调控、组织发育、生理学和组织相关疾病的发病机制。材料和方法:基因表达数据来源于5个大型RNA测序(RNA-seq)项目,涵盖96种不同的人体组织,从ArrayExpress和ExpressionAtlas检索。第一步是使用显著过滤器和tau分数作为特异性指数对基因进行分类。在分别计算所有数据集中每个基因的tau后,使用一种新的有意义的程序估计与最大表达水平的统计距离。将tau和统计距离估计相结合,计算基因在一个或多个组织中的特异性表达,称为扩展tau法。对96种不同人体组织获得的组织特异性基因进行了功能注释,并进行了一些比较,以证明扩展tau方法的有效性。结果与讨论:对大量组织/器官进行了基于表达水平的基因分类和组织特异性基因鉴定。通过产生扩展tau方法,基因成功地分配到多个组织,而不是原始的tau评分,它只能将组织特异性分配到单个组织。
{"title":"Robust and rigorous identification of tissue-specific genes by statistically extending tau score.","authors":"Hatice Büşra Lüleci,&nbsp;Alper Yılmaz","doi":"10.1186/s13040-022-00315-9","DOIUrl":"https://doi.org/10.1186/s13040-022-00315-9","url":null,"abstract":"<p><strong>Objectives: </strong>In this study, we aimed to identify tissue-specific genes for various human tissues/organs more robustly and rigorously by extending the tau score algorithm.</p><p><strong>Introduction: </strong>Tissue-specific genes are a class of genes whose functions and expressions are preferred in one or several tissues restrictedly. Identification of tissue-specific genes is essential for discovering multi-cellular biological processes such as tissue-specific molecular regulations, tissue development, physiology, and the pathogenesis of tissue-associated diseases.</p><p><strong>Materials and methods: </strong>Gene expression data derived from five large RNA sequencing (RNA-seq) projects, spanning 96 different human tissues, were retrieved from ArrayExpress and ExpressionAtlas. The first step is categorizing genes using significant filters and tau score as a specificity index. After calculating tau for each gene in all datasets separately, statistical distance from the maximum expression level was estimated using a new meaningful procedure. Specific expression of a gene in one or several tissues was calculated after the integration of tau and statistical distance estimation, which is called as extended tau approach. Obtained tissue-specific genes for 96 different human tissues were functionally annotated, and some comparisons were carried out to show the effectiveness of the extended tau method.</p><p><strong>Results and discussion: </strong>Categorization of genes based on expression level and identification of tissue-specific genes for a large number of tissues/organs were executed. Genes were successfully assigned to multiple tissues by generating the extended tau approach as opposed to the original tau score, which can assign tissue specificity to single tissue only.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9733102/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9549696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Classification of breast cancer recurrence based on imputed data: a simulation study. 基于输入数据的乳腺癌复发分类:模拟研究。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-12-07 DOI: 10.1186/s13040-022-00316-8
Rahibu A Abassi, Amina S Msengwa

Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers' discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers' accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.

已经进行了一些研究来对各种现实生活事件进行分类,但在医学领域的研究很少;特别是在统计技术下的乳房复发。据我们所知,在存在缺失数据的情况下,没有关于统计分类准确率和分类器对乳腺癌复发的判别能力的比较报道。因此,本文旨在通过使用不同模拟条件下由输入过程产生的多个数据集,比较二元分类器(逻辑回归、线性和二次判别分析)的性能来填补这一分析空白。我们的研究有助于了解分类器在分类二元结果变量时的准确性和判别能力如何受到输入的数值缺失数据的影响。我们在随机缺失(MAR)和完全随机缺失(MCAR)机制下模拟了缺失率分别为15%、30%、45%和60%的不完整数据集。均值归算、热甲板、k近邻、通过链式方程的多重归算、期望最大化和预测均值匹配被用于归算不完整数据集。对每个分类器在MAR和MCAR机制下的正确分类精度和受试者工作特征(ROC)曲线下的面积进行比较。在MCAR机制下,基于平均输入数据的线性判别分类器在45%的缺失数据下获得了最高的分类准确率(73.9%)。作为分类器,在MCAR机制下,基于预测均值匹配输入数据的逻辑回归在缺失30%时产生的ROC曲线下面积最大(0.6418),而k近邻在缺失60%数据时最大(0.6428)。
{"title":"Classification of breast cancer recurrence based on imputed data: a simulation study.","authors":"Rahibu A Abassi,&nbsp;Amina S Msengwa","doi":"10.1186/s13040-022-00316-8","DOIUrl":"https://doi.org/10.1186/s13040-022-00316-8","url":null,"abstract":"<p><p>Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers' discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers' accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9727846/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10329275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting diseases in medical prescriptions using data mining methods. 利用数据挖掘方法检测医学处方中的疾病。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-11-24 DOI: 10.1186/s13040-022-00314-w
Sana Nazari Nezhad, Mohammad H Zahedi, Elham Farahani

Every year, the health of millions of people around the world is compromised by misdiagnosis, which sometimes could even lead to death. In addition, it entails huge financial costs for patients, insurance companies, and governments. Furthermore, many physicians' professional life is adversely affected by unintended errors in prescribing medication or misdiagnosing a disease. Our aim in this paper is to use data mining methods to find knowledge in a dataset of medical prescriptions that can be effective in improving the diagnostic process. In this study, using 4 single classification algorithms including decision tree, random forest, simple Bayes, and K-nearest neighbors, the disease and its category were predicted. Then, in order to improve the performance of these algorithms, we used an Ensemble Learning methodology to present our proposed model. In the final step, a number of experiments were performed to compare the performance of different data mining techniques. The final model proposed in this study has an accuracy and kappa score of 62.86% and 0.620 for disease prediction and 74.39% and 0.720 for prediction of the disease category, respectively, which has better performance than other studies in this field.In general, the results of this study can be used to help maintain the health of patients, and prevent the wastage of the financial resources of patients, insurance companies, and governments. In addition, it can aid physicians and help their careers by providing timely information on diagnostic errors. Finally, these results can be used as a basis for future research in this field.

每年,全世界数百万人的健康因误诊而受到损害,有时甚至可能导致死亡。此外,它还会给患者、保险公司和政府带来巨大的财务成本。此外,许多医生的职业生涯受到处方或误诊疾病的意外错误的不利影响。我们在本文中的目标是使用数据挖掘方法在医学处方数据集中找到可以有效改进诊断过程的知识。本研究采用决策树、随机森林、简单贝叶斯、k近邻4种单一分类算法对病害及其分类进行预测。然后,为了提高这些算法的性能,我们使用集成学习方法来呈现我们提出的模型。在最后一步中,进行了一些实验来比较不同数据挖掘技术的性能。本文提出的最终模型预测疾病的准确率和kappa评分分别为62.86%和0.620,预测疾病类别的准确率和kappa评分分别为74.39%和0.720,优于该领域的其他研究。总的来说,本研究的结果可以用来帮助维护患者的健康,防止浪费患者、保险公司和政府的财政资源。此外,它可以通过提供诊断错误的及时信息来帮助医生和帮助他们的职业生涯。最后,这些结果可以作为该领域未来研究的基础。
{"title":"Detecting diseases in medical prescriptions using data mining methods.","authors":"Sana Nazari Nezhad,&nbsp;Mohammad H Zahedi,&nbsp;Elham Farahani","doi":"10.1186/s13040-022-00314-w","DOIUrl":"https://doi.org/10.1186/s13040-022-00314-w","url":null,"abstract":"<p><p>Every year, the health of millions of people around the world is compromised by misdiagnosis, which sometimes could even lead to death. In addition, it entails huge financial costs for patients, insurance companies, and governments. Furthermore, many physicians' professional life is adversely affected by unintended errors in prescribing medication or misdiagnosing a disease. Our aim in this paper is to use data mining methods to find knowledge in a dataset of medical prescriptions that can be effective in improving the diagnostic process. In this study, using 4 single classification algorithms including decision tree, random forest, simple Bayes, and K-nearest neighbors, the disease and its category were predicted. Then, in order to improve the performance of these algorithms, we used an Ensemble Learning methodology to present our proposed model. In the final step, a number of experiments were performed to compare the performance of different data mining techniques. The final model proposed in this study has an accuracy and kappa score of 62.86% and 0.620 for disease prediction and 74.39% and 0.720 for prediction of the disease category, respectively, which has better performance than other studies in this field.In general, the results of this study can be used to help maintain the health of patients, and prevent the wastage of the financial resources of patients, insurance companies, and governments. In addition, it can aid physicians and help their careers by providing timely information on diagnostic errors. Finally, these results can be used as a basis for future research in this field.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9694862/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10320720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Towards a potential pan-cancer prognostic signature for gene expression based on probesets and ensemble machine learning. 基于问题集和集成机器学习的基因表达的潜在泛癌症预后标记。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-11-03 DOI: 10.1186/s13040-022-00312-y
Davide Chicco, Abbas Alameer, Sara Rahmati, Giuseppe Jurman

Cancer is one of the leading causes of death worldwide and can be caused by environmental aspects (for example, exposure to asbestos), by human behavior (such as smoking), or by genetic factors. To understand which genes might be involved in patients' survival, researchers have invented prognostic genetic signatures: lists of genes that can be used in scientific analyses to predict if a patient will survive or not. In this study, we joined together five different prognostic signatures, each of them related to a specific cancer type, to generate a unique pan-cancer prognostic signature, that contains 207 unique probesets related to 187 unique gene symbols, with one particular probeset present in two cancer type-specific signatures (203072_at related to the MYO1E gene). We applied our proposed pan-cancer signature with the Random Forests machine learning method to 57 microarray gene expression datasets of 12 different cancer types, and analyzed the results. We also compared the performance of our pan-cancer signature with the performances of two alternative prognostic signatures, and with the performances of each cancer type-specific signature on their corresponding cancer type-specific datasets. Our results confirmed the effectiveness of our prognostic pan-cancer signature. Moreover, we performed a pathway enrichment analysis, which indicated an association between the signature genes and a protein-protein interaction analysis, that highlighted PIK3R2 and FN1 as key genes having a fundamental relevance in our signature, suggesting an important role in pan-cancer prognosis for both of them.

癌症是世界范围内死亡的主要原因之一,可由环境因素(如接触石棉)、人类行为(如吸烟)或遗传因素引起。为了了解哪些基因可能与患者的生存有关,研究人员发明了预后基因特征:可用于科学分析的基因列表,以预测患者是否会存活。在这项研究中,我们将五种不同的预后特征连接在一起,每一种都与特定的癌症类型相关,以生成一个独特的泛癌症预后特征,该特征包含与187个独特基因符号相关的207个独特的问题集,其中一个特定的问题集存在于两个癌症类型特异性特征中(与MYO1E基因相关的203072_at)。我们使用随机森林机器学习方法将我们提出的泛癌症签名应用于12种不同癌症类型的57个微阵列基因表达数据集,并对结果进行分析。我们还将泛癌症签名的性能与两种替代预后签名的性能进行了比较,并将每种癌症类型特异性签名在其相应的癌症类型特异性数据集中的性能进行了比较。我们的结果证实了我们的预后泛癌症特征的有效性。此外,我们进行了信号通路富集分析,这表明信号基因与蛋白质相互作用分析之间存在关联,强调PIK3R2和FN1是与我们的信号具有基本相关性的关键基因,这表明它们在泛癌症预后中都起着重要作用。
{"title":"Towards a potential pan-cancer prognostic signature for gene expression based on probesets and ensemble machine learning.","authors":"Davide Chicco,&nbsp;Abbas Alameer,&nbsp;Sara Rahmati,&nbsp;Giuseppe Jurman","doi":"10.1186/s13040-022-00312-y","DOIUrl":"https://doi.org/10.1186/s13040-022-00312-y","url":null,"abstract":"<p><p>Cancer is one of the leading causes of death worldwide and can be caused by environmental aspects (for example, exposure to asbestos), by human behavior (such as smoking), or by genetic factors. To understand which genes might be involved in patients' survival, researchers have invented prognostic genetic signatures: lists of genes that can be used in scientific analyses to predict if a patient will survive or not. In this study, we joined together five different prognostic signatures, each of them related to a specific cancer type, to generate a unique pan-cancer prognostic signature, that contains 207 unique probesets related to 187 unique gene symbols, with one particular probeset present in two cancer type-specific signatures (203072_at related to the MYO1E gene). We applied our proposed pan-cancer signature with the Random Forests machine learning method to 57 microarray gene expression datasets of 12 different cancer types, and analyzed the results. We also compared the performance of our pan-cancer signature with the performances of two alternative prognostic signatures, and with the performances of each cancer type-specific signature on their corresponding cancer type-specific datasets. Our results confirmed the effectiveness of our prognostic pan-cancer signature. Moreover, we performed a pathway enrichment analysis, which indicated an association between the signature genes and a protein-protein interaction analysis, that highlighted PIK3R2 and FN1 as key genes having a fundamental relevance in our signature, suggesting an important role in pan-cancer prognosis for both of them.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9632055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40664009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An unsupervised image segmentation algorithm for coronary angiography. 冠状动脉造影的无监督图像分割算法。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-10-21 DOI: 10.1186/s13040-022-00313-x
Zong-Xian Yin, Hong-Ming Xu

Computer visual systems can rapidly obtain a large amount of data and automatically process them with ease. These characteristics constitute advantages for the application of such systems in the automatic analysis of medical images, as well as in processing technology. The precision of image segmentation, which plays a critical role in computer visual systems, directly affects the quality of processing results. Coronary angiographs feature various background colors, complex patterns, and blurry edges. The image areas containing blood vessels cannot be precisely segmented through regular methods. Therefore, this study proposed an unsupervised learning algorithm that uses regional parameter expansion (RPE). This method was derived from the flood fill algorithm, which can effectively segment image areas containing blood vessels despite a complex background or uneven light and shadow. An optimal cover tree (OCT) algorithm was proposed for the establishment of coronary arteries and the estimation of vessel diameter. Through the region growing method, spanning trees were used to record the cover length of adjacent connections, thereby establishing vessel paths, and the length can be used to track changes in vessel diameter.

计算机视觉系统可以快速获取大量的数据,并轻松地对其进行自动处理。这些特点构成了这些系统在医学图像自动分析以及处理技术中的应用的优势。图像分割的精度直接影响到处理结果的质量,在计算机视觉系统中起着至关重要的作用。冠状动脉造影具有背景颜色多样、图案复杂、边缘模糊等特点。常规方法无法对含有血管的图像区域进行精确分割。因此,本研究提出了一种使用区域参数展开(RPE)的无监督学习算法。该方法是由洪水填充算法衍生而来的,该算法可以在复杂背景或光影不均匀的情况下有效分割含有血管的图像区域。提出了一种用于冠状动脉建立和血管直径估计的最优覆盖树(OCT)算法。通过区域生长法,利用生成树记录相邻连接的覆盖长度,从而建立血管路径,并利用该长度跟踪血管直径的变化。
{"title":"An unsupervised image segmentation algorithm for coronary angiography.","authors":"Zong-Xian Yin,&nbsp;Hong-Ming Xu","doi":"10.1186/s13040-022-00313-x","DOIUrl":"https://doi.org/10.1186/s13040-022-00313-x","url":null,"abstract":"<p><p>Computer visual systems can rapidly obtain a large amount of data and automatically process them with ease. These characteristics constitute advantages for the application of such systems in the automatic analysis of medical images, as well as in processing technology. The precision of image segmentation, which plays a critical role in computer visual systems, directly affects the quality of processing results. Coronary angiographs feature various background colors, complex patterns, and blurry edges. The image areas containing blood vessels cannot be precisely segmented through regular methods. Therefore, this study proposed an unsupervised learning algorithm that uses regional parameter expansion (RPE). This method was derived from the flood fill algorithm, which can effectively segment image areas containing blood vessels despite a complex background or uneven light and shadow. An optimal cover tree (OCT) algorithm was proposed for the establishment of coronary arteries and the estimation of vessel diameter. Through the region growing method, spanning trees were used to record the cover length of adjacent connections, thereby establishing vessel paths, and the length can be used to track changes in vessel diameter.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9587570/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40564841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts. 通过多关系提取生物医学摘要扩展数据库衍生的生物医学知识图谱。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-10-18 DOI: 10.1186/s13040-022-00311-z
David N Nicholson, Daniel S Himmelstein, Casey S Greene

Background: Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types.

Results: We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1.

Conclusions: Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.

背景:知识图谱通过为生物医学实体提供上下文信息、构建网络和支持高通量分析的解释来支持生物医学研究工作。这些数据库是通过人工管理来填充的,随着出版率呈指数级增长,这是一个挑战。数据编程是一种范例,它通过将数据库与写为标签函数(设计用于自动注释文本数据的程序)的简单规则和启发式方法相结合,规避了这一艰巨的手动过程。不幸的是,编写一个有用的标签函数需要大量的错误分析,并且每个函数都需要花费数天的时间。这个瓶颈使得填充具有多个节点和边缘类型的知识图实际上是不可行的。因此,我们试图通过评估如何在多个边缘类型之间重用标签函数来加速标签函数的创建过程。结果:我们获得了实体标记的摘要,并将这些实体细分为仅包含化合物、基因和疾病。我们提取了包含共同提及的某些生物医学实体的句子,这些实体包含在先前描述的知识图谱Hetionet v1中。我们训练了一个基线模型,该模型使用仅数据库的标签函数,然后使用抽样方法来测量添加边缘特定或边缘不匹配的标签函数组合在基线上的改善程度。接下来,我们训练了一个判别器模型来检测表明生物医学关系的句子,然后估计可以被召回的边缘类型的数量并添加到Hetionet v1中。我们发现添加边缘不匹配的标签函数很少能改善关系提取,而控制边缘特定的标签函数却能。这一趋势有两个例外,化合物结合-基因和基因相互作用-基因,两者都表明了物理关系和可转移性的迹象。在测试的场景中,判别模型的性能很大程度上依赖于生成的注释。使用每种边缘类型的最佳判别模型,我们召回了Hetionet v1中近30%的已建立边缘。结论:我们的研究结果表明,该框架可以将新的边缘合并到我们的源知识图中。然而,标签功能转移的结果是混合的。只有描述非常相似边缘类型的标签函数在传输时才支持改进的性能。我们期望这一战略的持续发展可以为生物医学知识图谱的发现提供必要的构建模块,确保这些资源包括前沿的结果。
{"title":"Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts.","authors":"David N Nicholson,&nbsp;Daniel S Himmelstein,&nbsp;Casey S Greene","doi":"10.1186/s13040-022-00311-z","DOIUrl":"https://doi.org/10.1186/s13040-022-00311-z","url":null,"abstract":"<p><strong>Background: </strong>Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types.</p><p><strong>Results: </strong>We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1.</p><p><strong>Conclusions: </strong>Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9578183/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10692874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
EZCancerTarget: an open-access drug repurposing and data-collection tool to enhance target validation and optimize international research efforts against highly progressive cancers. EZCancerTarget:一个开放获取的药物再利用和数据收集工具,用于加强靶向验证和优化针对高度进展性癌症的国际研究工作。
IF 4.5 3区 生物学 Q1 Mathematics Pub Date : 2022-10-01 DOI: 10.1186/s13040-022-00307-9
David Dora, Timea Dora, Gabor Szegvari, Csongor Gerdán, Zoltan Lohinai

The expanding body of potential therapeutic targets requires easily accessible, structured, and transparent real-time interpretation of molecular data. Open-access genomic, proteomic and drug-repurposing databases transformed the landscape of cancer research, but most of them are difficult and time-consuming for casual users. Furthermore, to conduct systematic searches and data retrieval on multiple targets, researchers need the help of an expert bioinformatician, who is not always readily available for smaller research teams. We invite research teams to join and aim to enhance the cooperative work of more experienced groups to harmonize international efforts to overcome devastating malignancies. Here, we integrate available fundamental data and present a novel, open access, data-aggregating, drug repurposing platform, deriving our searches from the entries of Clue.io. We show how we integrated our previous expertise in small-cell lung cancer (SCLC) to initiate a new platform to overcome highly progressive cancers such as triple-negative breast and pancreatic cancer with data-aggregating approaches. Through the front end, the current content of the platform can be further expanded or replaced and users can create their drug-target list to select the clinically most relevant targets for further functional validation assays or drug trials. EZCancerTarget integrates searches from publicly available databases, such as PubChem, DrugBank, PubMed, and EMA, citing up-to-date and relevant literature of every target. Moreover, information on compounds is complemented with biological background information on eligible targets using entities like UniProt, String, and GeneCards, presenting relevant pathways, molecular- and biological function and subcellular localizations of these molecules. Cancer drug discovery requires a convergence of complex, often disparate fields. We present a simple, transparent, and user-friendly drug repurposing software to facilitate the efforts of research groups in the field of cancer research.

潜在治疗靶点的不断扩大需要易于获取、结构化和透明的分子数据实时解释。开放获取的基因组学、蛋白质组学和药物再利用数据库改变了癌症研究的格局,但其中大多数对普通用户来说都是困难和耗时的。此外,为了对多个目标进行系统的搜索和数据检索,研究人员需要专业的生物信息学家的帮助,而小型研究团队并不总是可以随时获得这些专家。我们邀请研究小组加入,旨在加强更有经验的团体的合作工作,以协调国际努力,克服毁灭性的恶性肿瘤。在这里,我们整合了现有的基础数据,并提出了一个新颖的,开放获取的,数据聚合的,药物再利用的平台,从Clue.io的条目中得出我们的搜索。我们展示了我们如何整合我们之前在小细胞肺癌(SCLC)方面的专业知识,以启动一个新的平台,以克服高度进展的癌症,如三阴性乳腺癌和胰腺癌。通过前端,平台的现有内容可以进一步扩展或替换,用户可以创建自己的药物靶点列表,选择临床最相关的靶点进行进一步的功能验证分析或药物试验。EZCancerTarget整合了来自公共数据库的搜索,如PubChem, DrugBank, PubMed和EMA,引用每个目标的最新和相关文献。此外,利用UniProt、String和GeneCards等实体,对化合物的信息进行了生物背景信息的补充,展示了这些分子的相关途径、分子和生物学功能以及亚细胞定位。癌症药物的发现需要复杂的,往往是不同领域的融合。我们提出了一个简单、透明、用户友好的药物再利用软件,以促进研究小组在癌症研究领域的努力。
{"title":"EZCancerTarget: an open-access drug repurposing and data-collection tool to enhance target validation and optimize international research efforts against highly progressive cancers.","authors":"David Dora,&nbsp;Timea Dora,&nbsp;Gabor Szegvari,&nbsp;Csongor Gerdán,&nbsp;Zoltan Lohinai","doi":"10.1186/s13040-022-00307-9","DOIUrl":"https://doi.org/10.1186/s13040-022-00307-9","url":null,"abstract":"<p><p>The expanding body of potential therapeutic targets requires easily accessible, structured, and transparent real-time interpretation of molecular data. Open-access genomic, proteomic and drug-repurposing databases transformed the landscape of cancer research, but most of them are difficult and time-consuming for casual users. Furthermore, to conduct systematic searches and data retrieval on multiple targets, researchers need the help of an expert bioinformatician, who is not always readily available for smaller research teams. We invite research teams to join and aim to enhance the cooperative work of more experienced groups to harmonize international efforts to overcome devastating malignancies. Here, we integrate available fundamental data and present a novel, open access, data-aggregating, drug repurposing platform, deriving our searches from the entries of Clue.io. We show how we integrated our previous expertise in small-cell lung cancer (SCLC) to initiate a new platform to overcome highly progressive cancers such as triple-negative breast and pancreatic cancer with data-aggregating approaches. Through the front end, the current content of the platform can be further expanded or replaced and users can create their drug-target list to select the clinically most relevant targets for further functional validation assays or drug trials. EZCancerTarget integrates searches from publicly available databases, such as PubChem, DrugBank, PubMed, and EMA, citing up-to-date and relevant literature of every target. Moreover, information on compounds is complemented with biological background information on eligible targets using entities like UniProt, String, and GeneCards, presenting relevant pathways, molecular- and biological function and subcellular localizations of these molecules. Cancer drug discovery requires a convergence of complex, often disparate fields. We present a simple, transparent, and user-friendly drug repurposing software to facilitate the efforts of research groups in the field of cancer research.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9526900/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40388280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1