Proceedings. IEEE International Conference on Bioinformatics and Biomedicine最新文献_第10页

Mining FDA resources to compute population-specific frequencies of adverse drug reactions. 挖掘FDA资源，计算药物不良反应的人群特异性频率。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2017-11-01 Epub Date: 2017-12-18 DOI: 10.1109/BIBM.2017.8217935

Aleksandar Poleksic, Carson Turner, Rishabh Dalal, Paul Gray, Lei Xie

Adverse drug reactions (ADRs) represent one of the main health and economic problems in the world. With increasing data on ADRs, there is an increased need for software tools capable of organizing and storing the information on drug-ADR associations in a form that is easy to use and understand. Here we present a step by step computational procedure capable of extracting drug-ADR frequency data from the large collection of patient safety reports stored in the Federal Drug Administration database. Our procedure is the first of its type capable of generating population specific drug-ADR frequencies. The drug-ADR data generated by our method can be made specific to a single patient population group (such as gender or age) or a single therapy characteristic (such as drug dosage, duration of therapy) or any combination of such.

药物不良反应(adr)是世界上主要的健康和经济问题之一。随着药品不良反应数据的增加，越来越需要能够以易于使用和理解的形式组织和存储药品不良反应相关信息的软件工具。在这里，我们提出了一个循序渐进的计算程序，能够从存储在联邦药物管理局数据库中的大量患者安全报告中提取药物不良反应频率数据。我们的程序是第一个能够产生特定人群药物不良反应频率的程序。我们的方法生成的药物不良反应数据可以针对单一患者群体(如性别或年龄)或单一治疗特征(如药物剂量、治疗持续时间)或这些的任何组合。

引用次数: 1

Automatic Methods to Extract New York Heart Association Classification from Clinical Notes. 从临床记录中自动提取纽约心脏协会分类的方法。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2017-11-01 Epub Date: 2017-12-18 DOI: 10.1109/BIBM.2017.8217848

Rui Zhang, Sisi Ma, Liesa Shanahan, Jessica Munroe, Sarah Horn, Stuart Speedie

Cardiac Resynchronization Therapy (CRT) is an established pacing therapy for heart failure patients. The New York Heart Association (NYHA) classification is often used as a measure of a patient's response to CRT. Identifying NYHA class for heart failure patients in an electronic health record (EHR) consistently, over time, can provide better understanding of the progression of heart failure and assessment of CRT response and effectiveness. However, NYHA is rarely stored in EHR structured data such information is often documented in unstructured clinical notes. In this study, we thus investigated the use of natural language processing (NLP) methods to identify NYHA classification from clinical notes. We collected 6,174 clinical notes that were matched with hospital-specific custom NYHA class diagnosis codes. Machine-learning based methods performed similar with a rule-based method. The best machine-learning method, support vector machine with n-gram features, performed the best (93% F-measure). Further validation of the findings is required.

心脏再同步化治疗(CRT)是一种成熟的心脏起搏治疗方法。纽约心脏协会(NYHA)的分类常被用来衡量病人对CRT的反应。随着时间的推移，在电子健康记录(EHR)中一致地确定心衰患者的NYHA等级，可以更好地了解心衰的进展，并评估CRT的反应和有效性。然而，NYHA很少存储在电子病历结构化数据中，这些信息通常记录在非结构化的临床记录中。在这项研究中，我们因此研究了使用自然语言处理(NLP)方法从临床记录中识别NYHA分类。我们收集了6174份临床记录，这些记录与医院特定的定制NYHA分类诊断代码相匹配。基于机器学习的方法与基于规则的方法相似。最好的机器学习方法，具有n-gram特征的支持向量机，表现最好(93% F-measure)。需要进一步验证这些发现。

{"title":"Automatic Methods to Extract New York Heart Association Classification from Clinical Notes.","authors":"Rui Zhang, Sisi Ma, Liesa Shanahan, Jessica Munroe, Sarah Horn, Stuart Speedie","doi":"10.1109/BIBM.2017.8217848","DOIUrl":"https://doi.org/10.1109/BIBM.2017.8217848","url":null,"abstract":"Cardiac Resynchronization Therapy (CRT) is an established pacing therapy for heart failure patients. The New York Heart Association (NYHA) classification is often used as a measure of a patient's response to CRT. Identifying NYHA class for heart failure patients in an electronic health record (EHR) consistently, over time, can provide better understanding of the progression of heart failure and assessment of CRT response and effectiveness. However, NYHA is rarely stored in EHR structured data such information is often documented in unstructured clinical notes. In this study, we thus investigated the use of natural language processing (NLP) methods to identify NYHA classification from clinical notes. We collected 6,174 clinical notes that were matched with hospital-specific custom NYHA class diagnosis codes. Machine-learning based methods performed similar with a rule-based method. The best machine-learning method, support vector machine with n-gram features, performed the best (93% F-measure). Further validation of the findings is required.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2017 ","pages":"1296-1299"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BIBM.2017.8217848","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36333041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Towards an Obesity-Cancer Knowledge Base: Biomedical Entity Identification and Relation Detection. 迈向肥胖-癌症知识库:生物医学实体识别与关系检测。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2016-12-01 Epub Date: 2017-01-19 DOI: 10.1109/BIBM.2016.7822672

Juan Antonio Lossio-Ventura, William Hogan, François Modave, Amanda Hicks, Josh Hanna, Yi Guo, Zhe He, Jiang Bian

Obesity is associated with increased risks of various types of cancer, as well as a wide range of other chronic diseases. On the other hand, access to health information activates patient participation, and improve their health outcomes. However, existing online information on obesity and its relationship to cancer is heterogeneous ranging from pre-clinical models and case studies to mere hypothesis-based scientific arguments. A formal knowledge representation (i.e., a semantic knowledge base) would help better organizing and delivering quality health information related to obesity and cancer that consumers need. Nevertheless, current ontologies describing obesity, cancer and related entities are not designed to guide automatic knowledge base construction from heterogeneous information sources. Thus, in this paper, we present methods for named-entity recognition (NER) to extract biomedical entities from scholarly articles and for detecting if two biomedical entities are related, with the long term goal of building a obesity-cancer knowledge base. We leverage both linguistic and statistical approaches in the NER task, which supersedes the state-of-the-art results. Further, based on statistical features extracted from the sentences, our method for relation detection obtains an accuracy of 99.3% and a f-measure of 0.993.

肥胖与患各种癌症的风险增加以及其他多种慢性疾病有关。另一方面，获取健康信息可以激活患者的参与，并改善他们的健康结果。然而，现有的关于肥胖及其与癌症关系的在线信息是异构的，从临床前模型和案例研究到仅仅基于假设的科学论点。正式的知识表示(即语义知识库)将有助于更好地组织和提供消费者所需的与肥胖和癌症相关的高质量健康信息。然而，目前描述肥胖、癌症和相关实体的本体并不能指导从异构信息源自动构建知识库。因此，在本文中，我们提出了命名实体识别(NER)方法，从学术文章中提取生物医学实体，并检测两个生物医学实体是否相关，其长期目标是建立一个肥胖-癌症知识库。我们在NER任务中利用语言和统计方法，取代了最先进的结果。此外，基于从句子中提取的统计特征，我们的关系检测方法的准确率为99.3%，f-measure为0.993。

{"title":"Towards an Obesity-Cancer Knowledge Base: Biomedical Entity Identification and Relation Detection.","authors":"Juan Antonio Lossio-Ventura, William Hogan, François Modave, Amanda Hicks, Josh Hanna, Yi Guo, Zhe He, Jiang Bian","doi":"10.1109/BIBM.2016.7822672","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822672","url":null,"abstract":"Obesity is associated with increased risks of various types of cancer, as well as a wide range of other chronic diseases. On the other hand, access to health information activates patient participation, and improve their health outcomes. However, existing online information on obesity and its relationship to cancer is heterogeneous ranging from pre-clinical models and case studies to mere hypothesis-based scientific arguments. A formal knowledge representation (i.e., a semantic knowledge base) would help better organizing and delivering quality health information related to obesity and cancer that consumers need. Nevertheless, current ontologies describing obesity, cancer and related entities are not designed to guide automatic knowledge base construction from heterogeneous information sources. Thus, in this paper, we present methods for named-entity recognition (NER) to extract biomedical entities from scholarly articles and for detecting if two biomedical entities are related, with the long term goal of building a obesity-cancer knowledge base. We leverage both linguistic and statistical approaches in the NER task, which supersedes the state-of-the-art results. Further, based on statistical features extracted from the sentences, our method for relation detection obtains an accuracy of 99.3% and a f-measure of 0.993.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2016 ","pages":"1081-1088"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BIBM.2016.7822672","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34993764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Uncertainty Quantified Computational Analysis of the Energetics of Virus Capsid Assembly. 病毒外壳组装能量的不确定性量化计算分析。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2016-12-01 Epub Date: 2017-01-19 DOI: 10.1109/BIBM.2016.7822775

N Clement, M Rasheed, C Bajaj

Most of the existing research in assembly pathway prediction/analysis of viral capsids makes the simplifying assumption that the configuration of the intermediate states can be extracted directly from the final configuration of the entire capsid. This assumption does not take into account the conformational changes of the constituent proteins as well as minor changes to the binding interfaces that continue throughout the assembly process until stabilization. This paper presents a statistical-ensemble based approach which samples the configurational space for each monomer with the relative local orientation between monomers, to capture the uncertainties in binding and conformations. Furthermore, instead of using larger capsomers (trimers, pentamers) as building blocks, we allow all possible subassemblies to bind in all possible combinations. We represent the resulting assembly graph in two different ways: First, we use the Wilcoxon signed rank measure to compare the distributions of binding free energy computed on the sampled conformations to predict likely pathways. Second, we represent chemical equilibrium aspects of the transitions as a Bayesian Factor graph where both associations and dissociations are modeled based on concentrations and the binding free energies. We applied these protocols on the feline panleukopenia virus and the Nudaurelia capensis virus. Results from these experiments showed significant departure from those one would obtain if only the static configurations of the proteins were considered. Hence, we establish the importance of an uncertainty-aware protocol for pathway analysis, and provide a statistical framework as an important first step towards assembly pathway prediction with high statistical confidence.

现有的大多数病毒衣壳组装路径预测/分析研究都做了一个简化假设，即中间状态的构型可以直接从整个衣壳的最终构型中提取出来。这一假设没有考虑到组成蛋白的构象变化以及结合界面的微小变化，而这些变化在整个组装过程中一直持续到稳定为止。本文提出了一种基于统计组合的方法，该方法利用单体间的相对局部取向对每个单体的构象空间进行采样，以捕捉结合和构象中的不确定性。此外，我们不使用较大的单体（三聚体、五聚体）作为构建模块，而是允许所有可能的子装配以所有可能的组合进行结合。我们用两种不同的方法表示由此产生的组装图：首先，我们使用 Wilcoxon 符号秩测量法来比较在采样构象上计算的结合自由能分布，以预测可能的路径。其次，我们用贝叶斯因子图来表示化学平衡方面的转变，其中关联和解离都是根据浓度和结合自由能来建模的。我们将这些方案应用于猫泛白细胞减少症病毒和帽状瘤病毒。这些实验的结果表明，如果只考虑蛋白质的静态构型，结果会有很大偏差。因此，我们确定了不确定性感知协议对通路分析的重要性，并提供了一个统计框架，作为以高统计置信度进行组装通路预测的重要第一步。

{"title":"Uncertainty Quantified Computational Analysis of the Energetics of Virus Capsid Assembly.","authors":"N Clement, M Rasheed, C Bajaj","doi":"10.1109/BIBM.2016.7822775","DOIUrl":"10.1109/BIBM.2016.7822775","url":null,"abstract":"Most of the existing research in assembly pathway prediction/analysis of viral capsids makes the simplifying assumption that the configuration of the intermediate states can be extracted directly from the final configuration of the entire capsid. This assumption does not take into account the conformational changes of the constituent proteins as well as minor changes to the binding interfaces that continue throughout the assembly process until stabilization. This paper presents a statistical-ensemble based approach which samples the configurational space for each monomer with the relative local orientation between monomers, to capture the uncertainties in binding and conformations. Furthermore, instead of using larger capsomers (trimers, pentamers) as building blocks, we allow all possible subassemblies to bind in all possible combinations. We represent the resulting assembly graph in two different ways: First, we use the Wilcoxon signed rank measure to compare the distributions of binding free energy computed on the sampled conformations to predict likely pathways. Second, we represent chemical equilibrium aspects of the transitions as a Bayesian Factor graph where both associations and dissociations are modeled based on concentrations and the binding free energies. We applied these protocols on the feline panleukopenia virus and the Nudaurelia capensis virus. Results from these experiments showed significant departure from those one would obtain if only the static configurations of the proteins were considered. Hence, we establish the importance of an uncertainty-aware protocol for pathway analysis, and provide a statistical framework as an important first step towards assembly pathway prediction with high statistical confidence.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2016 ","pages":"1706-1713"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5604467/pdf/nihms894982.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35431193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transcriptional Responses to Ultraviolet and Ionizing Radiation: An Approach Based on Graph Curvature. 对紫外线和电离辐射的转录响应:一种基于图曲率的方法。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2016-12-01 Epub Date: 2017-01-19 DOI: 10.1109/BIBM.2016.7822706

Yongxin Chen, Jung Hun Oh, Romeil Sandhu, Sangkyu Lee, Joseph O Deasy, Allen Tannenbaum

More than half of all cancer patients receive radiotherapy in their treatment process. However, our understanding of abnormal transcriptional responses to radiation remains poor. In this study, we employ an extended definition of Ollivier-Ricci curvature based on LI-Wasserstein distance to investigate genes and biological processes associated with ionizing radiation (IR) and ultraviolet radiation (UV) exposure using a microarray dataset. Gene expression levels were modeled on a gene interaction topology downloaded from the Human Protein Reference Database (HPRD). This was performed for IR, UV, and mock datasets, separately. The difference curvature value between IR and mock graphs (also between UV and mock) for each gene was used as a metric to estimate the extent to which the gene responds to radiation. We found that in comparison of the top 200 genes identified from IR and UV graphs, about 20~30% genes were overlapping. Through gene ontology enrichment analysis, we found that the metabolic-related biological process was highly associated with both IR and UV radiation exposure.

超过一半的癌症患者在治疗过程中接受放疗。然而，我们对辐射异常转录反应的理解仍然很差。在这项研究中，我们采用基于LI-Wasserstein距离的奥利维耶-里奇曲率的扩展定义，使用微阵列数据集研究与电离辐射(IR)和紫外线辐射(UV)暴露相关的基因和生物过程。基因表达水平是根据从人类蛋白质参考数据库(HPRD)下载的基因相互作用拓扑结构建模的。这是分别对IR、UV和模拟数据集执行的。每个基因的红外图和模拟图(紫外图和模拟图)之间的曲率差值被用作估计基因对辐射反应程度的度量。我们发现，在IR图和UV图中鉴定的前200个基因中，约有20~30%的基因重叠。通过基因本体富集分析，我们发现代谢相关的生物过程与IR和UV辐射暴露高度相关。

{"title":"Transcriptional Responses to Ultraviolet and Ionizing Radiation: An Approach Based on Graph Curvature.","authors":"Yongxin Chen, Jung Hun Oh, Romeil Sandhu, Sangkyu Lee, Joseph O Deasy, Allen Tannenbaum","doi":"10.1109/BIBM.2016.7822706","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822706","url":null,"abstract":"More than half of all cancer patients receive radiotherapy in their treatment process. However, our understanding of abnormal transcriptional responses to radiation remains poor. In this study, we employ an extended definition of Ollivier-Ricci curvature based on LI-Wasserstein distance to investigate genes and biological processes associated with ionizing radiation (IR) and ultraviolet radiation (UV) exposure using a microarray dataset. Gene expression levels were modeled on a gene interaction topology downloaded from the Human Protein Reference Database (HPRD). This was performed for IR, UV, and mock datasets, separately. The difference curvature value between IR and mock graphs (also between UV and mock) for each gene was used as a metric to estimate the extent to which the gene responds to radiation. We found that in comparison of the top 200 genes identified from IR and UV graphs, about 20~30% genes were overlapping. Through gene ontology enrichment analysis, we found that the metabolic-related biological process was highly associated with both IR and UV radiation exposure.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2016 ","pages":"1302-1306"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BIBM.2016.7822706","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34784321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Classification of Use Status for Dietary Supplements in Clinical Notes. 临床记录中膳食补充剂使用状况的分类。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2016-12-01 Epub Date: 2017-01-19 DOI: 10.1109/BIBM.2016.7822668

Yadan Fan, Lu He, Rui Zhang

Clinical notes contain rich information about dietary supplements, which are critical for detecting signals of dietary supplement side effects and interactions between drugs and supplements. One of the important factors of supplement documentation is usage status, such as started and discontinuation. Such information is usually stored in the unstructured clinical notes. We developed a rule-based classifier to identify supplement usage status in clinical notes. The categories referring to the patient's status of supplement use were classified into four classes: Continuing (C), Discontinued (D), Started (S), and Unclassified (U). Clinical notes containing 10 of the most commonly consumed supplements (i.e., alfalfa, echinacea, fish oil, garlic, ginger, ginkgo, ginseng, melatonin, St. John's Wort, and Vitamin E) were retrieved from the University of Minnesota Clinical Data Repository. The gold standard was defined by manually annotating 1000 randomly selected sentences or statements mentioning at least one of these 10 supplements. The rules in the classifier was initially developed on two-thirds of the set of 7 supplements (i.e., alfalfa, garlic, ginger, ginkgo, ginseng, St. John's Wort, and Vitamin E); the performance was evaluated on the remaining one-third of this set. To evaluate the generalizability of rules, we further validated the second testing set on other 3 supplements (i.e., echinacea, fish oil, and melatonin). The performance of the classifier achieved F-measures of 0.95, 0.97, 0.96, and 0.96 for status C, D, S, and U on 7 supplements, respectively. The classifier also showed good generalizability when it was applied to the other 3 supplements with F-measures of 0.96 for C, 0.96 for D, 0.95 for S, and 0.89 for U. This study demonstrated that the classifier can accurately classify supplement usage status, which can be further integrated as a module into the existing natural language processing pipeline for supporting dietary supplement knowledge discovery.

临床记录包含丰富的膳食补充剂信息，这对于检测膳食补充剂副作用和药物与补充剂之间的相互作用至关重要。补充文档的一个重要因素是使用状态，如启动和停止。这些信息通常存储在非结构化的临床记录中。我们开发了一个基于规则的分类器来识别临床记录中的补充剂使用状况。将患者服用补充剂的情况分为四类:持续(C)、停止(D)、开始(S)和未分类(U)。临床记录中包含10种最常服用的补充剂(即苜蓿、紫锥菊、鱼油、大蒜、生姜、银杏、人参、褪黑素、圣约翰草和维生素E)从明尼苏达大学临床数据存储库中检索。黄金标准是通过手动标注1000个随机选择的句子或语句来定义的，这些句子或语句至少提到了这10个补充内容中的一个。分类器中的规则最初是针对7种补充剂(即苜蓿、大蒜、生姜、银杏、人参、圣约翰草和维生素E)中的三分之二制定的;对剩下的三分之一进行性能评估。为了评估规则的普遍性，我们进一步验证了其他3种补充剂(即紫锥菊、鱼油和褪黑素)的第二组测试集。分类器在7种补充剂上的C、D、S和U状态的f测量值分别为0.95、0.97、0.96和0.96。该分类器对C、D、S、u的f值分别为0.96、0.96、0.95和0.89的其他3种补充剂也表现出了良好的泛化性。研究表明，该分类器可以准确地对补充剂的使用状态进行分类，可以作为模块进一步集成到现有的自然语言处理管道中，支持膳食补充剂知识的发现。

{"title":"Classification of Use Status for Dietary Supplements in Clinical Notes.","authors":"Yadan Fan, Lu He, Rui Zhang","doi":"10.1109/BIBM.2016.7822668","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822668","url":null,"abstract":"Clinical notes contain rich information about dietary supplements, which are critical for detecting signals of dietary supplement side effects and interactions between drugs and supplements. One of the important factors of supplement documentation is usage status, such as started and discontinuation. Such information is usually stored in the unstructured clinical notes. We developed a rule-based classifier to identify supplement usage status in clinical notes. The categories referring to the patient's status of supplement use were classified into four classes: Continuing (C), Discontinued (D), Started (S), and Unclassified (U). Clinical notes containing 10 of the most commonly consumed supplements (i.e., alfalfa, echinacea, fish oil, garlic, ginger, ginkgo, ginseng, melatonin, St. John's Wort, and Vitamin E) were retrieved from the University of Minnesota Clinical Data Repository. The gold standard was defined by manually annotating 1000 randomly selected sentences or statements mentioning at least one of these 10 supplements. The rules in the classifier was initially developed on two-thirds of the set of 7 supplements (i.e., alfalfa, garlic, ginger, ginkgo, ginseng, St. John's Wort, and Vitamin E); the performance was evaluated on the remaining one-third of this set. To evaluate the generalizability of rules, we further validated the second testing set on other 3 supplements (i.e., echinacea, fish oil, and melatonin). The performance of the classifier achieved F-measures of 0.95, 0.97, 0.96, and 0.96 for status C, D, S, and U on 7 supplements, respectively. The classifier also showed good generalizability when it was applied to the other 3 supplements with F-measures of 0.96 for C, 0.96 for D, 0.95 for S, and 0.89 for U. This study demonstrated that the classifier can accurately classify supplement usage status, which can be further integrated as a module into the existing natural language processing pipeline for supporting dietary supplement knowledge discovery.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2016 ","pages":"1054-1061"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BIBM.2016.7822668","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35428398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins. DeeperBind：加强对 DNA 结合蛋白序列特异性的预测。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2016-12-01 Epub Date: 2017-01-19 DOI: 10.1109/bibm.2016.7822515

Hamid Reza Hassanzadeh, May D Wang

Transcription factors (TFs) are macromolecules that bind to cis-regulatory specific sub-regions of DNA promoters and initiate transcription. Finding the exact location of these binding sites (aka motifs) is important in a variety of domains such as drug design and development. To address this need, several in vivo and in vitro techniques have been developed so far that try to characterize and predict the binding specificity of a protein to different DNA loci. The major problem with these techniques is that they are not accurate enough in prediction of the binding affinity and characterization of the corresponding motifs. As a result, downstream analysis is required to uncover the locations where proteins of interest bind. Here, we propose DeeperBind, a long short term recurrent convolutional network for prediction of protein binding specificities with respect to DNA probes. DeeperBind can model the positional dynamics of probe sequences and hence reckons with the contributions made by individual sub-regions in DNA sequences, in an effective way. Moreover, it can be trained and tested on datasets containing varying-length sequences. We apply our pipeline to the datasets derived from protein binding microarrays (PBMs), an in-vitro high-throughput technology for quantification of protein-DNA binding preferences, and present promising results. To the best of our knowledge, this is the most accurate pipeline that can predict binding specificities of DNA sequences from the data produced by high-throughput technologies through utilization of the power of deep learning for feature generation and positional dynamics modeling.

转录因子（TF）是与 DNA 启动子的顺式调节特定子区域结合并启动转录的大分子。找到这些结合位点（又称图案）的确切位置对药物设计和开发等多个领域都很重要。为了满足这一需求，迄今已开发出多种体内和体外技术，试图描述和预测蛋白质与不同 DNA 位点结合的特异性。这些技术的主要问题在于，它们在预测结合亲和力和表征相应基团方面不够准确。因此，需要进行下游分析才能发现相关蛋白质的结合位置。在此，我们提出了 DeeperBind，这是一种用于预测蛋白质与 DNA 探针结合特异性的长短期递归卷积网络。DeeperBind 可以对探针序列的位置动态进行建模，从而有效地计算 DNA 序列中各个子区域的贡献。此外，它还可以在包含不同长度序列的数据集上进行训练和测试。蛋白质结合微阵列是一种用于量化蛋白质-DNA 结合偏好的体外高通量技术。据我们所知，这是通过利用深度学习在特征生成和位置动力学建模方面的强大功能，从高通量技术产生的数据中预测 DNA 序列结合特异性的最准确的管道。

{"title":"DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins.","authors":"Hamid Reza Hassanzadeh, May D Wang","doi":"10.1109/bibm.2016.7822515","DOIUrl":"10.1109/bibm.2016.7822515","url":null,"abstract":"Transcription factors (TFs) are macromolecules that bind to cis-regulatory specific sub-regions of DNA promoters and initiate transcription. Finding the exact location of these binding sites (aka motifs) is important in a variety of domains such as drug design and development. To address this need, several in vivo and in vitro techniques have been developed so far that try to characterize and predict the binding specificity of a protein to different DNA loci. The major problem with these techniques is that they are not accurate enough in prediction of the binding affinity and characterization of the corresponding motifs. As a result, downstream analysis is required to uncover the locations where proteins of interest bind. Here, we propose DeeperBind, a long short term recurrent convolutional network for prediction of protein binding specificities with respect to DNA probes. DeeperBind can model the positional dynamics of probe sequences and hence reckons with the contributions made by individual sub-regions in DNA sequences, in an effective way. Moreover, it can be trained and tested on datasets containing varying-length sequences. We apply our pipeline to the datasets derived from protein binding microarrays (PBMs), an in-vitro high-throughput technology for quantification of protein-DNA binding preferences, and present promising results. To the best of our knowledge, this is the most accurate pipeline that can predict binding specificities of DNA sequences from the data produced by high-throughput technologies through utilization of the power of deep learning for feature generation and positional dynamics modeling.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2016 ","pages":"178-183"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7302108/pdf/nihms-1595286.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38060153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep Convolutional Neural Networks for Detecting Secondary Structures in Protein Density Maps from Cryo-Electron Microscopy. 用于从冷冻电镜蛋白质密度图中检测二级结构的深度卷积神经网络

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2016-12-01 Epub Date: 2017-01-19 DOI: 10.1109/BIBM.2016.7822490

Rongjian Li, Dong Si, Tao Zeng, Shuiwang Ji, Jing He

The detection of secondary structure of proteins using three dimensional (3D) cryo-electron microscopy (cryo-EM) images is still a challenging task when the spatial resolution of cryo-EM images is at medium level (5-10Å ). Prior researches focused on the usage of local features that may not capture the global information of image objects. In this study, we propose to use deep learning methods to extract high representative global features and then automatically detect secondary structures of proteins. In particular, we build a convolutional neural network (CNN) classifier that predicts the probability of label for every individual voxel in 3D cryo-EM image with respect to the secondary structure elements of proteins such as α-helix, β-sheet and background. To effectively incorporate the 3D spatial information in protein structures, we propose to perform 3D convolutions in the convolutional layers of CNNs. We show that the proposed CNN classifier can outperform existing SVM method on identifying the secondary structure elements of proteins from 3D cryo-EM medium resolution images.

当冷冻电子显微镜（cryo-EM）图像的空间分辨率处于中等水平（5-10 Å）时，利用三维（3D）冷冻电子显微镜（cryo-EM）图像检测蛋白质的二级结构仍然是一项具有挑战性的任务。之前的研究主要集中在局部特征的使用上，这可能无法捕捉到图像对象的全局信息。在本研究中，我们建议使用深度学习方法来提取高代表性的全局特征，然后自动检测蛋白质的二级结构。具体而言，我们建立了一个卷积神经网络（CNN）分类器，该分类器可预测三维冷冻电镜图像中每个体素的标签概率，并与蛋白质的二级结构元素（如α-螺旋、β-片和背景）相关。为了有效地将三维空间信息纳入蛋白质结构，我们建议在 CNN 的卷积层中执行三维卷积。结果表明，在从三维冷冻电镜中等分辨率图像识别蛋白质二级结构元素方面，所提出的 CNN 分类器优于现有的 SVM 方法。

{"title":"Deep Convolutional Neural Networks for Detecting Secondary Structures in Protein Density Maps from Cryo-Electron Microscopy.","authors":"Rongjian Li, Dong Si, Tao Zeng, Shuiwang Ji, Jing He","doi":"10.1109/BIBM.2016.7822490","DOIUrl":"10.1109/BIBM.2016.7822490","url":null,"abstract":"The detection of secondary structure of proteins using three dimensional (3D) cryo-electron microscopy (cryo-EM) images is still a challenging task when the spatial resolution of cryo-EM images is at medium level (5-10Å ). Prior researches focused on the usage of local features that may not capture the global information of image objects. In this study, we propose to use deep learning methods to extract high representative global features and then automatically detect secondary structures of proteins. In particular, we build a convolutional neural network (CNN) classifier that predicts the probability of label for every individual voxel in 3D cryo-EM image with respect to the secondary structure elements of proteins such as α-helix, β-sheet and background. To effectively incorporate the 3D spatial information in protein structures, we propose to perform 3D convolutions in the convolutional layers of CNNs. We show that the proposed CNN classifier can outperform existing SVM method on identifying the secondary structure elements of proteins from 3D cryo-EM medium resolution images.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2016 ","pages":"41-46"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5952046/pdf/nihms874389.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36106213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Multi-Modal Graph-Based Semi-Supervised Pipeline for Predicting Cancer Survival. 基于多模态图的半监督管道预测癌症生存。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2016-12-01 Epub Date: 2017-01-19 DOI: 10.1109/bibm.2016.7822516

Hamid Reza Hassanzadeh, John H Phan, May D Wang

Cancer survival prediction is an active area of research that can help prevent unnecessary therapies and improve patient's quality of life. Gene expression profiling is being widely used in cancer studies to discover informative biomarkers that aid predict different clinical endpoint prediction. We use multiple modalities of data derived from RNA deep-sequencing (RNA-seq) to predict survival of cancer patients. Despite the wealth of information available in expression profiles of cancer tumors, fulfilling the aforementioned objective remains a big challenge, for the most part, due to the paucity of data samples compared to the high dimension of the expression profiles. As such, analysis of transcriptomic data modalities calls for state-of-the-art big-data analytics techniques that can maximally use all the available data to discover the relevant information hidden within a significant amount of noise. In this paper, we propose a pipeline that predicts cancer patients' survival by exploiting the structure of the input (manifold learning) and by leveraging the unlabeled samples using Laplacian support vector machines, a graph-based semi supervised learning (GSSL) paradigm. We show that under certain circumstances, no single modality per se will result in the best accuracy and by fusing different models together via a stacked generalization strategy, we may boost the accuracy synergistically. We apply our approach to two cancer datasets and present promising results. We maintain that a similar pipeline can be used for predictive tasks where labeled samples are expensive to acquire.

癌症生存预测是一个活跃的研究领域，可以帮助预防不必要的治疗，提高患者的生活质量。基因表达谱被广泛应用于癌症研究中，以发现信息丰富的生物标志物，帮助预测不同的临床终点预测。我们使用来自RNA深度测序(RNA-seq)的多种数据模式来预测癌症患者的生存。尽管在癌症肿瘤的表达谱中有丰富的可用信息，但在很大程度上，由于与高维表达谱相比数据样本的缺乏，实现上述目标仍然是一个巨大的挑战。因此，转录组数据模式的分析需要最先进的大数据分析技术，这些技术可以最大限度地利用所有可用数据来发现隐藏在大量噪声中的相关信息。在本文中，我们提出了一个管道，通过利用输入的结构(流形学习)和利用使用拉普拉斯支持向量机(一种基于图的半监督学习(GSSL)范例的未标记样本来预测癌症患者的生存。研究表明，在某些情况下，单一模型本身不会产生最佳精度，通过堆叠泛化策略将不同模型融合在一起，可以协同提高精度。我们将我们的方法应用于两个癌症数据集，并提出了有希望的结果。我们认为，类似的管道可以用于预测任务，其中标记的样本是昂贵的获取。

{"title":"A Multi-Modal Graph-Based Semi-Supervised Pipeline for Predicting Cancer Survival.","authors":"Hamid Reza Hassanzadeh, John H Phan, May D Wang","doi":"10.1109/bibm.2016.7822516","DOIUrl":"https://doi.org/10.1109/bibm.2016.7822516","url":null,"abstract":"Cancer survival prediction is an active area of research that can help prevent unnecessary therapies and improve patient's quality of life. Gene expression profiling is being widely used in cancer studies to discover informative biomarkers that aid predict different clinical endpoint prediction. We use multiple modalities of data derived from RNA deep-sequencing (RNA-seq) to predict survival of cancer patients. Despite the wealth of information available in expression profiles of cancer tumors, fulfilling the aforementioned objective remains a big challenge, for the most part, due to the paucity of data samples compared to the high dimension of the expression profiles. As such, analysis of transcriptomic data modalities calls for state-of-the-art big-data analytics techniques that can maximally use all the available data to discover the relevant information hidden within a significant amount of noise. In this paper, we propose a pipeline that predicts cancer patients' survival by exploiting the structure of the input (manifold learning) and by leveraging the unlabeled samples using Laplacian support vector machines, a graph-based semi supervised learning (GSSL) paradigm. We show that under certain circumstances, no single modality per se will result in the best accuracy and by fusing different models together via a stacked generalization strategy, we may boost the accuracy synergistically. We apply our approach to two cancer datasets and present promising results. We maintain that a similar pipeline can be used for predictive tasks where labeled samples are expensive to acquire.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2016 ","pages":"184-189"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/bibm.2016.7822516","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38151657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Analysis of Temporal Constraints in Qualitative Eligibility Criteria of Cancer Clinical Studies. 肿瘤临床研究定性资格标准的时间约束分析。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2016-12-01 Epub Date: 2017-01-19 DOI: 10.1109/BIBM.2016.7822607

Zhe He, Zhiwei Chen, Jiang Bian

Clinical studies, especially randomized controlled trials, generate gold-standard medical evidence. However, the lack of population representativeness of clinical studies has hampered their generalizability to the real-world population. Overly restrictive qualitative criteria are often applied to exclude patients. In this work, we develop a lexical-pattern-based tool to structure qualitative eligibility criteria with temporal constraints, with which we analyzed over 10,800 cancer clinical studies. Our results showed that restrictive temporal constraints are often applied on qualitative criteria in cancer studies, limiting the generalizability of their results.

临床研究，尤其是随机对照试验，产生了黄金标准的医学证据。然而，缺乏人群代表性的临床研究阻碍了其推广到现实世界的人群。过于严格的定性标准常常被用于排除患者。在这项工作中，我们开发了一个基于词汇模式的工具来构建具有时间约束的定性资格标准，我们分析了超过10,800个癌症临床研究。我们的研究结果表明，限制性的时间约束通常应用于癌症研究的定性标准，限制了其结果的普遍性。

引用次数: 2