首页 > 最新文献

Data and Text Mining in Bioinformatics最新文献

英文 中文
Grounded Feature Selection for Biomedical Relation Extraction by the Combinative Approach 基于组合方法的生物医学关系提取接地特征选择
Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665975
S. Song, G. Heo, Ha Jin Kim, H. Jung, Yonghwan Kim, Min Song
Relation extraction is an important task in biomedical areas such as protein-protein interaction, gene-disease interactions, and drug-disease interactions. In recent years, it has been widely researched to automatically extract biomedical relations in a vest amount of biomedical text data. In this paper, we propose a hybrid approach to extracting relations based on a rule-based approach feature set. We then use different classification algorithms such as SVM, Naïve Bayes, and Decision Tree classifiers for relation classification. The rationale for adopting shallow parsing and other NLP techniques to extract relations is two-folds: simplicity and robustness. We select seven features with the rule-based shallow parsing technique and evaluate the performance with four different PPI public corpora. Our experimental results show the stable performance in F-measure even with the relatively fewer features.
关系提取是蛋白质-蛋白质相互作用、基因-疾病相互作用、药物-疾病相互作用等生物医学领域的重要研究课题。近年来,从大量的生物医学文本数据中自动提取生物医学关系已经得到了广泛的研究。在本文中,我们提出了一种基于基于规则的方法特征集的混合方法来提取关系。然后,我们使用不同的分类算法,如SVM、Naïve贝叶斯和决策树分类器进行关系分类。采用浅层解析和其他NLP技术提取关系的基本原理有两个方面:简单性和健壮性。我们使用基于规则的浅解析技术选择了7个特征,并使用4个不同的PPI公共语料库评估了性能。我们的实验结果表明,即使特征相对较少,F-measure的性能也很稳定。
{"title":"Grounded Feature Selection for Biomedical Relation Extraction by the Combinative Approach","authors":"S. Song, G. Heo, Ha Jin Kim, H. Jung, Yonghwan Kim, Min Song","doi":"10.1145/2665970.2665975","DOIUrl":"https://doi.org/10.1145/2665970.2665975","url":null,"abstract":"Relation extraction is an important task in biomedical areas such as protein-protein interaction, gene-disease interactions, and drug-disease interactions. In recent years, it has been widely researched to automatically extract biomedical relations in a vest amount of biomedical text data. In this paper, we propose a hybrid approach to extracting relations based on a rule-based approach feature set. We then use different classification algorithms such as SVM, Naïve Bayes, and Decision Tree classifiers for relation classification. The rationale for adopting shallow parsing and other NLP techniques to extract relations is two-folds: simplicity and robustness. We select seven features with the rule-based shallow parsing technique and evaluate the performance with four different PPI public corpora. Our experimental results show the stable performance in F-measure even with the relatively fewer features.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134620168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Integrative Database for Exploring Compound Combinations of Natural Products for Medical Effects 为医学效果探索天然产物化合物组合的综合数据库
Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665986
Suhyun Ha, Sunyong Yoo, Moonshik Shin, J. Kwak, O. Kwon, M. Choi, K. Kang, Hojung Nam, Doheon Lee
Natural products used in dietary supplements, complementary and alternative medicine (CAM) and conventional medicine are composites of multiple chemical compounds. These chemical compounds potentially offer an extensive source for drug discovery with accumulated knowledge of efficacy and safety. However, existing natural product related databases have drawbacks in both standardization and structuralization of information. Therefore, in this work, we construct an integrated database of natural products by mapping the prescription, herb, compound, and phenotype information to international identifiers and structuralizing the efficacy information through database integration and text-mining methods. We expect that the constructed database could serve as a fundamental resource for the natural products research.
用于膳食补充剂、补充和替代医学(CAM)和传统医学的天然产品是多种化合物的复合物。这些化合物可能为药物发现提供广泛的来源,积累了有关疗效和安全性的知识。然而,现有的天然产物相关数据库在信息的标准化和结构化方面存在不足。因此,我们将天然产物的处方、草药、化合物和表型信息映射到国际标识符,并通过数据库集成和文本挖掘方法将功效信息结构化,从而构建天然产物的综合数据库。我们期望所构建的数据库可以作为天然产物研究的基础资源。
{"title":"Integrative Database for Exploring Compound Combinations of Natural Products for Medical Effects","authors":"Suhyun Ha, Sunyong Yoo, Moonshik Shin, J. Kwak, O. Kwon, M. Choi, K. Kang, Hojung Nam, Doheon Lee","doi":"10.1145/2665970.2665986","DOIUrl":"https://doi.org/10.1145/2665970.2665986","url":null,"abstract":"Natural products used in dietary supplements, complementary and alternative medicine (CAM) and conventional medicine are composites of multiple chemical compounds. These chemical compounds potentially offer an extensive source for drug discovery with accumulated knowledge of efficacy and safety. However, existing natural product related databases have drawbacks in both standardization and structuralization of information. Therefore, in this work, we construct an integrated database of natural products by mapping the prescription, herb, compound, and phenotype information to international identifiers and structuralizing the efficacy information through database integration and text-mining methods. We expect that the constructed database could serve as a fundamental resource for the natural products research.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115450589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mining Context-Specific Rules from the Literature for Virtual Human Model Simulation 从文献中挖掘上下文特定规则用于虚拟人体模型仿真
Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665987
Kwangmin Kim, Sejoon Lee, Kyunghyun Park, Dongjin Jang, Doheon Lee
Computer-based virtual human model is believed to be the promising solution for drug response identification. Literature mining is competitive method to extract those biological rules for human model simulation, since existing public databases provide only limited amount of information applicable for the simulation. Here we propose the method for mining context-specific rules from the literature, for future application to virtual human model simulation. Integrating the existing biological databases, we have constructed formalized ontology. From the PubMed literature, we have tagged 11 distinct types of biological entities using both of conditional random field (CRF) and dictionary based Named Entity Recognition (NER). Recognized named entities were normalized and mapped to formalized ontology. Context-specific biological rules between named entities, characterized by increase/decrease features, were extracted by pattern-based method utilizing regular expression. As the result, we have obtained the organ-context specific biological rules. Further researches on enhanced rule and context extraction will be followed.
基于计算机的虚拟人体模型被认为是药物反应识别的有前途的解决方案。文献挖掘是提取人体模型仿真的生物学规律的一种有竞争力的方法,因为现有的公共数据库提供的适用于仿真的信息有限。在这里,我们提出了从文献中挖掘上下文特定规则的方法,以便将来应用于虚拟人体模型仿真。结合现有的生物数据库,构建了形式化的本体。从PubMed文献中,我们使用条件随机场(CRF)和基于字典的命名实体识别(NER)标记了11种不同类型的生物实体。识别的命名实体被规范化并映射到形式化的本体。采用基于模式的方法,利用正则表达式提取命名实体之间以增减特征为特征的上下文特定生物规则。结果,我们获得了器官环境特异性的生物学规律。后续将进一步研究增强的规则和上下文提取。
{"title":"Mining Context-Specific Rules from the Literature for Virtual Human Model Simulation","authors":"Kwangmin Kim, Sejoon Lee, Kyunghyun Park, Dongjin Jang, Doheon Lee","doi":"10.1145/2665970.2665987","DOIUrl":"https://doi.org/10.1145/2665970.2665987","url":null,"abstract":"Computer-based virtual human model is believed to be the promising solution for drug response identification. Literature mining is competitive method to extract those biological rules for human model simulation, since existing public databases provide only limited amount of information applicable for the simulation. Here we propose the method for mining context-specific rules from the literature, for future application to virtual human model simulation. Integrating the existing biological databases, we have constructed formalized ontology. From the PubMed literature, we have tagged 11 distinct types of biological entities using both of conditional random field (CRF) and dictionary based Named Entity Recognition (NER). Recognized named entities were normalized and mapped to formalized ontology. Context-specific biological rules between named entities, characterized by increase/decrease features, were extracted by pattern-based method utilizing regular expression. As the result, we have obtained the organ-context specific biological rules. Further researches on enhanced rule and context extraction will be followed.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129247469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Display of Conceptual Structures in the Epidemiologic Literature 流行病学文献中的概念结构展示
Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665983
E. H. Kim, S. Song, Yonghwan Kim, Min Song
Biomedical literature from PubMed contains various types of entities such as diseases or organisms. The rapid growth of their size makes it harder to conceptualized; however, displaying the natural terms that occurred in the text is more effective in understanding the target corpus to be searched than suggesting a concept related to a user query. Thus, we consider the natural common words that biomedical information users actually write and speak. We extract bio-related terms from the corpus mapping with the UMLS. We show entity-based networks with natural language terms as they are shown in the text. In this paper, we present simple and precise associative networks of natural terms in the biomedical literature. The entity-based networks and entity relations can make understanding the biomedical literature corpus more effective and easier by detecting related terms and their hidden relations in the documents. We considered bio-entities and their relations in the biomedical literature and focused on the representation of a graphic display that can improve users' perception about a large corpus. To this end, epidemiology as an experimental domain was chosen and we extract entities from the corpus mapping the UMLS and draw their relations inferred by the Semantic Network of the UMLS. Then we calculate term frequencies, co-occurrences, and term pair similarities (See Figure 1). In results, distinguished networks that display conceptual structures in the biomedical literature with a natural language and not a concept were demonstrated (See Figure 2). The networks we present provide more comprehension of the biomedical collection.
来自PubMed的生物医学文献包含各种类型的实体,如疾病或生物体。它们规模的快速增长使得概念化变得更加困难;然而,在理解要搜索的目标语料库时,显示文本中出现的自然术语比建议与用户查询相关的概念更有效。因此,我们考虑生物医学信息用户实际写和说的自然常用词。我们使用UMLS从语料库映射中提取生物相关术语。我们用自然语言术语展示基于实体的网络,就像它们在文本中显示的那样。在本文中,我们提出了生物医学文献中自然术语的简单而精确的关联网络。基于实体的网络和实体关系通过检测文档中的相关术语及其隐含关系,使生物医学文献语料库的理解更加有效和容易。我们考虑了生物医学文献中的生物实体及其关系,并专注于图形显示的表示,这可以提高用户对大型语料库的感知。为此,我们选择流行病学作为实验领域,从映射UMLS的语料库中提取实体,并通过UMLS的语义网络推断出它们之间的关系。然后我们计算术语频率、共现和术语对相似度(见图1)。在结果中,展示了用自然语言而不是概念显示生物医学文献中概念结构的区分网络(见图2)。我们提出的网络提供了对生物医学集合的更多理解。
{"title":"A Display of Conceptual Structures in the Epidemiologic Literature","authors":"E. H. Kim, S. Song, Yonghwan Kim, Min Song","doi":"10.1145/2665970.2665983","DOIUrl":"https://doi.org/10.1145/2665970.2665983","url":null,"abstract":"Biomedical literature from PubMed contains various types of entities such as diseases or organisms. The rapid growth of their size makes it harder to conceptualized; however, displaying the natural terms that occurred in the text is more effective in understanding the target corpus to be searched than suggesting a concept related to a user query. Thus, we consider the natural common words that biomedical information users actually write and speak. We extract bio-related terms from the corpus mapping with the UMLS. We show entity-based networks with natural language terms as they are shown in the text. In this paper, we present simple and precise associative networks of natural terms in the biomedical literature. The entity-based networks and entity relations can make understanding the biomedical literature corpus more effective and easier by detecting related terms and their hidden relations in the documents. We considered bio-entities and their relations in the biomedical literature and focused on the representation of a graphic display that can improve users' perception about a large corpus. To this end, epidemiology as an experimental domain was chosen and we extract entities from the corpus mapping the UMLS and draw their relations inferred by the Semantic Network of the UMLS. Then we calculate term frequencies, co-occurrences, and term pair similarities (See Figure 1). In results, distinguished networks that display conceptual structures in the biomedical literature with a natural language and not a concept were demonstrated (See Figure 2). The networks we present provide more comprehension of the biomedical collection.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132066107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Identifying Cancer Subtypes based on Somatic Mutation Profile 基于体细胞突变谱识别癌症亚型
Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665980
Sungchul Kim, Lee Sael, Hwanjo Yu
Tumor stratification is one of the basic tasks in cancer genomics for a better understanding of the tumor heterogeneity and better targeted treatments. There are various biological data that can be used to stratify tumors including gene expression and sequencing data. In this work, we use the somatic mutation data. Two types of somatic mutation profiles are generated and clustered using k-means clustering with appropriate distance measures to obtain cancer subtypes for each cancer type: binary somatic mutation profile and weighted somatic mutation profile. According to the predictive power of clinical features and survival time of the identified subtypes, the binary somatic mutation profile with Jaccard distance (B-Jac) performed the best and the weighted somatic mutation profile with Euclidean distance (W-Euc) performed comparably. Both approaches performed significantly better than the typical usage of somatic mutation, i.e. the binary somatic mutation profile with Euclidean distance (B-Euc).
肿瘤分层是肿瘤基因组学的基本任务之一,有助于更好地了解肿瘤的异质性和更好地进行靶向治疗。有各种各样的生物学数据可用于肿瘤分层,包括基因表达和测序数据。在这项工作中,我们使用体细胞突变数据。生成两种类型的体细胞突变谱,并使用k-means聚类和适当的距离度量进行聚类,以获得每种癌症类型的癌症亚型:二元体细胞突变谱和加权体细胞突变谱。根据所鉴定亚型的临床特征和生存时间的预测能力,具有Jaccard距离的二元体细胞突变谱(B-Jac)表现最好,具有欧几里得距离的加权体细胞突变谱(W-Euc)表现较好。这两种方法的表现都明显优于典型的体细胞突变方法,即具有欧几里得距离的二进制体细胞突变谱(B-Euc)。
{"title":"Identifying Cancer Subtypes based on Somatic Mutation Profile","authors":"Sungchul Kim, Lee Sael, Hwanjo Yu","doi":"10.1145/2665970.2665980","DOIUrl":"https://doi.org/10.1145/2665970.2665980","url":null,"abstract":"Tumor stratification is one of the basic tasks in cancer genomics for a better understanding of the tumor heterogeneity and better targeted treatments. There are various biological data that can be used to stratify tumors including gene expression and sequencing data. In this work, we use the somatic mutation data. Two types of somatic mutation profiles are generated and clustered using k-means clustering with appropriate distance measures to obtain cancer subtypes for each cancer type: binary somatic mutation profile and weighted somatic mutation profile. According to the predictive power of clinical features and survival time of the identified subtypes, the binary somatic mutation profile with Jaccard distance (B-Jac) performed the best and the weighted somatic mutation profile with Euclidean distance (W-Euc) performed comparably. Both approaches performed significantly better than the typical usage of somatic mutation, i.e. the binary somatic mutation profile with Euclidean distance (B-Euc).","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133384810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Systematic Identification of Context-dependent Conflicting Information in Biological Pathways 生物通路中情境依赖性冲突信息的系统识别
Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665973
Seyeol Yoon, J. Jung, Hasun Yu, Mijin Kwon, Sungji Choo, Kyunghyun Park, Dongjin Jang, Sangwoo Kim, Doheon Lee
Interactions between biological entities such as genes, proteins and metabolites, so called pathways, are key features to understand molecular mechanisms of life. As pathway information is being accumulated rapidly through various knowledge resources, there are growing interests in maintaining integrity of the heterogeneous databases. Here, we defined conflict as a status where two contradictory evidences (i.e. 'A increases B' and 'A decreases B') coexist in a same pathway. This conflict damages unity so that inference of simulation on the integrated pathway network might be unreliable. We defined rule and rule group. A rule consists of interaction of two entities, meta-relation (increase or decrease), and contexts terms about tissue specificity or environmental conditions. The rules, which have the same interaction, are grouped into a rule group. If the rules don't have unanimous meta-relation, the rule group and the rules are judged as being conflicting. This analysis revealed that almost 20% of known interactions suffer from conflicting information and conflicting information occurred much more frequently in the literatures than the public database. With consideration for dual functions depending on context, we thought it might resolve conflict to consider context. We grouped rules, which have the same context terms as well as interaction. It's revealed that up to 86% of the conflicts could be resolved by considering context. Subsequent analysis also showed that those contradictory records generally compete each other closely, but some information might be suspicious when their evidence levels are seriously imbalanced. By identifying and resolving the conflicts, we expect that pathway databases can be cleaned and used for better secondary analyses such as gene/protein annotation, network dynamics and qualitative/quantitative simulation.
生物实体(如基因、蛋白质和代谢物)之间的相互作用,即所谓的途径,是理解生命分子机制的关键特征。随着途径信息通过各种知识资源快速积累,维护异构数据库的完整性日益受到关注。在这里,我们将冲突定义为两个相互矛盾的证据(即:“A增加B”和“A减少B”)在同一路径上共存。这种冲突破坏了统一性,使得综合路径网络仿真的推理不可靠。我们定义了规则和规则组。规则由两个实体的相互作用、元关系(增加或减少)以及关于组织特异性或环境条件的上下文术语组成。具有相同交互的规则被分组到规则组中。如果规则没有一致的元关系,则判断规则组和规则是冲突的。这一分析表明,近20%的已知相互作用存在信息冲突,而文献中的信息冲突比公共数据库中的信息冲突要频繁得多。考虑到上下文的双重功能,我们认为考虑上下文可能会解决冲突。我们对具有相同上下文术语和交互的规则进行分组。据透露,高达86%的冲突可以通过考虑上下文来解决。随后的分析还表明,这些相互矛盾的记录通常相互密切竞争,但当证据水平严重不平衡时,一些信息可能是可疑的。通过识别和解决冲突,我们期望通路数据库可以被清理,并用于更好的二次分析,如基因/蛋白质注释,网络动力学和定性/定量模拟。
{"title":"Systematic Identification of Context-dependent Conflicting Information in Biological Pathways","authors":"Seyeol Yoon, J. Jung, Hasun Yu, Mijin Kwon, Sungji Choo, Kyunghyun Park, Dongjin Jang, Sangwoo Kim, Doheon Lee","doi":"10.1145/2665970.2665973","DOIUrl":"https://doi.org/10.1145/2665970.2665973","url":null,"abstract":"Interactions between biological entities such as genes, proteins and metabolites, so called pathways, are key features to understand molecular mechanisms of life. As pathway information is being accumulated rapidly through various knowledge resources, there are growing interests in maintaining integrity of the heterogeneous databases. Here, we defined conflict as a status where two contradictory evidences (i.e. 'A increases B' and 'A decreases B') coexist in a same pathway. This conflict damages unity so that inference of simulation on the integrated pathway network might be unreliable. We defined rule and rule group. A rule consists of interaction of two entities, meta-relation (increase or decrease), and contexts terms about tissue specificity or environmental conditions. The rules, which have the same interaction, are grouped into a rule group. If the rules don't have unanimous meta-relation, the rule group and the rules are judged as being conflicting. This analysis revealed that almost 20% of known interactions suffer from conflicting information and conflicting information occurred much more frequently in the literatures than the public database. With consideration for dual functions depending on context, we thought it might resolve conflict to consider context. We grouped rules, which have the same context terms as well as interaction. It's revealed that up to 86% of the conflicts could be resolved by considering context. Subsequent analysis also showed that those contradictory records generally compete each other closely, but some information might be suspicious when their evidence levels are seriously imbalanced. By identifying and resolving the conflicts, we expect that pathway databases can be cleaned and used for better secondary analyses such as gene/protein annotation, network dynamics and qualitative/quantitative simulation.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"2017 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130130017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Injury Narrative Text Classification: A Preliminary Study 伤害叙事文本分类的初步研究
Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665976
Lin Chen, K. Vallmuur, R. Nayak
Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000. The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built. In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.
医院急诊科以叙事文本的形式记录病人的受伤情况。对于统计报告,需要将此文本数据映射到预定义的代码。该领域的现有研究使用Naïve贝叶斯概率方法来构建映射分类器。在本文中,我们的重点是为分类方法的选择提供指导。我们构建了许多属于不同分类族的分类器,如决策树、概率、神经网络以及基于实例、基于集成和基于核的线性分类器。进行广泛的预处理以确保数据的质量,从而确保质量分类结果。在伤害描述中带有空条目的记录将被删除。拼写错误的纠正过程是通过查找和替换拼写错误的单词与发音相近的单词来完成的。有意义的短语被识别和保留,而不是删除短语的一部分作为停止词。在许多形式的条目中出现的缩写都是手动识别的,并且只使用一种形式的缩写。聚类用于区分非频繁术语和频繁术语。这个过程将文本特征的数量从28000个显著减少到5000个。医学叙事文本损伤数据集是由许多短文件组成的。数据具有高维和稀疏的特征,即很少有特征是不相关的,但特征之间是相互关联的。因此,奇异值分解(SVD)和非负矩阵分解(NNMF)等矩阵分解技术被用于将处理后的特征空间映射到更低维的特征空间。用这些简化的特征空间构建分类器。在实验中,进行了一组测试,以反映哪种分类方法最适合医学文本分类。基于支持向量机的非负矩阵分解方法可以达到93%的准确率,高于所有经过测试的传统分类器。我们还发现,TF/IDF加权在长文本分类中效果很好,但在短文档分类中却不如二元加权。另一个发现是,在与医学专家协商后,应该删除排名前n的术语,因为它会影响分类效果。
{"title":"Injury Narrative Text Classification: A Preliminary Study","authors":"Lin Chen, K. Vallmuur, R. Nayak","doi":"10.1145/2665970.2665976","DOIUrl":"https://doi.org/10.1145/2665970.2665976","url":null,"abstract":"Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000.\u0000 The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built.\u0000 In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134370820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Construction of Multi-level Networks Incorporating Molecule, Cell, Organ and Phenotype Properties for Drug-induced Phenotype Prediction 结合分子、细胞、器官和表型特性的多层次网络构建用于药物诱导表型预测
Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665989
J. Jung, Hasun Yu, Seyeol Yoon, Mijin Kwon, Sungji Choo, Sangwoo Kim, Doheon Lee
Inferring drug-induced phenotypes via computational approaches can give a substantial support to drug discovery procedure. However, existing computational models that are mainly based on a single cell or a single organ model are thought to be limited because the phenotypes are consequences of stochastic biochemical processes among distant cells/organs as well as molecules confined in one cell. Therefore, there is an urgent demand for a new computational model that represents heterogeneous biochemical interactions spanning the entire human body. To meet the demand, we constructed multi-level networks that incorporate previously uncovered high-level properties such as molecules, cells, organs, and phenotypes. Currently, the networks consist of 1,776,506 edges including molecular networks within 76 pre-defined cell-types, inter-cell interactions among the cell-types, and gene (protein) relations to 429 phenotypes. We are also planning to verify if known drug-induced phenotypes are reproducible in the networks using a Petri-net based simulation.
通过计算方法推断药物诱导的表型可以为药物发现过程提供实质性的支持。然而,现有的主要基于单个细胞或单个器官模型的计算模型被认为是有限的,因为表型是远距离细胞/器官之间以及限制在一个细胞中的分子的随机生化过程的结果。因此,迫切需要一种新的计算模型来代表跨越整个人体的异质生化相互作用。为了满足需求,我们构建了包含先前发现的高级特性(如分子、细胞、器官和表型)的多层次网络。目前,该网络由1,776,506条边组成,包括76种预定义细胞类型内的分子网络,细胞类型之间的细胞间相互作用以及与429种表型的基因(蛋白质)关系。我们还计划使用基于Petri-net的模拟来验证已知的药物诱导表型在网络中是否可重复。
{"title":"Construction of Multi-level Networks Incorporating Molecule, Cell, Organ and Phenotype Properties for Drug-induced Phenotype Prediction","authors":"J. Jung, Hasun Yu, Seyeol Yoon, Mijin Kwon, Sungji Choo, Sangwoo Kim, Doheon Lee","doi":"10.1145/2665970.2665989","DOIUrl":"https://doi.org/10.1145/2665970.2665989","url":null,"abstract":"Inferring drug-induced phenotypes via computational approaches can give a substantial support to drug discovery procedure. However, existing computational models that are mainly based on a single cell or a single organ model are thought to be limited because the phenotypes are consequences of stochastic biochemical processes among distant cells/organs as well as molecules confined in one cell. Therefore, there is an urgent demand for a new computational model that represents heterogeneous biochemical interactions spanning the entire human body. To meet the demand, we constructed multi-level networks that incorporate previously uncovered high-level properties such as molecules, cells, organs, and phenotypes. Currently, the networks consist of 1,776,506 edges including molecular networks within 76 pre-defined cell-types, inter-cell interactions among the cell-types, and gene (protein) relations to 429 phenotypes. We are also planning to verify if known drug-induced phenotypes are reproducible in the networks using a Petri-net based simulation.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114596708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Biomedical Named Entity Recognition Based on the Combination of Regional and Global Text Features 基于区域与全局文本特征结合的生物医学命名实体识别
Pub Date : 2014-11-07 DOI: 10.1145/2665970.2665990
Y. Jeong, Dahee Lee, Namgi Han, Won Chul Kim, Min Song
The biomedical information extraction, especially Named Entity Recognition (NER), is a primary task in biomedical text-mining due to the rapid growth of large-scale literature. Extracting biomedical entities aims at identifying specific entities (words or phrases) from those unstructured text data. In this work, we introduce a novel biomedical NER system utilizing a combination of regional and global text features: linguistic, lexical, contextual, and syntactic features. Our system adopts Conditional Random Fields (CRFs) [1] as a machine learning algorithm and consists of two major pipelines (see Figure 1). We especially focus on constructing the first pipeline for text processing in a modularized manner and discovering rich feature sets regarding comprehensive linguistics and contexts. To implement the CRF framework in the second pipeline, our system uses a modified version of Mallet [2] to take advantage of feature induction. As a result of 10-fold cross-validation, our system achieves from 0.99% up to 18.47% of F-measure improvement as well as the highest precision compared to existing open-source biomedical NER systems on GENETAG corpus [3]. We figure out that several components such as abundant key features, external resources, and feature induction contribute to the performance of the proposed system.
由于大规模文献的快速增长,生物医学信息提取,特别是命名实体识别(NER)成为生物医学文本挖掘的首要任务。提取生物医学实体的目的是从这些非结构化文本数据中识别特定实体(单词或短语)。在这项工作中,我们引入了一个新的生物医学NER系统,该系统利用了区域和全局文本特征的组合:语言、词汇、上下文和句法特征。我们的系统采用条件随机场(Conditional Random Fields, CRFs)[1]作为机器学习算法,由两个主要管道组成(见图1)。我们特别关注以模块化方式构建文本处理的第一个管道,并发现关于综合语言学和上下文的丰富特征集。为了在第二个管道中实现CRF框架,我们的系统使用了修改版本的Mallet[2]来利用特征归纳。经过10倍交叉验证,与GENETAG语料库上现有的开源生物医学NER系统相比,我们的系统达到了0.99%到18.47%的F-measure改进,并且精度最高[3]。我们发现,丰富的关键特征、外部资源和特征归纳等因素对系统的性能有很大的影响。
{"title":"Biomedical Named Entity Recognition Based on the Combination of Regional and Global Text Features","authors":"Y. Jeong, Dahee Lee, Namgi Han, Won Chul Kim, Min Song","doi":"10.1145/2665970.2665990","DOIUrl":"https://doi.org/10.1145/2665970.2665990","url":null,"abstract":"The biomedical information extraction, especially Named Entity Recognition (NER), is a primary task in biomedical text-mining due to the rapid growth of large-scale literature. Extracting biomedical entities aims at identifying specific entities (words or phrases) from those unstructured text data. In this work, we introduce a novel biomedical NER system utilizing a combination of regional and global text features: linguistic, lexical, contextual, and syntactic features. Our system adopts Conditional Random Fields (CRFs) [1] as a machine learning algorithm and consists of two major pipelines (see Figure 1). We especially focus on constructing the first pipeline for text processing in a modularized manner and discovering rich feature sets regarding comprehensive linguistics and contexts. To implement the CRF framework in the second pipeline, our system uses a modified version of Mallet [2] to take advantage of feature induction. As a result of 10-fold cross-validation, our system achieves from 0.99% up to 18.47% of F-measure improvement as well as the highest precision compared to existing open-source biomedical NER systems on GENETAG corpus [3]. We figure out that several components such as abundant key features, external resources, and feature induction contribute to the performance of the proposed system.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128755255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A prototype application for real-time recognition and disambiguation of clinical abbreviations 临床缩略语的实时识别和消歧的原型应用
Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512096
Yonghui Wu, J. Denny, S. Rosenbloom, R. Miller, D. Giuse, Min Song, Hua Xu
To save time, healthcare providers frequently use abbreviations while authoring clinical documents. Nevertheless, abbreviations that authors deem unambiguous often confuse other readers, including clinicians, patients, and natural language processing (NLP) systems. Most current clinical NLP systems "post-process" notes long after clinicians enter them into electronic health record systems (EHRs). Such post-processing cannot guarantee 100% accuracy in abbreviation identification and disambiguation, since multiple alternative interpretations exist. In this paper, authors describe a prototype system for real-time Clinical Abbreviation Recognition and Disambiguation (CARD) -- i.e., a system that interacts with authors during note generation to verify correct abbreviation senses. The CARD system design anticipates future integration with web-based clinical documentation systems to improve quality of healthcare records. The prototype application embodies three word sense disambiguation (WSD) methods. We evaluated the accuracy and response times of the prototype CARD system in a simulated study. Using an existing test data set of 25 commonly observed, highly ambiguous clinical abbreviations the evaluation demonstrated that the best WSD method had an accuracy of 88.8%, and a reasonable average response time of 1.6 milliseconds per each abbreviation. The study indicates potential feasibility of real-time NLP-enabled abbreviation disambiguation within clinical documentation systems.
为了节省时间,医疗保健提供者在编写临床文档时经常使用缩写。然而,作者认为没有歧义的缩写常常使其他读者感到困惑,包括临床医生、患者和自然语言处理(NLP)系统。大多数目前的临床NLP系统在临床医生将记录输入电子健康记录系统(EHRs)后很长时间才进行“后处理”。这种后处理不能保证缩略词识别和消歧的100%准确性,因为存在多种替代解释。在本文中,作者描述了一个用于实时临床缩写识别和消歧(CARD)的原型系统,即一个在注释生成过程中与作者交互以验证正确缩写感觉的系统。CARD系统设计预计未来将与基于web的临床文档系统集成,以提高医疗记录的质量。原型应用程序体现了三种词义消歧方法。我们在模拟研究中评估了原型CARD系统的准确性和响应时间。使用现有的25个常见的、高度模糊的临床缩略语的测试数据集,评估表明,最佳WSD方法的准确率为88.8%,每个缩略语的合理平均响应时间为1.6毫秒。该研究表明,潜在的可行性实时nlp启用缩写消歧在临床文件系统。
{"title":"A prototype application for real-time recognition and disambiguation of clinical abbreviations","authors":"Yonghui Wu, J. Denny, S. Rosenbloom, R. Miller, D. Giuse, Min Song, Hua Xu","doi":"10.1145/2512089.2512096","DOIUrl":"https://doi.org/10.1145/2512089.2512096","url":null,"abstract":"To save time, healthcare providers frequently use abbreviations while authoring clinical documents. Nevertheless, abbreviations that authors deem unambiguous often confuse other readers, including clinicians, patients, and natural language processing (NLP) systems. Most current clinical NLP systems \"post-process\" notes long after clinicians enter them into electronic health record systems (EHRs). Such post-processing cannot guarantee 100% accuracy in abbreviation identification and disambiguation, since multiple alternative interpretations exist. In this paper, authors describe a prototype system for real-time Clinical Abbreviation Recognition and Disambiguation (CARD) -- i.e., a system that interacts with authors during note generation to verify correct abbreviation senses. The CARD system design anticipates future integration with web-based clinical documentation systems to improve quality of healthcare records. The prototype application embodies three word sense disambiguation (WSD) methods. We evaluated the accuracy and response times of the prototype CARD system in a simulated study. Using an existing test data set of 25 commonly observed, highly ambiguous clinical abbreviations the evaluation demonstrated that the best WSD method had an accuracy of 88.8%, and a reasonable average response time of 1.6 milliseconds per each abbreviation. The study indicates potential feasibility of real-time NLP-enabled abbreviation disambiguation within clinical documentation systems.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130790124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
期刊
Data and Text Mining in Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1