首页 > 最新文献

Data and Text Mining in Bioinformatics最新文献

英文 中文
Predicting baby feeding method from unstructured electronic health record data 从非结构化电子健康记录数据预测婴儿喂养方法
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390075
A. Rao, K. Maiden, Ben Carterette, Deborah B. Ehrenthal
Obesity is one of the most important health concerns in United States and is playing an important role in rising rates of chronic health conditions and health care costs. The percentage of the US population affected with childhood obesity and adult obesity has been on a constant upward linear trend for past few decades. According to Center for Disease control and prevention 35.7% of US adults are obese and 17% of children aged 2-19 years are obese. Researchers and health care providers in the US and the rest of world studying obesity are interested in factors affecting obesity. One such interesting factor potentially related to development of obesity is type of feeding provided to babies. In this work we describe an electronic health record (EHR) data set of babies with feeding method contained in the narrative portion of the record. We compare five supervised machine learning algorithms for predicting feeding method as a discrete value based on text in the field. We also compare these algorithms in terms of the classification error and prediction probability estimates generated by them.
肥胖是美国最重要的健康问题之一,在慢性健康状况和医疗保健费用上升中起着重要作用。在过去的几十年里,受儿童肥胖和成人肥胖影响的美国人口比例一直呈不断上升的线性趋势。根据美国疾病控制与预防中心的数据,35.7%的美国成年人肥胖,17%的2-19岁儿童肥胖。美国和世界其他地区研究肥胖的研究人员和卫生保健提供者对影响肥胖的因素很感兴趣。一个可能与肥胖发展相关的有趣因素是给婴儿提供的喂养方式。在这项工作中,我们描述了婴儿的电子健康记录(EHR)数据集,记录的叙述部分包含喂养方法。我们比较了五种有监督的机器学习算法,这些算法将馈送方法预测为基于现场文本的离散值。我们还比较了这些算法产生的分类误差和预测概率估计。
{"title":"Predicting baby feeding method from unstructured electronic health record data","authors":"A. Rao, K. Maiden, Ben Carterette, Deborah B. Ehrenthal","doi":"10.1145/2390068.2390075","DOIUrl":"https://doi.org/10.1145/2390068.2390075","url":null,"abstract":"Obesity is one of the most important health concerns in United States and is playing an important role in rising rates of chronic health conditions and health care costs. The percentage of the US population affected with childhood obesity and adult obesity has been on a constant upward linear trend for past few decades. According to Center for Disease control and prevention 35.7% of US adults are obese and 17% of children aged 2-19 years are obese. Researchers and health care providers in the US and the rest of world studying obesity are interested in factors affecting obesity. One such interesting factor potentially related to development of obesity is type of feeding provided to babies. In this work we describe an electronic health record (EHR) data set of babies with feeding method contained in the narrative portion of the record. We compare five supervised machine learning algorithms for predicting feeding method as a discrete value based on text in the field. We also compare these algorithms in terms of the classification error and prediction probability estimates generated by them.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123464641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Session details: Mining clinical data and text 会议细节:挖掘临床数据和文本
Pub Date : 2012-10-29 DOI: 10.1145/3260181
Hua Xu
{"title":"Session details: Mining clinical data and text","authors":"Hua Xu","doi":"10.1145/3260181","DOIUrl":"https://doi.org/10.1145/3260181","url":null,"abstract":"","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123565459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Indexing methods for efficient protein 3D surface search 高效蛋白质三维表面搜索的索引方法
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390078
Sungchul Kim, Lee Sael, Hwanjo Yu
This paper exploits efficient indexing techniques for protein structure search where protein structures are represented as vectors by 3D-Zernike Descriptor (3DZD). 3DZD compactly represents a surface shape of protein tertiary structure as a vector, and the simplified representation accelerates the structural search. However, further speed up is needed to address the scenarios where multiple users access the database simultaneously. We address this need for further speed up in protein structural search by exploiting two indexing techniques, i.e., iDistance and iKernel, on the 3DZDs. The results show that both iDistance and iKernel significantly enhance the searching speed. In addition, we introduce an extended approach for protein structure search based on indexing techniques that use the 3DZD characteristic. In the extended approach, index structure is constructured using only the first few of the numbers in the 3DZDs. To find the top-k similar structures, first top-10 x k similar structures are selected using the reduced index structure, then top-k structures are selected using similarity measure of full 3DZDs of the selected structures. Using the indexing techniques, the searching time reduced 69.6% using iDistance, 77% using iKernel, 77.4% using extended iDistance, and 87.9% using extended iKernel method.
本文利用三维泽尼克描述符(3DZD)将蛋白质结构表示为向量,利用高效索引技术进行蛋白质结构搜索。3DZD将蛋白质三级结构的表面形状紧凑地表示为矢量,简化后的表示加速了结构搜索。但是,为了解决多个用户同时访问数据库的场景,需要进一步提高速度。我们利用两种索引技术,即iDistance和iKernel,在3DZDs上解决了这一需求,以进一步加快蛋白质结构搜索。结果表明,iDistance和iKernel都显著提高了搜索速度。此外,我们还介绍了一种基于使用3DZD特征的索引技术的蛋白质结构搜索扩展方法。在扩展方法中,索引结构仅使用3dzd中的前几个数字来构建。为了寻找top-k个相似结构,首先使用约简索引结构选择top-10 x k个相似结构,然后使用所选结构的全3dzd的相似性度量选择top-k个相似结构。使用索引技术,使用iDistance方法的搜索时间减少了69.6%,使用iKernel方法的搜索时间减少了77%,使用扩展iDistance方法的搜索时间减少了77.4%,使用扩展iKernel方法的搜索时间减少了87.9%。
{"title":"Indexing methods for efficient protein 3D surface search","authors":"Sungchul Kim, Lee Sael, Hwanjo Yu","doi":"10.1145/2390068.2390078","DOIUrl":"https://doi.org/10.1145/2390068.2390078","url":null,"abstract":"This paper exploits efficient indexing techniques for protein structure search where protein structures are represented as vectors by 3D-Zernike Descriptor (3DZD). 3DZD compactly represents a surface shape of protein tertiary structure as a vector, and the simplified representation accelerates the structural search. However, further speed up is needed to address the scenarios where multiple users access the database simultaneously. We address this need for further speed up in protein structural search by exploiting two indexing techniques, i.e., iDistance and iKernel, on the 3DZDs. The results show that both iDistance and iKernel significantly enhance the searching speed. In addition, we introduce an extended approach for protein structure search based on indexing techniques that use the 3DZD characteristic. In the extended approach, index structure is constructured using only the first few of the numbers in the 3DZDs. To find the top-k similar structures, first top-10 x k similar structures are selected using the reduced index structure, then top-k structures are selected using similarity measure of full 3DZDs of the selected structures. Using the indexing techniques, the searching time reduced 69.6% using iDistance, 77% using iKernel, 77.4% using extended iDistance, and 87.9% using extended iKernel method.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126628812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Inferring appropriate eligibility criteria in clinical trial protocols without labeled data 在没有标记数据的临床试验方案中推断适当的资格标准
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390074
Angelo C. Restificar, S. Ananiadou
We consider the user task of designing clinical trial protocols and propose a method that outputs the most appropriate eligibility criteria from a potentially huge set of candidates. Each document d in our collection D is a clinical trial protocol which itself contains a set of eligibility criteria. Given a small set of sample documents D', |D'|<<|D|, a user has initially identified as relevant e.g., via a user query interface, our scoring method automatically suggests eligibility criteria from D by ranking them according to how appropriate they are to the clinical trial protocol currently being designed. We view a document as a mixture of latent topics and our method exploits this by applying a three-step procedure. First, we infer the latent topics in the sample documents using Latent Dirichlet Allocation (LDA) [3]. Next, we use logistic regression models to compute the probability that a given candidate criterion belongs to a particular topic. Lastly, we score each criterion by computing its expected value, the probability-weighted sum of the topic proportions inferred from the set of sample documents. Intuitively, the greater the probability that a candidate criterion belongs to the topics that are dominant in the samples, the higher its expected value or score. Results from our experiments indicate that our proposed method is 8 and 9 times better (resp., for inclusion and exclusion criteria) than randomly choosing from a set of candidates obtained from relevant documents. In user simulation experiments, we were able to automatically construct eligibility criteria that are on the average 75% and 70% (resp., for inclusion and exclusion criteria) similar to the correct eligibility criteria.
我们考虑了设计临床试验方案的用户任务,并提出了一种从潜在的大量候选者中输出最合适的资格标准的方法。我们收集的每个文档都是临床试验方案,它本身包含一套资格标准。给定一小组样本文档D', |D'|<<|D|,用户已经初步确定为相关的,例如,通过用户查询界面,我们的评分方法根据它们对当前正在设计的临床试验方案的适合程度自动建议D中的资格标准。我们将文档视为潜在主题的混合物,我们的方法通过应用一个三步过程来利用这一点。首先,我们使用潜在狄利克雷分配(latent Dirichlet Allocation, LDA)[3]来推断样本文档中的潜在主题。接下来,我们使用逻辑回归模型来计算给定候选标准属于特定主题的概率。最后,我们通过计算其期望值(从样本文档集推断的主题比例的概率加权和)来对每个标准进行评分。直观地说,候选标准属于样本中占主导地位的主题的概率越大,其期望值或分数就越高。实验结果表明,我们提出的方法分别比传统方法好8倍和9倍。,作为纳入和排除标准),而不是从相关文件中获得的一组候选人中随机选择。在用户模拟实验中,我们能够自动构建平均为75%和70%的资格标准。(用于纳入和排除标准)类似于正确的资格标准。
{"title":"Inferring appropriate eligibility criteria in clinical trial protocols without labeled data","authors":"Angelo C. Restificar, S. Ananiadou","doi":"10.1145/2390068.2390074","DOIUrl":"https://doi.org/10.1145/2390068.2390074","url":null,"abstract":"We consider the user task of designing clinical trial protocols and propose a method that outputs the most appropriate eligibility criteria from a potentially huge set of candidates. Each document d in our collection D is a clinical trial protocol which itself contains a set of eligibility criteria. Given a small set of sample documents D', |D'|<<|D|, a user has initially identified as relevant e.g., via a user query interface, our scoring method automatically suggests eligibility criteria from D by ranking them according to how appropriate they are to the clinical trial protocol currently being designed. We view a document as a mixture of latent topics and our method exploits this by applying a three-step procedure. First, we infer the latent topics in the sample documents using Latent Dirichlet Allocation (LDA) [3]. Next, we use logistic regression models to compute the probability that a given candidate criterion belongs to a particular topic. Lastly, we score each criterion by computing its expected value, the probability-weighted sum of the topic proportions inferred from the set of sample documents. Intuitively, the greater the probability that a candidate criterion belongs to the topics that are dominant in the samples, the higher its expected value or score. Results from our experiments indicate that our proposed method is 8 and 9 times better (resp., for inclusion and exclusion criteria) than randomly choosing from a set of candidates obtained from relevant documents. In user simulation experiments, we were able to automatically construct eligibility criteria that are on the average 75% and 70% (resp., for inclusion and exclusion criteria) similar to the correct eligibility criteria.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133352105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Lexicon-free and context-free drug names identification methods using hidden markov models and pointwise mutual information 使用隐马尔可夫模型和点互信息的无词典和无上下文的药品名称识别方法
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390072
Jacek Małyszko, A. Filipowska
The paper concerns the issue of extraction of medicine names from free text documents written in Polish. Using lexicon-based approaches, it is impossible to identify unknown or misspelled medicine names. In this paper, we present the results of experimentation on two methods: Hidden Markov Model (HMM) and Pointwise Mutual Information (PMI)-based approach. The experiment was to identify the medicine names without the use of lexicon or contextual information. The experimentation results show, that HMM may be used as one of several steps in drug names' identification (with F-score slightly below 70% for the test set), while the PMI can help in increasing the precision of results achieved using HMM, but with significant loss in recall.
本文关注从波兰文自由文本文档中提取药品名称的问题。使用基于词典的方法,不可能识别未知或拼写错误的药物名称。本文介绍了两种方法的实验结果:隐马尔可夫模型(HMM)和基于点互信息(PMI)的方法。实验是在不使用词汇或上下文信息的情况下识别药物名称。实验结果表明,隐马尔可夫可以作为药品名称识别的几个步骤之一(测试集的f分数略低于70%),而PMI可以帮助提高使用隐马尔可夫获得的结果的精度,但在召回率上有显著损失。
{"title":"Lexicon-free and context-free drug names identification methods using hidden markov models and pointwise mutual information","authors":"Jacek Małyszko, A. Filipowska","doi":"10.1145/2390068.2390072","DOIUrl":"https://doi.org/10.1145/2390068.2390072","url":null,"abstract":"The paper concerns the issue of extraction of medicine names from free text documents written in Polish. Using lexicon-based approaches, it is impossible to identify unknown or misspelled medicine names. In this paper, we present the results of experimentation on two methods: Hidden Markov Model (HMM) and Pointwise Mutual Information (PMI)-based approach. The experiment was to identify the medicine names without the use of lexicon or contextual information. The experimentation results show, that HMM may be used as one of several steps in drug names' identification (with F-score slightly below 70% for the test set), while the PMI can help in increasing the precision of results achieved using HMM, but with significant loss in recall.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129529870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Session details: Mining biological data and text 会议细节:挖掘生物数据和文本
Pub Date : 2012-10-29 DOI: 10.1145/3260182
Min Song
{"title":"Session details: Mining biological data and text","authors":"Min Song","doi":"10.1145/3260182","DOIUrl":"https://doi.org/10.1145/3260182","url":null,"abstract":"","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123904599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extracting structured information from free-text medication prescriptions using dependencies 使用依赖关系从自由文本药物处方中提取结构化信息
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390076
Andrew D. MacKinlay, Karin M. Verspoor
We explore an information extraction task where the goal is to determine the correct values for fields which are relevant to prescription drug administration such as dosage amount, frequency and route. The data set is a collection of prescriptions from a long-term health-care facility, a small subset of which we have manually annotated with values for these fields. We first examine a rule-based approach to the task, which uses a dependency parse of the prescription, achieving accuracies of 60-95% over various different fields, and 67.5% when all fields of the prescription are considered together. The outputs of such a system have potential applications in detecting irregularities in dosage delivery.
我们探索了一个信息提取任务,其目标是确定与处方药物给药相关的字段的正确值,如剂量、频率和路线。该数据集是来自一家长期医疗保健机构的处方集合,我们已经用这些字段的值手动注释了其中的一小部分。我们首先研究了一种基于规则的任务方法,它使用处方的依赖解析,在不同的字段上实现了60-95%的准确率,当处方的所有字段一起考虑时,实现了67.5%的准确率。这种系统的输出在检测给药中的不规则性方面具有潜在的应用。
{"title":"Extracting structured information from free-text medication prescriptions using dependencies","authors":"Andrew D. MacKinlay, Karin M. Verspoor","doi":"10.1145/2390068.2390076","DOIUrl":"https://doi.org/10.1145/2390068.2390076","url":null,"abstract":"We explore an information extraction task where the goal is to determine the correct values for fields which are relevant to prescription drug administration such as dosage amount, frequency and route. The data set is a collection of prescriptions from a long-term health-care facility, a small subset of which we have manually annotated with values for these fields. We first examine a rule-based approach to the task, which uses a dependency parse of the prescription, achieving accuracies of 60-95% over various different fields, and 67.5% when all fields of the prescription are considered together. The outputs of such a system have potential applications in detecting irregularities in dosage delivery.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128652568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Rule-based whole body modeling for analyzing multi-compound effects 基于规则的多复合效果分析全身建模
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390083
W. Hwang, Y. Hwang, Sunjae Lee, Doheon Lee
Essential reasons including robustness, redundancy, and crosstalk of biological systems, have been reported to explain the limited efficacy and unexpected side-effects of drugs. Many pharmaceutical laboratories have begun to develop multi-compound drugs to remedy this situation, and some of them have shown successful clinical results. Simultaneous application of multiple compounds could increase efficacy as well as reduce side-effects through pharmacodynamics and pharmacokinetic interactions. However, such approach requires overwhelming cost of preclinical experiments and tests as the number of possible combinations of compound dosages increases exponentially. Computer model-based experiments have been emerging as one of the most promising solutions to cope with such complexity. Though there have been many efforts to model specific molecular pathways using qualitative and quantitative formalisms, they suffer from unexpected results caused by distant interactions beyond their localized models. Here we propose a rule-based whole-body modeling platform. We have tested this platform with Type 2 diabetes (T2D) model, which involves the malfunction of numerous organs such as pancreas, circulation system, liver, and muscle. We have extracted T2D-related 117 rules by manual curation from literature and different types of existing models. The results of our simulation show drug effect pathways of T2D drugs and how combination of drugs could work on the whole-body scale. We expect that it would provide the insight for identifying effective combination of drugs and its mechanism for the drug development.
据报道,生物系统的鲁棒性、冗余性和串扰等基本原因可以解释药物的有限功效和意想不到的副作用。许多制药实验室已经开始开发复合药物来补救这种情况,其中一些已经显示出成功的临床效果。多种化合物同时应用可通过药效学和药代动力学相互作用提高疗效并减少副作用。然而,这种方法需要大量的临床前实验和测试费用,因为可能的化合物剂量组合数量呈指数增长。基于计算机模型的实验已经成为应对这种复杂性的最有希望的解决方案之一。尽管已经有许多努力使用定性和定量形式来模拟特定的分子途径,但它们受到超出其局部模型的远距离相互作用引起的意想不到的结果的影响。在此,我们提出了一个基于规则的全身建模平台。我们已经用2型糖尿病(T2D)模型测试了这个平台,这种模型涉及胰腺、循环系统、肝脏和肌肉等许多器官的功能障碍。我们从文献和不同类型的现有模型中,通过人工策展提取了与t2d相关的117条规则。我们的模拟结果显示了T2D药物的药物作用途径以及药物组合如何在全身范围内起作用。我们期望这将为药物开发提供有效的药物组合及其机制的认识。
{"title":"Rule-based whole body modeling for analyzing multi-compound effects","authors":"W. Hwang, Y. Hwang, Sunjae Lee, Doheon Lee","doi":"10.1145/2390068.2390083","DOIUrl":"https://doi.org/10.1145/2390068.2390083","url":null,"abstract":"Essential reasons including robustness, redundancy, and crosstalk of biological systems, have been reported to explain the limited efficacy and unexpected side-effects of drugs. Many pharmaceutical laboratories have begun to develop multi-compound drugs to remedy this situation, and some of them have shown successful clinical results. Simultaneous application of multiple compounds could increase efficacy as well as reduce side-effects through pharmacodynamics and pharmacokinetic interactions. However, such approach requires overwhelming cost of preclinical experiments and tests as the number of possible combinations of compound dosages increases exponentially. Computer model-based experiments have been emerging as one of the most promising solutions to cope with such complexity. Though there have been many efforts to model specific molecular pathways using qualitative and quantitative formalisms, they suffer from unexpected results caused by distant interactions beyond their localized models.\u0000 Here we propose a rule-based whole-body modeling platform. We have tested this platform with Type 2 diabetes (T2D) model, which involves the malfunction of numerous organs such as pancreas, circulation system, liver, and muscle. We have extracted T2D-related 117 rules by manual curation from literature and different types of existing models. The results of our simulation show drug effect pathways of T2D drugs and how combination of drugs could work on the whole-body scale. We expect that it would provide the insight for identifying effective combination of drugs and its mechanism for the drug development.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121701244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Protein complex prediction via bottleneck-based graph partitioning 基于瓶颈图划分的蛋白质复合体预测
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390079
Jaegyoon Ahn, D. Lee, Youngmi Yoon, Yunku Yeu, Sanghyun Park
Detecting protein complexes is one of essential and fundamental tasks in understanding various biological functions or processes. Therefore, precise identification of protein complexes is indispensible. For more precise detection of protein complexes, we propose a novel data structure which employs bottleneck proteins as partitioning points for detecting the protein complexes. The partitioning process allows overlapping between resulting protein complexes. We applied our algorithm to several PPI (Protein-Protein Interaction) networks of Saccharomyces cerevisiae and Homo sapiens, and validated our results using public databases of protein complexes. Our algorithm resulted in overlapping protein complexes with significantly improved F1 score, which comes from higher precision.
检测蛋白质复合物是理解各种生物功能或过程的基本任务之一。因此,精确鉴定蛋白质复合物是必不可少的。为了更精确地检测蛋白质复合物,我们提出了一种新的数据结构,该结构采用瓶颈蛋白作为检测蛋白质复合物的分划点。分割过程允许产生的蛋白质复合物之间的重叠。我们将我们的算法应用于酿酒酵母和智人的几个PPI (protein - protein Interaction)网络,并使用蛋白质复合物的公共数据库验证了我们的结果。我们的算法产生重叠的蛋白复合物,F1分数显著提高,这来自于更高的精度。
{"title":"Protein complex prediction via bottleneck-based graph partitioning","authors":"Jaegyoon Ahn, D. Lee, Youngmi Yoon, Yunku Yeu, Sanghyun Park","doi":"10.1145/2390068.2390079","DOIUrl":"https://doi.org/10.1145/2390068.2390079","url":null,"abstract":"Detecting protein complexes is one of essential and fundamental tasks in understanding various biological functions or processes. Therefore, precise identification of protein complexes is indispensible. For more precise detection of protein complexes, we propose a novel data structure which employs bottleneck proteins as partitioning points for detecting the protein complexes. The partitioning process allows overlapping between resulting protein complexes. We applied our algorithm to several PPI (Protein-Protein Interaction) networks of Saccharomyces cerevisiae and Homo sapiens, and validated our results using public databases of protein complexes. Our algorithm resulted in overlapping protein complexes with significantly improved F1 score, which comes from higher precision.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122176469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Clinical entity recognition using structural support vector machines with rich features 特征丰富的结构支持向量机临床实体识别
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390073
Buzhou Tang, Yonghui Wu, Min Jiang, Hua Xu
Named entity recognition (NER) is an important task for natural language processing (NLP) of clinical text. Conditional Random Fields (CRFs), a sequential labeling algorithm, and Support Vector Machines (SVMs), which is based on large margin theory, are two typical machine learning algorithms that have been widely applied to NER tasks, including clinical entity recognition. However, Structural Support Vector Machines (SSVMs), an algorithm that combines the advantages of both CRFs and SVMs, has not been investigated for clinical text processing. In this study, we applied the SSVMs algorithm to the Concept Extraction task of the 2010 i2b2 clinical NLP challenge, which was to recognize entities of medical problems, treatments, and tests from hospital discharge summaries. Using the same training (N = 27,837) and test (N = 45,009) sets in the challenge, our evaluation showed that the SSVMs-based NER system required less training time, while achieved better performance than the CRFs-based system for clinical entity recognition, when same features were used. Our study also demonstrated that rich features such as unsupervised word representations improved the performance of clinical entity recognition. When rich features were integrated with SSVMs, our system achieved a highest F-measure of 85.74% on the test set of 2010 i2b2 NLP challenge, which outperformed the best system reported in the challenge by 0.5%.
命名实体识别(NER)是临床文本自然语言处理(NLP)的一项重要任务。条件随机场(CRFs)是一种顺序标注算法,而支持向量机(svm)是基于大余量理论的两种典型的机器学习算法,已广泛应用于NER任务,包括临床实体识别。然而,结合CRFs和svm优点的结构支持向量机(ssvm)算法尚未被研究用于临床文本处理。在本研究中,我们将ssvm算法应用于2010年i2b2临床NLP挑战的概念提取任务,该任务是从医院出院摘要中识别医疗问题、治疗和测试的实体。在挑战中使用相同的训练集(N = 27,837)和测试集(N = 45,009),我们的评估表明,当使用相同的特征时,基于ssvm的NER系统所需的训练时间更少,但在临床实体识别方面取得了比基于crfs的系统更好的性能。我们的研究还表明,丰富的特征,如无监督的词表示,提高了临床实体识别的性能。当丰富的特征与ssvm集成时,我们的系统在2010年i2b2 NLP挑战的测试集上达到了85.74%的最高f值,比挑战中报告的最佳系统高出0.5%。
{"title":"Clinical entity recognition using structural support vector machines with rich features","authors":"Buzhou Tang, Yonghui Wu, Min Jiang, Hua Xu","doi":"10.1145/2390068.2390073","DOIUrl":"https://doi.org/10.1145/2390068.2390073","url":null,"abstract":"Named entity recognition (NER) is an important task for natural language processing (NLP) of clinical text. Conditional Random Fields (CRFs), a sequential labeling algorithm, and Support Vector Machines (SVMs), which is based on large margin theory, are two typical machine learning algorithms that have been widely applied to NER tasks, including clinical entity recognition. However, Structural Support Vector Machines (SSVMs), an algorithm that combines the advantages of both CRFs and SVMs, has not been investigated for clinical text processing. In this study, we applied the SSVMs algorithm to the Concept Extraction task of the 2010 i2b2 clinical NLP challenge, which was to recognize entities of medical problems, treatments, and tests from hospital discharge summaries. Using the same training (N = 27,837) and test (N = 45,009) sets in the challenge, our evaluation showed that the SSVMs-based NER system required less training time, while achieved better performance than the CRFs-based system for clinical entity recognition, when same features were used. Our study also demonstrated that rich features such as unsupervised word representations improved the performance of clinical entity recognition. When rich features were integrated with SSVMs, our system achieved a highest F-measure of 85.74% on the test set of 2010 i2b2 NLP challenge, which outperformed the best system reported in the challenge by 0.5%.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129035767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
期刊
Data and Text Mining in Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1