{"title":"Session details: Session 20: Big Data in Bioinformatics II","authors":"T. Pollard","doi":"10.1145/3254563","DOIUrl":"https://doi.org/10.1145/3254563","url":null,"abstract":"","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123840558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predictive diagnosis benefits both patients and hospitals. Major challenges limiting the effectiveness of machine learning based predictive diagnosis include the lack of efficient feature selection methods and the heterogeneity of measured patient data (e.g., vital signs). In this paper, we propose DLFS, an efficient feature selection scheme based on deep learning that is applicable for heterogeneous data. DLFS is unsupervised in nature and can learn compact representations from patient data automatically for efficient prediction. In this paper, the specific problem of predicting the patients' length of stay in the hospital is investigated in a predictive diagnosis framework which uses DLFS for feature selection. Real patient data from the pneumonia database of the National University Health System (NUHS) in Singapore are collected to verify the effectiveness of DLFS. By running experiments on real-world patient data and comparing with several other commonly used feature selection methods, we demonstrate the advantage of the proposed DLFS scheme.
{"title":"Learning Deep Representations from Heterogeneous Patient Data for Predictive Diagnosis","authors":"Chongyu Zhou, Yao Jia, M. Motani, J. Chew","doi":"10.1145/3107411.3107433","DOIUrl":"https://doi.org/10.1145/3107411.3107433","url":null,"abstract":"Predictive diagnosis benefits both patients and hospitals. Major challenges limiting the effectiveness of machine learning based predictive diagnosis include the lack of efficient feature selection methods and the heterogeneity of measured patient data (e.g., vital signs). In this paper, we propose DLFS, an efficient feature selection scheme based on deep learning that is applicable for heterogeneous data. DLFS is unsupervised in nature and can learn compact representations from patient data automatically for efficient prediction. In this paper, the specific problem of predicting the patients' length of stay in the hospital is investigated in a predictive diagnosis framework which uses DLFS for feature selection. Real patient data from the pneumonia database of the National University Health System (NUHS) in Singapore are collected to verify the effectiveness of DLFS. By running experiments on real-world patient data and comparing with several other commonly used feature selection methods, we demonstrate the advantage of the proposed DLFS scheme.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129569652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 11: Applications to Microbes and Imaging Genetics","authors":"A. Wright","doi":"10.1145/3254554","DOIUrl":"https://doi.org/10.1145/3254554","url":null,"abstract":"","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127665004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In model organism databases, one of the important tasks is to convert free text in biomedical literature to a structured data format. Curators in the Rat Genome Database (RGD), the primary source of rat genomic, genetic, and physiological data, spend considerable time and effort curating functional information for genes, QTLs, and strains from the literature. To increase curation efficiency and prioritize literature for data extraction OntoMate was developed at RGD. This tool tags Pubmed abstracts with genes, gene names, gene mutations, organism name and terms from 16 ontologies/vocabularies, including synonyms and aliases, used to represent functional information. In this project, we have used an unsupervised tagging method to reduce human effort for creating training data. In this approach, a machine learning tool based on decision tree classification techniques has been developed. Mentions that are uniquely belong to a semantic type play positive sample roles, and those with semantic types other than desired group are assumed to be negative samples. An interface allows the user to create a complex query incorporating terms from any of the ontologies, gene symbols, organisms, dates and other parameters. The results return abstracts along with all tagged parameters indicated in the query, along with children of the ontology terms chosen. Results can be further filtered by the user through a panel that lists organisms, genes and diseases with number of paper returned. Abstracts and papers are provided in rank order by relevance to the query. The tool is fully integrated into curation software so citations and abstracts can be automatically entered into the RGD database and given ID and genes and ontology terms in the tags can be checked to create annotations linked to the paper. The system was built with a scalable and open architecture, and literature is updated daily. This tool uses Solr indexing technology and categorizes papers based on a relevance score. It indexes and tags more than 27 million abstracts. With the use of bioNLP tools, RGD has added more automation to its curation workflow.
{"title":"Novel Unsupervised Named Entity Recognition Used in Text Annotation Tool (OntoMate) At Rat Genome Database","authors":"O. Ghiasvand, M. Shimoyama","doi":"10.1145/3107411.3108198","DOIUrl":"https://doi.org/10.1145/3107411.3108198","url":null,"abstract":"In model organism databases, one of the important tasks is to convert free text in biomedical literature to a structured data format. Curators in the Rat Genome Database (RGD), the primary source of rat genomic, genetic, and physiological data, spend considerable time and effort curating functional information for genes, QTLs, and strains from the literature. To increase curation efficiency and prioritize literature for data extraction OntoMate was developed at RGD. This tool tags Pubmed abstracts with genes, gene names, gene mutations, organism name and terms from 16 ontologies/vocabularies, including synonyms and aliases, used to represent functional information. In this project, we have used an unsupervised tagging method to reduce human effort for creating training data. In this approach, a machine learning tool based on decision tree classification techniques has been developed. Mentions that are uniquely belong to a semantic type play positive sample roles, and those with semantic types other than desired group are assumed to be negative samples. An interface allows the user to create a complex query incorporating terms from any of the ontologies, gene symbols, organisms, dates and other parameters. The results return abstracts along with all tagged parameters indicated in the query, along with children of the ontology terms chosen. Results can be further filtered by the user through a panel that lists organisms, genes and diseases with number of paper returned. Abstracts and papers are provided in rank order by relevance to the query. The tool is fully integrated into curation software so citations and abstracts can be automatically entered into the RGD database and given ID and genes and ontology terms in the tags can be checked to create annotations linked to the paper. The system was built with a scalable and open architecture, and literature is updated daily. This tool uses Solr indexing technology and categorizes papers based on a relevance score. It indexes and tags more than 27 million abstracts. With the use of bioNLP tools, RGD has added more automation to its curation workflow.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121202923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 15: Sdequence Analysis and Genome Assembly","authors":"C. Boucher","doi":"10.1145/3254558","DOIUrl":"https://doi.org/10.1145/3254558","url":null,"abstract":"","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116608260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 13: Knowledge Representation Applications","authors":"P. Veltri","doi":"10.1145/3254556","DOIUrl":"https://doi.org/10.1145/3254556","url":null,"abstract":"","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116639147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Intrinsically disordered proteins (IDPs) play an important role in many biological processes and are closely related to human diseases. They also have the potential to serve as targets for drug discovery, especially in disordered binding regions. Accurate prediction of IDPs is challenging, most methods rely on sequence profiles to improve accuracy making them computationally expensive. This paper describes a method based on n-gram frequencies using reduced amino acid alphabets, which tries to overcome this challenge by utilizing only sequence information. Our results show that the described IDP prediction approach performs at the same level as some of the other state of the art ab initio methods. However, the simplicity of n-grams allows to construct decision trees which can provide important insights into common patterns and properties associated with disordered regions.
{"title":"Identification and Prediction of Intrinsically Disordered Regions in Proteins Using n-grams","authors":"Mauricio Oberti, I. Vaisman","doi":"10.1145/3107411.3107480","DOIUrl":"https://doi.org/10.1145/3107411.3107480","url":null,"abstract":"Intrinsically disordered proteins (IDPs) play an important role in many biological processes and are closely related to human diseases. They also have the potential to serve as targets for drug discovery, especially in disordered binding regions. Accurate prediction of IDPs is challenging, most methods rely on sequence profiles to improve accuracy making them computationally expensive. This paper describes a method based on n-gram frequencies using reduced amino acid alphabets, which tries to overcome this challenge by utilizing only sequence information. Our results show that the described IDP prediction approach performs at the same level as some of the other state of the art ab initio methods. However, the simplicity of n-grams allows to construct decision trees which can provide important insights into common patterns and properties associated with disordered regions.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1997 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131165385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B-cell epitope prediction aims to support translational applications as exemplified by peptide-based vaccine design. This entails selection of immunizing peptide sequences that tend to be intrinsically disordered and thus appropriately described within the framework of polymer theory. A fully extended hexapeptide sequence spans a typical antibody footprint; but disordered peptides are flexible rather than rigid, such that their B-cell epitopes may vary in length according to the diversity of conformations assumed upon binding by antibodies. Hence, peptides were modeled herein as worm-like chains, using an interpolated approximation of the radial probability density distribution function to estimate the probability that the ends of a peptidic sequence are separated by a distance less than or equal to a typical antibody footprint diameter. The results suggest that the epitopes are likely to be no more than 17 residues long, which is consistent with available structural data on immune complexes consisting of antipeptide antibodies bound to cognate peptide antigens. For such antigens, B-cell epitope prediction thus could proceed with initial scanning for intrinsically disordered sequences of length up to a physicochemically plausible maximum value (e.g., 17 residues), with analysis of progressively longer subsequences to identify nonredundant sets of putative epitopes (e.g., based on predicted affinity).
{"title":"Development of a Polymer-Theoretic Approach to Describing Constraints on Reactions Between Antipeptide Antibodies and Intrinsically Disordered Peptide Antigens: Implications for B-Cell Epitope Prediction","authors":"S. Caoili","doi":"10.1145/3107411.3108190","DOIUrl":"https://doi.org/10.1145/3107411.3108190","url":null,"abstract":"B-cell epitope prediction aims to support translational applications as exemplified by peptide-based vaccine design. This entails selection of immunizing peptide sequences that tend to be intrinsically disordered and thus appropriately described within the framework of polymer theory. A fully extended hexapeptide sequence spans a typical antibody footprint; but disordered peptides are flexible rather than rigid, such that their B-cell epitopes may vary in length according to the diversity of conformations assumed upon binding by antibodies. Hence, peptides were modeled herein as worm-like chains, using an interpolated approximation of the radial probability density distribution function to estimate the probability that the ends of a peptidic sequence are separated by a distance less than or equal to a typical antibody footprint diameter. The results suggest that the epitopes are likely to be no more than 17 residues long, which is consistent with available structural data on immune complexes consisting of antipeptide antibodies bound to cognate peptide antigens. For such antigens, B-cell epitope prediction thus could proceed with initial scanning for intrinsically disordered sequences of length up to a physicochemically plausible maximum value (e.g., 17 residues), with analysis of progressively longer subsequences to identify nonredundant sets of putative epitopes (e.g., based on predicted affinity).","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132167852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objectives: Although patients may have a wealth of imaging, genomic, monitoring, and personal device data, it has yet to be fully integrated into clinical care. Methods: We identify three reasons for the lack of integration. The first is that "Big Data" is poorly managed by most Electronic Medical Record Systems (EMRS). The data is mostly available on "cloud-native" platforms that are outside the scope of most EMRS, and even checking if such data is available on a patient often must be done outside the EMRS. The second reason is that extracting features from the Big Data that are relevant to healthcare often requires complex machine learning algorithms, such as determining if a genomic variant is protein-altering. The third reason is that applications that present the big data need to be modified constantly to reflect the current state of knowledge, such as instructing when to order a new set of genomic tests. In some cases, the applications need to be updated nightly. Results: A new architecture for the EMRS is evolving which could unite Big Data, machine learning, and clinical care through a microservice-based architecture which can host applications focused on quite specific aspects of clinical care, such as managing cancer immunotherapy. Conclusion: Informatics innovation, medical research, and clinical care go hand in hand as we look to infuse science-based practice into healthcare. Innovative methods will lead to in a new ecosystem of Apps interacting with healthcare providers to fulfill a promise that is still to be determined.
{"title":"Instrumenting the Health Care Enterprise for Discovery in the Course of Clinical Care","authors":"S. Murphy","doi":"10.1145/3107411.3121000","DOIUrl":"https://doi.org/10.1145/3107411.3121000","url":null,"abstract":"Objectives: Although patients may have a wealth of imaging, genomic, monitoring, and personal device data, it has yet to be fully integrated into clinical care. Methods: We identify three reasons for the lack of integration. The first is that \"Big Data\" is poorly managed by most Electronic Medical Record Systems (EMRS). The data is mostly available on \"cloud-native\" platforms that are outside the scope of most EMRS, and even checking if such data is available on a patient often must be done outside the EMRS. The second reason is that extracting features from the Big Data that are relevant to healthcare often requires complex machine learning algorithms, such as determining if a genomic variant is protein-altering. The third reason is that applications that present the big data need to be modified constantly to reflect the current state of knowledge, such as instructing when to order a new set of genomic tests. In some cases, the applications need to be updated nightly. Results: A new architecture for the EMRS is evolving which could unite Big Data, machine learning, and clinical care through a microservice-based architecture which can host applications focused on quite specific aspects of clinical care, such as managing cancer immunotherapy. Conclusion: Informatics innovation, medical research, and clinical care go hand in hand as we look to infuse science-based practice into healthcare. Innovative methods will lead to in a new ecosystem of Apps interacting with healthcare providers to fulfill a promise that is still to be determined.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123030129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhila Esna Ashari Esfahani, K. Brayton, S. Broschat
Type IV secretion systems (T4SS) are constructed from multiple protein complexes that exist in some types of bacterial pathogens and are responsible for delivering type IV effector proteins into host cells. Effectors target eukaryotic cells and try to manipulate host cell processes and the immune system of the host. Some work has been done to validate effectors experimentally, and recently a few scoring and machine learning-based methods have been developed to predict effectors from whole genome sequences. However, different types of features have been suggested to be effective. In this work, we gathered the features proposed in pre-vious reports and calculated their values for a dataset of effectors and non-effectors of Coxiella burnetii. Then we ranked the features based on their importance in classifying effectors and non-effectors to determine the set of optimal features. Finally, a Support Vector Machine model was developed to test the optimal features by comparing them to a set of features proposed in a previous study. The outcome of the comparison supports the effectiveness of our optimal features.
{"title":"Determining Optimal Features for Predicting Type IV Secretion System Effector Proteins for Coxiella burnetii","authors":"Zhila Esna Ashari Esfahani, K. Brayton, S. Broschat","doi":"10.1145/3107411.3107416","DOIUrl":"https://doi.org/10.1145/3107411.3107416","url":null,"abstract":"Type IV secretion systems (T4SS) are constructed from multiple protein complexes that exist in some types of bacterial pathogens and are responsible for delivering type IV effector proteins into host cells. Effectors target eukaryotic cells and try to manipulate host cell processes and the immune system of the host. Some work has been done to validate effectors experimentally, and recently a few scoring and machine learning-based methods have been developed to predict effectors from whole genome sequences. However, different types of features have been suggested to be effective. In this work, we gathered the features proposed in pre-vious reports and calculated their values for a dataset of effectors and non-effectors of Coxiella burnetii. Then we ranked the features based on their importance in classifying effectors and non-effectors to determine the set of optimal features. Finally, a Support Vector Machine model was developed to test the optimal features by comparing them to a set of features proposed in a previous study. The outcome of the comparison supports the effectiveness of our optimal features.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"226 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116839044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}