Machine learning algorithms have been widely used to capture the static and temporal patterns within electronic health records (EHRs). While many studies focus on the (primary) prevention of diseases, primordial prevention (preventing the factors that are known to increase the risk of a disease occurring) is still widely under-investigated. In this study, we propose a multi-target regression model leveraging transformers to learn the bidirectional representations of EHR data and predict the future values of 11 major modifiable risk factors of cardiovascular disease (CVD). Inspired by the proven results of pre-training in natural language processing studies, we apply the same principles on EHR data, dividing the training of our model into two phases: pre-training and fine-tuning. We use the fine-tuned transformer model in a "multi-target regression" theme. Following this theme, we combine the 11 disjoint prediction tasks by adding shared and target-specific layers to the model and jointly train the entire model. We evaluate the performance of our proposed method on a large publicly available EHR dataset. Through various experiments, we demonstrate that the proposed method obtains a significant improvement (12.6% MAE on average across all 11 different outputs) over the baselines.
{"title":"Transformer-based Multi-target Regression on Electronic Health Records for Primordial Prevention of Cardiovascular Disease.","authors":"Raphael Poulain, Mehak Gupta, Randi Foraker, Rahmatollah Beheshti","doi":"10.1109/bibm52615.2021.9669441","DOIUrl":"https://doi.org/10.1109/bibm52615.2021.9669441","url":null,"abstract":"<p><p>Machine learning algorithms have been widely used to capture the static and temporal patterns within electronic health records (EHRs). While many studies focus on the (primary) prevention of diseases, primordial prevention (preventing the factors that are known to increase the risk of a disease occurring) is still widely under-investigated. In this study, we propose a multi-target regression model leveraging transformers to learn the bidirectional representations of EHR data and predict the future values of 11 major modifiable risk factors of cardiovascular disease (CVD). Inspired by the proven results of pre-training in natural language processing studies, we apply the same principles on EHR data, dividing the training of our model into two phases: pre-training and fine-tuning. We use the fine-tuned transformer model in a \"multi-target regression\" theme. Following this theme, we combine the 11 disjoint prediction tasks by adding shared and target-specific layers to the model and jointly train the entire model. We evaluate the performance of our proposed method on a large publicly available EHR dataset. Through various experiments, we demonstrate that the proposed method obtains a significant improvement (12.6% MAE on average across all 11 different outputs) over the baselines.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2021 ","pages":"726-731"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9859711/pdf/nihms-1865432.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9166302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-01DOI: 10.1109/bibm52615.2021.9669318
Salim Sazzed, Peter Scheible, Jing He, Willy Wriggers
We propose a fast, dynamic programming-based framework for tracing actin filaments in 3D maps of subcellular components in cryo-electron tomography. The approach can identify high-density filament segments in various orientations, but it takes advantage of the arrangement of actin filaments within cells into more or less tightly aligned bundles. Assuming that the tomogram can be rotated such that the filaments can be oriented to be directed in a dominant direction (i.e., the , , or axis), the proposed framework first identifies local seed points that form the origin of candidate filament segments (CFSs), which are then grown from the seeds using a fast dynamic programming algorithm. The CFS length can be tuned to the nominal resolution of the tomogram or the separation of desired features, or it can be used to restrict the curvature of filaments that deviate from the overall bundle direction. In subsequent steps, the CFSs are filtered based on backward tracing and path density analysis. Finally, neighboring CFSs are fused based on a collinearity criterion to bridge any noise artifacts in the 3D map that would otherwise fractionalize the tracing. We validate our proposed framework on simulated tomograms that closely mimic the features and appearance of experimental maps.
{"title":"Tracing Filaments in Simulated 3D Cryo-Electron Tomography Maps Using a Fast Dynamic Programming Algorithm.","authors":"Salim Sazzed, Peter Scheible, Jing He, Willy Wriggers","doi":"10.1109/bibm52615.2021.9669318","DOIUrl":"https://doi.org/10.1109/bibm52615.2021.9669318","url":null,"abstract":"<p><p>We propose a fast, dynamic programming-based framework for tracing actin filaments in 3D maps of subcellular components in cryo-electron tomography. The approach can identify high-density filament segments in various orientations, but it takes advantage of the arrangement of actin filaments within cells into more or less tightly aligned bundles. Assuming that the tomogram can be rotated such that the filaments can be oriented to be directed in a dominant direction (i.e., the <math><mi>X</mi></math>, <math><mi>Y</mi></math>, or <math><mi>Z</mi></math> axis), the proposed framework first identifies local seed points that form the origin of candidate filament segments (CFSs), which are then grown from the seeds using a fast dynamic programming algorithm. The CFS length <math><mrow><mi>l</mi></mrow></math> can be tuned to the nominal resolution of the tomogram or the separation of desired features, or it can be used to restrict the curvature of filaments that deviate from the overall bundle direction. In subsequent steps, the CFSs are filtered based on backward tracing and path density analysis. Finally, neighboring CFSs are fused based on a collinearity criterion to bridge any noise artifacts in the 3D map that would otherwise fractionalize the tracing. We validate our proposed framework on simulated tomograms that closely mimic the features and appearance of experimental maps.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2021 ","pages":"2553-2559"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10353374/pdf/nihms-1823578.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9852614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-01DOI: 10.1109/bibm52615.2021.9669370
Peter Scheible, Salim Sazzed, Jing He, Willy Wriggers
As automated filament tracing algorithms in cryo-electron tomography (cryo-ET) continue to improve, the validation of these approaches has become more incumbent. Having a known ground truth on which to base predictions is crucial to reliably test predicted cytoskeletal filaments because the detailed structure of the filaments in experimental tomograms is obscured by a low resolution, as well as by noise and missing Fourier space wedge artifacts. We present a software tool for the realistic simulation of tomographic maps (TomoSim) based on a known filament trace. The parameters of the simulated map are automatically matched to those of a corresponding experimental map. We describe the computational details of the first prototype of our approach, which includes wedge masking in Fourier space, noise color, and signal-to-noise matching. We also discuss current and potential future applications of the approach in the validation of concurrent filament tracing methods in cryo-ET.
{"title":"<i>TomoSim</i>: Simulation of Filamentous Cryo-Electron Tomograms.","authors":"Peter Scheible, Salim Sazzed, Jing He, Willy Wriggers","doi":"10.1109/bibm52615.2021.9669370","DOIUrl":"10.1109/bibm52615.2021.9669370","url":null,"abstract":"<p><p>As automated filament tracing algorithms in cryo-electron tomography (cryo-ET) continue to improve, the validation of these approaches has become more incumbent. Having a known ground truth on which to base predictions is crucial to reliably test predicted cytoskeletal filaments because the detailed structure of the filaments in experimental tomograms is obscured by a low resolution, as well as by noise and missing Fourier space wedge artifacts. We present a software tool for the realistic simulation of tomographic maps (<i>TomoSim</i>) based on a known filament trace. The parameters of the simulated map are automatically matched to those of a corresponding experimental map. We describe the computational details of the first prototype of our approach, which includes wedge masking in Fourier space, noise color, and signal-to-noise matching. We also discuss current and potential future applications of the approach in the validation of concurrent filament tracing methods in cryo-ET.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2021 ","pages":"2560-2565"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10338425/pdf/nihms-1823577.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10199020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-12-01DOI: 10.1109/bibm52615.2021.9669878
Junxiang Chen, Li Sun, Ke Yu, Kayhan Batmanghelich
Extracting hidden phenotypes is essential in medical data analysis because it facilitates disease subtyping, diagnosis, and understanding of disease etiology. Since the hidden phenotype is usually a low-dimensional representation that comprehensively describes the disease, we require a dimensionality-reduction method that captures as much disease-relevant information as possible. However, most unsupervised or self-supervised methods cannot achieve the goal because they learn a holistic representation containing both disease-relevant and disease-irrelevant information. Supervised methods can capture information that is predictive to the target clinical variable only, but the learned representation is usually not generalizable for the various aspects of the disease. Hence, we develop a dimensionality-reduction approach to extract Disease Relevant Features (DRFs) based on information theory. We propose to use clinical variables that weakly define the disease as so-called anchors. We derive a formulation that makes the DRF predictive of the anchors while forcing the remaining representation to be irrelevant to the anchors via adversarial regularization. We apply our method to a large-scale study of Chronic Obstructive Pulmonary Disease (COPD). Our experiment shows: (1) Learned DRFs are as predictive as the original representation in predicting the anchors, although it is in a significantly lower dimension. (2) Compared to supervised representation, the learned DRFs are more predictive to other relevant disease metrics that are not used during the training. (3) The learned DRFs are related to non-imaging biological measurements such as gene expressions, suggesting the DRFs include information related to the underlying biology of the disease.
{"title":"Extracting Disease-Relevant Features with Adversarial Regularization.","authors":"Junxiang Chen, Li Sun, Ke Yu, Kayhan Batmanghelich","doi":"10.1109/bibm52615.2021.9669878","DOIUrl":"https://doi.org/10.1109/bibm52615.2021.9669878","url":null,"abstract":"<p><p>Extracting hidden phenotypes is essential in medical data analysis because it facilitates disease subtyping, diagnosis, and understanding of disease etiology. Since the hidden phenotype is usually a low-dimensional representation that comprehensively describes the disease, we require a dimensionality-reduction method that captures as much disease-relevant information as possible. However, most unsupervised or self-supervised methods cannot achieve the goal because they learn a holistic representation containing both disease-relevant and disease-irrelevant information. Supervised methods can capture information that is predictive to the target clinical variable only, but the learned representation is usually not generalizable for the various aspects of the disease. Hence, we develop a dimensionality-reduction approach to extract Disease Relevant Features (DRFs) based on information theory. We propose to use clinical variables that weakly define the disease as so-called <i>anchors</i>. We derive a formulation that makes the DRF predictive of the anchors while forcing the remaining representation to be irrelevant to the anchors via adversarial regularization. We apply our method to a large-scale study of Chronic Obstructive Pulmonary Disease (COPD). Our experiment shows: (1) Learned DRFs are as predictive as the original representation in predicting the anchors, although it is in a significantly lower dimension. (2) Compared to supervised representation, the learned DRFs are more predictive to other relevant disease metrics that are <i>not</i> used during the training. (3) The learned DRFs are related to non-imaging biological measurements such as gene expressions, suggesting the DRFs include information related to the underlying biology of the disease.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":" ","pages":"3464-3471"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8863436/pdf/nihms-1778852.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39659267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01Epub Date: 2021-01-13DOI: 10.1109/bibm49941.2020.9313186
Fengbo Zheng, Licong Cui
Biomedical terminologies have been increasingly used in modern biomedical research and applications to facilitate data management and ensure semantic interoperability. As part of the evolution process, new concepts are regularly added to biomedical terminologies in response to the evolving domain knowledge and emerging applications. Most existing concept enrichment methods suggest new concepts via directly importing knowledge from external sources. In this paper, we introduced a lexical method based on formal concept analysis (FCA) to identify potentially missing concepts in a given terminology by leveraging its intrinsic knowledge - concept names. We first construct the FCA formal context based on the lexical features of concepts. Then we perform multistage intersection to formalize new concepts and detect potentially missing concepts. We applied our method to the Disease or Disorder sub-hierarchy in the National Cancer Institute (NCI) Thesaurus (19.08d version) and identified a total of 8,983 potentially missing concepts. As a preliminary evaluation of our method to validate the potentially missing concepts, we further checked whether they were included in any external source terminology in the Unified Medical Language System (UMLS). The result showed that 592 out of 8,937 potentially missing concepts were found in the UMLS.
{"title":"A Lexical-based Formal Concept Analysis Method to Identify Missing Concepts in the NCI Thesaurus.","authors":"Fengbo Zheng, Licong Cui","doi":"10.1109/bibm49941.2020.9313186","DOIUrl":"https://doi.org/10.1109/bibm49941.2020.9313186","url":null,"abstract":"<p><p>Biomedical terminologies have been increasingly used in modern biomedical research and applications to facilitate data management and ensure semantic interoperability. As part of the evolution process, new concepts are regularly added to biomedical terminologies in response to the evolving domain knowledge and emerging applications. Most existing concept enrichment methods suggest new concepts via directly importing knowledge from external sources. In this paper, we introduced a lexical method based on formal concept analysis (FCA) to identify potentially missing concepts in a given terminology by leveraging its intrinsic knowledge - concept names. We first construct the FCA formal context based on the lexical features of concepts. Then we perform multistage intersection to formalize new concepts and detect potentially missing concepts. We applied our method to the <i>Disease or Disorder</i> sub-hierarchy in the National Cancer Institute (NCI) Thesaurus (19.08d version) and identified a total of 8,983 potentially missing concepts. As a preliminary evaluation of our method to validate the potentially missing concepts, we further checked whether they were included in any external source terminology in the Unified Medical Language System (UMLS). The result showed that 592 out of 8,937 potentially missing concepts were found in the UMLS.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2020 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/bibm49941.2020.9313186","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39579552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01Epub Date: 2021-01-13DOI: 10.1109/bibm49941.2020.9313156
Kevin Gagnon, Tami L Crawford, Jihad Obeid
With the pervasiveness of Electronic Health Records in many hospital systems, the application of machine learning techniques to the field of health informatics has become much more feasible as large amounts of data become more accessible. In our experiment, we evaluated several different convolutional neural network architectures that are typically used in text classification tasks. We then tested those models based on 1,113 histories of present illness. (HPI) notes. This data was run over both sequential and multi-channel architectures, as well as a structure that implemented attention methods meant to focus the model on learning the influential data points within the text. We found that the multi-channel model performed the best with an accuracy of 92%, while the attention and sequential models performed worse with an accuracy of 90% and 89% respectively.
{"title":"Comparison of Convolutional Neural Network Architectures and their Influence on Patient Classification Tasks Relating to Altered Mental Status.","authors":"Kevin Gagnon, Tami L Crawford, Jihad Obeid","doi":"10.1109/bibm49941.2020.9313156","DOIUrl":"https://doi.org/10.1109/bibm49941.2020.9313156","url":null,"abstract":"<p><p>With the pervasiveness of Electronic Health Records in many hospital systems, the application of machine learning techniques to the field of health informatics has become much more feasible as large amounts of data become more accessible. In our experiment, we evaluated several different convolutional neural network architectures that are typically used in text classification tasks. We then tested those models based on 1,113 histories of present illness. (HPI) notes. This data was run over both sequential and multi-channel architectures, as well as a structure that implemented attention methods meant to focus the model on learning the influential data points within the text. We found that the multi-channel model performed the best with an accuracy of 92%, while the attention and sequential models performed worse with an accuracy of 90% and 89% respectively.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":" ","pages":"2752-2756"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/bibm49941.2020.9313156","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40376872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hypertrophic cardiomyopathy (HCM) is a genetic heart disease that is the leading cause of sudden cardiac death (SCD) in young adults. Despite the well-known risk factors and existing clinical practice guidelines, HCM patients are underdiagnosed and sub-optimally managed. Developing machine learning models on electronic health record (EHR) data can help in better diagnosis of HCM and thus improve hundreds of patient lives. Automated phenotyping using HCM billing codes has received limited attention in the literature with a small number of prior publications. In this paper, we propose a novel predictive model that helps physicians in making diagnostic decisions, by means of information learned from historical data of similar patients. We assembled a cohort of 11,562 patients with known or suspected HCM who have visited Mayo Clinic between the years 1995 to 2019. All existing billing codes of these patients were extracted from the EHR data warehouse. Target ground truth labeling for training the machine learning model was provided by confirmed HCM diagnosis using the gold standard imaging tests for HCM diagnosis echocardiography (echo), or cardiac magnetic resonance (CMR) imaging. As the result, patients were labeled into three categories of "yes definite HCM", "no HCM phenotype", and "possible HCM" after a manual review of medical records and imaging tests. In this study, a random forest was adopted to investigate the predictive performance of billing codes for the identification of HCM patients due to its practical application and expected accuracy in a wide range of use cases. Our model performed well in finding patients with "yes definite", "possible" and "no" HCM with an accuracy of 71%, weighted recall of 70%, the precision of 75%, and weighted F1 score of 72%. Furthermore, we provided visualizations based on multidimensional scaling and the principal component analysis to provide insights for clinicians' interpretation. This model can be used for the identification of HCM patients using their EHR data, and help clinicians in their diagnosis decision making.
{"title":"Explanatory Analysis of a Machine Learning Model to Identify Hypertrophic Cardiomyopathy Patients from EHR Using Diagnostic Codes.","authors":"Nasibeh Zanjirani Farahani, Shivaram Poigai Arunachalam, Divaakar Siva Baala Sundaram, Kalyan Pasupathy, Moein Enayati, Adelaide M Arruda-Olson","doi":"10.1109/bibm49941.2020.9313231","DOIUrl":"https://doi.org/10.1109/bibm49941.2020.9313231","url":null,"abstract":"<p><p>Hypertrophic cardiomyopathy (HCM) is a genetic heart disease that is the leading cause of sudden cardiac death (SCD) in young adults. Despite the well-known risk factors and existing clinical practice guidelines, HCM patients are underdiagnosed and sub-optimally managed. Developing machine learning models on electronic health record (EHR) data can help in better diagnosis of HCM and thus improve hundreds of patient lives. Automated phenotyping using HCM billing codes has received limited attention in the literature with a small number of prior publications. In this paper, we propose a novel predictive model that helps physicians in making diagnostic decisions, by means of information learned from historical data of similar patients. We assembled a cohort of 11,562 patients with known or suspected HCM who have visited Mayo Clinic between the years 1995 to 2019. All existing billing codes of these patients were extracted from the EHR data warehouse. Target ground truth labeling for training the machine learning model was provided by confirmed HCM diagnosis using the gold standard imaging tests for HCM diagnosis echocardiography (echo), or cardiac magnetic resonance (CMR) imaging. As the result, patients were labeled into three categories of \"yes definite HCM\", \"no HCM phenotype\", and \"possible HCM\" after a manual review of medical records and imaging tests. In this study, a random forest was adopted to investigate the predictive performance of billing codes for the identification of HCM patients due to its practical application and expected accuracy in a wide range of use cases. Our model performed well in finding patients with \"yes definite\", \"possible\" and \"no\" HCM with an accuracy of 71%, weighted recall of 70%, the precision of 75%, and weighted F1 score of 72%. Furthermore, we provided visualizations based on multidimensional scaling and the principal component analysis to provide insights for clinicians' interpretation. This model can be used for the identification of HCM patients using their EHR data, and help clinicians in their diagnosis decision making.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2020 ","pages":"1932-1937"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/bibm49941.2020.9313231","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39227791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01Epub Date: 2021-01-13DOI: 10.1109/bibm49941.2020.9313379
Xiao Luo, Haoran Ding, Matthew Tang, Priyanka Gandhi, Zhan Zhang, Zhe He
In recent years, the social web has been increasingly used for health information seeking, sharing, and subsequent health-related research. Women often use the Internet or social networking sites to seek information related to pregnancy in different stages. They may ask questions about birth control, trying to conceive, labor, or taking care of a newborn or baby. Classifying different types of questions about pregnancy information (e.g., before, during, and after pregnancy) can inform the design of social media and professional websites for pregnancy education and support. This research aims to investigate the attention mechanism built-in or added on top of the BERT model in classifying and annotating the pregnancy-related questions posted on a community Q&A site. We evaluated two BERT-based models and compared them against the traditional machine learning models for question classification. Most importantly, we investigated two attention mechanisms: the built-in self-attention mechanism of BERT and the additional attention layer on top of BERT for relevant term annotation. The classification performance showed that the BERT-based models worked better than the traditional models, and BERT with an additional attention layer can achieve higher overall precision than the basic BERT model. The results also showed that both attention mechanisms work differently on annotating relevant content, and they could serve as feature selection methods for text mining in general.
{"title":"Attention Mechanism with BERT for Content Annotation and Categorization of Pregnancy-Related Questions on a Community Q&A Site.","authors":"Xiao Luo, Haoran Ding, Matthew Tang, Priyanka Gandhi, Zhan Zhang, Zhe He","doi":"10.1109/bibm49941.2020.9313379","DOIUrl":"https://doi.org/10.1109/bibm49941.2020.9313379","url":null,"abstract":"<p><p>In recent years, the social web has been increasingly used for health information seeking, sharing, and subsequent health-related research. Women often use the Internet or social networking sites to seek information related to pregnancy in different stages. They may ask questions about birth control, trying to conceive, labor, or taking care of a newborn or baby. Classifying different types of questions about pregnancy information (e.g., before, during, and after pregnancy) can inform the design of social media and professional websites for pregnancy education and support. This research aims to investigate the attention mechanism built-in or added on top of the BERT model in classifying and annotating the pregnancy-related questions posted on a community Q&A site. We evaluated two BERT-based models and compared them against the traditional machine learning models for question classification. Most importantly, we investigated two attention mechanisms: the built-in self-attention mechanism of BERT and the additional attention layer on top of BERT for relevant term annotation. The classification performance showed that the BERT-based models worked better than the traditional models, and BERT with an additional attention layer can achieve higher overall precision than the basic BERT model. The results also showed that both attention mechanisms work differently on annotating relevant content, and they could serve as feature selection methods for text mining in general.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2020 ","pages":"1077-1081"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/bibm49941.2020.9313379","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25431135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01Epub Date: 2021-01-13DOI: 10.1109/bibm49941.2020.9313375
Muhammad Amith, Jing Wang, Grace Xiong, Kirk Roberts, Cui Tao
A variety of severe health issues can be attributed to poor nutrition and poor eating behaviors. Research has explored the impact of nutritional knowledge on an individual's inclination to purchase and consume certain foods. This paper introduces the Ontology of Fast Food Facts, a knowledge base that models consumer nutritional data from major fast food establishments. This artifact serves as an aggregate knowledge base to centralize nutritional information for consumers. As a semantically-linked data source, the Ontology of Fast Food Facts could engender methods and tools to further the research and impact the health consumers' diet and behavior, which is a factor in many severe health outcomes. We describe the initial development of this ontology and future directions we plan with this knowledge base.
{"title":"A health consumer ontology of fast food information.","authors":"Muhammad Amith, Jing Wang, Grace Xiong, Kirk Roberts, Cui Tao","doi":"10.1109/bibm49941.2020.9313375","DOIUrl":"https://doi.org/10.1109/bibm49941.2020.9313375","url":null,"abstract":"<p><p>A variety of severe health issues can be attributed to poor nutrition and poor eating behaviors. Research has explored the impact of nutritional knowledge on an individual's inclination to purchase and consume certain foods. This paper introduces the Ontology of Fast Food Facts, a knowledge base that models consumer nutritional data from major fast food establishments. This artifact serves as an aggregate knowledge base to centralize nutritional information for consumers. As a semantically-linked data source, the Ontology of Fast Food Facts could engender methods and tools to further the research and impact the health consumers' diet and behavior, which is a factor in many severe health outcomes. We describe the initial development of this ontology and future directions we plan with this knowledge base.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2020 ","pages":"1714-1719"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/bibm49941.2020.9313375","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39364566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01Epub Date: 2021-01-13DOI: 10.1109/bibm49941.2020.9313595
Cagatay Dursun, Jennifer R Smith, G Thomas Hayman, Anne E Kwitek, Serdar Bozdag
Complex diseases such as hypertension, cancer, and diabetes cause nearly 70% of the deaths in the U.S. and involve multiple genes and their interactions with environmental factors. Therefore, identification of genetic factors to understand and decrease the morbidity and mortality from complex diseases is an important and challenging task. With the generation of an unprecedented amount of multi-omics datasets, network-based methods have become popular to represent the multilayered complex molecular interactions. Particularly node embeddings, the low-dimensional representations of nodes in a network are utilized for gene function prediction. Integrated network analysis of multi-omics data alleviates the issues related to missing data and lack of context-specific datasets. Most of the node embedding methods, however, are unable to integrate multiple types of datasets from genes and phenotypes. To address this limitation, we developed a node embedding algorithm called Node Embeddings of Complex networks (NECo) that can utilize multilayered heterogeneous networks of genes and phenotypes. We evaluated the performance of NECo using genotypic and phenotypic datasets from rat (Rattus norvegicus) disease models to classify hypertension disease-related genes. Our method significantly outperformed the state-of-the-art node embedding methods, with AUC of 94.97% compared 85.98% in the second-best performer, and predicted genes not previously implicated in hypertension.
{"title":"NECo: A node embedding algorithm for multiplex heterogeneous networks.","authors":"Cagatay Dursun, Jennifer R Smith, G Thomas Hayman, Anne E Kwitek, Serdar Bozdag","doi":"10.1109/bibm49941.2020.9313595","DOIUrl":"10.1109/bibm49941.2020.9313595","url":null,"abstract":"<p><p>Complex diseases such as hypertension, cancer, and diabetes cause nearly 70% of the deaths in the U.S. and involve multiple genes and their interactions with environmental factors. Therefore, identification of genetic factors to understand and decrease the morbidity and mortality from complex diseases is an important and challenging task. With the generation of an unprecedented amount of multi-omics datasets, network-based methods have become popular to represent the multilayered complex molecular interactions. Particularly node embeddings, the low-dimensional representations of nodes in a network are utilized for gene function prediction. Integrated network analysis of multi-omics data alleviates the issues related to missing data and lack of context-specific datasets. Most of the node embedding methods, however, are unable to integrate multiple types of datasets from genes and phenotypes. To address this limitation, we developed a node embedding algorithm called Node Embeddings of Complex networks (NECo) that can utilize multilayered heterogeneous networks of genes and phenotypes. We evaluated the performance of NECo using genotypic and phenotypic datasets from rat (<i>Rattus norvegicus</i>) disease models to classify hypertension disease-related genes. Our method significantly outperformed the state-of-the-art node embedding methods, with AUC of 94.97% compared 85.98% in the second-best performer, and predicted genes not previously implicated in hypertension.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2020 ","pages":"146-149"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8466723/pdf/nihms-1741786.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39468722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}