Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics最新文献_第2页

HMMeta

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414702

Sola Gbenro, Kyle Hippe, Renzhi Cao

As the body of genomic product data increases at a much faster rate than can be annotated, computational analysis of protein function has never been more important. In this research, we introduce a novel protein function prediction method HMMeta, which is based on the prominent natural language prediction technique Hidden Markov Models (HMM). With a new representation of protein sequence as a language, we trained a unique HMM for each Gene Ontology (GO) term taken from the UniProt database, which in total has 27,451 unique GO IDs leading to the creation of 27,451 Hidden Markov Models. We employed data augmentation to artificially inflate the number of protein sequences associated with GO terms that have a limited amount in the database, and this helped to balance the number of protein sequences associated with each GO term. Predictions are made by running the sequence against each model created. The models within eighty percent of the top scoring model, or 75 models with the highest scores, whichever is less, represent the functions that are most associated with the given sequence. We benchmarked our method in the latest Critical Assessment of protein Function Annotation (CAFA 4) experiment as CaoLab2, and we also evaluated HMMeta against several other protein function prediction methods against a subset of the UniProt database. HMMeta achieved favorable results as a sequence-based method, and outperforms a few notable methods in some categories through our evaluation, which shows great potential for automated protein function prediction. The tool is available at https://github.com/KPHippe/HMM-For-Protein-Prediction.

{"title":"HMMeta","authors":"Sola Gbenro, Kyle Hippe, Renzhi Cao","doi":"10.1145/3388440.3414702","DOIUrl":"https://doi.org/10.1145/3388440.3414702","url":null,"abstract":"As the body of genomic product data increases at a much faster rate than can be annotated, computational analysis of protein function has never been more important. In this research, we introduce a novel protein function prediction method HMMeta, which is based on the prominent natural language prediction technique Hidden Markov Models (HMM). With a new representation of protein sequence as a language, we trained a unique HMM for each Gene Ontology (GO) term taken from the UniProt database, which in total has 27,451 unique GO IDs leading to the creation of 27,451 Hidden Markov Models. We employed data augmentation to artificially inflate the number of protein sequences associated with GO terms that have a limited amount in the database, and this helped to balance the number of protein sequences associated with each GO term. Predictions are made by running the sequence against each model created. The models within eighty percent of the top scoring model, or 75 models with the highest scores, whichever is less, represent the functions that are most associated with the given sequence. We benchmarked our method in the latest Critical Assessment of protein Function Annotation (CAFA 4) experiment as CaoLab2, and we also evaluated HMMeta against several other protein function prediction methods against a subset of the UniProt database. HMMeta achieved favorable results as a sequence-based method, and outperforms a few notable methods in some categories through our evaluation, which shows great potential for automated protein function prediction. The tool is available at https://github.com/KPHippe/HMM-For-Protein-Prediction.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115556614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Refinement of G protein-coupled receptor structure models: Improving the prediction of loop conformations and the virtual ligand screening performances G蛋白偶联受体结构模型的改进:改进环构象的预测和虚拟配体筛选性能

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414920

Bhumika Arora

G protein-coupled receptors (GPCRs) constitute the largest superfamily of membrane proteins. They mediate most of the physiological processes of the human body and form the largest group of potential drug targets. Therefore, knowledge of their three-dimensional structure is important for structure-based drug design. Due to the limited availability of the experimental structures of GPCRs, computational methods are often used for deriving the structural information. GPCRs have a common structural topology that is comprised of seven transmembrane helices interconnected by intra- and extracellular loops. Homology modeling is the computational approach that is commonly used for modeling the transmembrane helical domains of GPCRs. Depending upon the quality of template used, these homology models exhibit varying degrees of inaccuracies. We have previously explored the extent to which inaccuracies present in homology models of the transmembrane helical domains of GPCRs can affect loop prediction [1]. We have also investigated the effect of presence and absence of other extracellular loops on individual loop modeling. We found that loop prediction in GPCR models is much more difficult than loop reconstruction in crystal structures because of the imprecise positioning of loop anchors in the models, although modeling an extracellular loop in the presence of other extracellular loops helps in improving the accuracy of its prediction. Therefore, reducing the errors in loop anchors is crucial for GPCR structure prediction. To address this and to improve the usability of GPCR homology models for structure-based drug design, we have developed a Ligand Directed Modeling (LDM) method that involves geometric protein sampling and ligand docking. The method was evaluated for capacity to refine the GPCR models built across a range of templates with varying degrees of sequence similarity with the target. LDM reduced the errors in loop anchor positions and improved the performance of these models in virtual ligand screenings. Thus, this Ligand Directed Modeling method is efficient in improving the quality of GPCR structure models.

G蛋白偶联受体(gpcr)是膜蛋白中最大的超家族。它们介导人体的大部分生理过程，是最大的潜在药物靶点群。因此，了解它们的三维结构对于基于结构的药物设计非常重要。由于gpcr实验结构的可用性有限，通常采用计算方法来推导结构信息。gpcr具有共同的结构拓扑结构，由七个跨膜螺旋组成，由细胞内和细胞外环相互连接。同源性建模是一种常用的计算方法，用于模拟gpcr的跨膜螺旋结构域。根据所用模板的质量，这些同源模型表现出不同程度的不准确性。我们之前已经探索了gpcr跨膜螺旋结构域同源模型中的不准确性在多大程度上影响环预测[1]。我们还研究了其他细胞外环的存在和缺失对单个环建模的影响。我们发现GPCR模型中的环预测比晶体结构中的环重建要困难得多，因为模型中环锚点的定位不精确，尽管在存在其他细胞外环的情况下对细胞外环进行建模有助于提高其预测的准确性。因此，减少环锚点的误差对GPCR结构预测至关重要。为了解决这个问题并提高GPCR同源模型在基于结构的药物设计中的可用性，我们开发了一种配体定向建模(LDM)方法，该方法涉及几何蛋白质采样和配体对接。对该方法进行了评估，以改进在一系列与目标序列相似程度不同的模板上构建的GPCR模型的能力。LDM减少了环锚点位置的误差，提高了这些模型在虚拟配体筛选中的性能。因此，这种配体定向建模方法可以有效地提高GPCR结构模型的质量。

{"title":"Refinement of G protein-coupled receptor structure models: Improving the prediction of loop conformations and the virtual ligand screening performances","authors":"Bhumika Arora","doi":"10.1145/3388440.3414920","DOIUrl":"https://doi.org/10.1145/3388440.3414920","url":null,"abstract":"G protein-coupled receptors (GPCRs) constitute the largest superfamily of membrane proteins. They mediate most of the physiological processes of the human body and form the largest group of potential drug targets. Therefore, knowledge of their three-dimensional structure is important for structure-based drug design. Due to the limited availability of the experimental structures of GPCRs, computational methods are often used for deriving the structural information. GPCRs have a common structural topology that is comprised of seven transmembrane helices interconnected by intra- and extracellular loops. Homology modeling is the computational approach that is commonly used for modeling the transmembrane helical domains of GPCRs. Depending upon the quality of template used, these homology models exhibit varying degrees of inaccuracies. We have previously explored the extent to which inaccuracies present in homology models of the transmembrane helical domains of GPCRs can affect loop prediction [1]. We have also investigated the effect of presence and absence of other extracellular loops on individual loop modeling. We found that loop prediction in GPCR models is much more difficult than loop reconstruction in crystal structures because of the imprecise positioning of loop anchors in the models, although modeling an extracellular loop in the presence of other extracellular loops helps in improving the accuracy of its prediction. Therefore, reducing the errors in loop anchors is crucial for GPCR structure prediction. To address this and to improve the usability of GPCR homology models for structure-based drug design, we have developed a Ligand Directed Modeling (LDM) method that involves geometric protein sampling and ligand docking. The method was evaluated for capacity to refine the GPCR models built across a range of templates with varying degrees of sequence similarity with the target. LDM reduced the errors in loop anchor positions and improved the performance of these models in virtual ligand screenings. Thus, this Ligand Directed Modeling method is efficient in improving the quality of GPCR structure models.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114776452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using Patient Information for the Prediction of Caregiver Burden in Amyotrophic Lateral Sclerosis 利用患者信息预测肌萎缩侧索硬化症患者的照顾者负担

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414908

A. Antoniadi, M. Galvin, M. Heverin, O. Hardiman, C. Mooney

The aim of this study is to create a Clinical Decision Support System (CDSS) to assist in the early identification and support of caregivers at risk of experiencing burden while caring for a person with Amyotrophic Lateral Sclerosis. We work towards a system that uses a minimum amount of data that could be routinely collected. We investigated if the impairment of patients alone provides sufficient information for the prediction of caregiver burden. Results reveal a better performance of our system in identifying those at risk of high burden, but more information is needed for an accurate CDSS.

本研究的目的是建立一个临床决策支持系统(CDSS)，以帮助护理人员在护理肌萎缩侧索硬化症患者时早期识别和支持有负担风险的护理人员。我们的目标是建立一个系统，使用尽可能少的日常收集的数据。我们调查了患者的损伤是否单独提供了足够的信息来预测照顾者的负担。结果表明，我们的系统在识别高负担风险人群方面表现较好，但需要更多的信息来准确地进行CDSS。

引用次数: 2

ProLanGO2

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414701

Kyle Hippe, Sola Gbenro, Renzhi Cao

Predicting protein function from protein sequence is a main challenge in the computational biology field. Traditional methods that search protein sequences against existing databases may not work well in practice, particularly when little or no homology exists in the database. We introduce the ProLanGO2 method which utilizes the natural language processing and machine learning techniques to tackle the protein function prediction problem with protein sequence as input. Our method has been benchmarked blindly in the latest Critical Assessment of protein Function Annotation algorithms (CAFA 4) experiment. There are a few changes compared to the old version of ProLanGO. First of all, the latest version of the UniProt database is used. Second, the Uniprot database is filtered by the newly created fragment sequence database FSD to prepare for the protein sequence language. Third, the Encoder-Decoder network, a model consisting of two RNNs (encoder and decoder), is used to train models on the dataset. Fourth, if no k-mers of a protein sequence exist in the FSD, we select the top ten GO terms with the highest probability in all sequences from the Uniprot database that didn't contain any k-mers in FSD, and use those ten GO terms as back up for the prediction of new protein sequence. Finally, we selected the 100 best performing models and explored all combinations of those models to select the best performance ensemble model. We benchmark those different combinations of models on CAFA 3 dataset and select three top performance ensemble models for prediction in the latest CAFA 4 experiment as CaoLab. We have also evaluated the performance of our ProLanGO2 method on 253 unseen sequences taken from the UniProt database and compared with several other protein function prediction methods, the results show that our method achieves great performance among sequence-based protein function prediction methods. Our method is available in GitHub: https://github.com/caorenzhi/ProLanGO2.git.

{"title":"ProLanGO2","authors":"Kyle Hippe, Sola Gbenro, Renzhi Cao","doi":"10.1145/3388440.3414701","DOIUrl":"https://doi.org/10.1145/3388440.3414701","url":null,"abstract":"Predicting protein function from protein sequence is a main challenge in the computational biology field. Traditional methods that search protein sequences against existing databases may not work well in practice, particularly when little or no homology exists in the database. We introduce the ProLanGO2 method which utilizes the natural language processing and machine learning techniques to tackle the protein function prediction problem with protein sequence as input. Our method has been benchmarked blindly in the latest Critical Assessment of protein Function Annotation algorithms (CAFA 4) experiment. There are a few changes compared to the old version of ProLanGO. First of all, the latest version of the UniProt database is used. Second, the Uniprot database is filtered by the newly created fragment sequence database FSD to prepare for the protein sequence language. Third, the Encoder-Decoder network, a model consisting of two RNNs (encoder and decoder), is used to train models on the dataset. Fourth, if no k-mers of a protein sequence exist in the FSD, we select the top ten GO terms with the highest probability in all sequences from the Uniprot database that didn't contain any k-mers in FSD, and use those ten GO terms as back up for the prediction of new protein sequence. Finally, we selected the 100 best performing models and explored all combinations of those models to select the best performance ensemble model. We benchmark those different combinations of models on CAFA 3 dataset and select three top performance ensemble models for prediction in the latest CAFA 4 experiment as CaoLab. We have also evaluated the performance of our ProLanGO2 method on 253 unseen sequences taken from the UniProt database and compared with several other protein function prediction methods, the results show that our method achieves great performance among sequence-based protein function prediction methods. Our method is available in GitHub: https://github.com/caorenzhi/ProLanGO2.git.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117261141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A Divide and Conquer Algorithm for Electron Microscopy Segmentation 一种分而治之的电子显微镜分割算法

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414700

Ruba Jebril, Yingde Zhu, Wei Chen, K. Al Nasr

Cryo-Electron Microscopy is a biophysical technique able to visualize macromolecular complexes by producing 3-dimensional images. Currently, it has been advanced to be the second popular technique to construct protein molecules in terms of the number of structures released annually. The main advantages of cryo-electron microscopy are its ability to visualize large molecules, molecules that are hard to crystalize in their native environment. One critical step to construct the structure of a molecule from cryo-electron microscopy is to divide the image into regions for the chains/subunits that make up the molecule/complex. If the image is accurately segmented into the correct regions, the process of modelling using existing tools become easier and faster. In this paper, we developed a divide-and-conquer algorithm to segment a given cryo-electron microscopy image efficiently. Our approach is based on the popular watershed algorithm. We tested our method on 10 authentic images and compared it with Segger. Although, it is difficult to conduct an accurate comparison, the results show that the performance of our algorithm is competitive when compared to Segger.

低温电子显微镜是一种生物物理技术，能够通过产生三维图像来可视化大分子复合物。目前，就每年释放的结构数量而言，它已成为构建蛋白质分子的第二流行技术。低温电子显微镜的主要优点是它能够可视化大分子，这些分子在其天然环境中很难结晶。从低温电子显微镜中构建分子结构的一个关键步骤是将图像划分为组成分子/复合物的链/亚基区域。如果图像被准确地分割到正确的区域，使用现有工具建模的过程变得更容易和更快。在本文中，我们开发了一种分而治之的算法来分割给定的冷冻电子显微镜图像。我们的方法是基于流行的分水岭算法。我们在10张真实图像上测试了我们的方法，并与Segger进行了比较。虽然很难进行准确的比较，但结果表明，与Segger相比，我们的算法的性能具有竞争力。

{"title":"A Divide and Conquer Algorithm for Electron Microscopy Segmentation","authors":"Ruba Jebril, Yingde Zhu, Wei Chen, K. Al Nasr","doi":"10.1145/3388440.3414700","DOIUrl":"https://doi.org/10.1145/3388440.3414700","url":null,"abstract":"Cryo-Electron Microscopy is a biophysical technique able to visualize macromolecular complexes by producing 3-dimensional images. Currently, it has been advanced to be the second popular technique to construct protein molecules in terms of the number of structures released annually. The main advantages of cryo-electron microscopy are its ability to visualize large molecules, molecules that are hard to crystalize in their native environment. One critical step to construct the structure of a molecule from cryo-electron microscopy is to divide the image into regions for the chains/subunits that make up the molecule/complex. If the image is accurately segmented into the correct regions, the process of modelling using existing tools become easier and faster. In this paper, we developed a divide-and-conquer algorithm to segment a given cryo-electron microscopy image efficiently. Our approach is based on the popular watershed algorithm. We tested our method on 10 authentic images and compared it with Segger. Although, it is difficult to conduct an accurate comparison, the results show that the performance of our algorithm is competitive when compared to Segger.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117220596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How to build regulatory networks from single-cell gene expression data 如何利用单细胞基因表达数据构建调控网络

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414213

Aditya Pratapa, A. Jalihal, Jeffrey N. Law, Aditya Bharadwaj, T. Murali

Over a dozen methods have been developed to infer gene regulatory networks (GRNs) from single-cell RNA-seq data. An experimentalist seeking to analyze a new dataset faces a daunting task in selecting an appropriate inference method since there are no widely accepted ground-truth datasets for assessing algorithm accuracy and the criteria for evaluation and comparison of methods are varied. We have developed BEELINE, a comprehensive evaluation of state-of-the-art algorithms for inferring GRNs from single-cell transcriptomic data [1]. BEELINE incorporates 12 diverse algorithms for GRN inference. It provides an easy-to-use and uniform interface to each method in the form of a Docker image. BEELINE implements several measures for estimating and comparing the accuracy, stability, and efficiency of these algorithms. Thus, BEELINE facilitates reproducible, rigorous, and extensible evaluations of GRN inference methods. We selected (a) synthetic networks with predictable cellular trajectories, (b) literature-curated Boolean models, and (c) diverse transcriptional regulatory and functional interaction networks to serve as the ground truth for evaluating the accuracy of GRN inference algorithms. We developed a strategy to simulate single-cell gene expression data from the first two types of networks. We used multiple experimental single-cell RNA-seq datasets in conjunction with the third type of network. Our evaluations suggest that the area under the precision-recall curve and early precision of these algorithms are moderate. Techniques that do not require pseudotime-ordered cells are generally more accurate. Based on these results, we present recommendations to end users of GRN inference methods. Finally, we discuss the potential of supervised algorithms for GRN inference.

从单细胞RNA-seq数据中推断基因调控网络(grn)的方法已经发展了十几种。由于没有广泛接受的用于评估算法准确性的基础真值数据集，并且评估和比较方法的标准各不相同，因此，寻求分析新数据集的实验者在选择适当的推理方法方面面临着艰巨的任务。我们开发了BEELINE，这是一种从单细胞转录组数据推断grn的最先进算法的综合评估[1]。BEELINE结合了12种不同的GRN推理算法。它以Docker映像的形式为每个方法提供了一个易于使用和统一的接口。BEELINE实现了几种方法来评估和比较这些算法的精度、稳定性和效率。因此，BEELINE有助于对GRN推理方法进行可重复、严格和可扩展的评估。我们选择(a)具有可预测细胞轨迹的合成网络，(b)文献策划的布尔模型，以及(c)多种转录调控和功能相互作用网络作为评估GRN推理算法准确性的基础事实。我们开发了一种策略来模拟来自前两种类型网络的单细胞基因表达数据。我们将多个实验性单细胞RNA-seq数据集与第三种类型的网络结合使用。我们的评估表明，这些算法的精确召回曲线下的面积和早期精度是中等的。不需要伪时间顺序单元格的技术通常更准确。基于这些结果，我们对GRN推理方法的最终用户提出了建议。最后，我们讨论了监督算法在GRN推理中的潜力。

{"title":"How to build regulatory networks from single-cell gene expression data","authors":"Aditya Pratapa, A. Jalihal, Jeffrey N. Law, Aditya Bharadwaj, T. Murali","doi":"10.1145/3388440.3414213","DOIUrl":"https://doi.org/10.1145/3388440.3414213","url":null,"abstract":"Over a dozen methods have been developed to infer gene regulatory networks (GRNs) from single-cell RNA-seq data. An experimentalist seeking to analyze a new dataset faces a daunting task in selecting an appropriate inference method since there are no widely accepted ground-truth datasets for assessing algorithm accuracy and the criteria for evaluation and comparison of methods are varied. We have developed BEELINE, a comprehensive evaluation of state-of-the-art algorithms for inferring GRNs from single-cell transcriptomic data [1]. BEELINE incorporates 12 diverse algorithms for GRN inference. It provides an easy-to-use and uniform interface to each method in the form of a Docker image. BEELINE implements several measures for estimating and comparing the accuracy, stability, and efficiency of these algorithms. Thus, BEELINE facilitates reproducible, rigorous, and extensible evaluations of GRN inference methods. We selected (a) synthetic networks with predictable cellular trajectories, (b) literature-curated Boolean models, and (c) diverse transcriptional regulatory and functional interaction networks to serve as the ground truth for evaluating the accuracy of GRN inference algorithms. We developed a strategy to simulate single-cell gene expression data from the first two types of networks. We used multiple experimental single-cell RNA-seq datasets in conjunction with the third type of network. Our evaluations suggest that the area under the precision-recall curve and early precision of these algorithms are moderate. Techniques that do not require pseudotime-ordered cells are generally more accurate. Based on these results, we present recommendations to end users of GRN inference methods. Finally, we discuss the potential of supervised algorithms for GRN inference.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125880074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TLSurv

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412422

Yixing Jiang, Kristen Alford, Frank Ketchum, L. Tong, May D. Wang

Lung cancer is one of the leading cancers, but survival models have not been explored to the extent of other cancers like breast cancer. In this study, we develop a super-hybrid network called TLSurv to integrate Copy Number Variation, DNA methylation, mRNA expression, and miRNA expression data for TCGA-LUAD datasets. The modularity of this super-hybrid network allows the integration of multiple -omics modalities with tremendous dimensional differences. Additionally, a novel training scheme called multi-stage transfer learning is used to train this super-hybrid network incrementally. This allows for training of a large network with many subnetworks using a relatively small data sets. At each stage, a shallow subnetwork is trained and these networks are combined to form a powerful prediction network. The results show the combination of DNA methylation data with either mRNA or miRNA expression data has produced promising performances with C-indexes of around 0.7. This performance is better than previous studies. Interpretability analysis confirms the clinical significance of some biomarkers identified. In addition, some novel biomarkers are suggested for future medical research. These findings reveal the potential of super-hybrid network for integrating multiple data modalities and the potential of multi-stage transfer learning for addressing the "curse of dimensionality."

{"title":"TLSurv","authors":"Yixing Jiang, Kristen Alford, Frank Ketchum, L. Tong, May D. Wang","doi":"10.1145/3388440.3412422","DOIUrl":"https://doi.org/10.1145/3388440.3412422","url":null,"abstract":"Lung cancer is one of the leading cancers, but survival models have not been explored to the extent of other cancers like breast cancer. In this study, we develop a super-hybrid network called TLSurv to integrate Copy Number Variation, DNA methylation, mRNA expression, and miRNA expression data for TCGA-LUAD datasets. The modularity of this super-hybrid network allows the integration of multiple -omics modalities with tremendous dimensional differences. Additionally, a novel training scheme called multi-stage transfer learning is used to train this super-hybrid network incrementally. This allows for training of a large network with many subnetworks using a relatively small data sets. At each stage, a shallow subnetwork is trained and these networks are combined to form a powerful prediction network. The results show the combination of DNA methylation data with either mRNA or miRNA expression data has produced promising performances with C-indexes of around 0.7. This performance is better than previous studies. Interpretability analysis confirms the clinical significance of some biomarkers identified. In addition, some novel biomarkers are suggested for future medical research. These findings reveal the potential of super-hybrid network for integrating multiple data modalities and the potential of multi-stage transfer learning for addressing the \"curse of dimensionality.\"","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130103765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

EXAM: An Explainable Attention-based Model for COVID-19 Automatic Diagnosis 基于注意力的COVID-19自动诊断模型

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412455

Wenqi Shi, L. Tong, Yuchen Zhuang, Yuanda Zhu, May D. Wang

The ongoing coronavirus disease 2019 (COVID-19) is still rapidly spreading and has caused over 7,000,000 infection cases and 400,000 deaths around the world. To come up with a fast and reliable COVID-19 diagnosis system, people seek help from machine learning area to establish computer-aided diagnosis systems with the aid of the radiological imaging techniques, like X-ray imaging and computed tomography imaging. Although artificial intelligence based architectures have achieved great improvements in performance, most of the models are still seemed as a black box to researchers. In this paper, we propose an Explainable Attention-based Model (EXAM) for COVID-19 automatic diagnosis with convincing visual interpretation. We transform the diagnosis process with radiological images into an image classification problem differentiating COVID-19, normal and community-acquired pneumonia (CAP) cases. Combining channel-wise and spatial-wise attention mechanism, the proposed approach can effectively extract key features and suppress irrelevant information. Experiment results and visualization indicate that EXAM outperforms recent state-of-art models and demonstrate its interpretability.

持续的2019冠状病毒病(COVID-19)仍在迅速蔓延，在全球造成700多万例感染病例和40万例死亡。为了建立一个快速可靠的新冠肺炎诊断系统，人们寻求机器学习领域的帮助，借助放射成像技术，如x射线成像和计算机断层成像，建立计算机辅助诊断系统。尽管基于人工智能的架构在性能上取得了很大的进步，但大多数模型在研究人员看来仍然是一个黑盒子。在本文中，我们提出了一种具有令人信服的视觉解释的可解释的基于注意力的COVID-19自动诊断模型(EXAM)。我们将影像学诊断过程转化为区分COVID-19，正常和社区获得性肺炎(CAP)病例的图像分类问题。该方法结合了通道型和空间型的注意机制，可以有效地提取关键特征并抑制无关信息。实验结果和可视化表明，该方法优于当前最先进的模型，并证明了其可解释性。

{"title":"EXAM: An Explainable Attention-based Model for COVID-19 Automatic Diagnosis","authors":"Wenqi Shi, L. Tong, Yuchen Zhuang, Yuanda Zhu, May D. Wang","doi":"10.1145/3388440.3412455","DOIUrl":"https://doi.org/10.1145/3388440.3412455","url":null,"abstract":"The ongoing coronavirus disease 2019 (COVID-19) is still rapidly spreading and has caused over 7,000,000 infection cases and 400,000 deaths around the world. To come up with a fast and reliable COVID-19 diagnosis system, people seek help from machine learning area to establish computer-aided diagnosis systems with the aid of the radiological imaging techniques, like X-ray imaging and computed tomography imaging. Although artificial intelligence based architectures have achieved great improvements in performance, most of the models are still seemed as a black box to researchers. In this paper, we propose an Explainable Attention-based Model (EXAM) for COVID-19 automatic diagnosis with convincing visual interpretation. We transform the diagnosis process with radiological images into an image classification problem differentiating COVID-19, normal and community-acquired pneumonia (CAP) cases. Combining channel-wise and spatial-wise attention mechanism, the proposed approach can effectively extract key features and suppress irrelevant information. Experiment results and visualization indicate that EXAM outperforms recent state-of-art models and demonstrate its interpretability.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127919170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Hadoop-CNV-RF: A Scalable Copy Number Variation Detection Tool for Next-Generation Sequencing Data Hadoop-CNV-RF:用于下一代测序数据的可扩展拷贝数变异检测工具

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414861

Getiria Onsongo, H. Lam, Matthew Bower, B. Thyagarajan

Detection of small copy number variations (CNVs) in clinically relevant genes is routinely being used to aid diagnosis. We recently developed a tool, CNV-RF, capable of detecting clinically relevant CNVs with a high degree of sensitivity. CNV-RF implementation was designed for small gene panels and did not scale to large gene panels. Analyzing large gene panels with several hundred genes routinely failed due to memory limitations on a single computer, and, when successful, analysis took on average over 24 hours, making it impractical for routine use in the clinic. We need a reliable tool capable of accurately identifying clinically relevant CNVs on large gene panels within a more practical time frame. We have developed Hadoop-CNV-RF, a freely available, scalable, and more user-friendly implementation of CNV-RF capable of rapidly analyzing large datasets. Hadoop-CNV-RF takes advantage of Hadoop, a framework developed to analyze large amounts of data. In its implementation, we demonstrate the feasibility of developing scalable pipelines on Hadoop that integrate popular bioinformatics software developed for usage on traditional single-user computers without the need for special-purpose routines developed for Hadoop. Results show that Hadoop-CNV-RF reduces analysis time on large gene panels from over 24 hours to about 4 hours on a 20 node Hadoop cluster. Additionally, we demonstrate its ability to scale by analyzing a whole-exome dataset with close to a billion reads. Hadoop-CNV-RF has been clinically validated for large gene panels (up to 4800 genes) and is currently being used in the clinic. It is publicly available at: https://github.com/getiria-onsongo/hadoopcnvrf-public.

临床相关基因的小拷贝数变异(CNVs)检测通常被用于辅助诊断。我们最近开发了一种工具，CNV-RF，能够以高灵敏度检测临床相关的cnv。CNV-RF实现是为小型基因板设计的，不能扩展到大型基因板。由于单个计算机的内存限制，分析包含数百个基因的大型基因面板通常会失败，并且即使成功分析，平均也需要24小时以上的时间，这使得它不适合在临床中常规使用。我们需要一种可靠的工具，能够在更实际的时间框架内准确识别大型基因面板上的临床相关CNVs。我们开发了Hadoop-CNV-RF，这是一个免费的、可扩展的、更加用户友好的CNV-RF实现，能够快速分析大型数据集。Hadoop- cnv - rf利用了Hadoop这个用于分析大量数据的框架。在其实现中，我们展示了在Hadoop上开发可扩展管道的可行性，该管道集成了为传统单用户计算机开发的流行生物信息学软件，而不需要为Hadoop开发专用例程。结果表明，Hadoop- cnv - rf将大型基因面板的分析时间从24小时以上减少到20个节点Hadoop集群上的4小时左右。此外，我们通过分析近十亿次读取的全外显子组数据集来证明其扩展能力。Hadoop-CNV-RF已通过大型基因面板(多达4800个基因)的临床验证，目前正在临床应用。它可以在https://github.com/getiria-onsongo/hadoopcnvrf-public上公开获取。

{"title":"Hadoop-CNV-RF: A Scalable Copy Number Variation Detection Tool for Next-Generation Sequencing Data","authors":"Getiria Onsongo, H. Lam, Matthew Bower, B. Thyagarajan","doi":"10.1145/3388440.3414861","DOIUrl":"https://doi.org/10.1145/3388440.3414861","url":null,"abstract":"Detection of small copy number variations (CNVs) in clinically relevant genes is routinely being used to aid diagnosis. We recently developed a tool, CNV-RF, capable of detecting clinically relevant CNVs with a high degree of sensitivity. CNV-RF implementation was designed for small gene panels and did not scale to large gene panels. Analyzing large gene panels with several hundred genes routinely failed due to memory limitations on a single computer, and, when successful, analysis took on average over 24 hours, making it impractical for routine use in the clinic. We need a reliable tool capable of accurately identifying clinically relevant CNVs on large gene panels within a more practical time frame. We have developed Hadoop-CNV-RF, a freely available, scalable, and more user-friendly implementation of CNV-RF capable of rapidly analyzing large datasets. Hadoop-CNV-RF takes advantage of Hadoop, a framework developed to analyze large amounts of data. In its implementation, we demonstrate the feasibility of developing scalable pipelines on Hadoop that integrate popular bioinformatics software developed for usage on traditional single-user computers without the need for special-purpose routines developed for Hadoop. Results show that Hadoop-CNV-RF reduces analysis time on large gene panels from over 24 hours to about 4 hours on a 20 node Hadoop cluster. Additionally, we demonstrate its ability to scale by analyzing a whole-exome dataset with close to a billion reads. Hadoop-CNV-RF has been clinically validated for large gene panels (up to 4800 genes) and is currently being used in the clinic. It is publicly available at: https://github.com/getiria-onsongo/hadoopcnvrf-public.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"277 19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126708844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Protein-Protein Interactome for an African Cichlid 非洲慈鲷的蛋白质-蛋白质相互作用组

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414916

Gabriel A. Preising, J. Faber-Hammond, S. Renn, Anna M. Ritz

ACM Reference Format: Gabriel A. Preising, Joshua J. Faber-Hammond, Suzy C. P. Renn, and Anna Ritz. 2020. A Protein-Protein Interactome for an African Cichlid. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB ’20), September 21–24, 2020, Virtual Event, USA. ACM, New York, NY, USA, 1 page. https://doi.org/10. 1145/3388440.3414916

ACM参考格式:Gabriel A. Preising, Joshua J. Faber-Hammond, Suzy C. P. Renn和Anna Ritz。2020。非洲慈鲷的蛋白质-蛋白质相互作用组。第11届ACM生物信息学、计算生物学和健康信息学国际会议论文集(BCB ' 20)， 2020年9月21-24日，虚拟事件，美国。ACM，纽约，美国，1页。https://doi.org/10。1145/3388440.3414916

引用次数: 0