Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics最新文献_第4页

Filtering STARR-Seq Peaks for Enhancers with Sequence Models 用序列模型滤波增强子的STARR-Seq峰

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414905

R. J. Nowling, Rafael Reple Geromel, B. Halligan

STARR-Seq is a high-throughput technique for directly identifying genomic regions with enhancer activity [1]. Genomic DNA is sheared, inserted into artificial plasmids designed so that DNA with enhancer activity trigger self-transcription, and transfected into culture cells. The resulting RNA is converted back into cDNA, sequenced, and aligned to a reference genome. "Peaks" are called by comparing observed read depth at each point to an expected read depth from control DNA using a statistical test. Examples of peak calling methods based on read depth include MACS2 [4], basicSTARRSeq, and STARRPeaker [3]. It is challenging to accurately distinguish between real peaks and artifacts in regions where mean read depth is low but the variance is high. Fortunately, enhancer activity is strongly correlated with sequence content. We propose using sequence-based machine learning models in a semi-supervised framework to filter peaks. 501-bp sequences centered on the ≈11k STARR peaks from [1] were extracted from the Drosophila melanogaster dm3 genome. Randomly-sampled 501-bp sequences were used as a negative set. Peaks were filtered using a Bonferroni-corrected significance value (α = 0.05) to create a "high-confidence" subset of ≈2.2k peaks. A Logistic Regression model with k-mer count features was trained on the high-confidence peak sequences and their negatives and used to classifying the remaining ≈8.8k peak sequences. The self-trained, sequenced-based model identified an additional ≈3.7k candidate enhancers ("medium confidence"). The remaining ≈5k STARR peaks were considered "low confidence" peaks. We plotted histograms of the read depth log-fold change for the three sets of peaks (high, medium, and low confidence) (see Figure 1). The distributions for the medium- and low-confidence peaks overlapped significantly. The sequence-based model identified enhancer candidates that would otherwise be filtered out using read depth alone. We called peaks for the 4 D. melanogaster FAIRE-Seq data sets from [2]. Sequencing data were cleaned with Trimmomatic, aligned to the dm3 genome with bwa backtrack, and filtered for mapping quality (q < 10) with samtools. MACS2 called ≈61k FAIRE peaks. The STARR peaks overlapped with the FAIRE peaks with precisions of 52.7% (high-confidence peaks), 40.6% (medium-confidence peaks), and 22.5% (low-confidence peaks).

STARR-Seq是一种直接鉴定具有增强子活性的基因组区域的高通量技术[1]。基因组DNA被剪切，插入人工质粒中，使具有增强子活性的DNA触发自转录，并转染到培养细胞中。得到的RNA被转换回cDNA，测序，并与参考基因组对齐。通过使用统计检验将每个点的观察到的读取深度与来自对照DNA的预期读取深度进行比较，称为“峰值”。基于读深度的峰值调用方法包括MACS2[4]、basicSTARRSeq和STARRPeaker[3]。在平均读取深度低但方差大的区域，准确区分真实峰值和伪影是一项挑战。幸运的是，增强子活性与序列内容密切相关。我们建议在半监督框架中使用基于序列的机器学习模型来过滤峰值。从黑腹果蝇dm3基因组中提取[1]中以≈11k STARR峰为中心的501-bp序列。随机采样的501-bp序列作为阴性集。使用bonferroni校正显著性值(α = 0.05)过滤峰，以创建约2.2万个峰的“高置信度”子集。对高置信度峰序列及其负值训练具有k-mer计数特征的Logistic回归模型，用于对剩余的≈8.8k个峰序列进行分类。自我训练的基于序列的模型确定了额外的≈3.7k候选增强子(“中等置信度”)。其余≈5k的STARR峰被认为是“低置信度”峰。我们绘制了三组峰值(高、中、低置信度)的读取深度对数倍变化直方图(见图1)。中置信度和低置信度峰值的分布明显重叠。基于序列的模型识别了候选增强子，否则仅使用读取深度就会被过滤掉。我们从[2]中调用了4个D. melanogaster FAIRE-Seq数据集的峰值。使用Trimmomatic对测序数据进行清洗，使用bwa backtrack对dm3基因组进行比对，并使用samtools对定位质量(q < 10)进行过滤。MACS2称为≈61k FAIRE峰。STARR峰与FAIRE峰重叠，精度分别为52.7%(高置信度峰)、40.6%(中置信度峰)和22.5%(低置信度峰)。

{"title":"Filtering STARR-Seq Peaks for Enhancers with Sequence Models","authors":"R. J. Nowling, Rafael Reple Geromel, B. Halligan","doi":"10.1145/3388440.3414905","DOIUrl":"https://doi.org/10.1145/3388440.3414905","url":null,"abstract":"STARR-Seq is a high-throughput technique for directly identifying genomic regions with enhancer activity [1]. Genomic DNA is sheared, inserted into artificial plasmids designed so that DNA with enhancer activity trigger self-transcription, and transfected into culture cells. The resulting RNA is converted back into cDNA, sequenced, and aligned to a reference genome. \"Peaks\" are called by comparing observed read depth at each point to an expected read depth from control DNA using a statistical test. Examples of peak calling methods based on read depth include MACS2 [4], basicSTARRSeq, and STARRPeaker [3]. It is challenging to accurately distinguish between real peaks and artifacts in regions where mean read depth is low but the variance is high. Fortunately, enhancer activity is strongly correlated with sequence content. We propose using sequence-based machine learning models in a semi-supervised framework to filter peaks. 501-bp sequences centered on the ≈11k STARR peaks from [1] were extracted from the Drosophila melanogaster dm3 genome. Randomly-sampled 501-bp sequences were used as a negative set. Peaks were filtered using a Bonferroni-corrected significance value (α = 0.05) to create a \"high-confidence\" subset of ≈2.2k peaks. A Logistic Regression model with k-mer count features was trained on the high-confidence peak sequences and their negatives and used to classifying the remaining ≈8.8k peak sequences. The self-trained, sequenced-based model identified an additional ≈3.7k candidate enhancers (\"medium confidence\"). The remaining ≈5k STARR peaks were considered \"low confidence\" peaks. We plotted histograms of the read depth log-fold change for the three sets of peaks (high, medium, and low confidence) (see Figure 1). The distributions for the medium- and low-confidence peaks overlapped significantly. The sequence-based model identified enhancer candidates that would otherwise be filtered out using read depth alone. We called peaks for the 4 D. melanogaster FAIRE-Seq data sets from [2]. Sequencing data were cleaned with Trimmomatic, aligned to the dm3 genome with bwa backtrack, and filtered for mapping quality (q < 10) with samtools. MACS2 called ≈61k FAIRE peaks. The STARR peaks overlapped with the FAIRE peaks with precisions of 52.7% (high-confidence peaks), 40.6% (medium-confidence peaks), and 22.5% (low-confidence peaks).","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114969994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identifying Evolutionary Origins of Repeat Domains in Protein Families 鉴定蛋白质家族重复结构域的进化起源

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412416

Chaitanya Aluru, Mona Singh

Arrays of repeat domains are critical to the proper function of a significant fraction of protein families. These repeats are easily identified in sequence, and are thought to have arisen primarily through the simultaneous duplication of multiple domains. However, for most repeat domain protein families, very little is typically known about the specific domain duplication events that occurred in their evolutionary histories. Here we extend existing reconciliation formulations that use domain trees and sequence trees to infer domain duplication and loss events to additionally consider simultaneous domain duplications under arbitrary cost models. We develop a novel integer linear programming (ILP) solution to this reconciliation problem, and demonstrate the accuracy and robustness of our approach on simulated datasets. Finally, as proof of principle, we apply our approach to an orthogroup containing the C2H2 zinc finger repeat domain, and identify simultaneous domain duplications that occurred at the onset of the primate lineage. Simulation and ILP code is available at https://github.com/Singh-Lab/treeSim.

重复结构域的阵列对蛋白质家族的正常功能至关重要。这些重复很容易在序列中识别，并且被认为主要是通过多个结构域的同时复制而产生的。然而，对于大多数重复结构域蛋白家族来说，在其进化史中发生的特定结构域重复事件通常知之甚少。在这里，我们扩展了现有的使用领域树和序列树来推断领域复制和损失事件的协调公式，以额外考虑任意成本模型下的同时领域复制。我们开发了一种新的整数线性规划(ILP)解决方案来解决这个调和问题，并在模拟数据集上证明了我们的方法的准确性和鲁棒性。最后，作为原理证明，我们将我们的方法应用于包含C2H2锌指重复结构域的正类群，并确定了在灵长类谱系开始时发生的同步结构域复制。仿真和ILP代码可在https://github.com/Singh-Lab/treeSim上获得。

{"title":"Identifying Evolutionary Origins of Repeat Domains in Protein Families","authors":"Chaitanya Aluru, Mona Singh","doi":"10.1145/3388440.3412416","DOIUrl":"https://doi.org/10.1145/3388440.3412416","url":null,"abstract":"Arrays of repeat domains are critical to the proper function of a significant fraction of protein families. These repeats are easily identified in sequence, and are thought to have arisen primarily through the simultaneous duplication of multiple domains. However, for most repeat domain protein families, very little is typically known about the specific domain duplication events that occurred in their evolutionary histories. Here we extend existing reconciliation formulations that use domain trees and sequence trees to infer domain duplication and loss events to additionally consider simultaneous domain duplications under arbitrary cost models. We develop a novel integer linear programming (ILP) solution to this reconciliation problem, and demonstrate the accuracy and robustness of our approach on simulated datasets. Finally, as proof of principle, we apply our approach to an orthogroup containing the C2H2 zinc finger repeat domain, and identify simultaneous domain duplications that occurred at the onset of the primate lineage. Simulation and ILP code is available at https://github.com/Singh-Lab/treeSim.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133298133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Convolutional Neural Network Strategy for Skin Cancer Lesions Classifications and Detections 卷积神经网络在皮肤癌病灶分类与检测中的应用

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3415988

Abdala Nour, B. Boufama

Skin cancer is one of the most common forms of cancer that has widespread as a disease around the world. With early, accurate diagnosis, the chances of treating skin cancer are high. This has inspired us to design a deep learning model that uses a conventional neural network to automatically classify and detect different types of skin cancer images. Through this way the system takes actions to prevent and early detect skin cancer, leading to potentially the best approach for treatment. The goal of this research is to apply the systematic meta heuristic optimization and image detection techniques based on a convolutional neural network to efficiently and accurately detect and classify different types of skin lesions.

皮肤癌是最常见的癌症之一，在世界范围内广泛传播。通过早期、准确的诊断，治疗皮肤癌的机会很高。这启发了我们设计一个深度学习模型，使用传统的神经网络来自动分类和检测不同类型的皮肤癌图像。通过这种方式，该系统采取措施预防和早期发现皮肤癌，从而找到潜在的最佳治疗方法。本研究的目的是应用基于卷积神经网络的系统元启发式优化和图像检测技术，高效准确地检测和分类不同类型的皮肤病变。

引用次数: 3

Post-Processing Summarization for Mining Frequent Dense Subnetworks 频繁密集子网络挖掘的后处理总结

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3415989

Sangmin Seo, Saeed Salem

Gene expression data for multiple biological and environmental conditions is being collected for multiple species. Functional modules and subnetwork biomarkers discovery have traditionally been based on analyzing a single gene expression dataset. Research has focused on discovering modules from multiple gene expression datasets. Gene coexpression network mining methods have been proposed for mining frequent functional modules. Moreover, biclustering algorithms have been proposed to allow for missing coexpression links. Existing approaches report a large number of edgesets that have high overlap. In this work, we propose an algorithm to mine frequent dense modules from multiple coexpression networks using a post-processing data summarization method. Our algorithm mines a succinct set of representative subgraphs that have little overlap which reduce the downstream analysis of the reported modules. Experiments on human gene expression data show that the reported modules are biologically significant as evident by Gene Ontology molecular functions and KEGG pathways enrichment.

多种生物和环境条件下的基因表达数据正在被收集。功能模块和子网络生物标志物的发现传统上是基于对单个基因表达数据集的分析。研究的重点是从多个基因表达数据集中发现模块。基因共表达网络挖掘方法被提出用于挖掘频繁功能模块。此外，已提出的双聚类算法允许缺失的共表达链接。现有的方法报告了大量具有高重叠的边集。在这项工作中，我们提出了一种使用后处理数据汇总方法从多个共表达网络中挖掘频繁密集模块的算法。我们的算法挖掘了一组简洁的具有代表性的子图，这些子图很少重叠，从而减少了对报告模块的下游分析。对人类基因表达数据的实验表明，所报道的模块具有显著的生物学意义，如gene Ontology分子功能和KEGG通路的富集。

引用次数: 0

Segmentation-based Feature Extraction for Cryo-Electron Microscopy at Medium Resolution 基于分割的中分辨率冷冻电镜特征提取

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414711

Lin Chen, Ruba Jebril, K. Al Nasr

Cryo-Electron Microscopy is a biophysics technique that produces volume images for a given molecule. It can visualize large molecules and protein complexes. At high resolution, <5Å, the structure can be modeled. When the resolution drops to worse than 5Å, computational techniques are used overcome the inaccuracy inherent in volume images. In this paper, we propose a segmentation-based approach to extract important features to overcome the essential inaccuracy in medium resolution volume images. The features are volume components represent local peak regions on the image. Later, the volume components are classified into one of the main secondary structure elements found in the protein molecules. Specifically, we built four models to classify volume components: Helix-Sheet-Loop, Helix-Binary, Sheet-Binary, and Loop-Binary. We used machine learning-based classifiers. Seven classification models are used to classify volume components. The proposed work in this paper is a preliminary approach to detect secondary structure elements from medium resolution volume images. The four machine-learning models were trained using authentic volume images from the Electron Microscopy Data Bank. No simulated/synthesized image was used for either training or testing. This is important since all existing methods use simulated images for training. Due to the noise essential to authentic images, simulated images are not best representatives. The procedure includes feature extraction, model selection, fine-tuning, and model ensembling. We tested our four models on the 20% of the dataset of 3400 volume components. The methods have achieved 80% accuracy for Sheet-Binary model, 77% for Helix-Binary, 71% for Loop-Binary and 67% for Helix-Sheet-Loop model.

低温电子显微镜是一种生物物理学技术，可以产生给定分子的体积图像。它可以可视化大分子和蛋白质复合物。在高分辨率(<5Å)下，可以对结构进行建模。当分辨率降至5Å以下时，采用计算技术克服体积图像固有的不准确性。在本文中，我们提出了一种基于分割的方法来提取重要特征，以克服中分辨率体图像的本质不准确性。特征是代表图像上局部峰值区域的体积分量。后来，体积组分被归类为蛋白质分子中发现的主要二级结构元素之一。具体来说，我们建立了四个模型来对体积分量进行分类:螺旋-片-环、螺旋-二进制、片-二进制和环-二进制。我们使用了基于机器学习的分类器。采用7种分类模型对体积分量进行分类。本文提出的工作是从中分辨率体图像中检测二级结构元素的初步方法。这四个机器学习模型使用来自电子显微镜数据库的真实体图像进行训练。没有模拟/合成图像用于训练或测试。这一点很重要，因为所有现有的方法都使用模拟图像进行训练。由于真实图像所必需的噪声，模拟图像并不是最好的代表。该过程包括特征提取、模型选择、微调和模型集成。我们在3400个体积分量的数据集的20%上测试了我们的四个模型。该方法的精度分别为:Sheet-Binary模型的80%、Helix-Binary模型的77%、Loop-Binary模型的71%和Helix-Sheet-Loop模型的67%。

{"title":"Segmentation-based Feature Extraction for Cryo-Electron Microscopy at Medium Resolution","authors":"Lin Chen, Ruba Jebril, K. Al Nasr","doi":"10.1145/3388440.3414711","DOIUrl":"https://doi.org/10.1145/3388440.3414711","url":null,"abstract":"Cryo-Electron Microscopy is a biophysics technique that produces volume images for a given molecule. It can visualize large molecules and protein complexes. At high resolution, <5Å, the structure can be modeled. When the resolution drops to worse than 5Å, computational techniques are used overcome the inaccuracy inherent in volume images. In this paper, we propose a segmentation-based approach to extract important features to overcome the essential inaccuracy in medium resolution volume images. The features are volume components represent local peak regions on the image. Later, the volume components are classified into one of the main secondary structure elements found in the protein molecules. Specifically, we built four models to classify volume components: Helix-Sheet-Loop, Helix-Binary, Sheet-Binary, and Loop-Binary. We used machine learning-based classifiers. Seven classification models are used to classify volume components. The proposed work in this paper is a preliminary approach to detect secondary structure elements from medium resolution volume images. The four machine-learning models were trained using authentic volume images from the Electron Microscopy Data Bank. No simulated/synthesized image was used for either training or testing. This is important since all existing methods use simulated images for training. Due to the noise essential to authentic images, simulated images are not best representatives. The procedure includes feature extraction, model selection, fine-tuning, and model ensembling. We tested our four models on the 20% of the dataset of 3400 volume components. The methods have achieved 80% accuracy for Sheet-Binary model, 77% for Helix-Binary, 71% for Loop-Binary and 67% for Helix-Sheet-Loop model.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115155604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Learning the regulatory grammar of DNA for gene expression engineering 学习基因表达工程中DNA的调控语法

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414922

Jan Zrimec, Aleksej Zelezniak

The DNA regulatory code of gene expression is encoded in the gene regulatory structure spanning the coding and adjacent non-coding regulatory DNA regions. Deciphering this regulatory code, and how the whole gene structure interacts to produce mRNA transcripts and regulate mRNA abundance, can greatly improve our capabilities for controlling gene expression. Here, we consider that natural systems offer the most accurate information on gene expression regulation and apply deep learning on over 20,000 mRNA datasets to learn the DNA encoded regulatory code across a variety of model organisms from bacteria to Human [1]. We find that up to 82% of variation of gene expression is encoded in the gene regulatory structure across all model organisms. Coding and regulatory regions carry both overlapping and new, orthogonal information, and additively contribute to gene expression prediction. By mining the gene expression models for the relevant DNA regulatory motifs, we uncover that motif interactions across the whole gene regulatory structure define over 3 orders of magnitude of gene expression levels. Finally, we experimentally verify the usefulness of our AI-guided approach for protein expression engineering. Our results suggest that single motifs or regulatory regions might not be solely responsible for regulating gene expression levels. Instead, the whole gene regulatory structure, which contains the DNA regulatory grammar of interacting DNA motifs across the protein coding and non-coding regulatory regions, forms a coevolved transcriptional regulatory unit. This provides a solution by which whole gene systems with pre-specified expression patterns can be designed.

基因表达的DNA调控代码编码在跨越编码区和邻近的非编码DNA调控区的基因调控结构中。破译这个调控密码，以及整个基因结构如何相互作用产生mRNA转录物并调节mRNA丰度，可以大大提高我们控制基因表达的能力。在这里，我们认为自然系统提供了最准确的基因表达调控信息，并在超过20,000个mRNA数据集上应用深度学习来学习从细菌到人类的各种模式生物的DNA编码调控代码[1]。我们发现，在所有模式生物中，高达82%的基因表达变异是在基因调控结构中编码的。编码区和调控区携带重叠和新的、正交的信息，并有助于基因表达预测。通过挖掘相关DNA调控基序的基因表达模型，我们发现整个基因调控结构中的基序相互作用定义了超过3个数量级的基因表达水平。最后，我们通过实验验证了我们的人工智能引导方法在蛋白质表达工程中的实用性。我们的研究结果表明，单一的基序或调控区域可能不是调节基因表达水平的唯一原因。相反，整个基因调控结构，包括在蛋白质编码区和非编码调控区相互作用的DNA基序的DNA调控语法，形成了一个共同进化的转录调控单元。这提供了一种解决方案，通过这种解决方案，可以设计具有预先指定表达模式的整个基因系统。

{"title":"Learning the regulatory grammar of DNA for gene expression engineering","authors":"Jan Zrimec, Aleksej Zelezniak","doi":"10.1145/3388440.3414922","DOIUrl":"https://doi.org/10.1145/3388440.3414922","url":null,"abstract":"The DNA regulatory code of gene expression is encoded in the gene regulatory structure spanning the coding and adjacent non-coding regulatory DNA regions. Deciphering this regulatory code, and how the whole gene structure interacts to produce mRNA transcripts and regulate mRNA abundance, can greatly improve our capabilities for controlling gene expression. Here, we consider that natural systems offer the most accurate information on gene expression regulation and apply deep learning on over 20,000 mRNA datasets to learn the DNA encoded regulatory code across a variety of model organisms from bacteria to Human [1]. We find that up to 82% of variation of gene expression is encoded in the gene regulatory structure across all model organisms. Coding and regulatory regions carry both overlapping and new, orthogonal information, and additively contribute to gene expression prediction. By mining the gene expression models for the relevant DNA regulatory motifs, we uncover that motif interactions across the whole gene regulatory structure define over 3 orders of magnitude of gene expression levels. Finally, we experimentally verify the usefulness of our AI-guided approach for protein expression engineering. Our results suggest that single motifs or regulatory regions might not be solely responsible for regulating gene expression levels. Instead, the whole gene regulatory structure, which contains the DNA regulatory grammar of interacting DNA motifs across the protein coding and non-coding regulatory regions, forms a coevolved transcriptional regulatory unit. This provides a solution by which whole gene systems with pre-specified expression patterns can be designed.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124701084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Binding Free Energy of the Novel Coronavirus Spike Protein and the Human ACE2 Receptor: An MMGB/SA Computational Study 新型冠状病毒刺突蛋白与人ACE2受体的结合自由能:MMGB/SA计算研究

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414712

Negin Forouzesh

The ability to estimate protein-protein binding free energy in a computationally efficient via a physics-based approach is beneficial to research focused on the mechanism of viruses binding to their target proteins. Implicit solvation methodology may be particularly useful in the early stages of such research, as it can offer valuable insights into the binding process, quickly. Here we evaluate the potential of the related molecular mechanics generalized Born surface area (MMGB/SA) approach to estimate the binding free energy between the SARS-CoV-2 spike receptor-binding domain and the human ACE2 receptor. The calculations are based on a recent flavor of the generalized Born model, GBNSR6, shown to be effective in protein-ligand binding estimates, but never before used in the MMGB/SA context. Two options for representing the dielectric boundary of the molecule are evaluated: one based on standard bondi radii, and the other based on a newly developed set of atomic radii (OPT1), optimized specifically for protein-ligand binding. We first test the entire computational pipeline on the well-studied Ras-Raf protein-protein complex, which has similar binding free energy to that of the SARS-CoV-2/ACE2 complex. Predictions based on both radii sets are closer to experiment compared to a previously published estimate based on MMGB/SA. The two estimates for the SARS-CoV-2/ACE2 also provide a "bound" on the experimental ΔGbind: --14.7 (bondi) < -10.6(Exp.) < -4.1(OPT1) kcal/mol. Both estimates point to the expected near cancellation of the relatively large enthalpy and entropy contributions, suggesting that the proposed MMGB/SA protocol may be trustworthy, at least qualitatively, for analysis of the SARS-CoV-2/ACE2 in light of the need to move forward fast.

通过基于物理的方法以计算效率估计蛋白质-蛋白质结合自由能的能力有利于集中研究病毒与其靶蛋白结合的机制。隐式溶剂化方法在此类研究的早期阶段可能特别有用，因为它可以快速地提供对结合过程的有价值的见解。在此，我们评估了相关分子力学广义Born表面积(MMGB/SA)方法估计SARS-CoV-2刺突受体结合域与人类ACE2受体之间结合自由能的潜力。计算基于广义Born模型的最新版本GBNSR6，该模型在蛋白质配体结合估计中显示有效，但从未在MMGB/SA环境中使用。评估了表示分子介电边界的两种选择:一种基于标准键半径，另一种基于新开发的原子半径(OPT1)集，专门针对蛋白质-配体结合进行了优化。我们首先在已经得到充分研究的Ras-Raf蛋白-蛋白复合物上测试整个计算管道，该复合物具有与SARS-CoV-2/ACE2复合物相似的结合自由能。与先前发布的基于MMGB/SA的估计相比，基于这两个半径集的预测更接近实验。对SARS-CoV-2/ACE2的两个估计也提供了实验ΔGbind的“边界”:—14.7 (bondi) < -10.6(Exp.) < -4.1(OPT1) kcal/mol。这两种估计都表明，相对较大的焓和熵贡献预计将接近抵消，这表明，鉴于需要快速推进，拟议的MMGB/SA方案在分析SARS-CoV-2/ACE2方面可能是值得信赖的，至少在定性上是如此。

{"title":"Binding Free Energy of the Novel Coronavirus Spike Protein and the Human ACE2 Receptor: An MMGB/SA Computational Study","authors":"Negin Forouzesh","doi":"10.1145/3388440.3414712","DOIUrl":"https://doi.org/10.1145/3388440.3414712","url":null,"abstract":"The ability to estimate protein-protein binding free energy in a computationally efficient via a physics-based approach is beneficial to research focused on the mechanism of viruses binding to their target proteins. Implicit solvation methodology may be particularly useful in the early stages of such research, as it can offer valuable insights into the binding process, quickly. Here we evaluate the potential of the related molecular mechanics generalized Born surface area (MMGB/SA) approach to estimate the binding free energy between the SARS-CoV-2 spike receptor-binding domain and the human ACE2 receptor. The calculations are based on a recent flavor of the generalized Born model, GBNSR6, shown to be effective in protein-ligand binding estimates, but never before used in the MMGB/SA context. Two options for representing the dielectric boundary of the molecule are evaluated: one based on standard bondi radii, and the other based on a newly developed set of atomic radii (OPT1), optimized specifically for protein-ligand binding. We first test the entire computational pipeline on the well-studied Ras-Raf protein-protein complex, which has similar binding free energy to that of the SARS-CoV-2/ACE2 complex. Predictions based on both radii sets are closer to experiment compared to a previously published estimate based on MMGB/SA. The two estimates for the SARS-CoV-2/ACE2 also provide a \"bound\" on the experimental ΔGbind: --14.7 (bondi) < -10.6(Exp.) < -4.1(OPT1) kcal/mol. Both estimates point to the expected near cancellation of the relatively large enthalpy and entropy contributions, suggesting that the proposed MMGB/SA protocol may be trustworthy, at least qualitatively, for analysis of the SARS-CoV-2/ACE2 in light of the need to move forward fast.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127619669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

FINDER

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414926

Sagnik Banerjee, M. Woodhouse, R. Wise, Priyanka Bhandary, T. Sen, Carson M. Andorf

近年,都市部では高層ビルの増加などに伴い中継伝送にとっての環境は悪化の一途をたどっている.実際,そう遠くない過去に中継実績がある現場であっても,現在では伝送不可となってしまったという事例も多い.そのため,中継担当者は事前に情報を収集し準備してから中継に臨んでいるわけだが,プロファイルなどの高度情報を利用しても, こういった高層ビルなど建築物に関するデータは,ほとんどの場合地図等には正確に反映されないため,その事前取得情報の精度には限界がある. そこで,現場下見作業が必要とされるわけだが,土地勘がない地域からの中継である場合,目標とするFPU(Field Pickup Unit:無線中継伝送装置)受信基地やSNG(Satellite News Gathering:通信衛星を用いた放送番組素材収集システム)通信衛星の方向が,すぐにはわからないこともある. このような時には,一度中継車を駐車して電測(電界強度測定)を行ってみたものの受信状態が芳しくなく,同じ駐車場内でも数メートル程移動しなければならなかったり,あるいは駐車位置を道路の反対側へ変更してはじめて伝送路が確保できた,といったこともしばしばである. そこで今回,そういった中継現場下見作業を効率化し, 問題解決のための支援をすることを目的として,携帯端末用アンドロイドアプリ『ΩFINDER(オメガファインダ)』を開発した.本アプリケーションは,携帯端末にあらかじめ具備された機能を活用し,「地図上にさまざまな情報を表示させ,中継車両の駐車場所決定を補佐する機能」と, 「中継予定現場から周囲を撮影した画像に必要な情報を重畳し,伝送可否判断を補佐する機能」を実現することを目的として設計したものである.本稿ではその構成や機能など技術的な開発結果について,具体的な使用事例も交えて説明し解説を行う.

近年来，随着城市高层建筑的增加，中继传输的环境日趋恶化。实际上，即使在不远的过去有过转播业绩的现场，现在也有很多无法进行转播的事例。因此，中继负责人要事先收集信息，做好准备后再参加转播。因此，即使使用配置文件等高级信息，这些高层建筑等建筑物的相关数据，在地图上无法准确反映，因此事前获取的信息精度有限。因此，需要进行现场勘察工作，而土地勘是关键。从内地区中继时，目标是FPU(Field Pickup Unit，无线中继传输装置)接收基地和SNG(Satellite News)Gathering:利用通信卫星的广播节目素材收集systm)有时无法马上知道通信卫星的方向，这时需要将转播车停一停进行电测(场强测)定)，但接收状态不佳，即使在同一停车场内也必须移动数米，或者将停车位置变更到道路的另一侧才能确保传输线路成功了，这样的事情经常发生。因此，这次我们提高了这种现场勘查工作的效率，为了解决问题的支持作为目的,手机用アンドロイドアプリ‘ω- finder(欧米茄取景器)》进行了开发。该应用程序可活用便携终端备有的功能，“在地图上显示各种信息，辅助决定中继车辆的停车位置”。设计的目的是实现“将所需信息重叠在从中继预定现场拍摄的周围图像上，辅助判断是否传输的功能”。本文将结合具体的使用案例，对其结构和功能等技术开发结果进行说明和解说。

{"title":"FINDER","authors":"Sagnik Banerjee, M. Woodhouse, R. Wise, Priyanka Bhandary, T. Sen, Carson M. Andorf","doi":"10.1145/3388440.3414926","DOIUrl":"https://doi.org/10.1145/3388440.3414926","url":null,"abstract":"近年,都市部では高層ビルの増加などに伴い中継伝送にとっての環境は悪化の一途をたどっている.実際,そう遠くない過去に中継実績がある現場であっても,現在では伝送不可となってしまったという事例も多い.そのため,中継担当者は事前に情報を収集し準備してから中継に臨んでいるわけだが,プロファイルなどの高度情報を利用しても, こういった高層ビルなど建築物に関するデータは,ほとんどの場合地図等には正確に反映されないため,その事前取得情報の精度には限界がある. そこで,現場下見作業が必要とされるわけだが,土地勘がない地域からの中継である場合,目標とするFPU(Field Pickup Unit:無線中継伝送装置)受信基地やSNG(Satellite News Gathering:通信衛星を用いた放送番組素材収集システム)通信衛星の方向が,すぐにはわからないこともある. このような時には,一度中継車を駐車して電測(電界強度測定)を行ってみたものの受信状態が芳しくなく,同じ駐車場内でも数メートル程移動しなければならなかったり,あるいは駐車位置を道路の反対側へ変更してはじめて伝送路が確保できた,といったこともしばしばである. そこで今回,そういった中継現場下見作業を効率化し, 問題解決のための支援をすることを目的として,携帯端末用アンドロイドアプリ『ΩFINDER(オメガファインダ)』を開発した.本アプリケーションは,携帯端末にあらかじめ具備された機能を活用し,「地図上にさまざまな情報を表示させ,中継車両の駐車場所決定を補佐する機能」と, 「中継予定現場から周囲を撮影した画像に必要な情報を重畳し,伝送可否判断を補佐する機能」を実現することを目的として設計したものである.本稿ではその構成や機能など技術的な開発結果について,具体的な使用事例も交えて説明し解説を行う.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128037402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Using Guided Motion Planning to Study Binding Site Accessibility 使用引导运动规划研究绑定站点的可达性

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414707

Diane Uwacu, Abigail Ren, Shawna L. Thomas, N. Amato

Computational methods are commonly used to predict protein-ligand interactions. These methods typically search for regions with favorable energy that geometrically fit the ligand, and then rank them as potential binding sites. While this general strategy can provide good predictions in some cases, it does not do well when the binding site is not accessible to the ligand. In addition, recent research has shown that in some cases protein access tunnels play a major role in the activity and stability of the protein's binding interactions. Hence, to fully understand the binding behavior of such proteins, it is imperative to identify and study their access tunnels. In this work, we present a motion planning algorithm that scores protein binding site accessibility for a particular ligand. This method can be used to screen ligand candidates for a protein by eliminating those that cannot access the binding site. This method was tested on two case studies to analyze effects of modifying a protein's access tunnels to increase activity and/or stability as well as study how a ligand inhibitor blocks access to the protein binding site.

计算方法通常用于预测蛋白质与配体的相互作用。这些方法通常寻找几何上适合配体的有利能量区域，然后将它们列为潜在的结合位点。虽然这种一般策略在某些情况下可以提供很好的预测，但当配体无法接近结合位点时，它就不那么有效了。此外，最近的研究表明，在某些情况下，蛋白质通道在蛋白质结合相互作用的活性和稳定性中起着重要作用。因此，为了充分了解这类蛋白的结合行为，识别和研究它们的通道是必要的。在这项工作中，我们提出了一种运动规划算法，该算法对特定配体的蛋白质结合位点可达性进行评分。这种方法可以通过去除那些不能进入结合位点的配体来筛选蛋白质的候选配体。该方法在两个案例研究中进行了测试，以分析修改蛋白质的通道以增加活性和/或稳定性的影响，以及研究配体抑制剂如何阻止进入蛋白质结合位点。

{"title":"Using Guided Motion Planning to Study Binding Site Accessibility","authors":"Diane Uwacu, Abigail Ren, Shawna L. Thomas, N. Amato","doi":"10.1145/3388440.3414707","DOIUrl":"https://doi.org/10.1145/3388440.3414707","url":null,"abstract":"Computational methods are commonly used to predict protein-ligand interactions. These methods typically search for regions with favorable energy that geometrically fit the ligand, and then rank them as potential binding sites. While this general strategy can provide good predictions in some cases, it does not do well when the binding site is not accessible to the ligand. In addition, recent research has shown that in some cases protein access tunnels play a major role in the activity and stability of the protein's binding interactions. Hence, to fully understand the binding behavior of such proteins, it is imperative to identify and study their access tunnels. In this work, we present a motion planning algorithm that scores protein binding site accessibility for a particular ligand. This method can be used to screen ligand candidates for a protein by eliminating those that cannot access the binding site. This method was tested on two case studies to analyze effects of modifying a protein's access tunnels to increase activity and/or stability as well as study how a ligand inhibitor blocks access to the protein binding site.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117068825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Variational Autoencoders for Protein Structure Prediction 蛋白质结构预测的变分自编码器

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412471

F. Alam, Amarda Shehu

The universe of protein structures contains many dark regions beyond the reach of experimental techniques. Yet, knowledge of the tertiary structure(s) that a protein employs to interact with partners in the cell is critical to understanding its biological function(s) and dysfunction(s). Great progress has been made in silico by methods that generate structures as part of an optimization. Recently, generative models based on neural networks are being debuted for generating protein structures. There is typically limited to showing that some generated structures are credible. In this paper, we go beyond this objective. We design variational autoencoders and evaluate whether they can replace existing, established methods. We evaluate various architectures via rigorous metrics in comparison with the popular Rosetta framework. The presented results are promising and show that once seeded with sufficient, physically-realistic structures, variational autoencoders are efficient models for generating realistic tertiary structures.

蛋白质结构的宇宙包含许多实验技术无法触及的黑暗区域。然而，了解蛋白质用来与细胞中的伙伴相互作用的三级结构对于理解其生物学功能和功能障碍至关重要。通过生成结构作为优化的一部分的方法，在硅方面取得了很大的进展。最近，基于神经网络的生成模型首次出现，用于生成蛋白质结构。通常仅限于表明某些生成的结构是可信的。在本文中，我们超越了这一目标。我们设计变分自编码器，并评估它们是否可以取代现有的，既定的方法。我们通过严格的指标来评估各种架构，并与流行的Rosetta框架进行比较。所提出的结果是有希望的，并且表明，一旦播种了足够的，物理逼真的结构，变分自编码器是生成逼真的三级结构的有效模型。

引用次数: 7