Probabilistic topic modeling for genomic data interpretation

2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Pub Date : 2010-12-01 DOI:10.1109/BIBM.2010.5706554

Xin Chen, Xiaohua Hu, Xiajiong Shen, G. Rosen

{"title":"Probabilistic topic modeling for genomic data interpretation","authors":"Xin Chen, Xiaohua Hu, Xiajiong Shen, G. Rosen","doi":"10.1109/BIBM.2010.5706554","DOIUrl":null,"url":null,"abstract":"Recently, the concept of a species containing both core and distributed genes, known as the supra- or pangenome theory, has been introduced. In this paper, we aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species and tell their functional roles. To achieve this end, we firstly apply a composition-based approach to break down DNA sequences into sub-reads called the ‘N-mer’ and represent the sequences by N-mer frequencies. Then, we introduce the Latent Dirichlet Allocation (LDA) model to study the genome-level statistic patterns (a.k.a. latent topics) of the ‘N-mer’ features. Each estimated latent topic represents a certain component of the whole genome. With the help of the BioJava toolkit, we access to the gene region information of reference sequences from the NCBI database. We use our data mining framework to investigate two areas: 1) do strains within species share similar core and distributed topics? and 2) do genes with similar functional roles contain similar latent topics? After studying the mutual information between latent topics and gene regions, we provide examples of each, where the BioCyc database is used to correlate pathway and reaction information to the genes. The examples demonstrate the effectiveness of proposed method.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2010.5706554","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

Recently, the concept of a species containing both core and distributed genes, known as the supra- or pangenome theory, has been introduced. In this paper, we aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species and tell their functional roles. To achieve this end, we firstly apply a composition-based approach to break down DNA sequences into sub-reads called the ‘N-mer’ and represent the sequences by N-mer frequencies. Then, we introduce the Latent Dirichlet Allocation (LDA) model to study the genome-level statistic patterns (a.k.a. latent topics) of the ‘N-mer’ features. Each estimated latent topic represents a certain component of the whole genome. With the help of the BioJava toolkit, we access to the gene region information of reference sequences from the NCBI database. We use our data mining framework to investigate two areas: 1) do strains within species share similar core and distributed topics? and 2) do genes with similar functional roles contain similar latent topics? After studying the mutual information between latent topics and gene regions, we provide examples of each, where the BioCyc database is used to correlate pathway and reaction information to the genes. The examples demonstrate the effectiveness of proposed method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基因组数据解释的概率主题建模

最近，物种既包含核心基因又包含分布基因的概念被称为超基因组理论或泛基因组理论。在本文中，我们旨在开发一种新的方法，能够分析DNA序列的基因组水平组成，以表征同一物种共享的一组共同的基因组特征，并告诉他们的功能作用。为了实现这一目标，我们首先采用基于组合的方法将DNA序列分解为称为“N-mer”的子读段，并用N-mer频率表示序列。然后，我们引入潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)模型来研究“N-mer”特征的基因组水平统计模式(即潜在主题)。每个估计的潜在主题代表了整个基因组的某个组成部分。借助BioJava工具箱，我们从NCBI数据库中获取参考序列的基因区域信息。我们使用我们的数据挖掘框架来研究两个领域:1)物种内的菌株是否具有相似的核心和分布主题?2)具有相似功能角色的基因是否包含相似的潜在主题?在研究了潜在主题和基因区域之间的相互信息之后，我们提供了每个例子，其中BioCyc数据库用于将途径和反应信息与基因相关联。算例验证了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

自引率

0.00%

发文量