首页 > 最新文献

KDD : proceedings. International Conference on Knowledge Discovery & Data Mining最新文献

英文 中文
Drosophila Gene Expression Pattern Annotation Using Sparse Features and Term-Term Interactions. 基于稀疏特征和长期相互作用的果蝇基因表达模式注释。
Shuiwang Ji, Lei Yuan, Ying-Xin Li, Zhi-Hua Zhou, Sudhir Kumar, Jieping Ye

The Drosophila gene expression pattern images document the spatial and temporal dynamics of gene expression and they are valuable tools for explicating the gene functions, interaction, and networks during Drosophila embryogenesis. To provide text-based pattern searching, the images in the Berkeley Drosophila Genome Project (BDGP) study are annotated with ontology terms manually by human curators. We present a systematic approach for automating this task, because the number of images needing text descriptions is now rapidly increasing. We consider both improved feature representation and novel learning formulation to boost the annotation performance. For feature representation, we adapt the bag-of-words scheme commonly used in visual recognition problems so that the image group information in the BDGP study is retained. Moreover, images from multiple views can be integrated naturally in this representation. To reduce the quantization error caused by the bag-of-words representation, we propose an improved feature representation scheme based on the sparse learning technique. In the design of learning formulation, we propose a local regularization framework that can incorporate the correlations among terms explicitly. We further show that the resulting optimization problem admits an analytical solution. Experimental results show that the representation based on sparse learning outperforms the bag-of-words representation significantly. Results also show that incorporation of the term-term correlations improves the annotation performance consistently.

果蝇基因表达模式图像记录了果蝇基因表达的时空动态,是解释果蝇胚胎发生过程中基因功能、相互作用和网络的重要工具。为了提供基于文本的模式搜索,伯克利果蝇基因组计划(BDGP)研究中的图像由人类管理员手动标注本体术语。我们提出了一种系统的方法来自动化这项任务,因为需要文本描述的图像数量现在正在迅速增加。我们考虑了改进的特征表示和新的学习公式来提高标注性能。在特征表示方面,我们采用了视觉识别问题中常用的词袋方案,使BDGP研究中的图像组信息得以保留。此外,来自多个视图的图像可以自然地集成在这种表示中。为了减少词袋表示带来的量化误差,提出了一种改进的基于稀疏学习技术的特征表示方案。在学习公式的设计中,我们提出了一个局部正则化框架,可以显式地合并术语之间的相关性。我们进一步证明了所得到的优化问题有解析解。实验结果表明,基于稀疏学习的表示方法明显优于词袋表示方法。结果还表明,术语-术语相关性的结合一致地提高了标注性能。
{"title":"Drosophila Gene Expression Pattern Annotation Using Sparse Features and Term-Term Interactions.","authors":"Shuiwang Ji,&nbsp;Lei Yuan,&nbsp;Ying-Xin Li,&nbsp;Zhi-Hua Zhou,&nbsp;Sudhir Kumar,&nbsp;Jieping Ye","doi":"10.1145/1557019.1557068","DOIUrl":"https://doi.org/10.1145/1557019.1557068","url":null,"abstract":"<p><p>The Drosophila gene expression pattern images document the spatial and temporal dynamics of gene expression and they are valuable tools for explicating the gene functions, interaction, and networks during Drosophila embryogenesis. To provide text-based pattern searching, the images in the Berkeley Drosophila Genome Project (BDGP) study are annotated with ontology terms manually by human curators. We present a systematic approach for automating this task, because the number of images needing text descriptions is now rapidly increasing. We consider both improved feature representation and novel learning formulation to boost the annotation performance. For feature representation, we adapt the bag-of-words scheme commonly used in visual recognition problems so that the image group information in the BDGP study is retained. Moreover, images from multiple views can be integrated naturally in this representation. To reduce the quantization error caused by the bag-of-words representation, we propose an improved feature representation scheme based on the sparse learning technique. In the design of learning formulation, we propose a local regularization framework that can incorporate the correlations among terms explicitly. We further show that the resulting optimization problem admits an analytical solution. Experimental results show that the representation based on sparse learning outperforms the bag-of-words representation significantly. Results also show that incorporation of the term-term correlations improves the annotation performance consistently.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/1557019.1557068","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40111135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature. 生物文献中标题图挖掘的结构化对应主题模型。
Amr Ahmed, Eric P Xing, William W Cohen, Robert F Murphy

A major source of information (often the most crucial and informative part) in scholarly articles from scientific journals, proceedings and books are the figures that directly provide images and other graphical illustrations of key experimental results and other scientific contents. In biological articles, a typical figure often comprises multiple panels, accompanied by either scoped or global captioned text. Moreover, the text in the caption contains important semantic entities such as protein names, gene ontology, tissues labels, etc., relevant to the images in the figure. Due to the avalanche of biological literature in recent years, and increasing popularity of various bio-imaging techniques, automatic retrieval and summarization of biological information from literature figures has emerged as a major unsolved challenge in computational knowledge extraction and management in the life science. We present a new structured probabilistic topic model built on a realistic figure generation scheme to model the structurally annotated biological figures, and we derive an efficient inference algorithm based on collapsed Gibbs sampling for information retrieval and visualization. The resulting program constitutes one of the key IR engines in our SLIF system that has recently entered the final round (4 out 70 competing systems) of the Elsevier Grand Challenge on Knowledge Enhancement in the Life Science. Here we present various evaluations on a number of data mining tasks to illustrate our method.

来自科学期刊、会议记录和书籍的学术文章中的主要信息来源(通常是最关键和最翔实的部分)是直接提供关键实验结果和其他科学内容的图像和其他图形插图的数字。在生物学文章中,一个典型的图形通常由多个面板组成,并附有范围或全局标题文本。此外,标题中的文本包含与图中图像相关的重要语义实体,如蛋白质名称、基因本体、组织标签等。近年来,由于生物文献的大量涌现和各种生物成像技术的日益普及,从文献数据中自动检索和总结生物信息已成为生命科学计算知识提取和管理的一个重大挑战。本文提出了一种基于真实图形生成方案的结构化概率主题模型,用于结构化注释生物图形的建模,并推导了一种基于折叠Gibbs抽样的高效推理算法,用于信息检索和可视化。由此产生的程序构成了我们SLIF系统的关键IR引擎之一,该系统最近进入了爱思唯尔生命科学知识增强大挑战的最后一轮(70个竞争系统中的4个)。在这里,我们展示了对许多数据挖掘任务的各种评估,以说明我们的方法。
{"title":"Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature.","authors":"Amr Ahmed,&nbsp;Eric P Xing,&nbsp;William W Cohen,&nbsp;Robert F Murphy","doi":"10.1145/1557019.1557031","DOIUrl":"https://doi.org/10.1145/1557019.1557031","url":null,"abstract":"<p><p>A major source of information (often the most crucial and informative part) in scholarly articles from scientific journals, proceedings and books are the figures that directly provide images and other graphical illustrations of key experimental results and other scientific contents. In biological articles, a typical figure often comprises multiple panels, accompanied by either scoped or global captioned text. Moreover, the text in the caption contains important semantic entities such as protein names, gene ontology, tissues labels, etc., relevant to the images in the figure. Due to the avalanche of biological literature in recent years, and increasing popularity of various bio-imaging techniques, automatic retrieval and summarization of biological information from literature figures has emerged as a major unsolved challenge in computational knowledge extraction and management in the life science. We present a new structured probabilistic topic model built on a realistic figure generation scheme to model the structurally annotated biological figures, and we derive an efficient inference algorithm based on collapsed Gibbs sampling for information retrieval and visualization. The resulting program constitutes one of the key IR engines in our SLIF system that has recently entered the final round (4 out 70 competing systems) of the Elsevier Grand Challenge on Knowledge Enhancement in the Life Science. Here we present various evaluations on a number of data mining tasks to illustrate our method.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/1557019.1557031","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32889028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Fastanova: an efficient algorithm for genome-wide association study Fastanova:一种高效的全基因组关联研究算法
Xiang Zhang, F. Zou, Wei Wang
Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. ANOVA (analysis of variance) test is routinely used in association study. Important findings from studying gene-gene (SNP-pair) interactions are appearing in the literature. However, the number of SNPs can be up to millions. Evaluating joint effects of SNPs is a challenging task even for SNP-pairs. Moreover, with large number of SNPs correlated, permutation procedure is preferred over simple Bonferroni correction for properly controlling family-wise error rate and retaining mapping power, which dramatically increases the computational cost of association study. In this paper, we study the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. We derive an upper bound of SNP-pair ANOVA test, which can be expressed as the sum of two terms. The first term is based on single-SNP ANOVA test. The second term is based on the SNPs and independent of any phenotype permutation. Furthermore, SNP-pairs can be organized into groups, each of which shares a common upper bound. This allows for maximum reuse of intermediate computation, efficient upper bound estimation, and effective SNP-pair pruning. Consequently, FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. Extensive experiments demonstrate that FastANOVA is orders of magnitude faster than the brute-force implementation of ANOVA tests on all SNP pairs.
研究数量表型(如身高或体重)与单核苷酸多态性(snp)之间的关系是生物学中的一个重要问题。为了理解复杂表型的潜在机制,通常有必要考虑跨多个snp的联合遗传效应。方差分析(ANOVA)检验是关联研究中常用的检验方法。研究基因-基因(snp对)相互作用的重要发现出现在文献中。然而,snp的数量可能高达数百万。评估snp的联合效应是一项具有挑战性的任务,即使对snp对也是如此。此外,在大量snp相关的情况下,为了合理控制家族错误率和保留映射能力,排列过程比简单的Bonferroni校正更受欢迎,这大大增加了关联研究的计算成本。在本文中,我们研究了寻找与给定定量表型有显著关联的snp对的问题。我们提出了一种高效的算法FastANOVA,用于在批处理模式下对snp对进行方差分析,该算法也支持大排列检验。我们得到了一个snp对方差分析检验的上界,它可以表示为两项的和。第一项是基于单snp方差分析检验。第二项是基于snp和独立于任何表型排列。此外,snp对可以组织成组,每组都有一个共同的上界。这允许最大限度地重用中间计算,有效的上界估计和有效的snp对修剪。因此,FastANOVA只需要对少量候选snp对进行ANOVA检验,而不会有遗漏任何重要snp对的风险。大量的实验表明,FastANOVA比所有SNP对的ANOVA测试的暴力实施要快几个数量级。
{"title":"Fastanova: an efficient algorithm for genome-wide association study","authors":"Xiang Zhang, F. Zou, Wei Wang","doi":"10.1145/1401890.1401988","DOIUrl":"https://doi.org/10.1145/1401890.1401988","url":null,"abstract":"Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. ANOVA (analysis of variance) test is routinely used in association study. Important findings from studying gene-gene (SNP-pair) interactions are appearing in the literature. However, the number of SNPs can be up to millions. Evaluating joint effects of SNPs is a challenging task even for SNP-pairs. Moreover, with large number of SNPs correlated, permutation procedure is preferred over simple Bonferroni correction for properly controlling family-wise error rate and retaining mapping power, which dramatically increases the computational cost of association study.\u0000 In this paper, we study the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. We derive an upper bound of SNP-pair ANOVA test, which can be expressed as the sum of two terms. The first term is based on single-SNP ANOVA test. The second term is based on the SNPs and independent of any phenotype permutation. Furthermore, SNP-pairs can be organized into groups, each of which shares a common upper bound. This allows for maximum reuse of intermediate computation, efficient upper bound estimation, and effective SNP-pair pruning. Consequently, FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. Extensive experiments demonstrate that FastANOVA is orders of magnitude faster than the brute-force implementation of ANOVA tests on all SNP pairs.","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/1401890.1401988","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64097531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
FastANOVA: an Efficient Algorithm for Genome-Wide Association Study. FastANOVA:一种高效的全基因组关联研究算法。
Xiang Zhang, Fei Zou, Wei Wang

Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. ANOVA (analysis of variance) test is routinely used in association study. Important findings from studying gene-gene (SNP-pair) interactions are appearing in the literature. However, the number of SNPs can be up to millions. Evaluating joint effects of SNPs is a challenging task even for SNP-pairs. Moreover, with large number of SNPs correlated, permutation procedure is preferred over simple Bonferroni correction for properly controlling family-wise error rate and retaining mapping power, which dramatically increases the computational cost of association study.In this paper, we study the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. We derive an upper bound of SNP-pair ANOVA test, which can be expressed as the sum of two terms. The first term is based on single-SNP ANOVA test. The second term is based on the SNPs and independent of any phenotype permutation. Furthermore, SNP-pairs can be organized into groups, each of which shares a common upper bound. This allows for maximum reuse of intermediate computation, efficient upper bound estimation, and effective SNP-pair pruning. Consequently, FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. Extensive experiments demonstrate that FastANOVA is orders of magnitude faster than the brute-force implementation of ANOVA tests on all SNP pairs.

研究数量表型(如身高或体重)与单核苷酸多态性(snp)之间的关系是生物学中的一个重要问题。为了理解复杂表型的潜在机制,通常有必要考虑跨多个snp的联合遗传效应。方差分析(ANOVA)检验是关联研究中常用的检验方法。研究基因-基因(snp对)相互作用的重要发现出现在文献中。然而,snp的数量可能高达数百万。评估snp的联合效应是一项具有挑战性的任务,即使对snp对也是如此。此外,在大量snp相关的情况下,为了合理控制家族错误率和保留映射能力,排列过程比简单的Bonferroni校正更受欢迎,这大大增加了关联研究的计算成本。在本文中,我们研究了寻找与给定定量表型有显著关联的snp对的问题。我们提出了一种高效的算法FastANOVA,用于在批处理模式下对snp对进行方差分析,该算法也支持大排列检验。我们得到了一个snp对方差分析检验的上界,它可以表示为两项的和。第一项是基于单snp方差分析检验。第二项是基于snp和独立于任何表型排列。此外,snp对可以组织成组,每组都有一个共同的上界。这允许最大限度地重用中间计算,有效的上界估计和有效的snp对修剪。因此,FastANOVA只需要对少量候选snp对进行ANOVA检验,而不会有遗漏任何重要snp对的风险。大量的实验表明,FastANOVA比所有SNP对的ANOVA测试的暴力实施要快几个数量级。
{"title":"FastANOVA: an Efficient Algorithm for Genome-Wide Association Study.","authors":"Xiang Zhang,&nbsp;Fei Zou,&nbsp;Wei Wang","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Studying the association between quantitative phenotype (such as height or weight) and single nucleotide polymorphisms (SNPs) is an important problem in biology. To understand underlying mechanisms of complex phenotypes, it is often necessary to consider joint genetic effects across multiple SNPs. ANOVA (analysis of variance) test is routinely used in association study. Important findings from studying gene-gene (SNP-pair) interactions are appearing in the literature. However, the number of SNPs can be up to millions. Evaluating joint effects of SNPs is a challenging task even for SNP-pairs. Moreover, with large number of SNPs correlated, permutation procedure is preferred over simple Bonferroni correction for properly controlling family-wise error rate and retaining mapping power, which dramatically increases the computational cost of association study.In this paper, we study the problem of finding SNP-pairs that have significant associations with a given quantitative phenotype. We propose an efficient algorithm, FastANOVA, for performing ANOVA tests on SNP-pairs in a batch mode, which also supports large permutation test. We derive an upper bound of SNP-pair ANOVA test, which can be expressed as the sum of two terms. The first term is based on single-SNP ANOVA test. The second term is based on the SNPs and independent of any phenotype permutation. Furthermore, SNP-pairs can be organized into groups, each of which shares a common upper bound. This allows for maximum reuse of intermediate computation, efficient upper bound estimation, and effective SNP-pair pruning. Consequently, FastANOVA only needs to perform the ANOVA test on a small number of candidate SNP-pairs without the risk of missing any significant ones. Extensive experiments demonstrate that FastANOVA is orders of magnitude faster than the brute-force implementation of ANOVA tests on all SNP pairs.</p>","PeriodicalId":74037,"journal":{"name":"KDD : proceedings. International Conference on Knowledge Discovery & Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2951741/pdf/nihms-131999.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29353218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
KDD : proceedings. International Conference on Knowledge Discovery & Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1