Jialu Liu, Xiang Ren, Jingbo Shang, Taylor Cassidy, Clare R Voss, Jiawei Han
{"title":"通过潜在关键词推理表示文档。","authors":"Jialu Liu, Xiang Ren, Jingbo Shang, Taylor Cassidy, Clare R Voss, Jiawei Han","doi":"10.1145/2872427.2883088","DOIUrl":null,"url":null,"abstract":"<p><p>Many text mining approaches adopt bag-of-words or <i>n</i>-grams models to represent documents. Looking beyond just the words, <i>i.e.</i>, the explicit surface forms, in a document can improve a computer's understanding of text. Being aware of this, researchers have proposed concept-based models that rely on a human-curated knowledge base to incorporate other related concepts in the document representation. But these methods are not desirable when applied to vertical domains (<i>e.g.</i>, literature, enterprise, <i>etc.</i>) due to low coverage of in-domain concepts in the general knowledge base and interference from out-of-domain concepts. In this paper, we propose a data-driven model named <i>Latent Keyphrase Inference</i> (<i>LAKI</i>) that represents documents with a vector of closely related domain keyphrases instead of single words or existing concepts in the knowledge base. We show that given a corpus of in-domain documents, topical content units can be learned for each domain keyphrase, which enables a computer to do smart inference to discover latent document keyphrases, going beyond just explicit mentions. Compared with the state-of-art document representation approaches, LAKI fills the gap between bag-of-words and concept-based models by using domain keyphrases as the basic representation unit. It removes dependency on a knowledge base while providing, with keyphrases, readily interpretable representations. When evaluated against 8 other methods on two text mining tasks over two corpora, LAKI outperformed all.</p>","PeriodicalId":74532,"journal":{"name":"Proceedings of the ... International World-Wide Web Conference. International WWW Conference","volume":"2016 ","pages":"1057-1067"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2872427.2883088","citationCount":"24","resultStr":"{\"title\":\"Representing Documents via Latent Keyphrase Inference.\",\"authors\":\"Jialu Liu, Xiang Ren, Jingbo Shang, Taylor Cassidy, Clare R Voss, Jiawei Han\",\"doi\":\"10.1145/2872427.2883088\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Many text mining approaches adopt bag-of-words or <i>n</i>-grams models to represent documents. Looking beyond just the words, <i>i.e.</i>, the explicit surface forms, in a document can improve a computer's understanding of text. Being aware of this, researchers have proposed concept-based models that rely on a human-curated knowledge base to incorporate other related concepts in the document representation. But these methods are not desirable when applied to vertical domains (<i>e.g.</i>, literature, enterprise, <i>etc.</i>) due to low coverage of in-domain concepts in the general knowledge base and interference from out-of-domain concepts. In this paper, we propose a data-driven model named <i>Latent Keyphrase Inference</i> (<i>LAKI</i>) that represents documents with a vector of closely related domain keyphrases instead of single words or existing concepts in the knowledge base. We show that given a corpus of in-domain documents, topical content units can be learned for each domain keyphrase, which enables a computer to do smart inference to discover latent document keyphrases, going beyond just explicit mentions. Compared with the state-of-art document representation approaches, LAKI fills the gap between bag-of-words and concept-based models by using domain keyphrases as the basic representation unit. It removes dependency on a knowledge base while providing, with keyphrases, readily interpretable representations. When evaluated against 8 other methods on two text mining tasks over two corpora, LAKI outperformed all.</p>\",\"PeriodicalId\":74532,\"journal\":{\"name\":\"Proceedings of the ... International World-Wide Web Conference. International WWW Conference\",\"volume\":\"2016 \",\"pages\":\"1057-1067\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1145/2872427.2883088\",\"citationCount\":\"24\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... International World-Wide Web Conference. International WWW Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2872427.2883088\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... International World-Wide Web Conference. International WWW Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2872427.2883088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Representing Documents via Latent Keyphrase Inference.
Many text mining approaches adopt bag-of-words or n-grams models to represent documents. Looking beyond just the words, i.e., the explicit surface forms, in a document can improve a computer's understanding of text. Being aware of this, researchers have proposed concept-based models that rely on a human-curated knowledge base to incorporate other related concepts in the document representation. But these methods are not desirable when applied to vertical domains (e.g., literature, enterprise, etc.) due to low coverage of in-domain concepts in the general knowledge base and interference from out-of-domain concepts. In this paper, we propose a data-driven model named Latent Keyphrase Inference (LAKI) that represents documents with a vector of closely related domain keyphrases instead of single words or existing concepts in the knowledge base. We show that given a corpus of in-domain documents, topical content units can be learned for each domain keyphrase, which enables a computer to do smart inference to discover latent document keyphrases, going beyond just explicit mentions. Compared with the state-of-art document representation approaches, LAKI fills the gap between bag-of-words and concept-based models by using domain keyphrases as the basic representation unit. It removes dependency on a knowledge base while providing, with keyphrases, readily interpretable representations. When evaluated against 8 other methods on two text mining tasks over two corpora, LAKI outperformed all.