Doc2Cube: Allocating Documents to Text Cube Without Labeled Data

2018 IEEE International Conference on Data Mining (ICDM) Pub Date : 2018-11-01 DOI:10.1109/ICDM.2018.00169

Fangbo Tao, Chao Zhang, Xiusi Chen, Meng Jiang, T. Hanratty, Lance M. Kaplan, Jiawei Han

{"title":"Doc2Cube: Allocating Documents to Text Cube Without Labeled Data","authors":"Fangbo Tao, Chao Zhang, Xiusi Chen, Meng Jiang, T. Hanratty, Lance M. Kaplan, Jiawei Han","doi":"10.1109/ICDM.2018.00169","DOIUrl":null,"url":null,"abstract":"Data cube is a cornerstone architecture in multidimensional analysis of structured datasets. It is highly desirable to conduct multidimensional analysis on text corpora with cube structures for various text-intensive applications in healthcare, business intelligence, and social media analysis. However, one bottleneck to constructing text cube is to automatically put millions of documents into the right cube cells so that quality multidimensional analysis can be conducted afterwards-it is too expensive to allocate documents manually or rely on massively labeled data. We propose Doc2Cube, a method that constructs a text cube from a given text corpus in an unsupervised way. Initially, only the label names (e.g., USA, China) of each dimension (e.g., location) are provided instead of any labeled data. Doc2Cube leverages label names as weak supervision signals and iteratively performs joint embedding of labels, terms, and documents to uncover their semantic similarities. To generate joint embeddings that are discriminative for cube construction, Doc2Cube learns dimension-tailored document representations by selectively focusing on terms that are highly label-indicative in each dimension. Furthermore, Doc2Cube alleviates label sparsity by propagating the information from label names to other terms and enriching the labeled term set. Our experiments on real data demonstrate the superiority of Doc2Cube over existing methods.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2018.00169","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

Data cube is a cornerstone architecture in multidimensional analysis of structured datasets. It is highly desirable to conduct multidimensional analysis on text corpora with cube structures for various text-intensive applications in healthcare, business intelligence, and social media analysis. However, one bottleneck to constructing text cube is to automatically put millions of documents into the right cube cells so that quality multidimensional analysis can be conducted afterwards-it is too expensive to allocate documents manually or rely on massively labeled data. We propose Doc2Cube, a method that constructs a text cube from a given text corpus in an unsupervised way. Initially, only the label names (e.g., USA, China) of each dimension (e.g., location) are provided instead of any labeled data. Doc2Cube leverages label names as weak supervision signals and iteratively performs joint embedding of labels, terms, and documents to uncover their semantic similarities. To generate joint embeddings that are discriminative for cube construction, Doc2Cube learns dimension-tailored document representations by selectively focusing on terms that are highly label-indicative in each dimension. Furthermore, Doc2Cube alleviates label sparsity by propagating the information from label names to other terms and enriching the labeled term set. Our experiments on real data demonstrate the superiority of Doc2Cube over existing methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Doc2Cube:将文档分配到没有标记数据的文本立方体

数据立方体是结构化数据集多维分析的基础架构。对于医疗保健、商业智能和社交媒体分析中的各种文本密集型应用程序，非常需要对具有立方体结构的文本语料库进行多维分析。然而，构建文本多维数据集的一个瓶颈是自动将数百万个文档放入正确的多维数据集单元中，以便随后进行高质量的多维分析——手动分配文档或依赖大量标记数据的成本太高。我们提出了Doc2Cube，一种从给定文本语料库中以无监督的方式构造文本立方体的方法。最初，只提供每个维度(例如，位置)的标签名称(例如，USA, China)，而不提供任何标记数据。Doc2Cube利用标签名称作为弱监督信号，并迭代地执行标签、术语和文档的联合嵌入，以发现它们的语义相似性。为了生成对多维数据集构造具有区别性的联合嵌入，Doc2Cube通过选择性地关注每个维度中高度标记指示性的术语来学习定制维度的文档表示。此外，Doc2Cube通过将信息从标签名称传播到其他术语并丰富标记的术语集来减轻标签稀疏性。我们在实际数据上的实验证明了Doc2Cube相对于现有方法的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 IEEE International Conference on Data Mining (ICDM)

自引率

0.00%

发文量

期刊最新文献

Entire Regularization Path for Sparse Nonnegative Interaction Model Accelerating Experimental Design by Incorporating Experimenter Hunches Title Page i An Efficient Many-Class Active Learning Framework for Knowledge-Rich Domains Social Recommendation with Missing Not at Random Data