Doc2Cube: Allocating Documents to Text Cube Without Labeled Data

Fangbo Tao, Chao Zhang, Xiusi Chen, Meng Jiang, T. Hanratty, Lance M. Kaplan, Jiawei Han
{"title":"Doc2Cube: Allocating Documents to Text Cube Without Labeled Data","authors":"Fangbo Tao, Chao Zhang, Xiusi Chen, Meng Jiang, T. Hanratty, Lance M. Kaplan, Jiawei Han","doi":"10.1109/ICDM.2018.00169","DOIUrl":null,"url":null,"abstract":"Data cube is a cornerstone architecture in multidimensional analysis of structured datasets. It is highly desirable to conduct multidimensional analysis on text corpora with cube structures for various text-intensive applications in healthcare, business intelligence, and social media analysis. However, one bottleneck to constructing text cube is to automatically put millions of documents into the right cube cells so that quality multidimensional analysis can be conducted afterwards-it is too expensive to allocate documents manually or rely on massively labeled data. We propose Doc2Cube, a method that constructs a text cube from a given text corpus in an unsupervised way. Initially, only the label names (e.g., USA, China) of each dimension (e.g., location) are provided instead of any labeled data. Doc2Cube leverages label names as weak supervision signals and iteratively performs joint embedding of labels, terms, and documents to uncover their semantic similarities. To generate joint embeddings that are discriminative for cube construction, Doc2Cube learns dimension-tailored document representations by selectively focusing on terms that are highly label-indicative in each dimension. Furthermore, Doc2Cube alleviates label sparsity by propagating the information from label names to other terms and enriching the labeled term set. Our experiments on real data demonstrate the superiority of Doc2Cube over existing methods.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2018.00169","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19

Abstract

Data cube is a cornerstone architecture in multidimensional analysis of structured datasets. It is highly desirable to conduct multidimensional analysis on text corpora with cube structures for various text-intensive applications in healthcare, business intelligence, and social media analysis. However, one bottleneck to constructing text cube is to automatically put millions of documents into the right cube cells so that quality multidimensional analysis can be conducted afterwards-it is too expensive to allocate documents manually or rely on massively labeled data. We propose Doc2Cube, a method that constructs a text cube from a given text corpus in an unsupervised way. Initially, only the label names (e.g., USA, China) of each dimension (e.g., location) are provided instead of any labeled data. Doc2Cube leverages label names as weak supervision signals and iteratively performs joint embedding of labels, terms, and documents to uncover their semantic similarities. To generate joint embeddings that are discriminative for cube construction, Doc2Cube learns dimension-tailored document representations by selectively focusing on terms that are highly label-indicative in each dimension. Furthermore, Doc2Cube alleviates label sparsity by propagating the information from label names to other terms and enriching the labeled term set. Our experiments on real data demonstrate the superiority of Doc2Cube over existing methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Doc2Cube:将文档分配到没有标记数据的文本立方体
数据立方体是结构化数据集多维分析的基础架构。对于医疗保健、商业智能和社交媒体分析中的各种文本密集型应用程序,非常需要对具有立方体结构的文本语料库进行多维分析。然而,构建文本多维数据集的一个瓶颈是自动将数百万个文档放入正确的多维数据集单元中,以便随后进行高质量的多维分析——手动分配文档或依赖大量标记数据的成本太高。我们提出了Doc2Cube,一种从给定文本语料库中以无监督的方式构造文本立方体的方法。最初,只提供每个维度(例如,位置)的标签名称(例如,USA, China),而不提供任何标记数据。Doc2Cube利用标签名称作为弱监督信号,并迭代地执行标签、术语和文档的联合嵌入,以发现它们的语义相似性。为了生成对多维数据集构造具有区别性的联合嵌入,Doc2Cube通过选择性地关注每个维度中高度标记指示性的术语来学习定制维度的文档表示。此外,Doc2Cube通过将信息从标签名称传播到其他术语并丰富标记的术语集来减轻标签稀疏性。我们在实际数据上的实验证明了Doc2Cube相对于现有方法的优越性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Entire Regularization Path for Sparse Nonnegative Interaction Model Accelerating Experimental Design by Incorporating Experimenter Hunches Title Page i An Efficient Many-Class Active Learning Framework for Knowledge-Rich Domains Social Recommendation with Missing Not at Random Data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1