大文档流趋势分析

Chengliang Zhang, Shenghuo Zhu, Yihong Gong
{"title":"大文档流趋势分析","authors":"Chengliang Zhang, Shenghuo Zhu, Yihong Gong","doi":"10.1109/ICMLA.2006.51","DOIUrl":null,"url":null,"abstract":"More and more powerful computer technology inspires people to investigate information hidden under huge amounts of documents. In this report, we are especially interested in documents with relative time order, which we also call document streams. Examples include TV news, forums, emails of company projects, call center telephone logs, etc. To get an insight into these document streams, first we need to detect the events among the document streams. We use a time-sensitive Dirichlet process mixture model to find the events in the document streams. A time sensitive Dirichlet process mixture model is a generative model, which allows a potentially infinite number of mixture components and uses a Dirichlet compound multinomial model to model the distribution of words in documents. In this report, we consider three different time sensitive Dirichlet process mixture models: an exponential decay kernel model, a polynomial decay function kernel Dirichlet process model and a sliding window kernel model. Experiments on the TDT2 dataset have shown that the time sensitive models perform 18-20% better in terms of accuracy than the Dirichlet process mixture model. The sliding windows kernel and the polynomial kernel are more promising in detecting events. We use ThemeRiver to provide a visualization of the events along the time axis. With the help of ThemeRiver, people can easily get an overall picture of how different events evolve. Besides ThemeRiver, we investigate using top words as a high-level summarization of each event. Experiment results on TDT2 dataset suggests that the sliding window kernel is a better choice both in terms of capturing the trend of the events and expressibility","PeriodicalId":297071,"journal":{"name":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Trend Analysis for Large Document Streams\",\"authors\":\"Chengliang Zhang, Shenghuo Zhu, Yihong Gong\",\"doi\":\"10.1109/ICMLA.2006.51\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"More and more powerful computer technology inspires people to investigate information hidden under huge amounts of documents. In this report, we are especially interested in documents with relative time order, which we also call document streams. Examples include TV news, forums, emails of company projects, call center telephone logs, etc. To get an insight into these document streams, first we need to detect the events among the document streams. We use a time-sensitive Dirichlet process mixture model to find the events in the document streams. A time sensitive Dirichlet process mixture model is a generative model, which allows a potentially infinite number of mixture components and uses a Dirichlet compound multinomial model to model the distribution of words in documents. In this report, we consider three different time sensitive Dirichlet process mixture models: an exponential decay kernel model, a polynomial decay function kernel Dirichlet process model and a sliding window kernel model. Experiments on the TDT2 dataset have shown that the time sensitive models perform 18-20% better in terms of accuracy than the Dirichlet process mixture model. The sliding windows kernel and the polynomial kernel are more promising in detecting events. We use ThemeRiver to provide a visualization of the events along the time axis. With the help of ThemeRiver, people can easily get an overall picture of how different events evolve. Besides ThemeRiver, we investigate using top words as a high-level summarization of each event. Experiment results on TDT2 dataset suggests that the sliding window kernel is a better choice both in terms of capturing the trend of the events and expressibility\",\"PeriodicalId\":297071,\"journal\":{\"name\":\"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-12-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2006.51\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2006.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

越来越强大的计算机技术激发人们去调查隐藏在海量文件下的信息。在本报告中,我们对具有相对时间顺序的文档特别感兴趣,我们也将其称为文档流。例子包括电视新闻、论坛、公司项目的电子邮件、呼叫中心的电话记录等。为了深入了解这些文档流,首先我们需要检测文档流中的事件。我们使用时间敏感的Dirichlet过程混合模型来查找文档流中的事件。时间敏感Dirichlet过程混合模型是一种生成模型,它允许潜在无限数量的混合成分,并使用Dirichlet复合多项模型来模拟文档中单词的分布。在本文中,我们考虑了三种不同的时间敏感狄利克雷过程混合模型:指数衰减核模型、多项式衰减函数核狄利克雷过程模型和滑动窗口核模型。在TDT2数据集上的实验表明,时间敏感模型的精度比Dirichlet过程混合模型高18-20%。滑动窗口核和多项式核在检测事件方面更有前景。我们使用themerriver沿着时间轴提供事件的可视化。在themerriver的帮助下,人们可以很容易地获得不同事件发展的整体图景。除了themerriver,我们还研究了使用热门词汇作为每个事件的高级摘要。在TDT2数据集上的实验结果表明,滑动窗口核在捕获事件趋势和可表达性方面都是更好的选择
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Trend Analysis for Large Document Streams
More and more powerful computer technology inspires people to investigate information hidden under huge amounts of documents. In this report, we are especially interested in documents with relative time order, which we also call document streams. Examples include TV news, forums, emails of company projects, call center telephone logs, etc. To get an insight into these document streams, first we need to detect the events among the document streams. We use a time-sensitive Dirichlet process mixture model to find the events in the document streams. A time sensitive Dirichlet process mixture model is a generative model, which allows a potentially infinite number of mixture components and uses a Dirichlet compound multinomial model to model the distribution of words in documents. In this report, we consider three different time sensitive Dirichlet process mixture models: an exponential decay kernel model, a polynomial decay function kernel Dirichlet process model and a sliding window kernel model. Experiments on the TDT2 dataset have shown that the time sensitive models perform 18-20% better in terms of accuracy than the Dirichlet process mixture model. The sliding windows kernel and the polynomial kernel are more promising in detecting events. We use ThemeRiver to provide a visualization of the events along the time axis. With the help of ThemeRiver, people can easily get an overall picture of how different events evolve. Besides ThemeRiver, we investigate using top words as a high-level summarization of each event. Experiment results on TDT2 dataset suggests that the sliding window kernel is a better choice both in terms of capturing the trend of the events and expressibility
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
An Efficient Heuristic for Discovering Multiple Ill-Defined Attributes in Datasets Robust Model Selection Using Cross Validation: A Simple Iterative Technique for Developing Robust Gene Signatures in Biomedical Genomics Applications Detecting Web Content Function Using Generalized Hidden Markov Model Naive Bayes Classification Given Probability Estimation Trees A New Machine Learning Technique Based on Straight Line Segments
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1