大文档流趋势分析

2006 5th International Conference on Machine Learning and Applications (ICMLA'06) Pub Date : 2006-12-14 DOI:10.1109/ICMLA.2006.51

Chengliang Zhang, Shenghuo Zhu, Yihong Gong

{"title":"大文档流趋势分析","authors":"Chengliang Zhang, Shenghuo Zhu, Yihong Gong","doi":"10.1109/ICMLA.2006.51","DOIUrl":null,"url":null,"abstract":"More and more powerful computer technology inspires people to investigate information hidden under huge amounts of documents. In this report, we are especially interested in documents with relative time order, which we also call document streams. Examples include TV news, forums, emails of company projects, call center telephone logs, etc. To get an insight into these document streams, first we need to detect the events among the document streams. We use a time-sensitive Dirichlet process mixture model to find the events in the document streams. A time sensitive Dirichlet process mixture model is a generative model, which allows a potentially infinite number of mixture components and uses a Dirichlet compound multinomial model to model the distribution of words in documents. In this report, we consider three different time sensitive Dirichlet process mixture models: an exponential decay kernel model, a polynomial decay function kernel Dirichlet process model and a sliding window kernel model. Experiments on the TDT2 dataset have shown that the time sensitive models perform 18-20% better in terms of accuracy than the Dirichlet process mixture model. The sliding windows kernel and the polynomial kernel are more promising in detecting events. We use ThemeRiver to provide a visualization of the events along the time axis. With the help of ThemeRiver, people can easily get an overall picture of how different events evolve. Besides ThemeRiver, we investigate using top words as a high-level summarization of each event. Experiment results on TDT2 dataset suggests that the sliding window kernel is a better choice both in terms of capturing the trend of the events and expressibility","PeriodicalId":297071,"journal":{"name":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Trend Analysis for Large Document Streams\",\"authors\":\"Chengliang Zhang, Shenghuo Zhu, Yihong Gong\",\"doi\":\"10.1109/ICMLA.2006.51\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"More and more powerful computer technology inspires people to investigate information hidden under huge amounts of documents. In this report, we are especially interested in documents with relative time order, which we also call document streams. Examples include TV news, forums, emails of company projects, call center telephone logs, etc. To get an insight into these document streams, first we need to detect the events among the document streams. We use a time-sensitive Dirichlet process mixture model to find the events in the document streams. A time sensitive Dirichlet process mixture model is a generative model, which allows a potentially infinite number of mixture components and uses a Dirichlet compound multinomial model to model the distribution of words in documents. In this report, we consider three different time sensitive Dirichlet process mixture models: an exponential decay kernel model, a polynomial decay function kernel Dirichlet process model and a sliding window kernel model. Experiments on the TDT2 dataset have shown that the time sensitive models perform 18-20% better in terms of accuracy than the Dirichlet process mixture model. The sliding windows kernel and the polynomial kernel are more promising in detecting events. We use ThemeRiver to provide a visualization of the events along the time axis. With the help of ThemeRiver, people can easily get an overall picture of how different events evolve. Besides ThemeRiver, we investigate using top words as a high-level summarization of each event. Experiment results on TDT2 dataset suggests that the sliding window kernel is a better choice both in terms of capturing the trend of the events and expressibility\",\"PeriodicalId\":297071,\"journal\":{\"name\":\"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-12-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2006.51\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2006.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

越来越强大的计算机技术激发人们去调查隐藏在海量文件下的信息。在本报告中，我们对具有相对时间顺序的文档特别感兴趣，我们也将其称为文档流。例子包括电视新闻、论坛、公司项目的电子邮件、呼叫中心的电话记录等。为了深入了解这些文档流，首先我们需要检测文档流中的事件。我们使用时间敏感的Dirichlet过程混合模型来查找文档流中的事件。时间敏感Dirichlet过程混合模型是一种生成模型，它允许潜在无限数量的混合成分，并使用Dirichlet复合多项模型来模拟文档中单词的分布。在本文中，我们考虑了三种不同的时间敏感狄利克雷过程混合模型:指数衰减核模型、多项式衰减函数核狄利克雷过程模型和滑动窗口核模型。在TDT2数据集上的实验表明，时间敏感模型的精度比Dirichlet过程混合模型高18-20%。滑动窗口核和多项式核在检测事件方面更有前景。我们使用themerriver沿着时间轴提供事件的可视化。在themerriver的帮助下，人们可以很容易地获得不同事件发展的整体图景。除了themerriver，我们还研究了使用热门词汇作为每个事件的高级摘要。在TDT2数据集上的实验结果表明，滑动窗口核在捕获事件趋势和可表达性方面都是更好的选择

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Trend Analysis for Large Document Streams

More and more powerful computer technology inspires people to investigate information hidden under huge amounts of documents. In this report, we are especially interested in documents with relative time order, which we also call document streams. Examples include TV news, forums, emails of company projects, call center telephone logs, etc. To get an insight into these document streams, first we need to detect the events among the document streams. We use a time-sensitive Dirichlet process mixture model to find the events in the document streams. A time sensitive Dirichlet process mixture model is a generative model, which allows a potentially infinite number of mixture components and uses a Dirichlet compound multinomial model to model the distribution of words in documents. In this report, we consider three different time sensitive Dirichlet process mixture models: an exponential decay kernel model, a polynomial decay function kernel Dirichlet process model and a sliding window kernel model. Experiments on the TDT2 dataset have shown that the time sensitive models perform 18-20% better in terms of accuracy than the Dirichlet process mixture model. The sliding windows kernel and the polynomial kernel are more promising in detecting events. We use ThemeRiver to provide a visualization of the events along the time axis. With the help of ThemeRiver, people can easily get an overall picture of how different events evolve. Besides ThemeRiver, we investigate using top words as a high-level summarization of each event. Experiment results on TDT2 dataset suggests that the sliding window kernel is a better choice both in terms of capturing the trend of the events and expressibility

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2006 5th International Conference on Machine Learning and Applications (ICMLA'06)

自引率

0.00%

发文量