首页 > 最新文献

2010 IEEE International Conference on Data Mining Workshops最新文献

英文 中文
Learning Document Labels from Enriched Click Graphs 从丰富的点击图学习文档标签
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.190
Lan Nie, Zhigang Hua, Xiaofeng He, S. Gaffney
Document classification plays an increasingly important role in extracting and organizing the knowledge, however, the Web document classification task was hindered by the huge number of Web documents while limited resource of human judgment on the training data. To obtain sufficient training data in a cost-efficient way, in this paper, we propose a semi-supervised learning approach to predict a document’s class label by mining the click graph. To overcome the sparseness problem of click graph, we enrich it by including hyperlinks between the Web documents. Content-based constraints are further added to regularize the graph. The resulting graph unifies three data sources: click-through data, hyperlinks and content relevance. Starting from a very small seed set of manually labeled documents, we automatically explore large amount of relevant documents by applying a Markov random walk model to the enriched click graph. The top pages with high confidence scores are included to the current training data for classifier model training. We investigate various combinations among the three sources and conduct extensive experiments on six typical web classification tasks. The experimental results show that the click graph enriched with hyperlink and content information can significantly improve the classification quality across multiple tasks only with a minimal human labeling cost.
文档分类在提取和组织知识方面发挥着越来越重要的作用,然而Web文档数量庞大,而人类对训练数据的判断资源有限,阻碍了Web文档分类任务的完成。为了以一种经济有效的方式获得足够的训练数据,本文提出了一种半监督学习方法,通过挖掘点击图来预测文档的类标签。为了克服点击图的稀疏性问题,我们通过包含Web文档之间的超链接来丰富它。进一步添加基于内容的约束来规范图。生成的图形统一了三个数据源:点击数据、超链接和内容相关性。从一个非常小的手动标记文档的种子集开始,我们通过对丰富的点击图应用马尔可夫随机漫步模型来自动探索大量相关文档。置信度高的首页被纳入当前训练数据,用于分类器模型训练。我们研究了三种来源之间的各种组合,并在六个典型的web分类任务上进行了广泛的实验。实验结果表明,添加了超链接和内容信息的点击图能够以最小的人工标注成本显著提高多任务的分类质量。
{"title":"Learning Document Labels from Enriched Click Graphs","authors":"Lan Nie, Zhigang Hua, Xiaofeng He, S. Gaffney","doi":"10.1109/ICDMW.2010.190","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.190","url":null,"abstract":"Document classification plays an increasingly important role in extracting and organizing the knowledge, however, the Web document classification task was hindered by the huge number of Web documents while limited resource of human judgment on the training data. To obtain sufficient training data in a cost-efficient way, in this paper, we propose a semi-supervised learning approach to predict a document’s class label by mining the click graph. To overcome the sparseness problem of click graph, we enrich it by including hyperlinks between the Web documents. Content-based constraints are further added to regularize the graph. The resulting graph unifies three data sources: click-through data, hyperlinks and content relevance. Starting from a very small seed set of manually labeled documents, we automatically explore large amount of relevant documents by applying a Markov random walk model to the enriched click graph. The top pages with high confidence scores are included to the current training data for classifier model training. We investigate various combinations among the three sources and conduct extensive experiments on six typical web classification tasks. The experimental results show that the click graph enriched with hyperlink and content information can significantly improve the classification quality across multiple tasks only with a minimal human labeling cost.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133795774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Learning Restricted Bayesian Network Classifiers with Mixed Non-i.i.d. Sampling 混合非id约束贝叶斯网络分类器的学习。抽样
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.199
Zhongfeng Wang, Zhihai Wang, Bin Fu
Generally, numerous data may increase the statistical power. However, many algorithms in data mining community only focus on small samples. This is because when the sample size increases, the data set is not necessarily identically distributed in spite of being generated by some common data generating mechanism. In this paper, we realize restricted Bayesian network classifiers are robust even when training data set is non-i.i.d. sampling. Empirical studies show that these algorithms performs as well as others which combine independent experimental results by some statistical methods.
通常,大量的数据会增加统计能力。然而,数据挖掘界的许多算法只关注小样本。这是因为当样本量增加时,尽管数据集是由一些常见的数据生成机制生成的,但数据集不一定是同分布的。在本文中,我们实现了约束贝叶斯网络分类器即使在训练数据集是非id的情况下也是鲁棒的。抽样。实证研究表明,这些算法的性能与通过一些统计方法结合独立实验结果的算法一样好。
{"title":"Learning Restricted Bayesian Network Classifiers with Mixed Non-i.i.d. Sampling","authors":"Zhongfeng Wang, Zhihai Wang, Bin Fu","doi":"10.1109/ICDMW.2010.199","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.199","url":null,"abstract":"Generally, numerous data may increase the statistical power. However, many algorithms in data mining community only focus on small samples. This is because when the sample size increases, the data set is not necessarily identically distributed in spite of being generated by some common data generating mechanism. In this paper, we realize restricted Bayesian network classifiers are robust even when training data set is non-i.i.d. sampling. Empirical studies show that these algorithms performs as well as others which combine independent experimental results by some statistical methods.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130539607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards a Reliable Framework of Uncertainty-Based Group Decision Support System 基于不确定性的群体决策支持系统可靠框架研究
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.80
J. Chai, J. Liu
This study proposes a framework of Uncertainty-based Group Decision Support System (UGDSS). It provides a platform for multiple criteria decision analysis in six aspects including (1) decision environment, (2) decision problem, (3) decision group, (4) decision conflict, (5) decision schemes and (6) group negotiation. Based on multiple artificial intelligent technologies, this framework provides reliable support for the comprehensive manipulation of applications and advanced decision approaches through the design of an integrated multi-agents architecture.
本研究提出一种基于不确定性的群体决策支持系统框架。它从六个方面为多准则决策分析提供了一个平台,包括(1)决策环境,(2)决策问题,(3)决策群体,(4)决策冲突,(5)决策方案,(6)群体协商。该框架基于多种人工智能技术,通过设计集成的多智能体体系结构,为应用程序的综合操作和高级决策方法提供可靠的支持。
{"title":"Towards a Reliable Framework of Uncertainty-Based Group Decision Support System","authors":"J. Chai, J. Liu","doi":"10.1109/ICDMW.2010.80","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.80","url":null,"abstract":"This study proposes a framework of Uncertainty-based Group Decision Support System (UGDSS). It provides a platform for multiple criteria decision analysis in six aspects including (1) decision environment, (2) decision problem, (3) decision group, (4) decision conflict, (5) decision schemes and (6) group negotiation. Based on multiple artificial intelligent technologies, this framework provides reliable support for the comprehensive manipulation of applications and advanced decision approaches through the design of an integrated multi-agents architecture.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125855292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Clustering Performance on Evolving Data Streams: Assessing Algorithms and Evaluation Measures within MOA 演化数据流的聚类性能:MOA中的评估算法和评估方法
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.17
P. Kranen, Hardy Kremer, Timm Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer
In today's applications, evolving data streams are ubiquitous. Stream clustering algorithms were introduced to gain useful knowledge from these streams in real-time. The quality of the obtained clusterings, i.e. how good they reflect the data, can be assessed by evaluation measures. A multitude of stream clustering algorithms and evaluation measures for clusterings were introduced in the literature, however, until now there is no general tool for a direct comparison of the different algorithms or the evaluation measures. In our demo, we present a novel experimental framework for both tasks. It offers the means for extensive evaluation and visualization and is an extension of the Massive Online Analysis (MOA) software environment released under the GNU GPL License.
在今天的应用程序中,不断发展的数据流无处不在。引入了流聚类算法,从这些流中实时获取有用的知识。所获得的聚类的质量,即它们对数据的反映程度,可以通过评价度量来评估。文献中介绍了大量的流聚类算法和聚类的评价方法,但是到目前为止,还没有一个通用的工具可以直接比较不同的算法或评价方法。在我们的演示中,我们为这两个任务提出了一个新的实验框架。它提供了广泛的评估和可视化的手段,是在GNU GPL许可下发布的大规模在线分析(MOA)软件环境的扩展。
{"title":"Clustering Performance on Evolving Data Streams: Assessing Algorithms and Evaluation Measures within MOA","authors":"P. Kranen, Hardy Kremer, Timm Jansen, T. Seidl, A. Bifet, G. Holmes, B. Pfahringer","doi":"10.1109/ICDMW.2010.17","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.17","url":null,"abstract":"In today's applications, evolving data streams are ubiquitous. Stream clustering algorithms were introduced to gain useful knowledge from these streams in real-time. The quality of the obtained clusterings, i.e. how good they reflect the data, can be assessed by evaluation measures. A multitude of stream clustering algorithms and evaluation measures for clusterings were introduced in the literature, however, until now there is no general tool for a direct comparison of the different algorithms or the evaluation measures. In our demo, we present a novel experimental framework for both tasks. It offers the means for extensive evaluation and visualization and is an extension of the Massive Online Analysis (MOA) software environment released under the GNU GPL License.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122254314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
A Convex Combination of Models for Predicting Road Traffic 道路交通预测模型的凸组合
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.23
Carlos J. Gil Bellosta
This paper describes an approach to the road traffic prediction problem in Warsaw in the context of a data mining competition that is part of the IEEE ICDM 2010. A solution based on a convex combination of models mining different wells of information within the data is described. Such convex combination allows the final model compensate highly uncorrelated errors from the different underlying models and to achieve higher prediction accuracy.
本文描述了在IEEE ICDM 2010的数据挖掘竞赛的背景下,华沙道路交通预测问题的一种方法。描述了一种基于凸组合模型的解决方案,该模型在数据中挖掘不同的信息井。这种凸组合允许最终模型补偿来自不同底层模型的高度不相关的误差,并获得更高的预测精度。
{"title":"A Convex Combination of Models for Predicting Road Traffic","authors":"Carlos J. Gil Bellosta","doi":"10.1109/ICDMW.2010.23","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.23","url":null,"abstract":"This paper describes an approach to the road traffic prediction problem in Warsaw in the context of a data mining competition that is part of the IEEE ICDM 2010. A solution based on a convex combination of models mining different wells of information within the data is described. Such convex combination allows the final model compensate highly uncorrelated errors from the different underlying models and to achieve higher prediction accuracy.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125240408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Distributed Flow Algorithms for Scalable Similarity Visualization 可扩展相似度可视化的分布式流算法
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.120
Novi Quadrianto, Dale Schuurmans, Alex Smola
We describe simple yet scalable and distributed algorithms for solving the maximum flow problem and its minimum cost flow variant, motivated by problems of interest in objects similarity visualization. We formulate the fundamental problem as a convex-concave saddle point problem. We then show that this problem can be efficiently solved by a first order method or by exploiting faster quasi-Newton steps. Our proposed approach costs at most O(|E|) per iteration for a graph with |E| edges. Further, the number of required iterations can be shown to be independent of number of edges for the first order approximation method. We present experimental results in two applications: mosaic generation and color similarity based image layouting.
我们描述了简单但可扩展的分布式算法,用于解决最大流量问题及其最小成本流量变体,其动机是对对象相似性可视化感兴趣的问题。我们将基本问题表述为凸凹鞍点问题。然后我们证明这个问题可以通过一阶方法或利用更快的准牛顿步骤有效地解决。对于一个边数为|E|的图,我们提出的方法每次迭代的开销最多为0 (|E|)。此外,对于一阶近似方法,可以证明所需迭代的次数与边的数量无关。我们给出了两个应用的实验结果:马赛克生成和基于颜色相似的图像布局。
{"title":"Distributed Flow Algorithms for Scalable Similarity Visualization","authors":"Novi Quadrianto, Dale Schuurmans, Alex Smola","doi":"10.1109/ICDMW.2010.120","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.120","url":null,"abstract":"We describe simple yet scalable and distributed algorithms for solving the maximum flow problem and its minimum cost flow variant, motivated by problems of interest in objects similarity visualization. We formulate the fundamental problem as a convex-concave saddle point problem. We then show that this problem can be efficiently solved by a first order method or by exploiting faster quasi-Newton steps. Our proposed approach costs at most O(|E|) per iteration for a graph with |E| edges. Further, the number of required iterations can be shown to be independent of number of edges for the first order approximation method. We present experimental results in two applications: mosaic generation and color similarity based image layouting.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126176197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Less Effort, More Outcomes: Optimising Debt Recovery with Decision Trees 更少的努力,更多的结果:用决策树优化债务回收
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.114
Yanchang Zhao, H. Bohlscheid, Shanshan Wu, Longbing Cao
This paper presents a real-world application of data mining techniques to optimise debt recovery in social security. The traditional method of contacting a customer for the purpose of putting in place a debt recovery schedule has been an out-bound phone call, and by and large, customers are chosen at random. This obsolete and inefficient method of selecting customers for debt recovery purposes has existed for years and in order to improve this process, decision trees were built to model debt recovery and predict the response of customers if contacted by phone. Test results on historical data show that, the built model is effective to rank customers in their likelihood of entering into a successful debt recovery repayment schedule. If contacting the top 20 per cent of customers in debt, instead of contacting all of them, approximately 50 per cent of repayments would be received.
本文介绍了数据挖掘技术在现实世界中的应用,以优化社会保障中的债务回收。为了制定一个债务回收计划而联系客户的传统方法是打外呼电话,总的来说,客户是随机选择的。这种为债务回收目的选择客户的过时和低效的方法已经存在了多年,为了改进这一过程,建立了决策树来模拟债务回收,并预测如果通过电话联系客户的反应。对历史数据的测试结果表明,所建立的模型对客户进入成功的债务回收还款计划的可能性进行排名是有效的。如果与负债最多的20%的客户联系,而不是与所有客户联系,将收到大约50%的还款。
{"title":"Less Effort, More Outcomes: Optimising Debt Recovery with Decision Trees","authors":"Yanchang Zhao, H. Bohlscheid, Shanshan Wu, Longbing Cao","doi":"10.1109/ICDMW.2010.114","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.114","url":null,"abstract":"This paper presents a real-world application of data mining techniques to optimise debt recovery in social security. The traditional method of contacting a customer for the purpose of putting in place a debt recovery schedule has been an out-bound phone call, and by and large, customers are chosen at random. This obsolete and inefficient method of selecting customers for debt recovery purposes has existed for years and in order to improve this process, decision trees were built to model debt recovery and predict the response of customers if contacted by phone. Test results on historical data show that, the built model is effective to rank customers in their likelihood of entering into a successful debt recovery repayment schedule. If contacting the top 20 per cent of customers in debt, instead of contacting all of them, approximately 50 per cent of repayments would be received.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129108239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Challenges in Scheduling Aggregation in Cyberphysical Information Processing Systems 网络物理信息处理系统中调度聚合的挑战
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.96
James L. Horey
Data aggregation is an important element in information processing systems, including MapReduce clusters and cyber physical networks. Unlike simple sensor networks, all the data in information processing systems must be eventually aggregated. Our goal is to lower overall latency in these systems by intelligently scheduling aggregation on intermediate routing nodes. In order to understand the potential challenges associated with constructing a distributed scheduler that minimizes latency, we developed a simple model of wireless information processing systems and simulation of our model. Unlike previous models, our model explicitly takes into account link latency and computation time. Our model also considers heterogeneous computing capabilities. We tested the latency while randomly assigning aggregation computation to nodes in the network. Preliminary results indicate that in cases where the computation time is greater than transmission time, in-network aggregation can have a large effect (reducing latency by 50% or more). However, naive scheduling can have a detrimental effect. Specifically, when the root node (a.k.a the base station) is faster than the other nodes, the latency can increase with increased coverage, and these effects vary with the number of nodes present.
数据聚合是信息处理系统的重要组成部分,包括MapReduce集群和网络物理网络。与简单的传感器网络不同,信息处理系统中的所有数据最终都必须进行汇总。我们的目标是通过在中间路由节点上智能地调度聚合来降低这些系统中的总体延迟。为了理解与构建最小化延迟的分布式调度器相关的潜在挑战,我们开发了一个简单的无线信息处理系统模型并对我们的模型进行了仿真。与以前的模型不同,我们的模型明确地考虑了链路延迟和计算时间。我们的模型还考虑了异构计算能力。我们在随机分配聚合计算给网络中的节点时测试了延迟。初步结果表明,在计算时间大于传输时间的情况下,网络内聚合可以产生很大的效果(将延迟减少50%或更多)。然而,幼稚的调度可能会产生有害的影响。具体来说,当根节点(即基站)比其他节点快时,延迟可能会随着覆盖范围的增加而增加,并且这些影响会随着存在的节点数量而变化。
{"title":"Challenges in Scheduling Aggregation in Cyberphysical Information Processing Systems","authors":"James L. Horey","doi":"10.1109/ICDMW.2010.96","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.96","url":null,"abstract":"Data aggregation is an important element in information processing systems, including MapReduce clusters and cyber physical networks. Unlike simple sensor networks, all the data in information processing systems must be eventually aggregated. Our goal is to lower overall latency in these systems by intelligently scheduling aggregation on intermediate routing nodes. In order to understand the potential challenges associated with constructing a distributed scheduler that minimizes latency, we developed a simple model of wireless information processing systems and simulation of our model. Unlike previous models, our model explicitly takes into account link latency and computation time. Our model also considers heterogeneous computing capabilities. We tested the latency while randomly assigning aggregation computation to nodes in the network. Preliminary results indicate that in cases where the computation time is greater than transmission time, in-network aggregation can have a large effect (reducing latency by 50% or more). However, naive scheduling can have a detrimental effect. Specifically, when the root node (a.k.a the base station) is faster than the other nodes, the latency can increase with increased coverage, and these effects vary with the number of nodes present.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127553901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Geospatial Schema Matching with High-Quality Cluster Assurance and Location Mining from Social Network 基于高质量集群保证的地理空间模式匹配与社交网络位置挖掘
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.204
L. Khan, J. Partyka, Satyen Abrol, B. Thuraisingham
In this talk, we will present how semantics can improve the quality of the data mining process. In particular, first, we will focus on geospatial schema matching with high quality cluster assurance. Next, we will focus on location mining from social network. With regard to the first problem, resolving semantic heterogeneity across distinct data sources remains a highly relevant problem in the GIS domain requiring innovative solutions. Our approach, called GSim, semantically aligns tables from respective GIS databases by first choosing attributes for comparison. We then examine their instances and calculate a similarity value between them called Entropy-Based Distribution (EBD) by combining two separate methods. Our primary method discerns the geographic types from instances of compared attributes. If geographic type matching is not possible, we then apply a generic schema matching method which employs normalized Google distance with the usage of clustering process. GSim proceeds by deriving clusters from attribute instances based on content and their geographic types (if possible), gleaned from a gazetteer. However, clustering algorithms may produce inconsistent results based on variable cluster quality. We apply novel metrics measuring cluster distance and purity to guarantee high-quality homogeneous clusters. The end result is a wholly geospatial similarity value, expressed as EBD. We show the effectiveness of our approach over the traditional N-gram approach across multi-jurisdictional datasets by generating impressive results. With regard to the second problem, we will predict the location of the user on the basis of his social network (e.g., Twitter) using the strong theoretical framework of semi-supervised learning, in particular, we employ label propagation algorithm. For privacy and security reasons, most of the people on social networking sites like Twitter are unwilling to specify their locations explicitly. On the city locations returned by the algorithm, the system performs agglomerative clustering based on geospatial proximity and their individual scores to return cluster of locations with higher confidence. We perform extensive experiments to show the validity of our system in terms of both accuracy and running time. Experimental results show that our approach outperforms the content based geo-tagging approach in both accuracy and running time.
在这次演讲中,我们将介绍语义如何提高数据挖掘过程的质量。特别是,首先,我们将重点关注具有高质量集群保证的地理空间模式匹配。接下来,我们将专注于从社交网络中挖掘位置。关于第一个问题,在GIS领域,解决不同数据源之间的语义异构问题仍然是一个高度相关的问题,需要创新的解决方案。我们的方法称为GSim,通过首先选择用于比较的属性,从语义上对各自GIS数据库中的表进行对齐。然后,我们检查它们的实例,并通过结合两种不同的方法计算它们之间的相似值,称为基于熵的分布(EBD)。我们的主要方法是从比较属性的实例中识别地理类型。如果无法进行地理类型匹配,则采用归一化Google距离并使用聚类过程的通用模式匹配方法。GSim根据内容及其地理类型(如果可能的话)从属性实例中派生集群,这些属性实例是从地名词典中收集的。然而,由于聚类质量的不同,聚类算法可能会产生不一致的结果。我们采用新的度量方法来测量聚类距离和纯度,以保证高质量的同质聚类。最终结果是一个完整的地理空间相似性值,表示为EBD。通过生成令人印象深刻的结果,我们展示了我们的方法在跨多管辖数据集的传统n图方法上的有效性。对于第二个问题,我们将使用半监督学习的强大理论框架,根据用户的社交网络(例如Twitter)来预测用户的位置,特别是我们使用标签传播算法。出于隐私和安全的考虑,Twitter等社交网站上的大多数人都不愿意明确地说明自己的位置。系统对算法返回的城市位置进行基于地理空间接近度及其个体得分的聚类,得到置信度更高的城市位置聚类。我们进行了大量的实验,以证明我们的系统在准确性和运行时间方面的有效性。实验结果表明,该方法在准确率和运行时间上都优于基于内容的地理标记方法。
{"title":"Geospatial Schema Matching with High-Quality Cluster Assurance and Location Mining from Social Network","authors":"L. Khan, J. Partyka, Satyen Abrol, B. Thuraisingham","doi":"10.1109/ICDMW.2010.204","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.204","url":null,"abstract":"In this talk, we will present how semantics can improve the quality of the data mining process. In particular, first, we will focus on geospatial schema matching with high quality cluster assurance. Next, we will focus on location mining from social network. With regard to the first problem, resolving semantic heterogeneity across distinct data sources remains a highly relevant problem in the GIS domain requiring innovative solutions. Our approach, called GSim, semantically aligns tables from respective GIS databases by first choosing attributes for comparison. We then examine their instances and calculate a similarity value between them called Entropy-Based Distribution (EBD) by combining two separate methods. Our primary method discerns the geographic types from instances of compared attributes. If geographic type matching is not possible, we then apply a generic schema matching method which employs normalized Google distance with the usage of clustering process. GSim proceeds by deriving clusters from attribute instances based on content and their geographic types (if possible), gleaned from a gazetteer. However, clustering algorithms may produce inconsistent results based on variable cluster quality. We apply novel metrics measuring cluster distance and purity to guarantee high-quality homogeneous clusters. The end result is a wholly geospatial similarity value, expressed as EBD. We show the effectiveness of our approach over the traditional N-gram approach across multi-jurisdictional datasets by generating impressive results. With regard to the second problem, we will predict the location of the user on the basis of his social network (e.g., Twitter) using the strong theoretical framework of semi-supervised learning, in particular, we employ label propagation algorithm. For privacy and security reasons, most of the people on social networking sites like Twitter are unwilling to specify their locations explicitly. On the city locations returned by the algorithm, the system performs agglomerative clustering based on geospatial proximity and their individual scores to return cluster of locations with higher confidence. We perform extensive experiments to show the validity of our system in terms of both accuracy and running time. Experimental results show that our approach outperforms the content based geo-tagging approach in both accuracy and running time.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115355487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Insights from Applying Sequential Pattern Mining to E-commerce Click Stream Data 从应用顺序模式挖掘电子商务点击流数据的见解
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.31
Arthur Pitman, M. Zanker
Previous sequential pattern mining algorithms have focused on improving performance in terms of runtime and memory consumption without considering the specifics of different data sources or application scenarios. In this paper, we focus on mining closed sequential patterns from website click streams by extending the state of the art Bi-Directional Extension (BIDE) algorithm in order to identify domain-specific rule sets. In particular, we focus on exploiting sequential patterns for landing page personalization and product recommendation in the e-commerce domain. Our contribution is therefore of algorithmic as well as of empirical nature. Based on a dataset that we derived from an online store for nutritional supplements, we evaluate the effectiveness of using different sources of domain knowledge, such as product hierarchies and search word categorizations, to enhance predictions about the conversion actions of users. Furthermore, we examine the performance of the recommender for two important user subgroups, namely those that use search functionality and those that don't. Our findings indicate for instance that search terms alone are already quite effective for predicting users' add-to-basket actions and that using additional domain knowledge to generate multi-dimensional rules does not always lead to improved accuracy.
以前的顺序模式挖掘算法侧重于在运行时和内存消耗方面提高性能,而没有考虑不同数据源或应用程序场景的具体情况。在本文中,我们通过扩展最先进的双向扩展(bidide)算法,专注于从网站点击流中挖掘封闭的顺序模式,以识别特定于领域的规则集。特别地,我们专注于在电子商务领域开发登陆页面个性化和产品推荐的顺序模式。因此,我们的贡献既是算法的,也是经验的。基于我们从营养补充剂在线商店获得的数据集,我们评估了使用不同领域知识来源的有效性,例如产品层次结构和搜索词分类,以增强对用户转换行为的预测。此外,我们对两个重要的用户子组(即使用搜索功能和不使用搜索功能的用户子组)检查了推荐器的性能。我们的研究结果表明,例如,搜索词本身已经非常有效地预测用户的添加到购物篮的行为,而使用额外的领域知识来生成多维规则并不总是导致准确性的提高。
{"title":"Insights from Applying Sequential Pattern Mining to E-commerce Click Stream Data","authors":"Arthur Pitman, M. Zanker","doi":"10.1109/ICDMW.2010.31","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.31","url":null,"abstract":"Previous sequential pattern mining algorithms have focused on improving performance in terms of runtime and memory consumption without considering the specifics of different data sources or application scenarios. In this paper, we focus on mining closed sequential patterns from website click streams by extending the state of the art Bi-Directional Extension (BIDE) algorithm in order to identify domain-specific rule sets. In particular, we focus on exploiting sequential patterns for landing page personalization and product recommendation in the e-commerce domain. Our contribution is therefore of algorithmic as well as of empirical nature. Based on a dataset that we derived from an online store for nutritional supplements, we evaluate the effectiveness of using different sources of domain knowledge, such as product hierarchies and search word categorizations, to enhance predictions about the conversion actions of users. Furthermore, we examine the performance of the recommender for two important user subgroups, namely those that use search functionality and those that don't. Our findings indicate for instance that search terms alone are already quite effective for predicting users' add-to-basket actions and that using additional domain knowledge to generate multi-dimensional rules does not always lead to improved accuracy.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"199 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116005116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
2010 IEEE International Conference on Data Mining Workshops
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1