首页 > 最新文献

Sixth International Conference on Data Mining (ICDM'06)最新文献

英文 中文
Rule-Based Platform for Web User Profiling 基于规则的Web用户分析平台
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.137
Jianping Zhang, Manu Shukla
This paper discusses a research project: rule-based Web user profiling platform. In this platform, usage data are encoded as a sequence of events, each of which represents an action performed by a user on a Web service at a given time. An event template is proposed to define event models for different Web services. The platform is rule-based. Rules define profile metrics and determine how to compute profile metrics from usage events. A prototype of the platform was implemented and was applied to generate profiles from page view events. The major contribution of the work is the rule-based approach to user profiling. It is the rules and the event template that provide the flexibility to allow the platform to be configured for different Web services.
本文讨论了一个研究项目:基于规则的Web用户分析平台。在此平台中,使用数据被编码为事件序列,每个事件序列表示用户在给定时间对Web服务执行的一个操作。建议使用事件模板为不同的Web服务定义事件模型。平台是基于规则的。规则定义概要指标,并确定如何根据使用事件计算概要指标。实现了该平台的原型,并应用于从页面浏览事件生成概要文件。这项工作的主要贡献是基于规则的用户分析方法。规则和事件模板提供了灵活性,允许为不同的Web服务配置平台。
{"title":"Rule-Based Platform for Web User Profiling","authors":"Jianping Zhang, Manu Shukla","doi":"10.1109/ICDM.2006.137","DOIUrl":"https://doi.org/10.1109/ICDM.2006.137","url":null,"abstract":"This paper discusses a research project: rule-based Web user profiling platform. In this platform, usage data are encoded as a sequence of events, each of which represents an action performed by a user on a Web service at a given time. An event template is proposed to define event models for different Web services. The platform is rule-based. Rules define profile metrics and determine how to compute profile metrics from usage events. A prototype of the platform was implemented and was applied to generate profiles from page view events. The major contribution of the work is the rule-based approach to user profiling. It is the rules and the event template that provide the flexibility to allow the platform to be configured for different Web services.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116119731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Improving Grouped-Entity Resolution Using Quasi-Cliques 利用准派系改进群实体解析
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.85
Byung-Won On, Ergin Elmacioglu, Dongwon Lee, Jaewoo Kang, J. Pei
The entity resolution (ER) problem, which identifies duplicate entities that refer to the same real world entity, is essential in many applications. In this paper, in particular, we focus on resolving entities that contain a group of related elements in them (e.g., an author entity with a list of citations, a singer entity with song list, or an intermediate result by GROUP BY SQL query). Such entities, named as grouped-entities, frequently occur in many applications. The previous approaches toward grouped-entity resolution often rely on textual similarity, and produce a large number of false positives. As a complementing technique, in this paper, we present our experience of applying a recently proposed graph mining technique, Quasi-Clique, atop conventional ER solutions. Our approach exploits contextual information mined from the group of elements per entity in addition to syntactic similarity. Extensive experiments verify that our proposal improves precision and recall up to 83% when used together with a variety of existing ER solutions, but never worsens them.
实体解析(ER)问题在许多应用程序中都很重要,它识别引用相同现实世界实体的重复实体。在本文中,我们特别关注于解析包含一组相关元素的实体(例如,包含引用列表的作者实体,包含歌曲列表的歌手实体,或通过group by SQL查询的中间结果)。这种实体被称为分组实体,经常出现在许多应用程序中。以前的分组实体解析方法往往依赖于文本相似性,并产生大量的误报。作为一种补充技术,在本文中,我们介绍了我们在传统ER解决方案之上应用最近提出的图挖掘技术——拟团(Quasi-Clique)的经验。除了语法相似性之外,我们的方法还利用了从每个实体的元素组中挖掘的上下文信息。大量的实验证明,当与各种现有的ER解决方案一起使用时,我们的提议提高了准确率和召回率高达83%,但从未恶化它们。
{"title":"Improving Grouped-Entity Resolution Using Quasi-Cliques","authors":"Byung-Won On, Ergin Elmacioglu, Dongwon Lee, Jaewoo Kang, J. Pei","doi":"10.1109/ICDM.2006.85","DOIUrl":"https://doi.org/10.1109/ICDM.2006.85","url":null,"abstract":"The entity resolution (ER) problem, which identifies duplicate entities that refer to the same real world entity, is essential in many applications. In this paper, in particular, we focus on resolving entities that contain a group of related elements in them (e.g., an author entity with a list of citations, a singer entity with song list, or an intermediate result by GROUP BY SQL query). Such entities, named as grouped-entities, frequently occur in many applications. The previous approaches toward grouped-entity resolution often rely on textual similarity, and produce a large number of false positives. As a complementing technique, in this paper, we present our experience of applying a recently proposed graph mining technique, Quasi-Clique, atop conventional ER solutions. Our approach exploits contextual information mined from the group of elements per entity in addition to syntactic similarity. Extensive experiments verify that our proposal improves precision and recall up to 83% when used together with a variety of existing ER solutions, but never worsens them.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1924 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127456713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
Regularized Least Absolute Deviations Regression and an Efficient Algorithm for Parameter Tuning 正则化最小绝对偏差回归及参数整定的有效算法
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.134
Li Wang, Michael D. Gordon, Ji Zhu
Linear regression is one of the most important and widely used techniques for data analysis. However, sometimes people are not satisfied with it because of the following two limitations: 1) its results are sensitive to outliers, so when the error terms are not normally distributed, especially when they have heavy-tailed distributions, linear regression often works badly; 2) its estimated coefficients tend to have high variance, although their bias is low. To reduce the influence of outliers, robust regression models were developed. Least absolute deviation (LAD) regression is one of them. LAD minimizes the mean absolute errors, instead of mean squared errors, so its results are more robust. To address the second limitation, shrinkage methods were proposed, which add a penalty on the size of the coefficients. The LASSO is one of these methods and it uses the L1-norm penalty, which not only reduces the prediction error and the variance of estimated coefficients, but also provides an automatic feature selection function. In this paper, we propose the regularized least absolute deviation (RLAD) regression model, which combines the nice features of the LAD and the LASSO together. The RLAD is a regularization method, whose objective function has the form of "loss + penalty." The "loss" is the sum of the absolute deviations and the "penalty" is the L1-norm of the coefficient vector. Furthermore, to facilitate parameter tuning, we develop an efficient algorithm which can solve the entire regularization path in one pass. Simulations with various settings are performed to demonstrate its performance. Finally, we apply the algorithm to solve the image reconstruction problem and find interesting results.
线性回归是数据分析中最重要和应用最广泛的技术之一。然而,有时人们对它并不满意,因为它有以下两个局限性:1)它的结果对异常值很敏感,所以当误差项不是正态分布时,特别是当它们具有重尾分布时,线性回归往往效果不佳;2)其估计系数往往具有高方差,尽管它们的偏差很低。为了减少异常值的影响,建立了稳健的回归模型。最小绝对偏差(LAD)回归就是其中之一。LAD最小化的是平均绝对误差,而不是均方误差,因此它的结果更稳健。为了解决第二个限制,提出了收缩方法,这增加了对系数大小的惩罚。LASSO就是其中的一种方法,它使用l1范数惩罚,不仅减少了预测误差和估计系数的方差,而且提供了一个自动的特征选择功能。本文提出了正则化最小绝对偏差(RLAD)回归模型,该模型结合了正则化最小绝对偏差和LASSO的优点。RLAD是一种正则化方法,其目标函数具有“损失+惩罚”的形式。“损失”是绝对偏差的总和,“惩罚”是系数向量的l1范数。此外,为了方便参数调整,我们开发了一种有效的算法,可以一次求解整个正则化路径。通过不同设置的仿真来验证其性能。最后,我们将该算法应用于图像重建问题,得到了一些有趣的结果。
{"title":"Regularized Least Absolute Deviations Regression and an Efficient Algorithm for Parameter Tuning","authors":"Li Wang, Michael D. Gordon, Ji Zhu","doi":"10.1109/ICDM.2006.134","DOIUrl":"https://doi.org/10.1109/ICDM.2006.134","url":null,"abstract":"Linear regression is one of the most important and widely used techniques for data analysis. However, sometimes people are not satisfied with it because of the following two limitations: 1) its results are sensitive to outliers, so when the error terms are not normally distributed, especially when they have heavy-tailed distributions, linear regression often works badly; 2) its estimated coefficients tend to have high variance, although their bias is low. To reduce the influence of outliers, robust regression models were developed. Least absolute deviation (LAD) regression is one of them. LAD minimizes the mean absolute errors, instead of mean squared errors, so its results are more robust. To address the second limitation, shrinkage methods were proposed, which add a penalty on the size of the coefficients. The LASSO is one of these methods and it uses the L1-norm penalty, which not only reduces the prediction error and the variance of estimated coefficients, but also provides an automatic feature selection function. In this paper, we propose the regularized least absolute deviation (RLAD) regression model, which combines the nice features of the LAD and the LASSO together. The RLAD is a regularization method, whose objective function has the form of \"loss + penalty.\" The \"loss\" is the sum of the absolute deviations and the \"penalty\" is the L1-norm of the coefficient vector. Furthermore, to facilitate parameter tuning, we develop an efficient algorithm which can solve the entire regularization path in one pass. Simulations with various settings are performed to demonstrate its performance. Finally, we apply the algorithm to solve the image reconstruction problem and find interesting results.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127071951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 95
Data Mining Approaches to Criminal Career Analysis 犯罪生涯分析的数据挖掘方法
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.47
J. D. Bruin, Tim K. Cocx, W. Kosters, J. Laros, J. Kok
Narrative reports and criminal records are stored digitally across individual police departments, enabling the collection of this data to compile a nation-wide database of criminals and the crimes they committed. The compilation of this data through the last years presents new possibilities of analyzing criminal activity through time. Augmenting the traditional, more socially oriented, approach of behavioral study of these criminals and traditional statistics, data mining methods like clustering and prediction enable police forces to get a clearer picture of criminal careers. This allows officers to recognize crucial spots in changing criminal behaviour and deploy resources to prevent these careers from unfolding. Four important factors play a role in the analysis of criminal careers: crime nature, frequency, duration and severity. We describe a tool that extracts these from the database and creates digital profiles for all offenders. It compares all individuals on these profiles by a new distance measure and clusters them accordingly. This method yields a visual clustering of these criminal careers and enables the identification of classes of criminals. The proposed method allows for several user-defined parameters.
叙述性报告和犯罪记录以数字方式存储在各个警察部门,使这些数据的收集能够编制一个全国性的罪犯及其所犯罪行数据库。过去几年对这些数据的汇编为分析长期犯罪活动提供了新的可能性。传统的、更面向社会的、对这些罪犯进行行为研究的方法和传统的统计方法,如聚类和预测等数据挖掘方法的增强,使警方能够更清楚地了解犯罪生涯。这使警察能够识别改变犯罪行为的关键点,并部署资源以防止这些职业发展。犯罪性质、犯罪频率、犯罪持续时间和犯罪严重程度是分析犯罪生涯的四个重要因素。我们描述了一种工具,可以从数据库中提取这些信息,并为所有违法者创建数字档案。它通过一种新的距离度量来比较这些概况上的所有个体,并相应地对它们进行聚类。这种方法产生了这些犯罪职业的视觉聚类,并能够识别罪犯的类别。建议的方法允许使用几个用户定义的参数。
{"title":"Data Mining Approaches to Criminal Career Analysis","authors":"J. D. Bruin, Tim K. Cocx, W. Kosters, J. Laros, J. Kok","doi":"10.1109/ICDM.2006.47","DOIUrl":"https://doi.org/10.1109/ICDM.2006.47","url":null,"abstract":"Narrative reports and criminal records are stored digitally across individual police departments, enabling the collection of this data to compile a nation-wide database of criminals and the crimes they committed. The compilation of this data through the last years presents new possibilities of analyzing criminal activity through time. Augmenting the traditional, more socially oriented, approach of behavioral study of these criminals and traditional statistics, data mining methods like clustering and prediction enable police forces to get a clearer picture of criminal careers. This allows officers to recognize crucial spots in changing criminal behaviour and deploy resources to prevent these careers from unfolding. Four important factors play a role in the analysis of criminal careers: crime nature, frequency, duration and severity. We describe a tool that extracts these from the database and creates digital profiles for all offenders. It compares all individuals on these profiles by a new distance measure and clusters them accordingly. This method yields a visual clustering of these criminal careers and enables the identification of classes of criminals. The proposed method allows for several user-defined parameters.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125147947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
Automatic Single-Organ Segmentation in Computed Tomography Images 计算机断层扫描图像中单器官自动分割
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.24
Ruchaneewan Susomboon, D. Raicu, J. Furst, D. Channin
In this paper, we propose a hybrid approach for automatic single-organ segmentation in computed tomography (CT) data. The approach consists of three stages: first, a probability image of the organ of interest is obtained by applying a binary classification model obtained using pixel-based texture features; second, an adaptive split-and-merge segmentation algorithm is applied on the organ probability image to remove the noise introduced by the misclassified pixels; and third, the segmented organ's boundaries from the previous stage are iteratively refined using a region growing algorithm. While we applied our approach for liver segmentation in 2-D CT images, a challenging and important task in many medical applications, the proposed approach can be applied for the segmentation of any other organ in CT images. Moreover, the proposed approach can be extended to perform automatic multiple organ segmentation and to build context-sensitive reporting tools for computer-aided diagnosis applications.
在本文中,我们提出了一种用于计算机断层扫描(CT)数据中单器官自动分割的混合方法。该方法包括三个阶段:首先,利用基于像素的纹理特征获得的二值分类模型获得感兴趣器官的概率图像;其次,对器官概率图像采用自适应分裂合并分割算法,去除误分类像素带来的噪声;第三,使用区域增长算法迭代细化前一阶段分割的器官边界。虽然我们将我们的方法应用于二维CT图像中的肝脏分割,这是许多医学应用中具有挑战性和重要的任务,但我们提出的方法可以应用于CT图像中任何其他器官的分割。此外,所提出的方法可以扩展到执行自动多器官分割和构建上下文敏感的报告工具,用于计算机辅助诊断应用。
{"title":"Automatic Single-Organ Segmentation in Computed Tomography Images","authors":"Ruchaneewan Susomboon, D. Raicu, J. Furst, D. Channin","doi":"10.1109/ICDM.2006.24","DOIUrl":"https://doi.org/10.1109/ICDM.2006.24","url":null,"abstract":"In this paper, we propose a hybrid approach for automatic single-organ segmentation in computed tomography (CT) data. The approach consists of three stages: first, a probability image of the organ of interest is obtained by applying a binary classification model obtained using pixel-based texture features; second, an adaptive split-and-merge segmentation algorithm is applied on the organ probability image to remove the noise introduced by the misclassified pixels; and third, the segmented organ's boundaries from the previous stage are iteratively refined using a region growing algorithm. While we applied our approach for liver segmentation in 2-D CT images, a challenging and important task in many medical applications, the proposed approach can be applied for the segmentation of any other organ in CT images. Moreover, the proposed approach can be extended to perform automatic multiple organ segmentation and to build context-sensitive reporting tools for computer-aided diagnosis applications.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131369759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
The Relationships Among Various Nonnegative Matrix Factorization Methods for Clustering 聚类中各种非负矩阵分解方法之间的关系
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.160
Tao Li, C. Ding
The nonnegative matrix factorization (NMF) has been shown recently to be useful for clustering and various extensions and variations of NMF have been proposed recently. Despite significant research progress in this area, few attempts have been made to establish the connections between various factorization methods while highlighting their differences. In this paper we aim to provide a comprehensive study on matrix factorization for clustering. In particular, we present an overview and summary on various matrix factorization algorithms and theoretically analyze the relationships among them. Experiments are also conducted to empirically evaluate and compare various factorization methods. In addition, our study also answers several previously unaddressed yet important questions for matrix factorizations including the interpretation and normalization of cluster posterior and the benefits and evaluation of simultaneous clustering. We expect our study would provide good insights on matrix factorization research for clustering.
近年来,非负矩阵分解(NMF)在聚类问题上得到了广泛的应用,并提出了各种扩展和变化。尽管这一领域的研究取得了重大进展,但很少有人试图建立各种分解方法之间的联系,同时突出它们之间的差异。在本文中,我们的目的是提供一个全面的研究矩阵分解聚类。特别地,我们对各种矩阵分解算法进行了概述和总结,并从理论上分析了它们之间的关系。并通过实验对各种分解方法进行了实证评价和比较。此外,我们的研究还回答了几个以前未解决但重要的矩阵分解问题,包括聚类后验的解释和归一化以及同时聚类的好处和评估。我们期望我们的研究能为聚类的矩阵分解研究提供很好的见解。
{"title":"The Relationships Among Various Nonnegative Matrix Factorization Methods for Clustering","authors":"Tao Li, C. Ding","doi":"10.1109/ICDM.2006.160","DOIUrl":"https://doi.org/10.1109/ICDM.2006.160","url":null,"abstract":"The nonnegative matrix factorization (NMF) has been shown recently to be useful for clustering and various extensions and variations of NMF have been proposed recently. Despite significant research progress in this area, few attempts have been made to establish the connections between various factorization methods while highlighting their differences. In this paper we aim to provide a comprehensive study on matrix factorization for clustering. In particular, we present an overview and summary on various matrix factorization algorithms and theoretically analyze the relationships among them. Experiments are also conducted to empirically evaluate and compare various factorization methods. In addition, our study also answers several previously unaddressed yet important questions for matrix factorizations including the interpretation and normalization of cluster posterior and the benefits and evaluation of simultaneous clustering. We expect our study would provide good insights on matrix factorization research for clustering.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128206010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 308
Adaptive Parallel Graph Mining for CMP Architectures 面向CMP架构的自适应并行图挖掘
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.15
G. Buehrer, S. Parthasarathy, Yen-kuang Chen
Mining graph data is an increasingly popular challenge, which has practical applications in many areas, including molecular substructure discovery, Web link analysis, fraud detection, and social network analysis. The problem statement is to enumerate all subgraphs occurring in at least sigma graphs of a database, where sigma is a user specified parameter. Chip multiprocessors (CMPs) provide true parallel processing, and are expected to become the de facto standard for commodity computing. In this work, building on the state-of-the-art, we propose an efficient approach to parallelize such algorithms for CMPs. We show that an algorithm which adapts its behavior based on the runtime state of the system can improve system utilization and lower execution times. Most notably, we incorporate dynamic state management to allow memory consumption to vary based on availability. We evaluate our techniques on current day shared memory systems (SMPs) and expect similar performance for CMPs. We demonstrate excellent speedup, 27-fold on 32 processors for several real world datasets. Additionally, we show our dynamic techniques afford this scalability while consuming up to 35% less memory than static techniques.
挖掘图形数据是一项日益流行的挑战,它在许多领域都有实际应用,包括分子子结构发现、Web链接分析、欺诈检测和社会网络分析。问题语句是枚举数据库中至少sigma图中出现的所有子图,其中sigma是用户指定的参数。芯片多处理器(cmp)提供了真正的并行处理,并有望成为商用计算的事实上的标准。在这项工作中,基于最先进的技术,我们提出了一种有效的方法来并行化cmp的这种算法。我们证明了一种基于系统运行状态调整其行为的算法可以提高系统利用率和降低执行时间。最值得注意的是,我们结合了动态状态管理,允许内存消耗根据可用性变化。我们在当前的共享内存系统(smp)上评估了我们的技术,并期望cmp具有类似的性能。我们展示了出色的加速,在32个处理器上对几个真实世界的数据集进行27倍的加速。此外,动态技术提供了这种可伸缩性,同时比静态技术消耗的内存少35%。
{"title":"Adaptive Parallel Graph Mining for CMP Architectures","authors":"G. Buehrer, S. Parthasarathy, Yen-kuang Chen","doi":"10.1109/ICDM.2006.15","DOIUrl":"https://doi.org/10.1109/ICDM.2006.15","url":null,"abstract":"Mining graph data is an increasingly popular challenge, which has practical applications in many areas, including molecular substructure discovery, Web link analysis, fraud detection, and social network analysis. The problem statement is to enumerate all subgraphs occurring in at least sigma graphs of a database, where sigma is a user specified parameter. Chip multiprocessors (CMPs) provide true parallel processing, and are expected to become the de facto standard for commodity computing. In this work, building on the state-of-the-art, we propose an efficient approach to parallelize such algorithms for CMPs. We show that an algorithm which adapts its behavior based on the runtime state of the system can improve system utilization and lower execution times. Most notably, we incorporate dynamic state management to allow memory consumption to vary based on availability. We evaluate our techniques on current day shared memory systems (SMPs) and expect similar performance for CMPs. We demonstrate excellent speedup, 27-fold on 32 processors for several real world datasets. Additionally, we show our dynamic techniques afford this scalability while consuming up to 35% less memory than static techniques.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130433070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Detecting Link Spam Using Temporal Information 利用时间信息检测垃圾链接
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.51
Guoyang Shen, Bin Gao, Tie-Yan Liu, Guang Feng, Shiji Song, Hang Li
How to effectively protect against spam on search ranking results is an important issue for contemporary web search engines. This paper addresses the problem of combating one major type of web spam: 'link spam.' Most of the previous work on anti link spam managed to make use of one snapshot of web data to detect spam, and thus it did not take advantage of the fact that link spam tends to result in drastic changes of links in a short time period. To overcome the shortcoming, this paper proposes using temporal information on links in detection of link spam, as well as other information. Specifically, it defines temporal features such as in-link growth rate (IGR) and in-link death rate (IDR) in a spam classification model (i.e., SVM). Experimental results on web domain graph data show that link spam can be successfully detected with the proposed method.
如何有效地防止搜索排名结果中的垃圾邮件是当代网络搜索引擎面临的一个重要问题。本文解决了打击一种主要类型的网络垃圾邮件的问题:“链接垃圾邮件”。以前的大多数反链接垃圾邮件的工作都是设法利用一个web数据快照来检测垃圾邮件,因此它没有利用链接垃圾邮件往往会导致链接在短时间内发生剧烈变化的事实。为了克服这一缺点,本文提出利用链接的时间信息以及其他信息来检测垃圾链接。具体来说,它定义了垃圾邮件分类模型(即SVM)中的链接内增长率(IGR)和链接内死亡率(IDR)等时间特征。在web域图数据上的实验结果表明,该方法可以成功地检测出垃圾链接。
{"title":"Detecting Link Spam Using Temporal Information","authors":"Guoyang Shen, Bin Gao, Tie-Yan Liu, Guang Feng, Shiji Song, Hang Li","doi":"10.1109/ICDM.2006.51","DOIUrl":"https://doi.org/10.1109/ICDM.2006.51","url":null,"abstract":"How to effectively protect against spam on search ranking results is an important issue for contemporary web search engines. This paper addresses the problem of combating one major type of web spam: 'link spam.' Most of the previous work on anti link spam managed to make use of one snapshot of web data to detect spam, and thus it did not take advantage of the fact that link spam tends to result in drastic changes of links in a short time period. To overcome the shortcoming, this paper proposes using temporal information on links in detection of link spam, as well as other information. Specifically, it defines temporal features such as in-link growth rate (IGR) and in-link death rate (IDR) in a spam classification model (i.e., SVM). Experimental results on web domain graph data show that link spam can be successfully detected with the proposed method.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130717760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
Boosting for Learning Multiple Classes with Imbalanced Class Distribution 班级分布不均衡的多门课学习助推
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.29
Yanmin Sun, M. Kamel, Yang Wang
Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. This learning difficulty attracts a lot of research interests. Most efforts concentrate on bi-class problems. However, bi-class is not the only scenario where the class imbalance problem prevails. Reported solutions for bi-class applications are not applicable to multi-class problems. In this paper, we develop a cost-sensitive boosting algorithm to improve the classification performance of imbalanced data involving multiple classes. One barrier of applying the cost-sensitive boosting algorithm to the imbalanced data is that the cost matrix is often unavailable for a problem domain. To solve this problem, we apply Genetic Algorithm to search the optimum cost setup of each class. Empirical tests show that the proposed cost-sensitive boosting algorithm improves the classification performances of imbalanced data sets significantly.
对类分布不平衡的数据进行分类,对大多数标准分类器学习算法所能达到的性能造成了显著的缺陷,这些算法假设了相对平衡的类分布和相等的误分类代价。这种学习困难吸引了许多研究兴趣。大多数努力都集中在双类问题上。然而,双类并不是存在类不平衡问题的唯一场景。报告的双类应用的解决方案不适用于多类问题。在本文中,我们开发了一种代价敏感的增强算法来提高涉及多个类别的不平衡数据的分类性能。将代价敏感增强算法应用于不平衡数据的一个障碍是问题域的代价矩阵通常不可用。为了解决这一问题,我们采用遗传算法来搜索每个类别的最优成本设置。实证测试表明,本文提出的代价敏感增强算法显著提高了不平衡数据集的分类性能。
{"title":"Boosting for Learning Multiple Classes with Imbalanced Class Distribution","authors":"Yanmin Sun, M. Kamel, Yang Wang","doi":"10.1109/ICDM.2006.29","DOIUrl":"https://doi.org/10.1109/ICDM.2006.29","url":null,"abstract":"Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. This learning difficulty attracts a lot of research interests. Most efforts concentrate on bi-class problems. However, bi-class is not the only scenario where the class imbalance problem prevails. Reported solutions for bi-class applications are not applicable to multi-class problems. In this paper, we develop a cost-sensitive boosting algorithm to improve the classification performance of imbalanced data involving multiple classes. One barrier of applying the cost-sensitive boosting algorithm to the imbalanced data is that the cost matrix is often unavailable for a problem domain. To solve this problem, we apply Genetic Algorithm to search the optimum cost setup of each class. Empirical tests show that the proposed cost-sensitive boosting algorithm improves the classification performances of imbalanced data sets significantly.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131144657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 292
Personalization in Context: Does Context Matter When Building Personalized Customer Models? 情境中的个性化:构建个性化客户模型时情境是否重要?
Pub Date : 2006-12-18 DOI: 10.1109/ICDM.2006.125
M. Gorgoglione, C. Palmisano, A. Tuzhilin
The idea that context is important when predicting customer behavior has been maintained by scholars in marketing and data mining. However, no systematic study measuring how much the contextual information really matters in building customer models in personalization applications have been done before. In this paper, we address this problem. To this aim, we collected data containing rich contextual information by developing a special-purpose browser to help users to navigate a well- known e-commerce retail portal and purchase products on its site. The experimental results show that context does matter for the case of modeling behavior of individual customers. The granularity of contextual information also matters, and the effect of contextual information gets diluted during the process of aggregating customers' data.
市场营销和数据挖掘领域的学者一直认为,在预测客户行为时,环境很重要。然而,在个性化应用程序中构建客户模型时,上下文信息到底有多重要,目前还没有进行过系统的研究。在本文中,我们解决了这个问题。为此,我们通过开发一个特殊用途的浏览器来收集包含丰富上下文信息的数据,以帮助用户浏览一个知名的电子商务零售门户网站并在其网站上购买产品。实验结果表明,情境对个体顾客的行为建模确实有影响。上下文信息的粒度也很重要,在聚合客户数据的过程中,上下文信息的效果会被稀释。
{"title":"Personalization in Context: Does Context Matter When Building Personalized Customer Models?","authors":"M. Gorgoglione, C. Palmisano, A. Tuzhilin","doi":"10.1109/ICDM.2006.125","DOIUrl":"https://doi.org/10.1109/ICDM.2006.125","url":null,"abstract":"The idea that context is important when predicting customer behavior has been maintained by scholars in marketing and data mining. However, no systematic study measuring how much the contextual information really matters in building customer models in personalization applications have been done before. In this paper, we address this problem. To this aim, we collected data containing rich contextual information by developing a special-purpose browser to help users to navigate a well- known e-commerce retail portal and purchase products on its site. The experimental results show that context does matter for the case of modeling behavior of individual customers. The granularity of contextual information also matters, and the effect of contextual information gets diluted during the process of aggregating customers' data.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134112334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
期刊
Sixth International Conference on Data Mining (ICDM'06)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1