首页 > 最新文献

2010 IEEE International Conference on Data Mining Workshops最新文献

英文 中文
Find Intelligent Knowledge by Second-Order Mining: Three Cases from China 利用二阶挖掘寻找智能知识:来自中国的三个案例
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.115
G. Nie, Lingling Zhang, Yuejin Zhang, Wei Deng, Yong Shi
This paper discusses the second-order mining of the results of data mining. There is still gap between the knowledge which can direct the operation of company and the knowledge got from data mining. We take the knowledge from data mining as primary knowledge and the knowledge from second-order data mining as intelligent knowledge. We discuss the importance of intelligent knowledge and the way to find intelligent knowledge from primary knowledge via second-order mining process in this study. Three cases from China are used to demonstrate the methods to finish second mining. These cases are related to central bank credit scoring, credit card churn prediction, and email service fields respectively.
本文讨论了数据挖掘结果的二阶挖掘。指导企业运营的知识与从数据挖掘中获得的知识还存在一定的差距。我们将数据挖掘的知识作为初级知识,将二阶数据挖掘的知识作为智能知识。本文讨论了智能知识的重要性,以及如何通过二级挖掘过程从初级知识中发现智能知识。以中国的三个案例为例,论证了完成二次开采的方法。这些案例分别与中央银行信用评分、信用卡流失预测和电子邮件服务领域相关。
{"title":"Find Intelligent Knowledge by Second-Order Mining: Three Cases from China","authors":"G. Nie, Lingling Zhang, Yuejin Zhang, Wei Deng, Yong Shi","doi":"10.1109/ICDMW.2010.115","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.115","url":null,"abstract":"This paper discusses the second-order mining of the results of data mining. There is still gap between the knowledge which can direct the operation of company and the knowledge got from data mining. We take the knowledge from data mining as primary knowledge and the knowledge from second-order data mining as intelligent knowledge. We discuss the importance of intelligent knowledge and the way to find intelligent knowledge from primary knowledge via second-order mining process in this study. Three cases from China are used to demonstrate the methods to finish second mining. These cases are related to central bank credit scoring, credit card churn prediction, and email service fields respectively.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121234731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Parallel User Profiling Based on Folksonomy for Large Scaled Recommender Systems: An Implimentation of Cascading MapReduce 基于Folksonomy的大规模推荐系统并行用户分析:层叠MapReduce的实现
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.161
Huizhi Liang, Jim Hogan, Yue Xu
The Large scaled emerging user created information in web 2.0 such as tags, reviews, comments and blogs can be used to profile users¡¯ interests and preferences to make personalized recommendations. To solve the scalability problem of the current user profiling and recommender systems, this paper proposes a parallel user profiling approach and a scalable recommender system. The current advanced cloud computing techniques including Hadoop, MapReduce and Cascading are employed to implement the proposed approaches. The experiments were conducted on Amazon EC2 Elastic MapReduce and S3 with a real world large scaled dataset from Delicious website.
大量新兴用户在web2.0中创建的信息,如标签、评论、评论和博客,可以用来分析用户的兴趣和偏好,从而做出个性化的推荐。为了解决当前用户分析和推荐系统的可扩展性问题,本文提出了一种并行用户分析方法和可扩展的推荐系统。采用当前先进的云计算技术,包括Hadoop、MapReduce和Cascading来实现所提出的方法。实验在Amazon EC2 Elastic MapReduce和S3上进行,并使用了Delicious网站的真实大规模数据集。
{"title":"Parallel User Profiling Based on Folksonomy for Large Scaled Recommender Systems: An Implimentation of Cascading MapReduce","authors":"Huizhi Liang, Jim Hogan, Yue Xu","doi":"10.1109/ICDMW.2010.161","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.161","url":null,"abstract":"The Large scaled emerging user created information in web 2.0 such as tags, reviews, comments and blogs can be used to profile users¡¯ interests and preferences to make personalized recommendations. To solve the scalability problem of the current user profiling and recommender systems, this paper proposes a parallel user profiling approach and a scalable recommender system. The current advanced cloud computing techniques including Hadoop, MapReduce and Cascading are employed to implement the proposed approaches. The experiments were conducted on Amazon EC2 Elastic MapReduce and S3 with a real world large scaled dataset from Delicious website.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127440143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Augmenting Chinese Online Video Recommendations by Using Virtual Ratings Predicted by Review Sentiment Classification 基于评论情感分类预测的虚拟评分增强中文在线视频推荐
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.27
Weishi Zhang, Guiguang Ding, Li Chen, Chunping Li
In this paper we aim to resolve the recommendation problem by using the virtual ratings in online environments when user rating information is not available. As a matter of fact, in most of current websites especially the Chinese video-sharing ones, the traditional pure rating based collaborative filtering recommender methods are not fully qualified due to the sparsity of rating data. Motivated by our prior work on the investigation of user reviews that broadly appear in such sites, we hence propose a new recommender algorithm by fusing a self-supervised emoticon-integrated sentiment classification approach, by which the missing User-Item Rating Matrix can be substituted by the virtual ratings which are predicted by decomposing user reviews as given to the items. To test the algorithm’s practical value, we have first identified the self-supervised sentiment classification’s higher performance by comparing it with a supervised approach. Moreover, we conducted a statistic evaluation method to show the effectiveness of our recommender system on improving Chinese online video recommendations’ accuracy.
本文的目标是在用户评价信息不可用的情况下,利用虚拟评分来解决在线环境下的推荐问题。事实上,在目前的大多数网站尤其是中文视频分享网站中,由于评分数据的稀疏性,传统的纯基于评分的协同过滤推荐方法并不完全合格。受我们之前对这些网站上广泛出现的用户评论的调查工作的启发,我们因此提出了一种新的推荐算法,该算法融合了一种自监督的表情符号集成情感分类方法,其中缺失的用户-物品评级矩阵可以被通过分解给定用户评论预测的虚拟评级所取代。为了测试算法的实用价值,我们首先通过与监督方法的比较,确定了自监督情感分类的更高性能。此外,我们进行了统计评估方法,以显示我们的推荐系统在提高中文在线视频推荐的准确性方面的有效性。
{"title":"Augmenting Chinese Online Video Recommendations by Using Virtual Ratings Predicted by Review Sentiment Classification","authors":"Weishi Zhang, Guiguang Ding, Li Chen, Chunping Li","doi":"10.1109/ICDMW.2010.27","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.27","url":null,"abstract":"In this paper we aim to resolve the recommendation problem by using the virtual ratings in online environments when user rating information is not available. As a matter of fact, in most of current websites especially the Chinese video-sharing ones, the traditional pure rating based collaborative filtering recommender methods are not fully qualified due to the sparsity of rating data. Motivated by our prior work on the investigation of user reviews that broadly appear in such sites, we hence propose a new recommender algorithm by fusing a self-supervised emoticon-integrated sentiment classification approach, by which the missing User-Item Rating Matrix can be substituted by the virtual ratings which are predicted by decomposing user reviews as given to the items. To test the algorithm’s practical value, we have first identified the self-supervised sentiment classification’s higher performance by comparing it with a supervised approach. Moreover, we conducted a statistic evaluation method to show the effectiveness of our recommender system on improving Chinese online video recommendations’ accuracy.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126064237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
A New Data Mining Model for Hurricane Intensity Prediction 一种新的飓风强度预测数据挖掘模型
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.158
Y. Su, Sudheer Chelluboina, Michael Hahsler, M. Dunham
This paper proposes a new hurricane intensity prediction model, WFL-EMM, which is based on the data mining techniques of feature weight learning (WFL) and Extensible Markov Model (EMM). The data features used are those employed by one of the most popular intensity prediction models, SHIPS. In our algorithm, the weights of the features are learned by a genetic algorithm (GA) using historical hurricane data. As the GAs fitness function we use the error of the intensity prediction by an EMM learned using given feature weights. For fitness calculation we use a technique similar to k-fold cross validation on the training data. The best weights obtained by the genetic algorithm are used to build an EMM with all training data. This EMM is then applied to predict the hurricane intensities and compute prediction errors for the test data. Using historical data for the named Atlantic tropical cyclones from 1982 to 2003, experiments demonstrate that WFL-EMM provides significantly more accurate intensity predictions than SHIPS within 72 hours. Since we report here first results, we indicate how to improve WFL-EMM in the future.
基于特征权学习(WFL)和可扩展马尔可夫模型(EMM)的数据挖掘技术,提出了一种新的飓风强度预测模型WFL-EMM。所使用的数据特征是最流行的强度预测模型之一SHIPS所采用的数据特征。在我们的算法中,利用历史飓风数据,通过遗传算法(GA)来学习特征的权重。作为GAs适应度函数,我们使用给定特征权值学习到的EMM的强度预测误差。对于适应度计算,我们在训练数据上使用类似于k-fold交叉验证的技术。利用遗传算法得到的最佳权值,构建包含所有训练数据的EMM。然后应用该EMM预测飓风强度,并计算测试数据的预测误差。利用1982 ~ 2003年命名大西洋热带气旋的历史数据,实验证明WFL-EMM在72小时内的强度预报精度明显高于SHIPS。由于我们在这里报告了第一个结果,我们指出了未来如何改进WFL-EMM。
{"title":"A New Data Mining Model for Hurricane Intensity Prediction","authors":"Y. Su, Sudheer Chelluboina, Michael Hahsler, M. Dunham","doi":"10.1109/ICDMW.2010.158","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.158","url":null,"abstract":"This paper proposes a new hurricane intensity prediction model, WFL-EMM, which is based on the data mining techniques of feature weight learning (WFL) and Extensible Markov Model (EMM). The data features used are those employed by one of the most popular intensity prediction models, SHIPS. In our algorithm, the weights of the features are learned by a genetic algorithm (GA) using historical hurricane data. As the GAs fitness function we use the error of the intensity prediction by an EMM learned using given feature weights. For fitness calculation we use a technique similar to k-fold cross validation on the training data. The best weights obtained by the genetic algorithm are used to build an EMM with all training data. This EMM is then applied to predict the hurricane intensities and compute prediction errors for the test data. Using historical data for the named Atlantic tropical cyclones from 1982 to 2003, experiments demonstrate that WFL-EMM provides significantly more accurate intensity predictions than SHIPS within 72 hours. Since we report here first results, we indicate how to improve WFL-EMM in the future.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115020229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
MCExplorer: Interactive Exploration of Multiple (Subspace) Clustering Solutions 多(子空间)聚类解决方案的交互式探索
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.29
Stephan Günnemann, Hardy Kremer, Ines Färber, T. Seidl
Large amounts of data are ubiquitous today. Data mining methods like clustering were introduced to gain knowledge from these data. Recently, detection of multiple clusterings has become an active research area, where several alternative clustering solutions are generated for a single dataset. Each of the obtained clustering solutions is valid, of importance, and provides a different interpretation of the data. The key for knowledge extraction, however, is to learn how the different solutions are related to each other. This can be achieved by a comparison and analysis of the obtained clustering solutions. We introduce our demo MCExplorer, the first tool that allows for interactive exploration, browsing, and visualization of multiple clustering solutions on several granularities. MCExplorer is applicable to the output of both fullspace and subspace clustering approaches.
如今,大量的数据无处不在。引入了聚类等数据挖掘方法来从这些数据中获取知识。近年来,多聚类的检测已成为一个活跃的研究领域,其中针对单个数据集生成了几种不同的聚类解决方案。得到的每个聚类解决方案都是有效的、重要的,并提供了对数据的不同解释。然而,知识提取的关键是了解不同的解决方案是如何相互关联的。这可以通过比较和分析得到的聚类解决方案来实现。我们将介绍我们的演示MCExplorer,这是第一个允许对多个粒度的多个集群解决方案进行交互式探索、浏览和可视化的工具。MCExplorer适用于全空间和子空间聚类方法的输出。
{"title":"MCExplorer: Interactive Exploration of Multiple (Subspace) Clustering Solutions","authors":"Stephan Günnemann, Hardy Kremer, Ines Färber, T. Seidl","doi":"10.1109/ICDMW.2010.29","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.29","url":null,"abstract":"Large amounts of data are ubiquitous today. Data mining methods like clustering were introduced to gain knowledge from these data. Recently, detection of multiple clusterings has become an active research area, where several alternative clustering solutions are generated for a single dataset. Each of the obtained clustering solutions is valid, of importance, and provides a different interpretation of the data. The key for knowledge extraction, however, is to learn how the different solutions are related to each other. This can be achieved by a comparison and analysis of the obtained clustering solutions. We introduce our demo MCExplorer, the first tool that allows for interactive exploration, browsing, and visualization of multiple clustering solutions on several granularities. MCExplorer is applicable to the output of both fullspace and subspace clustering approaches.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117296257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Towards Community Discovery in Signed Collaborative Interaction Networks 面向签名协作交互网络中的社区发现
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.174
Petko Bogdanov, Nicholas D. Larusso, Ambuj K. Singh
—We propose a framework for discovery of collaborative community structure in Wiki-based knowledge repositories based on raw-content generation analysis. We leverage topic modelling in order to capture agreement and opposition of contributors and analyze these multi-modal relations to map communities in the contributor base. The key steps of our approach include (i) modeling of pair wise variable-strength contributor interactions that can be both positive and negative, (ii) synthesis of a global network incorporating all pair wise interactions, and (iii) detection and analysis of community structure encoded in such networks. The global community discovery algorithm we propose outperforms existing alternatives in identifying coherent clusters according to objective optimality criteria. Analysis of the discovered community structure reveals coalitions of common interest editors who back each other in promoting some topics and collectively oppose other coalitions or single authors. We couple contributor interactions with content evolution and reveal the global picture of opposing themes within the self-regulated community base for both controversial and featured articles in Wikipedia.
我们提出了一个基于原始内容生成分析的框架,用于在基于维基的知识库中发现协作社区结构。我们利用主题建模来捕捉贡献者的同意和反对,并分析这些多模态关系,以映射贡献者库中的社区。我们的方法的关键步骤包括(i)建模对明智的可变强度贡献者相互作用,可以是积极的和消极的,(ii)合成一个包含所有对明智的相互作用的全球网络,以及(iii)检测和分析编码在这些网络中的社区结构。我们提出的全局社区发现算法在根据客观最优性标准识别相干簇方面优于现有的替代算法。对发现的社区结构的分析揭示了共同利益编辑的联盟,他们在推动某些主题时相互支持,集体反对其他联盟或单个作者。我们将贡献者的互动与内容的演变结合起来,并揭示了维基百科中有争议和特色文章的自我监管社区基础中对立主题的全球图景。
{"title":"Towards Community Discovery in Signed Collaborative Interaction Networks","authors":"Petko Bogdanov, Nicholas D. Larusso, Ambuj K. Singh","doi":"10.1109/ICDMW.2010.174","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.174","url":null,"abstract":"—We propose a framework for discovery of collaborative community structure in Wiki-based knowledge repositories based on raw-content generation analysis. We leverage topic modelling in order to capture agreement and opposition of contributors and analyze these multi-modal relations to map communities in the contributor base. The key steps of our approach include (i) modeling of pair wise variable-strength contributor interactions that can be both positive and negative, (ii) synthesis of a global network incorporating all pair wise interactions, and (iii) detection and analysis of community structure encoded in such networks. The global community discovery algorithm we propose outperforms existing alternatives in identifying coherent clusters according to objective optimality criteria. Analysis of the discovered community structure reveals coalitions of common interest editors who back each other in promoting some topics and collectively oppose other coalitions or single authors. We couple contributor interactions with content evolution and reveal the global picture of opposing themes within the self-regulated community base for both controversial and featured articles in Wikipedia.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124572241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Using Taxonomies to Perform Aggregated Querying over Imprecise Data 使用分类法对不精确的数据执行聚合查询
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.173
Atanu Roy, Chandrima Sarkar, R. Angryk
In this paper, we put forward our approach for answering aggregated queries over imprecise data using domain specific taxonomies. A new concept we call the weighted hierarchical hyper graph has been introduced, which helps in answering aggregated queries when dealing with imprecise databases. We assume that the existence of a knowledge base is permanent and independent of the imprecision in the database. We use this concept to build a temporary database known as the extended database and use the extended database to build the marginal database, which efficiently answers aggregated queries over an imprecise data.
在本文中,我们提出了使用特定于领域的分类法来回答对不精确数据的聚合查询的方法。我们引入了一个新的概念,我们称之为加权层次超图,它有助于在处理不精确的数据库时回答聚合查询。我们假设知识库的存在是永久的,并且独立于数据库中的不精确性。我们使用这个概念来构建一个临时数据库,称为扩展数据库,并使用扩展数据库来构建边缘数据库,它有效地回答对不精确数据的聚合查询。
{"title":"Using Taxonomies to Perform Aggregated Querying over Imprecise Data","authors":"Atanu Roy, Chandrima Sarkar, R. Angryk","doi":"10.1109/ICDMW.2010.173","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.173","url":null,"abstract":"In this paper, we put forward our approach for answering aggregated queries over imprecise data using domain specific taxonomies. A new concept we call the weighted hierarchical hyper graph has been introduced, which helps in answering aggregated queries when dealing with imprecise databases. We assume that the existence of a knowledge base is permanent and independent of the imprecision in the database. We use this concept to build a temporary database known as the extended database and use the extended database to build the marginal database, which efficiently answers aggregated queries over an imprecise data.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"211 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124241720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Detecting and Tracking Community Dynamics in Evolutionary Networks 进化网络中群落动态的检测与跟踪
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.32
Zhengzhang Chen, Kevin A. Wilson, Ye Jin, W. Hendrix, N. Samatova
Community structure or clustering is ubiquitous in many evolutionary networks including social networks, biological networks and financial market networks. Detecting and tracking community deviations in evolutionary networks can uncover important and interesting behaviors that are latent if we ignore the dynamic information. In biological networks, for example, a small variation in a gene community may indicate an event, such as gene fusion, gene fission, or gene decay. In contrast to the previous work on detecting communities in static graphs or tracking conserved communities in time-varying graphs, this paper first introduces the concept of community dynamics, and then shows that the baseline approach by enumerating all communities in each graph and comparing all pairs of communities between consecutive graphs is infeasible and impractical. We propose an efficient method for detecting and tracking community dynamics in evolutionary networks by introducing graph representatives and community representatives to avoid generating redundant communities and limit the search space. We measure the performance of the representative-based algorithm by comparison to the baseline algorithm on synthetic networks, and our experiments show that our algorithm achieves a runtime speedup of 11–46. The method has also been applied to two real-world evolutionary networks including Food Web and Enron Email. Significant and informative community dynamics have been detected in both cases.
在许多进化网络中,如社会网络、生物网络和金融市场网络,社区结构或聚类是普遍存在的。在进化网络中,如果我们忽略动态信息,检测和跟踪社区偏差可以发现潜在的重要和有趣的行为。例如,在生物网络中,基因群落中的一个小变化可能表明一个事件,如基因融合、基因裂变或基因衰变。与以往在静态图中检测群落或在时变图中跟踪保守群落的工作相比,本文首先引入了群落动态的概念,然后证明了通过枚举每个图中的所有群落并比较连续图之间的所有对的基线方法是不可行的和不切实际的。我们提出了一种有效的方法来检测和跟踪进化网络中的社区动态,通过引入图代表和社区代表来避免产生冗余的社区和限制搜索空间。通过与合成网络上的基线算法进行比较,我们测量了基于代表性的算法的性能,实验表明,我们的算法实现了11 - 46的运行时加速。该方法也被应用于两个现实世界的进化网络,包括食品网和安然电子邮件。在这两种情况下都发现了重要的和信息丰富的社区动态。
{"title":"Detecting and Tracking Community Dynamics in Evolutionary Networks","authors":"Zhengzhang Chen, Kevin A. Wilson, Ye Jin, W. Hendrix, N. Samatova","doi":"10.1109/ICDMW.2010.32","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.32","url":null,"abstract":"Community structure or clustering is ubiquitous in many evolutionary networks including social networks, biological networks and financial market networks. Detecting and tracking community deviations in evolutionary networks can uncover important and interesting behaviors that are latent if we ignore the dynamic information. In biological networks, for example, a small variation in a gene community may indicate an event, such as gene fusion, gene fission, or gene decay. In contrast to the previous work on detecting communities in static graphs or tracking conserved communities in time-varying graphs, this paper first introduces the concept of community dynamics, and then shows that the baseline approach by enumerating all communities in each graph and comparing all pairs of communities between consecutive graphs is infeasible and impractical. We propose an efficient method for detecting and tracking community dynamics in evolutionary networks by introducing graph representatives and community representatives to avoid generating redundant communities and limit the search space. We measure the performance of the representative-based algorithm by comparison to the baseline algorithm on synthetic networks, and our experiments show that our algorithm achieves a runtime speedup of 11–46. The method has also been applied to two real-world evolutionary networks including Food Web and Enron Email. Significant and informative community dynamics have been detected in both cases.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123655428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Application of Data Mining for Anti-money Laundering Detection: A Case Study 数据挖掘在反洗钱检测中的应用:一个案例研究
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.66
Nhien-An Le-Khac, Mohand Tahar Kechadi
Recently, money laundering is becoming more and more sophisticated, it seems to have moved from the personal gain to the cliché of drug trafficking and financing terrorism. This criminal activity poses a serious threat not only to financial institutions but also to the nation. Today, most international financial institutions have been implementing anti-money laundering solutions but traditional investigative techniques consume numerous man-hours. Besides, most of the existing commercial solutions are based on statistics such as means and standard deviations and therefore are not efficient enough, especially for detecting suspicious cases in investment activities. In this paper, we present a case study of applying a knowledge-based solution that combines data mining and natural computing techniques to detect money laundering patterns. This solution is a part of a collaboration project between our research group and an international investment bank.
最近,洗钱越来越复杂,似乎已经从个人利益转移到贩毒和资助恐怖主义的陈词滥调。这种犯罪活动不仅对金融机构,而且对国家构成严重威胁。今天,大多数国际金融机构都在实施反洗钱解决方案,但传统的调查技术消耗了大量的工时。此外,现有的商业解决方案大多基于均值、标准差等统计数据,效率不高,特别是在发现投资活动中的可疑案例方面。在本文中,我们提出了一个应用基于知识的解决方案的案例研究,该解决方案结合了数据挖掘和自然计算技术来检测洗钱模式。这个解决方案是我们的研究小组和一家国际投资银行合作项目的一部分。
{"title":"Application of Data Mining for Anti-money Laundering Detection: A Case Study","authors":"Nhien-An Le-Khac, Mohand Tahar Kechadi","doi":"10.1109/ICDMW.2010.66","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.66","url":null,"abstract":"Recently, money laundering is becoming more and more sophisticated, it seems to have moved from the personal gain to the cliché of drug trafficking and financing terrorism. This criminal activity poses a serious threat not only to financial institutions but also to the nation. Today, most international financial institutions have been implementing anti-money laundering solutions but traditional investigative techniques consume numerous man-hours. Besides, most of the existing commercial solutions are based on statistics such as means and standard deviations and therefore are not efficient enough, especially for detecting suspicious cases in investment activities. In this paper, we present a case study of applying a knowledge-based solution that combines data mining and natural computing techniques to detect money laundering patterns. This solution is a part of a collaboration project between our research group and an international investment bank.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123728890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 76
A Sequence Data Mining Protocol to Identify Best Representative Sequence for Protein Domain Families 一种识别蛋白质结构域家族最具代表性序列的序列数据挖掘协议
Pub Date : 2010-12-13 DOI: 10.1109/ICDMW.2010.153
V. S. Gowri, K. Shameer, C. C. S. Reddy, P. Shingate, R. Sowdhamini
Protein domains are the compact, evolutionarily conserved units of proteins that can be utilized for function association of the large number of gene products realised from whole genome sequencing projects. Homology, inferred by sequence similarity, is usually a reason for transfer of function annotation from pre-existing domain families to gene products. Sequence analysis protocols are directed by the reference sequence of families used for homology searches to reduce computational time in such large-scale data mining processes. As protein domain families are diverse in nature, it is an important task to identify a single best representative sequence member from a protein domain family using a well-defined, reproducible bioinformatics protocol. We report a new bioinformatics protocol that can be used to identify best representative sequence (BRS) from protein domain families. The method is based on “coverage analysis” score implemented using three different sequence search programs and the trends obtained in reporting best representative sequence are assessed. The highest average coverage for BRPs was 66% when searched using Hidden Markov Models. Further, it is crucial to select BRS specific for a sequence search method when searching in large sequence databases.
蛋白质结构域是紧凑的、进化上保守的蛋白质单位,可用于全基因组测序项目中实现的大量基因产物的功能关联。同源性是由序列相似性推断出来的,通常是功能注释从已有结构域家族转移到基因产物的原因。序列分析协议由用于同源性搜索的家族参考序列指导,以减少此类大规模数据挖掘过程中的计算时间。由于蛋白质结构域家族具有多样性,因此使用定义良好、可重复的生物信息学协议从蛋白质结构域家族中确定一个最佳代表性序列成员是一项重要任务。我们报告了一种新的生物信息学协议,可用于从蛋白质结构域家族中识别最佳代表性序列(BRS)。该方法基于三种不同序列搜索程序实现的“覆盖率分析”得分,并对报告最佳代表性序列所获得的趋势进行评估。当使用隐马尔可夫模型进行搜索时,brp的最高平均覆盖率为66%。此外,在大型序列数据库中搜索时,选择特定于序列搜索方法的BRS是至关重要的。
{"title":"A Sequence Data Mining Protocol to Identify Best Representative Sequence for Protein Domain Families","authors":"V. S. Gowri, K. Shameer, C. C. S. Reddy, P. Shingate, R. Sowdhamini","doi":"10.1109/ICDMW.2010.153","DOIUrl":"https://doi.org/10.1109/ICDMW.2010.153","url":null,"abstract":"Protein domains are the compact, evolutionarily conserved units of proteins that can be utilized for function association of the large number of gene products realised from whole genome sequencing projects. Homology, inferred by sequence similarity, is usually a reason for transfer of function annotation from pre-existing domain families to gene products. Sequence analysis protocols are directed by the reference sequence of families used for homology searches to reduce computational time in such large-scale data mining processes. As protein domain families are diverse in nature, it is an important task to identify a single best representative sequence member from a protein domain family using a well-defined, reproducible bioinformatics protocol. We report a new bioinformatics protocol that can be used to identify best representative sequence (BRS) from protein domain families. The method is based on “coverage analysis” score implemented using three different sequence search programs and the trends obtained in reporting best representative sequence are assessed. The highest average coverage for BRPs was 66% when searched using Hidden Markov Models. Further, it is crucial to select BRS specific for a sequence search method when searching in large sequence databases.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115810086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2010 IEEE International Conference on Data Mining Workshops
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1