首页 > 最新文献

2015 IEEE International Conference on Data Mining Workshop (ICDMW)最新文献

英文 中文
The Hierarchical Model to Ali Mobile Recommendation Competition 阿里移动推荐大赛的层次模型
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.75
Suchi Qian, Furong Peng, Xiang Li, Jianfeng Lu
Recommendation Engines have gained the most attention in the Big Data world. In order to promote the application of big data, AlibabaGrouporganizedthebig data recommendation competition, which provides the big data processing platform and one billion behavior records to participants. The competition requires the participants to learn the model from the user's behaviors within one month and then predict the purchase behavior in the following day. There are four kinds of different behaviors included: browse, add-to-cart, collection and purchase. The F1-score is as the metric to evaluate the performance. Finally, our team achieves the top score of 8.78%, and our success can be owed to the following aspects: First, we model the recommendation problem as the binary classification problem and design the hierarchical model, Second, in order to improve performance of single classifier, we adopt the sample filtering strategy to select valuable samples for training, which not only boosts the performance but also speeds up the training, Third, the classifier fusion strategy is used to improve the final performance. This paper details our hierarchical model and some relevant key technologies adopted for this competition. This hierarchical model is also the framework of data processing, which is composed of four layers: 1) Sample filtering layer, which removes a large number of invaluable samples and reduces the computing complexity, 2) Feature extraction layer, which extracts extensive features so as to characterize the samples from all possible views, 3) Classifying layer, which trains several classifiers by different sampling strategy and feature groups, 4) Fusion layers, which fuses the results of different classifiers to obtain the better one. Our score in competition manifests the reasonableness and feasibility of our model.
推荐引擎在大数据领域获得了最多的关注。为了促进大数据的应用,阿里巴巴集团组织了大数据推荐大赛,为参赛者提供大数据处理平台和10亿条行为记录。比赛要求参赛者在一个月内从用户的行为中学习模型,然后预测第二天的购买行为。其中包括四种不同的行为:浏览、添加到购物车、收集和购买。f1得分作为评估性能的指标。最终,我们的团队取得了8.78%的最高分,我们的成功可以归功于以下几个方面:首先,我们将推荐问题建模为二值分类问题,并设计了分层模型;其次,为了提高单分类器的性能,我们采用样本过滤策略来选择有价值的样本进行训练,既提高了性能又加快了训练速度;第三,采用分类器融合策略来提高最终的性能。本文详细介绍了我们的分层模型和本次比赛所采用的一些相关关键技术。这种分层模型也是数据处理的框架,它由四层组成:1)样本过滤层,去除大量宝贵的样本,降低计算复杂度;2)特征提取层,提取广泛的特征,从所有可能的角度对样本进行表征;3)分类层,通过不同的采样策略和特征组训练多个分类器;4)融合层,融合不同分类器的结果,获得更好的分类器。我们在比赛中的得分体现了我们的模式的合理性和可行性。
{"title":"The Hierarchical Model to Ali Mobile Recommendation Competition","authors":"Suchi Qian, Furong Peng, Xiang Li, Jianfeng Lu","doi":"10.1109/ICDMW.2015.75","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.75","url":null,"abstract":"Recommendation Engines have gained the most attention in the Big Data world. In order to promote the application of big data, AlibabaGrouporganizedthebig data recommendation competition, which provides the big data processing platform and one billion behavior records to participants. The competition requires the participants to learn the model from the user's behaviors within one month and then predict the purchase behavior in the following day. There are four kinds of different behaviors included: browse, add-to-cart, collection and purchase. The F1-score is as the metric to evaluate the performance. Finally, our team achieves the top score of 8.78%, and our success can be owed to the following aspects: First, we model the recommendation problem as the binary classification problem and design the hierarchical model, Second, in order to improve performance of single classifier, we adopt the sample filtering strategy to select valuable samples for training, which not only boosts the performance but also speeds up the training, Third, the classifier fusion strategy is used to improve the final performance. This paper details our hierarchical model and some relevant key technologies adopted for this competition. This hierarchical model is also the framework of data processing, which is composed of four layers: 1) Sample filtering layer, which removes a large number of invaluable samples and reduces the computing complexity, 2) Feature extraction layer, which extracts extensive features so as to characterize the samples from all possible views, 3) Classifying layer, which trains several classifiers by different sampling strategy and feature groups, 4) Fusion layers, which fuses the results of different classifiers to obtain the better one. Our score in competition manifests the reasonableness and feasibility of our model.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115939547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Event Detection from Millions of Tweets Related to the Great East Japan Earthquake Using Feature Selection Technique 基于特征选择技术的数百万条东日本大地震相关推文事件检测
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.248
T. Hashimoto, D. Shepard, T. Kuboyama, Kilho Shin
Social media offers a wealth of insight into howsignificant events -- such as the Great East Japan Earthquake, the Arab Spring, and the Boston Bombing -- affect individuals. The scale of available data, however, can be intimidating: duringthe Great East Japan Earthquake, over 8 million tweets weresent each day from Japan alone. Conventional word vector-based event-detection techniques for social media that use Latent SemanticAnalysis, Latent Dirichlet Allocation, or graph communitydetection often cannot scale to such a large volume of data due to their space and time complexity. To alleviate this problem, we propose an efficient method for event detection by leveraging a fast feature selection algorithm called CWC. While we begin withword count vectors of authors and words for each time slot (inour case, every hour), we extract discriminative words from eachslot using CWC, which vastly reduces the number of features to track. We then convert these word vectors into a time series of vector distances from the initial point. The distance betweeneach time slot and the initial point remains high while an eventis happening, yet declines sharply when the event ends, offeringan accurate portrait of the span of an event. This method makes it possible to detect events from vast datasets. To demonstrateour method's effectiveness, we extract events from a dataset ofover two hundred million tweets sent in the 21 days followingthe Great East Japan Earthquake. With CWC, we can identifyevents from this dataset with great speed and accuracy.
社交媒体提供了丰富的洞察力,让我们了解重大事件对个人的影响,比如东日本大地震、阿拉伯之春和波士顿爆炸案。然而,可用数据的规模可能令人生畏:在东日本大地震期间,每天仅日本就发出了800多万条推文。传统的基于词向量的社交媒体事件检测技术使用潜在语义分析、潜在狄利克雷分配或图社区检测,由于它们的空间和时间复杂性,通常无法扩展到如此大的数据量。为了缓解这一问题,我们提出了一种有效的事件检测方法,即利用快速特征选择算法CWC。当我们从每个时隙(在我们的例子中,每小时)的作者和单词的单词计数向量开始时,我们使用CWC从每个时隙提取判别词,这大大减少了要跟踪的特征数量。然后我们将这些词向量转换成从初始点到向量距离的时间序列。当事件发生时,每个时隙与初始点之间的距离保持较高,但当事件结束时,距离急剧下降,从而提供了事件跨度的准确描述。这种方法使得从大量数据集中检测事件成为可能。为了证明我们方法的有效性,我们从东日本大地震后21天内发送的超过2亿条推文的数据集中提取事件。使用CWC,我们可以快速准确地从数据集中识别事件。
{"title":"Event Detection from Millions of Tweets Related to the Great East Japan Earthquake Using Feature Selection Technique","authors":"T. Hashimoto, D. Shepard, T. Kuboyama, Kilho Shin","doi":"10.1109/ICDMW.2015.248","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.248","url":null,"abstract":"Social media offers a wealth of insight into howsignificant events -- such as the Great East Japan Earthquake, the Arab Spring, and the Boston Bombing -- affect individuals. The scale of available data, however, can be intimidating: duringthe Great East Japan Earthquake, over 8 million tweets weresent each day from Japan alone. Conventional word vector-based event-detection techniques for social media that use Latent SemanticAnalysis, Latent Dirichlet Allocation, or graph communitydetection often cannot scale to such a large volume of data due to their space and time complexity. To alleviate this problem, we propose an efficient method for event detection by leveraging a fast feature selection algorithm called CWC. While we begin withword count vectors of authors and words for each time slot (inour case, every hour), we extract discriminative words from eachslot using CWC, which vastly reduces the number of features to track. We then convert these word vectors into a time series of vector distances from the initial point. The distance betweeneach time slot and the initial point remains high while an eventis happening, yet declines sharply when the event ends, offeringan accurate portrait of the span of an event. This method makes it possible to detect events from vast datasets. To demonstrateour method's effectiveness, we extract events from a dataset ofover two hundred million tweets sent in the 21 days followingthe Great East Japan Earthquake. With CWC, we can identifyevents from this dataset with great speed and accuracy.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126675733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Shikake Data Market for Collaborative Shikake Creation Shikake数据市场协同Shikake创作
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.130
N. Matsumura, Hideaki Takeda
A shikake is a trigger for behavioral change to solve a problem. We proposes a Shikake Data Market (SDM) platform for giving everyone an opportunity to implement a shikake with restricted resources, such as ideas, expert knowledge and skill, practitioners, negotiators, and budget. As a preliminary case, we analyzed the collaborative creation at a shikake hackathon and revealed that collaboration among people with diverse expert backgrounds would improve the quality of the output. Based on this result, we discuss collaborative shikake creation.
shikake是解决问题的行为改变的触发器。我们提出了一个Shikake数据市场(SDM)平台,让每个人都有机会利用有限的资源(如想法、专业知识和技能、从业者、谈判者和预算)来实施Shikake。作为初步案例,我们分析了shikake黑客马拉松的协作创作,揭示了不同专家背景的人之间的协作可以提高产出的质量。基于这一结果,我们讨论了协同诗歌创作。
{"title":"Shikake Data Market for Collaborative Shikake Creation","authors":"N. Matsumura, Hideaki Takeda","doi":"10.1109/ICDMW.2015.130","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.130","url":null,"abstract":"A shikake is a trigger for behavioral change to solve a problem. We proposes a Shikake Data Market (SDM) platform for giving everyone an opportunity to implement a shikake with restricted resources, such as ideas, expert knowledge and skill, practitioners, negotiators, and budget. As a preliminary case, we analyzed the collaborative creation at a shikake hackathon and revealed that collaboration among people with diverse expert backgrounds would improve the quality of the output. Based on this result, we discuss collaborative shikake creation.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124031810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lifting the Predictability of Human Mobility on Activity Trajectories 提高人类活动轨迹的可预测性
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.164
Xianming Li, Defu Lian, Xing Xie, Guangzhong Sun
Mobility prediction has recently attracted plenty of attention since it plays an important part in many applications ranging from urban planning and traffic forecasting to location-based services, including mobile recommendation and mobile advertisement. However, there is little study on exploiting the activity information, being often associated with the trajectories on which prediction is based, for assisting location prediction. To this end, in this paper, we propose a Time-stamped Activity INference Enhanced Predictor (TAINEP) for forecasting next location on activity trajectories. In TAINEP, we propose to leverage topic models for dimension reduction so as to capture co-occurrences of different time-stamped activities. It is then extended to incorporate temporal dependence between topics of consecutive time-stamped activities to infer the activity which may be conducted at the next location and the time when it will happen. Based on the inferred time-stamped activities, a probabilistic mixture model is further put forward to integrate them with commonly-used Markov predictors for forecasting the next locations. We finally evaluate the proposed model on two real-world datasets. The results show that the proposed method outperforms the competing predictors without inferring time-stamped activities. In other words, it lifts the predictability of human mobility.
从城市规划和交通预测到基于位置的服务,包括移动推荐和移动广告,移动预测在许多应用中发挥着重要作用,近年来引起了人们的广泛关注。然而,很少有研究利用活动信息,通常与预测所依据的轨迹相关联,以协助位置预测。为此,在本文中,我们提出了一个时间戳活动推断增强预测器(TAINEP)来预测活动轨迹上的下一个位置。在TAINEP中,我们建议利用主题模型进行降维,以便捕获不同时间戳活动的共同出现。然后将其扩展为包含连续时间戳活动的主题之间的时间依赖性,以推断可能在下一个地点进行的活动及其发生的时间。在推断出时间戳活动的基础上,进一步提出了一种概率混合模型,将其与常用的马尔可夫预测因子相结合,用于预测下一个地点。最后,我们在两个真实世界的数据集上评估了所提出的模型。结果表明,该方法在不推断时间戳活动的情况下优于竞争预测器。换句话说,它提高了人类流动性的可预测性。
{"title":"Lifting the Predictability of Human Mobility on Activity Trajectories","authors":"Xianming Li, Defu Lian, Xing Xie, Guangzhong Sun","doi":"10.1109/ICDMW.2015.164","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.164","url":null,"abstract":"Mobility prediction has recently attracted plenty of attention since it plays an important part in many applications ranging from urban planning and traffic forecasting to location-based services, including mobile recommendation and mobile advertisement. However, there is little study on exploiting the activity information, being often associated with the trajectories on which prediction is based, for assisting location prediction. To this end, in this paper, we propose a Time-stamped Activity INference Enhanced Predictor (TAINEP) for forecasting next location on activity trajectories. In TAINEP, we propose to leverage topic models for dimension reduction so as to capture co-occurrences of different time-stamped activities. It is then extended to incorporate temporal dependence between topics of consecutive time-stamped activities to infer the activity which may be conducted at the next location and the time when it will happen. Based on the inferred time-stamped activities, a probabilistic mixture model is further put forward to integrate them with commonly-used Markov predictors for forecasting the next locations. We finally evaluate the proposed model on two real-world datasets. The results show that the proposed method outperforms the competing predictors without inferring time-stamped activities. In other words, it lifts the predictability of human mobility.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126867921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Selecting Machine Learning Algorithms Using Regression Models 使用回归模型选择机器学习算法
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.43
Tri Doan, J. Kalita
In performing data mining, a common task is to search for the most appropriate algorithm(s) to retrieve important information from data. With an increasing number of available data mining techniques, it may be impractical to experiment with many techniques on a specific dataset of interest to find the best algorithm(s). In this paper, we demonstrate the suitability of tree-based multi-variable linear regression in predicting algorithm performance. We take into account prior machine learning experience to construct meta-knowledge for supervised learning. The idea is to use summary knowledge about datasets along with past performance of algorithms on these datasets to build this meta-knowledge. We augment pure statistical summaries with descriptive features and a misclassification cost, and discover that transformed datasets obtained by reducing a high dimensional feature space to a smaller dimension still retain significant characteristic knowledge necessary to predict algorithm performance. Our approach works well for both numerical and nominal data obtained from real world environments.
在执行数据挖掘时,一个常见的任务是搜索最合适的算法来从数据中检索重要信息。随着可用的数据挖掘技术越来越多,在感兴趣的特定数据集上试验许多技术以找到最佳算法可能是不切实际的。在本文中,我们证明了基于树的多变量线性回归在预测算法性能方面的适用性。我们考虑之前的机器学习经验来构建元知识进行监督学习。这个想法是使用关于数据集的总结知识以及这些数据集上算法的过去性能来构建这个元知识。我们用描述性特征和错误分类代价增强了纯统计摘要,并发现通过将高维特征空间降至较小维度获得的转换数据集仍然保留了预测算法性能所需的重要特征知识。我们的方法适用于从真实世界环境中获得的数值和标称数据。
{"title":"Selecting Machine Learning Algorithms Using Regression Models","authors":"Tri Doan, J. Kalita","doi":"10.1109/ICDMW.2015.43","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.43","url":null,"abstract":"In performing data mining, a common task is to search for the most appropriate algorithm(s) to retrieve important information from data. With an increasing number of available data mining techniques, it may be impractical to experiment with many techniques on a specific dataset of interest to find the best algorithm(s). In this paper, we demonstrate the suitability of tree-based multi-variable linear regression in predicting algorithm performance. We take into account prior machine learning experience to construct meta-knowledge for supervised learning. The idea is to use summary knowledge about datasets along with past performance of algorithms on these datasets to build this meta-knowledge. We augment pure statistical summaries with descriptive features and a misclassification cost, and discover that transformed datasets obtained by reducing a high dimensional feature space to a smaller dimension still retain significant characteristic knowledge necessary to predict algorithm performance. Our approach works well for both numerical and nominal data obtained from real world environments.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127942066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Examining Botnet Behaviors for Propaganda Dissemination: A Case Study of ISIL's Beheading Videos-Based Propaganda 检查僵尸网络行为的宣传传播:ISIL的斩首视频为基础的宣传案例研究
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.41
Samer Al-khateeb, Nitin Agarwal
Since the dissemination of the first beheading video by the Islamic State in Iraq and Levant (ISIL) of its hostage James Foley (an American journalist), this practice has become increasingly common. Videos of ISIL beheading their hostages in orange jumpsuits swarmed over social media as they swept across Iraq. By showing such shocking videos and images, ISIL is able to spread their opinions and create emotional attitudes for their followers. Through a sophisticated social media strategy and strategic use of botnets, ISIL is succeeding in its propaganda dissemination. ISIL is using social media as a tool to conduct recruitment and radicalization campaigns and raise funds. In this study, we examine the reasons for creating such videos grounded in the literature from cultural anthropology, transnationalism and religious identity, and media & communication. Toward this direction, we collect data from Twitter for the beheadings done by ISIL, especially the Egyptian Copts, the Arab-Israeli "Spy", and the Ethiopian Christians. The study provides insights into the way ISIL uses social media (especially Twitter) to disseminate propaganda and develop a framework to identify sociotechnical behavioral patterns from social and computational science perspective.
自从伊拉克和黎凡特伊斯兰国(ISIL)发布其人质詹姆斯·福利(James Foley,美国记者)的首个斩首视频以来,这种做法变得越来越普遍。isis斩首身穿橙色囚服的人质的视频席卷伊拉克,在社交媒体上铺天盖地。通过展示这些令人震惊的视频和图像,ISIL能够传播他们的观点,并为他们的追随者创造情感态度。通过复杂的社交媒体策略和对僵尸网络的战略性使用,ISIL在宣传传播方面取得了成功。ISIL正在利用社交媒体作为工具进行招募和激进化活动,并筹集资金。在本研究中,我们从文化人类学、跨国主义和宗教认同以及媒体与传播的文献中考察了创作此类视频的原因。朝着这个方向,我们从Twitter上收集了ISIL斩首的数据,尤其是埃及科普特人、阿拉伯-以色列“间谍”和埃塞俄比亚基督徒。该研究提供了对ISIL使用社交媒体(尤其是Twitter)传播宣传的方式的见解,并从社会和计算科学的角度开发了一个框架来识别社会技术行为模式。
{"title":"Examining Botnet Behaviors for Propaganda Dissemination: A Case Study of ISIL's Beheading Videos-Based Propaganda","authors":"Samer Al-khateeb, Nitin Agarwal","doi":"10.1109/ICDMW.2015.41","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.41","url":null,"abstract":"Since the dissemination of the first beheading video by the Islamic State in Iraq and Levant (ISIL) of its hostage James Foley (an American journalist), this practice has become increasingly common. Videos of ISIL beheading their hostages in orange jumpsuits swarmed over social media as they swept across Iraq. By showing such shocking videos and images, ISIL is able to spread their opinions and create emotional attitudes for their followers. Through a sophisticated social media strategy and strategic use of botnets, ISIL is succeeding in its propaganda dissemination. ISIL is using social media as a tool to conduct recruitment and radicalization campaigns and raise funds. In this study, we examine the reasons for creating such videos grounded in the literature from cultural anthropology, transnationalism and religious identity, and media & communication. Toward this direction, we collect data from Twitter for the beheadings done by ISIL, especially the Egyptian Copts, the Arab-Israeli \"Spy\", and the Ethiopian Christians. The study provides insights into the way ISIL uses social media (especially Twitter) to disseminate propaganda and develop a framework to identify sociotechnical behavioral patterns from social and computational science perspective.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128950816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
An Enumerative Biclustering Algorithm for DNA Microarray Data DNA微阵列数据的枚举双聚类算法
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.168
Haifa Ben Saber, M. Elloumi
In a number of domains, like in DNA microarray data analysis, we need to cluster simultaneously rows (genes) and columns (conditions) of a data matrix to identify groups of constant rows with a group of columns. This kind of clustering is called biclustering. Biclustering algorithms are extensively used in DNA microarray data analysis. More effective biclustering algorithms are highly desirable and needed. We introduce a new algorithm called, Enumerative Lattice (EnumLat) for biclustering of binary microarray data. EnumLat is an algorithm adopting the approach of enumerating biclusters. This algorithm extracts all biclusters consistent good quality. The main idea of EnumLat is the construction of a new tree structure to represent adequately different biclusters discovered during the process of enumeration. This algorithm adopts the strategy of all biclusters at a time. The performance of the proposed algorithm is assessed using both synthetic and real DNA microarray data, our algorithm outperforms other biclustering algorithms for binary microarray data. Moreover, we test the biological significance using a gene annotation web tool to show that our proposed method is able to produce biologically relevant biclusters.
在许多领域,如DNA微阵列数据分析中,我们需要同时对数据矩阵的行(基因)和列(条件)进行聚类,以识别具有一组列的恒定行组。这种聚类称为双聚类。双聚类算法广泛应用于DNA微阵列数据分析。更有效的双聚类算法是非常可取和需要的。本文介绍了一种用于二进制微阵列数据双聚类的新算法——枚举点阵(EnumLat)。EnumLat是一种采用双聚类枚举方法的算法。该算法提取出质量一致的所有双聚类。EnumLat的主要思想是构建一个新的树结构来充分表示枚举过程中发现的不同的双聚类。该算法采用一次处理所有双聚类的策略。使用合成和真实DNA微阵列数据对所提出算法的性能进行了评估,我们的算法优于其他二进制微阵列数据的双聚类算法。此外,我们使用基因注释网络工具测试了生物学意义,表明我们提出的方法能够产生生物学相关的双聚类。
{"title":"An Enumerative Biclustering Algorithm for DNA Microarray Data","authors":"Haifa Ben Saber, M. Elloumi","doi":"10.1109/ICDMW.2015.168","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.168","url":null,"abstract":"In a number of domains, like in DNA microarray data analysis, we need to cluster simultaneously rows (genes) and columns (conditions) of a data matrix to identify groups of constant rows with a group of columns. This kind of clustering is called biclustering. Biclustering algorithms are extensively used in DNA microarray data analysis. More effective biclustering algorithms are highly desirable and needed. We introduce a new algorithm called, Enumerative Lattice (EnumLat) for biclustering of binary microarray data. EnumLat is an algorithm adopting the approach of enumerating biclusters. This algorithm extracts all biclusters consistent good quality. The main idea of EnumLat is the construction of a new tree structure to represent adequately different biclusters discovered during the process of enumeration. This algorithm adopts the strategy of all biclusters at a time. The performance of the proposed algorithm is assessed using both synthetic and real DNA microarray data, our algorithm outperforms other biclustering algorithms for binary microarray data. Moreover, we test the biological significance using a gene annotation web tool to show that our proposed method is able to produce biologically relevant biclusters.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127487192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Mining Unstable Communities from Network Ensembles 从网络集合中挖掘不稳定社区
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.87
Ahsanur Rahman, Steve T. K. Jan, Hyunju Kim, B. Prakash, T. Murali
Ensembles of graphs arise in several natural applications, such as mobility tracking, computational biology, socialnetworks, and epidemiology. A common problem addressed by many existing mining techniques is to identify subgraphs of interest in these ensembles. In contrast, in this paper, we propose to quickly discover maximally variable regions of the graphs, i.e., sets of nodes that induce very different subgraphs across the ensemble. We first develop two intuitive and novel definitions of such node sets, which we then show can be efficiently enumerated using a level-wise algorithm. Finally, using extensive experiments on multiple real datasets, we show how these sets capture the main structural variations of the given set of networks and also provide us with interesting and relevant insights about these datasets.
图的集成出现在一些自然应用中,如移动跟踪、计算生物学、社交网络和流行病学。许多现有挖掘技术解决的一个常见问题是识别这些集成中感兴趣的子图。相反,在本文中,我们提出快速发现图的最大可变区域,即在集成中产生非常不同子图的节点集。我们首先开发了这种节点集的两个直观和新颖的定义,然后我们展示了可以使用分层算法有效地枚举它们。最后,通过对多个真实数据集的广泛实验,我们展示了这些数据集如何捕获给定网络集的主要结构变化,并为我们提供了关于这些数据集的有趣和相关的见解。
{"title":"Mining Unstable Communities from Network Ensembles","authors":"Ahsanur Rahman, Steve T. K. Jan, Hyunju Kim, B. Prakash, T. Murali","doi":"10.1109/ICDMW.2015.87","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.87","url":null,"abstract":"Ensembles of graphs arise in several natural applications, such as mobility tracking, computational biology, socialnetworks, and epidemiology. A common problem addressed by many existing mining techniques is to identify subgraphs of interest in these ensembles. In contrast, in this paper, we propose to quickly discover maximally variable regions of the graphs, i.e., sets of nodes that induce very different subgraphs across the ensemble. We first develop two intuitive and novel definitions of such node sets, which we then show can be efficiently enumerated using a level-wise algorithm. Finally, using extensive experiments on multiple real datasets, we show how these sets capture the main structural variations of the given set of networks and also provide us with interesting and relevant insights about these datasets.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130386655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Defending Suspected Users by Exploiting Specific Distance Metric in Collaborative Filtering Recommender Systems 协同过滤推荐系统中利用特定距离度量防御可疑用户
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.89
Zhihai Yang, Zhongmin Cai
Collaborative filtering recommender systems (CFRSs) are critical components of existing popular e-commerce websites to make personalized recommendations. In practice, CFRSs are highly vulnerable to "shilling" attacks or "profile injection" attacks due to its openness. A number of detection methods have been proposed to make CFRSs resistant to such attacks. However, some of them distinguished attackers by using typical similarity metrics, which are difficult to fully defend all attackers and show high computation time, although they can be effective to capture the concerned attackers in some extent. In this paper, we propose an unsupervised method to detect such attacks. Firstly, we filter out more genuine users by using suspected target items as far as possible in order to reduce time consumption. Based on the remained result of the first stage, we employ a new similarity metric to further filter out the remained genuine users, which combines the traditional similarity metric and the linkage information between users to improve the accuracy of similarity of users. Experimental results show that our proposed detection method is superior to benchmarked method.
协同过滤推荐系统(CFRSs)是当前流行的电子商务网站进行个性化推荐的关键组件。实际上,由于cfrs的开放性,它极易受到“先令”攻击或“配置文件注入”攻击。已经提出了许多检测方法来使cfrs抵抗此类攻击。然而,其中一些方法使用典型的相似度度量来区分攻击者,这种方法在一定程度上可以有效地捕获相关攻击者,但难以完全防御所有攻击者,且计算时间长。在本文中,我们提出了一种无监督的方法来检测这种攻击。首先,我们尽可能使用可疑的目标项目来过滤掉更多的真实用户,以减少时间消耗。在第一阶段剩余用户的基础上,采用新的相似度度量进一步过滤剩余真实用户,将传统的相似度度量与用户间的关联信息相结合,提高用户相似度的准确性。实验结果表明,本文提出的检测方法优于基准检测方法。
{"title":"Defending Suspected Users by Exploiting Specific Distance Metric in Collaborative Filtering Recommender Systems","authors":"Zhihai Yang, Zhongmin Cai","doi":"10.1109/ICDMW.2015.89","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.89","url":null,"abstract":"Collaborative filtering recommender systems (CFRSs) are critical components of existing popular e-commerce websites to make personalized recommendations. In practice, CFRSs are highly vulnerable to \"shilling\" attacks or \"profile injection\" attacks due to its openness. A number of detection methods have been proposed to make CFRSs resistant to such attacks. However, some of them distinguished attackers by using typical similarity metrics, which are difficult to fully defend all attackers and show high computation time, although they can be effective to capture the concerned attackers in some extent. In this paper, we propose an unsupervised method to detect such attacks. Firstly, we filter out more genuine users by using suspected target items as far as possible in order to reduce time consumption. Based on the remained result of the first stage, we employ a new similarity metric to further filter out the remained genuine users, which combines the traditional similarity metric and the linkage information between users to improve the accuracy of similarity of users. Experimental results show that our proposed detection method is superior to benchmarked method.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116948244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Terra Populus: Integrated Data on Population and Environment 胡杨地:人口与环境综合数据
Pub Date : 2015-11-14 DOI: 10.1109/ICDMW.2015.204
S. Ruggles, T. Kugler, Catherine A. Fitch, D. V. Riper
Terra Populus, part of National Science Foundation's DataNet initiative, is developing organizational and technical infrastructure to integrate, preserve, and disseminate data describing changes in the human population and environment over time. A large number of high-quality environmental and population datasets are available, but they are widely dispersed, have incompatible or inadequate metadata, and have incompatible geographic identifiers. The new Terra Populus infrastructure enables researchers to identify and merge data from heterogeneous sources to study the relationships between human behavior and the natural world.
Terra Populus是美国国家科学基金会数据网计划的一部分,正在开发组织和技术基础设施,以整合、保存和传播描述人口和环境随时间变化的数据。有大量高质量的环境和人口数据集,但它们分布广泛,元数据不兼容或不充分,地理标识符也不兼容。新的Terra Populus基础设施使研究人员能够识别和合并来自不同来源的数据,以研究人类行为与自然世界之间的关系。
{"title":"Terra Populus: Integrated Data on Population and Environment","authors":"S. Ruggles, T. Kugler, Catherine A. Fitch, D. V. Riper","doi":"10.1109/ICDMW.2015.204","DOIUrl":"https://doi.org/10.1109/ICDMW.2015.204","url":null,"abstract":"Terra Populus, part of National Science Foundation's DataNet initiative, is developing organizational and technical infrastructure to integrate, preserve, and disseminate data describing changes in the human population and environment over time. A large number of high-quality environmental and population datasets are available, but they are widely dispersed, have incompatible or inadequate metadata, and have incompatible geographic identifiers. The new Terra Populus infrastructure enables researchers to identify and merge data from heterogeneous sources to study the relationships between human behavior and the natural world.","PeriodicalId":192888,"journal":{"name":"2015 IEEE International Conference on Data Mining Workshop (ICDMW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132700032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2015 IEEE International Conference on Data Mining Workshop (ICDMW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1