Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第8页

Connecting users across social media sites: a behavioral-modeling approach 跨社交媒体网站连接用户:一种行为建模方法

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487648

R. Zafarani, Huan Liu

People use various social media for different purposes. The information on an individual site is often incomplete. When sources of complementary information are integrated, a better profile of a user can be built to improve online services such as verifying online information. To integrate these sources of information, it is necessary to identify individuals across social media sites. This paper aims to address the cross-media user identification problem. We introduce a methodology (MOBIUS) for finding a mapping among identities of individuals across social media sites. It consists of three key components: the first component identifies users' unique behavioral patterns that lead to information redundancies across sites; the second component constructs features that exploit information redundancies due to these behavioral patterns; and the third component employs machine learning for effective user identification. We formally define the cross-media user identification problem and show that MOBIUS is effective in identifying users across social media sites. This study paves the way for analysis and mining across social media sites, and facilitates the creation of novel online services across sites.

人们出于不同的目的使用各种社交媒体。单个站点上的信息通常是不完整的。当整合了互补信息的来源时，可以建立更好的用户档案，以改进在线服务，例如验证在线信息。为了整合这些信息来源，有必要在社交媒体网站上识别个人。本文旨在解决跨媒体用户识别问题。我们介绍了一种方法(MOBIUS)，用于在社交媒体网站上寻找个人身份之间的映射。它由三个关键组件组成:第一个组件识别导致跨站点信息冗余的用户独特行为模式;第二个组件构建利用这些行为模式导致的信息冗余的功能;第三个组件使用机器学习进行有效的用户识别。我们正式定义了跨媒体用户识别问题，并证明了MOBIUS在识别跨社交媒体网站的用户方面是有效的。这项研究为跨社交媒体网站的分析和挖掘铺平了道路，并促进了跨网站的新型在线服务的创建。

{"title":"Connecting users across social media sites: a behavioral-modeling approach","authors":"R. Zafarani, Huan Liu","doi":"10.1145/2487575.2487648","DOIUrl":"https://doi.org/10.1145/2487575.2487648","url":null,"abstract":"People use various social media for different purposes. The information on an individual site is often incomplete. When sources of complementary information are integrated, a better profile of a user can be built to improve online services such as verifying online information. To integrate these sources of information, it is necessary to identify individuals across social media sites. This paper aims to address the cross-media user identification problem. We introduce a methodology (MOBIUS) for finding a mapping among identities of individuals across social media sites. It consists of three key components: the first component identifies users' unique behavioral patterns that lead to information redundancies across sites; the second component constructs features that exploit information redundancies due to these behavioral patterns; and the third component employs machine learning for effective user identification. We formally define the cross-media user identification problem and show that MOBIUS is effective in identifying users across social media sites. This study paves the way for analysis and mining across social media sites, and facilitates the creation of novel online services across sites.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84388369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 408

KeySee: supporting keyword search on evolving events in social streams KeySee:支持关键字搜索不断发展的事件在社会流

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487711

Pei Lee, L. Lakshmanan, E. Milios

Online social streams such as Twitter/Facebook timelines and forum discussions have emerged as prevalent channels for information dissemination. As these social streams surge quickly, information overload has become a huge problem. Existing keyword search engines on social streams like Twitter Search are not successful in overcoming the problem, because they merely return an overwhelming list of posts, with little aggregation or semantics. In this demo, we provide a new solution called keysee by grouping posts into events, and track the evolution patterns of events as new posts stream in and old posts fade out. Noise and redundancy problems are effectively addressed in our system. Our demo supports refined keyword query on evolving events by allowing users to specify the time span and designated evolution pattern. For each event result, we provide various analytic views such as frequency curves, word clouds and GPS distributions. We deploy keysee on real Twitter streams and the results show that our demo outperforms existing keyword search engines on both quality and usability.

Twitter/Facebook时间轴和论坛讨论等在线社交流已成为信息传播的普遍渠道。随着这些社交流的迅速激增，信息过载已经成为一个巨大的问题。像Twitter搜索这样的社交流上现有的关键字搜索引擎并没有成功地克服这个问题，因为它们只是返回一个压倒性的帖子列表，几乎没有聚合或语义。在这个演示中，我们提供了一个名为keysee的新解决方案，它将帖子分组为事件，并在新帖子涌入和旧帖子淡出时跟踪事件的演变模式。我们的系统有效地解决了噪音和冗余问题。我们的演示通过允许用户指定时间跨度和指定的演化模式，支持对演化事件进行精细的关键字查询。对于每个事件结果，我们提供了各种分析视图，如频率曲线，词云和GPS分布。我们在真实的Twitter流上部署keysee，结果表明我们的演示在质量和可用性方面都优于现有的关键字搜索引擎。

引用次数: 17

TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC TurboGraph:一个快速并行图形引擎，在单个PC上处理十亿规模的图形

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487581

Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, Jeong-Hoon Lee, Min-Soo Kim, Jinha Kim, Hwanjo Yu

Graphs are used to model many real objects such as social networks and web graphs. Many real applications in various fields require efficient and effective management of large-scale graph structured data. Although distributed graph engines such as GBase and Pregel handle billion-scale graphs, the user needs to be skilled at managing and tuning a distributed system in a cluster, which is a nontrivial job for the ordinary user. Furthermore, these distributed systems need many machines in a cluster in order to provide reasonable performance. In order to address this problem, a disk-based parallel graph engine called Graph-Chi, has been recently proposed. Although Graph-Chi significantly outperforms all representative (disk-based) distributed graph engines, we observe that Graph-Chi still has serious performance problems for many important types of graph queries due to 1) limited parallelism and 2) separate steps for I/O processing and CPU processing. In this paper, we propose a general, disk-based graph engine called TurboGraph to process billion-scale graphs very efficiently by using modern hardware on a single PC. TurboGraph is the first truly parallel graph engine that exploits 1) full parallelism including multi-core parallelism and FlashSSD IO parallelism and 2) full overlap of CPU processing and I/O processing as much as possible. Specifically, we propose a novel parallel execution model, called pin-and-slide. TurboGraph also provides engine-level operators such as BFS which are implemented under the pin-and-slide model. Extensive experimental results with large real datasets show that TurboGraph consistently and significantly outperforms Graph-Chi by up to four orders of magnitude! Our implementation of TurboGraph is available at ``http://wshan.net/turbograph}" as executable files.

图用于模拟许多真实对象，如社交网络和网络图。各个领域的许多实际应用都需要对大规模图结构数据进行高效的管理。尽管像GBase和Pregel这样的分布式图形引擎可以处理十亿规模的图形，但是用户需要熟练地管理和调优集群中的分布式系统，这对于普通用户来说是一项非常重要的工作。此外，为了提供合理的性能，这些分布式系统需要集群中的许多机器。为了解决这个问题，最近提出了一种基于磁盘的并行图引擎，称为graph - chi。尽管graph - chi的性能明显优于所有代表性的(基于磁盘的)分布式图引擎，但我们观察到，由于1)有限的并行性和2)I/O处理和CPU处理的单独步骤，对于许多重要类型的图查询，graph - chi仍然存在严重的性能问题。在本文中，我们提出了一个通用的，基于磁盘的图形引擎，称为TurboGraph，通过在单个PC上使用现代硬件来非常有效地处理十亿规模的图形。TurboGraph是第一个真正的并行图形引擎，它利用了1)全并行性，包括多核并行性和FlashSSD IO并行性;2)CPU处理和I/O处理尽可能地完全重叠。具体来说，我们提出了一种新的并行执行模型，称为钉-滑动。TurboGraph还提供引擎级操作器，如BFS，这些操作器是在钉-滑动模型下实现的。使用大型真实数据集的广泛实验结果表明，TurboGraph始终显著优于Graph-Chi多达四个数量级!我们的TurboGraph实现可以在“http://wshan.net/turbograph}”以可执行文件的形式获得。

{"title":"TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC","authors":"Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, Jeong-Hoon Lee, Min-Soo Kim, Jinha Kim, Hwanjo Yu","doi":"10.1145/2487575.2487581","DOIUrl":"https://doi.org/10.1145/2487575.2487581","url":null,"abstract":"Graphs are used to model many real objects such as social networks and web graphs. Many real applications in various fields require efficient and effective management of large-scale graph structured data. Although distributed graph engines such as GBase and Pregel handle billion-scale graphs, the user needs to be skilled at managing and tuning a distributed system in a cluster, which is a nontrivial job for the ordinary user. Furthermore, these distributed systems need many machines in a cluster in order to provide reasonable performance. In order to address this problem, a disk-based parallel graph engine called Graph-Chi, has been recently proposed. Although Graph-Chi significantly outperforms all representative (disk-based) distributed graph engines, we observe that Graph-Chi still has serious performance problems for many important types of graph queries due to 1) limited parallelism and 2) separate steps for I/O processing and CPU processing. In this paper, we propose a general, disk-based graph engine called TurboGraph to process billion-scale graphs very efficiently by using modern hardware on a single PC. TurboGraph is the first truly parallel graph engine that exploits 1) full parallelism including multi-core parallelism and FlashSSD IO parallelism and 2) full overlap of CPU processing and I/O processing as much as possible. Specifically, we propose a novel parallel execution model, called pin-and-slide. TurboGraph also provides engine-level operators such as BFS which are implemented under the pin-and-slide model. Extensive experimental results with large real datasets show that TurboGraph consistently and significantly outperforms Graph-Chi by up to four orders of magnitude! Our implementation of TurboGraph is available at ``http://wshan.net/turbograph}\" as executable files.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87339673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 244

Cost-sensitive online active learning with application to malicious URL detection 成本敏感在线主动学习与恶意URL检测的应用

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487647

P. Zhao, S. Hoi

Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims to optimize the prediction accuracy. However, in a real-world malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy. Besides, another key limitation of the existing work is to assume a large amount of training data is available, which is impractical as the human labeling cost could be potentially quite expensive. To solve these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. In particular, we propose two CSOAL algorithms and analyze their theoretical performance in terms of cost-sensitive bounds. We conduct an extensive set of experiments to examine the empirical performance of the proposed algorithms for a large-scale challenging malicious URL detection task, in which the encouraging results showed that the proposed technique by querying an extremely small-sized labeled data (about 0.5% out of 1-million instances) can achieve better or highly comparable classification performance in comparison to the state-of-the-art cost-insensitive and cost-sensitive online classification algorithms using a huge amount of labeled data.

恶意URL (Uniform Resource Locator，统一资源定位符)检测是网络搜索和挖掘中的一个重要问题，对网络安全起着至关重要的作用。在文献中，许多现有的研究都试图将该问题表述为一个规则的监督二分类任务，其典型目标是优化预测精度。然而，在现实世界的恶意URL检测任务中，恶意URL与合法URL的比例是高度不平衡的，单纯优化预测精度是非常不合适的。此外，现有工作的另一个关键限制是假设有大量可用的训练数据，这是不切实际的，因为人工标记成本可能相当昂贵。为了解决这些问题，本文提出了一种新的成本敏感在线主动学习(CSOAL)框架，该框架只查询一小部分训练数据进行标记，并直接优化两个成本敏感度量来解决类别不平衡问题。特别地，我们提出了两种CSOAL算法，并从成本敏感界的角度分析了它们的理论性能。我们进行了一组广泛的实验，以检查所提出的算法在大规模具有挑战性的恶意URL检测任务中的经验性能。其中令人鼓舞的结果表明，与使用大量标记数据的最先进的成本不敏感和成本敏感在线分类算法相比，通过查询极小尺寸的标记数据(100万实例中的约0.5%)所提出的技术可以获得更好或高度可比的分类性能。

{"title":"Cost-sensitive online active learning with application to malicious URL detection","authors":"P. Zhao, S. Hoi","doi":"10.1145/2487575.2487647","DOIUrl":"https://doi.org/10.1145/2487575.2487647","url":null,"abstract":"Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims to optimize the prediction accuracy. However, in a real-world malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy. Besides, another key limitation of the existing work is to assume a large amount of training data is available, which is impractical as the human labeling cost could be potentially quite expensive. To solve these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. In particular, we propose two CSOAL algorithms and analyze their theoretical performance in terms of cost-sensitive bounds. We conduct an extensive set of experiments to examine the empirical performance of the proposed algorithms for a large-scale challenging malicious URL detection task, in which the encouraging results showed that the proposed technique by querying an extremely small-sized labeled data (about 0.5% out of 1-million instances) can achieve better or highly comparable classification performance in comparison to the state-of-the-art cost-insensitive and cost-sensitive online classification algorithms using a huge amount of labeled data.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91318572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 129

Amplifying the voice of youth in Africa via text analytics 通过文本分析放大非洲青年的声音

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2488216

Prem Melville, Vijil Chenthamarakshan, Richard D. Lawrence, J. Powell, Moses Mugisha, Sharad Sapra, R. Anandan, Solomon Assefa

U-report is an open-source SMS platform operated by UNICEF Uganda, designed to give community members a voice on issues that impact them. Data received by the system are either SMS responses to a poll conducted by UNICEF, or unsolicited reports of a problem occurring within the community. There are currently 200,000 U-report participants, and they send up to 10,000 unsolicited text messages a week. The objective of the program in Uganda is to understand the data in real-time, and have issues addressed by the appropriate department in UNICEF in a timely manner. Given the high volume and velocity of the data streams, manual inspection of all messages is no longer sustainable. This paper describes an automated message-understanding and routing system deployed by IBM at UNICEF. We employ recent advances in data mining to get the most out of labeled training data, while incorporating domain knowledge from experts. We discuss the trade-offs, design choices and challenges in applying such techniques in a real-world deployment.

U-report是联合国儿童基金会乌干达办事处运营的一个开源短信平台，旨在让社区成员就影响他们的问题发表意见。该系统收到的数据要么是对联合国儿童基金会进行的民意调查的短信答复，要么是关于社区内发生的问题的主动报告。目前有20万u -报告参与者，他们每周发送多达1万条主动短信。乌干达项目的目标是实时了解数据，并由联合国儿童基金会的相关部门及时解决问题。考虑到数据流的高容量和速度，对所有消息进行人工检查不再是可持续的。本文描述了IBM在联合国儿童基金会部署的自动消息理解和路由系统。我们采用数据挖掘的最新进展来最大限度地利用标记的训练数据，同时结合专家的领域知识。我们将讨论在实际部署中应用这些技术的权衡、设计选择和挑战。

{"title":"Amplifying the voice of youth in Africa via text analytics","authors":"Prem Melville, Vijil Chenthamarakshan, Richard D. Lawrence, J. Powell, Moses Mugisha, Sharad Sapra, R. Anandan, Solomon Assefa","doi":"10.1145/2487575.2488216","DOIUrl":"https://doi.org/10.1145/2487575.2488216","url":null,"abstract":"U-report is an open-source SMS platform operated by UNICEF Uganda, designed to give community members a voice on issues that impact them. Data received by the system are either SMS responses to a poll conducted by UNICEF, or unsolicited reports of a problem occurring within the community. There are currently 200,000 U-report participants, and they send up to 10,000 unsolicited text messages a week. The objective of the program in Uganda is to understand the data in real-time, and have issues addressed by the appropriate department in UNICEF in a timely manner. Given the high volume and velocity of the data streams, manual inspection of all messages is no longer sustainable. This paper describes an automated message-understanding and routing system deployed by IBM at UNICEF. We employ recent advances in data mining to get the most out of labeled training data, while incorporating domain knowledge from experts. We discuss the trade-offs, design choices and challenges in applying such techniques in a real-world deployment.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90608766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Comparing apples to oranges: a scalable solution with heterogeneous hashing 比较苹果和橘子:一个具有异构哈希的可扩展解决方案

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487668

Mingdong Ou, Peng Cui, Fei Wang, Jun Wang, Wenwu Zhu, Shiqiang Yang

Although hashing techniques have been popular for the large scale similarity search problem, most of the existing methods for designing optimal hash functions focus on homogeneous similarity assessment, i.e., the data entities to be indexed are of the same type. Realizing that heterogeneous entities and relationships are also ubiquitous in the real world applications, there is an emerging need to retrieve and search similar or relevant data entities from multiple heterogeneous domains, e.g., recommending relevant posts and images to a certain Facebook user. In this paper, we address the problem of ``comparing apples to oranges'' under the large scale setting. Specifically, we propose a novel Relation-aware Heterogeneous Hashing (RaHH), which provides a general framework for generating hash codes of data entities sitting in multiple heterogeneous domains. Unlike some existing hashing methods that map heterogeneous data in a common Hamming space, the RaHH approach constructs a Hamming space for each type of data entities, and learns optimal mappings between them simultaneously. This makes the learned hash codes flexibly cope with the characteristics of different data domains. Moreover, the RaHH framework encodes both homogeneous and heterogeneous relationships between the data entities to design hash functions with improved accuracy. To validate the proposed RaHH method, we conduct extensive evaluations on two large datasets; one is crawled from a popular social media sites, Tencent Weibo, and the other is an open dataset of Flickr(NUS-WIDE). The experimental results clearly demonstrate that the RaHH outperforms several state-of-the-art hashing methods with significant performance gains.

虽然哈希技术在大规模相似度搜索问题中得到了广泛应用，但现有的设计最优哈希函数的方法大多侧重于同构相似度评估，即要索引的数据实体是同一类型的。意识到异构实体和关系在现实世界的应用程序中也无处不在，从多个异构领域中检索和搜索相似或相关的数据实体是一个新兴的需求，例如，向某个Facebook用户推荐相关的帖子和图片。在本文中，我们解决了大规模设置下的“比较苹果和橘子”问题。具体来说，我们提出了一种新的关系感知异构哈希(RaHH)，它为生成位于多个异构域中的数据实体的哈希码提供了一个通用框架。与现有的一些将异构数据映射到公共汉明空间的散列方法不同，RaHH方法为每种类型的数据实体构建一个汉明空间，并同时学习它们之间的最佳映射。这使得学习到的哈希码能够灵活地应对不同数据域的特点。此外，RaHH框架对数据实体之间的同构和异构关系进行编码，以提高散列函数的精度。为了验证提出的RaHH方法，我们在两个大型数据集上进行了广泛的评估;一个是从流行的社交媒体网站腾讯微博中抓取的，另一个是Flickr(NUS-WIDE)的开放数据集。实验结果清楚地表明，RaHH比几种最先进的散列方法性能更好。

{"title":"Comparing apples to oranges: a scalable solution with heterogeneous hashing","authors":"Mingdong Ou, Peng Cui, Fei Wang, Jun Wang, Wenwu Zhu, Shiqiang Yang","doi":"10.1145/2487575.2487668","DOIUrl":"https://doi.org/10.1145/2487575.2487668","url":null,"abstract":"Although hashing techniques have been popular for the large scale similarity search problem, most of the existing methods for designing optimal hash functions focus on homogeneous similarity assessment, i.e., the data entities to be indexed are of the same type. Realizing that heterogeneous entities and relationships are also ubiquitous in the real world applications, there is an emerging need to retrieve and search similar or relevant data entities from multiple heterogeneous domains, e.g., recommending relevant posts and images to a certain Facebook user. In this paper, we address the problem of ``comparing apples to oranges'' under the large scale setting. Specifically, we propose a novel Relation-aware Heterogeneous Hashing (RaHH), which provides a general framework for generating hash codes of data entities sitting in multiple heterogeneous domains. Unlike some existing hashing methods that map heterogeneous data in a common Hamming space, the RaHH approach constructs a Hamming space for each type of data entities, and learns optimal mappings between them simultaneously. This makes the learned hash codes flexibly cope with the characteristics of different data domains. Moreover, the RaHH framework encodes both homogeneous and heterogeneous relationships between the data entities to design hash functions with improved accuracy. To validate the proposed RaHH method, we conduct extensive evaluations on two large datasets; one is crawled from a popular social media sites, Tencent Weibo, and the other is an open dataset of Flickr(NUS-WIDE). The experimental results clearly demonstrate that the RaHH outperforms several state-of-the-art hashing methods with significant performance gains.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":" 43","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91414883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Empirical bayes model to combine signals of adverse drug reactions 结合药物不良反应信号的经验贝叶斯模型

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2488214

R. Harpaz, W. DuMouchel, P. LePendu, N. Shah

Data mining is a crucial tool for identifying risk signals of potential adverse drug reactions (ADRs). However, mining of ADR signals is currently limited to leveraging a single data source at a time. It is widely believed that combining ADR evidence from multiple data sources will result in a more accurate risk identification system. We present a methodology based on empirical Bayes modeling to combine ADR signals mined from ~5 million adverse event reports collected by the FDA, and healthcare data corresponding to 46 million patients' the main two types of information sources currently employed for signal detection. Based on four sets of test cases (gold standard), we demonstrate that our method leads to a statistically significant and substantial improvement in signal detection accuracy, averaging 40% over the use of each source independently, and an area under the ROC curve of 0.87. We also compare the method with alternative supervised learning approaches, and argue that our approach is preferable as it does not require labeled (training) samples whose availability is currently limited. To our knowledge, this is the first effort to combine signals from these two complementary data sources, and to demonstrate the benefits of a computationally integrative strategy for drug safety surveillance.

数据挖掘是识别潜在药物不良反应(adr)风险信号的关键工具。然而，ADR信号的挖掘目前仅限于一次利用单个数据源。人们普遍认为，将来自多个数据源的ADR证据结合起来，将会形成一个更准确的风险识别系统。我们提出了一种基于经验贝叶斯模型的方法，将从FDA收集的约500万份不良事件报告中挖掘的ADR信号与4600万患者的医疗数据相结合，这是目前用于信号检测的两种主要信息来源。基于四组测试用例(金标准)，我们证明了我们的方法在信号检测精度方面具有统计显着性和实质性的提高，在独立使用每个源时平均为40%，ROC曲线下面积为0.87。我们还将该方法与其他监督学习方法进行了比较，并认为我们的方法更可取，因为它不需要目前可用性有限的标记(训练)样本。据我们所知，这是第一次将这两个互补数据源的信号结合起来，并展示了药物安全监测计算综合策略的好处。

{"title":"Empirical bayes model to combine signals of adverse drug reactions","authors":"R. Harpaz, W. DuMouchel, P. LePendu, N. Shah","doi":"10.1145/2487575.2488214","DOIUrl":"https://doi.org/10.1145/2487575.2488214","url":null,"abstract":"Data mining is a crucial tool for identifying risk signals of potential adverse drug reactions (ADRs). However, mining of ADR signals is currently limited to leveraging a single data source at a time. It is widely believed that combining ADR evidence from multiple data sources will result in a more accurate risk identification system. We present a methodology based on empirical Bayes modeling to combine ADR signals mined from ~5 million adverse event reports collected by the FDA, and healthcare data corresponding to 46 million patients' the main two types of information sources currently employed for signal detection. Based on four sets of test cases (gold standard), we demonstrate that our method leads to a statistically significant and substantial improvement in signal detection accuracy, averaging 40% over the use of each source independently, and an area under the ROC curve of 0.87. We also compare the method with alternative supervised learning approaches, and argue that our approach is preferable as it does not require labeled (training) samples whose availability is currently limited. To our knowledge, this is the first effort to combine signals from these two complementary data sources, and to demonstrate the benefits of a computationally integrative strategy for drug safety surveillance.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"96 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91478096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

A new collaborative filtering approach for increasing the aggregate diversity of recommender systems 一种增加推荐系统总体多样性的协同过滤新方法

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487656

K. Niemann, M. Wolpers

In order to satisfy and positively surprise the users, a recommender system needs to recommend items the users will like and most probably would not have found on their own. This requires the recommender system to recommend a broader range of items including niche items as well. Such an approach also support online-stores that often offer more items than traditional stores and need recommender systems to enable users to find the not so popular items as well. However, popular items that hold a lot of usage data are more easy to recommend and, thus, niche items are often excluded from the recommendations. In this paper, we propose a new collaborative filtering approach that is based on the items' usage contexts. The approach increases the rating predictions for niche items with fewer usage data available and improves the aggragate diversity of the recommendations.

为了让用户满意并给用户带来惊喜，推荐系统需要推荐用户会喜欢的东西，而这些东西很可能是用户自己找不到的。这就要求推荐系统推荐范围更广的商品，包括小众商品。这种方法也支持在线商店，这些商店通常比传统商店提供更多的商品，并且需要推荐系统使用户能够找到不太受欢迎的商品。然而，拥有大量使用数据的热门商品更容易被推荐，因此，小众商品通常被排除在推荐之外。在本文中，我们提出了一种新的基于项目使用上下文的协同过滤方法。该方法增加了对使用数据较少的利基商品的评级预测，并提高了推荐的总体多样性。

引用次数: 72

Location-aware publish/subscribe 位置感知的发布/订阅

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487617

Guoliang Li, Yang Wang, Ting Wang, Jianhua Feng

Location-based services have become widely available on mobile devices. Existing methods employ a pull model or user-initiated model, where a user issues a query to a server which replies with location-aware answers. To provide users with instant replies, a push model or server-initiated model is becoming an inevitable computing model in the next-generation location-based services. In the push model, subscribers register spatio-textual subscriptions to capture their interests, and publishers post spatio-textual messages. This calls for a high-performance location-aware publish/subscribe system to deliver publishers' messages to relevant subscribers.In this paper, we address the research challenges that arise in designing a location-aware publish/subscribe system. We propose an rtree based index structure by integrating textual descriptions into rtree nodes. We devise efficient filtering algorithms and develop effective pruning techniques to improve filtering efficiency. Experimental results show that our method achieves high performance. For example, our method can filter 500 tweets in a second for 10 million registered subscriptions on a commodity computer.

基于位置的服务已经在移动设备上广泛使用。现有的方法采用拉模型或用户发起模型，其中用户向服务器发出查询，服务器使用位置感知的答案进行应答。为了向用户提供即时回复，推送模式或服务器发起模式将成为下一代位置服务的必然计算模式。在推送模型中，订阅者注册空间文本订阅来获取他们的兴趣，发布者发布空间文本消息。这需要一个高性能的位置感知发布/订阅系统来将发布者的消息传递给相关的订阅者。在本文中，我们解决了在设计位置感知发布/订阅系统时出现的研究挑战。通过将文本描述集成到r树节点中，提出了一种基于r树的索引结构。我们设计了高效的过滤算法，并开发了有效的修剪技术来提高过滤效率。实验结果表明，该方法具有较高的性能。例如，我们的方法可以在一秒钟内为一台商用计算机上的1000万注册订阅过滤500条tweet。

{"title":"Location-aware publish/subscribe","authors":"Guoliang Li, Yang Wang, Ting Wang, Jianhua Feng","doi":"10.1145/2487575.2487617","DOIUrl":"https://doi.org/10.1145/2487575.2487617","url":null,"abstract":"Location-based services have become widely available on mobile devices. Existing methods employ a pull model or user-initiated model, where a user issues a query to a server which replies with location-aware answers. To provide users with instant replies, a push model or server-initiated model is becoming an inevitable computing model in the next-generation location-based services. In the push model, subscribers register spatio-textual subscriptions to capture their interests, and publishers post spatio-textual messages. This calls for a high-performance location-aware publish/subscribe system to deliver publishers' messages to relevant subscribers.In this paper, we address the research challenges that arise in designing a location-aware publish/subscribe system. We propose an rtree based index structure by integrating textual descriptions into rtree nodes. We devise efficient filtering algorithms and develop effective pruning techniques to improve filtering efficiency. Experimental results show that our method achieves high performance. For example, our method can filter 500 tweets in a second for 10 million registered subscriptions on a commodity computer.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79555406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 106

Robust sparse estimation of multiresponse regression and inverse covariance matrix via the L2 distance 基于L2距离的多响应回归和逆协方差矩阵鲁棒稀疏估计

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487667

A. Lozano, Huijing Jiang, Xinwei Deng

We propose a robust framework to jointly perform two key modeling tasks involving high dimensional data: (i) learning a sparse functional mapping from multiple predictors to multiple responses while taking advantage of the coupling among responses, and (ii) estimating the conditional dependency structure among responses while adjusting for their predictors. The traditional likelihood-based estimators lack resilience with respect to outliers and model misspecification. This issue is exacerbated when dealing with high dimensional noisy data. In this work, we propose instead to minimize a regularized distance criterion, which is motivated by the minimum distance functionals used in nonparametric methods for their excellent robustness properties. The proposed estimates can be obtained efficiently by leveraging a sequential quadratic programming algorithm. We provide theoretical justification such as estimation consistency for the proposed estimator. Additionally, we shed light on the robustness of our estimator through its linearization, which yields a combination of weighted lasso and graphical lasso with the sample weights providing an intuitive explanation of the robustness. We demonstrate the merits of our framework through simulation study and the analysis of real financial and genetics data.

我们提出了一个鲁棒框架来共同执行涉及高维数据的两个关键建模任务:(i)在利用响应之间的耦合的同时学习从多个预测因子到多个响应的稀疏函数映射，以及(ii)在调整其预测因子的同时估计响应之间的条件依赖结构。传统的基于似然的估计在异常值和模型错误规范方面缺乏弹性。在处理高维噪声数据时，这个问题更加严重。在这项工作中，我们建议最小化正则化距离准则，这是由非参数方法中使用的最小距离函数所激发的，因为它们具有出色的鲁棒性。利用序贯二次规划算法可以有效地获得所提出的估计。我们为所提出的估计器提供了估计一致性等理论证明。此外，我们通过线性化来阐明我们的估计器的鲁棒性，这产生了加权套索和图形套索的组合，其中样本权重提供了对鲁棒性的直观解释。我们通过模拟研究和对真实金融和遗传学数据的分析来证明我们的框架的优点。

{"title":"Robust sparse estimation of multiresponse regression and inverse covariance matrix via the L2 distance","authors":"A. Lozano, Huijing Jiang, Xinwei Deng","doi":"10.1145/2487575.2487667","DOIUrl":"https://doi.org/10.1145/2487575.2487667","url":null,"abstract":"We propose a robust framework to jointly perform two key modeling tasks involving high dimensional data: (i) learning a sparse functional mapping from multiple predictors to multiple responses while taking advantage of the coupling among responses, and (ii) estimating the conditional dependency structure among responses while adjusting for their predictors. The traditional likelihood-based estimators lack resilience with respect to outliers and model misspecification. This issue is exacerbated when dealing with high dimensional noisy data. In this work, we propose instead to minimize a regularized distance criterion, which is motivated by the minimum distance functionals used in nonparametric methods for their excellent robustness properties. The proposed estimates can be obtained efficiently by leveraging a sequential quadratic programming algorithm. We provide theoretical justification such as estimation consistency for the proposed estimator. Additionally, we shed light on the robustness of our estimator through its linearization, which yields a combination of weighted lasso and graphical lasso with the sample weights providing an intuitive explanation of the robustness. We demonstrate the merits of our framework through simulation study and the analysis of real financial and genetics data.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82748430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8