People use various social media for different purposes. The information on an individual site is often incomplete. When sources of complementary information are integrated, a better profile of a user can be built to improve online services such as verifying online information. To integrate these sources of information, it is necessary to identify individuals across social media sites. This paper aims to address the cross-media user identification problem. We introduce a methodology (MOBIUS) for finding a mapping among identities of individuals across social media sites. It consists of three key components: the first component identifies users' unique behavioral patterns that lead to information redundancies across sites; the second component constructs features that exploit information redundancies due to these behavioral patterns; and the third component employs machine learning for effective user identification. We formally define the cross-media user identification problem and show that MOBIUS is effective in identifying users across social media sites. This study paves the way for analysis and mining across social media sites, and facilitates the creation of novel online services across sites.
{"title":"Connecting users across social media sites: a behavioral-modeling approach","authors":"R. Zafarani, Huan Liu","doi":"10.1145/2487575.2487648","DOIUrl":"https://doi.org/10.1145/2487575.2487648","url":null,"abstract":"People use various social media for different purposes. The information on an individual site is often incomplete. When sources of complementary information are integrated, a better profile of a user can be built to improve online services such as verifying online information. To integrate these sources of information, it is necessary to identify individuals across social media sites. This paper aims to address the cross-media user identification problem. We introduce a methodology (MOBIUS) for finding a mapping among identities of individuals across social media sites. It consists of three key components: the first component identifies users' unique behavioral patterns that lead to information redundancies across sites; the second component constructs features that exploit information redundancies due to these behavioral patterns; and the third component employs machine learning for effective user identification. We formally define the cross-media user identification problem and show that MOBIUS is effective in identifying users across social media sites. This study paves the way for analysis and mining across social media sites, and facilitates the creation of novel online services across sites.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84388369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online social streams such as Twitter/Facebook timelines and forum discussions have emerged as prevalent channels for information dissemination. As these social streams surge quickly, information overload has become a huge problem. Existing keyword search engines on social streams like Twitter Search are not successful in overcoming the problem, because they merely return an overwhelming list of posts, with little aggregation or semantics. In this demo, we provide a new solution called keysee by grouping posts into events, and track the evolution patterns of events as new posts stream in and old posts fade out. Noise and redundancy problems are effectively addressed in our system. Our demo supports refined keyword query on evolving events by allowing users to specify the time span and designated evolution pattern. For each event result, we provide various analytic views such as frequency curves, word clouds and GPS distributions. We deploy keysee on real Twitter streams and the results show that our demo outperforms existing keyword search engines on both quality and usability.
{"title":"KeySee: supporting keyword search on evolving events in social streams","authors":"Pei Lee, L. Lakshmanan, E. Milios","doi":"10.1145/2487575.2487711","DOIUrl":"https://doi.org/10.1145/2487575.2487711","url":null,"abstract":"Online social streams such as Twitter/Facebook timelines and forum discussions have emerged as prevalent channels for information dissemination. As these social streams surge quickly, information overload has become a huge problem. Existing keyword search engines on social streams like Twitter Search are not successful in overcoming the problem, because they merely return an overwhelming list of posts, with little aggregation or semantics. In this demo, we provide a new solution called keysee by grouping posts into events, and track the evolution patterns of events as new posts stream in and old posts fade out. Noise and redundancy problems are effectively addressed in our system. Our demo supports refined keyword query on evolving events by allowing users to specify the time span and designated evolution pattern. For each event result, we provide various analytic views such as frequency curves, word clouds and GPS distributions. We deploy keysee on real Twitter streams and the results show that our demo outperforms existing keyword search engines on both quality and usability.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87120538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, Jeong-Hoon Lee, Min-Soo Kim, Jinha Kim, Hwanjo Yu
Graphs are used to model many real objects such as social networks and web graphs. Many real applications in various fields require efficient and effective management of large-scale graph structured data. Although distributed graph engines such as GBase and Pregel handle billion-scale graphs, the user needs to be skilled at managing and tuning a distributed system in a cluster, which is a nontrivial job for the ordinary user. Furthermore, these distributed systems need many machines in a cluster in order to provide reasonable performance. In order to address this problem, a disk-based parallel graph engine called Graph-Chi, has been recently proposed. Although Graph-Chi significantly outperforms all representative (disk-based) distributed graph engines, we observe that Graph-Chi still has serious performance problems for many important types of graph queries due to 1) limited parallelism and 2) separate steps for I/O processing and CPU processing. In this paper, we propose a general, disk-based graph engine called TurboGraph to process billion-scale graphs very efficiently by using modern hardware on a single PC. TurboGraph is the first truly parallel graph engine that exploits 1) full parallelism including multi-core parallelism and FlashSSD IO parallelism and 2) full overlap of CPU processing and I/O processing as much as possible. Specifically, we propose a novel parallel execution model, called pin-and-slide. TurboGraph also provides engine-level operators such as BFS which are implemented under the pin-and-slide model. Extensive experimental results with large real datasets show that TurboGraph consistently and significantly outperforms Graph-Chi by up to four orders of magnitude! Our implementation of TurboGraph is available at ``http://wshan.net/turbograph}" as executable files.
{"title":"TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC","authors":"Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, Jeong-Hoon Lee, Min-Soo Kim, Jinha Kim, Hwanjo Yu","doi":"10.1145/2487575.2487581","DOIUrl":"https://doi.org/10.1145/2487575.2487581","url":null,"abstract":"Graphs are used to model many real objects such as social networks and web graphs. Many real applications in various fields require efficient and effective management of large-scale graph structured data. Although distributed graph engines such as GBase and Pregel handle billion-scale graphs, the user needs to be skilled at managing and tuning a distributed system in a cluster, which is a nontrivial job for the ordinary user. Furthermore, these distributed systems need many machines in a cluster in order to provide reasonable performance. In order to address this problem, a disk-based parallel graph engine called Graph-Chi, has been recently proposed. Although Graph-Chi significantly outperforms all representative (disk-based) distributed graph engines, we observe that Graph-Chi still has serious performance problems for many important types of graph queries due to 1) limited parallelism and 2) separate steps for I/O processing and CPU processing. In this paper, we propose a general, disk-based graph engine called TurboGraph to process billion-scale graphs very efficiently by using modern hardware on a single PC. TurboGraph is the first truly parallel graph engine that exploits 1) full parallelism including multi-core parallelism and FlashSSD IO parallelism and 2) full overlap of CPU processing and I/O processing as much as possible. Specifically, we propose a novel parallel execution model, called pin-and-slide. TurboGraph also provides engine-level operators such as BFS which are implemented under the pin-and-slide model. Extensive experimental results with large real datasets show that TurboGraph consistently and significantly outperforms Graph-Chi by up to four orders of magnitude! Our implementation of TurboGraph is available at ``http://wshan.net/turbograph}\" as executable files.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87339673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims to optimize the prediction accuracy. However, in a real-world malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy. Besides, another key limitation of the existing work is to assume a large amount of training data is available, which is impractical as the human labeling cost could be potentially quite expensive. To solve these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. In particular, we propose two CSOAL algorithms and analyze their theoretical performance in terms of cost-sensitive bounds. We conduct an extensive set of experiments to examine the empirical performance of the proposed algorithms for a large-scale challenging malicious URL detection task, in which the encouraging results showed that the proposed technique by querying an extremely small-sized labeled data (about 0.5% out of 1-million instances) can achieve better or highly comparable classification performance in comparison to the state-of-the-art cost-insensitive and cost-sensitive online classification algorithms using a huge amount of labeled data.
{"title":"Cost-sensitive online active learning with application to malicious URL detection","authors":"P. Zhao, S. Hoi","doi":"10.1145/2487575.2487647","DOIUrl":"https://doi.org/10.1145/2487575.2487647","url":null,"abstract":"Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims to optimize the prediction accuracy. However, in a real-world malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy. Besides, another key limitation of the existing work is to assume a large amount of training data is available, which is impractical as the human labeling cost could be potentially quite expensive. To solve these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. In particular, we propose two CSOAL algorithms and analyze their theoretical performance in terms of cost-sensitive bounds. We conduct an extensive set of experiments to examine the empirical performance of the proposed algorithms for a large-scale challenging malicious URL detection task, in which the encouraging results showed that the proposed technique by querying an extremely small-sized labeled data (about 0.5% out of 1-million instances) can achieve better or highly comparable classification performance in comparison to the state-of-the-art cost-insensitive and cost-sensitive online classification algorithms using a huge amount of labeled data.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91318572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prem Melville, Vijil Chenthamarakshan, Richard D. Lawrence, J. Powell, Moses Mugisha, Sharad Sapra, R. Anandan, Solomon Assefa
U-report is an open-source SMS platform operated by UNICEF Uganda, designed to give community members a voice on issues that impact them. Data received by the system are either SMS responses to a poll conducted by UNICEF, or unsolicited reports of a problem occurring within the community. There are currently 200,000 U-report participants, and they send up to 10,000 unsolicited text messages a week. The objective of the program in Uganda is to understand the data in real-time, and have issues addressed by the appropriate department in UNICEF in a timely manner. Given the high volume and velocity of the data streams, manual inspection of all messages is no longer sustainable. This paper describes an automated message-understanding and routing system deployed by IBM at UNICEF. We employ recent advances in data mining to get the most out of labeled training data, while incorporating domain knowledge from experts. We discuss the trade-offs, design choices and challenges in applying such techniques in a real-world deployment.
{"title":"Amplifying the voice of youth in Africa via text analytics","authors":"Prem Melville, Vijil Chenthamarakshan, Richard D. Lawrence, J. Powell, Moses Mugisha, Sharad Sapra, R. Anandan, Solomon Assefa","doi":"10.1145/2487575.2488216","DOIUrl":"https://doi.org/10.1145/2487575.2488216","url":null,"abstract":"U-report is an open-source SMS platform operated by UNICEF Uganda, designed to give community members a voice on issues that impact them. Data received by the system are either SMS responses to a poll conducted by UNICEF, or unsolicited reports of a problem occurring within the community. There are currently 200,000 U-report participants, and they send up to 10,000 unsolicited text messages a week. The objective of the program in Uganda is to understand the data in real-time, and have issues addressed by the appropriate department in UNICEF in a timely manner. Given the high volume and velocity of the data streams, manual inspection of all messages is no longer sustainable. This paper describes an automated message-understanding and routing system deployed by IBM at UNICEF. We employ recent advances in data mining to get the most out of labeled training data, while incorporating domain knowledge from experts. We discuss the trade-offs, design choices and challenges in applying such techniques in a real-world deployment.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90608766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingdong Ou, Peng Cui, Fei Wang, Jun Wang, Wenwu Zhu, Shiqiang Yang
Although hashing techniques have been popular for the large scale similarity search problem, most of the existing methods for designing optimal hash functions focus on homogeneous similarity assessment, i.e., the data entities to be indexed are of the same type. Realizing that heterogeneous entities and relationships are also ubiquitous in the real world applications, there is an emerging need to retrieve and search similar or relevant data entities from multiple heterogeneous domains, e.g., recommending relevant posts and images to a certain Facebook user. In this paper, we address the problem of ``comparing apples to oranges'' under the large scale setting. Specifically, we propose a novel Relation-aware Heterogeneous Hashing (RaHH), which provides a general framework for generating hash codes of data entities sitting in multiple heterogeneous domains. Unlike some existing hashing methods that map heterogeneous data in a common Hamming space, the RaHH approach constructs a Hamming space for each type of data entities, and learns optimal mappings between them simultaneously. This makes the learned hash codes flexibly cope with the characteristics of different data domains. Moreover, the RaHH framework encodes both homogeneous and heterogeneous relationships between the data entities to design hash functions with improved accuracy. To validate the proposed RaHH method, we conduct extensive evaluations on two large datasets; one is crawled from a popular social media sites, Tencent Weibo, and the other is an open dataset of Flickr(NUS-WIDE). The experimental results clearly demonstrate that the RaHH outperforms several state-of-the-art hashing methods with significant performance gains.
{"title":"Comparing apples to oranges: a scalable solution with heterogeneous hashing","authors":"Mingdong Ou, Peng Cui, Fei Wang, Jun Wang, Wenwu Zhu, Shiqiang Yang","doi":"10.1145/2487575.2487668","DOIUrl":"https://doi.org/10.1145/2487575.2487668","url":null,"abstract":"Although hashing techniques have been popular for the large scale similarity search problem, most of the existing methods for designing optimal hash functions focus on homogeneous similarity assessment, i.e., the data entities to be indexed are of the same type. Realizing that heterogeneous entities and relationships are also ubiquitous in the real world applications, there is an emerging need to retrieve and search similar or relevant data entities from multiple heterogeneous domains, e.g., recommending relevant posts and images to a certain Facebook user. In this paper, we address the problem of ``comparing apples to oranges'' under the large scale setting. Specifically, we propose a novel Relation-aware Heterogeneous Hashing (RaHH), which provides a general framework for generating hash codes of data entities sitting in multiple heterogeneous domains. Unlike some existing hashing methods that map heterogeneous data in a common Hamming space, the RaHH approach constructs a Hamming space for each type of data entities, and learns optimal mappings between them simultaneously. This makes the learned hash codes flexibly cope with the characteristics of different data domains. Moreover, the RaHH framework encodes both homogeneous and heterogeneous relationships between the data entities to design hash functions with improved accuracy. To validate the proposed RaHH method, we conduct extensive evaluations on two large datasets; one is crawled from a popular social media sites, Tencent Weibo, and the other is an open dataset of Flickr(NUS-WIDE). The experimental results clearly demonstrate that the RaHH outperforms several state-of-the-art hashing methods with significant performance gains.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91414883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data mining is a crucial tool for identifying risk signals of potential adverse drug reactions (ADRs). However, mining of ADR signals is currently limited to leveraging a single data source at a time. It is widely believed that combining ADR evidence from multiple data sources will result in a more accurate risk identification system. We present a methodology based on empirical Bayes modeling to combine ADR signals mined from ~5 million adverse event reports collected by the FDA, and healthcare data corresponding to 46 million patients' the main two types of information sources currently employed for signal detection. Based on four sets of test cases (gold standard), we demonstrate that our method leads to a statistically significant and substantial improvement in signal detection accuracy, averaging 40% over the use of each source independently, and an area under the ROC curve of 0.87. We also compare the method with alternative supervised learning approaches, and argue that our approach is preferable as it does not require labeled (training) samples whose availability is currently limited. To our knowledge, this is the first effort to combine signals from these two complementary data sources, and to demonstrate the benefits of a computationally integrative strategy for drug safety surveillance.
{"title":"Empirical bayes model to combine signals of adverse drug reactions","authors":"R. Harpaz, W. DuMouchel, P. LePendu, N. Shah","doi":"10.1145/2487575.2488214","DOIUrl":"https://doi.org/10.1145/2487575.2488214","url":null,"abstract":"Data mining is a crucial tool for identifying risk signals of potential adverse drug reactions (ADRs). However, mining of ADR signals is currently limited to leveraging a single data source at a time. It is widely believed that combining ADR evidence from multiple data sources will result in a more accurate risk identification system. We present a methodology based on empirical Bayes modeling to combine ADR signals mined from ~5 million adverse event reports collected by the FDA, and healthcare data corresponding to 46 million patients' the main two types of information sources currently employed for signal detection. Based on four sets of test cases (gold standard), we demonstrate that our method leads to a statistically significant and substantial improvement in signal detection accuracy, averaging 40% over the use of each source independently, and an area under the ROC curve of 0.87. We also compare the method with alternative supervised learning approaches, and argue that our approach is preferable as it does not require labeled (training) samples whose availability is currently limited. To our knowledge, this is the first effort to combine signals from these two complementary data sources, and to demonstrate the benefits of a computationally integrative strategy for drug safety surveillance.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91478096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to satisfy and positively surprise the users, a recommender system needs to recommend items the users will like and most probably would not have found on their own. This requires the recommender system to recommend a broader range of items including niche items as well. Such an approach also support online-stores that often offer more items than traditional stores and need recommender systems to enable users to find the not so popular items as well. However, popular items that hold a lot of usage data are more easy to recommend and, thus, niche items are often excluded from the recommendations. In this paper, we propose a new collaborative filtering approach that is based on the items' usage contexts. The approach increases the rating predictions for niche items with fewer usage data available and improves the aggragate diversity of the recommendations.
{"title":"A new collaborative filtering approach for increasing the aggregate diversity of recommender systems","authors":"K. Niemann, M. Wolpers","doi":"10.1145/2487575.2487656","DOIUrl":"https://doi.org/10.1145/2487575.2487656","url":null,"abstract":"In order to satisfy and positively surprise the users, a recommender system needs to recommend items the users will like and most probably would not have found on their own. This requires the recommender system to recommend a broader range of items including niche items as well. Such an approach also support online-stores that often offer more items than traditional stores and need recommender systems to enable users to find the not so popular items as well. However, popular items that hold a lot of usage data are more easy to recommend and, thus, niche items are often excluded from the recommendations. In this paper, we propose a new collaborative filtering approach that is based on the items' usage contexts. The approach increases the rating predictions for niche items with fewer usage data available and improves the aggragate diversity of the recommendations.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77471598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Location-based services have become widely available on mobile devices. Existing methods employ a pull model or user-initiated model, where a user issues a query to a server which replies with location-aware answers. To provide users with instant replies, a push model or server-initiated model is becoming an inevitable computing model in the next-generation location-based services. In the push model, subscribers register spatio-textual subscriptions to capture their interests, and publishers post spatio-textual messages. This calls for a high-performance location-aware publish/subscribe system to deliver publishers' messages to relevant subscribers.In this paper, we address the research challenges that arise in designing a location-aware publish/subscribe system. We propose an rtree based index structure by integrating textual descriptions into rtree nodes. We devise efficient filtering algorithms and develop effective pruning techniques to improve filtering efficiency. Experimental results show that our method achieves high performance. For example, our method can filter 500 tweets in a second for 10 million registered subscriptions on a commodity computer.
{"title":"Location-aware publish/subscribe","authors":"Guoliang Li, Yang Wang, Ting Wang, Jianhua Feng","doi":"10.1145/2487575.2487617","DOIUrl":"https://doi.org/10.1145/2487575.2487617","url":null,"abstract":"Location-based services have become widely available on mobile devices. Existing methods employ a pull model or user-initiated model, where a user issues a query to a server which replies with location-aware answers. To provide users with instant replies, a push model or server-initiated model is becoming an inevitable computing model in the next-generation location-based services. In the push model, subscribers register spatio-textual subscriptions to capture their interests, and publishers post spatio-textual messages. This calls for a high-performance location-aware publish/subscribe system to deliver publishers' messages to relevant subscribers.In this paper, we address the research challenges that arise in designing a location-aware publish/subscribe system. We propose an rtree based index structure by integrating textual descriptions into rtree nodes. We devise efficient filtering algorithms and develop effective pruning techniques to improve filtering efficiency. Experimental results show that our method achieves high performance. For example, our method can filter 500 tweets in a second for 10 million registered subscriptions on a commodity computer.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79555406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a robust framework to jointly perform two key modeling tasks involving high dimensional data: (i) learning a sparse functional mapping from multiple predictors to multiple responses while taking advantage of the coupling among responses, and (ii) estimating the conditional dependency structure among responses while adjusting for their predictors. The traditional likelihood-based estimators lack resilience with respect to outliers and model misspecification. This issue is exacerbated when dealing with high dimensional noisy data. In this work, we propose instead to minimize a regularized distance criterion, which is motivated by the minimum distance functionals used in nonparametric methods for their excellent robustness properties. The proposed estimates can be obtained efficiently by leveraging a sequential quadratic programming algorithm. We provide theoretical justification such as estimation consistency for the proposed estimator. Additionally, we shed light on the robustness of our estimator through its linearization, which yields a combination of weighted lasso and graphical lasso with the sample weights providing an intuitive explanation of the robustness. We demonstrate the merits of our framework through simulation study and the analysis of real financial and genetics data.
{"title":"Robust sparse estimation of multiresponse regression and inverse covariance matrix via the L2 distance","authors":"A. Lozano, Huijing Jiang, Xinwei Deng","doi":"10.1145/2487575.2487667","DOIUrl":"https://doi.org/10.1145/2487575.2487667","url":null,"abstract":"We propose a robust framework to jointly perform two key modeling tasks involving high dimensional data: (i) learning a sparse functional mapping from multiple predictors to multiple responses while taking advantage of the coupling among responses, and (ii) estimating the conditional dependency structure among responses while adjusting for their predictors. The traditional likelihood-based estimators lack resilience with respect to outliers and model misspecification. This issue is exacerbated when dealing with high dimensional noisy data. In this work, we propose instead to minimize a regularized distance criterion, which is motivated by the minimum distance functionals used in nonparametric methods for their excellent robustness properties. The proposed estimates can be obtained efficiently by leveraging a sequential quadratic programming algorithm. We provide theoretical justification such as estimation consistency for the proposed estimator. Additionally, we shed light on the robustness of our estimator through its linearization, which yields a combination of weighted lasso and graphical lasso with the sample weights providing an intuitive explanation of the robustness. We demonstrate the merits of our framework through simulation study and the analysis of real financial and genetics data.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82748430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}