Daniel Austin, Samuel S. Seljan, Julius Monello, Stephanie Tzeng
Online advertising is a multi-billion dollar industry largely responsible for keeping most online content free and content creators ("publishers") in business. In one aspect of advertising sales, impressions are auctioned off in second price auctions on an auction-by-auction basis through what is known as real-time bidding (RTB). An important mechanism through which publishers can influence how much revenue they earn is reserve pricing in RTB auctions. The optimal reserve price problem is well studied in both applied and academic literatures. However, few solutions are suited to RTB, where billions of auctions for ad space on millions of different sites and Internet users are conducted each day among bidders with heterogenous valuations. In particular, existing solutions are not robust to violations of assumptions common in auction theory and do not scale to processing terabytes of data each hour, a high dimensional feature space, and a fast changing demand landscape. In this paper, we describe a scalable, online, real-time, incrementally updated reserve price optimizer for RTB that is currently implemented as part of the AppNexus Publisher Suite. Our solution applies an online learning approach, maximizing a custom cost function suited to reserve price optimization. We demonstrate the scalability and feasibility with the results from the reserve price optimizer deployed in a production environment. In the production deployed optimizer, the average revenue lift was 34.4% with 95% confidence intervals (33.2%, 35.6%) from more than 8 billion auctions over 46 days, a substantial increase over non-optimized and often manually set rule based reserve prices.
{"title":"Reserve Price Optimization at Scale","authors":"Daniel Austin, Samuel S. Seljan, Julius Monello, Stephanie Tzeng","doi":"10.1109/DSAA.2016.32","DOIUrl":"https://doi.org/10.1109/DSAA.2016.32","url":null,"abstract":"Online advertising is a multi-billion dollar industry largely responsible for keeping most online content free and content creators (\"publishers\") in business. In one aspect of advertising sales, impressions are auctioned off in second price auctions on an auction-by-auction basis through what is known as real-time bidding (RTB). An important mechanism through which publishers can influence how much revenue they earn is reserve pricing in RTB auctions. The optimal reserve price problem is well studied in both applied and academic literatures. However, few solutions are suited to RTB, where billions of auctions for ad space on millions of different sites and Internet users are conducted each day among bidders with heterogenous valuations. In particular, existing solutions are not robust to violations of assumptions common in auction theory and do not scale to processing terabytes of data each hour, a high dimensional feature space, and a fast changing demand landscape. In this paper, we describe a scalable, online, real-time, incrementally updated reserve price optimizer for RTB that is currently implemented as part of the AppNexus Publisher Suite. Our solution applies an online learning approach, maximizing a custom cost function suited to reserve price optimization. We demonstrate the scalability and feasibility with the results from the reserve price optimizer deployed in a production environment. In the production deployed optimizer, the average revenue lift was 34.4% with 95% confidence intervals (33.2%, 35.6%) from more than 8 billion auctions over 46 days, a substantial increase over non-optimized and often manually set rule based reserve prices.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115839311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifang Wei, L. Singh, Brian Gallagher, David J. Buttler
Event detection from text data is an active area of research. While the emphasis has been on event identification and labeling using a single data source, this work considers event and story line detection when using a large number of data sources. In this setting, it is natural for different events in the same domain, e.g. violence, sports, politics, to occur at the same time and for different story lines about the same event to emerge. To capture events in this setting, we propose an algorithm that detects events and story lines about events for a target domain. Our algorithm leverages a multi-relational sentence level semantic graph and well known graph properties to identify overlapping events and story lines within the events. We evaluate our approach on two large data sets containing millions of news articles from a large number of sources. Our empirical analysis shows that our approach improves the detection precision and recall by 10% to 25%, while providing complete event summaries.
{"title":"Overlapping Target Event and Story Line Detection of Online Newspaper Articles","authors":"Yifang Wei, L. Singh, Brian Gallagher, David J. Buttler","doi":"10.1109/DSAA.2016.30","DOIUrl":"https://doi.org/10.1109/DSAA.2016.30","url":null,"abstract":"Event detection from text data is an active area of research. While the emphasis has been on event identification and labeling using a single data source, this work considers event and story line detection when using a large number of data sources. In this setting, it is natural for different events in the same domain, e.g. violence, sports, politics, to occur at the same time and for different story lines about the same event to emerge. To capture events in this setting, we propose an algorithm that detects events and story lines about events for a target domain. Our algorithm leverages a multi-relational sentence level semantic graph and well known graph properties to identify overlapping events and story lines within the events. We evaluate our approach on two large data sets containing millions of news articles from a large number of sources. Our empirical analysis shows that our approach improves the detection precision and recall by 10% to 25%, while providing complete event summaries.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116078791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adham Beykikhoshk, Dinh Q. Phung, Ognjen Arandjelovic, S. Venkatesh
We describe a novel framework for the discovery of underlying topics of a longitudinal collection of scholarly data, and the tracking of their lifetime and popularity over time. Unlike the social media or news data where the underlying topics evolve over time, the topic nuances in science result in new scientific directions to emerge. Therefore, we model the longitudinal literature data with a new approach that uses topics which remain identifiable over the course of time. Current studies either disregard the time dimension or treat it as an exchangeable covariate when they fix the topics over time or do not share the topics over epochs when they model the time naturally. We address these issues by adopting a non-parametric Bayesian approach. We assume the data is partially exchangeable and divide it into consecutive epochs. Then, by fixing the topics in a recurrent Chinese restaurant franchise, we impose a static topical structure on the corpus such that the topics are shared across epochs and the documents within epochs. We demonstrate the effectiveness of the proposed framework on a collection of medical literature related to autism spectrum disorder. We collect a large corpus of publications and carefully examine two important research issues of the domain as case studies. Moreover, we make the results of our experiment and the source code of the model, freely available to the public. This aids other researchers to analyse our results or apply the model to their data collections.
{"title":"Analysing the History of Autism Spectrum Disorder Using Topic Models","authors":"Adham Beykikhoshk, Dinh Q. Phung, Ognjen Arandjelovic, S. Venkatesh","doi":"10.1109/DSAA.2016.65","DOIUrl":"https://doi.org/10.1109/DSAA.2016.65","url":null,"abstract":"We describe a novel framework for the discovery of underlying topics of a longitudinal collection of scholarly data, and the tracking of their lifetime and popularity over time. Unlike the social media or news data where the underlying topics evolve over time, the topic nuances in science result in new scientific directions to emerge. Therefore, we model the longitudinal literature data with a new approach that uses topics which remain identifiable over the course of time. Current studies either disregard the time dimension or treat it as an exchangeable covariate when they fix the topics over time or do not share the topics over epochs when they model the time naturally. We address these issues by adopting a non-parametric Bayesian approach. We assume the data is partially exchangeable and divide it into consecutive epochs. Then, by fixing the topics in a recurrent Chinese restaurant franchise, we impose a static topical structure on the corpus such that the topics are shared across epochs and the documents within epochs. We demonstrate the effectiveness of the proposed framework on a collection of medical literature related to autism spectrum disorder. We collect a large corpus of publications and carefully examine two important research issues of the domain as case studies. Moreover, we make the results of our experiment and the source code of the model, freely available to the public. This aids other researchers to analyse our results or apply the model to their data collections.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127321198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of optimizing influence diffusion ina network has applications in areas such as marketing, diseasecontrol, social media analytics, and more. In all cases, an initial setof influencers are chosen so as to optimize influence propagation.While a lot of research has been devoted to the influencemaximization problem, most solutions proposed to date applyon directed networks, considering the undirected case to besolvable as a special case. In this paper, we propose a novelalgorithm, Harvester, that achieves results of higher quality thanthe state of the art on symmetric interaction networks, leveragingthe particular characteristics of such networks. Harvester isbased on the aggregation of instances of live-edge graphs, fromwhich we compute the influence potential of each node. Weshow that this technique can be applied for both influencemaximization under a known seed size and also for the dualproblem of seed minimization under a target influence spread.Our experimental study with real data sets demonstrates that:(a) Harvester outperforms the state-of-the-art method, IMM,in terms of both influence spread and seed size; and (b) itsvariant for the seed minimization problem yields good seed sizeestimates, reducing the number of required trial influence spreadestimations by a factor of two; and (c) it is scalable with growinggraph size and robust to variant edge influence probabilities.
{"title":"Harvester: Influence Optimization in Symmetric Interaction Networks","authors":"S. Ivanov, Panagiotis Karras","doi":"10.1109/DSAA.2016.95","DOIUrl":"https://doi.org/10.1109/DSAA.2016.95","url":null,"abstract":"The problem of optimizing influence diffusion ina network has applications in areas such as marketing, diseasecontrol, social media analytics, and more. In all cases, an initial setof influencers are chosen so as to optimize influence propagation.While a lot of research has been devoted to the influencemaximization problem, most solutions proposed to date applyon directed networks, considering the undirected case to besolvable as a special case. In this paper, we propose a novelalgorithm, Harvester, that achieves results of higher quality thanthe state of the art on symmetric interaction networks, leveragingthe particular characteristics of such networks. Harvester isbased on the aggregation of instances of live-edge graphs, fromwhich we compute the influence potential of each node. Weshow that this technique can be applied for both influencemaximization under a known seed size and also for the dualproblem of seed minimization under a target influence spread.Our experimental study with real data sets demonstrates that:(a) Harvester outperforms the state-of-the-art method, IMM,in terms of both influence spread and seed size; and (b) itsvariant for the seed minimization problem yields good seed sizeestimates, reducing the number of required trial influence spreadestimations by a factor of two; and (c) it is scalable with growinggraph size and robust to variant edge influence probabilities.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125720303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A large amount of traffic-related data, including traffic statistics, accident statistics, road information, and drivers' and pedestrians' comments, is being collected through sensors and social media networks. We focus on the issue of extracting traffic risk factors from such heterogeneous data and ranking locations according to the extracted factors. In general, it is difficult to define traffic risk. We may adopt a clustering approach to identify groups of risky locations, where the risk factor is extracted by comparing the groups. Furthermore, we may utilize prior knowledge about partially ordered relations such that a specific location should be more risky than others. In this paper, we propose a novel method for traffic risk mining by unifying the clustering approach with prior knowledge with respect to order relations. Specifically, we propose the partially ordered non-negative matrix factorization (PONMF) algorithm, which is capable of clustering locations under partially ordered relations among them. The key idea is to employ the multiplicative update rule as well as the gradient descent rule for parameter estimation. Through experiments conducted using synthetic and real data sets, we show that PONMF can identify clusters that include high-risk roads and extract their risk factors.
{"title":"Traffic Risk Mining Using Partially Ordered Non-Negative Matrix Factorization","authors":"Taito Lee, Shin Matsushima, K. Yamanishi","doi":"10.1109/DSAA.2016.71","DOIUrl":"https://doi.org/10.1109/DSAA.2016.71","url":null,"abstract":"A large amount of traffic-related data, including traffic statistics, accident statistics, road information, and drivers' and pedestrians' comments, is being collected through sensors and social media networks. We focus on the issue of extracting traffic risk factors from such heterogeneous data and ranking locations according to the extracted factors. In general, it is difficult to define traffic risk. We may adopt a clustering approach to identify groups of risky locations, where the risk factor is extracted by comparing the groups. Furthermore, we may utilize prior knowledge about partially ordered relations such that a specific location should be more risky than others. In this paper, we propose a novel method for traffic risk mining by unifying the clustering approach with prior knowledge with respect to order relations. Specifically, we propose the partially ordered non-negative matrix factorization (PONMF) algorithm, which is capable of clustering locations under partially ordered relations among them. The key idea is to employ the multiplicative update rule as well as the gradient descent rule for parameter estimation. Through experiments conducted using synthetic and real data sets, we show that PONMF can identify clusters that include high-risk roads and extract their risk factors.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126688003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The goal of this paper is to build a system that automatically creates synthetic data to enable data science endeavors. To achieve this, we present the Synthetic Data Vault (SDV), a system that builds generative models of relational databases. We are able to sample from the model and create synthetic data, hence the name SDV. When implementing the SDV, we also developed an algorithm that computes statistics at the intersection of related database tables. We then used a state-of-the-art multivariate modeling approach to model this data. The SDV iterates through all possible relations, ultimately creating a model for the entire database. Once this model is computed, the same relational information allows the SDV to synthesize data by sampling from any part of the database. After building the SDV, we used it to generate synthetic data for five different publicly available datasets. We then published these datasets, and asked data scientists to develop predictive models for them as part of a crowdsourced experiment. By analyzing the outcomes, we show that synthetic data can successfully replace original data for data science. Our analysis indicates that there is no significant difference in the work produced by data scientists who used synthetic data as opposed to real data. We conclude that the SDV is a viable solution for synthetic data generation.
{"title":"The Synthetic Data Vault","authors":"Neha Patki, Roy Wedge, K. Veeramachaneni","doi":"10.1109/DSAA.2016.49","DOIUrl":"https://doi.org/10.1109/DSAA.2016.49","url":null,"abstract":"The goal of this paper is to build a system that automatically creates synthetic data to enable data science endeavors. To achieve this, we present the Synthetic Data Vault (SDV), a system that builds generative models of relational databases. We are able to sample from the model and create synthetic data, hence the name SDV. When implementing the SDV, we also developed an algorithm that computes statistics at the intersection of related database tables. We then used a state-of-the-art multivariate modeling approach to model this data. The SDV iterates through all possible relations, ultimately creating a model for the entire database. Once this model is computed, the same relational information allows the SDV to synthesize data by sampling from any part of the database. After building the SDV, we used it to generate synthetic data for five different publicly available datasets. We then published these datasets, and asked data scientists to develop predictive models for them as part of a crowdsourced experiment. By analyzing the outcomes, we show that synthetic data can successfully replace original data for data science. Our analysis indicates that there is no significant difference in the work produced by data scientists who used synthetic data as opposed to real data. We conclude that the SDV is a viable solution for synthetic data generation.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115120772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Active semi-supervised learning can play an important role in classification scenarios in which labeled data are difficult to obtain, while unlabeled data can be easily acquired. This paper focuses on an active semi-supervised algorithm that can be driven by multiple clustering hierarchies. If there is one or more hierarchies that can reasonably align clusters with class labels, then a few queries are needed to label with high quality all the unlabeled data. We take as a starting point the well-known Hierarchical Sampling (HS) algorithm and perform changes in different aspects of the original algorithm in order to tackle its main drawbacks, including its sensitivity to the choice of a single particular hierarchy. Experimental results over many real datasets show that the proposed algorithm performs superior or competitive when compared to a number of state-of-the-art algorithms for active semi-supervised classification.
{"title":"Active Semi-Supervised Classification Based on Multiple Clustering Hierarchies","authors":"Antonio J. L. Batista, R. Campello, J. Sander","doi":"10.1109/DSAA.2016.9","DOIUrl":"https://doi.org/10.1109/DSAA.2016.9","url":null,"abstract":"Active semi-supervised learning can play an important role in classification scenarios in which labeled data are difficult to obtain, while unlabeled data can be easily acquired. This paper focuses on an active semi-supervised algorithm that can be driven by multiple clustering hierarchies. If there is one or more hierarchies that can reasonably align clusters with class labels, then a few queries are needed to label with high quality all the unlabeled data. We take as a starting point the well-known Hierarchical Sampling (HS) algorithm and perform changes in different aspects of the original algorithm in order to tackle its main drawbacks, including its sensitivity to the choice of a single particular hierarchy. Experimental results over many real datasets show that the proposed algorithm performs superior or competitive when compared to a number of state-of-the-art algorithms for active semi-supervised classification.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"227 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134576604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inspired by recent progress in parallel and distributed optimization, we propose parallel least-squares policy iteration (parallel LSPI) in this paper. LSPI is a policy iteration method to find an optimal policy for MDPs. As solving MDPs with large state space is challenging and time demanding, we propose a parallel variant of LSPI which is capable of leveraging multiple computational resources. Preliminary analysis of our proposed method shows that the sample complexity improved from O(1/√n) towards O(1/√Mn) for each worker, where n is the number of samples and M is the number of workers. Experiments show the advantages of parallel LSPI comparing to the standard non-parallel one.
{"title":"Parallel Least-Squares Policy Iteration","authors":"Jun-Kun Wang, Shou-de Lin","doi":"10.1109/DSAA.2016.24","DOIUrl":"https://doi.org/10.1109/DSAA.2016.24","url":null,"abstract":"Inspired by recent progress in parallel and distributed optimization, we propose parallel least-squares policy iteration (parallel LSPI) in this paper. LSPI is a policy iteration method to find an optimal policy for MDPs. As solving MDPs with large state space is challenging and time demanding, we propose a parallel variant of LSPI which is capable of leveraging multiple computational resources. Preliminary analysis of our proposed method shows that the sample complexity improved from O(1/√n) towards O(1/√Mn) for each worker, where n is the number of samples and M is the number of workers. Experiments show the advantages of parallel LSPI comparing to the standard non-parallel one.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"48 59","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133323286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clustering techniques are very attractive for extracting and identifying patterns in datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality data, heterogeneity, and high complexity of some algorithms. For instance, some algorithms may have linear complexity but they require the domain knowledge in order to determine their input parameters. Distributed clustering techniques constitute a very good alternative to the big data challenges (e.g.,Volume, Variety, Veracity, and Velocity). Usually these techniques consist of two phases. The first phase generates local models or patterns and the second one tends to aggregate the local results to obtain global models. While the first phase can be executed in parallel on each site and, therefore, efficient, the aggregation phase is complex, time consuming and may produce incorrect and ambiguous global clusters and therefore incorrect models. In this paper we propose a new distributed clustering approach to deal efficiently with both phases, generation of local results and generation of global models by aggregation. For the first phase, our approach is capable of analysing the datasets located in each site using different clustering techniques. The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. For the evaluation, we use two well-known clustering algorithms, K-Means and DBSCAN. One of the key outputs of this distributed clustering technique is that the number of global clusters is dynamic, no need to be fixed in advance. Experimental results show that the approach is scalable and produces high quality results.
{"title":"Efficient Large Scale Clustering Based on Data Partitioning","authors":"Malika Bendechache, Mohand Tahar Kechadi, Nhien-An Le-Khac","doi":"10.1109/DSAA.2016.70","DOIUrl":"https://doi.org/10.1109/DSAA.2016.70","url":null,"abstract":"Clustering techniques are very attractive for extracting and identifying patterns in datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality data, heterogeneity, and high complexity of some algorithms. For instance, some algorithms may have linear complexity but they require the domain knowledge in order to determine their input parameters. Distributed clustering techniques constitute a very good alternative to the big data challenges (e.g.,Volume, Variety, Veracity, and Velocity). Usually these techniques consist of two phases. The first phase generates local models or patterns and the second one tends to aggregate the local results to obtain global models. While the first phase can be executed in parallel on each site and, therefore, efficient, the aggregation phase is complex, time consuming and may produce incorrect and ambiguous global clusters and therefore incorrect models. In this paper we propose a new distributed clustering approach to deal efficiently with both phases, generation of local results and generation of global models by aggregation. For the first phase, our approach is capable of analysing the datasets located in each site using different clustering techniques. The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. For the evaluation, we use two well-known clustering algorithms, K-Means and DBSCAN. One of the key outputs of this distributed clustering technique is that the number of global clusters is dynamic, no need to be fixed in advance. Experimental results show that the approach is scalable and produces high quality results.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132029083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhengyuan Zhou, Daniel Miller, Neal Master, D. Scheinker, N. Bambos, P. Glynn
Accurate predictions of surgical case lengths areuseful for patient scheduling in hospitals. In pediatric hospitals, this prediction problem is particularly difficult. Predictions aretypically provided by highly trained medical staff, but thesepredictions are not necessarily accurate. We present a noveldecision support tool that detects when expert predictions areinaccurate so that these predictions can be re-evaluated. We explore several different algorithms. We provide methodologicalinsights and suggest directions of future work.
{"title":"Detecting Inaccurate Predictions of Pediatric Surgical Durations","authors":"Zhengyuan Zhou, Daniel Miller, Neal Master, D. Scheinker, N. Bambos, P. Glynn","doi":"10.1109/DSAA.2016.56","DOIUrl":"https://doi.org/10.1109/DSAA.2016.56","url":null,"abstract":"Accurate predictions of surgical case lengths areuseful for patient scheduling in hospitals. In pediatric hospitals, this prediction problem is particularly difficult. Predictions aretypically provided by highly trained medical staff, but thesepredictions are not necessarily accurate. We present a noveldecision support tool that detects when expert predictions areinaccurate so that these predictions can be re-evaluated. We explore several different algorithms. We provide methodologicalinsights and suggest directions of future work.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132225053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}