2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)最新文献

英文中文

Reserve Price Optimization at Scale 规模下的储备价格优化

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.32

Daniel Austin, Samuel S. Seljan, Julius Monello, Stephanie Tzeng

Online advertising is a multi-billion dollar industry largely responsible for keeping most online content free and content creators ("publishers") in business. In one aspect of advertising sales, impressions are auctioned off in second price auctions on an auction-by-auction basis through what is known as real-time bidding (RTB). An important mechanism through which publishers can influence how much revenue they earn is reserve pricing in RTB auctions. The optimal reserve price problem is well studied in both applied and academic literatures. However, few solutions are suited to RTB, where billions of auctions for ad space on millions of different sites and Internet users are conducted each day among bidders with heterogenous valuations. In particular, existing solutions are not robust to violations of assumptions common in auction theory and do not scale to processing terabytes of data each hour, a high dimensional feature space, and a fast changing demand landscape. In this paper, we describe a scalable, online, real-time, incrementally updated reserve price optimizer for RTB that is currently implemented as part of the AppNexus Publisher Suite. Our solution applies an online learning approach, maximizing a custom cost function suited to reserve price optimization. We demonstrate the scalability and feasibility with the results from the reserve price optimizer deployed in a production environment. In the production deployed optimizer, the average revenue lift was 34.4% with 95% confidence intervals (33.2%, 35.6%) from more than 8 billion auctions over 46 days, a substantial increase over non-optimized and often manually set rule based reserve prices.

在线广告是一个价值数十亿美元的产业，主要负责保持大多数在线内容的免费和内容创作者(“出版商”)的业务。在广告销售的一个方面，印象是通过实时竞价(RTB)以二次价格拍卖的方式拍卖的。出版商能够影响其收入的一个重要机制是RTB拍卖中的保留价。最优保留价格问题在应用和学术文献中都得到了很好的研究。然而，很少有解决方案适合RTB，在这种情况下，每天在数百万不同网站和互联网用户上进行数十亿次广告空间拍卖，竞标者的估值各不相同。特别是，现有的解决方案对于违反拍卖理论中常见的假设并不稳健，并且不能扩展到每小时处理tb级数据，高维特征空间和快速变化的需求环境。在本文中，我们描述了一个可扩展的、在线的、实时的、增量更新的RTB保留价格优化器，该优化器目前作为AppNexus Publisher Suite的一部分实现。我们的解决方案采用在线学习方法，最大化适合保留价格优化的定制成本函数。我们用在生产环境中部署的保留价格优化器的结果演示了可扩展性和可行性。在生产部署优化器中，在46天内超过80亿次的拍卖中，平均收益提高了34.4%，95%置信区间(33.2%，35.6%)，比未优化的和通常手动设置的基于规则的保留价格有了显著提高。

{"title":"Reserve Price Optimization at Scale","authors":"Daniel Austin, Samuel S. Seljan, Julius Monello, Stephanie Tzeng","doi":"10.1109/DSAA.2016.32","DOIUrl":"https://doi.org/10.1109/DSAA.2016.32","url":null,"abstract":"Online advertising is a multi-billion dollar industry largely responsible for keeping most online content free and content creators (\"publishers\") in business. In one aspect of advertising sales, impressions are auctioned off in second price auctions on an auction-by-auction basis through what is known as real-time bidding (RTB). An important mechanism through which publishers can influence how much revenue they earn is reserve pricing in RTB auctions. The optimal reserve price problem is well studied in both applied and academic literatures. However, few solutions are suited to RTB, where billions of auctions for ad space on millions of different sites and Internet users are conducted each day among bidders with heterogenous valuations. In particular, existing solutions are not robust to violations of assumptions common in auction theory and do not scale to processing terabytes of data each hour, a high dimensional feature space, and a fast changing demand landscape. In this paper, we describe a scalable, online, real-time, incrementally updated reserve price optimizer for RTB that is currently implemented as part of the AppNexus Publisher Suite. Our solution applies an online learning approach, maximizing a custom cost function suited to reserve price optimization. We demonstrate the scalability and feasibility with the results from the reserve price optimizer deployed in a production environment. In the production deployed optimizer, the average revenue lift was 34.4% with 95% confidence intervals (33.2%, 35.6%) from more than 8 billion auctions over 46 days, a substantial increase over non-optimized and often manually set rule based reserve prices.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115839311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Overlapping Target Event and Story Line Detection of Online Newspaper Articles 网络报纸文章的重叠目标事件与故事线检测

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.30

Yifang Wei, L. Singh, Brian Gallagher, David J. Buttler

Event detection from text data is an active area of research. While the emphasis has been on event identification and labeling using a single data source, this work considers event and story line detection when using a large number of data sources. In this setting, it is natural for different events in the same domain, e.g. violence, sports, politics, to occur at the same time and for different story lines about the same event to emerge. To capture events in this setting, we propose an algorithm that detects events and story lines about events for a target domain. Our algorithm leverages a multi-relational sentence level semantic graph and well known graph properties to identify overlapping events and story lines within the events. We evaluate our approach on two large data sets containing millions of news articles from a large number of sources. Our empirical analysis shows that our approach improves the detection precision and recall by 10% to 25%, while providing complete event summaries.

从文本数据中检测事件是一个活跃的研究领域。虽然重点是使用单个数据源进行事件识别和标记，但这项工作在使用大量数据源时考虑了事件和故事线检测。在这种情况下，同一领域的不同事件(如暴力、体育、政治)在同一时间发生，同一事件的不同故事线出现是很自然的。为了捕获这种设置中的事件，我们提出了一种算法来检测目标域的事件和关于事件的故事线。我们的算法利用多关系句子级语义图和众所周知的图属性来识别重叠事件和事件中的故事线。我们在两个大型数据集上评估了我们的方法，这些数据集包含来自大量来源的数百万篇新闻文章。我们的实证分析表明，我们的方法在提供完整的事件摘要的同时，将检测精度和召回率提高了10%到25%。

引用次数: 10

Analysing the History of Autism Spectrum Disorder Using Topic Models 用主题模型分析自闭症谱系障碍的历史

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.65

Adham Beykikhoshk, Dinh Q. Phung, Ognjen Arandjelovic, S. Venkatesh

We describe a novel framework for the discovery of underlying topics of a longitudinal collection of scholarly data, and the tracking of their lifetime and popularity over time. Unlike the social media or news data where the underlying topics evolve over time, the topic nuances in science result in new scientific directions to emerge. Therefore, we model the longitudinal literature data with a new approach that uses topics which remain identifiable over the course of time. Current studies either disregard the time dimension or treat it as an exchangeable covariate when they fix the topics over time or do not share the topics over epochs when they model the time naturally. We address these issues by adopting a non-parametric Bayesian approach. We assume the data is partially exchangeable and divide it into consecutive epochs. Then, by fixing the topics in a recurrent Chinese restaurant franchise, we impose a static topical structure on the corpus such that the topics are shared across epochs and the documents within epochs. We demonstrate the effectiveness of the proposed framework on a collection of medical literature related to autism spectrum disorder. We collect a large corpus of publications and carefully examine two important research issues of the domain as case studies. Moreover, we make the results of our experiment and the source code of the model, freely available to the public. This aids other researchers to analyse our results or apply the model to their data collections.

我们描述了一个新的框架，用于发现学术数据纵向收集的潜在主题，并跟踪他们的寿命和受欢迎程度。与社交媒体或新闻数据的潜在主题随着时间的推移而演变不同，科学中的主题细微差别导致新的科学方向出现。因此，我们用一种新的方法对纵向文献数据进行建模，这种方法使用的主题在一段时间内仍然可以识别。当前的研究在固定时间主题时，要么忽略时间维度，要么将其视为可交换的协变量，要么在自然建模时不跨时代共享主题。我们通过采用非参数贝叶斯方法来解决这些问题。我们假设数据是部分可交换的，并将其划分为连续的时期。然后，通过固定一个经常出现的中餐馆特许经营中的主题，我们在语料库上强加了一个静态的主题结构，这样主题就可以跨时代共享，并且可以在时代内共享文档。我们在与自闭症谱系障碍相关的医学文献集合上证明了所提出的框架的有效性。我们收集了大量的出版物，并仔细研究了该领域的两个重要研究问题作为案例研究。此外，我们将实验结果和模型的源代码免费提供给公众。这有助于其他研究人员分析我们的结果或将模型应用于他们的数据收集。

{"title":"Analysing the History of Autism Spectrum Disorder Using Topic Models","authors":"Adham Beykikhoshk, Dinh Q. Phung, Ognjen Arandjelovic, S. Venkatesh","doi":"10.1109/DSAA.2016.65","DOIUrl":"https://doi.org/10.1109/DSAA.2016.65","url":null,"abstract":"We describe a novel framework for the discovery of underlying topics of a longitudinal collection of scholarly data, and the tracking of their lifetime and popularity over time. Unlike the social media or news data where the underlying topics evolve over time, the topic nuances in science result in new scientific directions to emerge. Therefore, we model the longitudinal literature data with a new approach that uses topics which remain identifiable over the course of time. Current studies either disregard the time dimension or treat it as an exchangeable covariate when they fix the topics over time or do not share the topics over epochs when they model the time naturally. We address these issues by adopting a non-parametric Bayesian approach. We assume the data is partially exchangeable and divide it into consecutive epochs. Then, by fixing the topics in a recurrent Chinese restaurant franchise, we impose a static topical structure on the corpus such that the topics are shared across epochs and the documents within epochs. We demonstrate the effectiveness of the proposed framework on a collection of medical literature related to autism spectrum disorder. We collect a large corpus of publications and carefully examine two important research issues of the domain as case studies. Moreover, we make the results of our experiment and the source code of the model, freely available to the public. This aids other researchers to analyse our results or apply the model to their data collections.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127321198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Harvester: Influence Optimization in Symmetric Interaction Networks 收割机:对称交互网络中的影响优化

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.95

S. Ivanov, Panagiotis Karras

The problem of optimizing influence diffusion ina network has applications in areas such as marketing, diseasecontrol, social media analytics, and more. In all cases, an initial setof influencers are chosen so as to optimize influence propagation.While a lot of research has been devoted to the influencemaximization problem, most solutions proposed to date applyon directed networks, considering the undirected case to besolvable as a special case. In this paper, we propose a novelalgorithm, Harvester, that achieves results of higher quality thanthe state of the art on symmetric interaction networks, leveragingthe particular characteristics of such networks. Harvester isbased on the aggregation of instances of live-edge graphs, fromwhich we compute the influence potential of each node. Weshow that this technique can be applied for both influencemaximization under a known seed size and also for the dualproblem of seed minimization under a target influence spread.Our experimental study with real data sets demonstrates that:(a) Harvester outperforms the state-of-the-art method, IMM,in terms of both influence spread and seed size; and (b) itsvariant for the seed minimization problem yields good seed sizeestimates, reducing the number of required trial influence spreadestimations by a factor of two; and (c) it is scalable with growinggraph size and robust to variant edge influence probabilities.

优化网络影响扩散的问题在市场营销、疾病控制、社交媒体分析等领域都有应用。在所有情况下，都选择一组初始影响者，以优化影响传播。虽然对影响最大化问题进行了大量的研究，但迄今为止提出的大多数解决方案都是应用于有向网络的，将无向情况视为可解的特殊情况。在本文中，我们提出了一种新颖的算法，Harvester，它利用对称交互网络的特殊特征，在对称交互网络上获得了比现有技术更高质量的结果。Harvester基于活边图实例的聚合，从中我们计算每个节点的影响潜力。结果表明，该方法既适用于已知种子大小下的影响最大化问题，也适用于目标影响范围下的种子最小化问题。我们对真实数据集的实验研究表明:(a)收割机在影响范围和种子大小方面优于最先进的方法IMM;(b)其对种子最小化问题的变量产生良好的种子大小估计，将所需的试验影响扩散估计的数量减少了两倍;并且(c)它随图大小的增长而可扩展，并且对不同的边缘影响概率具有鲁棒性。

{"title":"Harvester: Influence Optimization in Symmetric Interaction Networks","authors":"S. Ivanov, Panagiotis Karras","doi":"10.1109/DSAA.2016.95","DOIUrl":"https://doi.org/10.1109/DSAA.2016.95","url":null,"abstract":"The problem of optimizing influence diffusion ina network has applications in areas such as marketing, diseasecontrol, social media analytics, and more. In all cases, an initial setof influencers are chosen so as to optimize influence propagation.While a lot of research has been devoted to the influencemaximization problem, most solutions proposed to date applyon directed networks, considering the undirected case to besolvable as a special case. In this paper, we propose a novelalgorithm, Harvester, that achieves results of higher quality thanthe state of the art on symmetric interaction networks, leveragingthe particular characteristics of such networks. Harvester isbased on the aggregation of instances of live-edge graphs, fromwhich we compute the influence potential of each node. Weshow that this technique can be applied for both influencemaximization under a known seed size and also for the dualproblem of seed minimization under a target influence spread.Our experimental study with real data sets demonstrates that:(a) Harvester outperforms the state-of-the-art method, IMM,in terms of both influence spread and seed size; and (b) itsvariant for the seed minimization problem yields good seed sizeestimates, reducing the number of required trial influence spreadestimations by a factor of two; and (c) it is scalable with growinggraph size and robust to variant edge influence probabilities.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125720303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Traffic Risk Mining Using Partially Ordered Non-Negative Matrix Factorization 基于部分有序非负矩阵分解的交通风险挖掘

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.71

Taito Lee, Shin Matsushima, K. Yamanishi

A large amount of traffic-related data, including traffic statistics, accident statistics, road information, and drivers' and pedestrians' comments, is being collected through sensors and social media networks. We focus on the issue of extracting traffic risk factors from such heterogeneous data and ranking locations according to the extracted factors. In general, it is difficult to define traffic risk. We may adopt a clustering approach to identify groups of risky locations, where the risk factor is extracted by comparing the groups. Furthermore, we may utilize prior knowledge about partially ordered relations such that a specific location should be more risky than others. In this paper, we propose a novel method for traffic risk mining by unifying the clustering approach with prior knowledge with respect to order relations. Specifically, we propose the partially ordered non-negative matrix factorization (PONMF) algorithm, which is capable of clustering locations under partially ordered relations among them. The key idea is to employ the multiplicative update rule as well as the gradient descent rule for parameter estimation. Through experiments conducted using synthetic and real data sets, we show that PONMF can identify clusters that include high-risk roads and extract their risk factors.

大量的交通相关数据，包括交通统计、事故统计、道路信息、司机和行人的评论，正在通过传感器和社交媒体网络收集。我们重点研究从这些异构数据中提取交通风险因素的问题，并根据提取的因素对位置进行排序。一般来说，交通风险很难界定。我们可以采用聚类的方法来识别危险地点的组，其中通过比较组来提取风险因素。此外，我们可以利用关于部分有序关系的先验知识，使得特定位置应该比其他位置更危险。本文提出了一种基于先验知识的交通风险挖掘方法。具体而言，我们提出了部分有序非负矩阵分解(PONMF)算法，该算法能够在部分有序关系下对位置进行聚类。其关键思想是采用乘法更新规则和梯度下降规则进行参数估计。通过使用合成数据集和真实数据集进行的实验，我们表明PONMF可以识别包含高风险道路的聚类并提取其风险因素。

{"title":"Traffic Risk Mining Using Partially Ordered Non-Negative Matrix Factorization","authors":"Taito Lee, Shin Matsushima, K. Yamanishi","doi":"10.1109/DSAA.2016.71","DOIUrl":"https://doi.org/10.1109/DSAA.2016.71","url":null,"abstract":"A large amount of traffic-related data, including traffic statistics, accident statistics, road information, and drivers' and pedestrians' comments, is being collected through sensors and social media networks. We focus on the issue of extracting traffic risk factors from such heterogeneous data and ranking locations according to the extracted factors. In general, it is difficult to define traffic risk. We may adopt a clustering approach to identify groups of risky locations, where the risk factor is extracted by comparing the groups. Furthermore, we may utilize prior knowledge about partially ordered relations such that a specific location should be more risky than others. In this paper, we propose a novel method for traffic risk mining by unifying the clustering approach with prior knowledge with respect to order relations. Specifically, we propose the partially ordered non-negative matrix factorization (PONMF) algorithm, which is capable of clustering locations under partially ordered relations among them. The key idea is to employ the multiplicative update rule as well as the gradient descent rule for parameter estimation. Through experiments conducted using synthetic and real data sets, we show that PONMF can identify clusters that include high-risk roads and extract their risk factors.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126688003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

The Synthetic Data Vault 合成数据库

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.49

Neha Patki, Roy Wedge, K. Veeramachaneni

The goal of this paper is to build a system that automatically creates synthetic data to enable data science endeavors. To achieve this, we present the Synthetic Data Vault (SDV), a system that builds generative models of relational databases. We are able to sample from the model and create synthetic data, hence the name SDV. When implementing the SDV, we also developed an algorithm that computes statistics at the intersection of related database tables. We then used a state-of-the-art multivariate modeling approach to model this data. The SDV iterates through all possible relations, ultimately creating a model for the entire database. Once this model is computed, the same relational information allows the SDV to synthesize data by sampling from any part of the database. After building the SDV, we used it to generate synthetic data for five different publicly available datasets. We then published these datasets, and asked data scientists to develop predictive models for them as part of a crowdsourced experiment. By analyzing the outcomes, we show that synthetic data can successfully replace original data for data science. Our analysis indicates that there is no significant difference in the work produced by data scientists who used synthetic data as opposed to real data. We conclude that the SDV is a viable solution for synthetic data generation.

本文的目标是构建一个自动创建合成数据的系统，以支持数据科学的努力。为了实现这一点，我们提出了合成数据库(SDV)，这是一个构建关系数据库生成模型的系统。我们能够从模型中采样并创建合成数据，因此命名为SDV。在实现SDV时，我们还开发了一种算法，用于在相关数据库表的交集处计算统计信息。然后，我们使用最先进的多变量建模方法对这些数据进行建模。SDV遍历所有可能的关系，最终为整个数据库创建一个模型。一旦计算出这个模型，相同的关系信息就允许SDV通过从数据库的任何部分采样来合成数据。在构建SDV之后，我们使用它为五个不同的公开可用数据集生成合成数据。然后，我们发布了这些数据集，并要求数据科学家为它们开发预测模型，作为众包实验的一部分。通过对结果的分析，我们发现合成数据可以成功地取代原始数据。我们的分析表明，使用合成数据的数据科学家与使用真实数据的数据科学家所做的工作没有显著差异。我们得出结论，SDV是合成数据生成的可行解决方案。

{"title":"The Synthetic Data Vault","authors":"Neha Patki, Roy Wedge, K. Veeramachaneni","doi":"10.1109/DSAA.2016.49","DOIUrl":"https://doi.org/10.1109/DSAA.2016.49","url":null,"abstract":"The goal of this paper is to build a system that automatically creates synthetic data to enable data science endeavors. To achieve this, we present the Synthetic Data Vault (SDV), a system that builds generative models of relational databases. We are able to sample from the model and create synthetic data, hence the name SDV. When implementing the SDV, we also developed an algorithm that computes statistics at the intersection of related database tables. We then used a state-of-the-art multivariate modeling approach to model this data. The SDV iterates through all possible relations, ultimately creating a model for the entire database. Once this model is computed, the same relational information allows the SDV to synthesize data by sampling from any part of the database. After building the SDV, we used it to generate synthetic data for five different publicly available datasets. We then published these datasets, and asked data scientists to develop predictive models for them as part of a crowdsourced experiment. By analyzing the outcomes, we show that synthetic data can successfully replace original data for data science. Our analysis indicates that there is no significant difference in the work produced by data scientists who used synthetic data as opposed to real data. We conclude that the SDV is a viable solution for synthetic data generation.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115120772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 279

Active Semi-Supervised Classification Based on Multiple Clustering Hierarchies 基于多聚类层次的主动半监督分类

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.9

Antonio J. L. Batista, R. Campello, J. Sander

Active semi-supervised learning can play an important role in classification scenarios in which labeled data are difficult to obtain, while unlabeled data can be easily acquired. This paper focuses on an active semi-supervised algorithm that can be driven by multiple clustering hierarchies. If there is one or more hierarchies that can reasonably align clusters with class labels, then a few queries are needed to label with high quality all the unlabeled data. We take as a starting point the well-known Hierarchical Sampling (HS) algorithm and perform changes in different aspects of the original algorithm in order to tackle its main drawbacks, including its sensitivity to the choice of a single particular hierarchy. Experimental results over many real datasets show that the proposed algorithm performs superior or competitive when compared to a number of state-of-the-art algorithms for active semi-supervised classification.

主动半监督学习在标记数据难以获得而未标记数据容易获得的分类场景中发挥重要作用。本文研究了一种可由多聚类层次驱动的主动半监督算法。如果有一个或多个层次结构可以合理地将集群与类标签对齐，那么需要一些查询来标记高质量的所有未标记数据。我们以著名的分层采样(HS)算法为出发点，并在原始算法的不同方面进行更改，以解决其主要缺点，包括对单个特定层次选择的敏感性。在许多真实数据集上的实验结果表明，与许多最先进的主动半监督分类算法相比，所提出的算法表现优异或具有竞争力。

引用次数: 5

Parallel Least-Squares Policy Iteration 并行最小二乘策略迭代

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.24

Jun-Kun Wang, Shou-de Lin

Inspired by recent progress in parallel and distributed optimization, we propose parallel least-squares policy iteration (parallel LSPI) in this paper. LSPI is a policy iteration method to find an optimal policy for MDPs. As solving MDPs with large state space is challenging and time demanding, we propose a parallel variant of LSPI which is capable of leveraging multiple computational resources. Preliminary analysis of our proposed method shows that the sample complexity improved from O(1/√n) towards O(1/√Mn) for each worker, where n is the number of samples and M is the number of workers. Experiments show the advantages of parallel LSPI comparing to the standard non-parallel one.

受并行和分布式优化研究进展的启发，本文提出了并行最小二乘策略迭代(parallel LSPI)。LSPI是一种为mdp寻找最优策略的策略迭代方法。由于求解具有大状态空间的mdp具有挑战性和时间要求，我们提出了一种能够利用多种计算资源的LSPI并行变体。对我们提出的方法的初步分析表明，每个工人的样本复杂度从O(1/√n)提高到O(1/√Mn)，其中n为样本数量，M为工人数量。实验证明了并行LSPI与标准非并行LSPI相比的优势。

引用次数: 1

Efficient Large Scale Clustering Based on Data Partitioning 基于数据分区的高效大规模聚类

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.70

Malika Bendechache, Mohand Tahar Kechadi, Nhien-An Le-Khac

Clustering techniques are very attractive for extracting and identifying patterns in datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality data, heterogeneity, and high complexity of some algorithms. For instance, some algorithms may have linear complexity but they require the domain knowledge in order to determine their input parameters. Distributed clustering techniques constitute a very good alternative to the big data challenges (e.g.,Volume, Variety, Veracity, and Velocity). Usually these techniques consist of two phases. The first phase generates local models or patterns and the second one tends to aggregate the local results to obtain global models. While the first phase can be executed in parallel on each site and, therefore, efficient, the aggregation phase is complex, time consuming and may produce incorrect and ambiguous global clusters and therefore incorrect models. In this paper we propose a new distributed clustering approach to deal efficiently with both phases, generation of local results and generation of global models by aggregation. For the first phase, our approach is capable of analysing the datasets located in each site using different clustering techniques. The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. For the evaluation, we use two well-known clustering algorithms, K-Means and DBSCAN. One of the key outputs of this distributed clustering technique is that the number of global clusters is dynamic, no need to be fixed in advance. Experimental results show that the approach is scalable and produces high quality results.

聚类技术对于提取和识别数据集中的模式非常有吸引力。然而，它们在非常大的空间数据集上的应用面临着许多挑战，如高维数据、异构性和一些算法的高复杂性。例如，一些算法可能具有线性复杂性，但它们需要领域知识才能确定其输入参数。分布式集群技术是应对大数据挑战(例如，Volume、Variety、Veracity和Velocity)的一个很好的选择。通常这些技术包括两个阶段。第一阶段生成局部模型或模式，第二阶段倾向于汇总局部结果以获得全局模型。虽然第一阶段可以在每个站点上并行执行，因此效率很高，但是聚合阶段是复杂的、耗时的，并且可能产生不正确和模糊的全局集群，从而产生不正确的模型。在本文中，我们提出了一种新的分布式聚类方法来有效地处理两个阶段，即局部结果的生成和全局模型的生成。在第一阶段，我们的方法能够使用不同的聚类技术分析位于每个站点的数据集。聚合阶段的设计使最终的集群紧凑而准确，而整个过程在时间和内存分配方面是有效的。为了评估，我们使用了两种著名的聚类算法，K-Means和DBSCAN。这种分布式聚类技术的关键输出之一是全局聚类的数量是动态的，不需要预先固定。实验结果表明，该方法具有可扩展性和高质量。

{"title":"Efficient Large Scale Clustering Based on Data Partitioning","authors":"Malika Bendechache, Mohand Tahar Kechadi, Nhien-An Le-Khac","doi":"10.1109/DSAA.2016.70","DOIUrl":"https://doi.org/10.1109/DSAA.2016.70","url":null,"abstract":"Clustering techniques are very attractive for extracting and identifying patterns in datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality data, heterogeneity, and high complexity of some algorithms. For instance, some algorithms may have linear complexity but they require the domain knowledge in order to determine their input parameters. Distributed clustering techniques constitute a very good alternative to the big data challenges (e.g.,Volume, Variety, Veracity, and Velocity). Usually these techniques consist of two phases. The first phase generates local models or patterns and the second one tends to aggregate the local results to obtain global models. While the first phase can be executed in parallel on each site and, therefore, efficient, the aggregation phase is complex, time consuming and may produce incorrect and ambiguous global clusters and therefore incorrect models. In this paper we propose a new distributed clustering approach to deal efficiently with both phases, generation of local results and generation of global models by aggregation. For the first phase, our approach is capable of analysing the datasets located in each site using different clustering techniques. The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. For the evaluation, we use two well-known clustering algorithms, K-Means and DBSCAN. One of the key outputs of this distributed clustering technique is that the number of global clusters is dynamic, no need to be fixed in advance. Experimental results show that the approach is scalable and produces high quality results.","PeriodicalId":193885,"journal":{"name":"2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132029083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Detecting Inaccurate Predictions of Pediatric Surgical Durations 检测儿科手术持续时间的不准确预测

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Pub Date : 2016-10-01 DOI: 10.1109/DSAA.2016.56

Zhengyuan Zhou, Daniel Miller, Neal Master, D. Scheinker, N. Bambos, P. Glynn

Accurate predictions of surgical case lengths areuseful for patient scheduling in hospitals. In pediatric hospitals, this prediction problem is particularly difficult. Predictions aretypically provided by highly trained medical staff, but thesepredictions are not necessarily accurate. We present a noveldecision support tool that detects when expert predictions areinaccurate so that these predictions can be re-evaluated. We explore several different algorithms. We provide methodologicalinsights and suggest directions of future work.

准确预测手术病例长度对医院的病人安排是有用的。在儿科医院，这种预测问题尤其困难。预测通常由训练有素的医务人员提供，但这些预测不一定准确。我们提出了一种新的决策支持工具，可以检测专家预测何时不准确，以便对这些预测进行重新评估。我们探索了几种不同的算法。我们提供方法论见解并建议未来工作的方向。

引用次数: 13

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀