首页 > 最新文献

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献

英文 中文
New algorithms for parking demand management and a city-scale deployment 停车需求管理和城市规模部署的新算法
O. Zoeter, C. Dance, S. Clinchant, J. Andreoli
On-street parking, just as any publicly owned utility, is used inefficiently if access is free or priced very far from market rates. This paper introduces a novel demand management solution: using data from dedicated occupancy sensors an iteration scheme updates parking rates to better match demand. The new rates encourage parkers to avoid peak hours and peak locations and reduce congestion and underuse. The solution is deliberately simple so that it is easy to understand, easily seen to be fair and leads to parking policies that are easy to remember and act upon. We study the convergence properties of the iteration scheme and prove that it converges to a reasonable distribution for a very large class of models. The algorithm is in use to change parking rates in over 6000 spaces in downtown Los Angeles since June 2012 as part of the LA Express Park project. Initial results are encouraging with a reduction of congestion and underuse, while in more locations rates were decreased than increased.
就像任何公共事业一样,如果免费或定价与市场价格相去甚远,街边停车场的利用效率就会很低。本文介绍了一种新的需求管理解决方案:利用专用占用传感器的数据,迭代方案更新停车费率以更好地匹配需求。新的费率鼓励停车者避开高峰时间和高峰地点,减少拥堵和使用不足。解决方案故意简单,以便易于理解,容易看到是公平的,并导致停车政策,容易记住和行动。我们研究了迭代格式的收敛性,并证明了它收敛于一个非常大的模型类的合理分布。作为洛杉矶快速公园项目的一部分,自2012年6月以来,该算法已被用于改变洛杉矶市中心6000多个停车位的停车费率。初步结果令人鼓舞,减少了拥堵和未充分利用的情况,而在更多的地方,比率下降而不是增加。
{"title":"New algorithms for parking demand management and a city-scale deployment","authors":"O. Zoeter, C. Dance, S. Clinchant, J. Andreoli","doi":"10.1145/2623330.2623359","DOIUrl":"https://doi.org/10.1145/2623330.2623359","url":null,"abstract":"On-street parking, just as any publicly owned utility, is used inefficiently if access is free or priced very far from market rates. This paper introduces a novel demand management solution: using data from dedicated occupancy sensors an iteration scheme updates parking rates to better match demand. The new rates encourage parkers to avoid peak hours and peak locations and reduce congestion and underuse. The solution is deliberately simple so that it is easy to understand, easily seen to be fair and leads to parking policies that are easy to remember and act upon. We study the convergence properties of the iteration scheme and prove that it converges to a reasonable distribution for a very large class of models. The algorithm is in use to change parking rates in over 6000 spaces in downtown Los Angeles since June 2012 as part of the LA Express Park project. Initial results are encouraging with a reduction of congestion and underuse, while in more locations rates were decreased than increased.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"7 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91400876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Data science through the lens of social science 从社会科学的角度看数据科学
D. Conway
In this talk, Drew will examine data science through the lens of the social scientist. He will discuss how the various skills and disciplines combine into data science. Drew will also present a motivating example directly from his work as a senior advisor to NYC's Mayor's Office of Analytics.
在这次演讲中,Drew将通过社会科学家的视角来审视数据科学。他将讨论各种技能和学科如何结合到数据科学中。德鲁还将从他作为纽约市市长分析办公室高级顾问的工作中直接展示一个鼓舞人心的例子。
{"title":"Data science through the lens of social science","authors":"D. Conway","doi":"10.1145/2623330.2630824","DOIUrl":"https://doi.org/10.1145/2623330.2630824","url":null,"abstract":"In this talk, Drew will examine data science through the lens of the social scientist. He will discuss how the various skills and disciplines combine into data science. Drew will also present a motivating example directly from his work as a senior advisor to NYC's Mayor's Office of Analytics.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"51 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91519769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Personalized search result diversification via structured learning 通过结构化学习实现个性化搜索结果多样化
Shangsong Liang, Z. Ren, M. de Rijke
This paper is concerned with the problem of personalized diversification of search results, with the goal of enhancing the performance of both plain diversification and plain personalization algorithms. In previous work, the problem has mainly been tackled by means of unsupervised learning. To further enhance the performance, we propose a supervised learning strategy. Specifically, we set up a structured learning framework for conducting supervised personalized diversification, in which we add features extracted directly from the tokens of documents and those utilized by unsupervised personalized diversification algorithms, and, importantly, those generated from our proposed user-interest latent Dirichlet topic model. Based on our proposed topic model whether a document can cater to a user's interest can be estimated in our learning strategy. We also define two constraints in our structured learning framework to ensure that search results are both diversified and consistent with a user's interest. We conduct experiments on an open personalized diversification dataset and find that our supervised learning strategy outperforms unsupervised personalized diversification methods as well as other plain personalization and plain diversification methods.
本文研究了搜索结果的个性化多样化问题,目的是提高普通多样化和普通个性化算法的性能。在以前的工作中,这个问题主要是通过无监督学习来解决的。为了进一步提高性能,我们提出了一种监督学习策略。具体来说,我们建立了一个结构化的学习框架,用于进行有监督的个性化多样化,其中我们添加了直接从文档令牌中提取的特征,以及由无监督个性化多样化算法使用的特征,重要的是,从我们提出的用户兴趣潜在Dirichlet主题模型中生成的特征。基于我们提出的主题模型,可以在我们的学习策略中估计文档是否符合用户的兴趣。我们还在结构化学习框架中定义了两个约束,以确保搜索结果既多样化又与用户的兴趣一致。我们在一个开放的个性化多样化数据集上进行了实验,发现我们的监督学习策略优于无监督的个性化多样化方法以及其他普通的个性化和普通的多样化方法。
{"title":"Personalized search result diversification via structured learning","authors":"Shangsong Liang, Z. Ren, M. de Rijke","doi":"10.1145/2623330.2623650","DOIUrl":"https://doi.org/10.1145/2623330.2623650","url":null,"abstract":"This paper is concerned with the problem of personalized diversification of search results, with the goal of enhancing the performance of both plain diversification and plain personalization algorithms. In previous work, the problem has mainly been tackled by means of unsupervised learning. To further enhance the performance, we propose a supervised learning strategy. Specifically, we set up a structured learning framework for conducting supervised personalized diversification, in which we add features extracted directly from the tokens of documents and those utilized by unsupervised personalized diversification algorithms, and, importantly, those generated from our proposed user-interest latent Dirichlet topic model. Based on our proposed topic model whether a document can cater to a user's interest can be estimated in our learning strategy. We also define two constraints in our structured learning framework to ensure that search results are both diversified and consistent with a user's interest. We conduct experiments on an open personalized diversification dataset and find that our supervised learning strategy outperforms unsupervised personalized diversification methods as well as other plain personalization and plain diversification methods.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87670534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Gradient boosted feature selection 梯度增强特征选择
Z. Xu, Gao Huang, Kilian Q. Weinberger, A. Zheng
A feature selection algorithm should ideally satisfy four conditions: reliably extract relevant features; be able to identify non-linear feature interactions; scale linearly with the number of features and dimensions; allow the incorporation of known sparsity structure. In this work we propose a novel feature selection algorithm, Gradient Boosted Feature Selection (GBFS), which satisfies all four of these requirements. The algorithm is flexible, scalable, and surprisingly straight-forward to implement as it is based on a modification of Gradient Boosted Trees. We evaluate GBFS on several real world data sets and show that it matches or outperforms other state of the art feature selection algorithms. Yet it scales to larger data set sizes and naturally allows for domain-specific side information.
特征选择算法理想地满足四个条件:可靠地提取相关特征;能够识别非线性特征相互作用;与特征和维度的数量成线性比例;允许合并已知的稀疏性结构。在这项工作中,我们提出了一种新的特征选择算法,梯度增强特征选择(GBFS),它满足所有这四个要求。该算法是灵活的,可扩展的,并且令人惊讶的直接实现,因为它是基于梯度增强树的修改。我们在几个真实世界的数据集上评估了GBFS,并表明它匹配或优于其他最先进的特征选择算法。然而,它可以扩展到更大的数据集大小,并且自然地允许特定于领域的侧信息。
{"title":"Gradient boosted feature selection","authors":"Z. Xu, Gao Huang, Kilian Q. Weinberger, A. Zheng","doi":"10.1145/2623330.2623635","DOIUrl":"https://doi.org/10.1145/2623330.2623635","url":null,"abstract":"A feature selection algorithm should ideally satisfy four conditions: reliably extract relevant features; be able to identify non-linear feature interactions; scale linearly with the number of features and dimensions; allow the incorporation of known sparsity structure. In this work we propose a novel feature selection algorithm, Gradient Boosted Feature Selection (GBFS), which satisfies all four of these requirements. The algorithm is flexible, scalable, and surprisingly straight-forward to implement as it is based on a modification of Gradient Boosted Trees. We evaluate GBFS on several real world data sets and show that it matches or outperforms other state of the art feature selection algorithms. Yet it scales to larger data set sizes and naturally allows for domain-specific side information.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90167325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 150
Efficient mini-batch training for stochastic optimization 随机优化的高效小批量训练
Mu Li, T. Zhang, Yuqiang Chen, Alex Smola
Stochastic gradient descent (SGD) is a popular technique for large-scale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch size typically decreases the rate of convergence. This paper introduces a technique based on approximate optimization of a conservatively regularized objective function within each minibatch. We prove that the convergence rate does not decrease with increasing minibatch size. Experiments demonstrate that with suitable implementations of approximate optimization, the resulting algorithm can outperform standard SGD in many scenarios.
随机梯度下降(SGD)是机器学习中解决大规模优化问题的一种流行技术。为了实现SGD的并行化,需要采用小批量训练来降低通信成本。然而,小批量大小的增加通常会降低收敛速度。本文介绍了一种基于保守正则化目标函数在每个小批内近似优化的技术。我们证明了收敛速度不随小批大小的增加而降低。实验表明,通过适当的近似优化实现,所得算法在许多场景下都可以优于标准的SGD。
{"title":"Efficient mini-batch training for stochastic optimization","authors":"Mu Li, T. Zhang, Yuqiang Chen, Alex Smola","doi":"10.1145/2623330.2623612","DOIUrl":"https://doi.org/10.1145/2623330.2623612","url":null,"abstract":"Stochastic gradient descent (SGD) is a popular technique for large-scale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch size typically decreases the rate of convergence. This paper introduces a technique based on approximate optimization of a conservatively regularized objective function within each minibatch. We prove that the convergence rate does not decrease with increasing minibatch size. Experiments demonstrate that with suitable implementations of approximate optimization, the resulting algorithm can outperform standard SGD in many scenarios.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"125 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90595013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 700
LaSEWeb
Oleksandr Polozov, Sumit Gulwani
We show how to programmatically model processes that humans use when extracting answers to queries (e.g., "Who invented typewriter?", "List of Washington national parks") from semi-structured Web pages returned by a search engine. This modeling enables various applications including automating repetitive search tasks, and helping search engine developers design micro-segments of factoid questions. We describe the design and implementation of a domain-specific language that enables extracting data from a webpage based on its structure, visual layout, and linguistic patterns. We also describe an algorithm to rank multiple answers extracted from multiple webpages. On 100,000+ queries (across 7 micro-segments) obtained from Bing logs, our system LaSEWeb answered queries with an average recall of 71%. Also, the desired answer(s) were present in top-3 suggestions for 95%+ cases.
{"title":"LaSEWeb","authors":"Oleksandr Polozov, Sumit Gulwani","doi":"10.1145/2623330.2623761","DOIUrl":"https://doi.org/10.1145/2623330.2623761","url":null,"abstract":"We show how to programmatically model processes that humans use when extracting answers to queries (e.g., \"Who invented typewriter?\", \"List of Washington national parks\") from semi-structured Web pages returned by a search engine. This modeling enables various applications including automating repetitive search tasks, and helping search engine developers design micro-segments of factoid questions. We describe the design and implementation of a domain-specific language that enables extracting data from a webpage based on its structure, visual layout, and linguistic patterns. We also describe an algorithm to rank multiple answers extracted from multiple webpages. On 100,000+ queries (across 7 micro-segments) obtained from Bing logs, our system LaSEWeb answered queries with an average recall of 71%. Also, the desired answer(s) were present in top-3 suggestions for 95%+ cases.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74392397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Open question answering over curated and extracted knowledge bases 开放的问题回答在策划和提取的知识库
Anthony Fader, Luke Zettlemoyer, Oren Etzioni
We consider the problem of open-domain question answering (Open QA) over massive knowledge bases (KBs). Existing approaches use either manually curated KBs like Freebase or KBs automatically extracted from unstructured text. In this paper, we present OQA, the first approach to leverage both curated and extracted KBs. A key technical challenge is designing systems that are robust to the high variability in both natural language questions and massive KBs. OQA achieves robustness by decomposing the full Open QA problem into smaller sub-problems including question paraphrasing and query reformulation. OQA solves these sub-problems by mining millions of rules from an unlabeled question corpus and across multiple KBs. OQA then learns to integrate these rules by performing discriminative training on question-answer pairs using a latent-variable structured perceptron algorithm. We evaluate OQA on three benchmark question sets and demonstrate that it achieves up to twice the precision and recall of a state-of-the-art Open QA system.
我们考虑了大规模知识库(KBs)上的开放域问答(Open QA)问题。现有的方法要么使用像Freebase这样手动管理的KBs,要么使用从非结构化文本中自动提取的KBs。在本文中,我们提出了OQA,这是第一种利用策划和提取KBs的方法。一个关键的技术挑战是设计对自然语言问题和大量知识库的高度可变性都具有鲁棒性的系统。OQA通过将完全开放的QA问题分解为更小的子问题(包括问题释义和查询重新表述)来实现鲁棒性。OQA通过从未标记的问题语料库和多个知识库中挖掘数百万条规则来解决这些子问题。然后,OQA通过使用潜在变量结构化感知器算法对问答对进行判别训练来学习整合这些规则。我们在三个基准问题集上评估了OQA,并证明它的精确度和召回率是最先进的Open QA系统的两倍。
{"title":"Open question answering over curated and extracted knowledge bases","authors":"Anthony Fader, Luke Zettlemoyer, Oren Etzioni","doi":"10.1145/2623330.2623677","DOIUrl":"https://doi.org/10.1145/2623330.2623677","url":null,"abstract":"We consider the problem of open-domain question answering (Open QA) over massive knowledge bases (KBs). Existing approaches use either manually curated KBs like Freebase or KBs automatically extracted from unstructured text. In this paper, we present OQA, the first approach to leverage both curated and extracted KBs. A key technical challenge is designing systems that are robust to the high variability in both natural language questions and massive KBs. OQA achieves robustness by decomposing the full Open QA problem into smaller sub-problems including question paraphrasing and query reformulation. OQA solves these sub-problems by mining millions of rules from an unlabeled question corpus and across multiple KBs. OQA then learns to integrate these rules by performing discriminative training on question-answer pairs using a latent-variable structured perceptron algorithm. We evaluate OQA on three benchmark question sets and demonstrate that it achieves up to twice the precision and recall of a state-of-the-art Open QA system.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76330508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 397
Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization 通过稀疏非负张量分解从电子健康记录中获得高通量表型
Joyce Ho, Joydeep Ghosh, Jimeng Sun
The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to phenotypes, or medical concepts, that clinical researchers need or use. Existing phenotyping approaches typically require labor intensive supervision from medical experts. We propose Marble, a novel sparse non-negative tensor factorization method to derive phenotype candidates with virtually no human supervision. Marble decomposes the observed tensor into two terms, a bias tensor and an interaction tensor. The bias tensor represents the baseline characteristics common amongst the overall population and the interaction tensor defines the phenotypes. We demonstrate the capability of our proposed model on both simulated and patient data from a publicly available clinical database. Our results show that Marble derived phenotypes provide at least a 42.8% reduction in the number of non-zero element and also retains predictive power for classification purposes. Furthermore, the resulting phenotypes and baseline characteristics from real EHR data are consistent with known characteristics of the patient population. Thus it can potentially be used to rapidly characterize, predict, and manage a large number of diseases, thereby promising a novel, data-driven solution that can benefit very large segments of the population.
来自多个异构来源的电子健康记录(EHRs)的可用性迅速增加,促使采用数据驱动的方法来改进临床研究、决策、预后和患者管理。不幸的是,电子病历数据并不总是直接和可靠地映射到临床研究人员需要或使用的表型或医学概念。现有的表型分析方法通常需要医学专家的劳动密集型监督。我们提出了一种新的稀疏非负张量分解方法Marble,它可以在几乎没有人类监督的情况下推导候选表型。Marble将观测到的张量分解为两项,一个偏置张量和一个相互作用张量。偏倚张量代表总体中共同的基线特征,而相互作用张量定义表型。我们展示了我们提出的模型在来自公开可用的临床数据库的模拟和患者数据上的能力。我们的研究结果表明,大理石衍生的表型至少减少了42.8%的非零元素数量,并且还保留了用于分类目的的预测能力。此外,从真实EHR数据得出的表型和基线特征与患者群体的已知特征一致。因此,它有可能被用于快速描述、预测和管理大量疾病,从而有望成为一种新的、数据驱动的解决方案,使很大一部分人口受益。
{"title":"Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization","authors":"Joyce Ho, Joydeep Ghosh, Jimeng Sun","doi":"10.1145/2623330.2623658","DOIUrl":"https://doi.org/10.1145/2623330.2623658","url":null,"abstract":"The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to phenotypes, or medical concepts, that clinical researchers need or use. Existing phenotyping approaches typically require labor intensive supervision from medical experts. We propose Marble, a novel sparse non-negative tensor factorization method to derive phenotype candidates with virtually no human supervision. Marble decomposes the observed tensor into two terms, a bias tensor and an interaction tensor. The bias tensor represents the baseline characteristics common amongst the overall population and the interaction tensor defines the phenotypes. We demonstrate the capability of our proposed model on both simulated and patient data from a publicly available clinical database. Our results show that Marble derived phenotypes provide at least a 42.8% reduction in the number of non-zero element and also retains predictive power for classification purposes. Furthermore, the resulting phenotypes and baseline characteristics from real EHR data are consistent with known characteristics of the patient population. Thus it can potentially be used to rapidly characterize, predict, and manage a large number of diseases, thereby promising a novel, data-driven solution that can benefit very large segments of the population.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76934206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 220
Correlating events with time series for incident diagnosis 将事件与时间序列相关联以进行事件诊断
Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, D. Zhang, Zhe Wang
As online services have more and more popular, incident diagnosis has emerged as a critical task in minimizing the service downtime and ensuring high quality of the services provided. For most online services, incident diagnosis is mainly conducted by analyzing a large amount of telemetry data collected from the services at runtime. Time series data and event sequence data are two major types of telemetry data. Techniques of correlation analysis are important tools that are widely used by engineers for data-driven incident diagnosis. Despite their importance, there has been little previous work addressing the correlation between two types of heterogeneous data for incident diagnosis: continuous time series data and temporal event data. In this paper, we propose an approach to evaluate the correlation between time series data and event data. Our approach is capable of discovering three important aspects of event-timeseries correlation in the context of incident diagnosis: existence of correlation, temporal order, and monotonic effect. Our experimental results on simulation data sets and two real data sets demonstrate the effectiveness of the algorithm.
随着在线服务的日益普及,事件诊断已成为最小化服务停机时间和确保提供高质量服务的关键任务。对于大多数在线服务,事件诊断主要是通过分析在运行时从服务收集的大量遥测数据来进行的。时间序列数据和事件序列数据是遥测数据的两种主要类型。相关分析技术是工程师广泛应用于数据驱动事件诊断的重要工具。尽管它们很重要,但以前很少有工作解决两种类型的异构数据之间的相关性,用于事件诊断:连续时间序列数据和时间事件数据。本文提出了一种评估时间序列数据与事件数据相关性的方法。我们的方法能够在事件诊断的背景下发现事件-时间序列相关性的三个重要方面:相关性的存在性、时间顺序和单调效应。在仿真数据集和两个真实数据集上的实验结果证明了该算法的有效性。
{"title":"Correlating events with time series for incident diagnosis","authors":"Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, D. Zhang, Zhe Wang","doi":"10.1145/2623330.2623374","DOIUrl":"https://doi.org/10.1145/2623330.2623374","url":null,"abstract":"As online services have more and more popular, incident diagnosis has emerged as a critical task in minimizing the service downtime and ensuring high quality of the services provided. For most online services, incident diagnosis is mainly conducted by analyzing a large amount of telemetry data collected from the services at runtime. Time series data and event sequence data are two major types of telemetry data. Techniques of correlation analysis are important tools that are widely used by engineers for data-driven incident diagnosis. Despite their importance, there has been little previous work addressing the correlation between two types of heterogeneous data for incident diagnosis: continuous time series data and temporal event data. In this paper, we propose an approach to evaluate the correlation between time series data and event data. Our approach is capable of discovering three important aspects of event-timeseries correlation in the context of incident diagnosis: existence of correlation, temporal order, and monotonic effect. Our experimental results on simulation data sets and two real data sets demonstrate the effectiveness of the algorithm.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82301218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 91
Balanced graph edge partition 平衡图边划分
F. Bourse, M. Lelarge, M. Vojnović
Balanced edge partition has emerged as a new approach to partition an input graph data for the purpose of scaling out parallel computations, which is of interest for several modern data analytics computation platforms, including platforms for iterative computations, machine learning problems, and graph databases. This new approach stands in a stark contrast to the traditional approach of balanced vertex partition, where for given number of partitions, the problem is to minimize the number of edges cut subject to balancing the vertex cardinality of partitions. In this paper, we first characterize the expected costs of vertex and edge partitions with and without aggregation of messages, for the commonly deployed policy of placing a vertex or an edge uniformly at random to one of the partitions. We then obtain the first approximation algorithms for the balanced edge-partition problem which for the case of no aggregation matches the best known approximation ratio for the balanced vertex-partition problem, and show that this remains to hold for the case with aggregation up to factor that is equal to the maximum in-degree of a vertex. We report results of an extensive empirical evaluation on a set of real-world graphs, which quantifies the benefits of edge- vs. vertex-partition, and demonstrates efficiency of natural greedy online assignments for the balanced edge-partition problem with and with no aggregation.
平衡边缘划分已经成为一种新的划分输入图数据的方法,用于扩展并行计算,这是一些现代数据分析计算平台感兴趣的,包括迭代计算平台,机器学习问题和图数据库。这种新方法与传统的平衡顶点划分方法形成鲜明对比,在传统的平衡顶点划分方法中,对于给定数量的分区,问题是在平衡分区的顶点基数的情况下最小化被切割的边的数量。在本文中,我们首先描述了通常部署的将顶点或边缘均匀随机放置到其中一个分区的策略下,在有和没有消息聚合的情况下,顶点和边缘分区的预期成本。然后,我们获得了平衡边划分问题的第一个近似算法,该算法在没有聚合的情况下与平衡顶点划分问题的最佳已知近似比相匹配,并表明这仍然适用于聚合因子等于顶点的最大in-degree的情况。我们报告了一组真实世界图的广泛经验评估结果,量化了边缘与顶点划分的好处,并证明了自然贪婪在线分配在有和没有聚合的平衡边缘划分问题上的效率。
{"title":"Balanced graph edge partition","authors":"F. Bourse, M. Lelarge, M. Vojnović","doi":"10.1145/2623330.2623660","DOIUrl":"https://doi.org/10.1145/2623330.2623660","url":null,"abstract":"Balanced edge partition has emerged as a new approach to partition an input graph data for the purpose of scaling out parallel computations, which is of interest for several modern data analytics computation platforms, including platforms for iterative computations, machine learning problems, and graph databases. This new approach stands in a stark contrast to the traditional approach of balanced vertex partition, where for given number of partitions, the problem is to minimize the number of edges cut subject to balancing the vertex cardinality of partitions. In this paper, we first characterize the expected costs of vertex and edge partitions with and without aggregation of messages, for the commonly deployed policy of placing a vertex or an edge uniformly at random to one of the partitions. We then obtain the first approximation algorithms for the balanced edge-partition problem which for the case of no aggregation matches the best known approximation ratio for the balanced vertex-partition problem, and show that this remains to hold for the case with aggregation up to factor that is equal to the maximum in-degree of a vertex. We report results of an extensive empirical evaluation on a set of real-world graphs, which quantifies the benefits of edge- vs. vertex-partition, and demonstrates efficiency of natural greedy online assignments for the balanced edge-partition problem with and with no aggregation.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83164796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 137
期刊
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1