首页 > 最新文献

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management最新文献

英文 中文
Multi-query scheduling for time-critical data stream applications 多查询调度的时间关键型数据流应用
Yongluan Zhou, Ji Wu, A. K. Leghari
Many data stream applications, such as network intrusion detection, on-line financial tickers and environmental monitoring, typically exhibit certain "real-time" traits. In such applications, people are interested in strategies that ensure on-time delivery of query results. In this paper, we point out that traditional operator-based query scheduling strategies are insufficient to handle this class of problem. Therefore we choose to approach the issue from a new angle by modeling multi-query scheduling as a job-scheduling problem, a classical problem in real-time computing. By taking advantage of the wisdom in the real-time computing community, we propose several new scheduling strategies and algorithms to enhance the overall data stream query scheduling performance. Through extensive experiments over both real and synthetic data, we identify the important factors for scheduling performance and verify the effectiveness of our approaches.
许多数据流应用程序,如网络入侵检测、在线金融行情和环境监测,通常表现出某些“实时”特征。在这样的应用程序中,人们对确保查询结果准时交付的策略感兴趣。本文指出,传统的基于算子的查询调度策略不足以处理这类问题。因此,我们选择从一个新的角度来研究多查询调度问题,将多查询调度建模为实时计算中的经典问题——作业调度问题。利用实时计算界的智慧,提出了几种新的调度策略和算法,以提高数据流查询调度的整体性能。通过对真实数据和合成数据的大量实验,我们确定了影响调度性能的重要因素,并验证了我们方法的有效性。
{"title":"Multi-query scheduling for time-critical data stream applications","authors":"Yongluan Zhou, Ji Wu, A. K. Leghari","doi":"10.1145/2484838.2484864","DOIUrl":"https://doi.org/10.1145/2484838.2484864","url":null,"abstract":"Many data stream applications, such as network intrusion detection, on-line financial tickers and environmental monitoring, typically exhibit certain \"real-time\" traits. In such applications, people are interested in strategies that ensure on-time delivery of query results. In this paper, we point out that traditional operator-based query scheduling strategies are insufficient to handle this class of problem. Therefore we choose to approach the issue from a new angle by modeling multi-query scheduling as a job-scheduling problem, a classical problem in real-time computing. By taking advantage of the wisdom in the real-time computing community, we propose several new scheduling strategies and algorithms to enhance the overall data stream query scheduling performance. Through extensive experiments over both real and synthetic data, we identify the important factors for scheduling performance and verify the effectiveness of our approaches.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"21 1","pages":"15:1-15:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74173555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Towards a universal tracking database 走向一个通用的跟踪数据库
Gereon Schüller, Andreas Behrend
In moving object databases, authors usually assume that number and position of objects to be processed are always known in advance. Detecting an unknown moving object and pursuing its movement, however, is usually left to tracking algorithms resting outside the database. Trackers are complex software systems which process sensor data and application-specific context information in order to detect, classify, monitor and predict the course of moving objects. As there are no universal software tools for realizing a tracker, such systems are usually hand-coded from scratch for each tracking application. In this paper we present a way how to implement a framework for implementing universal trackers inside a database. As a use case, we consider the well-known probabilistic multiple hypothesis tracking approach (PMHT) and the interacting multiple model filter (IMM) for realizing typical tracking tasks. We show that incremental view maintenance techniques and Bregman Ball trees are well-suited for efficiently implementing state-of-the-art trackers for processing streams of radar data.
在移动对象数据库中,作者通常假设要处理的对象的数量和位置总是事先已知的。然而,检测一个未知的运动物体并追踪它的运动,通常留给数据库之外的跟踪算法。跟踪器是复杂的软件系统,它处理传感器数据和特定应用的上下文信息,以检测、分类、监控和预测运动物体的过程。由于没有实现跟踪器的通用软件工具,这样的系统通常是为每个跟踪应用程序从头开始手工编码的。在本文中,我们提出了一种在数据库中实现通用跟踪器的框架的方法。作为一个用例,我们考虑了众所周知的概率多假设跟踪方法(PMHT)和交互多模型滤波器(IMM)来实现典型的跟踪任务。我们表明,增量视图维护技术和布雷格曼球树非常适合于有效地实现用于处理雷达数据流的最先进的跟踪器。
{"title":"Towards a universal tracking database","authors":"Gereon Schüller, Andreas Behrend","doi":"10.1145/2484838.2484845","DOIUrl":"https://doi.org/10.1145/2484838.2484845","url":null,"abstract":"In moving object databases, authors usually assume that number and position of objects to be processed are always known in advance. Detecting an unknown moving object and pursuing its movement, however, is usually left to tracking algorithms resting outside the database. Trackers are complex software systems which process sensor data and application-specific context information in order to detect, classify, monitor and predict the course of moving objects. As there are no universal software tools for realizing a tracker, such systems are usually hand-coded from scratch for each tracking application. In this paper we present a way how to implement a framework for implementing universal trackers inside a database. As a use case, we consider the well-known probabilistic multiple hypothesis tracking approach (PMHT) and the interacting multiple model filter (IMM) for realizing typical tracking tasks. We show that incremental view maintenance techniques and Bregman Ball trees are well-suited for efficiently implementing state-of-the-art trackers for processing streams of radar data.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"30 1","pages":"10:1-10:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85450877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Parameter-free and domain-independent similarity search with diversity 具有多样性的无参数、独立于域的相似度搜索
Lúcio F. D. Santos, Willian D. Oliveira, Mônica Ribeiro Porto Ferreira, A. Traina, C. Traina
New operators to execute similarity-based queries over multimedia data stored in Database Management Systems are increasingly demanded. However, searching in very large datasets, the basic operators often return elements too much similar both to the query center and to themselves, reducing the answer's utility. In this paper, we tackle the problem of providing diversity to similarity query results, and define techniques to assure that each element in the result set is different enough from the others. Existing techniques compel the user to define either a parameter to trade among similarity and diversity or a minimum similarity between result elements. Distinctly, our approach provides similarity queries with diversification using the influence concept, which automatically estimates the inherent diversity between the result set elements requiring no user-defined parameters. Furthermore, our technique can be applied over any data represented in a metric space, so it is both parameter and application-domain independent. The "Better Results with Influence Diversification" (BRID) technique is the basis to the k-Diverse Nearest Neighbor (BRIDk) and to the Range Diverse (BRIDr) algorithms, which execute k-nearest neighbor and range queries with diversification, showing that the technique can be applied to diversify any type of similarity queries. We also define a way to measure the diversification degree in a result set. Through a detailed experimental evaluation using our approach, we show that BRID outperforms the existing methods regarding both query diversification quality and execution times, being at least two orders of magnitude faster than the best existing approaches.
对存储在数据库管理系统中的多媒体数据执行基于相似性查询的新操作符的需求日益增加。但是,在非常大的数据集中进行搜索时,基本运算符返回的元素通常与查询中心和其本身都非常相似,从而降低了答案的实用性。在本文中,我们解决了为相似查询结果提供多样性的问题,并定义了确保结果集中的每个元素与其他元素足够不同的技术。现有的技术迫使用户要么定义一个参数来在相似性和多样性之间进行交易,要么定义结果元素之间的最小相似性。显然,我们的方法使用影响概念提供了具有多样性的相似性查询,它自动估计结果集元素之间的内在多样性,不需要用户定义的参数。此外,我们的技术可以应用于度量空间中表示的任何数据,因此它既与参数无关,也与应用领域无关。“具有影响多样化的更好结果”(BRID)技术是k-多样化最近邻(BRIDk)和范围多样化(BRIDr)算法的基础,它们执行具有多样化的k-最近邻和范围查询,表明该技术可以应用于多样化任何类型的相似性查询。我们还定义了一种度量结果集多样化程度的方法。通过使用我们的方法进行详细的实验评估,我们表明BRID在查询多样化质量和执行时间方面优于现有方法,比现有最佳方法至少快两个数量级。
{"title":"Parameter-free and domain-independent similarity search with diversity","authors":"Lúcio F. D. Santos, Willian D. Oliveira, Mônica Ribeiro Porto Ferreira, A. Traina, C. Traina","doi":"10.1145/2484838.2484854","DOIUrl":"https://doi.org/10.1145/2484838.2484854","url":null,"abstract":"New operators to execute similarity-based queries over multimedia data stored in Database Management Systems are increasingly demanded. However, searching in very large datasets, the basic operators often return elements too much similar both to the query center and to themselves, reducing the answer's utility. In this paper, we tackle the problem of providing diversity to similarity query results, and define techniques to assure that each element in the result set is different enough from the others. Existing techniques compel the user to define either a parameter to trade among similarity and diversity or a minimum similarity between result elements. Distinctly, our approach provides similarity queries with diversification using the influence concept, which automatically estimates the inherent diversity between the result set elements requiring no user-defined parameters. Furthermore, our technique can be applied over any data represented in a metric space, so it is both parameter and application-domain independent. The \"Better Results with Influence Diversification\" (BRID) technique is the basis to the k-Diverse Nearest Neighbor (BRIDk) and to the Range Diverse (BRIDr) algorithms, which execute k-nearest neighbor and range queries with diversification, showing that the technique can be applied to diversify any type of similarity queries. We also define a way to measure the diversification degree in a result set. Through a detailed experimental evaluation using our approach, we show that BRID outperforms the existing methods regarding both query diversification quality and execution times, being at least two orders of magnitude faster than the best existing approaches.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"44 1","pages":"5:1-5:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83241937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Research lattices: towards a scientific hypothesis data model 研究格:走向科学假设数据模型
Bernardo Gonçalves, F. Porto
As the problems of scientific interest raise in scale and complexity, scientists have to tacitly manage too many analytic elements. Hypotheses are worked out to drive research towards successful explanation and prediction, which characterizes science as a dynamic activity that is partially ordered towards progress. This paper motivates and introduces research lattices, carrying out a lattice-theoretic approach for hypothesis representation and management in large-scale science and engineering. The goal of this work is to equip scientists with tools to manipulate and query hypotheses while keeping track of research progress. We refer to SciDB's array data model and discuss how data and theories could be managed in a unified model management framework.
随着科学兴趣问题在规模和复杂性上的提高,科学家不得不默认地管理太多的分析元素。假设的提出是为了推动研究走向成功的解释和预测,这使科学成为一种动态的活动,部分地朝着进步的方向发展。本文激发并引入了研究格,为大规模科学与工程中的假设表示和管理提供了一种格理论方法。这项工作的目标是为科学家提供工具来操纵和质疑假设,同时跟踪研究进展。我们将参考SciDB的数组数据模型,并讨论如何在统一的模型管理框架中管理数据和理论。
{"title":"Research lattices: towards a scientific hypothesis data model","authors":"Bernardo Gonçalves, F. Porto","doi":"10.1145/2484838.2484861","DOIUrl":"https://doi.org/10.1145/2484838.2484861","url":null,"abstract":"As the problems of scientific interest raise in scale and complexity, scientists have to tacitly manage too many analytic elements. Hypotheses are worked out to drive research towards successful explanation and prediction, which characterizes science as a dynamic activity that is partially ordered towards progress. This paper motivates and introduces research lattices, carrying out a lattice-theoretic approach for hypothesis representation and management in large-scale science and engineering. The goal of this work is to equip scientists with tools to manipulate and query hypotheses while keeping track of research progress. We refer to SciDB's array data model and discuss how data and theories could be managed in a unified model management framework.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"41:1-41:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83773759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Publishing trajectories with differential privacy guarantees 具有不同隐私保障的发布轨迹
Kaifeng Jiang, Dongxu Shao, S. Bressan, Thomas Kister, K. Tan
The pervasiveness of location-acquisition technologies has made it possible to collect the movement data of individuals or vehicles. However, it has to be carefully managed to ensure that there is no privacy breach. In this paper, we investigate the problem of publishing trajectory data under the differential privacy model. A straightforward solution is to add noise to a trajectory - this can be done either by adding noise to each coordinate of the position, to each position of the trajectory, or to the whole trajectory. However, such naive approaches result in trajectories with zigzag shapes and many crossings, making the published trajectories of little practical use. We introduce a mechanism called SDD (Sampling Distance and Direction), which is ε-differentially private. SDD samples a suitable direction and distance at each position to publish the next possible position. Numerical experiments conducted on real ship trajectories demonstrate that our proposed mechanism can deliver ship trajectories that are of good practical utility.
位置获取技术的普及使得收集个人或车辆的移动数据成为可能。然而,它必须仔细管理,以确保没有隐私泄露。本文研究了差分隐私模型下的轨迹数据发布问题。一个直接的解决方案是在轨迹中添加噪声——这可以通过在位置的每个坐标、轨迹的每个位置或整个轨迹中添加噪声来实现。然而,这种幼稚的方法导致轨迹具有之字形和许多交叉点,使得发表的轨迹几乎没有实际用途。我们引入了一种称为SDD(采样距离和方向)的机制,它是ε-差分私有的。SDD在每个位置采样合适的方向和距离,以发布下一个可能的位置。对实际船舶轨迹进行的数值实验表明,所提出的机制能够提供具有较好实用性的船舶轨迹。
{"title":"Publishing trajectories with differential privacy guarantees","authors":"Kaifeng Jiang, Dongxu Shao, S. Bressan, Thomas Kister, K. Tan","doi":"10.1145/2484838.2484846","DOIUrl":"https://doi.org/10.1145/2484838.2484846","url":null,"abstract":"The pervasiveness of location-acquisition technologies has made it possible to collect the movement data of individuals or vehicles. However, it has to be carefully managed to ensure that there is no privacy breach. In this paper, we investigate the problem of publishing trajectory data under the differential privacy model. A straightforward solution is to add noise to a trajectory - this can be done either by adding noise to each coordinate of the position, to each position of the trajectory, or to the whole trajectory. However, such naive approaches result in trajectories with zigzag shapes and many crossings, making the published trajectories of little practical use. We introduce a mechanism called SDD (Sampling Distance and Direction), which is ε-differentially private. SDD samples a suitable direction and distance at each position to publish the next possible position. Numerical experiments conducted on real ship trajectories demonstrate that our proposed mechanism can deliver ship trajectories that are of good practical utility.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"127 1","pages":"12:1-12:12"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73929761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 90
Adaptive exploration for large-scale protein analysis in the molecular dynamics database 分子动力学数据库中大规模蛋白质分析的自适应探索
Sarana Nutanong, N. Carey, Yanif Ahmad, A. Szalay, T. Woolf
Molecular dynamics (MD) simulations generate detailed time-series data of all-atom motions. These simulations are leading users of the world's most powerful supercomputers, and are standard-bearers for a wide range of high-performance computing (HPC) methods. However, MD data exploration and analysis is in its infancy in terms of scalability, ease-of-use, and ultimately its ability to answer 'grand challenge' science questions. This demonstration introduces the Molecular Dynamics Database (MDDB) project at Johns Hopkins, to study the co-design of database methods for deep on-the-fly exploratory MD analyses with HPC simulations. Data exploration in MD suffers from a "human bottleneck", where the laborious administration of simulations leaves little room for domain experts to focus on tackling science questions. MDDB exploits the data-rich nature of MD simulations to provide adaptive control of the exploration process with machine learning techniques, specifically reinforcement learning (RL). We present MDDB's data and queries, architecture, and its use of RL methods. Our audience will co-operate with our steering algorithm and science partners, and witness MDDB's abilities to significantly reduce exploration times and direct computation resources to where they best address science questions.
分子动力学(MD)模拟生成全原子运动的详细时间序列数据。这些模拟是世界上最强大的超级计算机的领先用户,也是各种高性能计算(HPC)方法的标准推动者。然而,医学数据探索和分析在可扩展性、易用性以及最终回答“重大挑战”科学问题的能力方面还处于起步阶段。本演示介绍了约翰霍普金斯大学的分子动力学数据库(MDDB)项目,以研究数据库方法的协同设计,用于深入的实时探索性MD分析和HPC模拟。医学领域的数据探索受到了“人为瓶颈”的困扰,繁重的模拟管理使领域专家没有多少空间专注于解决科学问题。MDDB利用MD模拟的丰富数据特性,通过机器学习技术,特别是强化学习(RL),为勘探过程提供自适应控制。我们介绍了MDDB的数据和查询、体系结构以及它对RL方法的使用。我们的观众将与我们的转向算法和科学合作伙伴合作,并见证MDDB显著减少探索时间和将计算资源引导到最适合解决科学问题的地方的能力。
{"title":"Adaptive exploration for large-scale protein analysis in the molecular dynamics database","authors":"Sarana Nutanong, N. Carey, Yanif Ahmad, A. Szalay, T. Woolf","doi":"10.1145/2484838.2484872","DOIUrl":"https://doi.org/10.1145/2484838.2484872","url":null,"abstract":"Molecular dynamics (MD) simulations generate detailed time-series data of all-atom motions. These simulations are leading users of the world's most powerful supercomputers, and are standard-bearers for a wide range of high-performance computing (HPC) methods. However, MD data exploration and analysis is in its infancy in terms of scalability, ease-of-use, and ultimately its ability to answer 'grand challenge' science questions. This demonstration introduces the Molecular Dynamics Database (MDDB) project at Johns Hopkins, to study the co-design of database methods for deep on-the-fly exploratory MD analyses with HPC simulations. Data exploration in MD suffers from a \"human bottleneck\", where the laborious administration of simulations leaves little room for domain experts to focus on tackling science questions. MDDB exploits the data-rich nature of MD simulations to provide adaptive control of the exploration process with machine learning techniques, specifically reinforcement learning (RL). We present MDDB's data and queries, architecture, and its use of RL methods. Our audience will co-operate with our steering algorithm and science partners, and witness MDDB's abilities to significantly reduce exploration times and direct computation resources to where they best address science questions.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"3 1","pages":"45:1-45:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75666587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Education and career paths for data scientists 数据科学家的教育和职业道路
M. Balazinska, S. Davidson, Bill Howe, Alexandros Labrinidis
MOTIVATION: As industry and science are increasingly data-driven, the need for skilled data scientists is exceeding what our universities are producing. According to a Mckinsey report: "By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills". Similarly, the ability to extract knowledge from scientific data is accelerating discovery and we need the next generation of domain scientists to be experts not only in their domain but also in data management. At the same time, however, researchers in academia who focus on building instruments or data management tools are often less recognized for their contributions than researchers focusing purely on the actual science. OVERVIEW: The goal of this panel will be to discuss all these challenges. We will discuss various aspects of how we should be educating both the emerging "data science" experts and the next generation of database and domain science experts. The panel will also discuss career paths for researchers who choose to specialize in developing new methods and tools for Big Data management in domain sciences, with recommendations for how we should better support these less traditional career paths.
动机:随着工业和科学越来越多地由数据驱动,对熟练数据科学家的需求超过了我们大学所能培养的。麦肯锡(Mckinsey)的一份报告称:“到2018年,仅美国就可能面临14万至19万具备深度分析技能的人才短缺。”同样,从科学数据中提取知识的能力正在加速发现,我们需要下一代领域科学家不仅是他们领域的专家,而且是数据管理方面的专家。然而,与此同时,专注于构建仪器或数据管理工具的学术界研究人员的贡献往往不如专注于实际科学的研究人员得到认可。概述:本小组的目标是讨论所有这些挑战。我们将讨论如何教育新兴的“数据科学”专家和下一代数据库和领域科学专家的各个方面。该小组还将讨论那些选择专门为领域科学中的大数据管理开发新方法和工具的研究人员的职业道路,并就如何更好地支持这些不太传统的职业道路提出建议。
{"title":"Education and career paths for data scientists","authors":"M. Balazinska, S. Davidson, Bill Howe, Alexandros Labrinidis","doi":"10.1145/2484838.2484886","DOIUrl":"https://doi.org/10.1145/2484838.2484886","url":null,"abstract":"MOTIVATION: As industry and science are increasingly data-driven, the need for skilled data scientists is exceeding what our universities are producing. According to a Mckinsey report: \"By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills\". Similarly, the ability to extract knowledge from scientific data is accelerating discovery and we need the next generation of domain scientists to be experts not only in their domain but also in data management. At the same time, however, researchers in academia who focus on building instruments or data management tools are often less recognized for their contributions than researchers focusing purely on the actual science.\u0000 OVERVIEW: The goal of this panel will be to discuss all these challenges. We will discuss various aspects of how we should be educating both the emerging \"data science\" experts and the next generation of database and domain science experts. The panel will also discuss career paths for researchers who choose to specialize in developing new methods and tools for Big Data management in domain sciences, with recommendations for how we should better support these less traditional career paths.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"82 1","pages":"3:1-3:2"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89023229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Parallel online aggregation in action 并行在线聚合正在起作用
Chengjie Qin, Florin Rusu
Online aggregation provides continuous estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution, or can let the processing terminate and obtain the exact result. In this demonstration, we introduce a general framework for parallel online aggregation in which estimation does not incur overhead on top of the actual processing. We define a generic interface to express any estimation model that abstracts completely the execution details. We design multiple sampling-based estimators suited for parallel online aggregation and implement them inside the framework. Demonstration participants are shown how estimates to general SQL aggregation queries over terabytes of TPC-H data are generated during the entire processing. Due to parallel execution, the estimate converges to the correct result in a matter of seconds even for the most difficult queries. The behavior of the estimators is evaluated under different operating regimes of the distributed cluster used in the demonstration.
在线聚合在实际处理过程中为计算的最终结果提供连续的估计。一旦估计足够准确,用户可以立即停止计算,通常是在执行的早期,或者可以让处理终止并获得确切的结果。在这个演示中,我们介绍了一个用于并行在线聚合的通用框架,在这个框架中,估计不会在实际处理的基础上产生开销。我们定义了一个通用接口来表达任何对执行细节完全抽象的评估模型。我们设计了多个适合并行在线聚合的基于采样的估计器,并在框架内实现它们。演示参与者将看到如何在整个处理过程中生成对tb TPC-H数据的一般SQL聚合查询的估计。由于并行执行,即使对于最困难的查询,估计也会在几秒钟内收敛到正确的结果。在演示中使用的分布式集群的不同操作制度下,评估了估计器的行为。
{"title":"Parallel online aggregation in action","authors":"Chengjie Qin, Florin Rusu","doi":"10.1145/2484838.2484874","DOIUrl":"https://doi.org/10.1145/2484838.2484874","url":null,"abstract":"Online aggregation provides continuous estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution, or can let the processing terminate and obtain the exact result. In this demonstration, we introduce a general framework for parallel online aggregation in which estimation does not incur overhead on top of the actual processing. We define a generic interface to express any estimation model that abstracts completely the execution details. We design multiple sampling-based estimators suited for parallel online aggregation and implement them inside the framework. Demonstration participants are shown how estimates to general SQL aggregation queries over terabytes of TPC-H data are generated during the entire processing. Due to parallel execution, the estimate converges to the correct result in a matter of seconds even for the most difficult queries. The behavior of the estimators is evaluated under different operating regimes of the distributed cluster used in the demonstration.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"27 1","pages":"46:1-46:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84835885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Bulk sorted access for efficient top-k retrieval 批量排序访问,以实现高效的top-k检索
Dustin Lange, Felix Naumann
Efficient top-k retrieval of records from a database has been an active research field for many years. We approach the problem from a real-world application point of view, in which the order of records according to some similarity function on an attribute is not unique: Many records have same values in several attributes and thus their ranking in those attributes is arbitrary. For instance, in large person databases many individuals have the same first name, the same date of birth, or live in the same city. Existing algorithms, such as the Threshold Algorithm (TA), are ill-equipped to handle such cases efficiently. We introduce a variation of TA, the Bulk Sorted Access Algorithm (BSA), which retrieves larger chunks of records from the sorted lists using fixed thresholds, and which focusses its efforts on records that are ranked high in more than one ordering and are thus more promising candidates. We experimentally show that our method outperforms TA and another previous method for top-k retrieval in those very common cases.
多年来,数据库中记录的高效top-k检索一直是一个活跃的研究领域。我们从实际应用程序的角度来处理这个问题,其中根据属性上的某些相似性函数的记录顺序不是唯一的:许多记录在几个属性中具有相同的值,因此它们在这些属性中的排名是任意的。例如,在大型人员数据库中,许多人有相同的名字、相同的出生日期或住在同一个城市。现有的算法,如阈值算法(TA),无法有效地处理这类情况。我们介绍了TA的一种变体,即批量排序访问算法(BSA),它使用固定阈值从排序列表中检索更大的记录块,并将其工作重点放在在多个排序中排名较高的记录上,因此更有希望的候选记录。我们通过实验证明,在这些非常常见的情况下,我们的方法优于TA和另一种以前的top-k检索方法。
{"title":"Bulk sorted access for efficient top-k retrieval","authors":"Dustin Lange, Felix Naumann","doi":"10.1145/2484838.2484852","DOIUrl":"https://doi.org/10.1145/2484838.2484852","url":null,"abstract":"Efficient top-k retrieval of records from a database has been an active research field for many years. We approach the problem from a real-world application point of view, in which the order of records according to some similarity function on an attribute is not unique: Many records have same values in several attributes and thus their ranking in those attributes is arbitrary. For instance, in large person databases many individuals have the same first name, the same date of birth, or live in the same city. Existing algorithms, such as the Threshold Algorithm (TA), are ill-equipped to handle such cases efficiently.\u0000 We introduce a variation of TA, the Bulk Sorted Access Algorithm (BSA), which retrieves larger chunks of records from the sorted lists using fixed thresholds, and which focusses its efforts on records that are ranked high in more than one ordering and are thus more promising candidates. We experimentally show that our method outperforms TA and another previous method for top-k retrieval in those very common cases.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"199 1","pages":"39:1-39:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73557802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sharing confidential data for algorithm development by multiple imputation 通过多重输入共享机密数据以进行算法开发
S. Verwer, S. V. D. Braak, Sunil Choenni
The availability of real-life data sets is of crucial importance for algorithm and application development, as these often require insight into the specific properties of the data. Often, however, such data are not released because of their proprietary and confidential nature. We propose to solve this problem using the statistical technique of multiple imputation, which is used as a powerful method for generating realistic synthetic data sets. Additionally, it is shown how the generated records can be combined into networked data using clustering techniques.
真实数据集的可用性对于算法和应用程序开发至关重要,因为这些通常需要深入了解数据的特定属性。然而,由于这些数据的专有性和保密性,这些数据通常不会被公布。我们建议使用多重插值的统计技术来解决这个问题,这是一种生成真实合成数据集的有力方法。此外,还展示了如何使用聚类技术将生成的记录组合成网络数据。
{"title":"Sharing confidential data for algorithm development by multiple imputation","authors":"S. Verwer, S. V. D. Braak, Sunil Choenni","doi":"10.1145/2484838.2484865","DOIUrl":"https://doi.org/10.1145/2484838.2484865","url":null,"abstract":"The availability of real-life data sets is of crucial importance for algorithm and application development, as these often require insight into the specific properties of the data. Often, however, such data are not released because of their proprietary and confidential nature. We propose to solve this problem using the statistical technique of multiple imputation, which is used as a powerful method for generating realistic synthetic data sets. Additionally, it is shown how the generated records can be combined into networked data using clustering techniques.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"47 1","pages":"42:1-42:4"},"PeriodicalIF":0.0,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85191511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1