首页 > 最新文献

Proceedings of the 2016 International Conference on Management of Data最新文献

英文 中文
Efficient Query Processing on Many-core Architectures: A Case Study with Intel Xeon Phi Processor 多核架构上的高效查询处理:基于Intel Xeon Phi处理器的案例研究
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899407
Xuntao Cheng, Bingsheng He, Mian Lu, C. Lau, Huynh Phung Huynh, R. Goh
Recently, Intel Xeon Phi is emerging as a many-core processor with up to 61 x86 cores. In this demonstration, we present PhiDB, an OLAP query processor with simultaneous multi-threading (SMT) capabilities on Xeon Phi as a case study for parallel database performance on future many-core processors. With the trend towards many-core architectures, query operator optimizations, and efficient query scheduling on such many-core architectures remain as challenging issues. This motivates us to redesign and evaluate query processors. In PhiDB, we apply Xeon Phi aware optimizations on query operators to exploit hardware features of Xeon Phi, and design a heuristic algorithm to schedule the concurrent execution of query operators for better performance, to demonstrate the performance impact of Xeon Phi aware optimizations. We have also developed a user interface for users to explore the underlying performance impacts of hardware-conscious optimizations and scheduling plans.
最近,Intel Xeon Phi成为了一款多核处理器,拥有多达61个x86内核。在本演示中,我们将介绍PhiDB,这是Xeon Phi上具有同步多线程(SMT)功能的OLAP查询处理器,作为未来多核处理器上并行数据库性能的案例研究。随着多核体系结构的发展,查询操作符的优化和多核体系结构上的高效查询调度仍然是具有挑战性的问题。这促使我们重新设计和评估查询处理器。在PhiDB中,我们在查询运算符上应用Xeon Phi感知优化,利用Xeon Phi的硬件特性,并设计了一个启发式算法来调度查询运算符的并发执行,以获得更好的性能,以演示Xeon Phi感知优化对性能的影响。我们还为用户开发了一个用户界面,使用户可以探索基于硬件的优化和调度计划的潜在性能影响。
{"title":"Efficient Query Processing on Many-core Architectures: A Case Study with Intel Xeon Phi Processor","authors":"Xuntao Cheng, Bingsheng He, Mian Lu, C. Lau, Huynh Phung Huynh, R. Goh","doi":"10.1145/2882903.2899407","DOIUrl":"https://doi.org/10.1145/2882903.2899407","url":null,"abstract":"Recently, Intel Xeon Phi is emerging as a many-core processor with up to 61 x86 cores. In this demonstration, we present PhiDB, an OLAP query processor with simultaneous multi-threading (SMT) capabilities on Xeon Phi as a case study for parallel database performance on future many-core processors. With the trend towards many-core architectures, query operator optimizations, and efficient query scheduling on such many-core architectures remain as challenging issues. This motivates us to redesign and evaluate query processors. In PhiDB, we apply Xeon Phi aware optimizations on query operators to exploit hardware features of Xeon Phi, and design a heuristic algorithm to schedule the concurrent execution of query operators for better performance, to demonstrate the performance impact of Xeon Phi aware optimizations. We have also developed a user interface for users to explore the underlying performance impacts of hardware-conscious optimizations and scheduling plans.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77356105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Interactive Search and Exploration of Waveform Data with Searchlight 用探照灯进行波形数据的交互式搜索和探索
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899404
A. Kalinin, U. Çetintemel, S. Zdonik
Searchlight enables search and exploration of large, multi-dimensional data sets interactively. It allows users to explore by specifying rich constraints for the "objects" they are interested in identifying. Constraints can express a variety of properties, including a shape of the object (e.g., a waveform interval of length 10-100ms), its aggregate properties (e.g., the average amplitude of the signal over the interval is greater than 10), and similarity to another object (e.g., the distance between the interval's waveform and the query waveform is less than 5). Searchlight allows users to specify an arbitrary number of such constraints, with mixing different types of constraints in the same query. Searchlight enhances the query execution engine of an array DBMS (currently SciDB) with the ability to perform sophisticated search using the power of Constraint Programming (CP). This allows an existing CP solver from Or-Tools (an open-source suite of operations research tools from Google) to directly access data inside the DBMS without the need to extract and transform it. This demo will illustrate the rich search and exploration capabilities of Searchlight, and its innovative technical features, by using the real-world MIMIC II data set, which contains waveform data for multi-parameter recordings of ICU patients, such as ABP (Arterial Blood Pressure) and ECG (electrocardiogram). Users will be able to search for interesting waveform intervals by specifying aggregate properties of the corresponding signals. In addition, they will be able to search for intervals similar to already found, where similarity is defined as a distance between the signal sequences.
Searchlight可以交互式地搜索和探索大型多维数据集。它允许用户通过为他们感兴趣的“对象”指定丰富的约束来进行探索。约束可以表达各种属性,包括对象的形状(例如,波形间隔长度为10-100ms),其聚合属性(例如,间隔内信号的平均幅度大于10),以及与另一个对象的相似性(例如,间隔的波形与查询波形之间的距离小于5)。Searchlight允许用户指定任意数量的此类约束,在同一查询中混合不同类型的约束。Searchlight增强了数组DBMS(目前为SciDB)的查询执行引擎,使其能够使用约束编程(CP)的功能执行复杂的搜索。这允许Or-Tools(来自b谷歌的开源操作研究工具套件)中的现有CP求解器直接访问DBMS中的数据,而无需提取和转换数据。本次演示将通过使用真实世界的MIMIC II数据集(包含ICU患者多参数记录的波形数据,如ABP(动脉血压)和ECG(心电图)),展示Searchlight丰富的搜索和探索功能及其创新的技术特征。用户将能够通过指定相应信号的聚合属性来搜索感兴趣的波形间隔。此外,他们将能够搜索与已经发现的相似的区间,其中相似性被定义为信号序列之间的距离。
{"title":"Interactive Search and Exploration of Waveform Data with Searchlight","authors":"A. Kalinin, U. Çetintemel, S. Zdonik","doi":"10.1145/2882903.2899404","DOIUrl":"https://doi.org/10.1145/2882903.2899404","url":null,"abstract":"Searchlight enables search and exploration of large, multi-dimensional data sets interactively. It allows users to explore by specifying rich constraints for the \"objects\" they are interested in identifying. Constraints can express a variety of properties, including a shape of the object (e.g., a waveform interval of length 10-100ms), its aggregate properties (e.g., the average amplitude of the signal over the interval is greater than 10), and similarity to another object (e.g., the distance between the interval's waveform and the query waveform is less than 5). Searchlight allows users to specify an arbitrary number of such constraints, with mixing different types of constraints in the same query. Searchlight enhances the query execution engine of an array DBMS (currently SciDB) with the ability to perform sophisticated search using the power of Constraint Programming (CP). This allows an existing CP solver from Or-Tools (an open-source suite of operations research tools from Google) to directly access data inside the DBMS without the need to extract and transform it. This demo will illustrate the rich search and exploration capabilities of Searchlight, and its innovative technical features, by using the real-world MIMIC II data set, which contains waveform data for multi-parameter recordings of ICU patients, such as ABP (Arterial Blood Pressure) and ECG (electrocardiogram). Users will be able to search for interesting waveform intervals by specifying aggregate properties of the corresponding signals. In addition, they will be able to search for intervals similar to already found, where similarity is defined as a distance between the signal sequences.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84292933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Range-based Obstructed Nearest Neighbor Queries 基于距离的阻塞近邻查询
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2915234
Huaijie Zhu, Xiaochun Yang, Bin Wang, Wang-Chien Lee
In this paper, we study a novel variant of obstructed nearest neighbor queries, namely, range-based obstructed nearest neighbor (RONN) search. A natural generalization of continuous obstructed nearest-neighbor (CONN), an RONN query retrieves the obstructed nearest neighbor for every point in a specified range. To process RONN, we first propose a CONN-Based (CONNB) algorithm as our baseline, which reduces the RONN query into a range query and four CONN queries processed using an R-tree. To address the shortcomings of the CONNB algorithm, we then propose a new RONN by R-tree Filtering (RONN-RF) algorithm, which explores effective filtering, also using R-tree. Next, we propose a new index, called O-tree, dedicated for indexing objects in the obstructed space. The novelty of O-tree lies in the idea of dividing the obstructed space into non-obstructed subspaces, aiming to efficiently retrieve highly qualified candidates for RONN processing. We develop an O-tree construction algorithm and propose a space division scheme, called optimal obstacle balance (OOB) scheme, to address the tree balance problem. Accordingly, we propose an efficient algorithm, called RONN by O-tree Acceleration (RONN-OA), which exploits O-tree to accelerate query processing of RONN. In addition, we extend O-tree for indexing polygons. At last, we conduct a comprehensive performance evaluation using both real and synthetic datasets to validate our ideas and the proposed algorithms. The experimental result shows that the RONN-OA algorithm outperforms the two R-tree based algorithms significantly. Moreover, we show that the OOB scheme achieves the best tree balance in O-tree and outperforms two baseline schemes.
本文研究了一种新的受阻最近邻查询的变体,即基于范围的受阻最近邻(RONN)搜索。连续受阻最近邻(CONN)的自然泛化,一个RONN查询检索指定范围内每个点的受阻最近邻。为了处理RONN,我们首先提出了一种基于CONN (CONNB)的算法作为基准,该算法将RONN查询减少为一个范围查询和使用r树处理的四个CONN查询。为了解决CONNB算法的不足,我们提出了一种新的RONN by R-tree Filtering (RONN- rf)算法,该算法也使用R-tree来探索有效的滤波。接下来,我们提出了一个新的索引,称为O-tree,专门用于索引阻塞空间中的对象。o树的新颖之处在于将阻塞空间划分为非阻塞子空间,旨在高效地检索出高质量的候选对象进行RONN处理。我们开发了一种o树构造算法,并提出了一种称为最优障碍平衡(OOB)方案的空间划分方案来解决树平衡问题。为此,我们提出了一种利用o树加速RONN (RONN- oa)算法,该算法利用o树加速RONN的查询处理。此外,我们扩展了O-tree用于索引多边形。最后,我们使用真实数据集和合成数据集进行了综合性能评估,以验证我们的想法和提出的算法。实验结果表明,RONN-OA算法明显优于两种基于r树的算法。此外,我们还证明了OOB方案在o树中达到了最佳的树平衡,并且优于两种基线方案。
{"title":"Range-based Obstructed Nearest Neighbor Queries","authors":"Huaijie Zhu, Xiaochun Yang, Bin Wang, Wang-Chien Lee","doi":"10.1145/2882903.2915234","DOIUrl":"https://doi.org/10.1145/2882903.2915234","url":null,"abstract":"In this paper, we study a novel variant of obstructed nearest neighbor queries, namely, range-based obstructed nearest neighbor (RONN) search. A natural generalization of continuous obstructed nearest-neighbor (CONN), an RONN query retrieves the obstructed nearest neighbor for every point in a specified range. To process RONN, we first propose a CONN-Based (CONNB) algorithm as our baseline, which reduces the RONN query into a range query and four CONN queries processed using an R-tree. To address the shortcomings of the CONNB algorithm, we then propose a new RONN by R-tree Filtering (RONN-RF) algorithm, which explores effective filtering, also using R-tree. Next, we propose a new index, called O-tree, dedicated for indexing objects in the obstructed space. The novelty of O-tree lies in the idea of dividing the obstructed space into non-obstructed subspaces, aiming to efficiently retrieve highly qualified candidates for RONN processing. We develop an O-tree construction algorithm and propose a space division scheme, called optimal obstacle balance (OOB) scheme, to address the tree balance problem. Accordingly, we propose an efficient algorithm, called RONN by O-tree Acceleration (RONN-OA), which exploits O-tree to accelerate query processing of RONN. In addition, we extend O-tree for indexing polygons. At last, we conduct a comprehensive performance evaluation using both real and synthetic datasets to validate our ideas and the proposed algorithms. The experimental result shows that the RONN-OA algorithm outperforms the two R-tree based algorithms significantly. Moreover, we show that the OOB scheme achieves the best tree balance in O-tree and outperforms two baseline schemes.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84605111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Ontology-Based Integration of Streaming and Static Relational Data with Optique 基于本体的流与静态关系数据与Optique集成
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899385
E. Kharlamov, S. Brandt, Ernesto Jiménez-Ruiz, Y. Kotidis, S. Lamparter, T. Mailis, C. Neuenstadt, Ö. Özçep, C. Pinkel, C. Svingos, D. Zheleznyakov, Ian Horrocks, Y. Ioannidis, R. Möller
Real-time processing of data coming from multiple heterogeneous data streams and static databases is a typical task in many industrial scenarios such as diagnostics of large machines. A complex diagnostic task may require a collection of up to hundreds of queries over such data. Although many of these queries retrieve data of the same kind, such as temperature measurements, they access structurally different data sources. In this work we show how Semantic Technologies implemented in our system optique can simplify such complex diagnostics by providing an abstraction layer---ontology---that integrates heterogeneous data. In a nutshell, optique allows complex diagnostic tasks to be expressed with just a few high-level semantic queries. The system can then automatically enrich these queries, translate them into a collection with a large number of low-level data queries, and finally optimise and efficiently execute the collection in a heavily distributed environment. We will demo the benefits of optique on a real world scenario from Siemens.
实时处理来自多个异构数据流和静态数据库的数据是许多工业场景中的典型任务,例如大型机器的诊断。复杂的诊断任务可能需要对此类数据进行多达数百次查询的集合。尽管这些查询中有许多检索相同类型的数据,例如温度测量值,但它们访问的数据源在结构上是不同的。在这项工作中,我们展示了在我们的系统光学中实现的语义技术如何通过提供集成异构数据的抽象层(本体)来简化这种复杂的诊断。简而言之,optique允许用几个高级语义查询来表达复杂的诊断任务。然后,系统可以自动丰富这些查询,将它们转换为具有大量低级数据查询的集合,并最终在高度分布式的环境中优化并有效地执行集合。我们将在西门子的真实场景中演示光学的优点。
{"title":"Ontology-Based Integration of Streaming and Static Relational Data with Optique","authors":"E. Kharlamov, S. Brandt, Ernesto Jiménez-Ruiz, Y. Kotidis, S. Lamparter, T. Mailis, C. Neuenstadt, Ö. Özçep, C. Pinkel, C. Svingos, D. Zheleznyakov, Ian Horrocks, Y. Ioannidis, R. Möller","doi":"10.1145/2882903.2899385","DOIUrl":"https://doi.org/10.1145/2882903.2899385","url":null,"abstract":"Real-time processing of data coming from multiple heterogeneous data streams and static databases is a typical task in many industrial scenarios such as diagnostics of large machines. A complex diagnostic task may require a collection of up to hundreds of queries over such data. Although many of these queries retrieve data of the same kind, such as temperature measurements, they access structurally different data sources. In this work we show how Semantic Technologies implemented in our system optique can simplify such complex diagnostics by providing an abstraction layer---ontology---that integrates heterogeneous data. In a nutshell, optique allows complex diagnostic tasks to be expressed with just a few high-level semantic queries. The system can then automatically enrich these queries, translate them into a collection with a large number of low-level data queries, and finally optimise and efficiently execute the collection in a heavily distributed environment. We will demo the benefits of optique on a real world scenario from Siemens.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82531060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
Constructing Join Histograms from Histograms with q-error Guarantees 从具有q-误差保证的直方图构造连接直方图
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2914828
Kaleb Alway, A. Nica
Histograms are implemented and used in any database system, usually defined on a single-column of a database table. However, one of the most desired statistical data in such systems are statistics on the correlation among columns. In this paper we present a novel construction algorithm for building a join histogram that accepts two single-column histograms over different attributes, each with q-error guarantees, and produces a histogram over the result of the join operation on these attributes. The join histogram is built only from the input histograms without accessing the base data or computing the join relation. Under certain restrictions, a q-error guarantee can be placed on the produced join histogram. It is possible to construct adversarial input histograms that produce arbitrarily large q-error in the resulting join histogram, but across several experiments, this type of input does not occur in either randomly generated data or real-world data. Our construction algorithm runs in linear time with respect to the size of the input histograms, and produces a join histogram that is at most as large as the sum of the sizes of the input histograms. These join histograms can be used to efficiently and accurately estimate the cardinality of join queries.
直方图可以在任何数据库系统中实现和使用,通常在数据库表的单列上定义。然而,在这样的系统中,最需要的统计数据之一是列之间相关性的统计数据。在本文中,我们提出了一种新的构建连接直方图的算法,该算法接受不同属性上的两个单列直方图,每个都有q-error保证,并在这些属性上的连接操作结果上生成直方图。连接直方图仅从输入直方图构建,而不访问基本数据或计算连接关系。在某些限制下,可以对生成的连接直方图进行q-error保证。可以构建对抗性输入直方图,在最终的连接直方图中产生任意大的q误差,但在几个实验中,这种类型的输入不会出现在随机生成的数据或实际数据中。我们的构建算法相对于输入直方图的大小在线性时间内运行,并产生最多与输入直方图大小总和一样大的连接直方图。这些连接直方图可用于有效而准确地估计连接查询的基数。
{"title":"Constructing Join Histograms from Histograms with q-error Guarantees","authors":"Kaleb Alway, A. Nica","doi":"10.1145/2882903.2914828","DOIUrl":"https://doi.org/10.1145/2882903.2914828","url":null,"abstract":"Histograms are implemented and used in any database system, usually defined on a single-column of a database table. However, one of the most desired statistical data in such systems are statistics on the correlation among columns. In this paper we present a novel construction algorithm for building a join histogram that accepts two single-column histograms over different attributes, each with q-error guarantees, and produces a histogram over the result of the join operation on these attributes. The join histogram is built only from the input histograms without accessing the base data or computing the join relation. Under certain restrictions, a q-error guarantee can be placed on the produced join histogram. It is possible to construct adversarial input histograms that produce arbitrarily large q-error in the resulting join histogram, but across several experiments, this type of input does not occur in either randomly generated data or real-world data. Our construction algorithm runs in linear time with respect to the size of the input histograms, and produces a join histogram that is at most as large as the sum of the sizes of the input histograms. These join histograms can be used to efficiently and accurately estimate the cardinality of join queries.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87674076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
BART in Action: Error Generation and Empirical Evaluations of Data-Cleaning Systems BART在行动:数据清理系统的错误产生和经验评估
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899397
Donatello Santoro, Patricia C. Arocena, Boris Glavic, G. Mecca, Renée J. Miller, Paolo Papotti
Repairing erroneous or conflicting data that violate a set of constraints is an important problem in data management. Many automatic or semi-automatic data-repairing algorithms have been proposed in the last few years, each with its own strengths and weaknesses. Bart is an open-source error-generation system conceived to support thorough experimental evaluations of these data-repairing systems. The demo is centered around three main lessons. To start, we discuss how generating errors in data is a complex problem, with several facets. We introduce the important notions of detectability and repairability of an error, that stand at the core of Bart. Then, we show how, by changing the features of errors, it is possible to influence quite significantly the performance of the tools. Finally, we concretely put to work five data-repairing algorithms on dirty data of various kinds generated using Bart, and discuss their performance.
修复违反一组约束的错误或冲突数据是数据管理中的一个重要问题。在过去的几年里,人们提出了许多自动或半自动的数据修复算法,每种算法都有自己的优缺点。Bart是一个开源错误生成系统,旨在支持对这些数据修复系统进行彻底的实验评估。演示围绕三个主要教训展开。首先,我们将讨论如何在数据中生成错误是一个复杂的问题,它包含几个方面。我们介绍了错误的可检测性和可修复性的重要概念,这是Bart的核心。然后,我们展示了如何通过改变错误的特征来显著影响工具的性能。最后,我们对使用Bart生成的各种脏数据具体实施了五种数据修复算法,并讨论了它们的性能。
{"title":"BART in Action: Error Generation and Empirical Evaluations of Data-Cleaning Systems","authors":"Donatello Santoro, Patricia C. Arocena, Boris Glavic, G. Mecca, Renée J. Miller, Paolo Papotti","doi":"10.1145/2882903.2899397","DOIUrl":"https://doi.org/10.1145/2882903.2899397","url":null,"abstract":"Repairing erroneous or conflicting data that violate a set of constraints is an important problem in data management. Many automatic or semi-automatic data-repairing algorithms have been proposed in the last few years, each with its own strengths and weaknesses. Bart is an open-source error-generation system conceived to support thorough experimental evaluations of these data-repairing systems. The demo is centered around three main lessons. To start, we discuss how generating errors in data is a complex problem, with several facets. We introduce the important notions of detectability and repairability of an error, that stand at the core of Bart. Then, we show how, by changing the features of errors, it is possible to influence quite significantly the performance of the tools. Finally, we concretely put to work five data-repairing algorithms on dirty data of various kinds generated using Bart, and discuss their performance.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87081767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic Entity Recognition and Typing in Massive Text Data 海量文本数据中的自动实体识别与输入
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2912567
Xiang Ren, Ahmed El-Kishky, Heng Ji, Jiawei Han
In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. bio-medical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.
在当今计算机化和信息化的社会中,个人不断面临着大量的文本数据,从新闻文章、科学出版物、产品评论到社交媒体上的各种文本信息。为了从这些庞大的、多领域的文本池中提取价值,理解实体及其关系是非常重要的。在本教程中,我们将介绍数据驱动的方法来识别大量领域特定文本语料库中感兴趣的类型实体。这些方法可以自动将令牌范围识别为文档中的实体提及,并以可扩展的方式标记其细粒度类型(例如,人、产品和食品)。由于这些方法不依赖于带注释的数据、预定义的输入模式或手工制作的特性,因此它们可以快速适应新的领域、体裁和语言。我们在真实的数据集上演示了这些类型的实体如何帮助知识发现和管理,包括各种类型(例如,新闻文章、论坛帖子和推文)、领域(通用与生物医学领域)和语言(例如,英语、中文、阿拉伯语,甚至是像豪萨语和约鲁巴语这样的低资源语言)。
{"title":"Automatic Entity Recognition and Typing in Massive Text Data","authors":"Xiang Ren, Ahmed El-Kishky, Heng Ji, Jiawei Han","doi":"10.1145/2882903.2912567","DOIUrl":"https://doi.org/10.1145/2882903.2912567","url":null,"abstract":"In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. bio-medical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82945693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Operator and Query Progress Estimation in Microsoft SQL Server Live Query Statistics Microsoft SQL Server实时查询统计中的算子和查询进度估计
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2903728
Kukjin Lee, A. König, Vivek R. Narasayya, Bolin Ding, S. Chaudhuri, Brent Ellwein, Alexey Eksarevskiy, Manbeen Kohli, Jacob Wyant, Praneeta Prakash, Rimma V. Nehme, Jiexing Li, J. Naughton
We describe the design and implementation of the new Live Query Statistics (LQS) feature in Microsoft SQL Server 2016. The functionality includes the display of overall query progress as well as progress of individual operators in the query execution plan. We describe the overall functionality of LQS, give usage examples and detail all areas where we had to extend the current state-of-the-art to build the complete LQS feature. Finally, we evaluate the effect these extensions have on progress estimation accuracy with a series of experiments using a large set of synthetic and real workloads.
我们描述了Microsoft SQL Server 2016中新的实时查询统计(LQS)功能的设计和实现。该功能包括显示总体查询进度以及查询执行计划中单个操作符的进度。我们描述了LQS的整体功能,给出了使用示例,并详细介绍了我们必须扩展当前最先进技术以构建完整LQS功能的所有领域。最后,我们通过一系列使用大量合成和真实工作负载的实验来评估这些扩展对进度估计精度的影响。
{"title":"Operator and Query Progress Estimation in Microsoft SQL Server Live Query Statistics","authors":"Kukjin Lee, A. König, Vivek R. Narasayya, Bolin Ding, S. Chaudhuri, Brent Ellwein, Alexey Eksarevskiy, Manbeen Kohli, Jacob Wyant, Praneeta Prakash, Rimma V. Nehme, Jiexing Li, J. Naughton","doi":"10.1145/2882903.2903728","DOIUrl":"https://doi.org/10.1145/2882903.2903728","url":null,"abstract":"We describe the design and implementation of the new Live Query Statistics (LQS) feature in Microsoft SQL Server 2016. The functionality includes the display of overall query progress as well as progress of individual operators in the query execution plan. We describe the overall functionality of LQS, give usage examples and detail all areas where we had to extend the current state-of-the-art to build the complete LQS feature. Finally, we evaluate the effect these extensions have on progress estimation accuracy with a series of experiments using a large set of synthetic and real workloads.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88866216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Extracting Equivalent SQL from Imperative Code in Database Applications 从数据库应用程序中的命令式代码中提取等效SQL
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2882926
K. V. Emani, Karthik Ramachandra, S. Bhattacharya, S. Sudarshan
Optimizing the performance of database applications is an area of practical importance, and has received significant attention in recent years. In this paper we present an approach to this problem which is based on extracting a concise algebraic representation of (parts of) an application, which may include imperative code as well as SQL queries. The algebraic representation can then be translated into SQL to improve application performance, by reducing the volume of data transferred, as well as reducing latency by minimizing the number of network round trips. Our techniques can be used for performing optimizations of database applications that techniques proposed earlier cannot perform. The algebraic representations can also be used for other purposes such as extracting equivalent queries for keyword search on form results. Our experiments indicate that the techniques we present are widely applicable to real world database applications, in terms of successfully extracting algebraic representations of application behavior, as well as in terms of providing performance benefits when used for optimization.
优化数据库应用程序的性能是一个具有实际重要性的领域,近年来受到了极大的关注。在本文中,我们提出了一种解决该问题的方法,该方法基于提取应用程序(部分)的简明代数表示,其中可能包括命令式代码和SQL查询。然后可以将代数表示转换为SQL,通过减少传输的数据量以及通过最小化网络往返次数来减少延迟,从而提高应用程序性能。我们的技术可用于执行先前提出的技术无法执行的数据库应用程序优化。代数表示还可以用于其他目的,例如为表单结果的关键字搜索提取等价查询。我们的实验表明,就成功地提取应用程序行为的代数表示以及在用于优化时提供性能优势而言,我们提出的技术广泛适用于现实世界的数据库应用程序。
{"title":"Extracting Equivalent SQL from Imperative Code in Database Applications","authors":"K. V. Emani, Karthik Ramachandra, S. Bhattacharya, S. Sudarshan","doi":"10.1145/2882903.2882926","DOIUrl":"https://doi.org/10.1145/2882903.2882926","url":null,"abstract":"Optimizing the performance of database applications is an area of practical importance, and has received significant attention in recent years. In this paper we present an approach to this problem which is based on extracting a concise algebraic representation of (parts of) an application, which may include imperative code as well as SQL queries. The algebraic representation can then be translated into SQL to improve application performance, by reducing the volume of data transferred, as well as reducing latency by minimizing the number of network round trips. Our techniques can be used for performing optimizations of database applications that techniques proposed earlier cannot perform. The algebraic representations can also be used for other purposes such as extracting equivalent queries for keyword search on form results. Our experiments indicate that the techniques we present are widely applicable to real world database applications, in terms of successfully extracting algebraic representations of application behavior, as well as in terms of providing performance benefits when used for optimization.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89673657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
CoDAR: Revealing the Generalized Procedure & Recommending Algorithms of Community Detection CoDAR:揭示社区检测的广义过程和推荐算法
Pub Date : 2016-06-26 DOI: 10.1145/2882903.2899386
Xiang Ying, Chaokun Wang, M. Wang, J. Yu, Jun Zhang
Community detection has attracted great interest in graph analysis and mining during the past decade, and a great number of approaches have been developed to address this problem. However, the lack of a uniform framework and a reasonable evaluation method makes it a puzzle to analyze, compare and evaluate the extensive work, let alone picking out a best one when necessary. In this paper, we design a tool called CoDAR, which reveals the generalized procedure of community detection and monitors the real-time structural changes of network during the detection process. Moreover, CoDAR adopts 12 recognized metrics and builds a rating model for performance evaluation of communities to recom- mend the best-performing algorithm. Finally, the tool also provides nice interactive windows for display.
在过去的十年里,社区检测引起了人们对图分析和挖掘的极大兴趣,并且已经开发了大量的方法来解决这个问题。然而,由于缺乏统一的框架和合理的评估方法,对大量的工作进行分析、比较和评估是一个难题,更不用说在必要的时候挑选出一个最好的。在本文中,我们设计了一个名为CoDAR的工具,它揭示了社区检测的广义过程,并在检测过程中实时监测网络的结构变化。此外,CoDAR采用了12个公认的指标,并建立了社区绩效评价的评级模型,以推荐性能最好的算法。最后,该工具还提供了很好的交互式窗口用于显示。
{"title":"CoDAR: Revealing the Generalized Procedure & Recommending Algorithms of Community Detection","authors":"Xiang Ying, Chaokun Wang, M. Wang, J. Yu, Jun Zhang","doi":"10.1145/2882903.2899386","DOIUrl":"https://doi.org/10.1145/2882903.2899386","url":null,"abstract":"Community detection has attracted great interest in graph analysis and mining during the past decade, and a great number of approaches have been developed to address this problem. However, the lack of a uniform framework and a reasonable evaluation method makes it a puzzle to analyze, compare and evaluate the extensive work, let alone picking out a best one when necessary. In this paper, we design a tool called CoDAR, which reveals the generalized procedure of community detection and monitors the real-time structural changes of network during the detection process. Moreover, CoDAR adopts 12 recognized metrics and builds a rating model for performance evaluation of communities to recom- mend the best-performing algorithm. Finally, the tool also provides nice interactive windows for display.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78589538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the 2016 International Conference on Management of Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1