We present a novel adaptive log compression scheme. Results show 30% improvement on compression ratios over existing approaches.
提出了一种新的自适应日志压缩方案。结果表明,与现有方法相比,压缩比提高了30%。
{"title":"Adaptive log compression for massive log data","authors":"Robert Christensen, Feifei Li","doi":"10.1145/2463676.2465341","DOIUrl":"https://doi.org/10.1145/2463676.2465341","url":null,"abstract":"We present a novel adaptive log compression scheme. Results show 30% improvement on compression ratios over existing approaches.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77933595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, Mark D. Callaghan
Database benchmarks are an important tool for database researchers and practitioners that ease the process of making informed comparisons between different database hardware, software and configurations. Large scale web services such as social networks are a major and growing database application area, but currently there are few benchmarks that accurately model web service workloads. In this paper we present a new synthetic benchmark called LinkBench. LinkBench is based on traces from production databases that store "social graph" data at Facebook, a major social network. We characterize the data and query workload in many dimensions, and use the insights gained to construct a realistic synthetic benchmark. LinkBench provides a realistic and challenging test for persistent storage of social and web service data, filling a gap in the available tools for researchers, developers and administrators.
{"title":"LinkBench: a database benchmark based on the Facebook social graph","authors":"Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, Mark D. Callaghan","doi":"10.1145/2463676.2465296","DOIUrl":"https://doi.org/10.1145/2463676.2465296","url":null,"abstract":"Database benchmarks are an important tool for database researchers and practitioners that ease the process of making informed comparisons between different database hardware, software and configurations. Large scale web services such as social networks are a major and growing database application area, but currently there are few benchmarks that accurately model web service workloads.\u0000 In this paper we present a new synthetic benchmark called LinkBench. LinkBench is based on traces from production databases that store \"social graph\" data at Facebook, a major social network. We characterize the data and query workload in many dimensions, and use the insights gained to construct a realistic synthetic benchmark. LinkBench provides a realistic and challenging test for persistent storage of social and web service data, filling a gap in the available tools for researchers, developers and administrators.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76763559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Whang, Tae-Seob Yun, Yeon-Mi Yeo, I. Song, Hyuk-Yoon Kwon, In-Joong Kim
Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. Regarding higher-level functionality, we can draw a parallel with the traditional O/S file system vs. DBMS. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS. To estimate the performance, we propose a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We argue that the model can accurately estimate the performance of a massively-parallel (e.g., 300-node) search engine using the experimental results obtained from a small-scale (e.g., 5-node) one. We show that the estimation error between the model and the actual experiment is less than 2.13% by observing that the bulk of the query processing time is spent at the slave (vs. at the master and network) and by estimating the time spent at the slave based on actual measurement. Using our model, we demonstrate a commercial-level scalability and performance of our architecture. Our proposed system ODYS is capable of handling 1 billion queries per day (81 queries/sec) for 30 billion Web pages by using only 43,472 nodes with an average query response time of 194 ms. By using twice as many (86,944) nodes, ODYS can provide an average query response time of 148 ms. These results show that building a massively-parallel search engine using a parallel DBMS is a viable approach with advantages of supporting the high-level (i.e., DBMS-level), SQL-like programming interface.
{"title":"ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality","authors":"K. Whang, Tae-Seob Yun, Yeon-Mi Yeo, I. Song, Hyuk-Yoon Kwon, In-Joong Kim","doi":"10.1145/2463676.2465316","DOIUrl":"https://doi.org/10.1145/2463676.2465316","url":null,"abstract":"Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. Regarding higher-level functionality, we can draw a parallel with the traditional O/S file system vs. DBMS. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS. To estimate the performance, we propose a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We argue that the model can accurately estimate the performance of a massively-parallel (e.g., 300-node) search engine using the experimental results obtained from a small-scale (e.g., 5-node) one. We show that the estimation error between the model and the actual experiment is less than 2.13% by observing that the bulk of the query processing time is spent at the slave (vs. at the master and network) and by estimating the time spent at the slave based on actual measurement. Using our model, we demonstrate a commercial-level scalability and performance of our architecture. Our proposed system ODYS is capable of handling 1 billion queries per day (81 queries/sec) for 30 billion Web pages by using only 43,472 nodes with an average query response time of 194 ms. By using twice as many (86,944) nodes, ODYS can provide an average query response time of 148 ms. These results show that building a massively-parallel search engine using a parallel DBMS is a viable approach with advantages of supporting the high-level (i.e., DBMS-level), SQL-like programming interface.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82488125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaochun Yang, Bin Wang, Tao Qiu, Yaoshu Wang, Chen Li
The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.
{"title":"Improving regular-expression matching on strings using negative factors","authors":"Xiaochun Yang, Bin Wang, Tao Qiu, Yaoshu Wang, Chen Li","doi":"10.1145/2463676.2465289","DOIUrl":"https://doi.org/10.1145/2463676.2465289","url":null,"abstract":"The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87676261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this demonstration, we present Ibex, a novel storage engine featuring hybrid, FPGA-accelerated query processing. In Ibex, an FPGA is inserted along the path between the storage devices and the database engine. The FPGA acts as an intelligent storage engine supporting query off-loading from the query engine. Apart from significant performance improvements for many common SQL queries, the demo will show how Ibex reduces data movement, CPU usage, and overall energy consumption in database appliances.
{"title":"Less watts, more performance: an intelligent storage engine for data appliances","authors":"L. Woods, J. Teubner, G. Alonso","doi":"10.1145/2463676.2463685","DOIUrl":"https://doi.org/10.1145/2463676.2463685","url":null,"abstract":"In this demonstration, we present Ibex, a novel storage engine featuring hybrid, FPGA-accelerated query processing. In Ibex, an FPGA is inserted along the path between the storage devices and the database engine. The FPGA acts as an intelligent storage engine supporting query off-loading from the query engine. Apart from significant performance improvements for many common SQL queries, the demo will show how Ibex reduces data movement, CPU usage, and overall energy consumption in database appliances.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83314891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The fact that multi-core CPUs have become so common and that the number of CPU cores in one chip has continued to rise means that a server machine can easily contain an extremely high number of CPU cores. The CPU scalability of IT systems is thus attracting a considerable amount of research attention. Some systems, such as ACID-compliant DBMSs, are said to be difficult to scale, probably due to the mutual exclusion required to ensure data consistency. Possible countermeasures include latch-free (LF) data structures, an elemental technology to improve the CPU scalability by eliminating the need for mutual exclusion. This paper investigates these LF data structures with a particular focus on their applicability and effectiveness. Some existing LF data structures (such as LF hash tables) have been adapted to PostgreSQL, one of the most popular open-source DBMSs. The performance improvement was evaluated with a benchmark program simulating real-world transactions. Measurement results obtained from state-of-the-art 80-core machines demonstrated that the LF data structures were effective for performance improvement in a many-core situation in which DBT-1 throughput increased by about 2.5 times. Although the poor performance of the original DBMS was due to a severe latch-related bottleneck and can be improved by parameter tuning, it is of practical importance that LF data structures provided performance improvement without deep understanding of the target system behavior that is necessary for the parameter tuning.
{"title":"Latch-free data structures for DBMS: design, implementation, and evaluation","authors":"T. Horikawa","doi":"10.1145/2463676.2463720","DOIUrl":"https://doi.org/10.1145/2463676.2463720","url":null,"abstract":"The fact that multi-core CPUs have become so common and that the number of CPU cores in one chip has continued to rise means that a server machine can easily contain an extremely high number of CPU cores. The CPU scalability of IT systems is thus attracting a considerable amount of research attention. Some systems, such as ACID-compliant DBMSs, are said to be difficult to scale, probably due to the mutual exclusion required to ensure data consistency. Possible countermeasures include latch-free (LF) data structures, an elemental technology to improve the CPU scalability by eliminating the need for mutual exclusion. This paper investigates these LF data structures with a particular focus on their applicability and effectiveness. Some existing LF data structures (such as LF hash tables) have been adapted to PostgreSQL, one of the most popular open-source DBMSs. The performance improvement was evaluated with a benchmark program simulating real-world transactions. Measurement results obtained from state-of-the-art 80-core machines demonstrated that the LF data structures were effective for performance improvement in a many-core situation in which DBT-1 throughput increased by about 2.5 times. Although the poor performance of the original DBMS was due to a severe latch-related bottleneck and can be improved by parameter tuning, it is of practical importance that LF data structures provided performance improvement without deep understanding of the target system behavior that is necessary for the parameter tuning.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78533702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Finis, Robert Brunel, A. Kemper, Thomas Neumann, Franz Färber, Norman May
Main-memory database systems are emerging as the new backbone of business applications. Besides flat relational data representations also hierarchical ones are essential for these modern applications; therefore we devise a new indexing and versioning approach for hierarchies that is deeply integrated into the relational kernel. We propose the DeltaNI index as a versioned pendant of the nested intervals (NI) labeling scheme. The index is space- and time-efficient and yields a gapless, fixed-size integer NI labeling for each version while also supporting branching histories. In contrast to a naive NI labeling, it facilitates even complex updates of the tree structure. As many query processing techniques that work on top of the NI labeling have already been proposed, our index can be used as a building block for processing various kinds of queries. We evaluate the performance of the index on large inputs consisting of millions of nodes and thousands of versions. Thereby we show that DeltaNI scales well and can deliver satisfying performance for large business scenarios.
{"title":"DeltaNI: an efficient labeling scheme for versioned hierarchical data","authors":"Jan Finis, Robert Brunel, A. Kemper, Thomas Neumann, Franz Färber, Norman May","doi":"10.1145/2463676.2465329","DOIUrl":"https://doi.org/10.1145/2463676.2465329","url":null,"abstract":"Main-memory database systems are emerging as the new backbone of business applications. Besides flat relational data representations also hierarchical ones are essential for these modern applications; therefore we devise a new indexing and versioning approach for hierarchies that is deeply integrated into the relational kernel.\u0000 We propose the DeltaNI index as a versioned pendant of the nested intervals (NI) labeling scheme. The index is space- and time-efficient and yields a gapless, fixed-size integer NI labeling for each version while also supporting branching histories. In contrast to a naive NI labeling, it facilitates even complex updates of the tree structure. As many query processing techniques that work on top of the NI labeling have already been proposed, our index can be used as a building block for processing various kinds of queries. We evaluate the performance of the index on large inputs consisting of millions of nodes and thousands of versions. Thereby we show that DeltaNI scales well and can deliver satisfying performance for large business scenarios.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78234839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the data collected by enterprises grows in scale, there is a growing trend of performing data analytics on large datasets. Batch processing systems that can handle petabyte scale of data, such as Hadoop, have flourished and gained traction in the industry. As the results of batch analytics have been used to continuously improve front-facing user experience, there is a growing interest in pushing the processing latency down. This trend has fueled a resurgence in the development and usage of execution engines that can process continuous queries. An important class of continuous queries is windowed aggregation queries. Such queries arise in a wide range of applications such as generating personalized content and results. Today, considerable manual effort goes into finding the most suitable execution engine for these queries and on tuning query performance on these engines. An ecosystem composed of multiple execution engines may be needed in order to run the overall query workload efficiently given the diverse set of requirements that arise in practice. Cyclops is a continuous query processing platform that manages and orchestrates windowed aggregation queries in an ecosystem composed of multiple continuous query execution engines. Cyclops employs a cost-based approach for picking the most suitable engine and plan for executing a given query. This demonstration first presents an interactive visualization of the rich execution plan space of windowed aggregation queries, which allows users to analyze and understand the differences among plans. The next part of the demonstration will drill down into the design of Cyclops. For a given query, we show the cost spectrum of query execution plans across three different execution engines---Esper, Storm, and Hadoop---as estimated by Cyclops.
{"title":"Execution and optimization of continuous queries with cyclops","authors":"Harold Lim, S. Babu","doi":"10.1145/2463676.2465248","DOIUrl":"https://doi.org/10.1145/2463676.2465248","url":null,"abstract":"As the data collected by enterprises grows in scale, there is a growing trend of performing data analytics on large datasets. Batch processing systems that can handle petabyte scale of data, such as Hadoop, have flourished and gained traction in the industry. As the results of batch analytics have been used to continuously improve front-facing user experience, there is a growing interest in pushing the processing latency down. This trend has fueled a resurgence in the development and usage of execution engines that can process continuous queries.\u0000 An important class of continuous queries is windowed aggregation queries. Such queries arise in a wide range of applications such as generating personalized content and results. Today, considerable manual effort goes into finding the most suitable execution engine for these queries and on tuning query performance on these engines. An ecosystem composed of multiple execution engines may be needed in order to run the overall query workload efficiently given the diverse set of requirements that arise in practice.\u0000 Cyclops is a continuous query processing platform that manages and orchestrates windowed aggregation queries in an ecosystem composed of multiple continuous query execution engines. Cyclops employs a cost-based approach for picking the most suitable engine and plan for executing a given query. This demonstration first presents an interactive visualization of the rich execution plan space of windowed aggregation queries, which allows users to analyze and understand the differences among plans. The next part of the demonstration will drill down into the design of Cyclops. For a given query, we show the cost spectrum of query execution plans across three different execution engines---Esper, Storm, and Hadoop---as estimated by Cyclops.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73053967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large scale data warehouses rely heavily on secondary indexes, such as bitmaps and b-trees, to limit access to slow IO devices. However, with the advent of large main memory systems, cache conscious secondary indexes are needed to improve also the transfer bandwidth between memory and cpu. In this paper, we introduce column imprint, a simple but efficient cache conscious secondary index. A column imprint is a collection of many small bit vectors, each indexing the data points of a single cacheline. An imprint is used during query evaluation to limit data access and thus minimize memory traffic. The compression for imprints is cpu friendly and exploits the empirical observation that data often exhibits local clustering or partial ordering as a side-effect of the construction process. Most importantly, column imprint compression remains effective and robust even in the case of unclustered data, while other state-of-the-art solutions fail. We conducted an extensive experimental evaluation to assess the applicability and the performance impact of the column imprints. The storage overhead, when experimenting with real world datasets, is just a few percent over the size of the columns being indexed. The evaluation time for over 40000 range queries of varying selectivity revealed the efficiency of the proposed index compared to zonemaps and bitmaps with WAH compression.
{"title":"Column imprints: a secondary index structure","authors":"Lefteris Sidirourgos, M. Kersten","doi":"10.1145/2463676.2465306","DOIUrl":"https://doi.org/10.1145/2463676.2465306","url":null,"abstract":"Large scale data warehouses rely heavily on secondary indexes, such as bitmaps and b-trees, to limit access to slow IO devices. However, with the advent of large main memory systems, cache conscious secondary indexes are needed to improve also the transfer bandwidth between memory and cpu. In this paper, we introduce column imprint, a simple but efficient cache conscious secondary index. A column imprint is a collection of many small bit vectors, each indexing the data points of a single cacheline. An imprint is used during query evaluation to limit data access and thus minimize memory traffic. The compression for imprints is cpu friendly and exploits the empirical observation that data often exhibits local clustering or partial ordering as a side-effect of the construction process. Most importantly, column imprint compression remains effective and robust even in the case of unclustered data, while other state-of-the-art solutions fail. We conducted an extensive experimental evaluation to assess the applicability and the performance impact of the column imprints. The storage overhead, when experimenting with real world datasets, is just a few percent over the size of the columns being indexed. The evaluation time for over 40000 range queries of varying selectivity revealed the efficiency of the proposed index compared to zonemaps and bitmaps with WAH compression.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74575245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mihaela A. Bornea, Julian T Dolby, Anastasios Kementsietsidis, Kavitha Srinivas, Patrick Dantressangle, O. Udrea, Bishwaranjan Bhattacharjee
Efficient storage and querying of RDF data is of increasing importance, due to the increased popularity and widespread acceptance of RDF on the web and in the enterprise. In this paper, we describe a novel storage and query mechanism for RDF which works on top of existing relational representations. Reliance on relational representations of RDF means that one can take advantage of 35+ years of research on efficient storage and querying, industrial-strength transaction support, locking, security, etc. However, there are significant challenges in storing RDF in relational, which include data sparsity and schema variability. We describe novel mechanisms to shred RDF into relational, and novel query translation techniques to maximize the advantages of this shredded representation. We show that these mechanisms result in consistently good performance across multiple RDF benchmarks, even when compared with current state-of-the-art stores. This work provides the basis for RDF support in DB2 v.10.1.
{"title":"Building an efficient RDF store over a relational database","authors":"Mihaela A. Bornea, Julian T Dolby, Anastasios Kementsietsidis, Kavitha Srinivas, Patrick Dantressangle, O. Udrea, Bishwaranjan Bhattacharjee","doi":"10.1145/2463676.2463718","DOIUrl":"https://doi.org/10.1145/2463676.2463718","url":null,"abstract":"Efficient storage and querying of RDF data is of increasing importance, due to the increased popularity and widespread acceptance of RDF on the web and in the enterprise. In this paper, we describe a novel storage and query mechanism for RDF which works on top of existing relational representations. Reliance on relational representations of RDF means that one can take advantage of 35+ years of research on efficient storage and querying, industrial-strength transaction support, locking, security, etc. However, there are significant challenges in storing RDF in relational, which include data sparsity and schema variability. We describe novel mechanisms to shred RDF into relational, and novel query translation techniques to maximize the advantages of this shredded representation. We show that these mechanisms result in consistently good performance across multiple RDF benchmarks, even when compared with current state-of-the-art stores. This work provides the basis for RDF support in DB2 v.10.1.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72595669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}