Analysis of morphological data is central to a broad class of scientific problems in materials science, astronomy, bio-medicine, and many others. Understanding relationships between morphologies is a core analytical task in such settings. In this paper, we propose a graph-based framework for measuring similarity between morphologies. Our framework delivers a novel representation of a morphology as an augmented graph that encodes application-specific knowledge through the use of configurable signature functions. It provides also an algorithm to compute the similarity between a pair of morphology graphs. We present experimental results in which the framework is applied to morphology data from high-fidelity numerical simulations that emerge in materials science. The results demonstrate that our proposed measure is superior in capturing the semantic similarity between morphologies, compared to the state-of-the-art methods such as FFT-based measures.
{"title":"Graph-based Strategy for Establishing Morphology Similarity","authors":"Namit Juneja, J. Zola, V. Chandola, O. Wodo","doi":"10.1145/3468791.3468819","DOIUrl":"https://doi.org/10.1145/3468791.3468819","url":null,"abstract":"Analysis of morphological data is central to a broad class of scientific problems in materials science, astronomy, bio-medicine, and many others. Understanding relationships between morphologies is a core analytical task in such settings. In this paper, we propose a graph-based framework for measuring similarity between morphologies. Our framework delivers a novel representation of a morphology as an augmented graph that encodes application-specific knowledge through the use of configurable signature functions. It provides also an algorithm to compute the similarity between a pair of morphology graphs. We present experimental results in which the framework is applied to morphology data from high-fidelity numerical simulations that emerge in materials science. The results demonstrate that our proposed measure is superior in capturing the semantic similarity between morphologies, compared to the state-of-the-art methods such as FFT-based measures.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122045772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michal Bechny, F. Sobieczky, Jürgen Zeindl, Lisa Ehrlinger
Missing data (MD) is a prevalent problem and can negatively affect the trustworthiness of data analysis. In industrial use cases, faulty sensors or errors during data integration are common causes for systematically missing values. The majority of MD research deals with imputation, i.e., the replacement of missing values with “best guesses”. Most imputation methods require missing values to occur independently, which is rarely the case in industry. Thus, it is necessary to identify missing data patterns (i.e., systematically missing values) prior to imputation (1) to understand the cause of the missingness, (2) to gain deeper insight into the data, and (3) to choose the proper imputation technique. However, in literature, there is a wide varity of MD patterns without a common formalization. In this paper, we introduce the first formal definition of MD patterns. Building on this theory, we developed a systematic approach on how to automatically detect MD patterns in industrial data. The approach has been developed in cooperation with voestalpine Stahl GmbH, where we applied it to real-world data from the steel industry and demonstrated its efficacy with a simulation study.
{"title":"Missing Data Patterns: From Theory to an Application in the Steel Industry","authors":"Michal Bechny, F. Sobieczky, Jürgen Zeindl, Lisa Ehrlinger","doi":"10.1145/3468791.3468841","DOIUrl":"https://doi.org/10.1145/3468791.3468841","url":null,"abstract":"Missing data (MD) is a prevalent problem and can negatively affect the trustworthiness of data analysis. In industrial use cases, faulty sensors or errors during data integration are common causes for systematically missing values. The majority of MD research deals with imputation, i.e., the replacement of missing values with “best guesses”. Most imputation methods require missing values to occur independently, which is rarely the case in industry. Thus, it is necessary to identify missing data patterns (i.e., systematically missing values) prior to imputation (1) to understand the cause of the missingness, (2) to gain deeper insight into the data, and (3) to choose the proper imputation technique. However, in literature, there is a wide varity of MD patterns without a common formalization. In this paper, we introduce the first formal definition of MD patterns. Building on this theory, we developed a systematic approach on how to automatically detect MD patterns in industrial data. The approach has been developed in cooperation with voestalpine Stahl GmbH, where we applied it to real-world data from the steel industry and demonstrated its efficacy with a simulation study.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122497193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.
{"title":"SDTA: An Algebra for Statistical Data Transformation","authors":"Jie Song, H. Jagadish, George Alter","doi":"10.1145/3468791.3468811","DOIUrl":"https://doi.org/10.1145/3468791.3468811","url":null,"abstract":"Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127108189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Esfahani, M. Daneshmand, Venkatesh Srinivasan, Alex Thomo, Kui Wu
Truss decomposition is a popular approach for discovering cohesive subgraphs. However, truss decomposition on probabilistic graphs is challenging. State-of-the-art either do not scale to large graphs or use approximation techniques to achieve scalability. We present an exact and scalable algorithm for truss decomposition of probabilistic graphs. The algorithm is based on progressive tightening of the estimate of the truss value of each edge based on h-index computation and novel use of dynamic programming. Our proposed algorithm (1) is significantly faster than state-of-the-art and scales to much larger graphs, (2) is progressive by allowing the user to see near-results along the way, (3) does not sacrifice the exactness of final result, and (4) achieves all these while processing only an edge and its immediate neighbors at a time, thus resulting in smaller memory footprint. Our extensive experimental results confirm the scalability and efficiency of our algorithm.
{"title":"Truss Decomposition on Large Probabilistic Networks using H-Index","authors":"F. Esfahani, M. Daneshmand, Venkatesh Srinivasan, Alex Thomo, Kui Wu","doi":"10.1145/3468791.3468817","DOIUrl":"https://doi.org/10.1145/3468791.3468817","url":null,"abstract":"Truss decomposition is a popular approach for discovering cohesive subgraphs. However, truss decomposition on probabilistic graphs is challenging. State-of-the-art either do not scale to large graphs or use approximation techniques to achieve scalability. We present an exact and scalable algorithm for truss decomposition of probabilistic graphs. The algorithm is based on progressive tightening of the estimate of the truss value of each edge based on h-index computation and novel use of dynamic programming. Our proposed algorithm (1) is significantly faster than state-of-the-art and scales to much larger graphs, (2) is progressive by allowing the user to see near-results along the way, (3) does not sacrifice the exactness of final result, and (4) achieves all these while processing only an edge and its immediate neighbors at a time, thus resulting in smaller memory footprint. Our extensive experimental results confirm the scalability and efficiency of our algorithm.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122911425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Langendoen, Brad Glasbergen, Khuzaima S. Daudjee
Indexes for multidimensional data based on the R-Tree are popularly used by databases for a wide range of applications. Such index trees support point and range queries but are costly to construct over datasets of millions of points. We present the Non-Intersecting R-Tree (NIR-Tree), a novel insert-efficient, in-memory, multidimensional index that uses bounding polygons to provide efficient point and range query performance while indexing data at least an order of magnitude faster. The NIR-Tree leverages non-intersecting bounding polygons to reduce the number of nodes accessed during queries, compared to existing R-family indexes. Our experiments demonstrate that inserting into a NIR-Tree is 27 × faster than the ubiquitous R*-Tree, with point queries completing 2 × faster and range queries executing just as quickly.
{"title":"NIR-Tree: A Non-Intersecting R-Tree","authors":"K. Langendoen, Brad Glasbergen, Khuzaima S. Daudjee","doi":"10.1145/3468791.3468818","DOIUrl":"https://doi.org/10.1145/3468791.3468818","url":null,"abstract":"Indexes for multidimensional data based on the R-Tree are popularly used by databases for a wide range of applications. Such index trees support point and range queries but are costly to construct over datasets of millions of points. We present the Non-Intersecting R-Tree (NIR-Tree), a novel insert-efficient, in-memory, multidimensional index that uses bounding polygons to provide efficient point and range query performance while indexing data at least an order of magnitude faster. The NIR-Tree leverages non-intersecting bounding polygons to reduce the number of nodes accessed during queries, compared to existing R-family indexes. Our experiments demonstrate that inserting into a NIR-Tree is 27 × faster than the ubiquitous R*-Tree, with point queries completing 2 × faster and range queries executing just as quickly.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124283950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scalable query processing and scalable query engines over Cloud databases is a vibrant area of research, which has recently emerged within both the academic and industrial research community. This area has been further stirred-up by the current explosion of big data management and analytics models and techniques that, usually executed within the internal layer of public as well as private Clouds, pose severe (and new!) challenges to the annoying distributed query processing optimization problem in (distributed) database systems. Among other, taming the complexity of query execution plays a leading role, especially considering the typical Cloud environment that includes tens and tens of different-in-granularity data processing tasks (also at a different scale) over large-scale clusters. Inspired by these considerations, this paper focuses on models, paradigms, techniques and future challenges of scalable query processing and query engines over Cloud databases, by reporting on state-of-the-art results as well as emerging trends, with also criticisms on future work that we should expect from the community.
{"title":"Scalable Query Processing and Query Engines over Cloud Databases: Models, Paradigms, Techniques, Future Challenges","authors":"A. Cuzzocrea","doi":"10.1145/3468791.3472264","DOIUrl":"https://doi.org/10.1145/3468791.3472264","url":null,"abstract":"Scalable query processing and scalable query engines over Cloud databases is a vibrant area of research, which has recently emerged within both the academic and industrial research community. This area has been further stirred-up by the current explosion of big data management and analytics models and techniques that, usually executed within the internal layer of public as well as private Clouds, pose severe (and new!) challenges to the annoying distributed query processing optimization problem in (distributed) database systems. Among other, taming the complexity of query execution plays a leading role, especially considering the typical Cloud environment that includes tens and tens of different-in-granularity data processing tasks (also at a different scale) over large-scale clusters. Inspired by these considerations, this paper focuses on models, paradigms, techniques and future challenges of scalable query processing and query engines over Cloud databases, by reporting on state-of-the-art results as well as emerging trends, with also criticisms on future work that we should expect from the community.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133309026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giannis Vassiliou, Georgia Troullinou, N. Papadakis, H. Kondylakis
Semantic summaries try to extract compact information from the original RDF graph, while reducing its size. State of the art structural semantic summaries, focus primarily on the graph structure of the data, trying to maximize the summary’s utility for a specific purpose, such as indexing, query answering and source selection. In this paper, we present an approach that is able to construct high quality summaries, exploiting a small part of the query workload, maximizing their utility for query answering, i.e. the query coverage. We demonstrate our approach using two real world datasets and the corresponding query workloads and we show that we strictly dominates current state of the art in terms of query coverage.
{"title":"WBSum: Workload-based Summaries for RDF/S KBs","authors":"Giannis Vassiliou, Georgia Troullinou, N. Papadakis, H. Kondylakis","doi":"10.1145/3468791.3468815","DOIUrl":"https://doi.org/10.1145/3468791.3468815","url":null,"abstract":"Semantic summaries try to extract compact information from the original RDF graph, while reducing its size. State of the art structural semantic summaries, focus primarily on the graph structure of the data, trying to maximize the summary’s utility for a specific purpose, such as indexing, query answering and source selection. In this paper, we present an approach that is able to construct high quality summaries, exploiting a small part of the query workload, maximizing their utility for query answering, i.e. the query coverage. We demonstrate our approach using two real world datasets and the corresponding query workloads and we show that we strictly dominates current state of the art in terms of query coverage.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130147104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dai Hai Ton That, Mohammad Gharehdaghi, A. Rasin, T. Malik
In column stores, which ingest large amounts of data into multiple column groups, query performance deteriorates. Commercial column stores use log-structured merge (LSM) tree on projections to ingest data rapidly. LSM tree improves ingestion performance, but for column stores the sort-merge maintenance phase in an LSM tree is I/O-intensive, which slows concurrent queries and reduces overall throughput. In this paper, we present a simple heuristic approach to reduce the sorting and merging cost that arise when data is ingested in column stores. We demonstrate how a Min-Max heuristic can construct buckets and identify the level of sortedness in each range of data. Filled and relatively-sorted buckets are written out to disk; unfilled buckets are retained to achieve a better level of sortedness, thus avoiding the expensive sort-merge phase. We compare our Min-Max approach with LSM tree and production columnar stores using real and synthetic datasets.
{"title":"On Lowering Merge Costs of an LSM Tree","authors":"Dai Hai Ton That, Mohammad Gharehdaghi, A. Rasin, T. Malik","doi":"10.1145/3468791.3468820","DOIUrl":"https://doi.org/10.1145/3468791.3468820","url":null,"abstract":"In column stores, which ingest large amounts of data into multiple column groups, query performance deteriorates. Commercial column stores use log-structured merge (LSM) tree on projections to ingest data rapidly. LSM tree improves ingestion performance, but for column stores the sort-merge maintenance phase in an LSM tree is I/O-intensive, which slows concurrent queries and reduces overall throughput. In this paper, we present a simple heuristic approach to reduce the sorting and merging cost that arise when data is ingested in column stores. We demonstrate how a Min-Max heuristic can construct buckets and identify the level of sortedness in each range of data. Filled and relatively-sorted buckets are written out to disk; unfilled buckets are retained to achieve a better level of sortedness, thus avoiding the expensive sort-merge phase. We compare our Min-Max approach with LSM tree and production columnar stores using real and synthetic datasets.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"54 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116214610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolaos Kardoulakis, Kenza Kellou-Menouer, Georgia Troullinou, Zoubida Kedad, D. Plexousakis, H. Kondylakis
The rapid explosion of linked data has resulted into many weakly structured and incomplete data sources, where typing information might be missing. On the other hand, type information is essential for a number of tasks such as query answering, integration, summarization and partitioning. Existing approaches for type discovery, either completely ignore type declarations available in the dataset (implicit type discovery approaches), or rely only on existing types, in order to complement them (explicit type enrichment approaches). Implicit type discovery approaches are based on instance grouping, which requires an exhaustive comparison between the instances. This process is expensive and not incremental. Explicit type enrichment approaches on the other hand, are not able to identify new types and they can not process data sources that have little or no schema information. In this paper, we present HInT, the first incremental and hybrid type discovery system for RDF datasets, enabling type discovery in datasets where type declarations are missing. To achieve this goal, we incrementally identify the patterns of the various instances, we index and then group them to identify the types. During the processing of an instance, our approach exploits its type information, if available, to improve the quality of the discovered types by guiding the classification of the new instance in the correct group and by refining the groups already built. We analytically and experimentally show that our approach dominates in terms of efficiency, competitors from both worlds, implicit type discovery and explicit type enrichment while outperforming them in most of the cases in terms of quality.
{"title":"HInT: Hybrid and Incremental Type Discovery for Large RDF Data Sources","authors":"Nikolaos Kardoulakis, Kenza Kellou-Menouer, Georgia Troullinou, Zoubida Kedad, D. Plexousakis, H. Kondylakis","doi":"10.1145/3468791.3468808","DOIUrl":"https://doi.org/10.1145/3468791.3468808","url":null,"abstract":"The rapid explosion of linked data has resulted into many weakly structured and incomplete data sources, where typing information might be missing. On the other hand, type information is essential for a number of tasks such as query answering, integration, summarization and partitioning. Existing approaches for type discovery, either completely ignore type declarations available in the dataset (implicit type discovery approaches), or rely only on existing types, in order to complement them (explicit type enrichment approaches). Implicit type discovery approaches are based on instance grouping, which requires an exhaustive comparison between the instances. This process is expensive and not incremental. Explicit type enrichment approaches on the other hand, are not able to identify new types and they can not process data sources that have little or no schema information. In this paper, we present HInT, the first incremental and hybrid type discovery system for RDF datasets, enabling type discovery in datasets where type declarations are missing. To achieve this goal, we incrementally identify the patterns of the various instances, we index and then group them to identify the types. During the processing of an instance, our approach exploits its type information, if available, to improve the quality of the discovered types by guiding the classification of the new instance in the correct group and by refining the groups already built. We analytically and experimentally show that our approach dominates in terms of efficiency, competitors from both worlds, implicit type discovery and explicit type enrichment while outperforming them in most of the cases in terms of quality.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"106 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115764519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the generation of large-scale spatial data in various fields, array databases that represent space as an array have become one of the means of managing spatial data. Each cell in an array tends to interact with one another; therefore, instead of considering a single cell, considering a concept of subarray is required in some applications. In addition, each cell has several attribute values to indicate its features. Based on the two observations, we propose a new type of query, subarray skyline, that provides a way to find meaningful subarrays or filter less meaningful subarrays considering attributes. We also introduce an efficient query processing method, ReSKY, in centralized and distributed settings. Through extensive experiments using an array database and real datasets, we show that ReSKY has better performance than the existing techniques.
{"title":"Subarray Skyline Query Processing in Array Databases","authors":"Dalsu Choi, Hyunsik Yoon, Y. Chung","doi":"10.1145/3468791.3468799","DOIUrl":"https://doi.org/10.1145/3468791.3468799","url":null,"abstract":"With the generation of large-scale spatial data in various fields, array databases that represent space as an array have become one of the means of managing spatial data. Each cell in an array tends to interact with one another; therefore, instead of considering a single cell, considering a concept of subarray is required in some applications. In addition, each cell has several attribute values to indicate its features. Based on the two observations, we propose a new type of query, subarray skyline, that provides a way to find meaningful subarrays or filter less meaningful subarrays considering attributes. We also introduce an efficient query processing method, ReSKY, in centralized and distributed settings. Through extensive experiments using an array database and real datasets, we show that ReSKY has better performance than the existing techniques.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125858410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}