首页 > 最新文献

33rd International Conference on Scientific and Statistical Database Management最新文献

英文 中文
Graph-based Strategy for Establishing Morphology Similarity 基于图的形态学相似度建立策略
Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468819
Namit Juneja, J. Zola, V. Chandola, O. Wodo
Analysis of morphological data is central to a broad class of scientific problems in materials science, astronomy, bio-medicine, and many others. Understanding relationships between morphologies is a core analytical task in such settings. In this paper, we propose a graph-based framework for measuring similarity between morphologies. Our framework delivers a novel representation of a morphology as an augmented graph that encodes application-specific knowledge through the use of configurable signature functions. It provides also an algorithm to compute the similarity between a pair of morphology graphs. We present experimental results in which the framework is applied to morphology data from high-fidelity numerical simulations that emerge in materials science. The results demonstrate that our proposed measure is superior in capturing the semantic similarity between morphologies, compared to the state-of-the-art methods such as FFT-based measures.
形态学数据的分析是材料科学、天文学、生物医学和许多其他科学问题的核心。在这种情况下,理解形态之间的关系是一项核心分析任务。在本文中,我们提出了一个基于图的框架来测量形态学之间的相似性。我们的框架提供了一种新的形态学表示形式,即通过使用可配置签名函数对特定于应用程序的知识进行编码的增强图。它还提供了一种算法来计算一对形态学图之间的相似性。我们提出了实验结果,其中框架应用于材料科学中出现的高保真数值模拟的形态学数据。结果表明,与基于fft的测量方法等最新方法相比,我们提出的测量方法在捕获形态学之间的语义相似性方面具有优势。
{"title":"Graph-based Strategy for Establishing Morphology Similarity","authors":"Namit Juneja, J. Zola, V. Chandola, O. Wodo","doi":"10.1145/3468791.3468819","DOIUrl":"https://doi.org/10.1145/3468791.3468819","url":null,"abstract":"Analysis of morphological data is central to a broad class of scientific problems in materials science, astronomy, bio-medicine, and many others. Understanding relationships between morphologies is a core analytical task in such settings. In this paper, we propose a graph-based framework for measuring similarity between morphologies. Our framework delivers a novel representation of a morphology as an augmented graph that encodes application-specific knowledge through the use of configurable signature functions. It provides also an algorithm to compute the similarity between a pair of morphology graphs. We present experimental results in which the framework is applied to morphology data from high-fidelity numerical simulations that emerge in materials science. The results demonstrate that our proposed measure is superior in capturing the semantic similarity between morphologies, compared to the state-of-the-art methods such as FFT-based measures.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122045772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Missing Data Patterns: From Theory to an Application in the Steel Industry 缺失数据模式:从理论到在钢铁行业的应用
Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468841
Michal Bechny, F. Sobieczky, Jürgen Zeindl, Lisa Ehrlinger
Missing data (MD) is a prevalent problem and can negatively affect the trustworthiness of data analysis. In industrial use cases, faulty sensors or errors during data integration are common causes for systematically missing values. The majority of MD research deals with imputation, i.e., the replacement of missing values with “best guesses”. Most imputation methods require missing values to occur independently, which is rarely the case in industry. Thus, it is necessary to identify missing data patterns (i.e., systematically missing values) prior to imputation (1) to understand the cause of the missingness, (2) to gain deeper insight into the data, and (3) to choose the proper imputation technique. However, in literature, there is a wide varity of MD patterns without a common formalization. In this paper, we introduce the first formal definition of MD patterns. Building on this theory, we developed a systematic approach on how to automatically detect MD patterns in industrial data. The approach has been developed in cooperation with voestalpine Stahl GmbH, where we applied it to real-world data from the steel industry and demonstrated its efficacy with a simulation study.
数据缺失(MD)是一个普遍存在的问题,它会对数据分析的可信度产生负面影响。在工业用例中,传感器故障或数据集成过程中的错误是系统丢失值的常见原因。大多数医学研究处理的是代入,即用“最佳猜测”替代缺失值。大多数估算方法要求缺失值独立出现,这在工业中很少出现。因此,有必要在代入之前识别缺失的数据模式(即系统缺失的值)(1)了解缺失的原因,(2)更深入地了解数据,(3)选择合适的代入技术。然而,在文献中,有各种各样的MD模式,没有一个共同的形式化。在本文中,我们引入了MD模式的第一个正式定义。基于这一理论,我们开发了一种系统的方法来自动检测工业数据中的MD模式。该方法是与奥钢联斯塔尔有限公司合作开发的,我们将其应用于钢铁行业的实际数据,并通过模拟研究证明了其有效性。
{"title":"Missing Data Patterns: From Theory to an Application in the Steel Industry","authors":"Michal Bechny, F. Sobieczky, Jürgen Zeindl, Lisa Ehrlinger","doi":"10.1145/3468791.3468841","DOIUrl":"https://doi.org/10.1145/3468791.3468841","url":null,"abstract":"Missing data (MD) is a prevalent problem and can negatively affect the trustworthiness of data analysis. In industrial use cases, faulty sensors or errors during data integration are common causes for systematically missing values. The majority of MD research deals with imputation, i.e., the replacement of missing values with “best guesses”. Most imputation methods require missing values to occur independently, which is rarely the case in industry. Thus, it is necessary to identify missing data patterns (i.e., systematically missing values) prior to imputation (1) to understand the cause of the missingness, (2) to gain deeper insight into the data, and (3) to choose the proper imputation technique. However, in literature, there is a wide varity of MD patterns without a common formalization. In this paper, we introduce the first formal definition of MD patterns. Building on this theory, we developed a systematic approach on how to automatically detect MD patterns in industrial data. The approach has been developed in cooperation with voestalpine Stahl GmbH, where we applied it to real-world data from the steel industry and demonstrated its efficacy with a simulation study.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122497193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
SDTA: An Algebra for Statistical Data Transformation SDTA:统计数据转换的代数
Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468811
Jie Song, H. Jagadish, George Alter
Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.
统计数据操作是许多数据科学分析管道的关键组成部分,特别是作为数据摄取的一部分。这个任务一般是通过用SPSS, Stata, SAS, R, Python (Pandas)等语言编写转换脚本来完成的。这些工具支持的完全不同的数据模型、语言表示和转换操作使得最终用户很难理解和记录所执行的转换,开发人员也很难跨语言移植转换代码。为了应对这些挑战,我们提出了统计数据转换的正式范式。它由一个数据模型组成,称为结构化数据转换数据模型(SDTDM),其灵感来自多个统计转换框架的数据模型;一个代数,结构数据转换代数(SDTA),不仅能够转换SDTDM内的数据,还能够转换多个结构级别的元数据;以及一种等效的描述性语言,称为结构化数据转换语言(SDTL),最近被DDI联盟采用,该联盟维护元数据的国际标准,并将其作为其产品套件的一部分。对社会经济数据进行实际统计转换的实验表明,SDTL分别能成功表示从存储库中获得的SAS中的4185条命令和SPSS中的9087条命令中的86.1%和91.6%。我们用例子说明SDTA/SDTL如何帮助统计数据转换的文档化,这是数据集元数据中经常被忽视的一个重要方面。我们提出了一个称为C2Metadata的系统,它自动捕获SDTL中的转换和来源信息,作为元数据的一部分。此外,鉴于从源统计语言到SDTA/SDTL的转换机制,我们展示了如何将功能等效的转换程序转换为其他功能等效的程序,使用相同或不同的语言,允许代码重用和结果可再现性。我们还说明了使用SDTA来优化SDTL转换的可能性,使用类似于SQL优化的基于规则的重写。
{"title":"SDTA: An Algebra for Statistical Data Transformation","authors":"Jie Song, H. Jagadish, George Alter","doi":"10.1145/3468791.3468811","DOIUrl":"https://doi.org/10.1145/3468791.3468811","url":null,"abstract":"Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127108189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Truss Decomposition on Large Probabilistic Networks using H-Index 基于h指数的大概率网络桁架分解
Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468817
F. Esfahani, M. Daneshmand, Venkatesh Srinivasan, Alex Thomo, Kui Wu
Truss decomposition is a popular approach for discovering cohesive subgraphs. However, truss decomposition on probabilistic graphs is challenging. State-of-the-art either do not scale to large graphs or use approximation techniques to achieve scalability. We present an exact and scalable algorithm for truss decomposition of probabilistic graphs. The algorithm is based on progressive tightening of the estimate of the truss value of each edge based on h-index computation and novel use of dynamic programming. Our proposed algorithm (1) is significantly faster than state-of-the-art and scales to much larger graphs, (2) is progressive by allowing the user to see near-results along the way, (3) does not sacrifice the exactness of final result, and (4) achieves all these while processing only an edge and its immediate neighbors at a time, thus resulting in smaller memory footprint. Our extensive experimental results confirm the scalability and efficiency of our algorithm.
桁架分解是发现内聚子图的常用方法。然而,概率图上的桁架分解是具有挑战性的。最先进的技术要么不扩展到大的图形,要么使用近似技术来实现可伸缩性。提出了一种精确的、可扩展的概率图桁架分解算法。该算法基于基于h指数计算的每条边桁架值估计的渐进收紧,并新颖地使用了动态规划。我们提出的算法(1)比最先进的算法要快得多,并且可以扩展到更大的图,(2)通过允许用户在过程中看到接近的结果,(3)不牺牲最终结果的准确性,以及(4)在一次只处理一条边及其近邻的情况下实现所有这些,从而导致更小的内存占用。大量的实验结果证实了该算法的可扩展性和高效性。
{"title":"Truss Decomposition on Large Probabilistic Networks using H-Index","authors":"F. Esfahani, M. Daneshmand, Venkatesh Srinivasan, Alex Thomo, Kui Wu","doi":"10.1145/3468791.3468817","DOIUrl":"https://doi.org/10.1145/3468791.3468817","url":null,"abstract":"Truss decomposition is a popular approach for discovering cohesive subgraphs. However, truss decomposition on probabilistic graphs is challenging. State-of-the-art either do not scale to large graphs or use approximation techniques to achieve scalability. We present an exact and scalable algorithm for truss decomposition of probabilistic graphs. The algorithm is based on progressive tightening of the estimate of the truss value of each edge based on h-index computation and novel use of dynamic programming. Our proposed algorithm (1) is significantly faster than state-of-the-art and scales to much larger graphs, (2) is progressive by allowing the user to see near-results along the way, (3) does not sacrifice the exactness of final result, and (4) achieves all these while processing only an edge and its immediate neighbors at a time, thus resulting in smaller memory footprint. Our extensive experimental results confirm the scalability and efficiency of our algorithm.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122911425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
NIR-Tree: A Non-Intersecting R-Tree nir树:一种不相交的r树
Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468818
K. Langendoen, Brad Glasbergen, Khuzaima S. Daudjee
Indexes for multidimensional data based on the R-Tree are popularly used by databases for a wide range of applications. Such index trees support point and range queries but are costly to construct over datasets of millions of points. We present the Non-Intersecting R-Tree (NIR-Tree), a novel insert-efficient, in-memory, multidimensional index that uses bounding polygons to provide efficient point and range query performance while indexing data at least an order of magnitude faster. The NIR-Tree leverages non-intersecting bounding polygons to reduce the number of nodes accessed during queries, compared to existing R-family indexes. Our experiments demonstrate that inserting into a NIR-Tree is 27 × faster than the ubiquitous R*-Tree, with point queries completing 2 × faster and range queries executing just as quickly.
基于R-Tree的多维数据索引被数据库广泛用于各种应用程序。这样的索引树支持点和范围查询,但在数百万个点的数据集上构建成本很高。我们提出了非相交r树(NIR-Tree),这是一种新的插入效率,内存中的多维索引,它使用边界多边形提供有效的点和范围查询性能,同时索引数据的速度至少提高了一个数量级。与现有的r族索引相比,NIR-Tree利用非相交的边界多边形来减少查询期间访问的节点数量。我们的实验表明,插入到NIR-Tree中的速度比普遍存在的R*-Tree快27倍,点查询完成速度快2倍,范围查询执行速度也一样快。
{"title":"NIR-Tree: A Non-Intersecting R-Tree","authors":"K. Langendoen, Brad Glasbergen, Khuzaima S. Daudjee","doi":"10.1145/3468791.3468818","DOIUrl":"https://doi.org/10.1145/3468791.3468818","url":null,"abstract":"Indexes for multidimensional data based on the R-Tree are popularly used by databases for a wide range of applications. Such index trees support point and range queries but are costly to construct over datasets of millions of points. We present the Non-Intersecting R-Tree (NIR-Tree), a novel insert-efficient, in-memory, multidimensional index that uses bounding polygons to provide efficient point and range query performance while indexing data at least an order of magnitude faster. The NIR-Tree leverages non-intersecting bounding polygons to reduce the number of nodes accessed during queries, compared to existing R-family indexes. Our experiments demonstrate that inserting into a NIR-Tree is 27 × faster than the ubiquitous R*-Tree, with point queries completing 2 × faster and range queries executing just as quickly.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124283950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scalable Query Processing and Query Engines over Cloud Databases: Models, Paradigms, Techniques, Future Challenges 云数据库上的可扩展查询处理和查询引擎:模型、范例、技术、未来挑战
Pub Date : 2021-07-06 DOI: 10.1145/3468791.3472264
A. Cuzzocrea
Scalable query processing and scalable query engines over Cloud databases is a vibrant area of research, which has recently emerged within both the academic and industrial research community. This area has been further stirred-up by the current explosion of big data management and analytics models and techniques that, usually executed within the internal layer of public as well as private Clouds, pose severe (and new!) challenges to the annoying distributed query processing optimization problem in (distributed) database systems. Among other, taming the complexity of query execution plays a leading role, especially considering the typical Cloud environment that includes tens and tens of different-in-granularity data processing tasks (also at a different scale) over large-scale clusters. Inspired by these considerations, this paper focuses on models, paradigms, techniques and future challenges of scalable query processing and query engines over Cloud databases, by reporting on state-of-the-art results as well as emerging trends, with also criticisms on future work that we should expect from the community.
云数据库上的可伸缩查询处理和可伸缩查询引擎是一个充满活力的研究领域,最近在学术和工业研究界都出现了。当前大数据管理和分析模型和技术的爆炸式增长进一步刺激了这一领域,这些模型和技术通常在公共云和私有云的内层执行,对(分布式)数据库系统中令人讨厌的分布式查询处理优化问题提出了严峻(和新的)挑战。其中,控制查询执行的复杂性起着主导作用,特别是考虑到典型的云环境,其中包括大规模集群上数十个不同粒度的数据处理任务(也具有不同的规模)。受这些考虑的启发,本文通过报告最新的结果和新兴趋势,重点关注云数据库上可扩展查询处理和查询引擎的模型、范式、技术和未来的挑战,并对社区未来的工作提出了批评。
{"title":"Scalable Query Processing and Query Engines over Cloud Databases: Models, Paradigms, Techniques, Future Challenges","authors":"A. Cuzzocrea","doi":"10.1145/3468791.3472264","DOIUrl":"https://doi.org/10.1145/3468791.3472264","url":null,"abstract":"Scalable query processing and scalable query engines over Cloud databases is a vibrant area of research, which has recently emerged within both the academic and industrial research community. This area has been further stirred-up by the current explosion of big data management and analytics models and techniques that, usually executed within the internal layer of public as well as private Clouds, pose severe (and new!) challenges to the annoying distributed query processing optimization problem in (distributed) database systems. Among other, taming the complexity of query execution plays a leading role, especially considering the typical Cloud environment that includes tens and tens of different-in-granularity data processing tasks (also at a different scale) over large-scale clusters. Inspired by these considerations, this paper focuses on models, paradigms, techniques and future challenges of scalable query processing and query engines over Cloud databases, by reporting on state-of-the-art results as well as emerging trends, with also criticisms on future work that we should expect from the community.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133309026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WBSum: Workload-based Summaries for RDF/S KBs WBSum: RDF/S知识库的基于工作负载的摘要
Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468815
Giannis Vassiliou, Georgia Troullinou, N. Papadakis, H. Kondylakis
Semantic summaries try to extract compact information from the original RDF graph, while reducing its size. State of the art structural semantic summaries, focus primarily on the graph structure of the data, trying to maximize the summary’s utility for a specific purpose, such as indexing, query answering and source selection. In this paper, we present an approach that is able to construct high quality summaries, exploiting a small part of the query workload, maximizing their utility for query answering, i.e. the query coverage. We demonstrate our approach using two real world datasets and the corresponding query workloads and we show that we strictly dominates current state of the art in terms of query coverage.
语义摘要尝试从原始RDF图中提取紧凑的信息,同时减小其大小。最先进的结构语义摘要,主要关注数据的图形结构,试图最大化摘要的特定用途的效用,如索引,查询回答和来源选择。在本文中,我们提出了一种方法,该方法能够构建高质量的摘要,利用查询工作负载的一小部分,最大化它们对查询回答的效用,即查询覆盖率。我们使用两个真实世界的数据集和相应的查询工作负载来演示我们的方法,并表明我们在查询覆盖方面严格地控制了当前的技术状态。
{"title":"WBSum: Workload-based Summaries for RDF/S KBs","authors":"Giannis Vassiliou, Georgia Troullinou, N. Papadakis, H. Kondylakis","doi":"10.1145/3468791.3468815","DOIUrl":"https://doi.org/10.1145/3468791.3468815","url":null,"abstract":"Semantic summaries try to extract compact information from the original RDF graph, while reducing its size. State of the art structural semantic summaries, focus primarily on the graph structure of the data, trying to maximize the summary’s utility for a specific purpose, such as indexing, query answering and source selection. In this paper, we present an approach that is able to construct high quality summaries, exploiting a small part of the query workload, maximizing their utility for query answering, i.e. the query coverage. We demonstrate our approach using two real world datasets and the corresponding query workloads and we show that we strictly dominates current state of the art in terms of query coverage.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130147104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
On Lowering Merge Costs of an LSM Tree 关于降低LSM树的合并代价
Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468820
Dai Hai Ton That, Mohammad Gharehdaghi, A. Rasin, T. Malik
In column stores, which ingest large amounts of data into multiple column groups, query performance deteriorates. Commercial column stores use log-structured merge (LSM) tree on projections to ingest data rapidly. LSM tree improves ingestion performance, but for column stores the sort-merge maintenance phase in an LSM tree is I/O-intensive, which slows concurrent queries and reduces overall throughput. In this paper, we present a simple heuristic approach to reduce the sorting and merging cost that arise when data is ingested in column stores. We demonstrate how a Min-Max heuristic can construct buckets and identify the level of sortedness in each range of data. Filled and relatively-sorted buckets are written out to disk; unfilled buckets are retained to achieve a better level of sortedness, thus avoiding the expensive sort-merge phase. We compare our Min-Max approach with LSM tree and production columnar stores using real and synthetic datasets.
在列存储中,将大量数据摄取到多个列组中,查询性能会下降。商业列存储在投影上使用日志结构合并(LSM)树来快速摄取数据。LSM树提高了摄取性能,但是对于列存储来说,LSM树中的排序合并维护阶段是I/ o密集型的,这会减慢并发查询的速度并降低总体吞吐量。在本文中,我们提出了一种简单的启发式方法,以减少在列存储中摄取数据时产生的排序和合并成本。我们将演示Min-Max启发式算法如何构建桶并识别每个数据范围中的排序级别。填充的和相对排序的桶被写入磁盘;保留未填充的桶以实现更好的排序级别,从而避免昂贵的排序合并阶段。我们将我们的最小-最大方法与LSM树和使用真实数据集和合成数据集的生产列式存储进行比较。
{"title":"On Lowering Merge Costs of an LSM Tree","authors":"Dai Hai Ton That, Mohammad Gharehdaghi, A. Rasin, T. Malik","doi":"10.1145/3468791.3468820","DOIUrl":"https://doi.org/10.1145/3468791.3468820","url":null,"abstract":"In column stores, which ingest large amounts of data into multiple column groups, query performance deteriorates. Commercial column stores use log-structured merge (LSM) tree on projections to ingest data rapidly. LSM tree improves ingestion performance, but for column stores the sort-merge maintenance phase in an LSM tree is I/O-intensive, which slows concurrent queries and reduces overall throughput. In this paper, we present a simple heuristic approach to reduce the sorting and merging cost that arise when data is ingested in column stores. We demonstrate how a Min-Max heuristic can construct buckets and identify the level of sortedness in each range of data. Filled and relatively-sorted buckets are written out to disk; unfilled buckets are retained to achieve a better level of sortedness, thus avoiding the expensive sort-merge phase. We compare our Min-Max approach with LSM tree and production columnar stores using real and synthetic datasets.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"54 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116214610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
HInT: Hybrid and Incremental Type Discovery for Large RDF Data Sources 提示:大型RDF数据源的混合和增量类型发现
Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468808
Nikolaos Kardoulakis, Kenza Kellou-Menouer, Georgia Troullinou, Zoubida Kedad, D. Plexousakis, H. Kondylakis
The rapid explosion of linked data has resulted into many weakly structured and incomplete data sources, where typing information might be missing. On the other hand, type information is essential for a number of tasks such as query answering, integration, summarization and partitioning. Existing approaches for type discovery, either completely ignore type declarations available in the dataset (implicit type discovery approaches), or rely only on existing types, in order to complement them (explicit type enrichment approaches). Implicit type discovery approaches are based on instance grouping, which requires an exhaustive comparison between the instances. This process is expensive and not incremental. Explicit type enrichment approaches on the other hand, are not able to identify new types and they can not process data sources that have little or no schema information. In this paper, we present HInT, the first incremental and hybrid type discovery system for RDF datasets, enabling type discovery in datasets where type declarations are missing. To achieve this goal, we incrementally identify the patterns of the various instances, we index and then group them to identify the types. During the processing of an instance, our approach exploits its type information, if available, to improve the quality of the discovered types by guiding the classification of the new instance in the correct group and by refining the groups already built. We analytically and experimentally show that our approach dominates in terms of efficiency, competitors from both worlds, implicit type discovery and explicit type enrichment while outperforming them in most of the cases in terms of quality.
链接数据的快速爆炸导致了许多结构薄弱和不完整的数据源,其中可能缺少键入信息。另一方面,类型信息对于诸如查询应答、集成、摘要和分区等许多任务是必不可少的。现有的类型发现方法,要么完全忽略数据集中可用的类型声明(隐式类型发现方法),要么仅依赖现有类型来补充它们(显式类型充实方法)。隐式类型发现方法基于实例分组,这需要在实例之间进行详尽的比较。这个过程是昂贵的,而且不是增量的。另一方面,显式类型充实方法不能识别新类型,也不能处理只有很少或没有模式信息的数据源。在本文中,我们提出了第一个用于RDF数据集的增量和混合类型发现系统HInT,它支持在缺少类型声明的数据集中进行类型发现。为了实现这一目标,我们增量地识别各种实例的模式,对它们进行索引,然后对它们进行分组以识别类型。在处理实例的过程中,我们的方法利用它的类型信息(如果可用),通过在正确的组中指导新实例的分类,并通过改进已经构建的组,来提高所发现类型的质量。我们的分析和实验表明,我们的方法在效率方面占主导地位,来自两个世界的竞争对手,隐式类型发现和显式类型丰富,而在大多数情况下,在质量方面优于它们。
{"title":"HInT: Hybrid and Incremental Type Discovery for Large RDF Data Sources","authors":"Nikolaos Kardoulakis, Kenza Kellou-Menouer, Georgia Troullinou, Zoubida Kedad, D. Plexousakis, H. Kondylakis","doi":"10.1145/3468791.3468808","DOIUrl":"https://doi.org/10.1145/3468791.3468808","url":null,"abstract":"The rapid explosion of linked data has resulted into many weakly structured and incomplete data sources, where typing information might be missing. On the other hand, type information is essential for a number of tasks such as query answering, integration, summarization and partitioning. Existing approaches for type discovery, either completely ignore type declarations available in the dataset (implicit type discovery approaches), or rely only on existing types, in order to complement them (explicit type enrichment approaches). Implicit type discovery approaches are based on instance grouping, which requires an exhaustive comparison between the instances. This process is expensive and not incremental. Explicit type enrichment approaches on the other hand, are not able to identify new types and they can not process data sources that have little or no schema information. In this paper, we present HInT, the first incremental and hybrid type discovery system for RDF datasets, enabling type discovery in datasets where type declarations are missing. To achieve this goal, we incrementally identify the patterns of the various instances, we index and then group them to identify the types. During the processing of an instance, our approach exploits its type information, if available, to improve the quality of the discovered types by guiding the classification of the new instance in the correct group and by refining the groups already built. We analytically and experimentally show that our approach dominates in terms of efficiency, competitors from both worlds, implicit type discovery and explicit type enrichment while outperforming them in most of the cases in terms of quality.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"106 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115764519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Subarray Skyline Query Processing in Array Databases 数组数据库中的子数组Skyline查询处理
Pub Date : 2021-07-06 DOI: 10.1145/3468791.3468799
Dalsu Choi, Hyunsik Yoon, Y. Chung
With the generation of large-scale spatial data in various fields, array databases that represent space as an array have become one of the means of managing spatial data. Each cell in an array tends to interact with one another; therefore, instead of considering a single cell, considering a concept of subarray is required in some applications. In addition, each cell has several attribute values to indicate its features. Based on the two observations, we propose a new type of query, subarray skyline, that provides a way to find meaningful subarrays or filter less meaningful subarrays considering attributes. We also introduce an efficient query processing method, ReSKY, in centralized and distributed settings. Through extensive experiments using an array database and real datasets, we show that ReSKY has better performance than the existing techniques.
随着各领域大规模空间数据的产生,以阵列形式表示空间的阵列数据库已成为管理空间数据的手段之一。数组中的每个细胞都倾向于相互作用;因此,在某些应用中需要考虑子数组的概念,而不是考虑单个单元格。此外,每个单元格都有几个属性值来指示其特征。基于这两个观察,我们提出了一种新的查询类型,子数组天际线,它提供了一种根据属性查找有意义的子数组或过滤不太有意义的子数组的方法。我们还介绍了一种高效的查询处理方法,即集中和分布式设置下的ReSKY。通过阵列数据库和真实数据集的大量实验,我们证明了ReSKY比现有技术具有更好的性能。
{"title":"Subarray Skyline Query Processing in Array Databases","authors":"Dalsu Choi, Hyunsik Yoon, Y. Chung","doi":"10.1145/3468791.3468799","DOIUrl":"https://doi.org/10.1145/3468791.3468799","url":null,"abstract":"With the generation of large-scale spatial data in various fields, array databases that represent space as an array have become one of the means of managing spatial data. Each cell in an array tends to interact with one another; therefore, instead of considering a single cell, considering a concept of subarray is required in some applications. In addition, each cell has several attribute values to indicate its features. Based on the two observations, we propose a new type of query, subarray skyline, that provides a way to find meaningful subarrays or filter less meaningful subarrays considering attributes. We also introduce an efficient query processing method, ReSKY, in centralized and distributed settings. Through extensive experiments using an array database and real datasets, we show that ReSKY has better performance than the existing techniques.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125858410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
33rd International Conference on Scientific and Statistical Database Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1