首页 > 最新文献

Proc. VLDB Endow.最新文献

英文 中文
Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching Sparkly:用于实体匹配的简单但令人惊讶的强大TF/IDF拦截器
Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583163
Derek Paulsen, Yash Govind, A. Doan
Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.
阻塞是实体匹配中的一项重要任务。已经开发了许多阻塞解决方案,但据我们所知,使用众所周知的tf/idf措施进行阻塞实际上没有受到任何关注。然而,当我们使用Lucene对tf/idf阻塞进行实验时,我们发现它做得很好。因此,在本文中,我们深入研究了tf/idf阻塞。我们开发了Spark,它使用Lucene在Spark集群上以分布式无共享的方式执行top-k tf/idf阻塞。我们开发了一些技术来识别好的属性和标记器,这些属性和标记器可以用来阻塞,使spark完全自动化。我们进行了大量的实验,表明“火花”比8种最先进的阻滞剂效果更好。最后,我们对Sparkly的性能进行了深入分析,包括召回/输出大小和运行时间。我们的研究结果表明(a) tf/idf阻塞需要更多的关注,(b) spark形成了一个强大的基线,未来的阻塞工作应该与之比较,(c)未来的阻塞工作应该认真考虑top-k阻塞,这有助于提高召回率,以及分布式无共享架构,这有助于提高可扩展性,可预测性和可扩展性。
{"title":"Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching","authors":"Derek Paulsen, Yash Govind, A. Doan","doi":"10.14778/3583140.3583163","DOIUrl":"https://doi.org/10.14778/3583140.3583163","url":null,"abstract":"Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"28 1","pages":"1507-1519"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82390413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Bringing Compiling Databases to RISC Architectures 将编译数据库引入RISC架构
Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583142
F. Gruber, Maximilian Bandle, A. Engelke, Thomas Neumann, Jana Giceva
Current hardware development greatly influences the design decisions of modern database systems. For many modern performance-focused database systems, query compilation emerged as an integral part and different approaches for code generation evolved, making use of standard compilers, general-purpose compiler libraries, or domain-specific code generators. However, development primarily focused on the dominating x86-64 server architecture; but neglected current hardware developments towards other CPU architectures like ARM and other RISC architectures. Therefore, we explore the design space of code generation in database systems considering a variety of state-of-the-art compilation approaches with a set of qualitative and quantitative metrics. Based on our findings, we have developed a new code generator called FireARM for AArch64-based systems in our database system, Umbra. We identify general as well as architecture-specific challenges for custom code generation in databases and provide potential solutions to abstract or handle them. Furthermore, we present an extensive evaluation of different compilation approaches in Umbra on a wide variety of x86-64 and ARM machines. In particular, we compare quantitative performance characteristics such as compilation latency and query throughput. Our results show that using standard languages and compiler infrastructures reduces the barrier to employing query compilation and allows for high performance on big data sets, while domain-specific code generators can achieve a significantly lower compilation overhead and allow for better targeting of new architectures.
当前硬件的发展极大地影响了现代数据库系统的设计决策。对于许多关注性能的现代数据库系统,查询编译作为一个不可分割的部分出现,不同的代码生成方法不断发展,使用标准编译器、通用编译器库或特定于领域的代码生成器。然而,开发主要集中在占主导地位的x86-64服务器架构上;但却忽视了当前其他CPU架构的硬件发展,比如ARM和其他RISC架构。因此,我们探索了数据库系统中代码生成的设计空间,考虑了各种最先进的编译方法和一组定性和定量指标。基于我们的发现,我们在数据库系统Umbra中为基于aarch64的系统开发了一个名为FireARM的新代码生成器。我们确定了数据库中自定义代码生成的一般挑战以及特定于体系结构的挑战,并提供了抽象或处理这些挑战的潜在解决方案。此外,我们在各种x86-64和ARM机器上对Umbra中的不同编译方法进行了广泛的评估。特别是,我们比较了诸如编译延迟和查询吞吐量之类的定量性能特征。我们的结果表明,使用标准语言和编译器基础设施减少了使用查询编译的障碍,并允许在大数据集上实现高性能,而特定领域的代码生成器可以实现显着降低的编译开销,并允许更好地针对新架构。
{"title":"Bringing Compiling Databases to RISC Architectures","authors":"F. Gruber, Maximilian Bandle, A. Engelke, Thomas Neumann, Jana Giceva","doi":"10.14778/3583140.3583142","DOIUrl":"https://doi.org/10.14778/3583140.3583142","url":null,"abstract":"Current hardware development greatly influences the design decisions of modern database systems. For many modern performance-focused database systems, query compilation emerged as an integral part and different approaches for code generation evolved, making use of standard compilers, general-purpose compiler libraries, or domain-specific code generators. However, development primarily focused on the dominating x86-64 server architecture; but neglected current hardware developments towards other CPU architectures like ARM and other RISC architectures.\u0000 Therefore, we explore the design space of code generation in database systems considering a variety of state-of-the-art compilation approaches with a set of qualitative and quantitative metrics. Based on our findings, we have developed a new code generator called FireARM for AArch64-based systems in our database system, Umbra. We identify general as well as architecture-specific challenges for custom code generation in databases and provide potential solutions to abstract or handle them.\u0000 Furthermore, we present an extensive evaluation of different compilation approaches in Umbra on a wide variety of x86-64 and ARM machines. In particular, we compare quantitative performance characteristics such as compilation latency and query throughput.\u0000 Our results show that using standard languages and compiler infrastructures reduces the barrier to employing query compilation and allows for high performance on big data sets, while domain-specific code generators can achieve a significantly lower compilation overhead and allow for better targeting of new architectures.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"82 1","pages":"1222-1234"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74178815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data (Extended version) 不确定数据上排序和窗口查询的确定和可能答案的有效逼近(扩展版)
Pub Date : 2023-02-01 DOI: 10.48550/arXiv.2302.08676
Su Feng, Boris Glavic, Oliver Kennedy
Uncertainty arises naturally in many application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results using separate, incompatible data models. We present an efficient approach for under- and over-approximating results of ranking, top-k, and window queries over uncertain data. Our approach integrates well with existing techniques for querying uncertain data, is efficient, and is to the best of our knowledge the first to support windowed aggregation. We design algorithms for physical operators for uncertain sorting and windowed aggregation, and implement them in PostgreSQL. We evaluated our approach on synthetic and real world datasets, demonstrating that it outperforms all competitors, and often produces more accurate results.
由于数据输入错误和数据清理中的模糊性,在许多应用领域中自然会出现不确定性。先前在不完整和概率数据库中的工作已经研究了不确定数据上排名和top-k查询的语义和有效评估。但是,大多数方法孤立地处理top-k和排名,并且确实使用单独的、不兼容的数据模型表示不确定的输入数据和查询结果。我们提出了一种有效的方法来对不确定数据的排名、top-k和窗口查询的结果进行过近似和过近似。我们的方法与查询不确定数据的现有技术很好地集成在一起,效率很高,并且据我们所知,是第一个支持窗口聚合的方法。设计了不确定排序和窗口聚合的物理运算符算法,并在PostgreSQL中实现。我们在合成和真实世界的数据集上评估了我们的方法,证明它优于所有竞争对手,并且通常产生更准确的结果。
{"title":"Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data (Extended version)","authors":"Su Feng, Boris Glavic, Oliver Kennedy","doi":"10.48550/arXiv.2302.08676","DOIUrl":"https://doi.org/10.48550/arXiv.2302.08676","url":null,"abstract":"Uncertainty arises naturally in many application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results using separate, incompatible data models. We present an efficient approach for under- and over-approximating results of ranking, top-k, and window queries over uncertain data. Our approach integrates well with existing techniques for querying uncertain data, is efficient, and is to the best of our knowledge the first to support windowed aggregation. We design algorithms for physical operators for uncertain sorting and windowed aggregation, and implement them in PostgreSQL. We evaluated our approach on synthetic and real world datasets, demonstrating that it outperforms all competitors, and often produces more accurate results.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"172 1","pages":"1346-1358"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79463210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VersaMatch: Ontology Matching with Weak Supervision versmatch:弱监督的本体匹配
Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583148
Jonathan Fürst, Mauricio Fadel Argerich, Bin Cheng
Ontology matching is crucial to data integration for across-silo data sharing and has been mainly addressed with heuristic and machine learning (ML) methods. While heuristic methods are often inflexible and hard to extend to new domains, ML methods rely on substantial and hard to obtain amounts of labeled training data. To overcome these limitations, we propose VersaMatch , a flexible, weakly-supervised ontology matching system. VersaMatch employs various weak supervision sources, such as heuristic rules, pattern matching, and external knowledge bases, to produce labels from a large amount of unlabeled data for training a discriminative ML model. For prediction, VersaMatch develops a novel ensemble model combining the weak supervision sources with the discriminative model to support generalization while retaining a high precision. Our ensemble method boosts end model performance by 4 points compared to a traditional weak-supervision baseline. In addition, compared to state-of-the-art ontology matchers, VersaMatch achieves an overall 4-point performance improvement in F1 score across 26 ontology combinations from different domains. For recently released, in-the-wild datasets, VersaMatch beats the next best matchers by 9 points in F1. Furthermore, its core weak-supervision logic can easily be improved by adding more knowledge sources and collecting more unlabeled data for training.
本体匹配是跨孤岛数据共享中数据集成的关键,目前主要通过启发式和机器学习方法来解决。虽然启发式方法通常不灵活且难以扩展到新领域,但ML方法依赖于大量且难以获得标记的训练数据。为了克服这些限制,我们提出了versatch,一个灵活的、弱监督的本体匹配系统。versatch使用各种弱监督源,如启发式规则、模式匹配和外部知识库,从大量未标记的数据中生成标签,用于训练判别ML模型。对于预测,versatch开发了一种新的集成模型,将弱监督源与判别模型相结合,以支持泛化,同时保持较高的精度。与传统的弱监督基线相比,我们的集成方法将最终模型的性能提高了4个点。此外,与最先进的本体匹配器相比,versatch在来自不同领域的26个本体组合的F1分数中实现了4分的总体性能提升。对于最近发布的野外数据集,versatch在F1中以9分的优势击败了排名第二的对手。此外,通过增加更多的知识来源和收集更多的未标记数据进行训练,可以很容易地改进其核心的弱监督逻辑。
{"title":"VersaMatch: Ontology Matching with Weak Supervision","authors":"Jonathan Fürst, Mauricio Fadel Argerich, Bin Cheng","doi":"10.14778/3583140.3583148","DOIUrl":"https://doi.org/10.14778/3583140.3583148","url":null,"abstract":"\u0000 Ontology matching is crucial to data integration for across-silo data sharing and has been mainly addressed with heuristic and machine learning (ML) methods. While heuristic methods are often inflexible and hard to extend to new domains, ML methods rely on substantial and hard to obtain amounts of labeled training data. To overcome these limitations, we propose\u0000 VersaMatch\u0000 , a flexible, weakly-supervised ontology matching system. VersaMatch employs various weak supervision sources, such as heuristic rules, pattern matching, and external knowledge bases, to produce labels from a large amount of unlabeled data for training a discriminative ML model. For prediction, VersaMatch develops a novel ensemble model combining the weak supervision sources with the discriminative model to support generalization while retaining a high precision. Our ensemble method boosts end model performance by 4 points compared to a traditional weak-supervision baseline. In addition, compared to state-of-the-art ontology matchers, VersaMatch achieves an overall 4-point performance improvement in F1 score across 26 ontology combinations from different domains. For recently released, in-the-wild datasets, VersaMatch beats the next best matchers by 9 points in F1. Furthermore, its core weak-supervision logic can easily be improved by adding more knowledge sources and collecting more unlabeled data for training.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"82 1","pages":"1305-1318"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83940720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Robust Query Driven Cardinality Estimation under Changing Workloads 工作负载变化下的鲁棒查询驱动基数估计
Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583164
Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, S. Madden, Tim Kraska, Mohammad Alizadeh
Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage requirements, fast inference and training, and are easily adaptable for any kind of query. Unfortunately, such models can suffer unpredictably bad performance under workload drift, i.e., if the query pattern or data changes. This makes them unreliable and hard to deploy. We analyze the reasons why models become unpredictable due to workload drift, and introduce modifications to the query representation and neural network training techniques to make query-driven models robust to the effects of workload drift. First, we emulate workload drift in queries involving some unseen tables or columns by randomly masking out some table or column features during training. This forces the model to make predictions with missing query information, relying more on robust features based on up-to-date DBMS statistics that are useful even when query or data drift happens. Second, we introduce join bitmaps, which extends sampling-based features to be consistent across joins using ideas from sideways information passing. Finally, we show how both of these ideas can be adapted to handle data updates. We show significantly greater generalization than past works across different workloads and databases. For instance, a model trained with our techniques on a simple workload (JOBLight-train), with 40 k synthetically generated queries of at most 3 tables each, is able to generalize to the much more complex Join Order Benchmark, which include queries with up to 16 tables, and improve query runtimes by 2× over PostgreSQL. We show similar robustness results with data updates, and across other workloads. We discuss the situations where we expect, and see, improvements, as well as more challenging workload drift scenarios where these techniques do not improve much over PostgreSQL. However, even in the most challenging scenarios, our models never perform worse than PostgreSQL, while standard query driven models can get much worse than PostgreSQL.
查询驱动的基数估计模型从查询的历史日志中学习。它们是轻量级的,具有低存储需求,快速推理和训练,并且很容易适应任何类型的查询。不幸的是,在工作负载漂移的情况下,例如,如果查询模式或数据发生变化,这种模型的性能可能会出现不可预测的下降。这使得它们不可靠且难以部署。我们分析了由于工作负载漂移导致模型变得不可预测的原因,并引入了对查询表示和神经网络训练技术的修改,以使查询驱动模型对工作负载漂移的影响具有鲁棒性。首先,我们通过在训练期间随机屏蔽一些表或列的特征来模拟涉及一些看不见的表或列的查询中的工作负载漂移。这迫使模型使用缺失的查询信息进行预测,更多地依赖于基于最新DBMS统计数据的健壮特性,这些特性即使在发生查询或数据漂移时也很有用。其次,我们引入了连接位图,它扩展了基于采样的特征,使用横向信息传递的思想在连接之间保持一致。最后,我们将展示如何将这两种思想应用于处理数据更新。在不同的工作负载和数据库中,我们展示了比过去更大的泛化。例如,在一个简单的工作负载(JOBLight-train)上使用我们的技术进行训练的模型,每个查询最多生成3个表的40 k综合查询,能够推广到更复杂的Join Order基准,其中包括最多16个表的查询,并且查询运行时间比PostgreSQL提高了2倍。我们在数据更新和其他工作负载中显示了类似的鲁棒性结果。我们讨论了我们期望和看到的改进情况,以及更具挑战性的工作负载漂移场景,在这些场景中,这些技术与PostgreSQL相比没有太大的改进。然而,即使在最具挑战性的场景中,我们的模型也不会比PostgreSQL表现得差,而标准的查询驱动模型可能会比PostgreSQL差得多。
{"title":"Robust Query Driven Cardinality Estimation under Changing Workloads","authors":"Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, S. Madden, Tim Kraska, Mohammad Alizadeh","doi":"10.14778/3583140.3583164","DOIUrl":"https://doi.org/10.14778/3583140.3583164","url":null,"abstract":"Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage requirements, fast inference and training, and are easily adaptable for any kind of query. Unfortunately, such models can suffer unpredictably bad performance under workload drift, i.e., if the query pattern or data changes. This makes them unreliable and hard to deploy. We analyze the reasons why models become unpredictable due to workload drift, and introduce modifications to the query representation and neural network training techniques to make query-driven models robust to the effects of workload drift. First, we emulate workload drift in queries involving some unseen tables or columns by randomly masking out some table or column features during training. This forces the model to make predictions with missing query information, relying more on robust features based on up-to-date DBMS statistics that are useful even when query or data drift happens. Second, we introduce join bitmaps, which extends sampling-based features to be consistent across joins using ideas from sideways information passing. Finally, we show how both of these ideas can be adapted to handle data updates.\u0000 \u0000 We show significantly greater generalization than past works across different workloads and databases. For instance, a model trained with our techniques on a simple workload (JOBLight-train), with 40\u0000 k\u0000 synthetically generated queries of at most 3 tables each, is able to generalize to the much more complex Join Order Benchmark, which include queries with up to 16 tables, and improve query runtimes by 2× over PostgreSQL. We show similar robustness results with data updates, and across other workloads. We discuss the situations where we expect, and see, improvements, as well as more challenging workload drift scenarios where these techniques do not improve much over PostgreSQL. However, even in the most challenging scenarios, our models never perform worse than PostgreSQL, while standard query driven models can get much worse than PostgreSQL.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"16 1","pages":"1520-1533"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89712221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Elpis: Graph-Based Similarity Search for Scalable Data Science Elpis:面向可扩展数据科学的基于图的相似性搜索
Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583166
Ilias Azizi, Karima Echihabi, Themis Palpanas
The recent popularity of learned embeddings has fueled the growth of massive collections of high-dimensional (high-d) vectors that model complex data. Finding similar vectors in these collections is at the core of many important and practical data science applications. The data series community has developed tree-based similarity search techniques that outperform state-of-the-art methods on large collections of both data series and generic high-d vectors, on all scenarios except for no-guarantees ng -approximate search, where graph-based approaches designed by the high-d vector community achieve the best performance. However, building graph-based indexes is extremely expensive both in time and space. In this paper, we bring these two worlds together, study the corresponding solutions and their performance behavior, and propose ELPIS, a new strong baseline that takes advantage of the best features of both to achieve a superior performance in terms of indexing and ng-approximate search in-memory. ELPIS builds the index 3x-8x faster than competitors, using 40% less memory. It also achieves a high recall of 0.99, up to 2x faster than the state-of-the-art methods, and answers 1-NN queries up to one order of magnitude faster.
最近学习嵌入的流行推动了大量高维(high-d)向量集合的增长,这些向量对复杂数据进行建模。在这些集合中寻找相似的向量是许多重要和实用的数据科学应用的核心。数据序列社区已经开发了基于树的相似性搜索技术,在数据序列和通用高维向量的大型集合上,除了无保证的ng近似搜索(由高维向量社区设计的基于图的方法实现最佳性能)之外,在所有场景中都优于最先进的方法。然而,构建基于图的索引在时间和空间上都非常昂贵。在本文中,我们将这两个世界结合在一起,研究了相应的解决方案及其性能行为,并提出了ELPIS,这是一种新的强基线,它利用了两者的最佳特性,在索引和内存中近似搜索方面实现了卓越的性能。ELPIS构建索引的速度比竞争对手快3 -8倍,使用的内存少40%。它还实现了0.99的高召回率,比最先进的方法快2倍,并且回答1-NN查询的速度快了一个数量级。
{"title":"Elpis: Graph-Based Similarity Search for Scalable Data Science","authors":"Ilias Azizi, Karima Echihabi, Themis Palpanas","doi":"10.14778/3583140.3583166","DOIUrl":"https://doi.org/10.14778/3583140.3583166","url":null,"abstract":"\u0000 The recent popularity of learned embeddings has fueled the growth of massive collections of high-dimensional (high-d) vectors that model complex data. Finding similar vectors in these collections is at the core of many important and practical data science applications. The data series community has developed tree-based similarity search techniques that outperform state-of-the-art methods on large collections of both data series and generic high-d vectors, on all scenarios except for no-guarantees\u0000 ng\u0000 -approximate search, where graph-based approaches designed by the high-d vector community achieve the best performance. However, building graph-based indexes is extremely expensive both in time and space. In this paper, we bring these two worlds together, study the corresponding solutions and their performance behavior, and propose ELPIS, a new strong baseline that takes advantage of the best features of both to achieve a superior performance in terms of indexing and ng-approximate search in-memory. ELPIS builds the index 3x-8x faster than competitors, using 40% less memory. It also achieves a high recall of 0.99, up to 2x faster than the state-of-the-art methods, and answers 1-NN queries up to one order of magnitude faster.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"33 1","pages":"1548-1559"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82772347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Dotori: A Key-Value SSD Based KV Store Dotori:基于键值SSD的KV存储
Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583167
Carl Duffy, Jaehoon Shim, Sang-Hoon Kim, Jin-Soo Kim
Key-value SSDs (KVSSDs) represent a major shift in the storage stack design, with numerous potential benefits. Despite this, their lack of native features critical to operation in real world scenarios hinders their adoption, and these benefits go unrealized. Moreover, simply adapting existing key-value stores to run on KVSSDs proves underwhelming, as KVSSDs operate at lower raw device performance when compared to modern block SSDs. This paper introduces Dotori. Dotori is a KVSSD based key-value store that provides much needed functionality in a KVSSD through an upper layer in the host, and takes advantage of the unique KVSSD interface to enable further gains in functionality and performance. At the core of Dotori is a novel B+tree design that is only practical when the underlying storage device is a KVSSD. We test Dotori with an enterprise grade KVSSD against state-of-the-art block SSD based key-value stores through a range of micro-benchmarks and real world workloads. Despite low KVSSD raw device performance, Dotori achieves superior performance to these block-device based key-value stores while also showing significant gains in other important metrics.
键值ssd (kvssd)代表了存储堆栈设计的重大转变,具有许多潜在的好处。尽管如此,它们缺乏对现实场景中操作至关重要的原生特性,这阻碍了它们的采用,而且这些好处也没有实现。此外,简单地调整现有的键值存储以在kvssd上运行证明是不够的,因为与现代块ssd相比,kvssd以较低的原始设备性能运行。本文介绍了Dotori。Dotori是一种基于KVSSD的键值存储,它通过主机的上层在KVSSD中提供急需的功能,并利用独特的KVSSD接口来实现功能和性能的进一步提升。Dotori的核心是一种新颖的B+树设计,只有当底层存储设备是KVSSD时才实用。我们通过一系列微基准测试和真实世界的工作负载,使用企业级KVSSD对基于最先进的块SSD进行测试。尽管KVSSD原始设备性能较低,但与这些基于块设备的键值存储相比,Dotori实现了卓越的性能,同时在其他重要指标上也显示出显著的进步。
{"title":"Dotori: A Key-Value SSD Based KV Store","authors":"Carl Duffy, Jaehoon Shim, Sang-Hoon Kim, Jin-Soo Kim","doi":"10.14778/3583140.3583167","DOIUrl":"https://doi.org/10.14778/3583140.3583167","url":null,"abstract":"Key-value SSDs (KVSSDs) represent a major shift in the storage stack design, with numerous potential benefits. Despite this, their lack of native features critical to operation in real world scenarios hinders their adoption, and these benefits go unrealized. Moreover, simply adapting existing key-value stores to run on KVSSDs proves underwhelming, as KVSSDs operate at lower raw device performance when compared to modern block SSDs.\u0000 This paper introduces Dotori. Dotori is a KVSSD based key-value store that provides much needed functionality in a KVSSD through an upper layer in the host, and takes advantage of the unique KVSSD interface to enable further gains in functionality and performance. At the core of Dotori is a novel B+tree design that is only practical when the underlying storage device is a KVSSD.\u0000 We test Dotori with an enterprise grade KVSSD against state-of-the-art block SSD based key-value stores through a range of micro-benchmarks and real world workloads. Despite low KVSSD raw device performance, Dotori achieves superior performance to these block-device based key-value stores while also showing significant gains in other important metrics.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"126 1","pages":"1560-1572"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79532984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Design Space Exploration and Evaluation for Main-Memory Hash Joins in Storage Class Memory 存储类内存中主存哈希连接的设计空间探索与评价
Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583144
Wentao Huang, Yunhong Ji, Xuan Zhou, Bin He, K. Tan
In this paper, we seek to perform a rigorous experimental study of main-memory hash joins in storage class memory (SCM). In particular, we perform a design space exploration in real SCM for two state-of-the-art join algorithms: partitioned hash join (PHJ) and non-partitioned hash join (NPHJ), and identify the most crucial factors to implement an SCM-friendly join. Moreover, we present a rigorous evaluation with a broad spectrum of workloads for both joins and provide an in-depth analysis for choosing the most suitable algorithm in real SCM environment. With the most extensive experimental analysis up-to-date, we maintain that although there is no one universal winner in all scenarios, PHJ is generally superior to NPHJ in real SCM.
在本文中,我们试图对存储类内存(SCM)中的主存哈希连接进行严格的实验研究。特别是,我们在实际SCM中对两种最先进的连接算法:分区哈希连接(PHJ)和非分区哈希连接(NPHJ)进行了设计空间探索,并确定了实现SCM友好连接的最关键因素。此外,我们对两个连接的广泛工作负载进行了严格的评估,并为在实际SCM环境中选择最合适的算法提供了深入的分析。通过最新的最广泛的实验分析,我们认为,尽管在所有情况下没有一个普遍的赢家,但在实际的SCM中,PHJ通常优于NPHJ。
{"title":"A Design Space Exploration and Evaluation for Main-Memory Hash Joins in Storage Class Memory","authors":"Wentao Huang, Yunhong Ji, Xuan Zhou, Bin He, K. Tan","doi":"10.14778/3583140.3583144","DOIUrl":"https://doi.org/10.14778/3583140.3583144","url":null,"abstract":"In this paper, we seek to perform a rigorous experimental study of main-memory hash joins in storage class memory (SCM). In particular, we perform a design space exploration in real SCM for two state-of-the-art join algorithms: partitioned hash join (PHJ) and non-partitioned hash join (NPHJ), and identify the most crucial factors to implement an SCM-friendly join. Moreover, we present a rigorous evaluation with a broad spectrum of workloads for both joins and provide an in-depth analysis for choosing the most suitable algorithm in real SCM environment. With the most extensive experimental analysis up-to-date, we maintain that although there is no one universal winner in all scenarios, PHJ is generally superior to NPHJ in real SCM.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"35 1","pages":"1249-1263"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79600640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient Distributed Transaction Processing in Heterogeneous Networks 异构网络中的高效分布式事务处理
Pub Date : 2023-02-01 DOI: 10.14778/3583140.3583153
Qian Zhang, Jingyao Li, Hong-wei Zhao, Quanqing Xu, Wei Lu, Jinliang Xiao, Fusheng Han, Chuanhui Yang, Xiaoyong Du
Countrywide and worldwide business, like gaming and social networks, drives the popularity of inter-data-center transactions. To support inter-data-center transaction processing and data center fault tolerance simultaneously, existing protocols suffer from significant performance degradation due to high-latency and unstable networks. In this paper, we propose RedT, a novel distributed transaction processing protocol that works in heterogeneous networks. In detail, nodes within a data center are inter-connected via the RDMA-capable network and nodes across data centers are inter-connected via TCP/IP networks. RedT extends two-phase commit (2PC) by decomposing transactions into sub-transactions in terms of the data center granularity, and proposing a pre-write-log mechanism that is able to reduce the number of inter-data-center round-trips from a maximal of 6 to 2. Extensive evaluation against state-of-the-art protocols shows that RedT can achieve up to 1.57× higher throughputs and 0.56× lower latency.
全国性和全球性的业务,如游戏和社交网络,推动了数据中心间交易的普及。为了同时支持数据中心间事务处理和数据中心容错,现有协议由于网络的高延迟和不稳定而导致性能显著下降。本文提出了一种适用于异构网络的分布式事务处理协议——RedT。具体来说,数据中心内的节点通过支持rdma的网络相互连接,数据中心间的节点通过TCP/IP网络相互连接。RedT扩展了两阶段提交(2PC),根据数据中心粒度将事务分解为子事务,并提出了一种预写日志机制,该机制能够将数据中心间的往返次数从最多6次减少到2次。对最先进协议的广泛评估表明,RedT可以实现高达1.57倍的高吞吐量和0.56倍的低延迟。
{"title":"Efficient Distributed Transaction Processing in Heterogeneous Networks","authors":"Qian Zhang, Jingyao Li, Hong-wei Zhao, Quanqing Xu, Wei Lu, Jinliang Xiao, Fusheng Han, Chuanhui Yang, Xiaoyong Du","doi":"10.14778/3583140.3583153","DOIUrl":"https://doi.org/10.14778/3583140.3583153","url":null,"abstract":"Countrywide and worldwide business, like gaming and social networks, drives the popularity of inter-data-center transactions. To support inter-data-center transaction processing and data center fault tolerance simultaneously, existing protocols suffer from significant performance degradation due to high-latency and unstable networks. In this paper, we propose RedT, a novel distributed transaction processing protocol that works in heterogeneous networks. In detail, nodes within a data center are inter-connected via the RDMA-capable network and nodes across data centers are inter-connected via TCP/IP networks. RedT extends two-phase commit (2PC) by decomposing transactions into sub-transactions in terms of the data center granularity, and proposing a pre-write-log mechanism that is able to reduce the number of inter-data-center round-trips from a maximal of 6 to 2. Extensive evaluation against state-of-the-art protocols shows that RedT can achieve up to 1.57× higher throughputs and 0.56× lower latency.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"28 1","pages":"1372-1385"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76022134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report) 使用Explain-Da-V解释语义数据版本控制的数据集更改(技术报告)
Pub Date : 2023-01-30 DOI: 10.48550/arXiv.2301.13095
Roee Shraga, Renée J. Miller
In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce Explain-Da-V, a framework aiming to explain changes between two given dataset versions. Explain-Da-V generates explanations that use data transformations to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that Explain-Da-V generates better explanations than existing data transformation synthesis methods.
在数据科学和分析是协作的多用户环境中,会生成相同数据集的多个版本。虽然管理和存储数据版本在研究文献中受到了一些关注,但这种变化的语义性质仍未得到充分探索。在这项工作中,我们引入了explain - da - v框架,旨在解释两个给定数据集版本之间的变化。explain - da - v生成使用数据转换来解释更改的解释。我们进一步介绍了一套评估这些解释的有效性、概括性和可解释性的措施。我们通过经验证明,使用一个改编的现有基准和一个新创建的基准,Explain-Da-V生成的解释比现有的数据转换综合方法更好。
{"title":"Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report)","authors":"Roee Shraga, Renée J. Miller","doi":"10.48550/arXiv.2301.13095","DOIUrl":"https://doi.org/10.48550/arXiv.2301.13095","url":null,"abstract":"\u0000 In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce Explain-Da-V, a framework aiming to explain changes between two given dataset versions. Explain-Da-V generates\u0000 explanations\u0000 that use\u0000 data transformations\u0000 to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that Explain-Da-V generates better explanations than existing data transformation synthesis methods.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"84 1","pages":"1587-1600"},"PeriodicalIF":0.0,"publicationDate":"2023-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78662469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proc. VLDB Endow.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1