首页 > 最新文献

Advances in database technology : proceedings. International Conference on Extending Database Technology最新文献

英文 中文
UniCache: Efficient Log Replication through Learning Workload Patterns UniCache:通过学习工作负载模式实现高效日志复制
Harald Ng, Kun Wu, Paris Carbone
Most of the world’s cloud data service workloads are currently being backed by replicated state machines. Production-grade log replication protocols used for the job impose heavy data transfer duties on the primary server which need to disseminate the log commands to all the replica servers. UniCache proposes a principal solution to this problem using a learned replicated cache which enables commands to be sent over the network as compressed encodings. UniCache takes advantage of that each replica has access to a consistent prefix of the replicated log which allows them to build a uniform lookup cache used for compressing and decompressing commands consistently. UniCache achieves effective speedups, lowering the primary load in application workloads with a skewed data distribution. Our experimental studies showcase a low pre-processing overhead and the highest performance gains in cross-data center deployments over wide area networks.
世界上大多数云数据服务工作负载目前都由复制状态机提供支持。用于作业的生产级日志复制协议在主服务器上施加了繁重的数据传输任务,主服务器需要将日志命令传播到所有副本服务器。UniCache提出了一个主要的解决方案来解决这个问题,它使用学习复制缓存,使命令能够以压缩编码的形式在网络上发送。UniCache利用了每个副本都可以访问复制日志的一致前缀的优势,这使得它们可以构建统一的查找缓存,用于一致地压缩和解压缩命令。UniCache实现了有效的加速,降低了具有倾斜数据分布的应用程序工作负载的主负载。我们的实验研究表明,在广域网跨数据中心部署中,预处理开销较低,性能收益最高。
{"title":"UniCache: Efficient Log Replication through Learning Workload Patterns","authors":"Harald Ng, Kun Wu, Paris Carbone","doi":"10.48786/edbt.2023.39","DOIUrl":"https://doi.org/10.48786/edbt.2023.39","url":null,"abstract":"Most of the world’s cloud data service workloads are currently being backed by replicated state machines. Production-grade log replication protocols used for the job impose heavy data transfer duties on the primary server which need to disseminate the log commands to all the replica servers. UniCache proposes a principal solution to this problem using a learned replicated cache which enables commands to be sent over the network as compressed encodings. UniCache takes advantage of that each replica has access to a consistent prefix of the replicated log which allows them to build a uniform lookup cache used for compressing and decompressing commands consistently. UniCache achieves effective speedups, lowering the primary load in application workloads with a skewed data distribution. Our experimental studies showcase a low pre-processing overhead and the highest performance gains in cross-data center deployments over wide area networks.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"30 3 1","pages":"471-477"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90875060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SonicJoin: Fast, Robust and Worst-case Optimal SonicJoin:快速,稳健和最坏情况最优
Ahmad Khazaie, H. Pirk
The establishment of the AGM bound on the size of intermediate results of natural join queries has led to the development of several so-called worst-case join algorithms. These algorithms provably produce intermediate results that are (asymptotically) no larger than the final result of the join. The most notable ones are the Recursive Join , its successor, the Generic Join and the Leapfrog-Trie-Join . While algorithmically efficient, however, all of these algorithms require the availability of index structures that allow tuple lookups using the prefix of a key. Key-prefix-lookups in relational database systems are commonly supported by tree-based index structures since hash-based indices only support full-key lookups. In this paper, we study a wide variety of main-memory-oriented index structures that support key-prefix-lookups with a specific focus on supporting the Generic Join. Based on that study, we develop a novel, best-of-breed index structure called Sonic that combines the fast build and point lookup properties of hashtables with the prefix-lookups capabilities of trees and tries. To evaluate the performance of a variety of indices for worst-case optimal joins in a modern code-generating DBMS, we leveraged flexible, compile-time metaprogramming features to build a framework that creates highly efficient code, interweaving (at a microarchitectural level) a generic join implementation with any appropriate index structure. We demonstrate experimentally that in that framework, Sonic outperforms the fastest existing approaches by up to 2.5 times when supporting the Generic Join algorithm.
自然连接查询中间结果大小的AGM界的建立导致了几种所谓的最坏情况连接算法的发展。可以证明,这些算法产生的中间结果(渐近地)不大于连接的最终结果。最值得注意的是递归连接,它的继任者,泛型连接和跨越式尝试连接。然而,虽然算法效率很高,但所有这些算法都需要索引结构的可用性,这些索引结构允许使用键的前缀进行元组查找。关系数据库系统中的键前缀查找通常由基于树的索引结构支持,因为基于散列的索引只支持全键查找。在本文中,我们研究了各种面向主内存的索引结构,这些结构支持键前缀查找,并特别关注支持泛型连接。基于该研究,我们开发了一种新的、同类最佳的索引结构Sonic,它将散列表的快速构建和点查找属性与树和尝试的前缀查找功能相结合。为了评估现代代码生成DBMS中最坏情况下最优连接的各种索引的性能,我们利用灵活的编译时元编程特性构建了一个框架,该框架创建了高效的代码,(在微体系结构级别)将通用连接实现与任何适当的索引结构交织在一起。我们通过实验证明,在该框架中,当支持Generic Join算法时,Sonic的性能比现有最快的方法高出2.5倍。
{"title":"SonicJoin: Fast, Robust and Worst-case Optimal","authors":"Ahmad Khazaie, H. Pirk","doi":"10.48786/edbt.2023.46","DOIUrl":"https://doi.org/10.48786/edbt.2023.46","url":null,"abstract":"The establishment of the AGM bound on the size of intermediate results of natural join queries has led to the development of several so-called worst-case join algorithms. These algorithms provably produce intermediate results that are (asymptotically) no larger than the final result of the join. The most notable ones are the Recursive Join , its successor, the Generic Join and the Leapfrog-Trie-Join . While algorithmically efficient, however, all of these algorithms require the availability of index structures that allow tuple lookups using the prefix of a key. Key-prefix-lookups in relational database systems are commonly supported by tree-based index structures since hash-based indices only support full-key lookups. In this paper, we study a wide variety of main-memory-oriented index structures that support key-prefix-lookups with a specific focus on supporting the Generic Join. Based on that study, we develop a novel, best-of-breed index structure called Sonic that combines the fast build and point lookup properties of hashtables with the prefix-lookups capabilities of trees and tries. To evaluate the performance of a variety of indices for worst-case optimal joins in a modern code-generating DBMS, we leveraged flexible, compile-time metaprogramming features to build a framework that creates highly efficient code, interweaving (at a microarchitectural level) a generic join implementation with any appropriate index structure. We demonstrate experimentally that in that framework, Sonic outperforms the fastest existing approaches by up to 2.5 times when supporting the Generic Join algorithm.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"10 1","pages":"540-551"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75143850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reasoning over Financial Scenarios with the Vadalog System 用Vadalog系统对金融场景进行推理
Teodoro Baldazzi, Luigi Bellomarini, Emanuel Sallinger
{"title":"Reasoning over Financial Scenarios with the Vadalog System","authors":"Teodoro Baldazzi, Luigi Bellomarini, Emanuel Sallinger","doi":"10.48786/edbt.2023.66","DOIUrl":"https://doi.org/10.48786/edbt.2023.66","url":null,"abstract":"","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"53 1","pages":"782-791"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84585796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tuning the Utility-Privacy Trade-Off in Trajectory Data 轨迹数据中效用与隐私权衡的调优
Maja Schneider, P. Christen, E. Rahm, Jonathan Schneider, Lea Löffelmann
Trajectory data, often collected on a large scale with mobile sensors in smartphones and vehicles, are a valuable source for realiz-ing smart city applications, or for improving the user experience in mobile apps. But such data can also leak private information, such as a person’s whereabouts and their points of interest (POI). These in turn can reveal sensitive information, for example a person’s age, gender, religion, or home and work address. Location privacy preserving mechanisms (LPPM) can mitigate this issue by transforming data so that private details are protected. But privacy-preservation typically comes at the cost of a loss of utility. It can be challenging to find a suitable mechanism and the right settings to satisfy privacy as well as utility. In this work, we present Privacy Tuna, an interactive open-source framework to visualize trajectory data, and intuitively estimate data utility and privacy while applying various LPPMs. Our tool makes it easy for data owners to investigate the value of their data, choose a suitable privacy-preserving mechanism and tune its parameters to achieve a good utility-privacy trade-off.
轨迹数据通常通过智能手机和车辆中的移动传感器大规模收集,是实现智慧城市应用程序或改善移动应用程序用户体验的宝贵来源。但这些数据也可能泄露私人信息,比如一个人的行踪和他们的兴趣点(POI)。这反过来又会泄露敏感信息,例如一个人的年龄、性别、宗教信仰或家庭和工作地址。位置隐私保护机制(LPPM)可以通过转换数据以保护隐私详细信息来缓解这个问题。但保护隐私通常是以失去效用为代价的。找到一个合适的机制和正确的设置来满足隐私和实用是很有挑战性的。在这项工作中,我们提出了Privacy Tuna,这是一个交互式开源框架,用于可视化轨迹数据,并在应用各种lppm时直观地估计数据效用和隐私。我们的工具使数据所有者可以轻松地调查其数据的价值,选择合适的隐私保护机制并调整其参数,以实现良好的效用-隐私权衡。
{"title":"Tuning the Utility-Privacy Trade-Off in Trajectory Data","authors":"Maja Schneider, P. Christen, E. Rahm, Jonathan Schneider, Lea Löffelmann","doi":"10.48786/edbt.2023.78","DOIUrl":"https://doi.org/10.48786/edbt.2023.78","url":null,"abstract":"Trajectory data, often collected on a large scale with mobile sensors in smartphones and vehicles, are a valuable source for realiz-ing smart city applications, or for improving the user experience in mobile apps. But such data can also leak private information, such as a person’s whereabouts and their points of interest (POI). These in turn can reveal sensitive information, for example a person’s age, gender, religion, or home and work address. Location privacy preserving mechanisms (LPPM) can mitigate this issue by transforming data so that private details are protected. But privacy-preservation typically comes at the cost of a loss of utility. It can be challenging to find a suitable mechanism and the right settings to satisfy privacy as well as utility. In this work, we present Privacy Tuna, an interactive open-source framework to visualize trajectory data, and intuitively estimate data utility and privacy while applying various LPPMs. Our tool makes it easy for data owners to investigate the value of their data, choose a suitable privacy-preserving mechanism and tune its parameters to achieve a good utility-privacy trade-off.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"108 1","pages":"839-842"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85339441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In-Network Approximate and Efficient Spatiotemporal Range Queries on Moving Objects 运动对象的网络近似和高效时空距离查询
Guang Yang, Liang Liang
Data aggregations enable privacy-aware data analytics for moving objects. A spatiotemporal range count query is a fundamental query that aggregates the count of objects in a given spatial region and a time interval. Existing works are designed for centralized systems, which lead to issues with extensive communication and the potential for data leaks. Current in-network systems suffer from the distinct count problem (counting the same objects multiple times) and the dead space problem (excessive intra-communication from ill-suited spatial subdivisions). We propose a novel framework based on a planar graph representation for efficient privacy-aware in-network aggregate queries. Unlike conventional spatial decomposition methods, our framework uses sensor placement techniques to select sensors to reduce dead space. A submodular maximization-based method is introduced when the query distribution is known and a host of sampling methods are used when the query distribution is unknown or dynamic. We avoid double counting by tracking movements along the graph edges using discrete differential forms. We support queries with arbitrary temporal intervals with a constant-sized regression model that accelerates the query performance and reduces the storage size. We evaluate our method on real-world mobility data, which yields us a relative error of at most 13 . 8% with 25 . 6% of sensors while achieving a speedup of 3 . 5 × , 69 . 81% reduction in sensors accessed, and a storage reduction of 99 . 96% compared to finding the exact count.
数据聚合支持对移动对象进行隐私感知的数据分析。时空范围计数查询是聚合给定空间区域和时间间隔内对象计数的基本查询。现有的工作是为集中式系统设计的,这导致了广泛的通信和潜在的数据泄露问题。当前的网络内系统存在明显计数问题(对相同对象进行多次计数)和死空间问题(由于不合适的空间细分而导致的过度内部通信)。我们提出了一种基于平面图表示的网络聚合查询框架。与传统的空间分解方法不同,我们的框架使用传感器放置技术来选择传感器以减少死区。当查询分布已知时,引入基于子模块最大化的方法;当查询分布未知或动态时,使用大量抽样方法。我们通过使用离散微分形式跟踪沿图边的运动来避免重复计数。我们支持具有任意时间间隔的查询,使用恒定大小的回归模型可以加速查询性能并减少存储大小。我们在真实世界的移动数据上评估了我们的方法,这使我们的相对误差最多为13。8%的人选择25。6%的传感器,同时实现3的加速。5 ×, 69。访问的传感器减少81%,存储减少99%。96%与找到准确的数字相比。
{"title":"In-Network Approximate and Efficient Spatiotemporal Range Queries on Moving Objects","authors":"Guang Yang, Liang Liang","doi":"10.48786/edbt.2024.04","DOIUrl":"https://doi.org/10.48786/edbt.2024.04","url":null,"abstract":"Data aggregations enable privacy-aware data analytics for moving objects. A spatiotemporal range count query is a fundamental query that aggregates the count of objects in a given spatial region and a time interval. Existing works are designed for centralized systems, which lead to issues with extensive communication and the potential for data leaks. Current in-network systems suffer from the distinct count problem (counting the same objects multiple times) and the dead space problem (excessive intra-communication from ill-suited spatial subdivisions). We propose a novel framework based on a planar graph representation for efficient privacy-aware in-network aggregate queries. Unlike conventional spatial decomposition methods, our framework uses sensor placement techniques to select sensors to reduce dead space. A submodular maximization-based method is introduced when the query distribution is known and a host of sampling methods are used when the query distribution is unknown or dynamic. We avoid double counting by tracking movements along the graph edges using discrete differential forms. We support queries with arbitrary temporal intervals with a constant-sized regression model that accelerates the query performance and reduces the storage size. We evaluate our method on real-world mobility data, which yields us a relative error of at most 13 . 8% with 25 . 6% of sensors while achieving a speedup of 3 . 5 × , 69 . 81% reduction in sensors accessed, and a storage reduction of 99 . 96% compared to finding the exact count.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"2 1","pages":"34-46"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90721274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incremental Stream Query Merging 增量流查询合并
Ankit Chaudhary, Steffen Zeuch, V. Markl, Jeyhun Karimov
{"title":"Incremental Stream Query Merging","authors":"Ankit Chaudhary, Steffen Zeuch, V. Markl, Jeyhun Karimov","doi":"10.48786/edbt.2023.51","DOIUrl":"https://doi.org/10.48786/edbt.2023.51","url":null,"abstract":"","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"604-617"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89640908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
TempoGRAPHer: A Tool for Aggregating and Exploring Evolving Graphs TempoGRAPHer:一个用于聚合和探索演化图的工具
Evangelia Tsoukanara, Georgia Koloniari, E. Pitoura
Graphs offer a generic abstraction for modeling entities as nodes and their interactions and relationships as edges. Since most graphs evolve over time, it is important to study their evolution. To this end, we propose demonstrating TempoGRAPHer, a tool that provides an overview of the evolution of an attributed graph offering aggregation at both the time and the attribute dimensions. The tool also supports a novel exploration strategy that helps in identifying time intervals of significant growth, shrinkage, or stability. Finally, we describe a scenario that showcases the usefulness of the TempoGRAPHer tool in understanding the evolution of contacts between primary school students.
图提供了一个通用的抽象,将实体建模为节点,将它们的交互和关系建模为边。由于大多数图表都随着时间的推移而演变,因此研究它们的演变是很重要的。为此,我们建议演示TempoGRAPHer,这是一个工具,它提供了在时间和属性维度上提供聚合的属性图的发展概况。该工具还支持一种新的勘探策略,有助于识别显著增长、收缩或稳定的时间间隔。最后,我们描述了一个场景,展示了TempoGRAPHer工具在理解小学生之间接触演变方面的有用性。
{"title":"TempoGRAPHer: A Tool for Aggregating and Exploring Evolving Graphs","authors":"Evangelia Tsoukanara, Georgia Koloniari, E. Pitoura","doi":"10.48786/edbt.2023.79","DOIUrl":"https://doi.org/10.48786/edbt.2023.79","url":null,"abstract":"Graphs offer a generic abstraction for modeling entities as nodes and their interactions and relationships as edges. Since most graphs evolve over time, it is important to study their evolution. To this end, we propose demonstrating TempoGRAPHer, a tool that provides an overview of the evolution of an attributed graph offering aggregation at both the time and the attribute dimensions. The tool also supports a novel exploration strategy that helps in identifying time intervals of significant growth, shrinkage, or stability. Finally, we describe a scenario that showcases the usefulness of the TempoGRAPHer tool in understanding the evolution of contacts between primary school students.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"128 1","pages":"843-846"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88115076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
RDF-Analytics: Interactive Analytics over RDF Knowledge Graphs RDF-Analytics:基于RDF知识图的交互式分析
Maria-Evangelia Papadaki, Yannis Tzitzikas
The formulation of structured queries in knowledge graphs is a challenging task that presupposes familiarity with the syntax of the query language and the contents of the knowledge graph. To alleviate this difficulty in this paper we introduce RDF-ANALYTICS , a novel system that enables plain users to formulate analytic queries over complex, i.e. not necessarily star-schema based, RDF knowledge graphs. To come up with an intuitive interface, we leverage the familiarity of users with Faceted Search (FS) systems, i.e. we extend FS with actions that enable users to formulate analytic queries, too. Distinctive characteristics of the approach is the ability to include arbitrarily long paths in the analytic query (accompanied with count information), interactive formulation of HAVING restrictions, the support of both Faceted Search (i.e. the locating of the desired resources in a faceted search manner) and analytic queries, and the ability to formulate nested analytic queries. Finally, we present the results of a preliminary task-based evaluation with users, which are very promising.
知识图中结构化查询的表述是一项具有挑战性的任务,它以熟悉查询语言的语法和知识图的内容为前提。为了减轻这一困难,本文介绍了RDF- analytics,这是一个新颖的系统,它使普通用户能够在复杂的RDF知识图上制定分析查询,即不一定是基于星型模式的RDF知识图。为了提供一个直观的界面,我们利用了用户对分面搜索(FS)系统的熟悉程度,也就是说,我们扩展了FS,使用户也能够制定分析查询。该方法的显著特点是能够在分析查询中包含任意长的路径(附带计数信息),具有限制的交互式公式,支持分面搜索(即以分面搜索方式定位所需资源)和分析查询,以及制定嵌套分析查询的能力。最后,我们提出了一个初步的基于任务的用户评估结果,这是非常有希望的。
{"title":"RDF-Analytics: Interactive Analytics over RDF Knowledge Graphs","authors":"Maria-Evangelia Papadaki, Yannis Tzitzikas","doi":"10.48786/edbt.2023.70","DOIUrl":"https://doi.org/10.48786/edbt.2023.70","url":null,"abstract":"The formulation of structured queries in knowledge graphs is a challenging task that presupposes familiarity with the syntax of the query language and the contents of the knowledge graph. To alleviate this difficulty in this paper we introduce RDF-ANALYTICS , a novel system that enables plain users to formulate analytic queries over complex, i.e. not necessarily star-schema based, RDF knowledge graphs. To come up with an intuitive interface, we leverage the familiarity of users with Faceted Search (FS) systems, i.e. we extend FS with actions that enable users to formulate analytic queries, too. Distinctive characteristics of the approach is the ability to include arbitrarily long paths in the analytic query (accompanied with count information), interactive formulation of HAVING restrictions, the support of both Faceted Search (i.e. the locating of the desired resources in a faceted search manner) and analytic queries, and the ability to formulate nested analytic queries. Finally, we present the results of a preliminary task-based evaluation with users, which are very promising.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"323 1","pages":"807-810"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76296786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mining Structures from Massive Texts by Exploring the Power of Pre-trained Language Models 通过探索预训练语言模型的力量从大量文本中挖掘结构
Yu Zhang, Yunyi Zhang, Jiawei Han
Technologies for handling massive structured or semi-structured data have been researched extensively in database communities. However, the real-world data are largely in the form of unstructured text, posing a great challenge to their management and analysis as well as their integration with semi-structured databases. Recent developments of deep learning methods and large pre-trained language models (PLMs) have revolutionized text mining and processing and shed new light on structuring massive text data and building a framework for integrated (i.e., structured and unstructured) data management and analysis. In this tutorial, we will focus on the recently developed text mining approaches empowered by PLMs that can work without relying on heavy human annotations. We will present an organized picture of how a set of weakly supervised methods explore the power of PLMs to structure text data, with the following outline: (1) an introduction to pre-trained languagemodels that serve as new tools for our tasks, (2) mining topic structures: unsupervised and seed-guided methods for topic discovery from massive text corpora, (3) mining document structures: weakly supervised methods for text classification, (4) mining entity structures: distantly supervised and weakly supervised methods for phrase mining, named entity recognition, taxonomy construction, and structured knowledge graph construction, and (5) towards an integrated information processing paradigm. 1 BACKGROUND, GOALS, AND DURATION The massive text data available on the Web, social media, news, scientific literature, government reports, and other information sources contain rich knowledge that can potentially benefit a wide variety of information processing tasks, and they can be potentially structured and analyzed by extended database technologies. For example, one can conduct entity recognition and concept ontology construction on a large collection of scientific papers and extract the factual knowledge for knowledge base construction and subsequent analysis. How to effectively leverage the unstructured massive text data for downstream applications has remained an important and active research question for the past few decades. Recently, pre-trained language models (PLMs) such as BERT [6] have revolutionized the text mining field and brought new inspirations to structuring text data. To be specific, the following paradigm is usually adopted: pre-training neural architectures on large-scale text corpora obtained from the world knowledge (e.g., a combination of Wikipedia, books, scientific corpora, and web content), and then transferring their representations to task-specific data. By doing so, the knowledge encoded in the world corpora can be effectively leveraged to enhance © 2023 Copyright held by the owner/author(s). Published in Proceedings of the 26th International Conference on Extending Database Technology (EDBT), 28th March-31st March, 2023, ISBN 978-3-89318-092-9 on OpenProceedings.org.
数据库社区对处理大量结构化或半结构化数据的技术进行了广泛的研究。然而,现实世界的数据大多是非结构化文本,这给数据的管理和分析以及与半结构化数据库的集成带来了巨大的挑战。深度学习方法和大型预训练语言模型(plm)的最新发展彻底改变了文本挖掘和处理,并为构建大量文本数据和构建集成(即结构化和非结构化)数据管理和分析框架提供了新的思路。在本教程中,我们将重点介绍最近开发的由plm支持的文本挖掘方法,这些方法可以在不依赖大量人工注释的情况下工作。我们将有组织地展示一组弱监督方法如何探索plm构建文本数据的能力,并给出以下概述:(1)介绍作为我们任务新工具的预训练语言模型,(2)挖掘主题结构:从大量文本语料库中发现主题的无监督和种子引导方法,(3)挖掘文档结构:挖掘文本分类的弱监督方法,(4)挖掘实体结构。远距离监督和弱监督方法用于短语挖掘、命名实体识别、分类构建和结构化知识图构建,以及(5)迈向集成信息处理范式。背景、目标和持续时间网络、社交媒体、新闻、科学文献、政府报告和其他信息源上的大量文本数据包含丰富的知识,可以潜在地有益于各种信息处理任务,并且可以通过扩展数据库技术对它们进行结构化和分析。例如,可以对大量的科学论文进行实体识别和概念本体构建,提取事实知识,用于知识库的构建和后续分析。如何有效地利用海量非结构化文本数据进行下游应用,是过去几十年一个重要而活跃的研究课题。最近,BERT[6]等预训练语言模型(plm)彻底改变了文本挖掘领域,并为结构化文本数据带来了新的灵感。具体而言,通常采用以下范式:在从世界知识中获得的大规模文本语料库(例如维基百科、书籍、科学语料库和web内容的组合)上预训练神经架构,然后将其表示转换为特定任务的数据。通过这样做,可以有效地利用世界语料库中编码的知识来增强©2023所有者/作者持有的版权。发表于第26届国际扩展数据库技术会议论文集(EDBT), 2023年3月28日-31日,ISBN 978-3-89318-092-9, OpenProceedings.org。本论文的发布遵循知识共享许可协议cc -by-nc和4.0的条款。下游任务表现显著。然而,这种范例的主要挑战是,完全监督的plm微调通常需要大量的人工注释,这可能需要领域的专业知识,并且在实践中获得这些注释既昂贵又耗时。在本教程中,我们的目标是介绍以下方面的最新进展:(1)语言模型预训练,将大量文本转化为上下文化的文本表示;(2)弱监督方法,将预训练的表示转移到各种任务中,从大量文本中挖掘主题、文档和实体的结构。本教程中介绍的材料将极大地有利于从事文本挖掘/自然语言处理、数据挖掘和数据库系统工作的研究人员,以及旨在为目标应用程序获取结构化和可操作知识而不需要访问大量注释数据的实践者。本教程将在3小时内呈现。
{"title":"Mining Structures from Massive Texts by Exploring the Power of Pre-trained Language Models","authors":"Yu Zhang, Yunyi Zhang, Jiawei Han","doi":"10.48786/edbt.2023.81","DOIUrl":"https://doi.org/10.48786/edbt.2023.81","url":null,"abstract":"Technologies for handling massive structured or semi-structured data have been researched extensively in database communities. However, the real-world data are largely in the form of unstructured text, posing a great challenge to their management and analysis as well as their integration with semi-structured databases. Recent developments of deep learning methods and large pre-trained language models (PLMs) have revolutionized text mining and processing and shed new light on structuring massive text data and building a framework for integrated (i.e., structured and unstructured) data management and analysis. In this tutorial, we will focus on the recently developed text mining approaches empowered by PLMs that can work without relying on heavy human annotations. We will present an organized picture of how a set of weakly supervised methods explore the power of PLMs to structure text data, with the following outline: (1) an introduction to pre-trained languagemodels that serve as new tools for our tasks, (2) mining topic structures: unsupervised and seed-guided methods for topic discovery from massive text corpora, (3) mining document structures: weakly supervised methods for text classification, (4) mining entity structures: distantly supervised and weakly supervised methods for phrase mining, named entity recognition, taxonomy construction, and structured knowledge graph construction, and (5) towards an integrated information processing paradigm. 1 BACKGROUND, GOALS, AND DURATION The massive text data available on the Web, social media, news, scientific literature, government reports, and other information sources contain rich knowledge that can potentially benefit a wide variety of information processing tasks, and they can be potentially structured and analyzed by extended database technologies. For example, one can conduct entity recognition and concept ontology construction on a large collection of scientific papers and extract the factual knowledge for knowledge base construction and subsequent analysis. How to effectively leverage the unstructured massive text data for downstream applications has remained an important and active research question for the past few decades. Recently, pre-trained language models (PLMs) such as BERT [6] have revolutionized the text mining field and brought new inspirations to structuring text data. To be specific, the following paradigm is usually adopted: pre-training neural architectures on large-scale text corpora obtained from the world knowledge (e.g., a combination of Wikipedia, books, scientific corpora, and web content), and then transferring their representations to task-specific data. By doing so, the knowledge encoded in the world corpora can be effectively leveraged to enhance © 2023 Copyright held by the owner/author(s). Published in Proceedings of the 26th International Conference on Extending Database Technology (EDBT), 28th March-31st March, 2023, ISBN 978-3-89318-092-9 on OpenProceedings.org. ","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"108 1","pages":"851-854"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75928134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Formal Design Framework for Practical Property Graph Schema Languages 实用属性图模式语言的形式化设计框架
Nimo Beeren, G. Fletcher
Graph databases are increasingly receiving attention from industry and academia, due in part to their flexibility; a schema is often not required. However, schemas can significantly benefit query optimization, data integrity, and documentation. There currently does not exist a formal framework which captures the design space of state-of-the-art schema solutions. We present a formal design framework for property graph schema languages based on first-order logic rules, which balances expressivity and practicality. We show how this framework can be adapted to integrate a core set of constraints common in conceptual data modeling methods. To demonstrate practical feasibility, this model is imple-mented using graph queries for modern graph database systems, which we evaluate through a controlled experiment. We find that validation time scales linearly with the size of the data, while only using unoptimized straightforward implementations.
图数据库越来越受到工业界和学术界的关注,部分原因是其灵活性;通常不需要模式。但是,模式可以显著地促进查询优化、数据完整性和文档编制。目前还不存在一个正式的框架来捕捉最先进的模式解决方案的设计空间。提出了一种基于一阶逻辑规则的属性图模式语言的形式化设计框架,该框架兼顾了表达性和实用性。我们将展示如何调整此框架以集成概念数据建模方法中常见的一组核心约束。为了证明该模型的实际可行性,我们在现代图数据库系统中使用图查询来实现该模型,并通过对照实验对其进行了评估。我们发现验证时间与数据大小呈线性关系,而只使用未优化的直接实现。
{"title":"A Formal Design Framework for Practical Property Graph Schema Languages","authors":"Nimo Beeren, G. Fletcher","doi":"10.48786/edbt.2023.40","DOIUrl":"https://doi.org/10.48786/edbt.2023.40","url":null,"abstract":"Graph databases are increasingly receiving attention from industry and academia, due in part to their flexibility; a schema is often not required. However, schemas can significantly benefit query optimization, data integrity, and documentation. There currently does not exist a formal framework which captures the design space of state-of-the-art schema solutions. We present a formal design framework for property graph schema languages based on first-order logic rules, which balances expressivity and practicality. We show how this framework can be adapted to integrate a core set of constraints common in conceptual data modeling methods. To demonstrate practical feasibility, this model is imple-mented using graph queries for modern graph database systems, which we evaluate through a controlled experiment. We find that validation time scales linearly with the size of the data, while only using unoptimized straightforward implementations.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"19 1","pages":"478-484"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81536390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Advances in database technology : proceedings. International Conference on Extending Database Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1