Proceedings of the Vldb Endowment最新文献

英文中文

Uldp-FL: Federated Learning with Across-Silo User-Level Differential Privacy.

IF 2.6 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2024-07-01 DOI: 10.14778/3681954.3681966

Fumiyuki Kato, Li Xiong, Shun Takagi, Yang Cao, Masatoshi Yoshikawa

Differentially Private Federated Learning (DP-FL) has garnered attention as a collaborative machine learning approach that ensures formal privacy. Most DP-FL approaches ensure DP at the record-level within each silo for cross-silo FL. However, a single user's data may extend across multiple silos, and the desired user-level DP guarantee for such a setting remains unknown. In this study, we present Uldp-FL, a novel FL framework designed to guarantee user-level DP in cross-silo FL where a single user's data may belong to multiple silos. Our proposed algorithm directly ensures user-level DP through per-user weighted clipping, departing from group-privacy approaches. We provide a theoretical analysis of the algorithm's privacy and utility. Additionally, we improve the utility of the proposed algorithm with an enhanced weighting strategy based on user record distribution and design a novel private protocol that ensures no additional information is revealed to the silos and the server. Experiments on real-world datasets show substantial improvements in our methods in privacy-utility trade-offs under user-level DP compared to baseline methods. To the best of our knowledge, our work is the first FL framework that effectively provides user-level DP in the general cross-silo FL setting.

{"title":"Uldp-FL: Federated Learning with Across-Silo User-Level Differential Privacy.","authors":"Fumiyuki Kato, Li Xiong, Shun Takagi, Yang Cao, Masatoshi Yoshikawa","doi":"10.14778/3681954.3681966","DOIUrl":"10.14778/3681954.3681966","url":null,"abstract":"<p><p>Differentially Private Federated Learning (DP-FL) has garnered attention as a collaborative machine learning approach that ensures formal privacy. Most DP-FL approaches ensure DP at the record-level within each silo for cross-silo FL. However, a single user's data may extend across multiple silos, and the desired user-level DP guarantee for such a setting remains unknown. In this study, we present Uldp-FL, a novel FL framework designed to guarantee user-level DP in cross-silo FL where a single user's data may belong to multiple silos. Our proposed algorithm directly ensures user-level DP through per-user weighted clipping, departing from group-privacy approaches. We provide a theoretical analysis of the algorithm's privacy and utility. Additionally, we improve the utility of the proposed algorithm with an enhanced weighting strategy based on user record distribution and design a novel private protocol that ensures no additional information is revealed to the silos and the server. Experiments on real-world datasets show substantial improvements in our methods in privacy-utility trade-offs under user-level DP compared to baseline methods. To the best of our knowledge, our work is the first FL framework that effectively provides user-level DP in the general cross-silo FL setting.</p>","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"17 11","pages":"2826-2839"},"PeriodicalIF":2.6,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11822848/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143415947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Auditory Brainstem Response in a Child with Mitochondrial Disorder-Leigh Syndrome. 一名患有线粒体疾病-Leigh 综合征的儿童的听觉脑干反应。

IF 2.6 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2024-02-01 Epub Date: 2023-07-27 DOI: 10.1007/s12070-023-03971-3

P S Pooja, R Greeshma, Emin K Joy, Shreyank P Swamy

The mitochondrial disorder-Leigh syndrome is a neurodegenerative disorder often manifested with brainstem abnormalities. The case report highlights the auditory brainstem response in a child with medical findings suggestive of Leigh syndrome. The case report also emphasizes the importance of ruling out any underlying neural pathology before making a clinical impression in children with developmental delays.

线粒体疾病--利综合征是一种神经退行性疾病，通常表现为脑干异常。本病例报告重点介绍了一名儿童的听觉脑干反应，其医学检查结果提示患有莱氏综合征。该病例报告还强调了在对发育迟缓儿童做出临床印象之前排除任何潜在神经病变的重要性。

引用次数: 0

Breathing New Life into an Old Tree: Resolving Logging Dilemma of B + -tree on Modern Computational Storage Drives 为老树注入新生命：解决现代计算存储驱动器上 B + - 树的日志困境

IF 2.5 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-10-01 DOI: 10.14778/3626292.3626297

Kecheng Huang, Zhaoyan Shen, Zili Shao, Tong Zhang, Feng Chen

Having dominated databases and various data management systems for decades, B + -tree is infamously subject to a logging dilemma: One could improve B + -tree speed performance by equipping it with a larger log, which nevertheless will degrade its crash recovery speed. Such a logging dilemma is particularly prominent in the presence of modern workloads that involve intensive small writes. In this paper, we propose a novel solution, called per-page logging based B + -tree, which leverages the emerging computational storage drive (CSD) with built-in transparent compression to fundamentally resolve the logging dilemma. Our key idea is to divide the large single log into many small (e.g., 4KB), highly compressible per-page logs , each being statically bounded with a B + -tree page. All per-page logs together form a very large over-provisioned log space for B + -tree to improve its operational speed performance. Meanwhile, during crash recovery, B + -tree does not need to scan any per-page logs, leading to a recovery latency independent from the total log size. We have developed and open-sourced a fully functional prototype. Our evaluation results show that, under small-write intensive workloads, our design solution can improve B + -tree operational throughput by up to 625.6% and maintain a crash recovery time of as low as 19.2 ms, while incurring a minimal storage overhead of only 0.5-1.6%.

数十年来，B + - 树在数据库和各种数据管理系统中一直占据主导地位，但它却因日志困境而声名狼藉：人们可以通过为它配备更大的日志来提高 B + - 树的速度性能，但这却会降低它的崩溃恢复速度。这种日志困境在涉及密集小写的现代工作负载中尤为突出。在本文中，我们提出了一种新颖的解决方案，称为基于 B + -tree 的每页日志记录，它利用了新兴的计算存储驱动器（CSD）内置的透明压缩功能，从根本上解决了日志记录难题。我们的主要想法是将大型单一日志划分为许多小型（如 4KB）、可高度压缩的每页日志，每个日志都以 B + -tree 页面静态绑定。所有每页日志共同构成一个非常大的超额日志空间，供 B + -tree 使用，以提高其运行速度性能。同时，在崩溃恢复期间，B + -tree 不需要扫描任何每页日志，因此恢复延迟与总日志大小无关。我们开发并开源了一个功能齐全的原型。我们的评估结果表明，在小型写密集型工作负载下，我们的设计方案可以将 B + -tree 的运行吞吐量提高 625.6%，并将崩溃恢复时间保持在 19.2 毫秒，同时只产生 0.5-1.6% 的最小存储开销。

{"title":"Breathing New Life into an Old Tree: Resolving Logging Dilemma of B + -tree on Modern Computational Storage Drives","authors":"Kecheng Huang, Zhaoyan Shen, Zili Shao, Tong Zhang, Feng Chen","doi":"10.14778/3626292.3626297","DOIUrl":"https://doi.org/10.14778/3626292.3626297","url":null,"abstract":"Having dominated databases and various data management systems for decades, B + -tree is infamously subject to a logging dilemma: One could improve B + -tree speed performance by equipping it with a larger log, which nevertheless will degrade its crash recovery speed. Such a logging dilemma is particularly prominent in the presence of modern workloads that involve intensive small writes. In this paper, we propose a novel solution, called per-page logging based B + -tree, which leverages the emerging computational storage drive (CSD) with built-in transparent compression to fundamentally resolve the logging dilemma. Our key idea is to divide the large single log into many small (e.g., 4KB), highly compressible per-page logs , each being statically bounded with a B + -tree page. All per-page logs together form a very large over-provisioned log space for B + -tree to improve its operational speed performance. Meanwhile, during crash recovery, B + -tree does not need to scan any per-page logs, leading to a recovery latency independent from the total log size. We have developed and open-sourced a fully functional prototype. Our evaluation results show that, under small-write intensive workloads, our design solution can improve B + -tree operational throughput by up to 625.6% and maintain a crash recovery time of as low as 19.2 ms, while incurring a minimal storage overhead of only 0.5-1.6%.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"24 1","pages":""},"PeriodicalIF":2.5,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139330821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Cusp: Computing Thrills and Perils and Professional Awakening 在风口浪尖上:计算机的刺激、危险和职业觉醒

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611640

Natasa Milic-Frayling

Over the past eight decades, computer science has advanced as a field, and the computing profession has matured by establishing professional codes of conduct, fostering best practices, and establishing industry standards to support the proliferation of technologies and services. Research and applications of digital computation continue to change all aspects of human endeavor through new waves of innovation. While it is clear that different research advances fuel innovation, the ways they come together to make an impact vary. In contrast to highly regulated sectors such as pharma, medicine and law, the process of transforming research into widely deployed technologies is not regulated. We reflect on collective practices, from discovery by scientists and engineers to market delivery by entrepreneurs, industry leaders, and practitioners. We consider ecosystem changes that are required to sustain the transformational effects of new technologies and enable new practices to take root. Every such transformation ruptures in the existing socio-technical fabric and requires a concerted effort to remedy this through effective policies and regulations. Computing experts are involved in all phases and must match the transformational power of their innovation with the highest standard of professional conduct. We highlight the principles of responsible innovation and discuss three waves of digital innovation. We use wide and uncontrolled generative AI deployments to illustrate risks from the implosion of digital media due to contamination of digital records, removal of human agency, and risk to an individual's personhood.

在过去的八十年里，计算机科学作为一个领域已经取得了进步，通过建立专业行为准则、培养最佳实践和建立行业标准来支持技术和服务的扩散，计算专业已经成熟。数字计算的研究和应用继续通过新的创新浪潮改变人类努力的各个方面。虽然很明显，不同的研究进展推动了创新，但它们结合在一起产生影响的方式各不相同。与制药、医药和法律等受到高度监管的行业相比，将研究转化为广泛应用的技术的过程没有受到监管。我们反思集体实践，从科学家和工程师的发现到企业家、行业领袖和从业者的市场交付。我们考虑维持新技术的转型效应和使新实践扎根所需的生态系统变化。每一种这种转变都会破坏现有的社会技术结构，需要通过有效的政策和条例作出协调一致的努力来加以补救。计算机专家参与所有阶段，必须以最高标准的专业行为来匹配他们创新的变革力量。我们强调负责任的创新原则，并讨论了数字创新的三波浪潮。我们使用广泛和不受控制的生成人工智能部署来说明由于数字记录的污染，人类代理的移除以及个人人格的风险而导致的数字媒体内爆的风险。

{"title":"On the Cusp: Computing Thrills and Perils and Professional Awakening","authors":"Natasa Milic-Frayling","doi":"10.14778/3611540.3611640","DOIUrl":"https://doi.org/10.14778/3611540.3611640","url":null,"abstract":"Over the past eight decades, computer science has advanced as a field, and the computing profession has matured by establishing professional codes of conduct, fostering best practices, and establishing industry standards to support the proliferation of technologies and services. Research and applications of digital computation continue to change all aspects of human endeavor through new waves of innovation. While it is clear that different research advances fuel innovation, the ways they come together to make an impact vary. In contrast to highly regulated sectors such as pharma, medicine and law, the process of transforming research into widely deployed technologies is not regulated. We reflect on collective practices, from discovery by scientists and engineers to market delivery by entrepreneurs, industry leaders, and practitioners. We consider ecosystem changes that are required to sustain the transformational effects of new technologies and enable new practices to take root. Every such transformation ruptures in the existing socio-technical fabric and requires a concerted effort to remedy this through effective policies and regulations. Computing experts are involved in all phases and must match the transformational power of their innovation with the highest standard of professional conduct. We highlight the principles of responsible innovation and discuss three waves of digital innovation. We use wide and uncontrolled generative AI deployments to illustrate risks from the implosion of digital media due to contamination of digital records, removal of human agency, and risk to an individual's personhood.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lingua Manga : A Generic Large Language Model Centric System for Data Curation Lingua Manga:一个通用的大型语言模型中心的数据管理系统

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611624

Zui Chen, Lei Cao, Sam Madden

Data curation is a wide-ranging area which contains many critical but time-consuming data processing tasks. However, the diversity of such tasks makes it challenging to develop a general-purpose data curation system. To address this issue, we present Lingua Manga, a user-friendly and versatile system that utilizes pre-trained large language models. Lingua Manga offers automatic optimization for achieving high performance and label efficiency while facilitating flexible and rapid development. Through three example applications with distinct objectives and users of varying levels of technical proficiency, we demonstrate that Lingua Manga can effectively assist both skilled programmers and low-code or even no-code users in addressing data curation challenges.

数据管理是一个广泛的领域，包含许多关键但耗时的数据处理任务。然而，这些任务的多样性使得开发通用的数据管理系统具有挑战性。为了解决这个问题，我们提出了Lingua Manga，这是一个用户友好且通用的系统，利用预训练的大型语言模型。Lingua Manga提供自动优化，以实现高性能和标签效率，同时促进灵活和快速的开发。通过三个具有不同目标和不同技术熟练程度用户的示例应用程序，我们证明了Lingua Manga可以有效地帮助熟练的程序员和低代码甚至无代码用户解决数据管理挑战。

引用次数: 0

The Story of GraphLab - From Scaling Machine Learning to Shaping Graph Systems Research (VLDB 2023 Test-of-Time Award Talk) GraphLab的故事——从扩展机器学习到塑造图系统研究(VLDB 2023 Test-of-Time Award演讲)

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611637

Joseph E. Gonzalez, Yucheng Low

The GraphLab project spanned almost a decade and had profound academic and industrial impact on large-scale machine learning and graph processing systems. There were numerous papers written describing the innovations in GraphLab including the original vertex-centric [8] and edge-centric [3] programming abstractions, high-performance asynchronous execution engines [9], out-of-core graph computation [6], tabular graph-systems [4], and even new statistical inference algorithms [2] enabled by the GraphLab project. This work became the basis of multiple PhD theses [1, 5, 7]. The GraphLab open-source project had broad academic and industrial adoption and ultimately lead to the launch of Turi. In this talk, we tell the story of GraphLab, how it began and the key ideas behind it. We will focus on the approach to achieving scalable asynchronous systems in machine learning. During our talk, we will explore the impact that GraphLab has had on the development of graph processing systems, graph databases, and AI/ML; Additionally, we will share our insights and opinions into where we see the future of these fields heading. In the process, we highlight some of the lessons we learned and provide guidance for future students.

GraphLab项目历时近十年，对大规模机器学习和图形处理系统产生了深远的学术和工业影响。有许多论文描述了GraphLab的创新，包括原始的以顶点为中心[8]和以边缘为中心[3]的编程抽象、高性能异步执行引擎[9]、核外图计算[6]、表格图系统[4]，甚至是GraphLab项目支持的新的统计推断算法[2]。这项工作成为多篇博士论文的基础[1,5,7]。GraphLab开源项目在学术界和工业界得到了广泛的采用，并最终导致了Turi的发布。在这次演讲中，我们将讲述GraphLab的故事，它是如何开始的，以及它背后的关键思想。我们将重点讨论在机器学习中实现可扩展异步系统的方法。在我们的演讲中，我们将探讨GraphLab对图处理系统、图数据库和AI/ML发展的影响;此外，我们将分享我们对这些领域未来发展方向的见解和观点。在这个过程中，我们强调了我们所学到的一些教训，并为未来的学生提供了指导。

{"title":"The Story of GraphLab - From Scaling Machine Learning to Shaping Graph Systems Research (VLDB 2023 Test-of-Time Award Talk)","authors":"Joseph E. Gonzalez, Yucheng Low","doi":"10.14778/3611540.3611637","DOIUrl":"https://doi.org/10.14778/3611540.3611637","url":null,"abstract":"The GraphLab project spanned almost a decade and had profound academic and industrial impact on large-scale machine learning and graph processing systems. There were numerous papers written describing the innovations in GraphLab including the original vertex-centric [8] and edge-centric [3] programming abstractions, high-performance asynchronous execution engines [9], out-of-core graph computation [6], tabular graph-systems [4], and even new statistical inference algorithms [2] enabled by the GraphLab project. This work became the basis of multiple PhD theses [1, 5, 7]. The GraphLab open-source project had broad academic and industrial adoption and ultimately lead to the launch of Turi. In this talk, we tell the story of GraphLab, how it began and the key ideas behind it. We will focus on the approach to achieving scalable asynchronous systems in machine learning. During our talk, we will explore the impact that GraphLab has had on the development of graph processing systems, graph databases, and AI/ML; Additionally, we will share our insights and opinions into where we see the future of these fields heading. In the process, we highlight some of the lessons we learned and provide guidance for future students.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CM-Explorer: Dissecting Data Ingestion Problems CM-Explorer:剖析数据摄取问题

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611595

Niels Bylois, Frank Neven, Stijn Vansummeren

Data ingestion validation, the task of certifying the quality of continuously collected data, is crucial to ensure trustworthiness of analytics insights. A widely used approach for validating data quality is to specify, either manually or automatically, so-called data unit tests that check whether data quality metrics lie within expected bounds. We employ conditional unit tests based on conditional metrics (CMs) that compute data quality signals over specific parts of the ingestion data and therefore allow for a fine-grained detection of errors. A violated conditional unit test specifies a set of erroneous tuples in a natural way: the subrelation that its CM refers to. Unfortunately, the downside of their fine-grained nature is that violating unit tests are often correlated: a single error in an ingestion batch may cause multiple tests (each referring to different parts of the batch) to fail. The key challenge is therefore to untangle this correlation and filter out the most relevant violated conditional unit tests, i.e., tests that identify a core set of erroneous tuples and act as an explanation for the errors. We present CM-Explorer, a system that supports data stewards in quickly finding the most relevant violated conditional unit tests. The system consists of three components: (1) a graph explorer for visualizing the correlation structure of the violated unit tests; (2) a relation explorer for browsing the tuples selected by conditional unit tests; and, (3) a history explorer to get insight why conditional unit tests are violated. In this paper, we discuss these components and present the different scenarios that we make available for the demonstration.

数据摄取验证，即验证持续收集的数据质量的任务，对于确保分析见解的可信度至关重要。用于验证数据质量的一种广泛使用的方法是手动或自动指定所谓的数据单元测试，以检查数据质量度量是否在预期范围内。我们采用了基于条件度量(CMs)的条件单元测试，它计算摄取数据的特定部分的数据质量信号，从而允许对错误进行细粒度检测。违反的条件单元测试以一种自然的方式指定了一组错误的元组:它的CM引用的子关系。不幸的是，它们细粒度特性的缺点是违反单元测试通常是相关的:摄取批处理中的单个错误可能导致多个测试(每个测试引用批处理的不同部分)失败。因此，关键的挑战是解开这种关联，并过滤掉最相关的违反条件的单元测试，即识别一组核心错误元组并充当错误解释的测试。我们提出CM-Explorer，一个支持数据管理员快速查找最相关的违反条件单元测试的系统。该系统由三个部分组成:(1)图形浏览器，用于可视化违反单元测试的关联结构;(2)一个关系浏览器，用于浏览条件单元测试选择的元组;(3)一个历史探索者来了解为什么违反了条件单元测试。在本文中，我们将讨论这些组件，并展示我们为演示提供的不同场景。

{"title":"CM-Explorer: Dissecting Data Ingestion Problems","authors":"Niels Bylois, Frank Neven, Stijn Vansummeren","doi":"10.14778/3611540.3611595","DOIUrl":"https://doi.org/10.14778/3611540.3611595","url":null,"abstract":"Data ingestion validation, the task of certifying the quality of continuously collected data, is crucial to ensure trustworthiness of analytics insights. A widely used approach for validating data quality is to specify, either manually or automatically, so-called data unit tests that check whether data quality metrics lie within expected bounds. We employ conditional unit tests based on conditional metrics (CMs) that compute data quality signals over specific parts of the ingestion data and therefore allow for a fine-grained detection of errors. A violated conditional unit test specifies a set of erroneous tuples in a natural way: the subrelation that its CM refers to. Unfortunately, the downside of their fine-grained nature is that violating unit tests are often correlated: a single error in an ingestion batch may cause multiple tests (each referring to different parts of the batch) to fail. The key challenge is therefore to untangle this correlation and filter out the most relevant violated conditional unit tests, i.e., tests that identify a core set of erroneous tuples and act as an explanation for the errors. We present CM-Explorer, a system that supports data stewards in quickly finding the most relevant violated conditional unit tests. The system consists of three components: (1) a graph explorer for visualizing the correlation structure of the violated unit tests; (2) a relation explorer for browsing the tuples selected by conditional unit tests; and, (3) a history explorer to get insight why conditional unit tests are violated. In this paper, we discuss these components and present the different scenarios that we make available for the demonstration.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Databases on Modern Networks: A Decade of Research That Now Comes into Practice 现代网络上的数据库:十年来的研究现已付诸实践

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611579

Alberto Lerner, Carsten Binnig, Philippe Cudré-Mauroux, Rana Hussein, Matthias Jasny, Theo Jepsen, Dan R. K. Ports, Lasse Thostrup, Tobias Ziegler

Modern cloud networks are a fundamental pillar of data-intensive applications. They provide high-speed transaction (packet) rates and low overhead, enabling, for instance, truly scalable database designs. These networks, however, are fundamentally different from conventional ones. Arguably, the two key discerning technologies are RDMA and programmable network devices. Today, these technologies are not niche technologies anymore and are widely deployed across all major cloud vendors. The question is thus not if but how a new breed of data-intensive applications can benefit from modern networks, given the perceived difficulty in using and programming them. This tutorial addresses these challenges by exposing how the underlying principles changed as the network evolved and by presenting the new system design opportunities they opened. In the process, we also discuss several hard-earned lessons accumulated by making the transition first-hand.

现代云网络是数据密集型应用的基本支柱。它们提供高速的事务(包)速率和低开销，例如，支持真正可伸缩的数据库设计。然而，这些网络与传统网络有着根本的不同。可以说，两个关键的识别技术是RDMA和可编程网络设备。如今，这些技术不再是小众技术，而是被广泛部署在所有主要的云供应商中。因此，问题不是如果，而是考虑到使用和编程的困难，如何从现代网络中受益。本教程通过揭示底层原则如何随着网络的发展而变化，并介绍它们所带来的新系统设计机会，来解决这些挑战。在此过程中，我们还讨论了通过第一手过渡积累的一些来之不易的经验教训。

引用次数: 0

Techniques and Efficiencies from Building a Real-Time DBMS 构建实时DBMS的技术和效率

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611556

V. Srinivasan, Andrew Gooding, Sunil Sayyaparaju, Thomas Lopatic, Kevin Porter, Ashish Shinde, B. Narendran

This paper describes a variety of techniques from over a decade of developing Aerospike (formerly Citrusleaf), a real-time DBMS that is being used in some of the world's largest mission-critical systems that require the highest levels of performance and availability. Such mission-critical systems have many requirements including the ability to make decisions within a strict real-time SLA (milliseconds) with no downtime, predictable performance so that the first and billionth customer gets the same experience, ability to scale up 10X (or even 100X) with no downtime, support strong consistency for applications that need it, synchronous and asynchronous replication with global transactional capabilities, and the ability to deploy in any public and private cloud environments. We describe how using efficient algorithms to optimize every area of the DBMS helps the system achieve these stringent requirements. Specifically, we describe, effective ways to shard, place and locate data across a set of nodes, efficient identification of cluster membership and cluster changes, efficiencies generated by using a 'smart' client, how to effectively use replications with two copies replication instead of three-copy, how to reduce the cost of the realtime data footprint by combining the use of memory with flash storage, self-managing clusters for ease of operation including elastic scaling, networking and CPU optimizations including NUMA pinning with multi-threading. The techniques and efficiencies described here have enabled hundreds of deployments to grow by many orders of magnitude with near complete uptime.

本文介绍了十多年来开发Aerospike(前身为Citrusleaf)的各种技术，Aerospike是一种实时DBMS，用于世界上一些要求最高性能和可用性的大型关键任务系统。这样的关键任务系统有许多要求，包括在严格的实时SLA(毫秒级)内做出决策的能力，没有停机时间，可预测的性能，以便第一个和第十亿个客户获得相同的体验，扩展10倍(甚至100倍)的能力，没有停机时间，支持需要它的应用程序的强一致性，具有全局事务功能的同步和异步复制，以及在任何公共和私有云环境中部署的能力。我们描述了如何使用有效的算法来优化DBMS的每个领域，以帮助系统实现这些严格的要求。具体来说，我们描述了在一组节点上分片、放置和定位数据的有效方法，集群成员和集群变化的有效识别，使用“智能”客户端产生的效率，如何有效地使用双副本复制而不是三副本复制，如何通过结合使用内存和闪存来降低实时数据占用的成本，自我管理集群以方便操作，包括弹性扩展，网络和CPU优化，包括多线程的NUMA固定。本文描述的技术和效率已经使数百个部署的数量增长了许多个数量级，并且几乎完全正常运行。

{"title":"Techniques and Efficiencies from Building a Real-Time DBMS","authors":"V. Srinivasan, Andrew Gooding, Sunil Sayyaparaju, Thomas Lopatic, Kevin Porter, Ashish Shinde, B. Narendran","doi":"10.14778/3611540.3611556","DOIUrl":"https://doi.org/10.14778/3611540.3611556","url":null,"abstract":"This paper describes a variety of techniques from over a decade of developing Aerospike (formerly Citrusleaf), a real-time DBMS that is being used in some of the world's largest mission-critical systems that require the highest levels of performance and availability. Such mission-critical systems have many requirements including the ability to make decisions within a strict real-time SLA (milliseconds) with no downtime, predictable performance so that the first and billionth customer gets the same experience, ability to scale up 10X (or even 100X) with no downtime, support strong consistency for applications that need it, synchronous and asynchronous replication with global transactional capabilities, and the ability to deploy in any public and private cloud environments. We describe how using efficient algorithms to optimize every area of the DBMS helps the system achieve these stringent requirements. Specifically, we describe, effective ways to shard, place and locate data across a set of nodes, efficient identification of cluster membership and cluster changes, efficiencies generated by using a 'smart' client, how to effectively use replications with two copies replication instead of three-copy, how to reduce the cost of the realtime data footprint by combining the use of memory with flash storage, self-managing clusters for ease of operation including elastic scaling, networking and CPU optimizations including NUMA pinning with multi-threading. The techniques and efficiencies described here have enabled hundreds of deployments to grow by many orders of magnitude with near complete uptime.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135002981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On-the-Fly Data Transformation in Action 实时数据转换在行动

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611593

Ju Hyoung Mun, Konstantinos Karatsenidis, Tarikul Islam Papon, Shahin Roozkhosh, Denis Hoornaert, Ulrich Drepper, Ahmed Sanaullah, Renato Mancuso, Manos Athanassoulis

Transactional and analytical database management systems (DBMS) typically employ different data layouts: row-stores for the first and column-stores for the latter. In order to bridge the requirements of the two without maintaining two systems and two (or more) copies of the data, our proposed system Relational Memory employs specialized hardware that transforms the base row table into arbitrary column groups at query execution time. This approach maximizes the cache locality and is easy to use via a simple abstraction that allows transparent on-the-fly data transformation. Here, we demonstrate how to deploy and use Relational Memory via four representative scenarios. The demonstration uses the full-stack implementation of Relational Memory on the Xilinx Zynq UltraScale+ MPSoC platform. Conference participants will interact with Relational Memory deployed in the actual platform.

事务性和分析性数据库管理系统(DBMS)通常采用不同的数据布局:前者采用行存储，后者采用列存储。为了在不维护两个系统和两个(或更多)数据副本的情况下弥合这两者的需求，我们建议的系统关系内存使用专门的硬件，在查询执行时将基行表转换为任意列组。这种方法最大限度地提高了缓存的局部性，并且通过一个简单的抽象(允许透明的动态数据转换)易于使用。在这里，我们将通过四个代表性场景演示如何部署和使用关系内存。该演示在Xilinx Zynq UltraScale+ MPSoC平台上使用关系内存的全栈实现。会议参与者将与部署在实际平台中的关系内存进行交互。

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Vldb Endowment

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀