Proceedings of the Vldb Endowment最新文献_第7页

Angel-PTM: A Scalable and Economical Large-Scale Pre-Training System in Tencent Angel-PTM:一种可扩展且经济的腾讯大规模预训练系统

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611564

Xiaonan Nie, Yi Liu, Fangcheng Fu, Jinbao Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, Bin Cui

Recent years have witnessed the unprecedented achievements of large-scale pre-trained models, especially Transformer models. Many products and services in Tencent Inc., such as WeChat, QQ, and Tencent Advertisement, have been opted in to gain the power of pre-trained models. In this work, we present Angel-PTM, a productive deep learning system designed for pre-training and fine-tuning Transformer models. Angel-PTM can train extremely large-scale models with hierarchical memory efficiently. The key designs of Angel-PTM are a fine-grained memory management via the Page abstraction and a unified scheduling method that coordinates computations, data movements, and communications. Furthermore, Angel-PTM supports extreme model scaling with SSD storage and implements a lock-free updating mechanism to address the SSD I/O bottlenecks. Experimental results demonstrate that Angel-PTM outperforms existing systems by up to 114.8% in terms of maximum model scale as well as up to 88.9% in terms of training throughput. Additionally, experiments on GPT3-175B and T5-MoE-1.2T models utilizing hundreds of GPUs verify our strong scalability.

近年来，大规模预训练模型取得了前所未有的成就，尤其是Transformer模型。腾讯公司的许多产品和服务，如微信、QQ和腾讯广告，都已被选中，以获得预训练模型的能力。在这项工作中，我们提出了Angel-PTM，这是一个高效的深度学习系统，专为预训练和微调Transformer模型而设计。Angel-PTM可以有效地训练具有分层记忆的超大规模模型。Angel-PTM的关键设计是通过页面抽象实现的细粒度内存管理和协调计算、数据移动和通信的统一调度方法。此外，Angel-PTM支持SSD存储的极端模型扩展，并实现无锁更新机制，以解决SSD I/O瓶颈。实验结果表明，Angel-PTM在最大模型规模方面优于现有系统114.8%，在训练吞吐量方面优于现有系统88.9%。此外，在使用数百个gpu的GPT3-175B和T5-MoE-1.2T模型上的实验验证了我们强大的可扩展性。

{"title":"Angel-PTM: A Scalable and Economical Large-Scale Pre-Training System in Tencent","authors":"Xiaonan Nie, Yi Liu, Fangcheng Fu, Jinbao Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, Bin Cui","doi":"10.14778/3611540.3611564","DOIUrl":"https://doi.org/10.14778/3611540.3611564","url":null,"abstract":"Recent years have witnessed the unprecedented achievements of large-scale pre-trained models, especially Transformer models. Many products and services in Tencent Inc., such as WeChat, QQ, and Tencent Advertisement, have been opted in to gain the power of pre-trained models. In this work, we present Angel-PTM, a productive deep learning system designed for pre-training and fine-tuning Transformer models. Angel-PTM can train extremely large-scale models with hierarchical memory efficiently. The key designs of Angel-PTM are a fine-grained memory management via the Page abstraction and a unified scheduling method that coordinates computations, data movements, and communications. Furthermore, Angel-PTM supports extreme model scaling with SSD storage and implements a lock-free updating mechanism to address the SSD I/O bottlenecks. Experimental results demonstrate that Angel-PTM outperforms existing systems by up to 114.8% in terms of maximum model scale as well as up to 88.9% in terms of training throughput. Additionally, experiments on GPT3-175B and T5-MoE-1.2T models utilizing hundreds of GPUs verify our strong scalability.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FS-Real: A Real-World Cross-Device Federated Learning Platform FS-Real:一个真实世界的跨设备联合学习平台

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611617

Dawei Gao, Daoyuan Chen, Zitao Li, Yuexiang Xie, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou

Federated learning (FL) is a general distributed machine learning paradigm that provides solutions for tasks where data cannot be shared directly. Due to the difficulties in communication management and heterogeneity of distributed data and devices, initiating and using an FL algorithm for real-world cross-device scenarios requires significant repetitive effort but may not be transferable to similar projects. To reduce the effort required for developing and deploying FL algorithms, we present FS-Real, an open-source FL platform designed to address the need of a general and efficient infrastructure for real-world cross-device FL. In this paper, we introduce the key components of FS-Real and demonstrate that FS-Real has the following capabilities: 1) reducing the programming burden of FL algorithm development with plug-and-play and adaptable runtimes on Android and other Internet of Things (IoT) devices; 2) handling a large number of heterogeneous devices efficiently and robustly with our communication management components; 3) supporting a wide range of advanced FL algorithms with flexible configuration and extension; 4) alleviating the costs and efforts for deployment, evaluation, simulation, and performance optimization of FL algorithms with automatized tool kits.

联邦学习(FL)是一种通用的分布式机器学习范例，它为不能直接共享数据的任务提供解决方案。由于通信管理的困难以及分布式数据和设备的异构性，为真实世界的跨设备场景启动和使用FL算法需要大量的重复工作，但可能无法转移到类似的项目中。为了减少开发和部署FL算法所需的工作量，我们提出了FS-Real，这是一个开源FL平台，旨在满足现实世界中跨设备FL的通用高效基础设施的需求。在本文中，我们介绍了FS-Real的关键组件，并证明FS-Real具有以下功能:1)通过在Android和其他物联网(IoT)设备上即插即用和适应性强的运行时，减少FL算法开发的编程负担;2)利用我们的通信管理组件高效、稳健地处理大量异构设备;3)支持多种先进的FL算法，具有灵活的配置和扩展;4)通过自动化工具包减轻FL算法的部署、评估、仿真和性能优化的成本和工作量。

{"title":"FS-Real: A Real-World Cross-Device Federated Learning Platform","authors":"Dawei Gao, Daoyuan Chen, Zitao Li, Yuexiang Xie, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou","doi":"10.14778/3611540.3611617","DOIUrl":"https://doi.org/10.14778/3611540.3611617","url":null,"abstract":"Federated learning (FL) is a general distributed machine learning paradigm that provides solutions for tasks where data cannot be shared directly. Due to the difficulties in communication management and heterogeneity of distributed data and devices, initiating and using an FL algorithm for real-world cross-device scenarios requires significant repetitive effort but may not be transferable to similar projects. To reduce the effort required for developing and deploying FL algorithms, we present FS-Real, an open-source FL platform designed to address the need of a general and efficient infrastructure for real-world cross-device FL. In this paper, we introduce the key components of FS-Real and demonstrate that FS-Real has the following capabilities: 1) reducing the programming burden of FL algorithm development with plug-and-play and adaptable runtimes on Android and other Internet of Things (IoT) devices; 2) handling a large number of heterogeneous devices efficiently and robustly with our communication management components; 3) supporting a wide range of advanced FL algorithms with flexible configuration and extension; 4) alleviating the costs and efforts for deployment, evaluation, simulation, and performance optimization of FL algorithms with automatized tool kits.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

RESCU-SQL: Oblivious Querying for the Zero Trust Cloud sql:零信任云的遗忘查询

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611627

Xiling Li, Gefei Tan, Xiao Wang, Jennie Rogers, Soamar Homsi

Cloud service providers offer robust infrastructure for rent to organizations of all kinds. High stakes applications, such as the ones in defense and healthcare, are turning to the public cloud for a cost-effective, geographically distributed, always available solution to their hosting needs. Many such users are unwilling or unable to delegate their data to this third-party infrastructure. In this demonstration, we introduce RESCU-SQL, a zero-trust platform for resilient and secure SQL querying outsourced to one or more cloud service providers. RESCU-SQL users can query their DBMS using cloud infrastructure alone without revealing their private records to anyone. It does so by executing the query over secure multiparty computation. We call this system zero trust because it can tolerate any number of malicious servers provided one of them remains honest. Our demo will offer an interactive dashboard with which attendees can observe the performance of RESCU-SQL deployed on several in-cloud nodes for the TPC-H benchmark. Attendees can select a computing party and inject messages from it to explore how quickly it detects and reacts to a malicious party. This is the first SQL system to support all-but-one maliciously secure querying over a semi-honest coordinator for efficiency.

云服务提供商提供强大的基础设施，供各种组织租用。高风险应用程序(例如国防和医疗保健领域的应用程序)正在转向公共云，以获得经济高效、地理分布且始终可用的解决方案来满足其托管需求。许多这样的用户不愿意或不能将他们的数据委托给第三方基础设施。在本演示中，我们将介绍rescue -SQL，这是一个零信任平台，用于将弹性和安全的SQL查询外包给一个或多个云服务提供商。rescue - sql用户可以单独使用云基础设施查询他们的DBMS，而不会向任何人泄露他们的私人记录。它通过在安全多方计算上执行查询来实现这一点。我们称这个系统为零信任，因为它可以容忍任何数量的恶意服务器，只要其中一个保持诚实。我们的演示将提供一个交互式仪表板，与会者可以通过它观察部署在几个云内节点上的rescue - sql的性能，以进行TPC-H基准测试。与会者可以选择一个计算方并从中注入消息，以探索它检测和响应恶意方的速度有多快。这是第一个通过一个半诚实的协调器来支持除一个以外的所有恶意安全查询的SQL系统。

{"title":"RESCU-SQL: Oblivious Querying for the Zero Trust Cloud","authors":"Xiling Li, Gefei Tan, Xiao Wang, Jennie Rogers, Soamar Homsi","doi":"10.14778/3611540.3611627","DOIUrl":"https://doi.org/10.14778/3611540.3611627","url":null,"abstract":"Cloud service providers offer robust infrastructure for rent to organizations of all kinds. High stakes applications, such as the ones in defense and healthcare, are turning to the public cloud for a cost-effective, geographically distributed, always available solution to their hosting needs. Many such users are unwilling or unable to delegate their data to this third-party infrastructure. In this demonstration, we introduce RESCU-SQL, a zero-trust platform for resilient and secure SQL querying outsourced to one or more cloud service providers. RESCU-SQL users can query their DBMS using cloud infrastructure alone without revealing their private records to anyone. It does so by executing the query over secure multiparty computation. We call this system zero trust because it can tolerate any number of malicious servers provided one of them remains honest. Our demo will offer an interactive dashboard with which attendees can observe the performance of RESCU-SQL deployed on several in-cloud nodes for the TPC-H benchmark. Attendees can select a computing party and inject messages from it to explore how quickly it detects and reacts to a malicious party. This is the first SQL system to support all-but-one maliciously secure querying over a semi-honest coordinator for efficiency.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134996889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Odyssey: An Engine Enabling the Time-Series Clustering Journey Odyssey:一个支持时间序列集群旅程的引擎

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611622

John Paparrizos, Sai Prasanna Teja Reddy

Clustering is one of the most popular time-series tasks because it enables unsupervised data exploration and often serves as a subroutine or preprocessing step for other tasks. Despite being the subject of active research across disciplines for decades, only limited efforts focused on benchmarking clustering methods for time series. Unfortunately, these studies have (i) omitted popular methods and entire classes of methods; (ii) considered limited choices for underlying distance measures; (iii) performed evaluations on a small number of datasets; or (iv) avoided rigorous statistical validation of the findings. In addition, the sudden enthusiasm and recent slew of proposed deep learning methods underscore the vital need for a comprehensive study. Motivated by the aforementioned limitations, we present Odyssey, a modular and extensible web engine to comprehensively evaluate 80 time-series clustering methods spanning 9 different classes from the data mining, machine learning, and deep learning literature. Odyssey enables rigorous statistical analysis across 128 diverse time-series datasets. Through its interactive interface, Odyssey (i) reveals the best-performing method per class; (ii) identifies classes performing exceptionally well that were previously omitted; (iii) challenges claims about the use of elastic measures in clustering; (iv) highlights the effects of parameter tuning; and (v) debunks claims of superiority of deep learning methods. Odyssey does not only facilitate the most extensive study ever performed in this area but, importantly, reveals an illusion of progress while, in reality, none of the evaluated methods could outperform a traditional method, namely, k -Shape, with a statistically significant difference. Overall, Odyssey lays the foundations for advancing the state of the art in time-series clustering.

聚类是最流行的时间序列任务之一，因为它支持无监督的数据探索，并且通常用作其他任务的子例程或预处理步骤。尽管几十年来一直是跨学科研究的活跃主题，但只有有限的努力集中在时间序列的基准聚类方法上。不幸的是，这些研究(i)忽略了流行的方法和整个类别的方法;(ii)考虑了基础距离测量的有限选择;(iii)对少数数据集进行评估;或者(iv)避免对研究结果进行严格的统计验证。此外，突然出现的热情和最近提出的深度学习方法强调了对全面研究的迫切需要。基于上述限制，我们提出了Odyssey，一个模块化和可扩展的web引擎，以全面评估80种时间序列聚类方法，涵盖数据挖掘，机器学习和深度学习文献中的9个不同类别。Odyssey能够对128个不同的时间序列数据集进行严格的统计分析。通过它的交互界面，Odyssey (i)揭示了每个类的最佳执行方法;(ii)确定以前被省略的表现特别好的班级;(iii)质疑关于在聚类中使用弹性度量的主张;(iv)突出参数调整的影响;(v)揭穿深度学习方法优越性的说法。奥德赛不仅促进了在这一领域进行的最广泛的研究，而且重要的是，它揭示了一种进步的错觉，而实际上，所评估的方法都无法超越传统方法，即k -Shape，并具有统计学上的显著差异。总的来说，Odyssey为时间序列集群技术的发展奠定了基础。

{"title":"Odyssey: An Engine Enabling the Time-Series Clustering Journey","authors":"John Paparrizos, Sai Prasanna Teja Reddy","doi":"10.14778/3611540.3611622","DOIUrl":"https://doi.org/10.14778/3611540.3611622","url":null,"abstract":"Clustering is one of the most popular time-series tasks because it enables unsupervised data exploration and often serves as a subroutine or preprocessing step for other tasks. Despite being the subject of active research across disciplines for decades, only limited efforts focused on benchmarking clustering methods for time series. Unfortunately, these studies have (i) omitted popular methods and entire classes of methods; (ii) considered limited choices for underlying distance measures; (iii) performed evaluations on a small number of datasets; or (iv) avoided rigorous statistical validation of the findings. In addition, the sudden enthusiasm and recent slew of proposed deep learning methods underscore the vital need for a comprehensive study. Motivated by the aforementioned limitations, we present Odyssey, a modular and extensible web engine to comprehensively evaluate 80 time-series clustering methods spanning 9 different classes from the data mining, machine learning, and deep learning literature. Odyssey enables rigorous statistical analysis across 128 diverse time-series datasets. Through its interactive interface, Odyssey (i) reveals the best-performing method per class; (ii) identifies classes performing exceptionally well that were previously omitted; (iii) challenges claims about the use of elastic measures in clustering; (iv) highlights the effects of parameter tuning; and (v) debunks claims of superiority of deep learning methods. Odyssey does not only facilitate the most extensive study ever performed in this area but, importantly, reveals an illusion of progress while, in reality, none of the evaluated methods could outperform a traditional method, namely, k -Shape, with a statistically significant difference. Overall, Odyssey lays the foundations for advancing the state of the art in time-series clustering.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134997923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

mlwhatif: What If You Could Stop Re-Implementing Your Machine Learning Pipeline Analyses over and over? 如果你可以停止一遍又一遍地重新实现你的机器学习管道分析呢?

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611606

Stefan Grafberger, Shubha Guha, Paul Groth, Sebastian Schelter

Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are often brittle with respect to their input data. As a consequence, data scientists have to run different kinds of data centric what-if analyses to evaluate the robustness and reliability of such pipelines, e.g., with respect to data errors or preprocessing techniques. These what-if analyses follow a common pattern: they take an existing ML pipeline, create a pipeline variant by introducing a small change, and execute this variant to see how the change impacts the pipeline's output score. We recently proposed mlwhatif, a library that enables data scientists to declaratively specify what-if analyses for an ML pipeline, and to automatically generate, optimize and execute the required pipeline variants. We demonstrate how data scientists can leverage mlwhatif for a variety of pipelines and three different what-if analyses focusing on the robustness of a pipeline against data errors, the impact of data cleaning operations, and the impact of data preprocessing operations on fairness. In particular, we demonstrate step-by-step how mlwhatif generates and optimizes the required execution plans for the pipeline analyses. Our library is publicly available at https://github.com/stefan-grafberger/mlwhatif.

通过机器学习(ML)从数据中学习的软件系统用于关键的决策过程。不幸的是，现实世界的经验表明，ML系统中用于数据准备、特征编码和模型训练的管道相对于它们的输入数据来说往往是脆弱的。因此，数据科学家必须运行不同类型的以数据为中心的假设分析来评估这些管道的稳健性和可靠性，例如，关于数据错误或预处理技术。这些假设分析遵循一个共同的模式:它们采用一个现有的ML管道，通过引入一个小更改创建一个管道变体，并执行该变体，以查看更改如何影响管道的输出分数。我们最近提出了mlwhatif，这是一个库，它使数据科学家能够声明式地指定ML管道的假设分析，并自动生成、优化和执行所需的管道变体。我们演示了数据科学家如何利用mlwhatif对各种管道进行分析，以及三种不同的what-if分析，重点关注管道对数据错误的鲁棒性、数据清理操作的影响以及数据预处理操作对公平性的影响。特别是，我们一步一步地演示了mlwhatif如何为管道分析生成和优化所需的执行计划。我们的图书馆可以在https://github.com/stefan-grafberger/mlwhatif上公开获取。

{"title":"mlwhatif: What If You Could Stop Re-Implementing Your Machine Learning Pipeline Analyses over and over?","authors":"Stefan Grafberger, Shubha Guha, Paul Groth, Sebastian Schelter","doi":"10.14778/3611540.3611606","DOIUrl":"https://doi.org/10.14778/3611540.3611606","url":null,"abstract":"Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are often brittle with respect to their input data. As a consequence, data scientists have to run different kinds of data centric what-if analyses to evaluate the robustness and reliability of such pipelines, e.g., with respect to data errors or preprocessing techniques. These what-if analyses follow a common pattern: they take an existing ML pipeline, create a pipeline variant by introducing a small change, and execute this variant to see how the change impacts the pipeline's output score. We recently proposed mlwhatif, a library that enables data scientists to declaratively specify what-if analyses for an ML pipeline, and to automatically generate, optimize and execute the required pipeline variants. We demonstrate how data scientists can leverage mlwhatif for a variety of pipelines and three different what-if analyses focusing on the robustness of a pipeline against data errors, the impact of data cleaning operations, and the impact of data preprocessing operations on fairness. In particular, we demonstrate step-by-step how mlwhatif generates and optimizes the required execution plans for the pipeline analyses. Our library is publicly available at https://github.com/stefan-grafberger/mlwhatif.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"55 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134997924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TsQuality: Measuring Time Series Data Quality in Apache IoTDB 在Apache IoTDB中测量时间序列数据质量

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611601

Yuanhui Qiu, Chenguang Fang, Shaoxu Song, Xiangdong Huang, Chen Wang, Jianmin Wang

Time series has been found with various data quality issues, e.g., owing to sensor failure or network transmission errors in the Internet of Things (IoT). It is highly demanded to have an overview of the data quality issues on the millions of time series stored in a database. In this demo, we design and implement TsQuality, a system for measuring the data quality in Apache IoTDB. Four time series data quality measures, completeness, consistency, timeliness, and validity, are implemented as functions in Apache IoTDB or operators in Apache Spark. These data quality measures are also interpreted by navigating dirty points in different granularity. It is also well-integrated with the big data eco-system, connecting to Apache Zeppelin for SQL query, and Apache Superset for an overview of data quality.

时间序列已经被发现存在各种数据质量问题，例如，由于物联网(IoT)中的传感器故障或网络传输错误。非常需要对数据库中存储的数百万个时间序列的数据质量问题有一个概述。在这个演示中，我们设计并实现了TsQuality，一个在Apache IoTDB中测量数据质量的系统。完整性、一致性、时效性和有效性这四个时间序列数据质量指标在Apache IoTDB中作为函数实现，在Apache Spark中作为算子实现。这些数据质量度量也可以通过导航不同粒度的脏点来解释。它还与大数据生态系统很好地集成，连接到Apache Zeppelin进行SQL查询，连接到Apache Superset进行数据质量概述。

引用次数: 0

VisualNeo: Bridging the Gap between Visual Query Interfaces and Graph Query Engines VisualNeo:弥合可视化查询接口和图查询引擎之间的差距

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611608

Kai Huang, Houdong Liang, Chongchong Yao, Xi Zhao, Yue Cui, Yao Tian, Ruiyuan Zhang, Xiaofang Zhou

Visual Graph Query Interfaces (VQIs) empower non-programmers to query graph data by constructing visual queries intuitively. Devising efficient technologies in Graph Query Engines (GQEs) for interactive search and exploration has also been studied for years. However, these two vibrant scientific fields are traditionally independent of each other, causing a vast barrier for users who wish to explore the full-stack operations of graph querying. In this demonstration, we propose a novel VQI system built upon Neo4j called VisualNeo that facilities an efficient subgraph query in large graph databases. VisualNeo inherits several advanced features from recent advanced VQIs, which include the data-driven gui design and canned pattern generation. Additionally, it embodies a database manager module in order that users can connect to generic Neo4j databases. It performs query processing through the Neo4j driver and provides an aesthetic query result exploration.

可视化图形查询接口(VQIs)通过直观地构造可视化查询，使非程序员能够查询图形数据。在图形查询引擎(GQEs)中设计用于交互式搜索和探索的高效技术也已经研究了多年。然而，这两个充满活力的科学领域在传统上是相互独立的，这给希望探索图查询的全栈操作的用户造成了巨大的障碍。在这个演示中，我们提出了一个新的基于Neo4j的VQI系统，称为VisualNeo，它在大型图形数据库中提供了高效的子图查询。VisualNeo继承了最近高级vqi的几个高级特性，其中包括数据驱动的gui设计和罐装模式生成。此外，它还包含一个数据库管理器模块，以便用户可以连接到通用的Neo4j数据库。它通过Neo4j驱动程序执行查询处理，并提供美观的查询结果探索。

引用次数: 0

Web Connector: A Unified API Wrapper to Simplify Web Data Collection Web连接器:简化Web数据收集的统一API包装器

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611616

Weiyuan Wu, Pei Wang, Yi Xie, Yejia Liu, George Chow, Jiannan Wang

Collecting structured data from Web APIs, such as the Twitter API, Yelp Fusion API, Spotify API, and DBLP API, is a common task in the data science lifecycle, but it requires advanced programming skills for data scientists. To simplify web data collection and lower the barrier to entry, API wrappers have been developed to wrap API calls into easy-to-use functions. However, existing API wrappers are not standardized, which means that users must download and maintain multiple API wrappers and learn how to use each of them, while developers must spend considerable time creating an API wrapper for any new website. In this demo, we present the Web Connector, which unifies API wrappers to overcome these limitations. First, the Web Connector has an easy-to-use program-ming interface, designed to provide a user experience similar to that of reading data from relational databases. Second, the Web Connector's novel system architecture requires minimal effort to fetch data for end-users with an existing API description file. Third, the Web Connector includes a semi-automatic API description file generator that leverages the concept of generation by example to create new API wrappers without writing code.

从Web API(如Twitter API、Yelp Fusion API、Spotify API和DBLP API)收集结构化数据是数据科学生命周期中的一项常见任务，但它需要数据科学家具备高级编程技能。为了简化web数据收集并降低进入门槛，开发了API包装器，将API调用包装成易于使用的函数。然而，现有的API包装器没有标准化，这意味着用户必须下载和维护多个API包装器，并学习如何使用它们，而开发人员必须花费大量时间为任何新网站创建API包装器。在这个演示中，我们展示了Web Connector，它统一了API包装器来克服这些限制。首先，Web Connector具有易于使用的编程接口，旨在提供类似于从关系数据库读取数据的用户体验。其次，Web Connector新颖的系统架构只需要最小的努力，就可以使用现有的API描述文件为最终用户获取数据。第三，Web Connector包含一个半自动API描述文件生成器，它利用示例生成的概念来创建新的API包装器，而无需编写代码。

{"title":"Web Connector: A Unified API Wrapper to Simplify Web Data Collection","authors":"Weiyuan Wu, Pei Wang, Yi Xie, Yejia Liu, George Chow, Jiannan Wang","doi":"10.14778/3611540.3611616","DOIUrl":"https://doi.org/10.14778/3611540.3611616","url":null,"abstract":"Collecting structured data from Web APIs, such as the Twitter API, Yelp Fusion API, Spotify API, and DBLP API, is a common task in the data science lifecycle, but it requires advanced programming skills for data scientists. To simplify web data collection and lower the barrier to entry, API wrappers have been developed to wrap API calls into easy-to-use functions. However, existing API wrappers are not standardized, which means that users must download and maintain multiple API wrappers and learn how to use each of them, while developers must spend considerable time creating an API wrapper for any new website. In this demo, we present the Web Connector, which unifies API wrappers to overcome these limitations. First, the Web Connector has an easy-to-use program-ming interface, designed to provide a user experience similar to that of reading data from relational databases. Second, the Web Connector's novel system architecture requires minimal effort to fetch data for end-users with an existing API description file. Third, the Web Connector includes a semi-automatic API description file generator that leverages the concept of generation by example to create new API wrappers without writing code.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134998297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Private Information Retrieval in Large Scale Public Data Repositories 大规模公共数据库中的私有信息检索

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611572

Ishtiyaque Ahmad, Divyakant Agrawal, Amr El Abbadi, Trinabh Gupta

The tutorial focuses on Private Information Retrieval (PIR), which allows clients to privately query public or server-owned databases without disclosing their queries. The tutorial covers the basic concepts of PIR such as its types, construction, and critical building blocks, including homomorphic encryption. It also discusses the performance of PIR, existing optimizations for scalability, real-life applications of PIR, and ways to extend its functionalities.

本教程重点介绍私有信息检索(Private Information Retrieval, PIR)，它允许客户机在不公开其查询的情况下私下查询公共或服务器拥有的数据库。本教程介绍了PIR的基本概念，例如它的类型、构造和关键构建块，包括同态加密。本文还讨论了PIR的性能、现有的可伸缩性优化、PIR的实际应用程序以及扩展其功能的方法。

引用次数: 0

Visualizing Spreadsheet Formula Graphs Compactly 可视化电子表格公式图形紧凑

3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment

Pub Date : 2023-08-01 DOI: 10.14778/3611540.3611613

Fanchao Chen, Dixin Tang, Haotian Li, Aditya G. Parameswaran

Spreadsheets are a ubiquitous data analysis tool, empowering non-programmers and programmers alike to easily express their computations by writing formulae alongside data. The dependencies created by formulae are tracked as formula graphs, which play a central role in many spreadsheet applications and are critical to the interactivity and usability of spreadsheet systems. Unfortunately, as formula graphs become large and complex, it becomes harder for end-users to make sense of formula graphs and trace the dependents or precedents of cells to check the accuracy of individual formulae and identify sources of errors. In this paper, we demonstrate a spreadsheet formula graph visualization tool, TACO-Lens, developed as a plugin for Microsoft Excel. Our plugin leverages TACO, our framework for compactly and efficiently representing formula graphs. TACO compresses formula graphs using a key spreadsheet property: tabular locality, which means that cells close to each other are likely to have similar formula structures. This compact representation enables end-users to more easily consume complex dependencies and reduces the response time for tracing dependents and precedents. TACO-Lens, our visualization plugin, depicts the compact representation of TACO and supports users in visually tracing dependents and precedents. In this demonstration, attendees can compare the visualizations of different formula graphs using TACO, Excel's built-in dependency tracing tool, and an approach that does not compress formula graphs, and quantitatively compare the different response time of different approaches.

电子表格是一种无处不在的数据分析工具，它使非程序员和程序员都能轻松地通过在数据旁边编写公式来表达他们的计算。公式创建的依赖关系以公式图的形式跟踪，公式图在许多电子表格应用程序中起着中心作用，对电子表格系统的交互性和可用性至关重要。不幸的是，随着公式图变得越来越大和复杂，最终用户越来越难以理解公式图并跟踪单元格的依赖项或先例，以检查单个公式的准确性和识别错误来源。在本文中，我们演示了一个电子表格公式图形可视化工具TACO-Lens，它是作为Microsoft Excel的插件开发的。我们的插件利用TACO，我们的框架紧凑，高效地表示公式图。TACO使用一个关键的电子表格属性来压缩公式图:表格局部性，这意味着彼此接近的单元格可能具有相似的公式结构。这种紧凑的表示使最终用户能够更容易地使用复杂的依赖项，并减少跟踪依赖项和前例的响应时间。TACO- lens，我们的可视化插件，描绘了TACO的紧凑表示，并支持用户在视觉上跟踪依赖和先例。在本次演示中，与会者可以使用Excel内置的依赖项跟踪工具TACO和不压缩公式图的方法来比较不同公式图的可视化效果，并定量地比较不同方法的不同响应时间。

{"title":"Visualizing Spreadsheet Formula Graphs Compactly","authors":"Fanchao Chen, Dixin Tang, Haotian Li, Aditya G. Parameswaran","doi":"10.14778/3611540.3611613","DOIUrl":"https://doi.org/10.14778/3611540.3611613","url":null,"abstract":"Spreadsheets are a ubiquitous data analysis tool, empowering non-programmers and programmers alike to easily express their computations by writing formulae alongside data. The dependencies created by formulae are tracked as formula graphs, which play a central role in many spreadsheet applications and are critical to the interactivity and usability of spreadsheet systems. Unfortunately, as formula graphs become large and complex, it becomes harder for end-users to make sense of formula graphs and trace the dependents or precedents of cells to check the accuracy of individual formulae and identify sources of errors. In this paper, we demonstrate a spreadsheet formula graph visualization tool, TACO-Lens, developed as a plugin for Microsoft Excel. Our plugin leverages TACO, our framework for compactly and efficiently representing formula graphs. TACO compresses formula graphs using a key spreadsheet property: tabular locality, which means that cells close to each other are likely to have similar formula structures. This compact representation enables end-users to more easily consume complex dependencies and reduces the response time for tracing dependents and precedents. TACO-Lens, our visualization plugin, depicts the compact representation of TACO and supports users in visually tracing dependents and precedents. In this demonstration, attendees can compare the visualizations of different formula graphs using TACO, Excel's built-in dependency tracing tool, and an approach that does not compress formula graphs, and quantitatively compare the different response time of different approaches.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135003658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0