2020 IEEE 36th International Conference on Data Engineering (ICDE)最新文献_第7页

A Unified Framework for Multi-view Spectral Clustering 多视点光谱聚类的统一框架

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00187

In the era of big data, multi-view clustering has drawn considerable attention in machine learning and data mining communities due to the existence of a large number of unlabeled multi-view data in reality. Traditional spectral graph theoretic methods have recently been extended to multi-view clustering and shown outstanding performance. However, most of them still consist of two separate stages: learning a fixed common real matrix (i.e., continuous labels) of all the views from original data, and then applying K-means to the resulting common label matrix to obtain the final clustering results. To address these, we design a unified multi-view spectral clustering scheme to learn the discrete cluster indicator matrix in one stage. Specifically, the proposed framework directly obtain clustering results without performing K-means clustering. Experimental results on several famous benchmark datasets verify the effectiveness and superiority of the proposed method compared to the state-of-the-arts.

在大数据时代，由于现实中存在大量未标注的多视图数据，多视图聚类在机器学习和数据挖掘领域引起了相当大的关注。近年来，传统的谱图理论方法已扩展到多视图聚类，并显示出优异的性能。然而，大多数聚类仍然由两个独立的阶段组成:从原始数据中学习所有视图的固定的公共实矩阵(即连续标签)，然后对得到的公共标签矩阵应用K-means，得到最终的聚类结果。为了解决这些问题，我们设计了一个统一的多视点光谱聚类方案，在一个阶段学习离散聚类指标矩阵。具体而言，该框架不需要进行K-means聚类，直接获得聚类结果。在多个著名基准数据集上的实验结果验证了该方法的有效性和优越性。

引用次数: 7

StructSim: Querying Structural Node Similarity at Billion Scale StructSim:在十亿尺度上查询结构节点相似度

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00211

Xiaoshuang Chen Longbin Lai Lu Qin Xuemin Lin

Structural node similarity is widely used in analyzing complex networks. As one of the structural node similarity metrics, role similarity has the good merit of indicating automorphism (isomorphism). Existing algorithms to compute role similarity (e.g., RoleSim and NED) suffer from severe performance bottlenecks, and thus cannot handle large real-world graphs. In this paper, we propose a new framework StructSim to compute nodes’ role similarity. Under this framework, we prove that StructSim is guaranteed to be an admissible role similarity metric based on the maximum matching. While maximum matching is too costly to scale, we then devise the BinCount matching to speed up the computation. BinCount-based StructSim admits a precomputed index to query one single pair in O(k log D) time, where k is a small user-defined parameter and D is the maximum node degree. Extensive empirical studies show that StructSim is significantly faster than existing works for computing structural node similarities on the real-world graphs, with comparable effectiveness.

结构节点相似性在复杂网络分析中得到了广泛的应用。角色相似度作为结构节点相似度度量之一，具有表示自同构(同构)的优点。现有的计算角色相似度的算法(例如，RoleSim和NED)存在严重的性能瓶颈，因此无法处理现实世界中的大型图形。本文提出了一个新的框架StructSim来计算节点的角色相似度。在此框架下，我们证明了StructSim是基于最大匹配的可接受的角色相似度度量。虽然最大匹配的成本太高，无法扩展，但我们随后设计了BinCount匹配来加快计算速度。基于bincount的StructSim允许预先计算索引在O(k log D)时间内查询单个对，其中k是用户自定义的小参数，D是最大节点度。大量的实证研究表明，StructSim在计算现实世界图上的结构节点相似度方面比现有的工作要快得多，并且具有相当的有效性。

引用次数: 15

Being Happy with the Least: Achieving α-happiness with Minimum Number of Tuples 用最少的元组获得α-幸福

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00092

R. C. Wong V. Tsotras Peng Peng Min Xie

When faced with a database containing millions of products, a user may be only interested in a (typically much) smaller representative subset. Various approaches were proposed to create a good representative subset that fits the user’s needs which are expressed in the form of a utility function (e.g., the top-k and diversification query). Recently, a regret minimization query was proposed: it does not require users to provide their utility functions and returns a small set of tuples such that any user’s favorite tuple in this subset is guaranteed to be not much worse than his/her favorite tuple in the whole database. In a sense, this query finds a small set of tuples that makes the user happy (i.e., not regretful) even if s/he gets the best tuple in the selected set but not the best tuple among all tuples in the database.In this paper, we study the min-size version of the regret minimization query; that is, we want to determine the least tuples needed to keep users happy at a given level. We term this problem as the α-happiness query where we quantify the user’s happiness level by a criterion, called the happiness ratio, and guarantee that each user is at least α happy with the set returned (i.e., the happiness ratio is at least α) where α is a real number from 0 to 1. As this is an NP-hard problem, we derive an approximate solution with theoretical guarantee by considering the problem from a geometric perspective. Since in practical scenarios, users are interested in achieving higher happiness levels (i.e., α is closer to 1), we performed extensive experiments for these scenarios, using both real and synthetic datasets. Our evaluations show that our algorithm outperforms the best-known previous approaches in two ways: (i) it answers the α-happiness query by returning fewer tuples to users and, (ii) it answers much faster (up to two orders of magnitude times improvement for large α).

当面对包含数百万个产品的数据库时，用户可能只对(通常是)较小的代表性子集感兴趣。提出了各种方法来创建适合用户需求的良好代表性子集，这些需求以效用函数的形式表示(例如，top-k和多样化查询)。最近，提出了一种遗憾最小化查询:它不要求用户提供他们的实用函数，并返回一个小的元组集，这样任何用户在这个子集中最喜欢的元组都保证不会比他/她在整个数据库中最喜欢的元组差太多。从某种意义上说，这个查询找到了一小部分让用户满意(即不会后悔)的元组，即使他/她得到了所选集合中最好的元组，但不是数据库中所有元组中最好的元组。在本文中，我们研究了最小尺寸版本的后悔最小化查询;也就是说，我们想要确定在给定级别上保持用户满意所需的最小元组。我们将这个问题称为α-幸福查询，其中我们通过一个称为幸福比率的标准来量化用户的幸福水平，并保证每个用户对返回的集合至少感到α满意(即，幸福比率至少为α)，其中α是从0到1的实数。由于这是一个np困难问题，我们从几何角度考虑问题，得到了一个具有理论保证的近似解。由于在实际场景中，用户对获得更高的幸福水平(即，α更接近1)感兴趣，因此我们使用真实和合成数据集对这些场景进行了广泛的实验。我们的评估表明，我们的算法在两个方面优于之前最著名的方法:(i)它通过向用户返回更少的元组来回答α-幸福查询，(ii)它的回答速度要快得多(对于大α，它的速度提高了两个数量级)。

{"title":"Being Happy with the Least: Achieving α-happiness with Minimum Number of Tuples","authors":"Min Xie, R. C. Wong, Peng Peng, V. Tsotras","doi":"10.1109/ICDE48307.2020.00092","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00092","url":null,"abstract":"When faced with a database containing millions of products, a user may be only interested in a (typically much) smaller representative subset. Various approaches were proposed to create a good representative subset that fits the user’s needs which are expressed in the form of a utility function (e.g., the top-k and diversification query). Recently, a regret minimization query was proposed: it does not require users to provide their utility functions and returns a small set of tuples such that any user’s favorite tuple in this subset is guaranteed to be not much worse than his/her favorite tuple in the whole database. In a sense, this query finds a small set of tuples that makes the user happy (i.e., not regretful) even if s/he gets the best tuple in the selected set but not the best tuple among all tuples in the database.In this paper, we study the min-size version of the regret minimization query; that is, we want to determine the least tuples needed to keep users happy at a given level. We term this problem as the α-happiness query where we quantify the user’s happiness level by a criterion, called the happiness ratio, and guarantee that each user is at least α happy with the set returned (i.e., the happiness ratio is at least α) where α is a real number from 0 to 1. As this is an NP-hard problem, we derive an approximate solution with theoretical guarantee by considering the problem from a geometric perspective. Since in practical scenarios, users are interested in achieving higher happiness levels (i.e., α is closer to 1), we performed extensive experiments for these scenarios, using both real and synthetic datasets. Our evaluations show that our algorithm outperforms the best-known previous approaches in two ways: (i) it answers the α-happiness query by returning fewer tuples to users and, (ii) it answers much faster (up to two orders of magnitude times improvement for large α).","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"95 1","pages":"1009-1020"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82782539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

FlashSchema: Achieving High Quality XML Schemas with Powerful Inference Algorithms and Large-scale Schema Data FlashSchema:通过强大的推理算法和大规模的模式数据实现高质量的XML模式

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00214

Tingjian Ge Zhiwu Xu Yeting Li headimg

H. Chen Jialun Cao Qiancheng Peng

Getting high quality XML schemas to avoid or reduce application risks is an important problem in practice, for which some important aspects have yet to be addressed satisfactorily in existing work. In this paper, we propose a tool FlashSchema for high quality XML schema design, which supports both one-pass and interactive schema design and schema recommendation. To the best of our knowledge, no other existing tools support interactive schema design and schema recommendation. One salient feature of our work is the design of algorithms to infer k-occurrence interleaving regular expressions, which are not only more powerful in model capacity, but also more efficient. Additionally, such algorithms form the basis of our interactive schema design. The other feature is that, starting from large-scale schema data that we have harvested from the Web, we devise a new solution for type inference, as well as propose schema recommendation for schema design. Finally, we conduct a series of experiments on two XML datasets, comparing with 9 state-of-the-art algorithms and open-source tools in terms of running time, preciseness, and conciseness. Experimental results show that our work achieves the highest level of preciseness and conciseness within only a few seconds. Experimental results and examples also demonstrate the effectiveness of our type inference and schema recommendation methods.

在实践中，获得高质量的XML模式以避免或减少应用程序风险是一个重要的问题，在现有的工作中，一些重要的方面还没有得到令人满意的解决。在本文中，我们提出了一个用于高质量XML模式设计的工具FlashSchema，该工具支持一次性和交互式模式设计以及模式推荐。据我们所知，没有其他现有工具支持交互式模式设计和模式推荐。我们工作的一个显著特征是设计了推断k次出现的交错正则表达式的算法，这不仅在模型容量上更强大，而且效率更高。此外，这些算法构成了交互式模式设计的基础。另一个特点是，从我们从Web上获得的大规模模式数据出发，我们设计了一种新的类型推断解决方案，并为模式设计提出了模式建议。最后，我们在两个XML数据集上进行了一系列实验，在运行时间、精确性和简洁性方面与9种最先进的算法和开源工具进行了比较。实验结果表明，我们的工作在几秒钟内就达到了最高的精确度和简洁性。实验结果和实例也证明了我们的类型推理和模式推荐方法的有效性。

{"title":"FlashSchema: Achieving High Quality XML Schemas with Powerful Inference Algorithms and Large-scale Schema Data","authors":"Yeting Li, Jialun Cao, H. Chen, Tingjian Ge, Zhiwu Xu, Qiancheng Peng","doi":"10.1109/ICDE48307.2020.00214","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00214","url":null,"abstract":"Getting high quality XML schemas to avoid or reduce application risks is an important problem in practice, for which some important aspects have yet to be addressed satisfactorily in existing work. In this paper, we propose a tool FlashSchema for high quality XML schema design, which supports both one-pass and interactive schema design and schema recommendation. To the best of our knowledge, no other existing tools support interactive schema design and schema recommendation. One salient feature of our work is the design of algorithms to infer k-occurrence interleaving regular expressions, which are not only more powerful in model capacity, but also more efficient. Additionally, such algorithms form the basis of our interactive schema design. The other feature is that, starting from large-scale schema data that we have harvested from the Web, we devise a new solution for type inference, as well as propose schema recommendation for schema design. Finally, we conduct a series of experiments on two XML datasets, comparing with 9 state-of-the-art algorithms and open-source tools in terms of running time, preciseness, and conciseness. Experimental results show that our work achieves the highest level of preciseness and conciseness within only a few seconds. Experimental results and examples also demonstrate the effectiveness of our type inference and schema recommendation methods.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"820 1","pages":"1962-1965"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80995124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Indoor Mobility Semantics Annotation Using Coupled Conditional Markov Networks 基于耦合条件马尔可夫网络的室内移动语义标注

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00128

M. A. Cheema Hua Lu Huan Li headimg

Gang Chen

L. Shou

Indoor mobility semantics analytics can greatly benefit many pertinent applications. Existing semantic annotation methods mainly focus on outdoor space and require extra knowledge such as POI category or human activity regularity. However, these conditions are difficult to meet in indoor venues with relatively small extents but complex topology. This work studies the annotation of indoor mobility semantics that describe an object’s mobility event (what ) at a semantic indoor region (where ) during a time period (when ). A coupled conditional Markov network (C2MN) is proposed with a set of feature functions carefully designed by incorporating indoor topology and mobility behaviors. C2MN is able to capture probabilistic dependencies among positioning records, semantic regions, and mobility events jointly. Nevertheless, the correlation of regions and events hinders the parameters learning. Therefore, we devise an alternate learning algorithm to enable the parameter learning over correlated variables. The extensive experiments demonstrate that our C2MN-based semantic annotation is efficient and effective on both real and synthetic indoor mobility data.

室内移动语义分析可以极大地有利于许多相关的应用。现有的语义标注方法主要集中在户外空间，需要额外的知识，如POI类别或人类活动的规律性。然而，这些条件在面积相对较小但拓扑结构复杂的室内场馆中很难满足。这项工作研究了室内移动语义的注释，该语义描述了一个时间段(何时)在语义室内区域(何地)中物体的移动事件(什么)。提出了一种耦合条件马尔可夫网络(C2MN)，该网络结合室内拓扑和移动行为，精心设计了一组特征函数。C2MN能够联合捕获定位记录、语义区域和移动事件之间的概率依赖关系。然而，区域和事件的相关性阻碍了参数的学习。因此，我们设计了一种替代学习算法来实现对相关变量的参数学习。大量的实验表明，我们基于c2mn的语义标注在真实和合成室内移动数据上都是高效的。

{"title":"Indoor Mobility Semantics Annotation Using Coupled Conditional Markov Networks","authors":"Huan Li, Hua Lu, M. A. Cheema, L. Shou, Gang Chen","doi":"10.1109/ICDE48307.2020.00128","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00128","url":null,"abstract":"Indoor mobility semantics analytics can greatly benefit many pertinent applications. Existing semantic annotation methods mainly focus on outdoor space and require extra knowledge such as POI category or human activity regularity. However, these conditions are difficult to meet in indoor venues with relatively small extents but complex topology. This work studies the annotation of indoor mobility semantics that describe an object’s mobility event (what ) at a semantic indoor region (where ) during a time period (when ). A coupled conditional Markov network (C2MN) is proposed with a set of feature functions carefully designed by incorporating indoor topology and mobility behaviors. C2MN is able to capture probabilistic dependencies among positioning records, semantic regions, and mobility events jointly. Nevertheless, the correlation of regions and events hinders the parameters learning. Therefore, we devise an alternate learning algorithm to enable the parameter learning over correlated variables. The extensive experiments demonstrate that our C2MN-based semantic annotation is efficient and effective on both real and synthetic indoor mobility data.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"19 1","pages":"1441-1452"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81988838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Contribution Maximization in Probabilistic Datalog 概率数据中的贡献最大化

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00076

Tova Milo Y. Moskovitch Brit Youngmann

The use of probabilistic datalog programs has been recently advocated for applications that involve recursive computation and uncertainty. While using such programs allows for a flexible knowledge derivation, it makes the analysis of query results a challenging task. Particularly, given a set O of output tuples and a number k, one would like to understand which k-size subset of the input tuples have contributed the most to the derivation of O. This is useful for multiple tasks, such as identifying the critical sources of errors and understanding surprising results. Previous works have mainly focused on the quantification of tuples contribution to a query result in non-recursive SQL queries, very often disregarding probabilistic inference. To quantify the contribution in probabilistic datalog programs, one must account for the recursive relations between input and output data, and the uncertainty. To this end, we formalize the Contribution Maximization (CM) problem. We then reduce CM to the well-studied Influence Maximization (IM) problem, showing that we can harness techniques developed for IM to our setting. However, we show that such naïve adoption results in poor performance. To overcome this, we propose an optimized algorithm which injects a refined variant of the classic Magic Sets technique, integrated with a sampling method, into IM algorithms, achieving a significant saving of space and execution time. Our experiments demonstrate the effectiveness of our algorithm, even where the naïve approach is infeasible.

在涉及递归计算和不确定性的应用中，最近提倡使用概率数据程序。虽然使用这样的程序允许灵活的知识派生，但它使查询结果的分析成为一项具有挑战性的任务。特别是，给定一组O个输出元组和一个数字k，人们想要了解输入元组的哪个k大小的子集对O的推导贡献最大。这对于多个任务很有用，例如识别错误的关键来源和理解令人惊讶的结果。以前的工作主要集中在非递归SQL查询中元组对查询结果的贡献的量化上，通常忽略了概率推理。为了量化概率数据程序中的贡献，必须考虑输入和输出数据之间的递归关系以及不确定性。为此，我们将贡献最大化(CM)问题形式化。然后，我们将CM简化为经过充分研究的影响最大化(IM)问题，表明我们可以利用为IM开发的技术来实现我们的设置。然而，我们表明这样的naïve采用导致了较差的性能。为了克服这个问题，我们提出了一种优化算法，该算法将经典Magic Sets技术的改进变体与采样方法集成到IM算法中，从而大大节省了空间和执行时间。我们的实验证明了我们的算法的有效性，即使naïve方法是不可行的。

{"title":"Contribution Maximization in Probabilistic Datalog","authors":"T. Milo, Y. Moskovitch, Brit Youngmann","doi":"10.1109/ICDE48307.2020.00076","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00076","url":null,"abstract":"The use of probabilistic datalog programs has been recently advocated for applications that involve recursive computation and uncertainty. While using such programs allows for a flexible knowledge derivation, it makes the analysis of query results a challenging task. Particularly, given a set O of output tuples and a number k, one would like to understand which k-size subset of the input tuples have contributed the most to the derivation of O. This is useful for multiple tasks, such as identifying the critical sources of errors and understanding surprising results. Previous works have mainly focused on the quantification of tuples contribution to a query result in non-recursive SQL queries, very often disregarding probabilistic inference. To quantify the contribution in probabilistic datalog programs, one must account for the recursive relations between input and output data, and the uncertainty. To this end, we formalize the Contribution Maximization (CM) problem. We then reduce CM to the well-studied Influence Maximization (IM) problem, showing that we can harness techniques developed for IM to our setting. However, we show that such naïve adoption results in poor performance. To overcome this, we propose an optimized algorithm which injects a refined variant of the classic Magic Sets technique, integrated with a sampling method, into IM algorithms, achieving a significant saving of space and execution time. Our experiments demonstrate the effectiveness of our algorithm, even where the naïve approach is infeasible.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"72 1","pages":"817-828"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78747207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Design of Database Systems with DRAM-only Heterogeneous Memory Architecture 基于异构内存架构的数据库系统设计

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00243

Yifan Qiao

This thesis advocates a novel DRAM-only strategy to reduce the computing system memory cost for the first time, and investigates its applications to database systems. This thesis envisions a low-cost DRAM module called block-protected DRAM, which reduces bit cost by significantly relaxing the DRAM raw reliability and meanwhile employs long error correction code (ECC) to ensure data integrity at small coding redundancy. Built upon the exactly same DRAM technology, today’s byte-accessible DRAM and envisioned block-protected DRAM strike at different trade-offs between memory bit cost and native data access granularity, and naturally form a heterogeneous DRAM-only memory system. The practical feasibility of such heterogeneous memory systems is further strengthened by the new media-agnostic and latency-oblivious CPU-memory interfaces such as IBM’s OpenCAPI/OMI and Intel’s CXL. This DRAM-only design approach perfectly leverages the existing DRAM manufacturing infrastructure and is not subject to any fundamental technology risk and uncertainty. Hence, before NVM technologies could eventually fulfill their long-awaited promises (i.e., DRAM-grade speed at flash-grade cost), this DRAM-only design framework can fill the gap to empower continuous progress and advances of computing systems. This thesis aims to develop techniques that enable relational and NoSQL databases to take full advantage of the envisioned low-cost heterogeneous DRAM system. As the first step, we studied how one could employ heterogeneous DRAM to implement a low-cost tiered caching solution for relational database, and obtained encouraging results using MySQL as a test vehicle.

本文首次提出了一种新的全内存策略来降低计算系统的内存成本，并对其在数据库系统中的应用进行了研究。本文设想了一种低成本的DRAM模块，称为块保护DRAM，它通过显著降低DRAM原始可靠性来降低比特成本，同时采用长纠错码(ECC)来确保小编码冗余下的数据完整性。基于完全相同的DRAM技术，今天的字节可访问DRAM和预期的块保护DRAM在内存位成本和本机数据访问粒度之间进行了不同的权衡，自然形成了异构的DRAM存储系统。IBM的OpenCAPI/OMI和Intel的CXL等新的与媒体无关和延迟无关的cpu -内存接口进一步加强了这种异构存储系统的实际可行性。这种纯DRAM设计方法完美地利用了现有的DRAM制造基础设施，不受任何基础技术风险和不确定性的影响。因此，在NVM技术最终实现其期待已久的承诺(即以闪存级成本实现dram级速度)之前，这种仅使用dram的设计框架可以填补空白，从而使计算系统不断进步和进步。本论文旨在开发技术，使关系数据库和NoSQL数据库能够充分利用所设想的低成本异构DRAM系统。作为第一步，我们研究了如何使用异构DRAM来实现关系数据库的低成本分层缓存解决方案，并使用MySQL作为测试工具获得了令人鼓舞的结果。

{"title":"Design of Database Systems with DRAM-only Heterogeneous Memory Architecture","authors":"Yifan Qiao","doi":"10.1109/ICDE48307.2020.00243","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00243","url":null,"abstract":"This thesis advocates a novel DRAM-only strategy to reduce the computing system memory cost for the first time, and investigates its applications to database systems. This thesis envisions a low-cost DRAM module called block-protected DRAM, which reduces bit cost by significantly relaxing the DRAM raw reliability and meanwhile employs long error correction code (ECC) to ensure data integrity at small coding redundancy. Built upon the exactly same DRAM technology, today’s byte-accessible DRAM and envisioned block-protected DRAM strike at different trade-offs between memory bit cost and native data access granularity, and naturally form a heterogeneous DRAM-only memory system. The practical feasibility of such heterogeneous memory systems is further strengthened by the new media-agnostic and latency-oblivious CPU-memory interfaces such as IBM’s OpenCAPI/OMI and Intel’s CXL. This DRAM-only design approach perfectly leverages the existing DRAM manufacturing infrastructure and is not subject to any fundamental technology risk and uncertainty. Hence, before NVM technologies could eventually fulfill their long-awaited promises (i.e., DRAM-grade speed at flash-grade cost), this DRAM-only design framework can fill the gap to empower continuous progress and advances of computing systems. This thesis aims to develop techniques that enable relational and NoSQL databases to take full advantage of the envisioned low-cost heterogeneous DRAM system. As the first step, we studied how one could employ heterogeneous DRAM to implement a low-cost tiered caching solution for relational database, and obtained encouraging results using MySQL as a test vehicle.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"60 1","pages":"2054-2058"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75053755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Online Trichromatic Pickup and Delivery Scheduling in Spatial Crowdsourcing 空间众包中的在线三色取货调度

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00089

Guohui Li Guanfeng Liu Chenze Huang Bolong Zheng Kai Zheng headimg

Christian S. Jensen Nguyen Quoc Viet Hung Lu Chen

In Pickup-and-Delivery problems (PDP), mobile workers are employed to pick up and deliver items with the goal of reducing travel and fuel consumption. Unlike most existing efforts that focus on finding a schedule that enables the delivery of as many items as possible at the lowest cost, we consider trichromatic (worker-item-task) utility that encompasses worker reliability, item quality, and task profitability. Moreover, we allow customers to specify keywords for desired items when they submit tasks, which may result in multiple pickup options, thus further increasing the difficulty of the problem. Specifically, we formulate the problem of Online Trichromatic Pickup and Delivery Scheduling (OTPD) that aims to find optimal delivery schedules with highest overall utility. In order to quickly respond to submitted tasks, we propose a greedy solution that finds the schedule with the highest utility-cost ratio. Next, we introduce a skyline kinetic tree-based solution that materializes intermediate results to improve the result quality. Finally, we propose a density-based grouping solution that partitions streaming tasks and efficiently assigns them to the workers with high overall utility. Extensive experiments with real and synthetic data offer evidence that the proposed solutions excel over baselines with respect to both effectiveness and efficiency.

在拾取和交付问题(PDP)中，雇用移动工人来拾取和交付物品，以减少旅行和燃料消耗。与大多数现有的专注于寻找能够以最低成本交付尽可能多的项目的时间表的工作不同，我们考虑了三色(工人-项目-任务)实用程序，它包含了工人可靠性、项目质量和任务盈利能力。此外，我们允许客户在提交任务时指定所需物品的关键字，这可能会导致多个取件选项，从而进一步增加了问题的难度。具体来说，我们制定了在线三色取货和交货调度(OTPD)的问题，旨在找到具有最高整体效用的最佳交货时间表。为了快速响应提交的任务，我们提出了一个贪心的解决方案，寻找效用成本比最高的调度。接下来，我们将介绍一种基于天际线动态树的解决方案，该解决方案将中间结果物化以提高结果质量。最后，我们提出了一种基于密度的分组解决方案，该方案对流任务进行分区，并有效地将其分配给具有高整体效用的工人。对真实数据和合成数据进行的广泛实验证明，所提出的解决方案在有效性和效率方面都优于基线。

{"title":"Online Trichromatic Pickup and Delivery Scheduling in Spatial Crowdsourcing","authors":"Bolong Zheng, Chenze Huang, Christian S. Jensen, Lu Chen, Nguyen Quoc Viet Hung, Guanfeng Liu, Guohui Li, Kai Zheng","doi":"10.1109/ICDE48307.2020.00089","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00089","url":null,"abstract":"In Pickup-and-Delivery problems (PDP), mobile workers are employed to pick up and deliver items with the goal of reducing travel and fuel consumption. Unlike most existing efforts that focus on finding a schedule that enables the delivery of as many items as possible at the lowest cost, we consider trichromatic (worker-item-task) utility that encompasses worker reliability, item quality, and task profitability. Moreover, we allow customers to specify keywords for desired items when they submit tasks, which may result in multiple pickup options, thus further increasing the difficulty of the problem. Specifically, we formulate the problem of Online Trichromatic Pickup and Delivery Scheduling (OTPD) that aims to find optimal delivery schedules with highest overall utility. In order to quickly respond to submitted tasks, we propose a greedy solution that finds the schedule with the highest utility-cost ratio. Next, we introduce a skyline kinetic tree-based solution that materializes intermediate results to improve the result quality. Finally, we propose a density-based grouping solution that partitions streaming tasks and efficiently assigns them to the workers with high overall utility. Extensive experiments with real and synthetic data offer evidence that the proposed solutions excel over baselines with respect to both effectiveness and efficiency.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"10 1","pages":"973-984"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74223517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

BFT-Store: Storage Partition for Permissioned Blockchain via Erasure Coding BFT-Store:通过Erasure编码的许可区块链存储分区

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00205

Cheqing Jin Xiaodong Qi headimg

Zhao Zhang headimg

Aoying Zhou

The full-replication data storage mechanism, as commonly utilized in existing blockchain systems, is lack of sufficient storage scalability, since it reserves a copy of the whole block data in each node so that the overall storage consumption per block is O(n) with n nodes. Moreover, due to the existence of Byzantine nodes, existing partitioning methods, though widely adopted in distributed systems for decades, cannot suit for blockchain systems directly, thereby it is critical to devise a new storage mechanism. This paper proposes a novel storage engine, called BFT-Store, to enhance storage scalability by integrating erasure coding with Byzantine Fault Tolerance (BFT) consensus protocol. First, the storage consumption per block can be reduced to O(1), which enlarges overall storage capability when more nodes join blockchain. Second, an efficient online re-encoding protocol is designed for storage scale-out and a hybrid replication scheme is employed to improve reading performance. Last, extensive experimental results illustrate the scalability, availability and efficiency of BFT-Store, which is implemented on an open-source permissioned blockchain Tendermint.

现有区块链系统中常用的全复制数据存储机制缺乏足够的存储可扩展性，因为它在每个节点中保留了整个块数据的副本，因此每个块的总体存储消耗是O(n)，有n个节点。此外，由于拜占庭节点的存在，现有的分区方法虽然在分布式系统中被广泛采用了几十年，但不能直接适用于区块链系统，因此设计一种新的存储机制是至关重要的。本文提出了一种新的存储引擎BFT- store，通过将擦除编码与拜占庭容错(BFT)共识协议相结合来提高存储的可扩展性。首先，每个区块的存储消耗可以减少到0(1)，当更多的节点加入区块链时，整体存储能力就会扩大。其次，设计了一种高效的在线重编码协议用于存储横向扩展，并采用混合复制方案提高读取性能。最后，大量的实验结果说明了在开源许可区块链Tendermint上实现的BFT-Store的可扩展性、可用性和效率。

{"title":"BFT-Store: Storage Partition for Permissioned Blockchain via Erasure Coding","authors":"Xiaodong Qi, Zhao Zhang, Cheqing Jin, Aoying Zhou","doi":"10.1109/ICDE48307.2020.00205","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00205","url":null,"abstract":"The full-replication data storage mechanism, as commonly utilized in existing blockchain systems, is lack of sufficient storage scalability, since it reserves a copy of the whole block data in each node so that the overall storage consumption per block is O(n) with n nodes. Moreover, due to the existence of Byzantine nodes, existing partitioning methods, though widely adopted in distributed systems for decades, cannot suit for blockchain systems directly, thereby it is critical to devise a new storage mechanism. This paper proposes a novel storage engine, called BFT-Store, to enhance storage scalability by integrating erasure coding with Byzantine Fault Tolerance (BFT) consensus protocol. First, the storage consumption per block can be reduced to O(1), which enlarges overall storage capability when more nodes join blockchain. Second, an efficient online re-encoding protocol is designed for storage scale-out and a hybrid replication scheme is employed to improve reading performance. Last, extensive experimental results illustrate the scalability, availability and efficiency of BFT-Store, which is implemented on an open-source permissioned blockchain Tendermint.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"89 1","pages":"1926-1929"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73216128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

vCBIR: A Verifiable Search Engine for Content-Based Image Retrieval 基于内容的图像检索的可验证搜索引擎vCBIR

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00156

Jianliang Xu Shangwei Guo Cheng Xu Ce Zhang Yang Ji

We demonstrate vCBIR, a verifiable search engine for Content-Based Image Retrieval. vCBIR allows a small or medium-sized enterprise to outsource its image database to a cloud-based service provider and ensures the integrity of query processing. Like other common data-as-a-service (DaaS) systems, vCBIR consists of three parties: (i) the image owner who outsources its database, (ii) the service provider who executes the authenticated query processing, and (iii) the client who issues search queries. By employing a novel query authentication scheme proposed in our prior work [4], the system not only supports cloud-based image retrieval, but also generates a cryptographic proof for each query, by which the client could verify the integrity of query results. During the demonstration, we will showcase the usage of vCBIR and also provide attendees interactive experience of verifying query results against an untrustworthy service provider through graphical user interface (GUI).

我们演示了vCBIR，一个基于内容的图像检索的可验证搜索引擎。vCBIR允许中小型企业将其图像数据库外包给基于云的服务提供商，并确保查询处理的完整性。与其他常见的数据即服务(DaaS)系统一样，vCBIR由三方组成:(i)将其数据库外包的映像所有者，(ii)执行身份验证查询处理的服务提供商，以及(iii)发出搜索查询的客户端。通过采用我们在之前的工作[4]中提出的一种新的查询认证方案，系统不仅支持基于云的图像检索，而且还为每个查询生成一个加密证明，客户端可以通过该验证查询结果的完整性。在演示过程中，我们将展示vCBIR的使用，并通过图形用户界面(GUI)为与会者提供针对不可信的服务提供商验证查询结果的交互式体验。

引用次数: 1