2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)最新文献_第5页

Improving Distribued Subgraph Matching Algorithm on Timely Dataflow 实时数据流上改进的分布式子图匹配算法

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)

Pub Date : 2019-04-01 DOI: 10.1109/ICDEW.2019.000-2

Zhengmin Lai, Zhengyi Yang, Longbin Lai

The subgraph matching problem is defined to find all subgraphs of a data graph that are isomorphic to a given query graph. Subgraph matching plays a vital role in the fields of e-commerce, social media and biological science. CliqueJoin is a distributed subgraph matching algorithm that is designed to be efficient and scalable. However, CliqueJoin is originally developed on MapReduce, thus the performance of the algorithm can be affected by the notorious I/O issue of MapReduce while processing multi-round join tasks. Meanwhile, CliqueJoin does not propose a cost evaluation strategy for labelled graphs, which limits its application in practice where most real-world graphs are labelled. Targeting the limitations of CliqueJoin, we propose CliqueJoin++ to improve CliqueJoin in two aspects. Firstly, we implement CliqueJoin++ on the Timely dataflow system instead of MapReduce to avoid considerable I/O cost. Secondly, we extend the cost evaluation function in CliqueJoin to compute optimal join plans for labelled graphs in the distributed context. Extensive experiments have been conducted to show that the proposed method is up to 10 times faster than the MapReduce version for unlabelled matching, and it achieves good performance and scalability for labelled matching.

子图匹配问题的定义是寻找与给定查询图同构的数据图的所有子图。子图匹配在电子商务、社交媒体和生物科学等领域发挥着至关重要的作用。CliqueJoin是一种高效、可扩展的分布式子图匹配算法。然而，CliqueJoin最初是在MapReduce上开发的，因此在处理多轮连接任务时，算法的性能可能会受到MapReduce臭名昭著的I/O问题的影响。同时，CliqueJoin并没有提出标记图的成本评估策略，这限制了它在实际应用中的应用，因为大多数现实世界的图都是标记的。针对CliqueJoin的局限性，我们提出了cliquejoin++，从两个方面对CliqueJoin进行改进。首先，我们在及时数据流系统上实现cliquejoin++而不是MapReduce，以避免大量的I/O开销。其次，我们扩展了CliqueJoin中的代价评估函数，以计算分布式环境下标记图的最优连接计划。大量的实验表明，该方法在无标记匹配方面比MapReduce版本快10倍，并且在标记匹配方面具有良好的性能和可扩展性。

{"title":"Improving Distribued Subgraph Matching Algorithm on Timely Dataflow","authors":"Zhengmin Lai, Zhengyi Yang, Longbin Lai","doi":"10.1109/ICDEW.2019.000-2","DOIUrl":"https://doi.org/10.1109/ICDEW.2019.000-2","url":null,"abstract":"The subgraph matching problem is defined to find all subgraphs of a data graph that are isomorphic to a given query graph. Subgraph matching plays a vital role in the fields of e-commerce, social media and biological science. CliqueJoin is a distributed subgraph matching algorithm that is designed to be efficient and scalable. However, CliqueJoin is originally developed on MapReduce, thus the performance of the algorithm can be affected by the notorious I/O issue of MapReduce while processing multi-round join tasks. Meanwhile, CliqueJoin does not propose a cost evaluation strategy for labelled graphs, which limits its application in practice where most real-world graphs are labelled. Targeting the limitations of CliqueJoin, we propose CliqueJoin++ to improve CliqueJoin in two aspects. Firstly, we implement CliqueJoin++ on the Timely dataflow system instead of MapReduce to avoid considerable I/O cost. Secondly, we extend the cost evaluation function in CliqueJoin to compute optimal join plans for labelled graphs in the distributed context. Extensive experiments have been conducted to show that the proposed method is up to 10 times faster than the MapReduce version for unlabelled matching, and it achieves good performance and scalability for labelled matching.","PeriodicalId":186190,"journal":{"name":"2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132826279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Context-Aware Attention-Based Data Augmentation for POI Recommendation 基于上下文感知注意力的POI推荐数据增强

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)

Pub Date : 2019-04-01 DOI: 10.1109/ICDEW.2019.00-14

Yang Li, Yadan Luo, Zheng Zhang, S. Sadiq, Peng Cui

With the rapid growth of location-based social networks (LBSNs), Point-Of-Interest (POI) recommendation has been broadly studied in this decade. Recently, the next POI recommendation, a natural extension of POI recommendation, has attracted much attention. It aims at suggesting the next POI to a user in spatial and temporal context, which is a practical yet challenging task in various applications. Existing approaches mainly model the spatial and temporal information, and memorise historical patterns through the user's trajectories for the recommendation. However, they suffer from the negative impact of missing and irregular check-in data, which significantly influences model performance. In this paper, we propose an attention-based sequence-to-sequence generative model, namely POI-Augmentation Seq2Seq (PA-Seq2Seq), to address the sparsity of training set by making check-in records to be evenly-spaced. Specifically, the encoder summarises each checkin sequence and the decoder predicts the possible missing checkins based on the encoded information. In order to learn timeaware correlation among user history, we employ local attention mechanism to help the decoder focus on a specific range of context information when predicting a certain missing check-in point. Extensive experiments have been conducted on two realworld check-in datasets, Gowalla and Brightkite, for performance and effectiveness evaluation.

随着基于位置的社交网络(LBSNs)的快速发展，兴趣点(POI)推荐在近十年得到了广泛的研究。最近，作为POI推荐的自然延伸，下一个POI推荐引起了人们的广泛关注。它的目的是在空间和时间背景下向用户建议下一个POI，这在各种应用中是一项实际但具有挑战性的任务。现有的方法主要是对空间和时间信息建模，并通过用户的轨迹记忆历史模式来进行推荐。然而，它们受到缺失和不规则签入数据的负面影响，这严重影响了模型的性能。在本文中，我们提出了一个基于注意力的序列到序列生成模型，即POI-Augmentation Seq2Seq (PA-Seq2Seq)，通过使签入记录均匀间隔来解决训练集的稀疏性问题。具体来说，编码器总结每个签入序列，解码器根据编码信息预测可能缺失的签入。为了学习用户历史之间的时间感知相关性，我们采用局部注意机制来帮助解码器在预测某个缺失的签入点时关注特定范围的上下文信息。在Gowalla和Brightkite两个现实世界的签入数据集上进行了大量的实验，以进行性能和有效性评估。

{"title":"Context-Aware Attention-Based Data Augmentation for POI Recommendation","authors":"Yang Li, Yadan Luo, Zheng Zhang, S. Sadiq, Peng Cui","doi":"10.1109/ICDEW.2019.00-14","DOIUrl":"https://doi.org/10.1109/ICDEW.2019.00-14","url":null,"abstract":"With the rapid growth of location-based social networks (LBSNs), Point-Of-Interest (POI) recommendation has been broadly studied in this decade. Recently, the next POI recommendation, a natural extension of POI recommendation, has attracted much attention. It aims at suggesting the next POI to a user in spatial and temporal context, which is a practical yet challenging task in various applications. Existing approaches mainly model the spatial and temporal information, and memorise historical patterns through the user's trajectories for the recommendation. However, they suffer from the negative impact of missing and irregular check-in data, which significantly influences model performance. In this paper, we propose an attention-based sequence-to-sequence generative model, namely POI-Augmentation Seq2Seq (PA-Seq2Seq), to address the sparsity of training set by making check-in records to be evenly-spaced. Specifically, the encoder summarises each checkin sequence and the decoder predicts the possible missing checkins based on the encoded information. In order to learn timeaware correlation among user history, we employ local attention mechanism to help the decoder focus on a specific range of context information when predicting a certain missing check-in point. Extensive experiments have been conducted on two realworld check-in datasets, Gowalla and Brightkite, for performance and effectiveness evaluation.","PeriodicalId":186190,"journal":{"name":"2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133680995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Context-Aware Co-attention Neural Network for Service Recommendations 面向服务推荐的上下文感知协同关注神经网络

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)

Pub Date : 2019-04-01 DOI: 10.1109/ICDEW.2019.00-11

Lei Li, Ruihai Dong, Li Chen

Context-aware recommender systems are able to produce more accurate recommendations by harnessing contextual information, such as consuming time and location. Further, user reviews as an important information resource, providing valuable information about users' preferences, items' aspects, and implicit contextual features, could be used to enhance the embeddings of users, items, and contexts. However, few works attempt to incorporate these two types of information, i.e., contexts and reviews, into their models. Recent state-of-the-art context-aware methods only characterize relations between two types of entities among users, items and contexts, which may be insufficient, as the final prediction is closely related to all the three types of entities. In this paper, we propose a novel model, named Context-aware Co-Attention Neural Network (CCANN), to dynamically infer relations between contexts and users/items, and subsequently to model the degree of matching between users' contextual preferences and items' context-aware aspects via co-attention mechanism. To better leverage the information from reviews, we propose an embedding method, named Entity2Vec, to jointly learn embeddings of different entities (users, items and contexts) with words in a textual review. Experimental results, on three datasets composed of millions of review records crawled from TripAdvisor, demonstrate that our CCANN significantly outperforms state-of-the-art recommendation methods, and Entity2Vec can further boost the model's performance.

上下文感知推荐系统能够通过利用上下文信息(如消耗时间和位置)产生更准确的推荐。此外，用户评论作为一种重要的信息资源，提供了关于用户偏好、项目方面和隐式上下文特征的有价值的信息，可用于增强用户、项目和上下文的嵌入。然而，很少有作品试图将这两种类型的信息，即上下文和评论，合并到他们的模型中。最近最先进的上下文感知方法仅表征用户、项目和上下文之间两种实体之间的关系，这可能是不够的，因为最终的预测与所有三种类型的实体密切相关。在本文中，我们提出了一个新的模型，称为上下文感知共同注意神经网络(CCANN)，动态推断上下文与用户/项目之间的关系，随后通过共同注意机制对用户的上下文偏好与项目的上下文感知方面的匹配程度进行建模。为了更好地利用评论中的信息，我们提出了一种名为Entity2Vec的嵌入方法，在文本评论中共同学习不同实体(用户、项目和上下文)与单词的嵌入。在从TripAdvisor抓取的数百万条评论记录组成的三个数据集上的实验结果表明，我们的CCANN显著优于最先进的推荐方法，而Entity2Vec可以进一步提高模型的性能。

{"title":"Context-Aware Co-attention Neural Network for Service Recommendations","authors":"Lei Li, Ruihai Dong, Li Chen","doi":"10.1109/ICDEW.2019.00-11","DOIUrl":"https://doi.org/10.1109/ICDEW.2019.00-11","url":null,"abstract":"Context-aware recommender systems are able to produce more accurate recommendations by harnessing contextual information, such as consuming time and location. Further, user reviews as an important information resource, providing valuable information about users' preferences, items' aspects, and implicit contextual features, could be used to enhance the embeddings of users, items, and contexts. However, few works attempt to incorporate these two types of information, i.e., contexts and reviews, into their models. Recent state-of-the-art context-aware methods only characterize relations between two types of entities among users, items and contexts, which may be insufficient, as the final prediction is closely related to all the three types of entities. In this paper, we propose a novel model, named Context-aware Co-Attention Neural Network (CCANN), to dynamically infer relations between contexts and users/items, and subsequently to model the degree of matching between users' contextual preferences and items' context-aware aspects via co-attention mechanism. To better leverage the information from reviews, we propose an embedding method, named Entity2Vec, to jointly learn embeddings of different entities (users, items and contexts) with words in a textual review. Experimental results, on three datasets composed of millions of review records crawled from TripAdvisor, demonstrate that our CCANN significantly outperforms state-of-the-art recommendation methods, and Entity2Vec can further boost the model's performance.","PeriodicalId":186190,"journal":{"name":"2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129096412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Skyline Nearest Neighbor Search on Multi-layer Graphs 多层图上的Skyline最近邻搜索

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)

Pub Date : 2019-04-01 DOI: 10.1109/ICDEW.2019.000-3

Wanqi Liu, Dong Wen, Hanchen Wang, Fan Zhang, Xubo Wang

Nearest neighbor search is a fundamental problem in graph theory. In real-world applications, the multi-layer graph model is extensively studied to reveal the multi-dimensional relations between the graph entities. In this paper, we formulate a new problem named skyline nearest neighbor search on multi-layer graphs. Given a query vertex u, we aim to compute a set of skyline vertices that are not dominated by other vertices in terms of the shortest distance on all graph layers. We propose an early-termination algorithm instead of naively adopting the traditional skyline procedure as a subroutine. We also investigate the rule to optimize search order in the algorithm and further improve the algorithmic efficiency. The experimental results demonstrate that the optimization strategies work well on different graphs and can speed up the algorithm significantly.

最近邻搜索是图论中的一个基本问题。在实际应用中，为了揭示图实体之间的多维关系，对多层图模型进行了广泛的研究。本文提出了多层图上的天际线最近邻搜索问题。给定一个查询顶点u，我们的目标是计算一组天际线顶点，这些顶点在所有图层上的距离最短，不受其他顶点的支配。我们提出了一种早期终止算法，而不是天真地采用传统的天际线程序作为子程序。研究了算法中搜索顺序的优化规则，进一步提高了算法的效率。实验结果表明，该优化策略在不同的图上都能很好地工作，并能显著提高算法的速度。

引用次数: 2

Learning to Select User-Specific Features for Top-N Recommendation of New Items 学习为新项目的Top-N推荐选择用户特定的功能

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)

Pub Date : 2019-04-01 DOI: 10.1109/ICDEW.2019.00-19

Yifan Chen, Xiang Zhao, Jin-Yuan Liu, Bin Ge, Weiming Zhang

Recommending new items to users remains a challenge due to the absence of user's past preferences for these items. Item features from side information are typically leveraged to tackle the problem. Existing methods formulate regression models, taking as input item features and as output user ratings. Availing of high dimensional item features, these methods are confronted with the issue of overfitting, which greatly impedes recommendation experience. In this work, we opt for feature selection to solve the problem of recommending top-N new items with high-dimensional side information. Existing feature selection methods find a common set of features for all users, which fails to differentiate user preferences over item features. To achieve personalization for feature selection, we propose to select item features specifically for users. The refined features filtered out the dimensions that are irrelevant to recommendations or unappealing to users. The experiment results on real-life datasets with high-dimensional side information reveal that the proposed method is effective in singling out features crucial to top-N recommendations and hence boosting the performance.

向用户推荐新产品仍然是一个挑战，因为没有用户过去对这些产品的偏好。通常利用边线信息中的道具特性来解决这个问题。现有方法建立回归模型，将项目特征作为输入，将用户评分作为输出。这些方法利用了高维的物品特征，存在过拟合问题，严重影响了推荐体验。在这项工作中，我们选择特征选择来解决推荐具有高维侧信息的top-N新项目的问题。现有的特征选择方法为所有用户找到一组共同的特征，这无法区分用户对项目特征的偏好。为了实现特征选择的个性化，我们建议为用户专门选择项目特征。精细化的功能过滤掉了与推荐无关或对用户没有吸引力的维度。在具有高维侧信息的真实数据集上的实验结果表明，该方法可以有效地挑选出对top-N推荐至关重要的特征，从而提高性能。

{"title":"Learning to Select User-Specific Features for Top-N Recommendation of New Items","authors":"Yifan Chen, Xiang Zhao, Jin-Yuan Liu, Bin Ge, Weiming Zhang","doi":"10.1109/ICDEW.2019.00-19","DOIUrl":"https://doi.org/10.1109/ICDEW.2019.00-19","url":null,"abstract":"Recommending new items to users remains a challenge due to the absence of user's past preferences for these items. Item features from side information are typically leveraged to tackle the problem. Existing methods formulate regression models, taking as input item features and as output user ratings. Availing of high dimensional item features, these methods are confronted with the issue of overfitting, which greatly impedes recommendation experience. In this work, we opt for feature selection to solve the problem of recommending top-N new items with high-dimensional side information. Existing feature selection methods find a common set of features for all users, which fails to differentiate user preferences over item features. To achieve personalization for feature selection, we propose to select item features specifically for users. The refined features filtered out the dimensions that are irrelevant to recommendations or unappealing to users. The experiment results on real-life datasets with high-dimensional side information reveal that the proposed method is effective in singling out features crucial to top-N recommendations and hence boosting the performance.","PeriodicalId":186190,"journal":{"name":"2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132878574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Implementing Big Data Lake for Heterogeneous Data Sources 实现异构数据源大数据湖

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)

Pub Date : 2019-04-01 DOI: 10.1109/ICDEW.2019.00-37

Hassan Mehmood, Ekaterina Gilman, Marta Cortés, Panos Kostakos, A. Byrne, K. Valta, Stavros Tekes, J. Riekki

Modern connected cities are more and more leveraging advances in ICT to improve their services and the quality of life of their inhabitants. The data generated from different sources, such as environmental sensors, social networking platforms, traffic counters, are harnessed to achieve these end goals. However, collecting, integrating, and analyzing all the heterogeneous data sources available from the cities is a challenge. This article suggests a data lake approach built on Big Data technologies, to gather all the data together for further analysis. The platform, described here, enables data collection, storage, integration, and further analysis and visualization of the results. This solution is the first attempt to integrate a diverse set of data sources from four pilot cities as part of the CUTLER project (Coastal urban development through the lenses of resiliency). The design and implementation details, as well as usage scenarios are presented in this paper.

现代互联城市越来越多地利用信息通信技术的进步来改善其服务和居民的生活质量。来自不同来源的数据，如环境传感器、社交网络平台、流量计数器，被用来实现这些最终目标。然而，收集、集成和分析来自城市的所有异构数据源是一个挑战。本文建议采用基于大数据技术的数据湖方法，将所有数据收集在一起进行进一步分析。这里描述的平台支持数据收集、存储、集成以及对结果的进一步分析和可视化。作为CUTLER项目(通过弹性镜头的沿海城市发展)的一部分，该解决方案首次尝试整合来自四个试点城市的各种数据源。文中给出了系统的设计实现细节和使用场景。

引用次数: 38

A Data-Driven Approach for Tracking Human Litter in Modern Cities 追踪现代城市人类垃圾的数据驱动方法

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)

Pub Date : 2019-04-01 DOI: 10.1109/ICDEW.2019.00-33

Ziang Zhao, Yunfan Kang, A. Magdy, Win Colton Cowger, A. Gray

In the recent years, human litter, such as food waste, diapers, construction materials, used motor oil, hypodermic needles, etc, is causing growing problems for the environment and quality of life in modern cities. Data about this waste has a significant importance in the field of environmental sciences due to its important use cases that span saving marine life, reducing the risk from natural hazards, community cleaning efforts, etc. In addition, such litter spreads several diseases in urban areas with high populations such as undeveloped neighborhoods in large modern cities. In this paper, we introduce a data-driven approach that enables environmental scientists and organizations to track, manage, and model human litter data at a large scale through smart technologies. We make a major on-going effort to collect and maintain this data worldwide from different sources through a community of environmental scientists and partner organizations. With the increasing volume of collected datasets, existing software packages, such as GIS software, do not scale to process, query, and visualize such data. To overcome this, we provide a scalable data management and visualization framework that digests datasets from different sources, with different formats, in a scalable backend that cleans, integrates, and unifies them in a structured form. On top of this backend, frontend applications are built to visualize litter data at multiple spatial levels, from continents and oceans to street level, to enable new opportunities for both environmental scientists and organizations to track, model, and clean up litter data. The framework is currently managing thirty real datasets and provide different interfaces for different kinds of users.

近年来，人类的垃圾，如食物垃圾、纸尿裤、建筑材料、废旧机油、皮下注射针头等，正在给现代城市的环境和生活质量造成越来越大的问题。关于这种废物的数据在环境科学领域具有重要意义，因为它的重要用例涉及拯救海洋生物、减少自然灾害风险、社区清洁工作等。此外，这样的垃圾在人口众多的城市地区传播了几种疾病，比如在大城市的未开发社区。在本文中，我们介绍了一种数据驱动的方法，使环境科学家和组织能够通过智能技术大规模地跟踪、管理和模拟人类垃圾数据。我们通过一个由环境科学家和合作伙伴组织组成的社区，在全球范围内从不同的来源收集和维护这些数据。随着收集的数据集数量的增加，现有的软件包，如GIS软件，不能扩展到处理、查询和可视化这些数据。为了克服这个问题，我们提供了一个可扩展的数据管理和可视化框架，它在一个可扩展的后端中消化来自不同来源、不同格式的数据集，并以结构化的形式对它们进行清理、集成和统一。在此后端之上，构建了前端应用程序来可视化从大陆、海洋到街道等多个空间级别的垃圾数据，为环境科学家和组织跟踪、建模和清理垃圾数据提供了新的机会。该框架目前正在管理30个真实数据集，并为不同类型的用户提供不同的接口。

{"title":"A Data-Driven Approach for Tracking Human Litter in Modern Cities","authors":"Ziang Zhao, Yunfan Kang, A. Magdy, Win Colton Cowger, A. Gray","doi":"10.1109/ICDEW.2019.00-33","DOIUrl":"https://doi.org/10.1109/ICDEW.2019.00-33","url":null,"abstract":"In the recent years, human litter, such as food waste, diapers, construction materials, used motor oil, hypodermic needles, etc, is causing growing problems for the environment and quality of life in modern cities. Data about this waste has a significant importance in the field of environmental sciences due to its important use cases that span saving marine life, reducing the risk from natural hazards, community cleaning efforts, etc. In addition, such litter spreads several diseases in urban areas with high populations such as undeveloped neighborhoods in large modern cities. In this paper, we introduce a data-driven approach that enables environmental scientists and organizations to track, manage, and model human litter data at a large scale through smart technologies. We make a major on-going effort to collect and maintain this data worldwide from different sources through a community of environmental scientists and partner organizations. With the increasing volume of collected datasets, existing software packages, such as GIS software, do not scale to process, query, and visualize such data. To overcome this, we provide a scalable data management and visualization framework that digests datasets from different sources, with different formats, in a scalable backend that cleans, integrates, and unifies them in a structured form. On top of this backend, frontend applications are built to visualize litter data at multiple spatial levels, from continents and oceans to street level, to enable new opportunities for both environmental scientists and organizations to track, model, and clean up litter data. The framework is currently managing thirty real datasets and provide different interfaces for different kinds of users.","PeriodicalId":186190,"journal":{"name":"2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115024302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Scalable and Privacy-Preserving Design of On/Off-Chain Smart Contracts 链上/链下智能合约的可扩展性和隐私保护设计

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)

Pub Date : 2019-02-18 DOI: 10.1109/ICDEW.2019.00-43

Chao Li, Balaji Palanisamy, Runhua Xu

The rise of smart contract systems such as Ethereum has resulted in a proliferation of blockchain-based decentralized applications including applications that store and manage a wide range of data. Current smart contracts are designed to be executed solely by miners and are revealed entirely on-chain, resulting in reduced scalability and privacy. In this paper, we discuss that scalability and privacy of smart contracts can be enhanced by splitting a given contract into an off-chain contract and an on-chain contract. Specifically, functions of the contract that involve high-cost computation or sensitive information can be split and included as the off-chain contract, that is signed and executed by only the interested participants. The proposed approach allows the participants to reach unanimous agreement off-chain when all of them are honest, allowing computing resources of miners to be saved and content of the off-chain contract to be hidden from the public. In case of a dispute caused by any dishonest participants, a signed copy of the off-chain contract can be revealed so that a verified instance can be created to make miners enforce the true execution result. Thus, honest participants have the ability to redress and penalize any fraudulent or dishonest behavior, which incentivizes all participants to honestly follow the agreed off-chain contract. We discuss techniques for splitting a contract into a pair of on/off-chain contracts and propose a mechanism to address the challenges of handling dishonest participants in the system. Our implementation and evaluation of the proposed approach using an example smart contract demonstrate the effectiveness of the proposed approach in Ethereum.

以太坊等智能合约系统的兴起导致了基于区块链的分散应用程序的激增，包括存储和管理各种数据的应用程序。目前的智能合约被设计为仅由矿工执行，并且完全在链上显示，从而降低了可扩展性和隐私性。在本文中，我们讨论了智能合约的可扩展性和隐私性可以通过将给定合约拆分为链下合约和链上合约来增强。具体而言，涉及高成本计算或敏感信息的合约功能可以拆分并包含为链下合约，仅由感兴趣的参与者签署和执行。提议的方法允许参与者在所有人都诚实的情况下达成一致的脱链协议，允许矿工的计算资源被保存，并且对公众隐藏脱链合同的内容。如果任何不诚实的参与者引起争议，可以显示链下合同的签名副本，以便创建验证实例，使矿工强制执行真实的执行结果。因此，诚实的参与者有能力纠正和惩罚任何欺诈或不诚实的行为，这激励了所有参与者诚实地遵守商定的链下合同。我们讨论了将合约拆分为一对链上/链下合约的技术，并提出了一种机制来解决处理系统中不诚实参与者的挑战。我们使用示例智能合约对所提议的方法进行了实施和评估，证明了所提议的方法在以太坊中的有效性。

{"title":"Scalable and Privacy-Preserving Design of On/Off-Chain Smart Contracts","authors":"Chao Li, Balaji Palanisamy, Runhua Xu","doi":"10.1109/ICDEW.2019.00-43","DOIUrl":"https://doi.org/10.1109/ICDEW.2019.00-43","url":null,"abstract":"The rise of smart contract systems such as Ethereum has resulted in a proliferation of blockchain-based decentralized applications including applications that store and manage a wide range of data. Current smart contracts are designed to be executed solely by miners and are revealed entirely on-chain, resulting in reduced scalability and privacy. In this paper, we discuss that scalability and privacy of smart contracts can be enhanced by splitting a given contract into an off-chain contract and an on-chain contract. Specifically, functions of the contract that involve high-cost computation or sensitive information can be split and included as the off-chain contract, that is signed and executed by only the interested participants. The proposed approach allows the participants to reach unanimous agreement off-chain when all of them are honest, allowing computing resources of miners to be saved and content of the off-chain contract to be hidden from the public. In case of a dispute caused by any dishonest participants, a signed copy of the off-chain contract can be revealed so that a verified instance can be created to make miners enforce the true execution result. Thus, honest participants have the ability to redress and penalize any fraudulent or dishonest behavior, which incentivizes all participants to honestly follow the agreed off-chain contract. We discuss techniques for splitting a contract into a pair of on/off-chain contracts and propose a mechanism to address the challenges of handling dishonest participants in the system. Our implementation and evaluation of the proposed approach using an example smart contract demonstrate the effectiveness of the proposed approach in Ethereum.","PeriodicalId":186190,"journal":{"name":"2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125143100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Elites Tweet? Characterizing the Twitter Verified User Network 精英微博吗?Twitter验证用户网络的特征

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)

Pub Date : 2018-12-23 DOI: 10.1109/ICDEW.2019.00006

Indraneil Paul, Abhinav Khattar, P. Kumaraguru, Manish Gupta, Shaan Chopra

Social network and publishing platforms, such as Twitter, support the concept of verification. Verified accounts are deemed worthy of platform-wide public interest and are separately authenticated by the platform itself. There have been repeated assertions by these platforms about verification not being tantamount to endorsement. However, a significant body of prior work suggests that possessing a verified status symbolizes enhanced credibility in the eyes of the platform audience. As a result, such a status is highly coveted among public figures and influencers. Hence, we attempt to characterize the network of verified users on Twitter and compare the results to similar analysis performed for the entire Twitter network. We extracted the entire network of verified users on Twitter (as of July 2018) and obtained 231,246 English user profiles and 79,213,811 connections. Subsequently, in the network analysis, we found that the sub-graph of verified users mirrors the full Twitter users graph in some aspects such as possessing a short diameter. However, our findings contrast with earlier findings on multiple aspects, such as the possession of a power law out-degree distribution, slight dissortativity, and a significantly higher reciprocity rate, as elucidated in the paper. Moreover, we attempt to gauge the presence of salient components within this sub-graph and detect the absence of homophily with respect to popularity, which again is in stark contrast to the full Twitter graph. Finally, we demonstrate stationarity in the time series of verified user activity levels. To the best of our knowledge, this work represents the first quantitative attempt at characterizing verified users on Twitter.

社交网络和发布平台，如Twitter，支持验证的概念。经过验证的账户被认为符合整个平台的公共利益，并由平台本身单独进行认证。这些平台一再声称，核查不等于认可。然而，大量先前的研究表明，拥有一个经过验证的身份象征着平台受众眼中可信度的提高。因此，这样的地位在公众人物和有影响力的人中间是非常令人垂涎的。因此，我们试图描述Twitter上经过验证的用户网络的特征，并将结果与对整个Twitter网络执行的类似分析进行比较。我们提取了Twitter上经过验证的用户的整个网络(截至2018年7月)，获得了231246个英文用户资料和79213811个连接。随后，在网络分析中，我们发现验证用户的子图在某些方面反映了完整的Twitter用户图，例如具有较短的直径。然而，我们的研究结果在多个方面与早期的研究结果形成对比，例如拥有幂律外度分布，轻微的无序性，以及显着更高的互惠率，如文中所述。此外，我们试图衡量这个子图中显著成分的存在，并检测受欢迎程度的同质性缺失，这再次与完整的Twitter图形成鲜明对比。最后，我们在验证用户活动水平的时间序列中证明了平稳性。据我们所知，这项工作代表了对Twitter上经过验证的用户进行定性的第一次定量尝试。

{"title":"Elites Tweet? Characterizing the Twitter Verified User Network","authors":"Indraneil Paul, Abhinav Khattar, P. Kumaraguru, Manish Gupta, Shaan Chopra","doi":"10.1109/ICDEW.2019.00006","DOIUrl":"https://doi.org/10.1109/ICDEW.2019.00006","url":null,"abstract":"Social network and publishing platforms, such as Twitter, support the concept of verification. Verified accounts are deemed worthy of platform-wide public interest and are separately authenticated by the platform itself. There have been repeated assertions by these platforms about verification not being tantamount to endorsement. However, a significant body of prior work suggests that possessing a verified status symbolizes enhanced credibility in the eyes of the platform audience. As a result, such a status is highly coveted among public figures and influencers. Hence, we attempt to characterize the network of verified users on Twitter and compare the results to similar analysis performed for the entire Twitter network. We extracted the entire network of verified users on Twitter (as of July 2018) and obtained 231,246 English user profiles and 79,213,811 connections. Subsequently, in the network analysis, we found that the sub-graph of verified users mirrors the full Twitter users graph in some aspects such as possessing a short diameter. However, our findings contrast with earlier findings on multiple aspects, such as the possession of a power law out-degree distribution, slight dissortativity, and a significantly higher reciprocity rate, as elucidated in the paper. Moreover, we attempt to gauge the presence of salient components within this sub-graph and detect the absence of homophily with respect to popularity, which again is in stark contrast to the full Twitter graph. Finally, we demonstrate stationarity in the time series of verified user activity levels. To the best of our knowledge, this work represents the first quantitative attempt at characterizing verified users on Twitter.","PeriodicalId":186190,"journal":{"name":"2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123806374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Cost/Performance in Modern Data Stores: How Data Caching Systems Succeed 现代数据存储的成本/性能:数据缓存系统如何成功

2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)

Pub Date : 2018-06-11 DOI: 10.1145/3211922.3211927

D. Lomet

Summary form only given, as follows. The complete presentation was not made available for publication as part of the conference proceedings. Data in traditional "caching" data systems resides on secondary storage, and is read into main memory only when operated on. This limits system performance. Main memory data stores with data always in main memory are much faster. But this performance comes at a cost. In this paper, we analyze the costs of both in-memory operations and secondary storage operations where data is not "in cache". We study the performance impact of cache misses on caching system performance. The analysis considers both execution and storage costs. Based on our analysis, we derive cost/performance results for a data caching system [Deuteronomy and its Bw-tree] and a main memory system [MassTree] to understand where each demonstrates the best cost per operation, what is driving the cost differences, and the scale of the differences. This analysis (1) provides insight into why data caching systems continue to dominate the market; (2) points to higher performance that does not rely on simply increasing main memory cache size; and (3) suggests a path to lower costs and hence better cost/performance.

仅给出摘要形式，如下。完整的报告没有作为会议记录的一部分提供出版。传统的“缓存”数据系统中的数据驻留在二级存储器上，只有在对其进行操作时才读入主存。这限制了系统性能。数据总是在主存中的主存数据存储要快得多。但这种表现是有代价的。在本文中，我们分析了内存操作和二级存储操作的成本，其中数据不在“缓存中”。我们研究了缓存缺失对缓存系统性能的影响。该分析同时考虑了执行和存储成本。根据我们的分析，我们得出了数据缓存系统(Deuteronomy及其Bw-tree)和主内存系统(masstreet)的成本/性能结果，以了解每个操作在哪些方面表现出最佳成本，是什么导致了成本差异，以及差异的规模。本分析(1)提供了数据缓存系统继续主导市场的原因;(2)指向更高的性能，而不是简单地依赖于增加主内存缓存大小;(3)提出了降低成本从而提高性价比的途径。

{"title":"Cost/Performance in Modern Data Stores: How Data Caching Systems Succeed","authors":"D. Lomet","doi":"10.1145/3211922.3211927","DOIUrl":"https://doi.org/10.1145/3211922.3211927","url":null,"abstract":"Summary form only given, as follows. The complete presentation was not made available for publication as part of the conference proceedings. Data in traditional \"caching\" data systems resides on secondary storage, and is read into main memory only when operated on. This limits system performance. Main memory data stores with data always in main memory are much faster. But this performance comes at a cost. In this paper, we analyze the costs of both in-memory operations and secondary storage operations where data is not \"in cache\". We study the performance impact of cache misses on caching system performance. The analysis considers both execution and storage costs. Based on our analysis, we derive cost/performance results for a data caching system [Deuteronomy and its Bw-tree] and a main memory system [MassTree] to understand where each demonstrates the best cost per operation, what is driving the cost differences, and the scale of the differences. This analysis (1) provides insight into why data caching systems continue to dominate the market; (2) points to higher performance that does not rely on simply increasing main memory cache size; and (3) suggests a path to lower costs and hence better cost/performance.","PeriodicalId":186190,"journal":{"name":"2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134599190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28