Big Data Research最新文献

英文中文

A Multi-View Filter for Relation-Free Knowledge Graph Completion 无关系知识图补全的多视图过滤器

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2023-05-01 DOI: 10.1016/j.bdr.2023.100397

Juan Li, Wen Zhang, Hongtao Yu

引用次数: 1

Task-Oriented Collaborative Graph Embedding Using Explicit High-Order Proximity for Recommendation 基于显式高阶接近度推荐的面向任务的协同图嵌入

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2023-04-01 DOI: 10.1016/j.bdr.2023.100382

Mintae Kim, Wooju Kim

引用次数: 0

What Is a Multi-Modal Knowledge Graph: A Survey 什么是多模态知识图谱:综述

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2023-03-01 DOI: 10.2139/ssrn.4229435

Jing-hui Peng, Xinyu Hu, Wenbo Huang, Jian Yang

引用次数: 2

Predicting Household Electric Power Consumption Using Multi-step Time Series with Convolutional LSTM 基于卷积LSTM的多步时间序列预测家庭用电量

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2023-02-28 DOI: 10.1016/j.bdr.2022.100360

Lucia Cascone , Saima Sadiq , Saleem Ullah , Seyedali Mirjalili , Hafeez Ur Rehman Siddiqui , Muhammad Umer

Energy consumption prediction has become an integral part of a smart and sustainable environment. With future demand forecasts, energy production and distribution can be optimized to meet the needs of the growing population. However, forecasting the demand of individual households is a challenging task due to the diversity of energy consumption patterns. Recently, it has become popular with artificial intelligence-based smart energy-saving designs, smart grid planning and social Internet of Things (IoT) based smart homes. Despite existing approaches for energy demand forecast, predominantly, such systems are based on one-step forecasting and have a short forecasting period. For resolving this issue and obtain high prediction accuracy, this study follows the prediction of household appliances' power in two phases. In the first phase, a long short-term memory (LSTM) based model is used to predict total generative active power for the coming 500 hours. The second phase employs a hybrid deep learning model that combines convolutional characteristics of neural network with LSTM for household electrical energy consumption forecasting of the week ahead utilizing Social IoT-based smart meter readings. Experimental results reveal that the proposed convolutional LSTM (ConvLSTM) architecture outperforms other models with the lowest root mean square error value of 367 kilowatts for weekly household power consumption.

能源消耗预测已成为智能和可持续环境的组成部分。通过对未来需求的预测，可以优化能源生产和分配，以满足不断增长的人口的需求。然而，由于能源消费模式的多样性，预测单个家庭的需求是一项具有挑战性的任务。最近，它在基于人工智能的智能节能设计、智能电网规划和基于社交物联网（IoT）的智能家居中流行起来。尽管现有的能源需求预测方法，但这种系统主要基于一步预测，预测周期短。为了解决这一问题并获得较高的预测精度，本研究对家用电器的功率进行了两期预测。在第一阶段，使用基于长短期记忆（LSTM）的模型来预测未来500小时的总发电有功功率。第二阶段采用混合深度学习模型，该模型将神经网络的卷积特性与LSTM相结合，利用基于社交物联网的智能电表读数预测未来一周的家庭电能消耗。实验结果表明，所提出的卷积LSTM（ConvLSTM）架构优于其他模型，每周家庭功耗的均方根误差值最低，为367千瓦。

{"title":"Predicting Household Electric Power Consumption Using Multi-step Time Series with Convolutional LSTM","authors":"Lucia Cascone , Saima Sadiq , Saleem Ullah , Seyedali Mirjalili , Hafeez Ur Rehman Siddiqui , Muhammad Umer","doi":"10.1016/j.bdr.2022.100360","DOIUrl":"https://doi.org/10.1016/j.bdr.2022.100360","url":null,"abstract":"<div><p>Energy consumption prediction has become an integral part of a smart and sustainable environment. With future demand forecasts, energy production and distribution can be optimized to meet the needs of the growing population. However, forecasting the demand of individual households is a challenging task due to the diversity of energy consumption patterns. Recently, it has become popular with artificial intelligence-based smart energy-saving designs, smart grid planning and social Internet of Things (IoT) based smart homes. Despite existing approaches for energy demand forecast, predominantly, such systems are based on one-step forecasting and have a short forecasting period. For resolving this issue and obtain high prediction accuracy, this study follows the prediction of household appliances' power in two phases. In the first phase, a long short-term memory (LSTM) based model is used to predict total generative active power for the coming 500 hours. The second phase employs a hybrid deep learning model that combines convolutional characteristics of neural network with LSTM for household electrical energy consumption forecasting of the week ahead utilizing Social IoT-based smart meter readings. Experimental results reveal that the proposed convolutional LSTM (ConvLSTM) architecture outperforms other models with the lowest root mean square error value of 367 kilowatts for weekly household power consumption.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49711519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

GeoYCSB: A Benchmark Framework for the Performance and Scalability Evaluation of Geospatial NoSQL Databases GeoYCSB:地理空间NoSQL数据库性能和可扩展性评估的基准框架

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2023-02-28 DOI: 10.1016/j.bdr.2023.100368

Suneuy Kim, Yvonne Hoang, Tsz Ting Yu, Yuvraj Singh Kanwar

The proliferation of geospatial applications has tremendously increased the variety, velocity, and volume of spatial data that data stores have to manage. Traditional relational databases reveal limitations in handling such big geospatial data, mainly due to their rigid schema requirements and limited scalability. Numerous NoSQL databases have emerged and actively serve as alternative data stores for big spatial data.

This study presents a framework, called GeoYCSB, developed for benchmarking NoSQL databases with geospatial workloads. To develop GeoYCSB, we extend YCSB, a de facto benchmark framework for NoSQL systems, by integrating into its design architecture the new components necessary to support geospatial workloads. GeoYCSB supports both microbenchmarks and macrobenchmarks and facilitates the use of real datasets in both. It is extensible to evaluate any NoSQL database, provided they support spatial queries, using geospatial workloads performed on datasets of any geometric complexity. We use GeoYCSB to benchmark two leading document stores, MongoDB and Couchbase, and present the experimental results and analysis. Finally, we demonstrate the extensibility of GeoYCSB by including a new dataset consisting of complex geometries and using it to benchmark a system with a wide variety of geospatial queries: Apache Accumulo, a wide-column store, with the GeoMesa framework applied on top.

地理空间应用程序的激增极大地增加了数据存储必须管理的空间数据的种类、速度和数量。传统的关系数据库在处理如此大的地理空间数据方面存在局限性，主要是由于它们严格的模式要求和有限的可扩展性。许多NoSQL数据库已经出现，并积极充当大空间数据的替代数据存储。这项研究提出了一个名为GeoYCSB的框架，该框架是为将NoSQL数据库与地理空间工作负载进行基准测试而开发的。为了开发GeoYCSB，我们通过将支持地理空间工作负载所需的新组件集成到其设计架构中，扩展了YCSB，这是一个事实上的NoSQL系统基准框架。GeoYCSB同时支持微基准和宏基准，并便于在两者中使用真实数据集。它可以扩展为评估任何NoSQL数据库，前提是它们支持空间查询，使用在任何几何复杂性的数据集上执行的地理空间工作负载。我们使用GeoYCSB对MongoDB和Couchbase这两个领先的文档库进行了基准测试，并给出了实验结果和分析。最后，我们展示了GeoYCSB的可扩展性，包括一个由复杂几何形状组成的新数据集，并使用它来对具有各种地理空间查询的系统进行基准测试：Apache Accumulo，一个宽列存储，顶部应用了GeoMesa框架。

{"title":"GeoYCSB: A Benchmark Framework for the Performance and Scalability Evaluation of Geospatial NoSQL Databases","authors":"Suneuy Kim, Yvonne Hoang, Tsz Ting Yu, Yuvraj Singh Kanwar","doi":"10.1016/j.bdr.2023.100368","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100368","url":null,"abstract":"<div><p>The proliferation of geospatial applications has tremendously increased the variety, velocity, and volume of spatial data that data stores have to manage. Traditional relational databases reveal limitations in handling such big geospatial data, mainly due to their rigid schema requirements and limited scalability. Numerous NoSQL databases have emerged and actively serve as alternative data stores for big spatial data.</p><p>This study presents a framework, called GeoYCSB, developed for benchmarking NoSQL databases with geospatial workloads. To develop GeoYCSB, we extend YCSB, a de facto benchmark framework for NoSQL systems, by integrating into its design architecture the new components necessary to support geospatial workloads. GeoYCSB supports both microbenchmarks and macrobenchmarks and facilitates the use of real datasets in both. It is extensible to evaluate any NoSQL database, provided they support spatial queries, using geospatial workloads performed on datasets of any geometric complexity. We use GeoYCSB to benchmark two leading document stores, MongoDB and Couchbase, and present the experimental results and analysis. Finally, we demonstrate the extensibility of GeoYCSB by including a new dataset consisting of complex geometries and using it to benchmark a system with a wide variety of geospatial queries: Apache Accumulo, a wide-column store, with the GeoMesa framework applied on top.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49733847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Efficiently Mining Colocation Patterns for Range Query 有效地挖掘范围查询的托管模式

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2023-02-28 DOI: 10.1016/j.bdr.2023.100369

Srikanth Baride , Anuj S. Saxena , Vikram Goyal

Colocation pattern mining finds a set of features whose instances frequently appear nearby in the same geographical space. Most of the existing algorithms for colocation patterns find nearby objects by a user-provided single-distance threshold. The value of the distance threshold is data specific and choosing a suitable distance for a user is not easy. In most real-world scenarios, it is rather meant to define spatial proximity by a distance range. It also provides flexibility to observe the change in the colocation patterns with distance and interprets the result better. Algorithms for mining colocations with a single distance threshold cannot be applied directly to the range of distances due to the computational overhead. We identify several structural properties of the collocation patterns and use them to propose an efficient single-pass colocation mining algorithm for distance range query, namely $R a n g e - C o M i n e$ . We compare the performance of the $R a n g e - C o M i n e$ with adapted versions of the famous Join-less colocation mining approach using both real-world and synthetic data sets and show that $R a n g e - C o M i n e$ outperforms the other algorithms.

并置模式挖掘发现了一组特征，这些特征的实例经常出现在同一地理空间的附近。大多数现有的主机代管模式算法都是通过用户提供的单个距离阈值来找到附近的对象。距离阈值的值是特定于数据的，并且为用户选择合适的距离并不容易。在大多数现实世界的场景中，它更倾向于通过距离范围来定义空间接近度。它还提供了观察主机代管模式随距离变化的灵活性，并更好地解释了结果。由于计算开销，用于挖掘具有单个距离阈值的主机代管的算法不能直接应用于距离范围。我们识别了配置模式的几个结构属性，并利用它们提出了一种有效的距离范围查询的单程配置挖掘算法，即range−CoMine。我们使用真实世界和合成数据集，将Range−CoMine的性能与著名的无连接主机代管挖掘方法的改编版本进行了比较，并表明Range−CoMine优于其他算法。

{"title":"Efficiently Mining Colocation Patterns for Range Query","authors":"Srikanth Baride , Anuj S. Saxena , Vikram Goyal","doi":"10.1016/j.bdr.2023.100369","DOIUrl":"https://doi.org/10.1016/j.bdr.2023.100369","url":null,"abstract":"<div><p>Colocation pattern mining finds a set of features whose instances frequently appear nearby in the same geographical space. Most of the existing algorithms for colocation patterns find nearby objects by a user-provided single-distance threshold. The value of the distance threshold is data specific and choosing a suitable distance for a user is not easy. In most real-world scenarios, it is rather meant to define spatial proximity by a distance range. It also provides flexibility to observe the change in the colocation patterns with distance and interprets the result better. Algorithms for mining colocations with a single distance threshold cannot be applied directly to the range of distances due to the computational overhead. We identify several structural properties of the collocation patterns and use them to propose an efficient single-pass colocation mining algorithm for distance range query, namely <span><math><mi>R</mi><mi>a</mi><mi>n</mi><mi>g</mi><mi>e</mi><mo>−</mo><mi>C</mi><mi>o</mi><mi>M</mi><mi>i</mi><mi>n</mi><mi>e</mi></math></span>. We compare the performance of the <span><math><mi>R</mi><mi>a</mi><mi>n</mi><mi>g</mi><mi>e</mi><mo>−</mo><mi>C</mi><mi>o</mi><mi>M</mi><mi>i</mi><mi>n</mi><mi>e</mi></math></span> with adapted versions of the famous Join-less colocation mining approach using both real-world and synthetic data sets and show that <span><math><mi>R</mi><mi>a</mi><mi>n</mi><mi>g</mi><mi>e</mi><mo>−</mo><mi>C</mi><mi>o</mi><mi>M</mi><mi>i</mi><mi>n</mi><mi>e</mi></math></span> outperforms the other algorithms.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49733848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Accelerating Columnar Storage Based on Asynchronous Skipping Strategy 基于异步跳过策略的列式存储加速

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2023-02-28 DOI: 10.1016/j.bdr.2022.100352

Wenhai Li , Zheng Yang , Lingfeng Deng , Zhiling Cheng , Weidong Wen , Yanxiang He

Many database applications, such as OnLine Analytical Processing (OLAP), web-based information extraction or scientific computation, need to select a subset of fields based on several user-defined filters. Developers of these applications require effective assembly methods for on-demand filtering and aggregation, which raises new challenges in deploying parallel computing components on top of columnar storage.

To efficiently generate qualified records, an asynchronous skipping strategy is presented to speed up filtering and decoding in the column-based storage. Concentrating on filtering-pushdown in parallel analytical workloads, we offer in-depth analysis on record assembly. We highlight the bottleneck of traditional record-wise assembling methods in the cases of evaluating analytical tasks on a nested schema. With a concurrent queue structure, an asynchronous skipping strategy is presented to evaluate column scan separately by a software pipeline involving an optionally different set of threads. We show how to intensively read the sequential blocks of each column, and how to effectively eliminate invalid payloads by integrating filtering-pushdown in an asynchronous I/O stack.

We implement a columnar storage supporting filtering-pushdown in nested schema. Our experiments are conducted on a de-facto standard benchmark using both variant-selectivity scans and ad-hoc queries. The results revealed that in parallel I/O-intensive workloads, our implementation improved the I/O performance of the state-of-the arts by 1.3X∼2.7X. Coupling the asynchronous strategy with filtering-pushdown, our implementation remarkably outperforms its competitors with heavyweight coding workloads on both HDD and SSD.

许多数据库应用程序，如联机分析处理（OLAP）、基于web的信息提取或科学计算，都需要基于几个用户定义的过滤器来选择字段的子集。这些应用程序的开发人员需要用于按需筛选和聚合的有效组装方法，这在将并行计算组件部署在列式存储之上时提出了新的挑战。为了有效地生成合格记录，提出了一种异步跳过策略，以加快基于列的存储中的过滤和解码速度。我们专注于过滤并行分析工作负载中的下推，提供对记录汇编的深入分析。我们强调了在嵌套模式上评估分析任务的情况下，传统的记录式组装方法的瓶颈。在并发队列结构的情况下，提出了一种异步跳过策略，通过涉及可选不同线程集的软件管道来单独评估列扫描。我们展示了如何集中读取每列的顺序块，以及如何通过在异步I/O堆栈中集成过滤下推来有效地消除无效有效负载。我们在嵌套模式中实现了一个支持过滤下推的柱状存储。我们的实验是在事实上的标准基准上进行的，使用变体选择性扫描和特殊查询。结果表明，在并行I/O密集型工作负载中，我们的实现将现有技术的I/O性能提高了1.3X～2.7X。将异步策略与过滤下推相结合，我们的实施显著优于HDD和SSD上的重量级编码工作负载的竞争对手。

{"title":"Accelerating Columnar Storage Based on Asynchronous Skipping Strategy","authors":"Wenhai Li , Zheng Yang , Lingfeng Deng , Zhiling Cheng , Weidong Wen , Yanxiang He","doi":"10.1016/j.bdr.2022.100352","DOIUrl":"https://doi.org/10.1016/j.bdr.2022.100352","url":null,"abstract":"<div><p>Many database applications, such as OnLine Analytical Processing (OLAP), web-based information extraction or scientific computation, need to select a subset of fields based on several user-defined filters. Developers of these applications require effective assembly methods for on-demand filtering and aggregation, which raises new challenges in deploying parallel computing components on top of columnar storage.</p><p>To efficiently generate qualified records, an asynchronous skipping strategy is presented to speed up filtering and decoding in the column-based storage. Concentrating on filtering-pushdown in parallel analytical workloads, we offer in-depth analysis on record assembly. We highlight the bottleneck of traditional record-wise assembling methods in the cases of evaluating analytical tasks on a nested schema. With a concurrent queue structure, an asynchronous skipping strategy is presented to evaluate column scan separately by a software pipeline involving an optionally different set of threads. We show how to intensively read the sequential blocks of each column, and how to effectively eliminate invalid payloads by integrating filtering-pushdown in an asynchronous I/O stack.</p><p>We implement a columnar storage supporting filtering-pushdown in nested schema. Our experiments are conducted on a de-facto standard benchmark using both variant-selectivity scans and ad-hoc queries. The results revealed that in parallel I/O-intensive workloads, our implementation improved the I/O performance of the state-of-the arts by 1.3X∼2.7X. Coupling the asynchronous strategy with filtering-pushdown, our implementation remarkably outperforms its competitors with heavyweight coding workloads on both HDD and SSD.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2023-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49733846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Properties and Performance of the ABCDe Random Graph Model with Community Structure 具有群落结构的ABCDe随机图模型的性质与性能

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-11-28 DOI: 10.1016/j.bdr.2022.100348

Bogumił Kamiński , Tomasz Olczak , Bartosz Pankratz , Paweł Prałat , François Théberge

In this paper, we investigate properties and performance of synthetic random graph models with a built-in community structure. Such models are important for evaluating and tuning community detection algorithms that are unsupervised by nature. We propose ABCDe—a multi-threaded implementation of the ABCD (Artificial Benchmark for Community Detection) graph generator. We discuss the implementation details of the algorithm and compare it with both the previously available sequential version of the ABCD model and with the parallel implementation of the standard and extensively used LFR (Lancichinetti–Fortunato–Radicchi) generator. We show that ABCDe is more than ten times faster and scales better than the parallel implementation of LFR provided in NetworKit. Moreover, the algorithm is not only faster but random graphs generated by ABCD have similar properties to the ones generated by the original LFR algorithm, while the parallelized NetworKit implementation of LFR produces graphs that have noticeably different characteristics.

本文研究了具有内置社团结构的合成随机图模型的性质和性能。这样的模型对于评估和调优自然不受监督的社区检测算法非常重要。我们提出abcde -一个ABCD (Artificial Benchmark for Community Detection)图生成器的多线程实现。我们讨论了该算法的实现细节，并将其与先前可用的ABCD模型的顺序版本以及标准和广泛使用的LFR (Lancichinetti-Fortunato-Radicchi)生成器的并行实现进行了比较。我们证明ABCDe比NetworKit中提供的LFR并行实现快十倍以上，并且可扩展性更好。此外，该算法不仅速度更快，而且ABCD生成的随机图与原始LFR算法生成的随机图具有相似的性质，而LFR的并行化NetworKit实现产生的图具有明显不同的特征。

{"title":"Properties and Performance of the ABCDe Random Graph Model with Community Structure","authors":"Bogumił Kamiński , Tomasz Olczak , Bartosz Pankratz , Paweł Prałat , François Théberge","doi":"10.1016/j.bdr.2022.100348","DOIUrl":"https://doi.org/10.1016/j.bdr.2022.100348","url":null,"abstract":"<div><p>In this paper, we investigate properties and performance of synthetic random graph models with a built-in community structure. Such models are important for evaluating and tuning community detection algorithms that are unsupervised by nature. We propose <strong>ABCDe</strong>—a multi-threaded implementation of the <strong>ABCD</strong> (Artificial Benchmark for Community Detection) graph generator. We discuss the implementation details of the algorithm and compare it with both the previously available sequential version of the <strong>ABCD</strong> model and with the parallel implementation of the standard and extensively used <strong>LFR</strong> (Lancichinetti–Fortunato–Radicchi) generator. We show that <strong>ABCDe</strong> is more than ten times faster and scales better than the parallel implementation of <strong>LFR</strong> provided in <span>NetworKit</span>. Moreover, the algorithm is not only faster but random graphs generated by <strong>ABCD</strong> have similar properties to the ones generated by the original <strong>LFR</strong> algorithm, while the parallelized <span>NetworKit</span> implementation of <strong>LFR</strong> produces graphs that have noticeably different characteristics.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579622000429/pdfft?md5=5b249e2f347f9c9eeb348b655a88cf99&pid=1-s2.0-S2214579622000429-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91599233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Twig-Based Algorithm for Top-k Subgraph Matching in Large-Scale Graph Data 基于小枝的大规模图数据Top-k子图匹配算法

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-11-28 DOI: 10.1016/j.bdr.2022.100350

Haiwei Zhang , Qijie Bai , Yining Lian , Yanlong Wen

Subgraph matching aims to find similar substructures in a single graph according to a given query graph and is known as a basic query for graph data management. There exist many categories of subgraph matching solutions. Subgraph isomorphism, which is thought of an NP-complete problem, is an initial solution for the subgraph matching task. To speed up the procedure, graph simulation has been presented to match subgraphs with a polynomial complexity of time. Unfortunately, graph simulation usually loses topologies of matched subgraphs because of its loose restrictions. In this paper, we propose an approximation approach named kSGM (top-k Subraph Graph Matching) for subgraph matching based on twig patterns. First, we transform query graphs into twig patterns and match candidate substructures in graph data. Second, we present an optimized join strategy along with top-k mechanism, including join order selection based on cost evaluation and optimized pruning based on maximum/minimum possible score. Finally, we design experiments on real-life and synthetic graph data to evaluate the performance of our work. The results show that our proposed kSGM obviously reduces the time complexity and guarantee the correctness for answering the queries of subgraph matching compared to existing algorithms.

子图匹配的目的是根据给定的查询图在单个图中找到相似的子结构，是图数据管理的基本查询。子图匹配解有很多种类型。子图同构是子图匹配问题的初始解，被认为是np完全问题。为了加快子图匹配的速度，提出了以多项式时间复杂度匹配子图的图模拟方法。不幸的是，图模拟由于其宽松的限制，通常会丢失匹配子图的拓扑结构。本文提出了一种基于小枝模式的子图匹配的近似方法kSGM (top-k subgraph Matching)。首先，我们将查询图转换为小枝模式，并在图数据中匹配候选子结构。其次，我们提出了一种基于top-k机制的优化连接策略，包括基于成本评估的连接顺序选择和基于最大/最小可能分数的优化修剪。最后，我们设计了真实和合成图形数据的实验来评估我们工作的性能。结果表明，与现有算法相比，我们提出的kSGM算法明显降低了时间复杂度，保证了回答子图匹配查询的正确性。

{"title":"A Twig-Based Algorithm for Top-k Subgraph Matching in Large-Scale Graph Data","authors":"Haiwei Zhang , Qijie Bai , Yining Lian , Yanlong Wen","doi":"10.1016/j.bdr.2022.100350","DOIUrl":"https://doi.org/10.1016/j.bdr.2022.100350","url":null,"abstract":"<div><p><span><span><span>Subgraph matching aims to find similar substructures in a single graph according to a given query graph and is known as a basic query for graph data management. There exist many categories of subgraph matching solutions. Subgraph isomorphism, which is thought of an NP-complete problem, is an initial solution for the subgraph matching task. To speed up the procedure, graph simulation has been presented to match subgraphs with a </span>polynomial complexity of time. Unfortunately, graph simulation usually loses topologies of matched subgraphs because of its loose restrictions. In this paper, we propose an </span>approximation approach named kSGM (top-</span><strong>k S</strong>ubraph <strong>G</strong>raph <strong>M</strong>atching) for subgraph matching based on twig patterns. First, we transform query graphs into twig patterns and match candidate substructures in graph data. Second, we present an optimized join strategy along with top-k mechanism, including join order selection based on cost evaluation and optimized pruning based on maximum/minimum possible score. Finally, we design experiments on real-life and synthetic graph data to evaluate the performance of our work. The results show that our proposed kSGM obviously reduces the time complexity and guarantee the correctness for answering the queries of subgraph matching compared to existing algorithms.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136939468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Intelligent Government Complaint Prediction Approach 一种智能政府投诉预测方法

IF 3.3 3区计算机科学 Q1 Business, Management and Accounting

Big Data Research

Pub Date : 2022-11-28 DOI: 10.1016/j.bdr.2022.100336

Siqi Chen , Yanling Zhang , Bin Song , Xiaojiang Du , Mohsen Guizani

Recent advances in machine learning (ML) bring more opportunities for greater implementation of smart government construction. However, there are many challenges in terms of government data application due to the previous nonstandard records and man-made errors. In this paper, we propose a practical intelligent government complaint prediction (IGCP) framework that helps governments quickly respond to citizens' consultations and complaints via ML technologies. In addition, we put forward an automatic label correction method and demonstrate its effectiveness on the performance improvement of intelligent government complaint prediction task. Specifically, the central server collects the interaction records from users and departments and automatically integrates them by the label correction approach which is designed to evaluate the similarity between different labels in data, and merge highly similar labels and corresponding samples into their most similar category. Based on those refined data, the central server quickly generates accurate solutions to complaints through text classification algorithms. The main innovation of our approach is that we turn the task of government complaint distribution into a text classification problem which is uniformly coordinated by the central server, and employ the label correction approach to correct redundant labels for training better models based on limited complaint records. To explore the influences of our approach, we evaluate its performance on real-world government service records provided by our collaborator. The experimental results demonstrate the prediction task which uses the label correction algorithm achieves significant improvements on almost all metrics of the classifier.

近年来，机器学习的发展为智能政府建设带来了更多的机会。然而，由于以往的不规范记录和人为错误，在政府数据应用方面存在许多挑战。在本文中，我们提出了一个实用的智能政府投诉预测(IGCP)框架，帮助政府通过ML技术快速响应公民的咨询和投诉。此外，我们还提出了一种自动标签校正方法，并验证了其在智能政府投诉预测任务性能提升中的有效性。具体而言，中央服务器收集来自用户和部门的交互记录，并通过标签校正方法自动整合，该方法旨在评估数据中不同标签之间的相似度，并将高度相似的标签和相应的样本合并到最相似的类别中。基于这些精细化的数据，中央服务器通过文本分类算法快速生成准确的投诉解决方案。该方法的主要创新点是将政府投诉分发任务转化为由中央服务器统一协调的文本分类问题，并采用标签校正方法对冗余标签进行校正，从而在有限的投诉记录基础上训练出更好的模型。为了探索我们的方法的影响，我们在合作者提供的真实政府服务记录上评估了它的表现。实验结果表明，使用标签校正算法的预测任务在分类器的几乎所有指标上都取得了显著的改进。

{"title":"An Intelligent Government Complaint Prediction Approach","authors":"Siqi Chen , Yanling Zhang , Bin Song , Xiaojiang Du , Mohsen Guizani","doi":"10.1016/j.bdr.2022.100336","DOIUrl":"10.1016/j.bdr.2022.100336","url":null,"abstract":"<div><p><span>Recent advances in machine learning<span> (ML) bring more opportunities for greater implementation of smart government construction. However, there are many challenges in terms of government data application due to the previous nonstandard records and man-made errors. In this paper, we propose a practical intelligent government complaint prediction (IGCP) framework that helps governments quickly respond to citizens' consultations and complaints via ML technologies<span>. In addition, we put forward an automatic label correction method and demonstrate its effectiveness on the performance improvement of intelligent government complaint prediction task. Specifically, the central server collects the interaction records from users and departments and automatically integrates them by the label correction approach which is designed to evaluate the similarity between different labels in data, and merge highly similar labels and corresponding samples into their most similar category. Based on those refined data, the central server quickly generates accurate solutions to complaints through text classification algorithms. The main innovation of our approach is that we turn the task of government complaint distribution into a text classification problem which is uniformly coordinated by the central server, and employ the label correction approach to correct redundant labels for training better models based on limited complaint records. To explore the influences of our approach, we evaluate its performance on real-world government service records provided by our collaborator. The experimental results demonstrate the prediction task which uses the label </span></span></span>correction algorithm achieves significant improvements on almost all metrics of the classifier.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":null,"pages":null},"PeriodicalIF":3.3,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84865537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Big Data Research

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀