Information Systems最新文献_第6页

GAMA: A multi-graph-based anomaly detection framework for business processes via graph neural networks GAMA：基于图神经网络的业务流程多图异常检测框架

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-05-19 DOI: 10.1016/j.is.2024.102405

Wei Guan, Jian Cao, Yang Gu, Shiyou Qian

Anomalies in business processes are inevitable for various reasons such as system failures and operator errors. Detecting anomalies is important for the management and optimization of business processes. However, prevailing anomaly detection approaches often fail to capture crucial structural information about the underlying process. To address this, we propose a multi-Graph based Anomaly detection fraMework for business processes via grAph neural networks, named GAMA. GAMA makes use of structural process information and attribute information in a more integrated way. In GAMA, multiple graphs are applied to model a trace in which each attribute is modeled as a separate graph. In particular, the graph constructed for the special attribute activity reflects the control flow. Then GAMA employs a multi-graph encoder and a multi-sequence decoder on multiple graphs to detect anomalies in terms of the reconstruction errors. Moreover, three teacher forcing styles are designed to enhance GAMA’s ability to reconstruct normal behaviors and thus improve detection performance. We conduct extensive experiments on both synthetic logs and real-life logs. The experiment results demonstrate that GAMA outperforms state-of-the-art methods for both trace-level and attribute-level anomaly detection.

由于系统故障和操作员失误等各种原因，业务流程中出现异常是不可避免的。检测异常对于管理和优化业务流程非常重要。然而，现有的异常检测方法往往无法捕捉到底层流程的关键结构信息。为了解决这个问题，我们提出了一种通过 grAph 神经网络进行业务流程多图异常检测的方法，命名为 GAMA。GAMA 以更综合的方式利用结构性流程信息和属性信息。在 GAMA 中，多个图被应用于跟踪建模，其中每个属性都作为一个单独的图建模。特别是，为特殊属性活动构建的图反映了控制流。然后，GAMA 在多个图上使用多图编码器和多序列解码器来检测重建错误方面的异常。此外，我们还设计了三种教师强制风格，以增强 GAMA 重构正常行为的能力，从而提高检测性能。我们在合成日志和真实日志上进行了大量实验。实验结果表明，在轨迹级和属性级异常检测方面，GAMA 都优于最先进的方法。

{"title":"GAMA: A multi-graph-based anomaly detection framework for business processes via graph neural networks","authors":"Wei Guan, Jian Cao, Yang Gu, Shiyou Qian","doi":"10.1016/j.is.2024.102405","DOIUrl":"https://doi.org/10.1016/j.is.2024.102405","url":null,"abstract":"<div><p>Anomalies in business processes are inevitable for various reasons such as system failures and operator errors. Detecting anomalies is important for the management and optimization of business processes. However, prevailing anomaly detection approaches often fail to capture crucial structural information about the underlying process. To address this, we propose a multi-Graph based Anomaly detection fraMework for business processes via grAph neural networks, named GAMA. GAMA makes use of structural process information and attribute information in a more integrated way. In GAMA, multiple graphs are applied to model a trace in which each attribute is modeled as a separate graph. In particular, the graph constructed for the special attribute <em>activity</em> reflects the control flow. Then GAMA employs a multi-graph encoder and a multi-sequence decoder on multiple graphs to detect anomalies in terms of the reconstruction errors. Moreover, three teacher forcing styles are designed to enhance GAMA’s ability to reconstruct normal behaviors and thus improve detection performance. We conduct extensive experiments on both synthetic logs and real-life logs. The experiment results demonstrate that GAMA outperforms state-of-the-art methods for both trace-level and attribute-level anomaly detection.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"124 ","pages":"Article 102405"},"PeriodicalIF":3.7,"publicationDate":"2024-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141083465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TRGST: An enhanced generalized suffix tree for topological relations between paths TRGST：路径拓扑关系的增强型广义后缀树

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-05-18 DOI: 10.1016/j.is.2024.102406

Carlos Quijada-Fuentes , M. Andrea Rodríguez , Diego Seco

This paper introduces the TRGST data structure, which is designed to handle queries related to topological relations between paths represented as sequences of stops in a network. As an example, these paths could correspond to stops on a public transport network, and a query of interest is to retrieve paths that share at least $k$ consecutive stops. While topological relations among spatial objects have received extensive attention, the efficient processing of these relations in the context of trajectory paths, considering both time and space efficiency, remains a relatively less explored domain. Taking inspiration from pattern matching implementations, the TRGST data structure is constructed on the foundation of the Generalized Suffix Tree. Its purpose is to provide a compact representation of a set of paths and to efficiently handle topological relation queries by leveraging the pattern search capabilities inherent in this structure. The paper provides a detailed account of the structure and algorithms of TRGST, followed by a performance analysis utilizing both real and synthetic data. The results underscore the remarkable scalability of the TRGST in terms of both query time and space utilization.

本文介绍 TRGST 数据结构，该结构旨在处理与网络中以站点序列表示的路径之间的拓扑关系有关的查询。举例来说，这些路径可能对应于公共交通网络中的站点，我们感兴趣的查询是检索至少有 k 个连续站点的路径。虽然空间对象之间的拓扑关系已受到广泛关注，但在轨迹路径中如何高效处理这些关系，同时考虑时间和空间效率，仍是一个探索相对较少的领域。受模式匹配实现的启发，TRGST 数据结构是在广义后缀树的基础上构建的。其目的是提供一组路径的紧凑表示，并利用该结构固有的模式搜索功能高效处理拓扑关系查询。本文详细介绍了 TRGST 的结构和算法，随后利用真实数据和合成数据进行了性能分析。结果表明，TRGST 在查询时间和空间利用率方面都具有显著的可扩展性。

{"title":"TRGST: An enhanced generalized suffix tree for topological relations between paths","authors":"Carlos Quijada-Fuentes , M. Andrea Rodríguez , Diego Seco","doi":"10.1016/j.is.2024.102406","DOIUrl":"10.1016/j.is.2024.102406","url":null,"abstract":"<div><p>This paper introduces the <em>TRGST</em> data structure, which is designed to handle queries related to topological relations between paths represented as sequences of stops in a network. As an example, these paths could correspond to stops on a public transport network, and a query of interest is to retrieve paths that share at least <span><math><mi>k</mi></math></span> consecutive stops. While topological relations among spatial objects have received extensive attention, the efficient processing of these relations in the context of trajectory paths, considering both time and space efficiency, remains a relatively less explored domain. Taking inspiration from pattern matching implementations, the <em>TRGST</em> data structure is constructed on the foundation of the Generalized Suffix Tree. Its purpose is to provide a compact representation of a set of paths and to efficiently handle topological relation queries by leveraging the pattern search capabilities inherent in this structure. The paper provides a detailed account of the structure and algorithms of <em>TRGST</em>, followed by a performance analysis utilizing both real and synthetic data. The results underscore the remarkable scalability of the <em>TRGST</em> in terms of both query time and space utilization.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"125 ","pages":"Article 102406"},"PeriodicalIF":3.7,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141144791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MBDL: Exploring dynamic dependency among various types of behaviors for recommendation MBDL：探索各类推荐行为之间的动态依赖关系

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-05-18 DOI: 10.1016/j.is.2024.102407

Hang Zhang, Mingxin Gan

Users have various behaviors on items, including page view, tag-as-favorite, add-to-cart, and purchase in online shopping platforms. These various types of behaviors reflect users’ different intentions, which also help learn their preferences on items in a recommender system. Although some multi-behavior recommendation methods have been proposed, two significant challenges have not been widely noticed: (i) capturing heterogeneous and dynamic preferences of users simultaneously from different types of behaviors; (ii) modeling the dynamic dependency among various types of behaviors. To overcome the above challenges, we propose a novel multi-behavior dynamic dependency learning method (MBDL) to explore the heterogeneity and dependency among various types of behavior sequences for recommendation. In brief, MBDL first uses a dual-channel interest encoder to learn the long-term interest representations and the evolution of short-term interests from the behavior-aware item sequences. Then, MBDL adopts a contrastive learning method to preserve the consistency of user’s long-term behavioral patterns, and a multi-head attention network to capture the dynamic dependency among short-term interactive behaviors. Finally, MBDL adaptively integrates the influence of long- and short-term interests to predict future user–item interactions. Experiments on two real-world datasets show that the proposed MBDL method outperforms state-of-the-art methods significantly on recommendation accuracy. Further ablation studies demonstrate the effectiveness of our model and the benefits of learning dynamic dependency among types of behaviors.

在网上购物平台中，用户对商品的行为多种多样，包括页面浏览、标记为收藏夹、添加到购物车和购买。这些不同类型的行为反映了用户的不同意图，也有助于在推荐系统中了解用户对商品的偏好。虽然已经提出了一些多行为推荐方法，但有两个重大挑战尚未引起广泛关注：(i) 从不同类型的行为中同时捕捉用户的异构和动态偏好；(ii) 模拟不同类型行为之间的动态依赖关系。为了克服上述挑战，我们提出了一种新颖的多行为动态依赖学习方法（MBDL）来探索用于推荐的各类行为序列之间的异质性和依赖性。简而言之，MBDL 首先使用双通道兴趣编码器从行为感知项目序列中学习长期兴趣表征和短期兴趣演变。然后，MBDL 采用对比学习法来保持用户长期行为模式的一致性，并采用多头注意力网络来捕捉短期互动行为之间的动态依赖关系。最后，MBDL 自适应地整合了长期和短期兴趣的影响，以预测用户与物品的未来互动。在两个真实世界数据集上进行的实验表明，所提出的 MBDL 方法在推荐准确性上明显优于最先进的方法。进一步的消融研究证明了我们模型的有效性以及学习行为类型之间动态依赖关系的益处。

{"title":"MBDL: Exploring dynamic dependency among various types of behaviors for recommendation","authors":"Hang Zhang, Mingxin Gan","doi":"10.1016/j.is.2024.102407","DOIUrl":"10.1016/j.is.2024.102407","url":null,"abstract":"<div><p>Users have various behaviors on items, including <em>page view</em>, <em>tag-as-favorite</em>, <em>add-to-cart</em>, and <em>purchase</em> in online shopping platforms. These various types of behaviors reflect users’ different intentions, which also help learn their preferences on items in a recommender system. Although some multi-behavior recommendation methods have been proposed, two significant challenges have not been widely noticed: (i) capturing heterogeneous and dynamic preferences of users simultaneously from different types of behaviors; (ii) modeling the dynamic dependency among various types of behaviors. To overcome the above challenges, we propose a novel multi-behavior dynamic dependency learning method (MBDL) to explore the heterogeneity and dependency among various types of behavior sequences for recommendation. In brief, MBDL first uses a dual-channel interest encoder to learn the long-term interest representations and the evolution of short-term interests from the behavior-aware item sequences. Then, MBDL adopts a contrastive learning method to preserve the consistency of user’s long-term behavioral patterns, and a multi-head attention network to capture the dynamic dependency among short-term interactive behaviors. Finally, MBDL adaptively integrates the influence of long- and short-term interests to predict future user–item interactions. Experiments on two real-world datasets show that the proposed MBDL method outperforms state-of-the-art methods significantly on recommendation accuracy. Further ablation studies demonstrate the effectiveness of our model and the benefits of learning dynamic dependency among types of behaviors.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"124 ","pages":"Article 102407"},"PeriodicalIF":3.7,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141143297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Storage Management with Multi-Version Partitioned BTrees 多版本分区 BTrees 的存储管理

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-05-15 DOI: 10.1016/j.is.2024.102403

Christian Riegger, Ilia Petrov

Modern persistent Key/Value-Stores operate on updatable datasets — massively exceeding the size of available main memory. Tree-based key/value storage management structures became particularly popular in storage engines. B $^{+}$ -Trees allow constant search performance, however write-heavy workloads yield inefficient write patterns to secondary storage devices and poor performance characteristics. LSM-Trees overcome this issue by horizontal partitioning fractions of data — small enough to fully reside in main memory, but require frequent maintenance to sustain search performance.

To this end, firstly, we propose Multi-Version Partitioned BTrees (MV-PBT) as sole storage and index management structure in key-sorted storage engines like Key/Value-Stores. Secondly, we compare MV-PBT against LSM-Trees. The logical horizontal partitioning in MV-PBT allows leveraging recent advances in modern B $^{+}$ -Tree techniques in a small transparent and memory resident portion of the structure. Structural properties sustain steady read performance, even on historical data, and yield efficient write patterns as well as reduced write-amplification.

We integrate MV-PBT in the WiredTiger key/value storage engine. MV-PBT offers an up to 2x increased steady throughput in comparison to LSM-Trees and several orders of magnitude in comparison to B $^{+}$ -Trees in a YCSB workload. Moreover, MV-PBT exhibits robust time-travel query performance and outperforms LSM-Trees by 20% and B $^{+}$ -Trees by an order of magnitude.

现代持久性键/值存储是在可更新的数据集上运行的，这大大超出了可用主内存的大小。基于树的键/值存储管理结构在存储引擎中尤其流行。B+ 树允许持续的搜索性能，但写入量大的工作负载会导致向二级存储设备的写入模式效率低下，性能特性较差。为此，我们首先提出了多版本分区 BTrees（Multi-Version Partitioned BTrees，MV-PBT），作为键排序存储引擎（如键/值存储引擎）中唯一的存储和索引管理结构。其次，我们将 MV-PBT 与 LSM-Trees 进行了比较。MV-PBT 中的逻辑水平分区允许在结构的一小部分透明和内存驻留中利用现代 B+-Tree 技术的最新进展。结构特性可保持稳定的读取性能（即使是历史数据），并产生高效的写入模式以及减少写入放大。我们将 MV-PBT 集成到 WiredTiger 键/值存储引擎中。在 YCSB 工作负载中，与 LSM-Trees 相比，MV-PBT 的稳定吞吐量最多提高了 2 倍，与 B+-Trees 相比则提高了几个数量级。此外，MV-PBT 还具有强大的时间旅行查询性能，比 LSM-Trees 高出 20%，比 B+-Trees 高出一个数量级。

{"title":"Storage Management with Multi-Version Partitioned BTrees","authors":"Christian Riegger, Ilia Petrov","doi":"10.1016/j.is.2024.102403","DOIUrl":"https://doi.org/10.1016/j.is.2024.102403","url":null,"abstract":"<div><p>Modern persistent Key/Value-Stores operate on updatable datasets — massively exceeding the size of available main memory. Tree-based key/value storage management structures became particularly popular in storage engines. B<span><math><msup><mrow></mrow><mrow><mo>+</mo></mrow></msup></math></span>-Trees allow constant search performance, however write-heavy workloads yield inefficient write patterns to secondary storage devices and poor performance characteristics. LSM-Trees overcome this issue by horizontal partitioning fractions of data — small enough to fully reside in main memory, but require frequent maintenance to sustain search performance.</p><p>To this end, firstly, we propose Multi-Version Partitioned BTrees (MV-PBT) as sole storage and index management structure in key-sorted storage engines like Key/Value-Stores. Secondly, we compare MV-PBT against LSM-Trees. The logical horizontal partitioning in MV-PBT allows leveraging recent advances in modern B<span><math><msup><mrow></mrow><mrow><mo>+</mo></mrow></msup></math></span>-Tree techniques in a small transparent and memory resident portion of the structure. Structural properties sustain steady read performance, even on historical data, and yield efficient write patterns as well as reduced write-amplification.</p><p>We integrate MV-PBT in the WiredTiger key/value storage engine. MV-PBT offers an up to 2x increased steady throughput in comparison to LSM-Trees and several orders of magnitude in comparison to B<span><math><msup><mrow></mrow><mrow><mo>+</mo></mrow></msup></math></span>-Trees in a YCSB workload. Moreover, MV-PBT exhibits robust time-travel query performance and outperforms LSM-Trees by 20% and B<span><math><msup><mrow></mrow><mrow><mo>+</mo></mrow></msup></math></span>-Trees by an order of magnitude.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"125 ","pages":"Article 102403"},"PeriodicalIF":3.7,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437924000619/pdfft?md5=cd0642883c73bb282d5d3104ee04d813&pid=1-s2.0-S0306437924000619-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141294465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recognizing task-level events from user interaction data 从用户交互数据中识别任务级事件

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-05-15 DOI: 10.1016/j.is.2024.102404

Adrian Rebmann , Han van der Aa

User interaction data comprises events that capture individual actions that a user performs on their computer. Such events provide detailed records about how users carry out their tasks in a process, even when this involves different applications. Although the comprehensiveness of such data provides a promising basis for process mining, user interaction events cannot be used directly for this purpose, because they do not meet two essential requirements. In particular, they neither indicate their relation to a process-level activity nor their relation to a specific process execution. Therefore, user interaction data needs to be transformed so that it meets these requirements before process mining techniques can be applied. This transformation problem comprises identifying tasks and their types and determining the relation between tasks and process executions. While some existing approaches tackle parts of this problem, none address it comprehensively. Therefore, we propose an unsupervised approach for recognizing task-level events from user interaction data that addresses it in full. It segments user interaction data to identify tasks, categorizes these according to their type, and relates tasks to each other via object instances it extracts from the user interaction events. In this manner, our approach creates task-level events that meet the requirements of process mining settings. Our evaluation demonstrates the approach’s efficacy and shows that its combined consideration of control-flow, data, and semantic information allows it to outperform baseline approaches in both online and offline settings.

用户交互数据包括捕捉用户在计算机上执行的单个操作的事件。这些事件详细记录了用户在流程中执行任务的情况，即使涉及不同的应用程序。虽然此类数据的全面性为流程挖掘提供了一个很好的基础，但用户交互事件不能直接用于此目的，因为它们不符合两个基本要求。特别是，它们既没有表明与流程级活动的关系，也没有表明与特定流程执行的关系。因此，在应用流程挖掘技术之前，需要对用户交互数据进行转换，使其满足这些要求。这一转换问题包括识别任务及其类型，以及确定任务与流程执行之间的关系。虽然现有的一些方法解决了这一问题的部分内容，但没有一种方法能全面解决这一问题。因此，我们提出了一种从用户交互数据中识别任务级事件的无监督方法，以全面解决这一问题。该方法可分割用户交互数据以识别任务，根据任务类型对任务进行分类，并通过从用户交互事件中提取的对象实例将任务相互联系起来。通过这种方式，我们的方法可以创建符合流程挖掘设置要求的任务级事件。我们的评估证明了这一方法的有效性，并表明它对控制流、数据和语义信息的综合考虑使其在在线和离线环境中均优于基准方法。

{"title":"Recognizing task-level events from user interaction data","authors":"Adrian Rebmann , Han van der Aa","doi":"10.1016/j.is.2024.102404","DOIUrl":"10.1016/j.is.2024.102404","url":null,"abstract":"<div><p>User interaction data comprises events that capture individual actions that a user performs on their computer. Such events provide detailed records about how users carry out their tasks in a process, even when this involves different applications. Although the comprehensiveness of such data provides a promising basis for process mining, user interaction events cannot be used directly for this purpose, because they do not meet two essential requirements. In particular, they neither indicate their relation to a process-level activity nor their relation to a specific process execution. Therefore, user interaction data needs to be transformed so that it meets these requirements before process mining techniques can be applied. This transformation problem comprises identifying tasks and their types and determining the relation between tasks and process executions. While some existing approaches tackle parts of this problem, none address it comprehensively. Therefore, we propose an unsupervised approach for recognizing task-level events from user interaction data that addresses it in full. It segments user interaction data to identify tasks, categorizes these according to their type, and relates tasks to each other via object instances it extracts from the user interaction events. In this manner, our approach creates task-level events that meet the requirements of process mining settings. Our evaluation demonstrates the approach’s efficacy and shows that its combined consideration of control-flow, data, and semantic information allows it to outperform baseline approaches in both online and offline settings.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"124 ","pages":"Article 102404"},"PeriodicalIF":3.7,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437924000620/pdfft?md5=6b076d025b548fc182dc1f86d4b2885e&pid=1-s2.0-S0306437924000620-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141037429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning complex predicates for cardinality estimation using recursive neural networks 利用递归神经网络学习复杂谓词以进行卡方估计

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-05-08 DOI: 10.1016/j.is.2024.102402

Zhi Wang , Hancong Duan , Yamin Cheng , Geyong Min

Cardinality estimation is one of the most vital components in the query optimizer, which has been extensively studied recently. On one hand, traditional cardinality estimators, such as histograms and sampling methods, struggle to capture the correlations between multiple tables. On the other hand, current learning-based methods still suffer from the feature extraction of complex predicates and join relations, which will lead to inaccurate cost estimation, eventually a sub-optimal execution plan. To address these challenges, we present a novel end-to-end architecture leveraging deep learning to provide high-quality cardinality estimation. We exploit an effective feature extraction technique, which can fully make use of the structure of tables, join conditions and predicates. Besides, we use sampling-based technique to construct sample bitmaps for the tables and join conditions respectively. We also utilize the characteristics of predicate tree combined with recursive neural network to extract deep-level features of complex predicates. Finally, we embed these feature vectors into the model, which consists of three components: a recursive neural network, a graph convolutional neural network (GCN) and a multi-set convolutional neural network, to obtain the estimated cardinality. Extensive results conducted on real-world workloads demonstrate that our approach can achieve significant improvement in accuracy and be extended to queries with complex semantics.

卡片性估计是查询优化器中最重要的组成部分之一，近来已得到广泛研究。一方面，直方图和抽样方法等传统的卡片性估计方法难以捕捉到多个表之间的相关性。另一方面，当前基于学习的方法仍然存在复杂谓词和连接关系的特征提取问题，这将导致成本估计不准确，最终产生次优执行计划。为了应对这些挑战，我们提出了一种新颖的端到端架构，利用深度学习来提供高质量的万有引力估计。我们利用有效的特征提取技术，可以充分利用表的结构、连接条件和谓词。此外，我们还使用基于采样的技术，分别为表和连接条件构建样本位图。我们还利用谓词树的特点，结合递归神经网络，提取复杂谓词的深层次特征。最后，我们将这些特征向量嵌入到由递归神经网络、图卷积神经网络（GCN）和多集卷积神经网络三部分组成的模型中，以获得估计的卡入度。在实际工作负载中取得的大量结果表明，我们的方法可以显著提高准确性，并可扩展到具有复杂语义的查询。

{"title":"Learning complex predicates for cardinality estimation using recursive neural networks","authors":"Zhi Wang , Hancong Duan , Yamin Cheng , Geyong Min","doi":"10.1016/j.is.2024.102402","DOIUrl":"10.1016/j.is.2024.102402","url":null,"abstract":"<div><p>Cardinality estimation is one of the most vital components in the query optimizer, which has been extensively studied recently. On one hand, traditional cardinality estimators, such as histograms and sampling methods, struggle to capture the correlations between multiple tables. On the other hand, current learning-based methods still suffer from the feature extraction of complex predicates and join relations, which will lead to inaccurate cost estimation, eventually a sub-optimal execution plan. To address these challenges, we present a novel end-to-end architecture leveraging deep learning to provide high-quality cardinality estimation. We exploit an effective feature extraction technique, which can fully make use of the structure of tables, join conditions and predicates. Besides, we use sampling-based technique to construct sample bitmaps for the tables and join conditions respectively. We also utilize the characteristics of predicate tree combined with recursive neural network to extract deep-level features of complex predicates. Finally, we embed these feature vectors into the model, which consists of three components: a recursive neural network, a graph convolutional neural network (GCN) and a multi-set convolutional neural network, to obtain the estimated cardinality. Extensive results conducted on real-world workloads demonstrate that our approach can achieve significant improvement in accuracy and be extended to queries with complex semantics.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"124 ","pages":"Article 102402"},"PeriodicalIF":3.7,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141048672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BF-BigGraph: An efficient subgraph isomorphism approach using machine learning for big graph databases BF-BigGraph：利用机器学习为大型图数据库提供高效的子图同构方法

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-05-06 DOI: 10.1016/j.is.2024.102401

Adnan Yazici , Ezgi Taşkomaz

Graph databases are flexible NoSQL databases used to efficiently store and query complex and big data. One of the most difficult problems in graph databases is the problem of subgraph isomorphism, which involves finding a matching pattern in a given graph. Subgraph isomorphism algorithms generally encounter problems in the efficient processing of complex queries based on a lack of pruning methods and the use of a matching order. In this study, we present a new subgraph isomorphism approach based on the best-first search design strategy and name it BF-BigGraph. Our approach includes a machine learning technique to efficiently find the best matching order for various complex queries. The parameters we used in our approach as heuristics to improve the performance of complex queries on graph-based NoSQL databases are database volatility, database size, type of query, and the size of the query. We utilized the Random Forest machine learning method to narrow candidate nodes to a higher level of search and effectively reduce the search space for efficient querying and retrieval. We compared BF-BigGraph with state-of-the-art approaches, namely BB-Graph, Neo4j’s Cypher, DualIso, GraphQL, TurboIso, and VF3 using publicly available databases including undirected graphs; WorldCup, Pokec, Youtube, and a big graph database of a real demographic application (a population database) with approximately 70 million nodes of a big directed graph. The performance results of our approach for different types of complex queries on all these databases are significantly better in terms of computation time and required memory than other competing approaches in the literature.

图数据库是一种灵活的 NoSQL 数据库，用于高效地存储和查询复杂的海量数据。图数据库中最困难的问题之一是子图同构问题，它涉及在给定图中找到匹配模式。由于缺乏剪枝方法和使用匹配顺序，子图同构算法在高效处理复杂查询时通常会遇到问题。在本研究中，我们提出了一种基于最佳优先搜索设计策略的新型子图同构方法，并将其命名为 BF-BigGraph。我们的方法包括一种机器学习技术，可高效地为各种复杂查询找到最佳匹配顺序。在我们的方法中，我们使用了数据库波动性、数据库大小、查询类型和查询大小等参数作为启发式方法，以提高基于图的 NoSQL 数据库上复杂查询的性能。我们利用随机森林（Random Forest）机器学习方法将候选节点缩小到更高层次的搜索范围，并有效缩小搜索空间，从而实现高效查询和检索。我们使用公开的数据库（包括无向图、WorldCup、Pokec、Youtube 和一个真实人口应用的大图数据库（人口数据库））对 BF-BigGraph 和最先进的方法（即 BB-Graph、Neo4j 的 Cypher、DualIso、GraphQL、TurboIso 和 VF3）进行了比较，这些数据库包含约 7000 万个节点的大有向图。在所有这些数据库上进行不同类型的复杂查询时，我们的方法在计算时间和所需内存方面都明显优于文献中的其他竞争方法。

{"title":"BF-BigGraph: An efficient subgraph isomorphism approach using machine learning for big graph databases","authors":"Adnan Yazici , Ezgi Taşkomaz","doi":"10.1016/j.is.2024.102401","DOIUrl":"10.1016/j.is.2024.102401","url":null,"abstract":"<div><p>Graph databases are flexible NoSQL databases used to efficiently store and query complex and big data. One of the most difficult problems in graph databases is the problem of subgraph isomorphism, which involves finding a matching pattern in a given graph. Subgraph isomorphism algorithms generally encounter problems in the efficient processing of complex queries based on a lack of pruning methods and the use of a matching order. In this study, we present a new subgraph isomorphism approach based on the best-first search design strategy and name it BF-BigGraph. Our approach includes a machine learning technique to efficiently find the best matching order for various complex queries. The parameters we used in our approach as heuristics to improve the performance of complex queries on graph-based NoSQL databases are database volatility, database size, type of query, and the size of the query. We utilized the Random Forest machine learning method to narrow candidate nodes to a higher level of search and effectively reduce the search space for efficient querying and retrieval. We compared BF-BigGraph with state-of-the-art approaches, namely BB-Graph, Neo4j’s Cypher, DualIso, GraphQL, TurboIso, and VF3 using publicly available databases including undirected graphs; WorldCup, Pokec, Youtube, and a big graph database of a real demographic application (a population database) with approximately 70 million nodes of a big directed graph. The performance results of our approach for different types of complex queries on all these databases are significantly better in terms of computation time and required memory than other competing approaches in the literature.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"124 ","pages":"Article 102401"},"PeriodicalIF":3.7,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141050700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Responsible composition and optimization of integration processes under correctness preserving guarantees 在保证正确性的前提下，负责任地组成和优化集成流程

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-04-30 DOI: 10.1016/j.is.2024.102400

Daniel Ritter , Fredrik Nordvall Forsberg , Stefanie Rinderle-Ma

Enterprise Application Integration deals with the problem of connecting heterogeneous applications, and is the centerpiece of current on-premise, cloud and device integration scenarios. For integration scenarios, structurally correct composition of patterns into processes and improvements of integration processes are crucial. In order to achieve this, we formalize compositions of integration patterns based on their characteristics, and describe optimization strategies that help to reduce the model complexity, and improve the process execution efficiency using design time techniques. Using the formalism of timed DB-nets – a refinement of Petri nets – we model integration logic features such as control- and data flow, transactional data storage, compensation and exception handling, and time aspects that are present in reoccurring solutions as separate integration patterns. We then propose a realization of optimization strategies using graph rewriting, and prove that the optimizations we consider preserve both structural and functional correctness. We evaluate the improvements on a real-world catalog of pattern compositions, containing over 900 integration processes, and illustrate the correctness properties in case studies based on two of these processes.

企业应用集成涉及异构应用的连接问题，是当前内部部署、云和设备集成方案的核心。对于集成方案来说，从结构上正确地将模式组成流程并改进集成流程至关重要。为了实现这一目标，我们根据集成模式的特点对其组合进行了形式化，并介绍了有助于降低模型复杂性的优化策略，以及利用设计时间技术提高流程执行效率的方法。使用定时 DB 网（Petri 网的一种改进）的形式主义，我们将控制流和数据流、事务数据存储、补偿和异常处理等集成逻辑特征以及重复出现的解决方案中的时间方面作为单独的集成模式进行建模。然后，我们提出了一种使用图重写的优化策略，并证明了我们所考虑的优化策略既能保持结构正确性，又能保持功能正确性。我们在一个包含 900 多个集成流程的实际模式组合目录中对改进进行了评估，并在基于其中两个流程的案例研究中说明了正确性属性。

{"title":"Responsible composition and optimization of integration processes under correctness preserving guarantees","authors":"Daniel Ritter , Fredrik Nordvall Forsberg , Stefanie Rinderle-Ma","doi":"10.1016/j.is.2024.102400","DOIUrl":"https://doi.org/10.1016/j.is.2024.102400","url":null,"abstract":"<div><p>Enterprise Application Integration deals with the problem of connecting heterogeneous applications, and is the centerpiece of current on-premise, cloud and device integration scenarios. For integration scenarios, structurally correct composition of patterns into processes and improvements of integration processes are crucial. In order to achieve this, we formalize compositions of integration patterns based on their characteristics, and describe optimization strategies that help to reduce the model complexity, and improve the process execution efficiency using design time techniques. Using the formalism of timed DB-nets – a refinement of Petri nets – we model integration logic features such as control- and data flow, transactional data storage, compensation and exception handling, and time aspects that are present in reoccurring solutions as separate integration patterns. We then propose a realization of optimization strategies using graph rewriting, and prove that the optimizations we consider preserve both structural and functional correctness. We evaluate the improvements on a real-world catalog of pattern compositions, containing over 900 integration processes, and illustrate the correctness properties in case studies based on two of these processes.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"124 ","pages":"Article 102400"},"PeriodicalIF":3.7,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140824326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Special Issue with Best Papers from ICPM 2022 国际清洁生产大会 2022 年最佳论文特刊

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-04-21 DOI: 10.1016/j.is.2024.102389

Andrea Burattin, Artem Polyvyanyy, Barbara Weber

引用次数: 0

TransLSTD: Augmenting hierarchical disease risk prediction model with time and context awareness via disease clustering TransLSTD：通过疾病聚类增强分层疾病风险预测模型的时间和上下文意识

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-04-19 DOI: 10.1016/j.is.2024.102390

Tao You , Qiaodong Dang , Qing Li , Peng Zhang , Guanzhong Wu , Wei Huang

The use of electronic health records has become widespread, providing a valuable source of information for predicting disease risk. While deep neural network models have been proposed and shown to be effective in this task, supplemented with medical domain knowledge for interpretability, several limitations still exist. Firstly, there is often a lack of differentiation between chronic and acute diseases leading to biased modeling of diseases. Secondly, the extraction of patient single-layer temporal patterns is limited, which hinders comprehensive representation and predictive power. Thirdly, weak interpretability based on deep neural networks prevents the extraction of valuable medical knowledge, limiting practical applications. To overcome these challenges, we propose TransLSTD, a hierarchical model that incorporates time awareness and context awareness while distinguishing between long-term and short-term diseases. TransLSTD uses clustering algorithms to classify disease types based on the occurrence feature matrix of diseases from EHR dataset and updates disease representation at the code level while creating patient visit embeddings. The model utilizes query vectors to incorporate visit context information and combines time data to capture the patient’s overall health status. Finally, the prediction module generates outcomes and provides effective interpretations. We demonstrate the effectiveness of TransLSTD using two real-world datasets, outperforming state-of-the-art models in terms of both AUC and F1 values. The data and code are released at https://github.com/DangQD/TransLSTD-master.

电子健康记录的使用已经非常普遍，为预测疾病风险提供了宝贵的信息来源。虽然深度神经网络模型已被提出并证明能有效地完成这项任务，并辅以医学领域知识以提高可解释性，但仍存在一些局限性。首先，慢性病和急性病之间往往缺乏区分，导致疾病建模存在偏差。其次，对患者单层时间模式的提取有限，这阻碍了综合表征和预测能力。第三，基于深度神经网络的可解释性较弱，无法提取有价值的医学知识，限制了实际应用。为了克服这些挑战，我们提出了 TransLSTD，这是一种分层模型，结合了时间感知和上下文感知，同时区分了长期和短期疾病。TransLSTD 采用聚类算法，根据电子病历数据集的疾病发生特征矩阵对疾病类型进行分类，并在创建患者就诊嵌入的同时更新代码级的疾病表示。该模型利用查询向量来整合就诊上下文信息，并结合时间数据来捕捉患者的整体健康状态。最后，预测模块生成结果并提供有效解释。我们使用两个真实数据集证明了 TransLSTD 的有效性，其 AUC 和 F1 值均优于最先进的模型。数据和代码发布于 https://github.com/DangQD/TransLSTD-master。

{"title":"TransLSTD: Augmenting hierarchical disease risk prediction model with time and context awareness via disease clustering","authors":"Tao You , Qiaodong Dang , Qing Li , Peng Zhang , Guanzhong Wu , Wei Huang","doi":"10.1016/j.is.2024.102390","DOIUrl":"https://doi.org/10.1016/j.is.2024.102390","url":null,"abstract":"<div><p>The use of electronic health records has become widespread, providing a valuable source of information for predicting disease risk. While deep neural network models have been proposed and shown to be effective in this task, supplemented with medical domain knowledge for interpretability, several limitations still exist. Firstly, there is often a lack of differentiation between chronic and acute diseases leading to biased modeling of diseases. Secondly, the extraction of patient single-layer temporal patterns is limited, which hinders comprehensive representation and predictive power. Thirdly, weak interpretability based on deep neural networks prevents the extraction of valuable medical knowledge, limiting practical applications. To overcome these challenges, we propose TransLSTD, a hierarchical model that incorporates time awareness and context awareness while distinguishing between long-term and short-term diseases. TransLSTD uses clustering algorithms to classify disease types based on the occurrence feature matrix of diseases from EHR dataset and updates disease representation at the code level while creating patient visit embeddings. The model utilizes query vectors to incorporate visit context information and combines time data to capture the patient’s overall health status. Finally, the prediction module generates outcomes and provides effective interpretations. We demonstrate the effectiveness of TransLSTD using two real-world datasets, outperforming state-of-the-art models in terms of both AUC and F1 values. The data and code are released at <span>https://github.com/DangQD/TransLSTD-master</span><svg><path></path></svg>.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"124 ","pages":"Article 102390"},"PeriodicalIF":3.7,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140644348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0