Data & Knowledge Engineering最新文献_第2页

VarClaMM: A reference meta-model to understand DNA variant classification VarClaMM：了解 DNA 变异分类的参考元模型

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-11-01 DOI: 10.1016/j.datak.2024.102370

Mireia Costa , Alberto García S. , Ana León , Anna Bernasconi , Oscar Pastor

Determining the significance of a DNA variant in patients’ health status – a complex process known as variant classification – is highly critical for precision medicine applications. However, there is still debate on how to combine and weigh diverse available evidence to achieve proper and consistent conclusions. Indeed, currently, there are more than 200 different variant classification guidelines available to the scientific community, aiming to establish a framework for standardizing the classification process. Yet, these guidelines are qualitative and vague by nature, hindering their practical application and potential automation. Consequently, more precise definitions are needed.

In this work, we discuss our efforts to create VarClaMM, a UML meta-model that aims to provide a clear specification of the key concepts involved in variant classification, serving as a common framework for the process. Through this accurate characterization of the domain, we were able to find contradictions or inconsistencies that might have an effect on the classification results. VarClaMM’s conceptualization efforts will lay the ground for the operationalization of variant classification, enabling any potential automation to be based on precise definitions.

确定 DNA 变异在患者健康状况中的重要性（这是一个复杂的过程，被称为变异分类）对于精准医疗的应用至关重要。然而，对于如何结合和权衡现有的各种证据以得出正确一致的结论，目前仍存在争议。事实上，目前科学界有 200 多份不同的变异体分类指南，旨在建立一个规范分类过程的框架。然而，这些指南在本质上是定性和模糊的，妨碍了它们的实际应用和潜在的自动化。在这项工作中，我们讨论了我们为创建 VarClaMM 所做的努力，VarClaMM 是一个 UML 元模型，旨在为变异体分类所涉及的关键概念提供清晰的规范，作为该过程的通用框架。通过对这一领域的准确描述，我们能够发现可能对分类结果产生影响的矛盾或不一致之处。VarClaMM 的概念化工作将为变体分类的操作化奠定基础，使任何潜在的自动化都能建立在精确定义的基础上。

{"title":"VarClaMM: A reference meta-model to understand DNA variant classification","authors":"Mireia Costa , Alberto García S. , Ana León , Anna Bernasconi , Oscar Pastor","doi":"10.1016/j.datak.2024.102370","DOIUrl":"10.1016/j.datak.2024.102370","url":null,"abstract":"<div><div>Determining the significance of a DNA variant in patients’ health status – a complex process known as <em>variant classification</em> – is highly critical for precision medicine applications. However, there is still debate on how to combine and weigh diverse available evidence to achieve proper and consistent conclusions. Indeed, currently, there are more than 200 different variant classification guidelines available to the scientific community, aiming to establish a framework for standardizing the classification process. Yet, these guidelines are qualitative and vague by nature, hindering their practical application and potential automation. Consequently, more precise definitions are needed.</div><div>In this work, we discuss our efforts to create VarClaMM, a UML meta-model that aims to provide a clear specification of the key concepts involved in variant classification, serving as a common framework for the process. Through this accurate characterization of the domain, we were able to find contradictions or inconsistencies that might have an effect on the classification results. VarClaMM’s conceptualization efforts will lay the ground for the operationalization of variant classification, enabling any potential automation to be based on precise definitions.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102370"},"PeriodicalIF":2.7,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142573531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NoSQL document data migration strategy in the context of schema evolution 模式演进背景下的 NoSQL 文档数据迁移策略

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-11-01 DOI: 10.1016/j.datak.2024.102369

Solomiia Fedushko , Roman Malyi , Yuriy Syerov , Pavlo Serdyuk

In Agile development, one approach cannot be chosen and used all the time. Constant updates and strategy changes are necessary. We want to show that combining several migration strategies is better than choosing only one. Also, we emphasize the need to consider the type of schema change. This paper introduces a novel approach designed to optimize the migration process for NoSQL databases. The approach represents a significant advancement in migration strategy planning, providing a quantitative framework to guide decision-making. By incorporating critical factors such as schema changes, database size, the necessity of data in search functionalities, and potential latency issues, the approach comprehensively evaluates the migration feasibility and identifies the optimal migration path. Unlike existing methodologies, this approach adapts to the dynamic nature of NoSQL databases, offering a scalable and flexible approach to migration planning.

在敏捷开发中，不可能一直选择和使用一种方法。不断更新和改变策略是必要的。我们希望证明，结合几种迁移策略比只选择一种迁移策略更好。此外，我们还强调需要考虑模式变更的类型。本文介绍了一种旨在优化 NoSQL 数据库迁移过程的新方法。该方法为指导决策提供了一个量化框架，是迁移策略规划领域的一大进步。通过纳入模式变更、数据库大小、搜索功能中数据的必要性和潜在延迟问题等关键因素，该方法全面评估了迁移的可行性，并确定了最佳迁移路径。与现有方法不同的是，这种方法适应 NoSQL 数据库的动态特性，为迁移规划提供了一种可扩展的灵活方法。

引用次数: 0

Change pattern relationships in event logs 事件日志中的更改模式关系

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-10-15 DOI: 10.1016/j.datak.2024.102368

Jonas Cremerius, Hendrik Patzlaff, Mathias Weske

Process mining utilises process execution data to discover and analyse business processes. Event logs represent process executions, providing information about the activities executed. In addition to generic event attributes like activity name and timestamp, events might contain domain-specific attributes, such as a blood sugar measurement in a healthcare environment. Many of these values change during a typical process quite frequently. We refer to those as dynamic event attributes. Change patterns can be derived from dynamic event attributes, describing if the attribute values change from one activity to another. So far, change patterns can only be identified in an isolated manner, neglecting the chance of finding co-occuring change patterns. This paper provides an approach to identifying relationships between change patterns by utilising correlation methods from statistics. We applied the proposed technique on two event logs derived from the MIMIC-IV real-world dataset on hospitalisations in the US and evaluated the results with a medical expert. It turns out that relationships between change patterns can be detected within the same directly or eventually follows relation and even beyond that. Further, we identify unexpected relationships that are occurring only at certain parts of the process. Thus, the process perspective reveals novel insights on how dynamic event attributes change together during process execution. The approach is implemented in Python using the PM4Py framework.

流程挖掘利用流程执行数据来发现和分析业务流程。事件日志代表流程执行情况，提供有关所执行活动的信息。除了活动名称和时间戳等通用事件属性外，事件还可能包含特定领域的属性，如医疗环境中的血糖测量。在一个典型的流程中，其中许多值会频繁变化。我们将这些属性称为动态事件属性。变化模式可以从动态事件属性中推导出来，描述属性值是否从一个活动变化到另一个活动。迄今为止，变化模式只能以孤立的方式识别，忽略了发现共同发生的变化模式的机会。本文提供了一种利用统计学中的相关方法来识别变化模式之间关系的方法。我们将所提出的技术应用于从美国 MIMIC-IV 真实世界住院数据集中提取的两个事件日志，并与医学专家一起对结果进行了评估。结果表明，变化模式之间的关系可以在相同的直接或最终跟随关系中检测到，甚至可以超越这种关系。此外，我们还发现了一些意想不到的关系，这些关系只发生在流程的某些部分。因此，流程视角揭示了流程执行过程中动态事件属性如何共同变化的新见解。该方法使用 PM4Py 框架在 Python 中实现。

{"title":"Change pattern relationships in event logs","authors":"Jonas Cremerius, Hendrik Patzlaff, Mathias Weske","doi":"10.1016/j.datak.2024.102368","DOIUrl":"10.1016/j.datak.2024.102368","url":null,"abstract":"<div><div>Process mining utilises process execution data to discover and analyse business processes. Event logs represent process executions, providing information about the activities executed. In addition to generic event attributes like activity name and timestamp, events might contain domain-specific attributes, such as a blood sugar measurement in a healthcare environment. Many of these values change during a typical process quite frequently. We refer to those as dynamic event attributes. Change patterns can be derived from dynamic event attributes, describing if the attribute values change from one activity to another. So far, change patterns can only be identified in an isolated manner, neglecting the chance of finding co-occuring change patterns. This paper provides an approach to identifying relationships between change patterns by utilising correlation methods from statistics. We applied the proposed technique on two event logs derived from the MIMIC-IV real-world dataset on hospitalisations in the US and evaluated the results with a medical expert. It turns out that relationships between change patterns can be detected within the same directly or eventually follows relation and even beyond that. Further, we identify unexpected relationships that are occurring only at certain parts of the process. Thus, the process perspective reveals novel insights on how dynamic event attributes change together during process execution. The approach is implemented in Python using the PM4Py framework.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102368"},"PeriodicalIF":2.7,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142533491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Strategic redesign of business processes in the digital age: A framework 数字时代业务流程的战略再设计：一个框架

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-10-05 DOI: 10.1016/j.datak.2024.102367

Fredrik Milani, Kateryna Kubrak, Juuli Nava

Organizations constantly seek ways to improve their business processes by using digital technologies as enablers. However, simply substituting an existing technology with a new one has limited value compared to using the capabilities of digital technologies to redesign business processes. Therefore, process analysts try to understand how the capabilities of digital technologies can enable the redesign of business processes. In this paper, we conduct a systematic literature review and examine 40 case studies where digital technologies were used to redesign business processes. We identified that, within the context of business process improvement, capabilities of digitalization, communication, analytics, digital representation, and connectivity can enable business process redesign. Furthermore, we note that these capabilities enable applying nine redesign heuristics. Based on our review, we map how each capability can facilitate the implementation of specific redesign heuristics. Finally, we illustrate how such a capability-driven approach can be applied to Metaverse as an example of a digital technology. Our mapping and classification framework can aid analysts in identifying candidate redesigns that capitalize on the capabilities of digital technologies.

各组织都在不断寻求利用数字技术改进业务流程的方法。然而，与利用数字技术的功能重新设计业务流程相比，简单地用新技术取代现有技术的价值有限。因此，流程分析师试图了解数字技术的功能如何能够促进业务流程的重新设计。在本文中，我们进行了系统的文献回顾，并研究了 40 个利用数字技术重新设计业务流程的案例。我们发现，在业务流程改进的背景下，数字化、通信、分析、数字表示和连接等能力可以促进业务流程的重新设计。此外，我们还注意到，这些能力有助于应用九种重新设计启发式方法。在回顾的基础上，我们描绘了每种能力如何促进特定再设计启发式方法的实施。最后，我们举例说明了如何将这种以能力为导向的方法应用于 Metaverse 这种数字技术。我们的映射和分类框架可以帮助分析人员确定可利用数字技术能力的候选再设计方案。

{"title":"Strategic redesign of business processes in the digital age: A framework","authors":"Fredrik Milani, Kateryna Kubrak, Juuli Nava","doi":"10.1016/j.datak.2024.102367","DOIUrl":"10.1016/j.datak.2024.102367","url":null,"abstract":"<div><div>Organizations constantly seek ways to improve their business processes by using digital technologies as enablers. However, simply substituting an existing technology with a new one has limited value compared to using the capabilities of digital technologies to redesign business processes. Therefore, process analysts try to understand how the capabilities of digital technologies can enable the redesign of business processes. In this paper, we conduct a systematic literature review and examine 40 case studies where digital technologies were used to redesign business processes. We identified that, within the context of business process improvement, capabilities of digitalization, communication, analytics, digital representation, and connectivity can enable business process redesign. Furthermore, we note that these capabilities enable applying nine redesign heuristics. Based on our review, we map how each capability can facilitate the implementation of specific redesign heuristics. Finally, we illustrate how such a capability-driven approach can be applied to Metaverse as an example of a digital technology. Our mapping and classification framework can aid analysts in identifying candidate redesigns that capitalize on the capabilities of digital technologies.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102367"},"PeriodicalIF":2.7,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Timed alignments with mixed moves 混合动作的定时排列

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-09-28 DOI: 10.1016/j.datak.2024.102366

Neha Rino , Thomas Chatain

We study conformance checking for timed models, that is, process models that consider both the sequence of events that occur, as well as the timestamps at which each event is recorded. Time-aware process mining is a growing subfield of research, and as tools that seek to discover timing-related properties in processes develop, so does the need for conformance-checking techniques that can tackle time constraints and provide insightful quality measures for time-aware process models. One of the most useful conformance artefacts is the alignment, that is, finding the minimal changes necessary to correct a new observation to conform to a process model. In this paper, we extend the notion of timed distance from a previous work where an edit on an event’s timestamp came in two types, depending on whether or not it would propagate to its successors. Here, these different types of edits have a weighted cost each, and the ratio of their costs is denoted by

α

. We then solve the purely timed alignment problem in this setting for a large class of these weighted distances (corresponding to

α \in {1} \cup [2, \infty)

). For these distances, we provide linear time algorithms for both distance computation and alignment on models with sequential causal processes.

我们研究的是定时模型的一致性检查，即同时考虑事件发生顺序和记录每个事件的时间戳的流程模型。时间感知流程挖掘是一个不断发展的研究子领域，随着试图发现流程中与时间相关属性的工具的发展，人们对能够解决时间限制并为时间感知流程模型提供有洞察力的质量度量的一致性检查技术的需求也在不断增长。最有用的一致性工件之一是对齐，也就是找到修正新观察结果所需的最小变化，使其符合流程模型。在本文中，我们扩展了以前工作中的定时距离概念，在以前的工作中，对事件时间戳的编辑分为两种类型，这取决于编辑是否会传播给后继者。在这里，这些不同类型的编辑各有一个加权成本，它们的成本比用 α 表示。然后，我们在这种情况下求解了一大类加权距离（对应于 α∈{1}∪[2,∞)）的纯定时对齐问题。对于这些距离，我们提供了在具有连续因果过程的模型上进行距离计算和配准的线性时间算法。

{"title":"Timed alignments with mixed moves","authors":"Neha Rino , Thomas Chatain","doi":"10.1016/j.datak.2024.102366","DOIUrl":"10.1016/j.datak.2024.102366","url":null,"abstract":"<div><div>We study conformance checking for timed models, that is, process models that consider both the sequence of events that occur, as well as the timestamps at which each event is recorded. Time-aware process mining is a growing subfield of research, and as tools that seek to discover timing-related properties in processes develop, so does the need for conformance-checking techniques that can tackle time constraints and provide insightful quality measures for time-aware process models. One of the most useful conformance artefacts is the alignment, that is, finding the minimal changes necessary to correct a new observation to conform to a process model. In this paper, we extend the notion of timed distance from a previous work where an edit on an event’s timestamp came in two types, depending on whether or not it would propagate to its successors. Here, these different types of edits have a weighted cost each, and the ratio of their costs is denoted by <span><math><mi>α</mi></math></span>. We then solve the purely timed alignment problem in this setting for a large class of these weighted distances (corresponding to <span><math><mrow><mi>α</mi><mo>∈</mo><mrow><mo>{</mo><mn>1</mn><mo>}</mo></mrow><mo>∪</mo><mrow><mo>[</mo><mn>2</mn><mo>,</mo><mi>∞</mi><mo>)</mo></mrow></mrow></math></span>). For these distances, we provide linear time algorithms for both distance computation and alignment on models with sequential causal processes.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102366"},"PeriodicalIF":2.7,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

State-transition-aware anomaly detection under concept drifts 概念漂移下的状态转换感知异常检测

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-09-28 DOI: 10.1016/j.datak.2024.102365

Bin Li, Shubham Gupta, Emmanuel Müller

Detecting temporal abnormal patterns over streaming data is challenging due to volatile data properties and the lack of real-time labels. The abnormal patterns are usually hidden in the temporal context, which cannot be detected by evaluating single points. Furthermore, the normal state evolves over time due to concept drifts. A single model does not fit all data over time. Autoencoders have recently been applied for unsupervised anomaly detection. However, they are trained on a single normal state and usually become invalid after distributional drifts in the data stream. This paper uses an Autoencoder-based approach STAD for anomaly detection under concept drifts. In particular, we propose a state-transition-aware model to map different data distributions in each period of the data stream into states, thereby addressing the model adaptation problem in an interpretable way. In addition, we analyzed statistical tests to detect the drift by examining the sensitivity and powers. Furthermore, we present considerable ways to estimate the probability density function for comparing the distributional similarity for state transitions. Our experiments evaluate the proposed method on synthetic and real-world datasets. While delivering comparable anomaly detection performance as the state-of-the-art approaches, STAD works more efficiently and provides extra interpretability. We also provide insightful analysis of optimal hyperparameters for efficient model training and adaptation.

由于数据的不稳定性和缺乏实时标签，在流数据中检测时间异常模式具有挑战性。异常模式通常隐藏在时间上下文中，无法通过评估单个点来检测。此外，由于概念漂移，正常状态会随时间发生变化。单一模型并不适合随时间变化的所有数据。自动编码器最近被应用于无监督异常检测。然而，它们是在单一正常状态下训练的，通常在数据流的分布漂移后就会失效。本文将基于自动编码器的 STAD 方法用于概念漂移下的异常检测。特别是，我们提出了一种状态转换感知模型，将数据流每个周期的不同数据分布映射为状态，从而以可解释的方式解决了模型适应问题。此外，我们还分析了统计检验，通过检验灵敏度和幂来检测漂移。此外，我们还提出了大量估算概率密度函数的方法，用于比较状态转换的分布相似性。我们的实验在合成数据集和真实数据集上对所提出的方法进行了评估。在提供与最先进方法相当的异常检测性能的同时，STAD 的工作效率更高，并提供了额外的可解释性。我们还对高效模型训练和适应的最佳超参数进行了深入分析。

{"title":"State-transition-aware anomaly detection under concept drifts","authors":"Bin Li, Shubham Gupta, Emmanuel Müller","doi":"10.1016/j.datak.2024.102365","DOIUrl":"10.1016/j.datak.2024.102365","url":null,"abstract":"<div><div>Detecting temporal abnormal patterns over streaming data is challenging due to volatile data properties and the lack of real-time labels. The abnormal patterns are usually hidden in the temporal context, which cannot be detected by evaluating single points. Furthermore, the normal state evolves over time due to concept drifts. A single model does not fit all data over time. Autoencoders have recently been applied for unsupervised anomaly detection. However, they are trained on a single normal state and usually become invalid after distributional drifts in the data stream. This paper uses an Autoencoder-based approach STAD for anomaly detection under concept drifts. In particular, we propose a state-transition-aware model to map different data distributions in each period of the data stream into states, thereby addressing the model adaptation problem in an interpretable way. In addition, we analyzed statistical tests to detect the drift by examining the sensitivity and powers. Furthermore, we present considerable ways to estimate the probability density function for comparing the distributional similarity for state transitions. Our experiments evaluate the proposed method on synthetic and real-world datasets. While delivering comparable anomaly detection performance as the state-of-the-art approaches, STAD works more efficiently and provides extra interpretability. We also provide insightful analysis of optimal hyperparameters for efficient model training and adaptation.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102365"},"PeriodicalIF":2.7,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reasoning on responsibilities for optimal process alignment computation 最佳流程对齐计算的责任推理

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-09-19 DOI: 10.1016/j.datak.2024.102353

Matteo Baldoni, Cristina Baroglio, Elisa Marengo, Roberto Micalizio

Process alignment aims at establishing a matching between a process model run and a log trace. To improve such a matching, process alignment techniques often exploit contextual conditions to enable computations that are more informed than the simple edit distance between model runs and log traces. The paper introduces a novel approach to process alignment which relies on contextual information expressed as responsibilities. The notion of responsibility is fundamental in business and organization models, but it is often overlooked. We show the computation of optimal alignments can take advantage of responsibilities. We leverage on them in two ways. First, responsibilities may sometimes justify deviations. In these cases, we consider them as correct behaviors rather than errors. Second, responsibilities can either be met or neglected in the execution of a trace. Thus, we prefer alignments where neglected responsibilities are minimized.

The paper proposes a formal framework for responsibilities in a process model, including the definition of cost functions for computing optimal alignments. We also propose a branch-and-bound algorithm for optimal alignment computation and exemplify its usage by way of two event logs from real executions.

流程对齐的目的是在流程模型运行和日志跟踪之间建立匹配。为了改进这种匹配，流程对齐技术通常会利用上下文条件来进行计算，这种计算比模型运行和日志跟踪之间的简单编辑距离更有依据。本文介绍了一种新颖的流程对齐方法，它依赖于以责任表示的上下文信息。责任概念是业务和组织模型的基本要素，但却经常被忽视。我们表明，最优对齐的计算可以利用责任。我们通过两种方式利用责任。首先，责任有时会证明偏离是合理的。在这种情况下，我们将其视为正确的行为而不是错误。其次，在跟踪执行过程中，责任既可能被履行，也可能被忽略。本文提出了流程模型中责任的正式框架，包括计算最优排列的成本函数定义。我们还提出了最优排列计算的分支和边界算法，并通过两个实际执行的事件日志来举例说明其用法。

{"title":"Reasoning on responsibilities for optimal process alignment computation","authors":"Matteo Baldoni, Cristina Baroglio, Elisa Marengo, Roberto Micalizio","doi":"10.1016/j.datak.2024.102353","DOIUrl":"10.1016/j.datak.2024.102353","url":null,"abstract":"<div><p>Process alignment aims at establishing a matching between a process model run and a log trace. To improve such a matching, process alignment techniques often exploit contextual conditions to enable computations that are more informed than the simple edit distance between model runs and log traces. The paper introduces a novel approach to process alignment which relies on contextual information expressed as <em>responsibilities</em>. The notion of responsibility is fundamental in business and organization models, but it is often overlooked. We show the computation of optimal alignments can take advantage of responsibilities. We leverage on them in two ways. First, responsibilities may sometimes justify deviations. In these cases, we consider them as correct behaviors rather than errors. Second, responsibilities can either be met or neglected in the execution of a trace. Thus, we prefer alignments where neglected responsibilities are minimized.</p><p>The paper proposes a formal framework for responsibilities in a process model, including the definition of cost functions for computing optimal alignments. We also propose a branch-and-bound algorithm for optimal alignment computation and exemplify its usage by way of two event logs from real executions.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102353"},"PeriodicalIF":2.7,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000776/pdfft?md5=df35ebc627d0abaf942b9666c2d2c159&pid=1-s2.0-S0169023X24000776-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142271815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Big data classification using SpinalNet-Fuzzy-ResNeXt based on spark architecture with data mining approach 使用基于数据挖掘方法的星火架构 SpinalNet-Fuzzy-ResNeXt 进行大数据分类

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-09-17 DOI: 10.1016/j.datak.2024.102364

M. Robinson Joel , K. Rajakumari , S. Anu Priya , M. Navaneethakrishnan

In the modern networking topology, big data is highly essential for several domains like e-commerce, healthcare, and finance. Big data classification has offered effectual performance in several applications. Still, big data classification is highly difficult and the recognized classification approaches require a longer duration and numerous resources for executing the accessible data. For resolving such issues, the spark-based classification approach is required. In this work, the hybrid SpinalNet-Fuzzy-ResNeXt model called SFResNeXt is implemented to classify the big data. Here, the SpinalNet and ResNeXt are merged, where the layers are fused with the fuzzy concept. The initial process is the outlier detection. The Holoentrophy method is used to detect the outlier data, and it is removed. Moreover, duplicate detection is performed by fingerprinting approach to detect the repeated data. The, Association Rule Mining (ARM) method is employed for feature selection. The big data is classified by the SFResNeXt. Furthermore, the SFResNeXt-based big data classification offered the accuracy, sensitivity, and specificity of 0.905, 0.914, and 0.922 using the heart disease dataset.

在现代网络拓扑结构中，大数据对电子商务、医疗保健和金融等多个领域都非常重要。大数据分类在多个应用中提供了有效的性能。然而，大数据分类仍然非常困难，公认的分类方法需要较长的时间和大量的资源来执行可访问的数据。为解决这些问题，需要基于火花的分类方法。在这项工作中，实现了名为 SFResNeXt 的混合 SpinalNet-Fuzzy-ResNeXt 模型来对大数据进行分类。在这里，SpinalNet 和 ResNeXt 被合并，各层与模糊概念融合。初始过程是离群点检测。使用 Holoentrophy 方法检测离群数据，并将其移除。此外，重复检测是通过指纹识别法来检测重复数据。特征选择采用关联规则挖掘（ARM）方法。通过 SFResNeXt 对大数据进行分类。此外，以心脏病数据集为例，基于 SFResNeXt 的大数据分类的准确度、灵敏度和特异度分别为 0.905、0.914 和 0.922。

{"title":"Big data classification using SpinalNet-Fuzzy-ResNeXt based on spark architecture with data mining approach","authors":"M. Robinson Joel , K. Rajakumari , S. Anu Priya , M. Navaneethakrishnan","doi":"10.1016/j.datak.2024.102364","DOIUrl":"10.1016/j.datak.2024.102364","url":null,"abstract":"<div><div>In the modern networking topology, big data is highly essential for several domains like e-commerce, healthcare, and finance. Big data classification has offered effectual performance in several applications. Still, big data classification is highly difficult and the recognized classification approaches require a longer duration and numerous resources for executing the accessible data. For resolving such issues, the spark-based classification approach is required. In this work, the hybrid SpinalNet-Fuzzy-ResNeXt model called SFResNeXt is implemented to classify the big data. Here, the SpinalNet and ResNeXt are merged, where the layers are fused with the fuzzy concept. The initial process is the outlier detection. The Holoentrophy method is used to detect the outlier data, and it is removed. Moreover, duplicate detection is performed by fingerprinting approach to detect the repeated data. The, Association Rule Mining (ARM) method is employed for feature selection. The big data is classified by the SFResNeXt. Furthermore, the SFResNeXt-based big data classification offered the accuracy, sensitivity, and specificity of 0.905, 0.914, and 0.922 using the heart disease dataset.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102364"},"PeriodicalIF":2.7,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SRank: Guiding schema selection in NoSQL document stores SRank：指导 NoSQL 文档存储中的模式选择

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-09-14 DOI: 10.1016/j.datak.2024.102360

Shelly Sachdeva , Neha Bansal , Hardik Bansal

The rise of big data has led to a greater need for applications to change their schema frequently. NoSQL databases provide flexibility in organizing data and offer multiple choices for structuring and storing similar information. While schema flexibility speeds up initial development, choosing schemas wisely is crucial, as they significantly impact performance, affecting data redundancy, navigation cost, data access cost, and maintainability. This paper emphasizes the importance of schema design in NoSQL document stores. It proposes a model to analyze and evaluate different schema alternatives and suggest the best schema out of various schema alternatives. The model is divided into four phases. The model inputs the Entity-Relationship (ER) model and workload queries. In the Transformation Phase, the schema alternatives are initially developed for each ER model, and subsequently, a schema graph is generated for each alternative. Concurrently, workload queries undergo conversion into query graphs. In the Schema Evaluation phase, the Schema Rank (SRank) is calculated for each schema alternative using query metrics derived from the query graphs and path coverage generated from the schema graphs. Finally, in the Output phase, the schema with the highest SRank is recommended as the most suitable choice for the application. The paper includes a case study of a Hotel Reservation System (HRS) to demonstrate the application of the proposed model. It comprehensively evaluates various schema alternatives based on query response time, storage efficiency, scalability, throughput, and latency. The paper validates the SRank computation for schema selection in NoSQL databases through an extensive experimental study. The alignment of SRank values with each schema's performance metrics underscores this ranking system's effectiveness. The SRank simplifies the schema selection process, assisting users in making informed decisions by reducing the time, cost, and effort of identifying the optimal schema for NoSQL document stores.

大数据的兴起导致应用程序更需要频繁更改其模式。NoSQL 数据库可以灵活地组织数据，并为结构化和存储类似信息提供多种选择。虽然模式灵活性加快了初始开发速度，但明智地选择模式至关重要，因为它们会显著影响性能，影响数据冗余、导航成本、数据访问成本和可维护性。本文强调了模式设计在 NoSQL 文档存储中的重要性。它提出了一个模型，用于分析和评估不同的模式备选方案，并从各种模式备选方案中推荐最佳模式。该模型分为四个阶段。该模型输入实体关系（ER）模型和工作量查询。在转换阶段，最初为每个 ER 模型开发模式备选方案，随后为每个备选方案生成模式图。与此同时，工作负载查询也被转换成查询图。在模式评估阶段，使用从查询图和从模式图生成的路径覆盖率得出的查询指标，为每个模式备选方案计算模式排名（SRank）。最后，在输出阶段，推荐 SRank 最高的模式作为最适合应用的选择。本文通过一个酒店预订系统（HRS）的案例研究来展示所提模型的应用。论文根据查询响应时间、存储效率、可扩展性、吞吐量和延迟全面评估了各种模式选择。论文通过广泛的实验研究验证了用于 NoSQL 数据库模式选择的 SRank 计算。SRank 值与每个模式的性能指标相一致，凸显了该排名系统的有效性。SRank 简化了模式选择过程，通过减少为 NoSQL 文档存储确定最佳模式所需的时间、成本和精力，帮助用户做出明智的决策。

{"title":"SRank: Guiding schema selection in NoSQL document stores","authors":"Shelly Sachdeva , Neha Bansal , Hardik Bansal","doi":"10.1016/j.datak.2024.102360","DOIUrl":"10.1016/j.datak.2024.102360","url":null,"abstract":"<div><div>The rise of big data has led to a greater need for applications to change their schema frequently. NoSQL databases provide flexibility in organizing data and offer multiple choices for structuring and storing similar information. While schema flexibility speeds up initial development, choosing schemas wisely is crucial, as they significantly impact performance, affecting data redundancy, navigation cost, data access cost, and maintainability. This paper emphasizes the importance of schema design in NoSQL document stores. It proposes a model to analyze and evaluate different schema alternatives and suggest the best schema out of various schema alternatives. The model is divided into four phases. The model inputs the Entity-Relationship (ER) model and workload queries. In the Transformation Phase, the schema alternatives are initially developed for each ER model, and subsequently, a schema graph is generated for each alternative. Concurrently, workload queries undergo conversion into query graphs. In the Schema Evaluation phase, the Schema Rank (SRank) is calculated for each schema alternative using query metrics derived from the query graphs and path coverage generated from the schema graphs. Finally, in the Output phase, the schema with the highest SRank is recommended as the most suitable choice for the application. The paper includes a case study of a Hotel Reservation System (HRS) to demonstrate the application of the proposed model. It comprehensively evaluates various schema alternatives based on query response time, storage efficiency, scalability, throughput, and latency. The paper validates the SRank computation for schema selection in NoSQL databases through an extensive experimental study. The alignment of SRank values with each schema's performance metrics underscores this ranking system's effectiveness. The SRank simplifies the schema selection process, assisting users in making informed decisions by reducing the time, cost, and effort of identifying the optimal schema for NoSQL document stores.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102360"},"PeriodicalIF":2.7,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142311715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Relating behaviour of data-aware process models 数据感知流程模型的相关行为

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-09-12 DOI: 10.1016/j.datak.2024.102363

Marco Montali, Sarah Winkler

Data Petri nets (DPNs) have gained traction as a model for data-aware processes, thanks to their ability to balance simplicity with expressiveness, and because they can be automatically discovered from event logs. While model checking techniques for DPNs have been studied, more complex analysis tasks that are highly relevant for BPM are beyond methods known in the literature. We focus here on equivalence and inclusion of process behaviour with respect to language and configuration spaces, optionally taking data into account. Such comparisons are important in the context of key process mining tasks, namely process repair and discovery, and related to conformance checking. To solve these tasks, we propose approaches for bounded DPNs based on constraint graphs, which are faithful abstractions of the reachable state space. Though the considered verification tasks are undecidable in general, we show that our method is a decision procedure DPNs that admit a finite history set. This property guarantees that constraint graphs are finite and computable, and was shown to hold for large classes of DPNs that are mined automatically, and DPNs presented in the literature. The new techniques are implemented in the tool ada, and an evaluation proving feasibility is provided.

由于数据 Petri 网（DPN）能够兼顾简洁性和表达性，而且可以从事件日志中自动发现，因此作为数据感知流程的一种模型，DPN 已经获得了广泛的关注。虽然针对 DPN 的模型检查技术已经得到了研究，但与 BPM 高度相关的更复杂的分析任务却超出了文献中已知方法的范围。在此，我们将重点放在流程行为与语言和配置空间的等价性和包含性上，并有选择地将数据考虑在内。这种比较在关键流程挖掘任务（即流程修复和发现）中非常重要，并且与一致性检查相关。为了解决这些任务，我们提出了基于约束图的有界 DPN 方法，约束图是可到达状态空间的忠实抽象。尽管所考虑的验证任务在一般情况下是不可判定的，但我们证明了我们的方法是一种允许有限历史集的决策过程 DPN。这一特性保证了约束图的有限性和可计算性，并被证明适用于大量自动挖掘的 DPN 和文献中介绍的 DPN。新技术在工具 ada 中实现，并提供了证明可行性的评估。

{"title":"Relating behaviour of data-aware process models","authors":"Marco Montali, Sarah Winkler","doi":"10.1016/j.datak.2024.102363","DOIUrl":"10.1016/j.datak.2024.102363","url":null,"abstract":"<div><p>Data Petri nets (DPNs) have gained traction as a model for data-aware processes, thanks to their ability to balance simplicity with expressiveness, and because they can be automatically discovered from event logs. While model checking techniques for DPNs have been studied, more complex analysis tasks that are highly relevant for BPM are beyond methods known in the literature. We focus here on equivalence and inclusion of process behaviour with respect to language and configuration spaces, optionally taking data into account. Such comparisons are important in the context of key process mining tasks, namely process repair and discovery, and related to conformance checking. To solve these tasks, we propose approaches for bounded DPNs based on <em>constraint graphs</em>, which are faithful abstractions of the reachable state space. Though the considered verification tasks are undecidable in general, we show that our method is a decision procedure DPNs that admit a <em>finite history set</em>. This property guarantees that constraint graphs are finite and computable, and was shown to hold for large classes of DPNs that are mined automatically, and DPNs presented in the literature. The new techniques are implemented in the tool <span>ada</span>, and an evaluation proving feasibility is provided.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102363"},"PeriodicalIF":2.7,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000879/pdfft?md5=ee932b18bac18fd1e3c1e769269d7d67&pid=1-s2.0-S0169023X24000879-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142241277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0