首页 > 最新文献

Data & Knowledge Engineering最新文献

英文 中文
Reasoning on responsibilities for optimal process alignment computation 最佳流程对齐计算的责任推理
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-19 DOI: 10.1016/j.datak.2024.102353

Process alignment aims at establishing a matching between a process model run and a log trace. To improve such a matching, process alignment techniques often exploit contextual conditions to enable computations that are more informed than the simple edit distance between model runs and log traces. The paper introduces a novel approach to process alignment which relies on contextual information expressed as responsibilities. The notion of responsibility is fundamental in business and organization models, but it is often overlooked. We show the computation of optimal alignments can take advantage of responsibilities. We leverage on them in two ways. First, responsibilities may sometimes justify deviations. In these cases, we consider them as correct behaviors rather than errors. Second, responsibilities can either be met or neglected in the execution of a trace. Thus, we prefer alignments where neglected responsibilities are minimized.

The paper proposes a formal framework for responsibilities in a process model, including the definition of cost functions for computing optimal alignments. We also propose a branch-and-bound algorithm for optimal alignment computation and exemplify its usage by way of two event logs from real executions.

流程对齐的目的是在流程模型运行和日志跟踪之间建立匹配。为了改进这种匹配,流程对齐技术通常会利用上下文条件来进行计算,这种计算比模型运行和日志跟踪之间的简单编辑距离更有依据。本文介绍了一种新颖的流程对齐方法,它依赖于以责任表示的上下文信息。责任概念是业务和组织模型的基本要素,但却经常被忽视。我们表明,最优对齐的计算可以利用责任。我们通过两种方式利用责任。首先,责任有时会证明偏离是合理的。在这种情况下,我们将其视为正确的行为而不是错误。其次,在跟踪执行过程中,责任既可能被履行,也可能被忽略。本文提出了流程模型中责任的正式框架,包括计算最优排列的成本函数定义。我们还提出了最优排列计算的分支和边界算法,并通过两个实际执行的事件日志来举例说明其用法。
{"title":"Reasoning on responsibilities for optimal process alignment computation","authors":"","doi":"10.1016/j.datak.2024.102353","DOIUrl":"10.1016/j.datak.2024.102353","url":null,"abstract":"<div><p>Process alignment aims at establishing a matching between a process model run and a log trace. To improve such a matching, process alignment techniques often exploit contextual conditions to enable computations that are more informed than the simple edit distance between model runs and log traces. The paper introduces a novel approach to process alignment which relies on contextual information expressed as <em>responsibilities</em>. The notion of responsibility is fundamental in business and organization models, but it is often overlooked. We show the computation of optimal alignments can take advantage of responsibilities. We leverage on them in two ways. First, responsibilities may sometimes justify deviations. In these cases, we consider them as correct behaviors rather than errors. Second, responsibilities can either be met or neglected in the execution of a trace. Thus, we prefer alignments where neglected responsibilities are minimized.</p><p>The paper proposes a formal framework for responsibilities in a process model, including the definition of cost functions for computing optimal alignments. We also propose a branch-and-bound algorithm for optimal alignment computation and exemplify its usage by way of two event logs from real executions.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000776/pdfft?md5=df35ebc627d0abaf942b9666c2d2c159&pid=1-s2.0-S0169023X24000776-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142271815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SRank: Guiding schema selection in NoSQL document stores SRank:指导 NoSQL 文档存储中的模式选择
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-14 DOI: 10.1016/j.datak.2024.102360
The rise of big data has led to a greater need for applications to change their schema frequently. NoSQL databases provide flexibility in organizing data and offer multiple choices for structuring and storing similar information. While schema flexibility speeds up initial development, choosing schemas wisely is crucial, as they significantly impact performance, affecting data redundancy, navigation cost, data access cost, and maintainability. This paper emphasizes the importance of schema design in NoSQL document stores. It proposes a model to analyze and evaluate different schema alternatives and suggest the best schema out of various schema alternatives. The model is divided into four phases. The model inputs the Entity-Relationship (ER) model and workload queries. In the Transformation Phase, the schema alternatives are initially developed for each ER model, and subsequently, a schema graph is generated for each alternative. Concurrently, workload queries undergo conversion into query graphs. In the Schema Evaluation phase, the Schema Rank (SRank) is calculated for each schema alternative using query metrics derived from the query graphs and path coverage generated from the schema graphs. Finally, in the Output phase, the schema with the highest SRank is recommended as the most suitable choice for the application. The paper includes a case study of a Hotel Reservation System (HRS) to demonstrate the application of the proposed model. It comprehensively evaluates various schema alternatives based on query response time, storage efficiency, scalability, throughput, and latency. The paper validates the SRank computation for schema selection in NoSQL databases through an extensive experimental study. The alignment of SRank values with each schema's performance metrics underscores this ranking system's effectiveness. The SRank simplifies the schema selection process, assisting users in making informed decisions by reducing the time, cost, and effort of identifying the optimal schema for NoSQL document stores.
大数据的兴起导致应用程序更需要频繁更改其模式。NoSQL 数据库可以灵活地组织数据,并为结构化和存储类似信息提供多种选择。虽然模式灵活性加快了初始开发速度,但明智地选择模式至关重要,因为它们会显著影响性能,影响数据冗余、导航成本、数据访问成本和可维护性。本文强调了模式设计在 NoSQL 文档存储中的重要性。它提出了一个模型,用于分析和评估不同的模式备选方案,并从各种模式备选方案中推荐最佳模式。该模型分为四个阶段。该模型输入实体关系(ER)模型和工作量查询。在转换阶段,最初为每个 ER 模型开发模式备选方案,随后为每个备选方案生成模式图。与此同时,工作负载查询也被转换成查询图。在模式评估阶段,使用从查询图和从模式图生成的路径覆盖率得出的查询指标,为每个模式备选方案计算模式排名(SRank)。最后,在输出阶段,推荐 SRank 最高的模式作为最适合应用的选择。本文通过一个酒店预订系统(HRS)的案例研究来展示所提模型的应用。论文根据查询响应时间、存储效率、可扩展性、吞吐量和延迟全面评估了各种模式选择。论文通过广泛的实验研究验证了用于 NoSQL 数据库模式选择的 SRank 计算。SRank 值与每个模式的性能指标相一致,凸显了该排名系统的有效性。SRank 简化了模式选择过程,通过减少为 NoSQL 文档存储确定最佳模式所需的时间、成本和精力,帮助用户做出明智的决策。
{"title":"SRank: Guiding schema selection in NoSQL document stores","authors":"","doi":"10.1016/j.datak.2024.102360","DOIUrl":"10.1016/j.datak.2024.102360","url":null,"abstract":"<div><div>The rise of big data has led to a greater need for applications to change their schema frequently. NoSQL databases provide flexibility in organizing data and offer multiple choices for structuring and storing similar information. While schema flexibility speeds up initial development, choosing schemas wisely is crucial, as they significantly impact performance, affecting data redundancy, navigation cost, data access cost, and maintainability. This paper emphasizes the importance of schema design in NoSQL document stores. It proposes a model to analyze and evaluate different schema alternatives and suggest the best schema out of various schema alternatives. The model is divided into four phases. The model inputs the Entity-Relationship (ER) model and workload queries. In the Transformation Phase, the schema alternatives are initially developed for each ER model, and subsequently, a schema graph is generated for each alternative. Concurrently, workload queries undergo conversion into query graphs. In the Schema Evaluation phase, the Schema Rank (SRank) is calculated for each schema alternative using query metrics derived from the query graphs and path coverage generated from the schema graphs. Finally, in the Output phase, the schema with the highest SRank is recommended as the most suitable choice for the application. The paper includes a case study of a Hotel Reservation System (HRS) to demonstrate the application of the proposed model. It comprehensively evaluates various schema alternatives based on query response time, storage efficiency, scalability, throughput, and latency. The paper validates the SRank computation for schema selection in NoSQL databases through an extensive experimental study. The alignment of SRank values with each schema's performance metrics underscores this ranking system's effectiveness. The SRank simplifies the schema selection process, assisting users in making informed decisions by reducing the time, cost, and effort of identifying the optimal schema for NoSQL document stores.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142311715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relating behaviour of data-aware process models 数据感知流程模型的相关行为
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-12 DOI: 10.1016/j.datak.2024.102363

Data Petri nets (DPNs) have gained traction as a model for data-aware processes, thanks to their ability to balance simplicity with expressiveness, and because they can be automatically discovered from event logs. While model checking techniques for DPNs have been studied, more complex analysis tasks that are highly relevant for BPM are beyond methods known in the literature. We focus here on equivalence and inclusion of process behaviour with respect to language and configuration spaces, optionally taking data into account. Such comparisons are important in the context of key process mining tasks, namely process repair and discovery, and related to conformance checking. To solve these tasks, we propose approaches for bounded DPNs based on constraint graphs, which are faithful abstractions of the reachable state space. Though the considered verification tasks are undecidable in general, we show that our method is a decision procedure DPNs that admit a finite history set. This property guarantees that constraint graphs are finite and computable, and was shown to hold for large classes of DPNs that are mined automatically, and DPNs presented in the literature. The new techniques are implemented in the tool ada, and an evaluation proving feasibility is provided.

由于数据 Petri 网(DPN)能够兼顾简洁性和表达性,而且可以从事件日志中自动发现,因此作为数据感知流程的一种模型,DPN 已经获得了广泛的关注。虽然针对 DPN 的模型检查技术已经得到了研究,但与 BPM 高度相关的更复杂的分析任务却超出了文献中已知方法的范围。在此,我们将重点放在流程行为与语言和配置空间的等价性和包含性上,并有选择地将数据考虑在内。这种比较在关键流程挖掘任务(即流程修复和发现)中非常重要,并且与一致性检查相关。为了解决这些任务,我们提出了基于约束图的有界 DPN 方法,约束图是可到达状态空间的忠实抽象。尽管所考虑的验证任务在一般情况下是不可判定的,但我们证明了我们的方法是一种允许有限历史集的决策过程 DPN。这一特性保证了约束图的有限性和可计算性,并被证明适用于大量自动挖掘的 DPN 和文献中介绍的 DPN。新技术在工具 ada 中实现,并提供了证明可行性的评估。
{"title":"Relating behaviour of data-aware process models","authors":"","doi":"10.1016/j.datak.2024.102363","DOIUrl":"10.1016/j.datak.2024.102363","url":null,"abstract":"<div><p>Data Petri nets (DPNs) have gained traction as a model for data-aware processes, thanks to their ability to balance simplicity with expressiveness, and because they can be automatically discovered from event logs. While model checking techniques for DPNs have been studied, more complex analysis tasks that are highly relevant for BPM are beyond methods known in the literature. We focus here on equivalence and inclusion of process behaviour with respect to language and configuration spaces, optionally taking data into account. Such comparisons are important in the context of key process mining tasks, namely process repair and discovery, and related to conformance checking. To solve these tasks, we propose approaches for bounded DPNs based on <em>constraint graphs</em>, which are faithful abstractions of the reachable state space. Though the considered verification tasks are undecidable in general, we show that our method is a decision procedure DPNs that admit a <em>finite history set</em>. This property guarantees that constraint graphs are finite and computable, and was shown to hold for large classes of DPNs that are mined automatically, and DPNs presented in the literature. The new techniques are implemented in the tool <span>ada</span>, and an evaluation proving feasibility is provided.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000879/pdfft?md5=ee932b18bac18fd1e3c1e769269d7d67&pid=1-s2.0-S0169023X24000879-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142241277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A framework for understanding event abstraction problem solving: Current states of event abstraction studies 理解事件抽象解决问题的框架:事件抽象研究的现状
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-06 DOI: 10.1016/j.datak.2024.102352

Event abstraction is a crucial step in applying process mining in real-world scenarios. However, practitioners often face challenges in selecting relevant research for their specific needs. To address this, we present a comprehensive framework for understanding event abstraction, comprising four key components: event abstraction sub-problems, consideration of process properties, data types for event abstraction, and various approaches to event abstraction. By systematically examining these components, practitioners can efficiently identify research that aligns with their requirements. Additionally, we analyze existing studies using this framework to provide practitioners with a clearer view of current research and suggest expanded applications of existing methods.

事件抽象是将流程挖掘应用于现实世界场景的关键一步。然而,实践者在选择满足其特定需求的相关研究时往往面临挑战。为了解决这个问题,我们提出了一个理解事件抽象的综合框架,由四个关键部分组成:事件抽象子问题、流程属性考虑、事件抽象的数据类型以及事件抽象的各种方法。通过系统地研究这些组成部分,从业人员可以有效地识别符合其要求的研究。此外,我们还利用这一框架分析了现有研究,为从业人员提供了更清晰的当前研究视图,并建议扩大现有方法的应用范围。
{"title":"A framework for understanding event abstraction problem solving: Current states of event abstraction studies","authors":"","doi":"10.1016/j.datak.2024.102352","DOIUrl":"10.1016/j.datak.2024.102352","url":null,"abstract":"<div><p>Event abstraction is a crucial step in applying process mining in real-world scenarios. However, practitioners often face challenges in selecting relevant research for their specific needs. To address this, we present a comprehensive framework for understanding event abstraction, comprising four key components: event abstraction sub-problems, consideration of process properties, data types for event abstraction, and various approaches to event abstraction. By systematically examining these components, practitioners can efficiently identify research that aligns with their requirements. Additionally, we analyze existing studies using this framework to provide practitioners with a clearer view of current research and suggest expanded applications of existing methods.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142171787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A conceptual framework for the government big data ecosystem (‘datagov.eco’) 政府大数据生态系统("datagov.eco")概念框架
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-05 DOI: 10.1016/j.datak.2024.102348

The public sector, private firms, and civil society constantly create data of high volume, velocity, and veracity from diverse sources. This kind of data is known as big data. As in other industries, public administrations consider big data as the “new oil" and employ data-centric policies to transform data into knowledge, stimulate good governance, innovative digital services, transparency, and citizens' engagement in public policy. More and more public organizations understand the value created by exploiting internal and external data sources, delivering new capabilities, and fostering collaboration inside and outside of public administrations. Despite the broad interest in this ecosystem, we still lack a detailed and systematic view of it. In this paper, we attempt to describe the emerging Government Big Data Ecosystem as a socio-technical network of people, organizations, processes, technology, infrastructure, standards & policies, procedures, and resources. This ecosystem supports data functions such as data collection, integration, analysis, storage, sharing, use, protection, and archiving. Through these functions, value is created by promoting evidence-based policymaking, modern public services delivery, data-driven administration and open government, and boosting the data economy. Through a Design Science Research methodology, we propose a conceptual framework, which we call ‘datagov.eco’. We believe our ‘datagov.eco’ framework will provide insights and support to different stakeholders’ profiles, including administrators, consultants, data engineers, and data scientists.

公共部门、私营企业和民间社会不断从各种来源创建大量、高速和真实的数据。这类数据被称为大数据。与其他行业一样,公共管理部门将大数据视为 "新石油",并采用以数据为中心的政策,将数据转化为知识,促进善治、创新数字服务、透明度和公民参与公共政策。越来越多的公共组织认识到利用内部和外部数据源、提供新能力以及促进公共管理部门内外协作所创造的价值。尽管这一生态系统受到广泛关注,但我们仍然缺乏对其详细而系统的了解。在本文中,我们试图将新兴的政府大数据生态系统描述为一个由人员、组织、流程、技术、基础设施、标准&amp、政策、程序和资源组成的社会技术网络。该生态系统支持数据功能,如数据收集、整合、分析、存储、共享、使用、保护和归档。通过这些功能,可以促进循证决策、现代公共服务交付、数据驱动的行政管理和开放式政府,并推动数据经济的发展,从而创造价值。通过设计科学研究方法,我们提出了一个概念框架,我们称之为 "datagov.eco"。我们相信,我们的 "datagov.eco "框架将为不同利益相关者(包括管理者、顾问、数据工程师和数据科学家)提供见解和支持。
{"title":"A conceptual framework for the government big data ecosystem (‘datagov.eco’)","authors":"","doi":"10.1016/j.datak.2024.102348","DOIUrl":"10.1016/j.datak.2024.102348","url":null,"abstract":"<div><p>The public sector, private firms, and civil society constantly create data of high volume, velocity, and veracity from diverse sources. This kind of data is known as big data. As in other industries, public administrations consider big data as the “new oil\" and employ data-centric policies to transform data into knowledge, stimulate good governance, innovative digital services, transparency, and citizens' engagement in public policy. More and more public organizations understand the value created by exploiting internal and external data sources, delivering new capabilities, and fostering collaboration inside and outside of public administrations. Despite the broad interest in this ecosystem, we still lack a detailed and systematic view of it. In this paper, we attempt to describe the emerging Government Big Data Ecosystem as a <em>socio-technical network</em> of people, organizations, processes, technology, infrastructure, standards &amp; policies, procedures, and resources. This ecosystem supports <em>data functions</em> such as data collection, integration, analysis, storage, sharing, use, protection, and archiving. Through these functions, <em>value is created</em> by promoting evidence-based policymaking, modern public services delivery, data-driven administration and open government, and boosting the data economy. Through a Design Science Research methodology, we propose a conceptual framework, which we call ‘datagov.eco’. We believe our ‘datagov.eco’ framework will provide insights and support to different stakeholders’ profiles, including administrators, consultants, data engineers, and data scientists.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142271814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data engineering and modeling for artificial intelligence 人工智能的数据工程和建模
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-01 DOI: 10.1016/j.datak.2024.102346
{"title":"Data engineering and modeling for artificial intelligence","authors":"","doi":"10.1016/j.datak.2024.102346","DOIUrl":"10.1016/j.datak.2024.102346","url":null,"abstract":"","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Capturing and Analysing Employee Behaviour: An Honest Day’s Work Record 捕捉和分析员工行为:诚实的日常工作记录
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-31 DOI: 10.1016/j.datak.2024.102350

For a range of reasons, organisations collect data on the work behaviour of their employees. However, each data collection technique displays its own unique mix of intrusiveness, information richness, and risks. For the sake of understanding the differences between data collection techniques, we conducted a multiple-case study in a multinational professional services organisation, tracking six participants throughout a workday using non-participant observation, screen recording, and timesheet techniques. This led to 136 hours of data. Our findings show that relying on one data collection technique alone cannot provide a comprehensive and accurate account of activities that are screen-based, offline, or overtime. The collected data also provided an opportunity to investigate the use of process mining for analysing employee behaviour, specifically with respect to the completeness of the collected data. Our study underlines the importance of judiciously selecting data collection techniques, as well as using a sufficiently broad data set to generate reliable insights into employee behaviour.

出于各种原因,企业会收集员工的工作行为数据。然而,每种数据收集技术都有其独特的侵入性、信息丰富性和风险性。为了了解数据收集技术之间的差异,我们在一家跨国专业服务机构开展了一项多案例研究,使用非参与者观察、屏幕记录和时间表技术对六名参与者的整个工作日进行跟踪。由此获得了 136 个小时的数据。我们的研究结果表明,仅仅依靠一种数据收集技术无法全面准确地描述基于屏幕、离线或加班的活动。收集到的数据还为研究如何利用流程挖掘来分析员工行为提供了机会,特别是在所收集数据的完整性方面。我们的研究强调了明智选择数据收集技术的重要性,以及使用足够广泛的数据集对员工行为进行可靠洞察的重要性。
{"title":"Capturing and Analysing Employee Behaviour: An Honest Day’s Work Record","authors":"","doi":"10.1016/j.datak.2024.102350","DOIUrl":"10.1016/j.datak.2024.102350","url":null,"abstract":"<div><p>For a range of reasons, organisations collect data on the work behaviour of their employees. However, each data collection technique displays its own unique mix of intrusiveness, information richness, and risks. For the sake of understanding the differences between data collection techniques, we conducted a multiple-case study in a multinational professional services organisation, tracking six participants throughout a workday using non-participant observation, screen recording, and timesheet techniques. This led to 136 hours of data. Our findings show that relying on one data collection technique alone cannot provide a comprehensive and accurate account of activities that are screen-based, offline, or overtime. The collected data also provided an opportunity to investigate the use of <em>process mining</em> for analysing employee behaviour, specifically with respect to the completeness of the collected data. Our study underlines the importance of judiciously selecting data collection techniques, as well as using a sufficiently broad data set to generate reliable insights into employee behaviour.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000740/pdfft?md5=0803a6136e27919fd8c8a868fa63e889&pid=1-s2.0-S0169023X24000740-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142149252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discovering outlying attributes of outliers in data streams 发现数据流中异常值的离群属性
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-30 DOI: 10.1016/j.datak.2024.102349

Data streams, continuous sequences of timestamped data points, necessitate real-time monitoring due to their time-sensitive nature. In various data stream applications, such as network security and credit card transaction monitoring, real-time detection of outliers is crucial, as these outliers often signify potential threats. Equally important is the real-time explanation of outliers, enabling users to glean insights and thereby shorten their investigation time. The investigation time for outliers is closely tied to their number of attributes, making it essential to provide explanations that detail which attributes are responsible for the abnormality of a data point, referred to as outlying attributes. However, the unbounded volume of data and concept drift of data streams pose challenges for discovering the outlying attributes of outliers in real time. In response, in this paper we propose EXOS, an algorithm designed for discovering the outlying attributes of multi-dimensional outliers in data streams. EXOS leverages cross-correlations among data streams, accommodates varying data stream schemas and arrival rates, and effectively addresses challenges related to the unbounded volume of data and concept drift. The algorithm is model-agnostic for point outlier detection and provides real-time explanations based on the local context of the outlier, derived from time-based tumbling windows. The paper provides a complexity analysis of EXOS and an experimental analysis comparing EXOS with existing algorithms. The evaluation includes an assessment of performance on both real-world and synthetic datasets in terms of average precision, recall, F1-score, and explanation time. The evaluation results show that, on average, EXOS achieves a 45.6% better F1 Score and is 7.3 times lower in explanation time compared to existing outlying attribute algorithms.

数据流是带有时间戳的数据点的连续序列,由于其时间敏感性,有必要对其进行实时监控。在网络安全和信用卡交易监控等各种数据流应用中,实时检测异常值至关重要,因为这些异常值往往意味着潜在的威胁。同样重要的是对异常值进行实时解释,使用户能够获得深刻见解,从而缩短调查时间。异常值的调查时间与其属性数量密切相关,因此必须提供解释,详细说明造成数据点异常的属性,即异常属性。然而,数据流的无限制数据量和概念漂移给实时发现异常值的离群属性带来了挑战。为此,我们在本文中提出了 EXOS 算法,该算法旨在发现数据流中多维离群值的离群属性。EXOS 可利用数据流之间的交叉相关性,适应不同的数据流模式和到达率,并能有效解决与无限制数据量和概念漂移相关的挑战。该算法在离群点检测方面与模型无关,并根据基于时间的翻滚窗口得出的离群点局部上下文提供实时解释。论文提供了 EXOS 的复杂性分析以及 EXOS 与现有算法比较的实验分析。评估包括对实际数据集和合成数据集的平均精确度、召回率、F1-分数和解释时间的性能评估。评估结果表明,与现有的离群属性算法相比,EXOS 的 F1 分数平均提高了 45.6%,解释时间缩短了 7.3 倍。
{"title":"Discovering outlying attributes of outliers in data streams","authors":"","doi":"10.1016/j.datak.2024.102349","DOIUrl":"10.1016/j.datak.2024.102349","url":null,"abstract":"<div><p>Data streams, continuous sequences of timestamped data points, necessitate real-time monitoring due to their time-sensitive nature. In various data stream applications, such as network security and credit card transaction monitoring, real-time detection of outliers is crucial, as these outliers often signify potential threats. Equally important is the real-time explanation of outliers, enabling users to glean insights and thereby shorten their investigation time. The investigation time for outliers is closely tied to their number of attributes, making it essential to provide explanations that detail which attributes are responsible for the abnormality of a data point, referred to as outlying attributes. However, the unbounded volume of data and concept drift of data streams pose challenges for discovering the outlying attributes of outliers in real time. In response, in this paper we propose EXOS, an algorithm designed for discovering the outlying attributes of multi-dimensional outliers in data streams. EXOS leverages cross-correlations among data streams, accommodates varying data stream schemas and arrival rates, and effectively addresses challenges related to the unbounded volume of data and concept drift. The algorithm is model-agnostic for point outlier detection and provides real-time explanations based on the local context of the outlier, derived from time-based tumbling windows. The paper provides a complexity analysis of EXOS and an experimental analysis comparing EXOS with existing algorithms. The evaluation includes an assessment of performance on both real-world and synthetic datasets in terms of average precision, recall, F1-score, and explanation time. The evaluation results show that, on average, EXOS achieves a 45.6% better F1 Score and is 7.3 times lower in explanation time compared to existing outlying attribute algorithms.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142121852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A self-adaptive density-based clustering algorithm for varying densities datasets with strong disturbance factor 针对具有强干扰因素的不同密度数据集的基于密度的自适应聚类算法
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-07 DOI: 10.1016/j.datak.2024.102345

Clustering is a fundamental task in data mining, aiming to group similar objects together based on their features or attributes. With the rapid increase in data analysis volume and the growing complexity of high-dimensional data distribution, clustering has become increasingly important in numerous applications, including image analysis, text mining, and anomaly detection. DBSCAN is a powerful tool for clustering analysis and is widely used in density-based clustering algorithms. However, DBSCAN and its variants encounter challenges when confronted with datasets exhibiting clusters of varying densities in intricate high-dimensional spaces affected by significant disturbance factors. A typical example is multi-density clustering connected by a few data points with strong internal correlations, a scenario commonly encountered in the analysis of crowd mobility. To address these challenges, we propose a Self-adaptive Density-Based Clustering Algorithm for Varying Densities Datasets with Strong Disturbance Factor (SADBSCAN). This algorithm comprises a data block splitter, a local clustering module, a global clustering module, and a data block merger to obtain adaptive clustering results. We conduct extensive experiments on both artificial and real-world datasets to evaluate the effectiveness of SADBSCAN. The experimental results indicate that SADBSCAN significantly outperforms several strong baselines across different metrics, demonstrating the high adaptability and scalability of our algorithm.

聚类是数据挖掘的一项基本任务,旨在根据相似对象的特征或属性将其归类。随着数据分析量的快速增长和高维数据分布的日益复杂,聚类在图像分析、文本挖掘和异常检测等众多应用中变得越来越重要。DBSCAN 是一种功能强大的聚类分析工具,被广泛应用于基于密度的聚类算法中。然而,当数据集在受重大干扰因素影响的错综复杂的高维空间中呈现出不同密度的聚类时,DBSCAN 及其变体就会遇到挑战。一个典型的例子是由几个具有强内部相关性的数据点连接而成的多密度聚类,这是人群流动性分析中经常遇到的情况。为了应对这些挑战,我们提出了一种针对具有强干扰因素的不同密度数据集的自适应密度聚类算法(SADBSCAN)。该算法由数据块分割器、局部聚类模块、全局聚类模块和数据块合并器组成,以获得自适应聚类结果。我们在人工数据集和真实数据集上进行了大量实验,以评估 SADBSCAN 的有效性。实验结果表明,在不同指标上,SADBSCAN 明显优于几种强大的基线算法,证明了我们算法的高适应性和可扩展性。
{"title":"A self-adaptive density-based clustering algorithm for varying densities datasets with strong disturbance factor","authors":"","doi":"10.1016/j.datak.2024.102345","DOIUrl":"10.1016/j.datak.2024.102345","url":null,"abstract":"<div><p>Clustering is a fundamental task in data mining, aiming to group similar objects together based on their features or attributes. With the rapid increase in data analysis volume and the growing complexity of high-dimensional data distribution, clustering has become increasingly important in numerous applications, including image analysis, text mining, and anomaly detection. DBSCAN is a powerful tool for clustering analysis and is widely used in density-based clustering algorithms. However, DBSCAN and its variants encounter challenges when confronted with datasets exhibiting clusters of varying densities in intricate high-dimensional spaces affected by significant disturbance factors. A typical example is multi-density clustering connected by a few data points with strong internal correlations, a scenario commonly encountered in the analysis of crowd mobility. To address these challenges, we propose a Self-adaptive Density-Based Clustering Algorithm for Varying Densities Datasets with Strong Disturbance Factor (SADBSCAN). This algorithm comprises a data block splitter, a local clustering module, a global clustering module, and a data block merger to obtain adaptive clustering results. We conduct extensive experiments on both artificial and real-world datasets to evaluate the effectiveness of SADBSCAN. The experimental results indicate that SADBSCAN significantly outperforms several strong baselines across different metrics, demonstrating the high adaptability and scalability of our algorithm.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141979340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Developing A Decision Support System for Healthcare Practices: A Design Science Research Approach 为医疗实践开发决策支持系统:设计科学研究方法
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-17 DOI: 10.1016/j.datak.2024.102344

We propose a new approach for designing a decision support system (DSS) for the transformation of healthcare practices. Practice transformation helps practices transition from their current state to patient-centered medical home (PCMH) model of care. Our approach employs activity theory to derive the elements of practice transformation by designing and integrating two ontologies: a domain ontology and a task ontology. By incorporating both goal-oriented and task-oriented aspects of the practice transformation process and specifying how they interact, our integrated design model for the DSS provides prescriptive knowledge on assessing the current status of a practice with respect to PCMH recognition and navigating efficiently through a complex solution space. This knowledge, which is at a moderate level of abstraction and expressed in a language that practitioners understand, contributes to the literature by providing a formulation for a nascent design theory. We implement the integrated design model as a DSS prototype; results of validation tests conducted on the prototype indicate that it is superior to the existing PCMH readiness tracking tool with respect to effectiveness, usability, efficiency, and sustainability.

我们提出了一种设计决策支持系统(DSS)的新方法,用于医疗实践的转型。实践转型有助于医疗实践从当前状态过渡到以患者为中心的医疗之家(PCMH)护理模式。我们的方法采用活动理论,通过设计和整合两个本体:领域本体和任务本体,推导出实践转型的要素。通过整合实践转型过程中的目标导向和任务导向两个方面,并明确它们之间的互动方式,我们的 DSS 集成设计模型提供了有关评估实践在 PCMH 识别方面的现状以及在复杂的解决方案空间中有效导航的规范性知识。这些知识的抽象程度适中,并以从业人员能够理解的语言表达,为新生的设计理论提供了一种表述方式,从而为文献做出了贡献。我们将综合设计模型作为一个 DSS 原型来实施;对该原型进行的验证测试结果表明,它在有效性、可用性、效率和可持续性方面都优于现有的 PCMH 准备情况跟踪工具。
{"title":"Developing A Decision Support System for Healthcare Practices: A Design Science Research Approach","authors":"","doi":"10.1016/j.datak.2024.102344","DOIUrl":"10.1016/j.datak.2024.102344","url":null,"abstract":"<div><p>We propose a new approach for designing a decision support system (DSS) for the transformation of healthcare practices. Practice transformation helps practices transition from their current state to patient-centered medical home (PCMH) model of care. Our approach employs activity theory to derive the elements of practice transformation by designing and integrating two ontologies: a domain ontology and a task ontology. By incorporating both goal-oriented and task-oriented aspects of the practice transformation process and specifying how they interact, our integrated design model for the DSS provides prescriptive knowledge on assessing the current status of a practice with respect to PCMH recognition and navigating efficiently through a complex solution space. This knowledge, which is at a moderate level of abstraction and expressed in a language that practitioners understand, contributes to the literature by providing a formulation for a nascent design theory. We implement the integrated design model as a DSS prototype; results of validation tests conducted on the prototype indicate that it is superior to the existing PCMH readiness tracking tool with respect to effectiveness, usability, efficiency, and sustainability.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.7,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141840536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Data & Knowledge Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1