首页 > 最新文献

Data & Knowledge Engineering最新文献

英文 中文
Big data classification using SpinalNet-Fuzzy-ResNeXt based on spark architecture with data mining approach 使用基于数据挖掘方法的星火架构 SpinalNet-Fuzzy-ResNeXt 进行大数据分类
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-17 DOI: 10.1016/j.datak.2024.102364
M. Robinson Joel , K. Rajakumari , S. Anu Priya , M. Navaneethakrishnan
In the modern networking topology, big data is highly essential for several domains like e-commerce, healthcare, and finance. Big data classification has offered effectual performance in several applications. Still, big data classification is highly difficult and the recognized classification approaches require a longer duration and numerous resources for executing the accessible data. For resolving such issues, the spark-based classification approach is required. In this work, the hybrid SpinalNet-Fuzzy-ResNeXt model called SFResNeXt is implemented to classify the big data. Here, the SpinalNet and ResNeXt are merged, where the layers are fused with the fuzzy concept. The initial process is the outlier detection. The Holoentrophy method is used to detect the outlier data, and it is removed. Moreover, duplicate detection is performed by fingerprinting approach to detect the repeated data. The, Association Rule Mining (ARM) method is employed for feature selection. The big data is classified by the SFResNeXt. Furthermore, the SFResNeXt-based big data classification offered the accuracy, sensitivity, and specificity of 0.905, 0.914, and 0.922 using the heart disease dataset.
在现代网络拓扑结构中,大数据对电子商务、医疗保健和金融等多个领域都非常重要。大数据分类在多个应用中提供了有效的性能。然而,大数据分类仍然非常困难,公认的分类方法需要较长的时间和大量的资源来执行可访问的数据。为解决这些问题,需要基于火花的分类方法。在这项工作中,实现了名为 SFResNeXt 的混合 SpinalNet-Fuzzy-ResNeXt 模型来对大数据进行分类。在这里,SpinalNet 和 ResNeXt 被合并,各层与模糊概念融合。初始过程是离群点检测。使用 Holoentrophy 方法检测离群数据,并将其移除。此外,重复检测是通过指纹识别法来检测重复数据。特征选择采用关联规则挖掘(ARM)方法。通过 SFResNeXt 对大数据进行分类。此外,以心脏病数据集为例,基于 SFResNeXt 的大数据分类的准确度、灵敏度和特异度分别为 0.905、0.914 和 0.922。
{"title":"Big data classification using SpinalNet-Fuzzy-ResNeXt based on spark architecture with data mining approach","authors":"M. Robinson Joel ,&nbsp;K. Rajakumari ,&nbsp;S. Anu Priya ,&nbsp;M. Navaneethakrishnan","doi":"10.1016/j.datak.2024.102364","DOIUrl":"10.1016/j.datak.2024.102364","url":null,"abstract":"<div><div>In the modern networking topology, big data is highly essential for several domains like e-commerce, healthcare, and finance. Big data classification has offered effectual performance in several applications. Still, big data classification is highly difficult and the recognized classification approaches require a longer duration and numerous resources for executing the accessible data. For resolving such issues, the spark-based classification approach is required. In this work, the hybrid SpinalNet-Fuzzy-ResNeXt model called SFResNeXt is implemented to classify the big data. Here, the SpinalNet and ResNeXt are merged, where the layers are fused with the fuzzy concept. The initial process is the outlier detection. The Holoentrophy method is used to detect the outlier data, and it is removed. Moreover, duplicate detection is performed by fingerprinting approach to detect the repeated data. The, Association Rule Mining (ARM) method is employed for feature selection. The big data is classified by the SFResNeXt. Furthermore, the SFResNeXt-based big data classification offered the accuracy, sensitivity, and specificity of 0.905, 0.914, and 0.922 using the heart disease dataset.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102364"},"PeriodicalIF":2.7,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SRank: Guiding schema selection in NoSQL document stores SRank:指导 NoSQL 文档存储中的模式选择
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-14 DOI: 10.1016/j.datak.2024.102360
Shelly Sachdeva , Neha Bansal , Hardik Bansal
The rise of big data has led to a greater need for applications to change their schema frequently. NoSQL databases provide flexibility in organizing data and offer multiple choices for structuring and storing similar information. While schema flexibility speeds up initial development, choosing schemas wisely is crucial, as they significantly impact performance, affecting data redundancy, navigation cost, data access cost, and maintainability. This paper emphasizes the importance of schema design in NoSQL document stores. It proposes a model to analyze and evaluate different schema alternatives and suggest the best schema out of various schema alternatives. The model is divided into four phases. The model inputs the Entity-Relationship (ER) model and workload queries. In the Transformation Phase, the schema alternatives are initially developed for each ER model, and subsequently, a schema graph is generated for each alternative. Concurrently, workload queries undergo conversion into query graphs. In the Schema Evaluation phase, the Schema Rank (SRank) is calculated for each schema alternative using query metrics derived from the query graphs and path coverage generated from the schema graphs. Finally, in the Output phase, the schema with the highest SRank is recommended as the most suitable choice for the application. The paper includes a case study of a Hotel Reservation System (HRS) to demonstrate the application of the proposed model. It comprehensively evaluates various schema alternatives based on query response time, storage efficiency, scalability, throughput, and latency. The paper validates the SRank computation for schema selection in NoSQL databases through an extensive experimental study. The alignment of SRank values with each schema's performance metrics underscores this ranking system's effectiveness. The SRank simplifies the schema selection process, assisting users in making informed decisions by reducing the time, cost, and effort of identifying the optimal schema for NoSQL document stores.
大数据的兴起导致应用程序更需要频繁更改其模式。NoSQL 数据库可以灵活地组织数据,并为结构化和存储类似信息提供多种选择。虽然模式灵活性加快了初始开发速度,但明智地选择模式至关重要,因为它们会显著影响性能,影响数据冗余、导航成本、数据访问成本和可维护性。本文强调了模式设计在 NoSQL 文档存储中的重要性。它提出了一个模型,用于分析和评估不同的模式备选方案,并从各种模式备选方案中推荐最佳模式。该模型分为四个阶段。该模型输入实体关系(ER)模型和工作量查询。在转换阶段,最初为每个 ER 模型开发模式备选方案,随后为每个备选方案生成模式图。与此同时,工作负载查询也被转换成查询图。在模式评估阶段,使用从查询图和从模式图生成的路径覆盖率得出的查询指标,为每个模式备选方案计算模式排名(SRank)。最后,在输出阶段,推荐 SRank 最高的模式作为最适合应用的选择。本文通过一个酒店预订系统(HRS)的案例研究来展示所提模型的应用。论文根据查询响应时间、存储效率、可扩展性、吞吐量和延迟全面评估了各种模式选择。论文通过广泛的实验研究验证了用于 NoSQL 数据库模式选择的 SRank 计算。SRank 值与每个模式的性能指标相一致,凸显了该排名系统的有效性。SRank 简化了模式选择过程,通过减少为 NoSQL 文档存储确定最佳模式所需的时间、成本和精力,帮助用户做出明智的决策。
{"title":"SRank: Guiding schema selection in NoSQL document stores","authors":"Shelly Sachdeva ,&nbsp;Neha Bansal ,&nbsp;Hardik Bansal","doi":"10.1016/j.datak.2024.102360","DOIUrl":"10.1016/j.datak.2024.102360","url":null,"abstract":"<div><div>The rise of big data has led to a greater need for applications to change their schema frequently. NoSQL databases provide flexibility in organizing data and offer multiple choices for structuring and storing similar information. While schema flexibility speeds up initial development, choosing schemas wisely is crucial, as they significantly impact performance, affecting data redundancy, navigation cost, data access cost, and maintainability. This paper emphasizes the importance of schema design in NoSQL document stores. It proposes a model to analyze and evaluate different schema alternatives and suggest the best schema out of various schema alternatives. The model is divided into four phases. The model inputs the Entity-Relationship (ER) model and workload queries. In the Transformation Phase, the schema alternatives are initially developed for each ER model, and subsequently, a schema graph is generated for each alternative. Concurrently, workload queries undergo conversion into query graphs. In the Schema Evaluation phase, the Schema Rank (SRank) is calculated for each schema alternative using query metrics derived from the query graphs and path coverage generated from the schema graphs. Finally, in the Output phase, the schema with the highest SRank is recommended as the most suitable choice for the application. The paper includes a case study of a Hotel Reservation System (HRS) to demonstrate the application of the proposed model. It comprehensively evaluates various schema alternatives based on query response time, storage efficiency, scalability, throughput, and latency. The paper validates the SRank computation for schema selection in NoSQL databases through an extensive experimental study. The alignment of SRank values with each schema's performance metrics underscores this ranking system's effectiveness. The SRank simplifies the schema selection process, assisting users in making informed decisions by reducing the time, cost, and effort of identifying the optimal schema for NoSQL document stores.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102360"},"PeriodicalIF":2.7,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142311715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relating behaviour of data-aware process models 数据感知流程模型的相关行为
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-12 DOI: 10.1016/j.datak.2024.102363
Marco Montali, Sarah Winkler

Data Petri nets (DPNs) have gained traction as a model for data-aware processes, thanks to their ability to balance simplicity with expressiveness, and because they can be automatically discovered from event logs. While model checking techniques for DPNs have been studied, more complex analysis tasks that are highly relevant for BPM are beyond methods known in the literature. We focus here on equivalence and inclusion of process behaviour with respect to language and configuration spaces, optionally taking data into account. Such comparisons are important in the context of key process mining tasks, namely process repair and discovery, and related to conformance checking. To solve these tasks, we propose approaches for bounded DPNs based on constraint graphs, which are faithful abstractions of the reachable state space. Though the considered verification tasks are undecidable in general, we show that our method is a decision procedure DPNs that admit a finite history set. This property guarantees that constraint graphs are finite and computable, and was shown to hold for large classes of DPNs that are mined automatically, and DPNs presented in the literature. The new techniques are implemented in the tool ada, and an evaluation proving feasibility is provided.

由于数据 Petri 网(DPN)能够兼顾简洁性和表达性,而且可以从事件日志中自动发现,因此作为数据感知流程的一种模型,DPN 已经获得了广泛的关注。虽然针对 DPN 的模型检查技术已经得到了研究,但与 BPM 高度相关的更复杂的分析任务却超出了文献中已知方法的范围。在此,我们将重点放在流程行为与语言和配置空间的等价性和包含性上,并有选择地将数据考虑在内。这种比较在关键流程挖掘任务(即流程修复和发现)中非常重要,并且与一致性检查相关。为了解决这些任务,我们提出了基于约束图的有界 DPN 方法,约束图是可到达状态空间的忠实抽象。尽管所考虑的验证任务在一般情况下是不可判定的,但我们证明了我们的方法是一种允许有限历史集的决策过程 DPN。这一特性保证了约束图的有限性和可计算性,并被证明适用于大量自动挖掘的 DPN 和文献中介绍的 DPN。新技术在工具 ada 中实现,并提供了证明可行性的评估。
{"title":"Relating behaviour of data-aware process models","authors":"Marco Montali,&nbsp;Sarah Winkler","doi":"10.1016/j.datak.2024.102363","DOIUrl":"10.1016/j.datak.2024.102363","url":null,"abstract":"<div><p>Data Petri nets (DPNs) have gained traction as a model for data-aware processes, thanks to their ability to balance simplicity with expressiveness, and because they can be automatically discovered from event logs. While model checking techniques for DPNs have been studied, more complex analysis tasks that are highly relevant for BPM are beyond methods known in the literature. We focus here on equivalence and inclusion of process behaviour with respect to language and configuration spaces, optionally taking data into account. Such comparisons are important in the context of key process mining tasks, namely process repair and discovery, and related to conformance checking. To solve these tasks, we propose approaches for bounded DPNs based on <em>constraint graphs</em>, which are faithful abstractions of the reachable state space. Though the considered verification tasks are undecidable in general, we show that our method is a decision procedure DPNs that admit a <em>finite history set</em>. This property guarantees that constraint graphs are finite and computable, and was shown to hold for large classes of DPNs that are mined automatically, and DPNs presented in the literature. The new techniques are implemented in the tool <span>ada</span>, and an evaluation proving feasibility is provided.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102363"},"PeriodicalIF":2.7,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000879/pdfft?md5=ee932b18bac18fd1e3c1e769269d7d67&pid=1-s2.0-S0169023X24000879-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142241277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A framework for understanding event abstraction problem solving: Current states of event abstraction studies 理解事件抽象解决问题的框架:事件抽象研究的现状
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-06 DOI: 10.1016/j.datak.2024.102352
Jungeun Lim, Minseok Song

Event abstraction is a crucial step in applying process mining in real-world scenarios. However, practitioners often face challenges in selecting relevant research for their specific needs. To address this, we present a comprehensive framework for understanding event abstraction, comprising four key components: event abstraction sub-problems, consideration of process properties, data types for event abstraction, and various approaches to event abstraction. By systematically examining these components, practitioners can efficiently identify research that aligns with their requirements. Additionally, we analyze existing studies using this framework to provide practitioners with a clearer view of current research and suggest expanded applications of existing methods.

事件抽象是将流程挖掘应用于现实世界场景的关键一步。然而,实践者在选择满足其特定需求的相关研究时往往面临挑战。为了解决这个问题,我们提出了一个理解事件抽象的综合框架,由四个关键部分组成:事件抽象子问题、流程属性考虑、事件抽象的数据类型以及事件抽象的各种方法。通过系统地研究这些组成部分,从业人员可以有效地识别符合其要求的研究。此外,我们还利用这一框架分析了现有研究,为从业人员提供了更清晰的当前研究视图,并建议扩大现有方法的应用范围。
{"title":"A framework for understanding event abstraction problem solving: Current states of event abstraction studies","authors":"Jungeun Lim,&nbsp;Minseok Song","doi":"10.1016/j.datak.2024.102352","DOIUrl":"10.1016/j.datak.2024.102352","url":null,"abstract":"<div><p>Event abstraction is a crucial step in applying process mining in real-world scenarios. However, practitioners often face challenges in selecting relevant research for their specific needs. To address this, we present a comprehensive framework for understanding event abstraction, comprising four key components: event abstraction sub-problems, consideration of process properties, data types for event abstraction, and various approaches to event abstraction. By systematically examining these components, practitioners can efficiently identify research that aligns with their requirements. Additionally, we analyze existing studies using this framework to provide practitioners with a clearer view of current research and suggest expanded applications of existing methods.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102352"},"PeriodicalIF":2.7,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142171787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A conceptual framework for the government big data ecosystem (‘datagov.eco’) 政府大数据生态系统("datagov.eco")概念框架
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-05 DOI: 10.1016/j.datak.2024.102348
Syed Iftikhar Hussain Shah , Vassilios Peristeras , Ioannis Magnisalis

The public sector, private firms, and civil society constantly create data of high volume, velocity, and veracity from diverse sources. This kind of data is known as big data. As in other industries, public administrations consider big data as the “new oil" and employ data-centric policies to transform data into knowledge, stimulate good governance, innovative digital services, transparency, and citizens' engagement in public policy. More and more public organizations understand the value created by exploiting internal and external data sources, delivering new capabilities, and fostering collaboration inside and outside of public administrations. Despite the broad interest in this ecosystem, we still lack a detailed and systematic view of it. In this paper, we attempt to describe the emerging Government Big Data Ecosystem as a socio-technical network of people, organizations, processes, technology, infrastructure, standards & policies, procedures, and resources. This ecosystem supports data functions such as data collection, integration, analysis, storage, sharing, use, protection, and archiving. Through these functions, value is created by promoting evidence-based policymaking, modern public services delivery, data-driven administration and open government, and boosting the data economy. Through a Design Science Research methodology, we propose a conceptual framework, which we call ‘datagov.eco’. We believe our ‘datagov.eco’ framework will provide insights and support to different stakeholders’ profiles, including administrators, consultants, data engineers, and data scientists.

公共部门、私营企业和民间社会不断从各种来源创建大量、高速和真实的数据。这类数据被称为大数据。与其他行业一样,公共管理部门将大数据视为 "新石油",并采用以数据为中心的政策,将数据转化为知识,促进善治、创新数字服务、透明度和公民参与公共政策。越来越多的公共组织认识到利用内部和外部数据源、提供新能力以及促进公共管理部门内外协作所创造的价值。尽管这一生态系统受到广泛关注,但我们仍然缺乏对其详细而系统的了解。在本文中,我们试图将新兴的政府大数据生态系统描述为一个由人员、组织、流程、技术、基础设施、标准&amp、政策、程序和资源组成的社会技术网络。该生态系统支持数据功能,如数据收集、整合、分析、存储、共享、使用、保护和归档。通过这些功能,可以促进循证决策、现代公共服务交付、数据驱动的行政管理和开放式政府,并推动数据经济的发展,从而创造价值。通过设计科学研究方法,我们提出了一个概念框架,我们称之为 "datagov.eco"。我们相信,我们的 "datagov.eco "框架将为不同利益相关者(包括管理者、顾问、数据工程师和数据科学家)提供见解和支持。
{"title":"A conceptual framework for the government big data ecosystem (‘datagov.eco’)","authors":"Syed Iftikhar Hussain Shah ,&nbsp;Vassilios Peristeras ,&nbsp;Ioannis Magnisalis","doi":"10.1016/j.datak.2024.102348","DOIUrl":"10.1016/j.datak.2024.102348","url":null,"abstract":"<div><p>The public sector, private firms, and civil society constantly create data of high volume, velocity, and veracity from diverse sources. This kind of data is known as big data. As in other industries, public administrations consider big data as the “new oil\" and employ data-centric policies to transform data into knowledge, stimulate good governance, innovative digital services, transparency, and citizens' engagement in public policy. More and more public organizations understand the value created by exploiting internal and external data sources, delivering new capabilities, and fostering collaboration inside and outside of public administrations. Despite the broad interest in this ecosystem, we still lack a detailed and systematic view of it. In this paper, we attempt to describe the emerging Government Big Data Ecosystem as a <em>socio-technical network</em> of people, organizations, processes, technology, infrastructure, standards &amp; policies, procedures, and resources. This ecosystem supports <em>data functions</em> such as data collection, integration, analysis, storage, sharing, use, protection, and archiving. Through these functions, <em>value is created</em> by promoting evidence-based policymaking, modern public services delivery, data-driven administration and open government, and boosting the data economy. Through a Design Science Research methodology, we propose a conceptual framework, which we call ‘datagov.eco’. We believe our ‘datagov.eco’ framework will provide insights and support to different stakeholders’ profiles, including administrators, consultants, data engineers, and data scientists.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102348"},"PeriodicalIF":2.7,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142271814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unraveling the foundations and the evolution of conceptual modeling—Intellectual structure, current themes, and trajectories 解读概念模型的基础和演变--知识结构、当前主题和发展轨迹
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-04 DOI: 10.1016/j.datak.2024.102351
Jacky Akoka , Isabelle Comyn-Wattiau , Nicolas Prat , Veda C. Storey
The field of conceptual modeling has now been in existence for over five decades. To understand how this field has evolved and should continue to evolve, it is useful to examine the contributions made over time and the themes that have emerged. In this research, we apply bibliometric analysis to a corpus of over 4700 research papers spanning from 1976 to 2023. We successively apply co-citation, bibliographic coupling, and main path analysis. Co-citation and citation networks are produced that surface the intellectual structure of the field, the main themes, and the relationships among major and influential research papers over time. We identify four areas in the intellectual structure of the field: conceptual modeling and databases; grammars and guidelines for conceptual modeling; requirements engineering and information systems design methodologies; and ontology constructs for conceptual modeling. Between 2017 and 2023, we distinguish nine research themes, including domain-specific conceptual modeling and applications, ontologies and applications, genomics, and datastores and multi-model data. The main path analysis identifies several trajectories among the major and most influential papers. This leads to insights into the lineage of key, influential papers in conceptual modeling research. The primordial nature of the main paths identified encompasses two important aspects. The first revolves around refining and complementing the entity-relationship model. The second identifies the contribution of ontologies for conceptual modeling to make the models more robust. Based on the findings from this bibliometric analysis, we propose several directions for future conceptual modeling research.
概念建模领域迄今已有五十多年的历史。为了了解这一领域是如何发展的,以及应该如何继续发展,我们有必要研究一下随着时间推移做出的贡献和出现的主题。在这项研究中,我们对从 1976 年到 2023 年的 4700 多篇研究论文进行了文献计量分析。我们先后应用了共引、书目耦合和主要路径分析。通过共引和引文网络,我们发现了该领域的知识结构、主要主题以及主要和有影响力的研究论文之间的关系。我们确定了该领域知识结构的四个方面:概念建模和数据库;概念建模的语法和指南;需求工程和信息系统设计方法;概念建模的本体构造。从 2017 年到 2023 年,我们将划分出九个研究主题,包括特定领域概念建模与应用、本体与应用、基因组学以及数据存储与多模型数据。主要路径分析确定了主要和最有影响力的论文之间的几条轨迹。这有助于深入了解概念建模研究中重要的、有影响力的论文的发展脉络。所确定的主要路径的原始性质包括两个重要方面。第一个方面是完善和补充实体关系模型。第二个方面是本体对概念建模的贡献,使模型更加稳健。根据文献计量分析的结果,我们提出了未来概念建模研究的几个方向。
{"title":"Unraveling the foundations and the evolution of conceptual modeling—Intellectual structure, current themes, and trajectories","authors":"Jacky Akoka ,&nbsp;Isabelle Comyn-Wattiau ,&nbsp;Nicolas Prat ,&nbsp;Veda C. Storey","doi":"10.1016/j.datak.2024.102351","DOIUrl":"10.1016/j.datak.2024.102351","url":null,"abstract":"<div><div>The field of conceptual modeling has now been in existence for over five decades. To understand how this field has evolved and should continue to evolve, it is useful to examine the contributions made over time and the themes that have emerged. In this research, we apply bibliometric analysis to a corpus of over 4700 research papers spanning from 1976 to 2023. We successively apply co-citation, bibliographic coupling, and main path analysis. Co-citation and citation networks are produced that surface the intellectual structure of the field, the main themes, and the relationships among major and influential research papers over time. We identify four areas in the intellectual structure of the field: conceptual modeling and databases; grammars and guidelines for conceptual modeling; requirements engineering and information systems design methodologies; and ontology constructs for conceptual modeling. Between 2017 and 2023, we distinguish nine research themes, including domain-specific conceptual modeling and applications, ontologies and applications, genomics, and datastores and multi-model data. The main path analysis identifies several trajectories among the major and most influential papers. This leads to insights into the lineage of key, influential papers in conceptual modeling research. The primordial nature of the main paths identified encompasses two important aspects. The first revolves around refining and complementing the entity-relationship model. The second identifies the contribution of ontologies for conceptual modeling to make the models more robust. Based on the findings from this bibliometric analysis, we propose several directions for future conceptual modeling research.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102351"},"PeriodicalIF":2.7,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142421924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data engineering and modeling for artificial intelligence 人工智能的数据工程和建模
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-01 DOI: 10.1016/j.datak.2024.102346
Carlos Ordonez, Wojciech Macyna, Ladjel Bellatreche
{"title":"Data engineering and modeling for artificial intelligence","authors":"Carlos Ordonez,&nbsp;Wojciech Macyna,&nbsp;Ladjel Bellatreche","doi":"10.1016/j.datak.2024.102346","DOIUrl":"10.1016/j.datak.2024.102346","url":null,"abstract":"","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102346"},"PeriodicalIF":2.7,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Capturing and Analysing Employee Behaviour: An Honest Day’s Work Record 捕捉和分析员工行为:诚实的日常工作记录
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-31 DOI: 10.1016/j.datak.2024.102350
Iris Beerepoot, Tea Šinik, Hajo A. Reijers

For a range of reasons, organisations collect data on the work behaviour of their employees. However, each data collection technique displays its own unique mix of intrusiveness, information richness, and risks. For the sake of understanding the differences between data collection techniques, we conducted a multiple-case study in a multinational professional services organisation, tracking six participants throughout a workday using non-participant observation, screen recording, and timesheet techniques. This led to 136 hours of data. Our findings show that relying on one data collection technique alone cannot provide a comprehensive and accurate account of activities that are screen-based, offline, or overtime. The collected data also provided an opportunity to investigate the use of process mining for analysing employee behaviour, specifically with respect to the completeness of the collected data. Our study underlines the importance of judiciously selecting data collection techniques, as well as using a sufficiently broad data set to generate reliable insights into employee behaviour.

出于各种原因,企业会收集员工的工作行为数据。然而,每种数据收集技术都有其独特的侵入性、信息丰富性和风险性。为了了解数据收集技术之间的差异,我们在一家跨国专业服务机构开展了一项多案例研究,使用非参与者观察、屏幕记录和时间表技术对六名参与者的整个工作日进行跟踪。由此获得了 136 个小时的数据。我们的研究结果表明,仅仅依靠一种数据收集技术无法全面准确地描述基于屏幕、离线或加班的活动。收集到的数据还为研究如何利用流程挖掘来分析员工行为提供了机会,特别是在所收集数据的完整性方面。我们的研究强调了明智选择数据收集技术的重要性,以及使用足够广泛的数据集对员工行为进行可靠洞察的重要性。
{"title":"Capturing and Analysing Employee Behaviour: An Honest Day’s Work Record","authors":"Iris Beerepoot,&nbsp;Tea Šinik,&nbsp;Hajo A. Reijers","doi":"10.1016/j.datak.2024.102350","DOIUrl":"10.1016/j.datak.2024.102350","url":null,"abstract":"<div><p>For a range of reasons, organisations collect data on the work behaviour of their employees. However, each data collection technique displays its own unique mix of intrusiveness, information richness, and risks. For the sake of understanding the differences between data collection techniques, we conducted a multiple-case study in a multinational professional services organisation, tracking six participants throughout a workday using non-participant observation, screen recording, and timesheet techniques. This led to 136 hours of data. Our findings show that relying on one data collection technique alone cannot provide a comprehensive and accurate account of activities that are screen-based, offline, or overtime. The collected data also provided an opportunity to investigate the use of <em>process mining</em> for analysing employee behaviour, specifically with respect to the completeness of the collected data. Our study underlines the importance of judiciously selecting data collection techniques, as well as using a sufficiently broad data set to generate reliable insights into employee behaviour.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102350"},"PeriodicalIF":2.7,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000740/pdfft?md5=0803a6136e27919fd8c8a868fa63e889&pid=1-s2.0-S0169023X24000740-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142149252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discovering outlying attributes of outliers in data streams 发现数据流中异常值的离群属性
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-30 DOI: 10.1016/j.datak.2024.102349
Egawati Panjei , Le Gruenwald

Data streams, continuous sequences of timestamped data points, necessitate real-time monitoring due to their time-sensitive nature. In various data stream applications, such as network security and credit card transaction monitoring, real-time detection of outliers is crucial, as these outliers often signify potential threats. Equally important is the real-time explanation of outliers, enabling users to glean insights and thereby shorten their investigation time. The investigation time for outliers is closely tied to their number of attributes, making it essential to provide explanations that detail which attributes are responsible for the abnormality of a data point, referred to as outlying attributes. However, the unbounded volume of data and concept drift of data streams pose challenges for discovering the outlying attributes of outliers in real time. In response, in this paper we propose EXOS, an algorithm designed for discovering the outlying attributes of multi-dimensional outliers in data streams. EXOS leverages cross-correlations among data streams, accommodates varying data stream schemas and arrival rates, and effectively addresses challenges related to the unbounded volume of data and concept drift. The algorithm is model-agnostic for point outlier detection and provides real-time explanations based on the local context of the outlier, derived from time-based tumbling windows. The paper provides a complexity analysis of EXOS and an experimental analysis comparing EXOS with existing algorithms. The evaluation includes an assessment of performance on both real-world and synthetic datasets in terms of average precision, recall, F1-score, and explanation time. The evaluation results show that, on average, EXOS achieves a 45.6% better F1 Score and is 7.3 times lower in explanation time compared to existing outlying attribute algorithms.

数据流是带有时间戳的数据点的连续序列,由于其时间敏感性,有必要对其进行实时监控。在网络安全和信用卡交易监控等各种数据流应用中,实时检测异常值至关重要,因为这些异常值往往意味着潜在的威胁。同样重要的是对异常值进行实时解释,使用户能够获得深刻见解,从而缩短调查时间。异常值的调查时间与其属性数量密切相关,因此必须提供解释,详细说明造成数据点异常的属性,即异常属性。然而,数据流的无限制数据量和概念漂移给实时发现异常值的离群属性带来了挑战。为此,我们在本文中提出了 EXOS 算法,该算法旨在发现数据流中多维离群值的离群属性。EXOS 可利用数据流之间的交叉相关性,适应不同的数据流模式和到达率,并能有效解决与无限制数据量和概念漂移相关的挑战。该算法在离群点检测方面与模型无关,并根据基于时间的翻滚窗口得出的离群点局部上下文提供实时解释。论文提供了 EXOS 的复杂性分析以及 EXOS 与现有算法比较的实验分析。评估包括对实际数据集和合成数据集的平均精确度、召回率、F1-分数和解释时间的性能评估。评估结果表明,与现有的离群属性算法相比,EXOS 的 F1 分数平均提高了 45.6%,解释时间缩短了 7.3 倍。
{"title":"Discovering outlying attributes of outliers in data streams","authors":"Egawati Panjei ,&nbsp;Le Gruenwald","doi":"10.1016/j.datak.2024.102349","DOIUrl":"10.1016/j.datak.2024.102349","url":null,"abstract":"<div><p>Data streams, continuous sequences of timestamped data points, necessitate real-time monitoring due to their time-sensitive nature. In various data stream applications, such as network security and credit card transaction monitoring, real-time detection of outliers is crucial, as these outliers often signify potential threats. Equally important is the real-time explanation of outliers, enabling users to glean insights and thereby shorten their investigation time. The investigation time for outliers is closely tied to their number of attributes, making it essential to provide explanations that detail which attributes are responsible for the abnormality of a data point, referred to as outlying attributes. However, the unbounded volume of data and concept drift of data streams pose challenges for discovering the outlying attributes of outliers in real time. In response, in this paper we propose EXOS, an algorithm designed for discovering the outlying attributes of multi-dimensional outliers in data streams. EXOS leverages cross-correlations among data streams, accommodates varying data stream schemas and arrival rates, and effectively addresses challenges related to the unbounded volume of data and concept drift. The algorithm is model-agnostic for point outlier detection and provides real-time explanations based on the local context of the outlier, derived from time-based tumbling windows. The paper provides a complexity analysis of EXOS and an experimental analysis comparing EXOS with existing algorithms. The evaluation includes an assessment of performance on both real-world and synthetic datasets in terms of average precision, recall, F1-score, and explanation time. The evaluation results show that, on average, EXOS achieves a 45.6% better F1 Score and is 7.3 times lower in explanation time compared to existing outlying attribute algorithms.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"154 ","pages":"Article 102349"},"PeriodicalIF":2.7,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142121852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A self-adaptive density-based clustering algorithm for varying densities datasets with strong disturbance factor 针对具有强干扰因素的不同密度数据集的基于密度的自适应聚类算法
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-07 DOI: 10.1016/j.datak.2024.102345
Zihao Cai, Zhaodong Gu, Kejing He

Clustering is a fundamental task in data mining, aiming to group similar objects together based on their features or attributes. With the rapid increase in data analysis volume and the growing complexity of high-dimensional data distribution, clustering has become increasingly important in numerous applications, including image analysis, text mining, and anomaly detection. DBSCAN is a powerful tool for clustering analysis and is widely used in density-based clustering algorithms. However, DBSCAN and its variants encounter challenges when confronted with datasets exhibiting clusters of varying densities in intricate high-dimensional spaces affected by significant disturbance factors. A typical example is multi-density clustering connected by a few data points with strong internal correlations, a scenario commonly encountered in the analysis of crowd mobility. To address these challenges, we propose a Self-adaptive Density-Based Clustering Algorithm for Varying Densities Datasets with Strong Disturbance Factor (SADBSCAN). This algorithm comprises a data block splitter, a local clustering module, a global clustering module, and a data block merger to obtain adaptive clustering results. We conduct extensive experiments on both artificial and real-world datasets to evaluate the effectiveness of SADBSCAN. The experimental results indicate that SADBSCAN significantly outperforms several strong baselines across different metrics, demonstrating the high adaptability and scalability of our algorithm.

聚类是数据挖掘的一项基本任务,旨在根据相似对象的特征或属性将其归类。随着数据分析量的快速增长和高维数据分布的日益复杂,聚类在图像分析、文本挖掘和异常检测等众多应用中变得越来越重要。DBSCAN 是一种功能强大的聚类分析工具,被广泛应用于基于密度的聚类算法中。然而,当数据集在受重大干扰因素影响的错综复杂的高维空间中呈现出不同密度的聚类时,DBSCAN 及其变体就会遇到挑战。一个典型的例子是由几个具有强内部相关性的数据点连接而成的多密度聚类,这是人群流动性分析中经常遇到的情况。为了应对这些挑战,我们提出了一种针对具有强干扰因素的不同密度数据集的自适应密度聚类算法(SADBSCAN)。该算法由数据块分割器、局部聚类模块、全局聚类模块和数据块合并器组成,以获得自适应聚类结果。我们在人工数据集和真实数据集上进行了大量实验,以评估 SADBSCAN 的有效性。实验结果表明,在不同指标上,SADBSCAN 明显优于几种强大的基线算法,证明了我们算法的高适应性和可扩展性。
{"title":"A self-adaptive density-based clustering algorithm for varying densities datasets with strong disturbance factor","authors":"Zihao Cai,&nbsp;Zhaodong Gu,&nbsp;Kejing He","doi":"10.1016/j.datak.2024.102345","DOIUrl":"10.1016/j.datak.2024.102345","url":null,"abstract":"<div><p>Clustering is a fundamental task in data mining, aiming to group similar objects together based on their features or attributes. With the rapid increase in data analysis volume and the growing complexity of high-dimensional data distribution, clustering has become increasingly important in numerous applications, including image analysis, text mining, and anomaly detection. DBSCAN is a powerful tool for clustering analysis and is widely used in density-based clustering algorithms. However, DBSCAN and its variants encounter challenges when confronted with datasets exhibiting clusters of varying densities in intricate high-dimensional spaces affected by significant disturbance factors. A typical example is multi-density clustering connected by a few data points with strong internal correlations, a scenario commonly encountered in the analysis of crowd mobility. To address these challenges, we propose a Self-adaptive Density-Based Clustering Algorithm for Varying Densities Datasets with Strong Disturbance Factor (SADBSCAN). This algorithm comprises a data block splitter, a local clustering module, a global clustering module, and a data block merger to obtain adaptive clustering results. We conduct extensive experiments on both artificial and real-world datasets to evaluate the effectiveness of SADBSCAN. The experimental results indicate that SADBSCAN significantly outperforms several strong baselines across different metrics, demonstrating the high adaptability and scalability of our algorithm.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102345"},"PeriodicalIF":2.7,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141979340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Data & Knowledge Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1