首页 > 最新文献

Data & Knowledge Engineering最新文献

英文 中文
Time-aware structure matching for temporal knowledge graph alignment 用于时态知识图谱对齐的时间感知结构匹配
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-03-11 DOI: 10.1016/j.datak.2024.102300
Wei Jia , Ruizhe Ma , Li Yan , Weinan Niu , Zongmin Ma

Entity alignment, aiming at identifying equivalent entity pairs across multiple knowledge graphs (KGs), serves as a vital step for knowledge fusion. As the majority of KGs undergo continuous evolution, existing solutions utilize graph neural networks (GNNs) to tackle entity alignment within temporal knowledge graphs (TKGs). However, this prevailing method often overlooks the consequential impact of relation embedding generation on entity embeddings through inherent structures. In this paper, we propose a novel model named Time-aware Structure Matching based on GNNs (TSM-GNN) that encompasses the learning of both topological and inherent structures. Our key innovation lies in a unique method for generating relation embeddings, which can enhance entity embeddings via inherent structure. Specifically, we utilize the translation property of knowledge graphs to obtain the entity embedding that is mapped into a time-aware vector space. Subsequently, we employ GNNs to learn global entity representation. To better capture the useful information from neighboring relations and entities, we introduce a time-aware attention mechanism that assigns different importance weights to different time-aware inherent structures. Experimental results on three real-world datasets demonstrate that TSM-GNN outperforms several state-of-the-art approaches for entity alignment between TKGs.

实体对齐旨在识别多个知识图谱(KG)中的等效实体对,是知识融合的重要步骤。由于大多数知识图谱都在不断演变,现有的解决方案利用图神经网络(GNN)来解决时态知识图谱(TKG)中的实体配准问题。然而,这种主流方法往往忽略了关系嵌入的生成会通过固有结构对实体嵌入产生影响。在本文中,我们提出了一种名为 "基于 GNNs 的时间感知结构匹配"(TSM-GNN)的新型模型,它包含拓扑结构和固有结构的学习。我们的关键创新在于一种生成关系嵌入的独特方法,它可以通过固有结构增强实体嵌入。具体来说,我们利用知识图谱的平移特性来获得映射到时间感知向量空间的实体嵌入。随后,我们利用 GNN 学习全局实体表示。为了更好地捕捉来自相邻关系和实体的有用信息,我们引入了时间感知关注机制,为不同的时间感知固有结构分配不同的重要性权重。在三个真实世界数据集上的实验结果表明,TSM-GNN 在 TKG 之间的实体配准方面优于几种最先进的方法。
{"title":"Time-aware structure matching for temporal knowledge graph alignment","authors":"Wei Jia ,&nbsp;Ruizhe Ma ,&nbsp;Li Yan ,&nbsp;Weinan Niu ,&nbsp;Zongmin Ma","doi":"10.1016/j.datak.2024.102300","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102300","url":null,"abstract":"<div><p>Entity alignment, aiming at identifying equivalent entity pairs across multiple knowledge graphs (KGs), serves as a vital step for knowledge fusion. As the majority of KGs undergo continuous evolution, existing solutions utilize graph neural networks (GNNs) to tackle entity alignment within temporal knowledge graphs (TKGs). However, this prevailing method often overlooks the consequential impact of relation embedding generation on entity embeddings through inherent structures. In this paper, we propose a novel model named Time-aware Structure Matching based on GNNs (TSM-GNN) that encompasses the learning of both topological and inherent structures. Our key innovation lies in a unique method for generating relation embeddings, which can enhance entity embeddings via inherent structure. Specifically, we utilize the translation property of knowledge graphs to obtain the entity embedding that is mapped into a time-aware vector space. Subsequently, we employ GNNs to learn global entity representation. To better capture the useful information from neighboring relations and entities, we introduce a time-aware attention mechanism that assigns different importance weights to different time-aware inherent structures. Experimental results on three real-world datasets demonstrate that TSM-GNN outperforms several state-of-the-art approaches for entity alignment between TKGs.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102300"},"PeriodicalIF":2.5,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140138228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A knowledge-sharing platform for space resources 空间资源知识共享平台
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-02-29 DOI: 10.1016/j.datak.2024.102286
Marcos Da Silveira, Louis Deladiennee, Emmanuel Scolan, Cedric Pruski

The ever-increasing interest of academia, industry, and government institutions in space resource information highlights the difficulty of finding, accessing, integrating, and reusing this information. Although information is regularly published on the internet, it is disseminated on many different websites and in different formats, including scientific publications, patents, news, and reports. We are currently developing a knowledge management and sharing platform for space resources. This tool, which relies on the combined use of knowledge graphs and ontologies, formalises the domain knowledge contained in the above-mentioned documents and makes it more readily available to the community. In this article, we describe the concepts and techniques of knowledge extraction and management adopted during the design and implementation of the platform.

学术界、工业界和政府机构对空间资源信息的兴趣与日俱增,这凸显了查找、获取、整合和再利用这些信息的难度。虽然信息会定期在互联网上发布,但这些信息在许多不同的网站上以不同的形式传播,包括科学出版物、专利、新闻和报告。我们目前正在开发一个空间资源知识管理和共享平台。该工具依赖于知识图谱和本体的结合使用,将上述文件中包含的领域知识正规化,使其更容易为社区所用。在本文中,我们将介绍在设计和实施该平台过程中采用的知识提取和管理概念及技术。
{"title":"A knowledge-sharing platform for space resources","authors":"Marcos Da Silveira,&nbsp;Louis Deladiennee,&nbsp;Emmanuel Scolan,&nbsp;Cedric Pruski","doi":"10.1016/j.datak.2024.102286","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102286","url":null,"abstract":"<div><p>The ever-increasing interest of academia, industry, and government institutions in space resource information highlights the difficulty of finding, accessing, integrating, and reusing this information. Although information is regularly published on the internet, it is disseminated on many different websites and in different formats, including scientific publications, patents, news, and reports. We are currently developing a knowledge management and sharing platform for space resources. This tool, which relies on the combined use of knowledge graphs and ontologies, formalises the domain knowledge contained in the above-mentioned documents and makes it more readily available to the community. In this article, we describe the concepts and techniques of knowledge extraction and management adopted during the design and implementation of the platform.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102286"},"PeriodicalIF":2.5,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140042746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge graph-based image classification 基于知识图谱的图像分类
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-02-28 DOI: 10.1016/j.datak.2024.102285
Franck Anaël Mbiaya , Christel Vrain , Frédéric Ros , Thi-Bich-Hanh Dao , Yves Lucas

This paper introduces a deep learning method for image classification that leverages knowledge formalized as a graph created from information represented by pairs attribute/value. The proposed method investigates a loss function that adaptively combines the classical cross-entropy commonly used in deep learning with a novel penalty function. The novel loss function is derived from the representation of nodes after embedding the knowledge graph and incorporates the proximity between class and image nodes. Its formulation enables the model to focus on identifying the boundary between the most challenging classes to distinguish. Experimental results on several image databases demonstrate improved performance compared to state-of-the-art methods, including classical deep learning algorithms and recent algorithms that incorporate knowledge represented by a graph.

本文介绍了一种用于图像分类的深度学习方法,该方法利用的知识形式化为由属性/值对表示的信息创建的图。该方法研究了一种损失函数,它将深度学习中常用的经典交叉熵与一种新型惩罚函数自适应地结合在一起。新颖的损失函数来自嵌入知识图谱后的节点表示,并结合了类和图像节点之间的邻近性。它的表述使模型能够专注于识别最难区分的类别之间的边界。在多个图像数据库上的实验结果表明,与最先进的方法(包括经典的深度学习算法和结合了图表示的知识的最新算法)相比,该模型的性能有所提高。
{"title":"Knowledge graph-based image classification","authors":"Franck Anaël Mbiaya ,&nbsp;Christel Vrain ,&nbsp;Frédéric Ros ,&nbsp;Thi-Bich-Hanh Dao ,&nbsp;Yves Lucas","doi":"10.1016/j.datak.2024.102285","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102285","url":null,"abstract":"<div><p>This paper introduces a deep learning method for image classification that leverages knowledge formalized as a graph created from information represented by pairs attribute/value. The proposed method investigates a loss function that adaptively combines the classical cross-entropy commonly used in deep learning with a novel penalty function. The novel loss function is derived from the representation of nodes after embedding the knowledge graph and incorporates the proximity between class and image nodes. Its formulation enables the model to focus on identifying the boundary between the most challenging classes to distinguish. Experimental results on several image databases demonstrate improved performance compared to state-of-the-art methods, including classical deep learning algorithms and recent algorithms that incorporate knowledge represented by a graph.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102285"},"PeriodicalIF":2.5,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000090/pdfft?md5=197a1155c2e53ecde4dd061f7a501a91&pid=1-s2.0-S0169023X24000090-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140113524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving the identification of relevant variants in genome information systems: A methodological approach with a case study on early onset Alzheimer's disease 改进基因组信息系统中相关变异的识别:方法论方法与早发性阿尔茨海默病案例研究
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-02-09 DOI: 10.1016/j.datak.2024.102284
Mireia Costa, Ana León, Óscar Pastor

Alzheimer's disease is the most common type of dementia in the elderly. Nevertheless, there is an early onset form that is difficult to diagnose precisely. As the genetic component is the most critical factor in developing this disease, identifying relevant genetic variants is key to obtaining a more reliable and straightforward diagnosis. The information about these variants is stored in an extensive number of data sources, which must be carefully analyzed to select only the information with sufficient quality to be used in a clinical setting. This selection has become complex due to the increasing available genomic information. The SILE method was designed to systematize identifying relevant variants for a disease in this challenging context. However, several problems on how SILE identifies relevant variants were discovered when applying the method to the early onset form of Alzheimer's disease. More specifically, the method failed to address specific features of this disease such as its low incidence and familiar component. This paper proposes an improvement of the identification process defined by the SILE method to make it applicable to a further spectrum of diseases. Details of how the proposed solution has been applied are also reported. As a result of this improvement, a set of 29 variants has been identified (25 variants Accepted with a Limited Evidence and 4 Accepted with Moderate Evidence). This constitutes a valuable result that facilitates and reinforces the genetic diagnosis of the disease.

阿尔茨海默病是最常见的老年痴呆症。然而,也有一种难以精确诊断的早发型老年痴呆症。由于遗传因素是导致这种疾病的最关键因素,因此识别相关的遗传变异是获得更可靠、更直接诊断的关键。有关这些变异的信息存储在大量数据源中,必须对这些数据源进行仔细分析,只选择质量足够高的信息用于临床。由于可用的基因组信息越来越多,这种选择变得越来越复杂。SILE 方法就是为了在这种充满挑战的情况下系统地识别疾病的相关变异而设计的。然而,在将 SILE 方法应用于早发性阿尔茨海默病时,发现了该方法在识别相关变异方面存在的一些问题。更具体地说,该方法未能解决这种疾病的具体特征,如发病率低和熟悉的成分。本文建议改进 SILE 方法定义的识别过程,使其适用于更多的疾病。本文还详细介绍了如何应用所提出的解决方案。经过改进后,已识别出一组 29 个变体(25 个变体以有限证据接受,4 个以中等证据接受)。这是一项宝贵的成果,有助于并加强疾病的基因诊断。
{"title":"Improving the identification of relevant variants in genome information systems: A methodological approach with a case study on early onset Alzheimer's disease","authors":"Mireia Costa,&nbsp;Ana León,&nbsp;Óscar Pastor","doi":"10.1016/j.datak.2024.102284","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102284","url":null,"abstract":"<div><p>Alzheimer's disease is the most common type of dementia in the elderly. Nevertheless, there is an early onset form that is difficult to diagnose precisely. As the genetic component is the most critical factor in developing this disease, identifying relevant genetic variants is key to obtaining a more reliable and straightforward diagnosis. The information about these variants is stored in an extensive number of data sources, which must be carefully analyzed to select only the information with sufficient quality to be used in a clinical setting. This selection has become complex due to the increasing available genomic information. The SILE method was designed to systematize identifying relevant variants for a disease in this challenging context. However, several problems on how SILE identifies relevant variants were discovered when applying the method to the early onset form of Alzheimer's disease. More specifically, the method failed to address specific features of this disease such as its low incidence and familiar component. This paper proposes an improvement of the identification process defined by the SILE method to make it applicable to a further spectrum of diseases. Details of how the proposed solution has been applied are also reported. As a result of this improvement, a set of 29 variants has been identified (25 variants Accepted with a Limited Evidence and 4 Accepted with Moderate Evidence). This constitutes a valuable result that facilitates and reinforces the genetic diagnosis of the disease.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102284"},"PeriodicalIF":2.5,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000089/pdfft?md5=571739f0b90877da191a9d94a852f178&pid=1-s2.0-S0169023X24000089-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139738034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fuzzy-Ontology based knowledge driven disease risk level prediction with optimization assisted ensemble classifier 基于模糊本体的知识驱动型疾病风险水平预测与优化辅助集合分类器
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-02-04 DOI: 10.1016/j.datak.2024.102278
Huma Parveen , Syed Wajahat Abbas Rizvi , Raja Sarath Kumar Boddu

Modern medicinal analysis is a complex procedure, requiring precise patient data, scientific knowledge obtained over numerous years and a theoretical understanding of related medical literature. To improve the accuracy and to reduce the time for diagnosis, clinical decision support systems (DSS) were introduced, which incorporate data mining schemes for enhancing the disease diagnosing accuracy. This work proposes a new disease-predicting model that involves 3 stages. Initially, “improved stemming and tokenization” are carried out in the pre-processing stage. Then, the “Fuzzy ontology, improved mutual information (MI), and correlation features” are extracted. Then, prediction is carried out via ensemble classifiers that include “improved Fuzzy logic, Long Short Term Memory (LSTM), Deep Convolution Neural Network (DCNN), and Bidirectional Gated Recurrent Unit (Bi-GRU)”.The outcomes from improved fuzzy logic, LSTM, and DCNN are further classified via Bi-GRU which offers the results. Specifically, Bi-GRU weights are optimally tuned using Deer Hunting Update Explored Arithmetic Optimization (DHUEAO). Finally, the efficiency of the proposed work is determined concerning a variety of metrics.

现代医学分析是一个复杂的过程,需要精确的病人数据、多年积累的科学知识以及对相关医学文献的理论理解。为了提高诊断的准确性并缩短诊断时间,临床决策支持系统(DSS)应运而生,它结合了数据挖掘方案以提高疾病诊断的准确性。这项工作提出了一种新的疾病预测模型,包括 3 个阶段。首先,在预处理阶段进行 "改进的词干化和标记化"。然后,提取 "模糊本体、改进的互信息(MI)和相关特征"。然后,通过包括 "改进的模糊逻辑、长短期记忆(LSTM)、深度卷积神经网络(DCNN)和双向门控递归单元(Bi-GRU)"在内的集合分类器进行预测。具体来说,Bi-GRU 权重是通过猎鹿更新探索算术优化(DHUEAO)进行优化调整的。最后,根据各种指标确定了拟议工作的效率。
{"title":"Fuzzy-Ontology based knowledge driven disease risk level prediction with optimization assisted ensemble classifier","authors":"Huma Parveen ,&nbsp;Syed Wajahat Abbas Rizvi ,&nbsp;Raja Sarath Kumar Boddu","doi":"10.1016/j.datak.2024.102278","DOIUrl":"10.1016/j.datak.2024.102278","url":null,"abstract":"<div><p>Modern medicinal analysis is a complex procedure, requiring precise patient data, scientific knowledge obtained over numerous years and a theoretical understanding of related medical literature. To improve the accuracy and to reduce the time for diagnosis, clinical decision support systems (DSS) were introduced, which incorporate data mining schemes for enhancing the disease diagnosing accuracy. This work proposes a new disease-predicting model that involves 3 stages. Initially, “improved stemming and tokenization” are carried out in the pre-processing stage. Then, the “Fuzzy ontology, improved mutual information (MI), and correlation features” are extracted. Then, prediction is carried out via ensemble classifiers that include “improved Fuzzy logic, Long Short Term Memory (LSTM), Deep Convolution Neural Network (DCNN), and Bidirectional Gated Recurrent Unit (Bi-GRU)”.The outcomes from improved fuzzy logic, LSTM, and DCNN are further classified via Bi-GRU which offers the results. Specifically, Bi-GRU weights are optimally tuned using Deer Hunting Update Explored Arithmetic Optimization (DHUEAO). Finally, the efficiency of the proposed work is determined concerning a variety of metrics.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102278"},"PeriodicalIF":2.5,"publicationDate":"2024-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139677918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fusion learning of preference and bias from ratings and reviews for item recommendation 从评分和评论中融合学习偏好和偏见,以进行项目推荐
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-02-03 DOI: 10.1016/j.datak.2024.102283
Junrui Liu , Tong Li , Zhen Yang , Di Wu , Huan Liu

Recommendation methods improve rating prediction performance by learning selection bias phenomenon-users tend to rate items they like. These methods model selection bias by calculating the propensities of ratings, but inaccurate propensity could introduce more noise, fail to model selection bias, and reduce prediction performance. We argue that learning interaction features can effectively model selection bias and improve model performance, as interaction features explain the reason of the trend. Reviews can be used to model interaction features because they have a strong intrinsic correlation with user interests and item interactions. In this study, we propose a preference- and bias-oriented fusion learning model (PBFL) that models the interaction features based on reviews and user preferences to make rating predictions. Our proposal both embeds traditional user preferences in reviews, interactions, and ratings and considers word distribution bias and review quoting to model interaction features. Six real-world datasets are used to demonstrate effectiveness and performance. PBFL achieves an average improvement of 4.46% in root-mean-square error (RMSE) and 3.86% in mean absolute error (MAE) over the best baseline.

推荐方法通过学习选择偏差现象--用户倾向于给自己喜欢的项目评分--来提高评分预测性能。这些方法通过计算评分的倾向性来模拟选择偏差,但不准确的倾向性会带来更多噪音,无法模拟选择偏差,降低预测性能。我们认为,学习交互特征可以有效地模拟选择偏差并提高模型性能,因为交互特征可以解释趋势的原因。评论可用于交互特征建模,因为它们与用户兴趣和项目交互有很强的内在相关性。在本研究中,我们提出了一种以偏好和偏见为导向的融合学习模型(PBFL),该模型基于评论和用户偏好对交互特征进行建模,从而做出评分预测。我们的建议既在评论、互动和评分中嵌入了传统的用户偏好,又考虑了单词分布偏差和评论引用,从而为互动特征建模。我们使用了六个真实世界的数据集来证明其有效性和性能。与最佳基准相比,PBFL 的均方根误差 (RMSE) 平均提高了 4.46%,平均绝对误差 (MAE) 平均提高了 3.86%。
{"title":"Fusion learning of preference and bias from ratings and reviews for item recommendation","authors":"Junrui Liu ,&nbsp;Tong Li ,&nbsp;Zhen Yang ,&nbsp;Di Wu ,&nbsp;Huan Liu","doi":"10.1016/j.datak.2024.102283","DOIUrl":"10.1016/j.datak.2024.102283","url":null,"abstract":"<div><p>Recommendation methods improve rating prediction performance by learning selection bias phenomenon-users tend to rate items they like. These methods model selection bias by calculating the propensities of ratings, but inaccurate propensity could introduce more noise, fail to model selection bias, and reduce prediction performance. We argue that learning interaction features can effectively model selection bias and improve model performance, as interaction features explain the reason of the trend. Reviews can be used to model interaction features because they have a strong intrinsic correlation with user interests and item interactions. In this study, we propose a preference- and bias-oriented fusion learning model (PBFL) that models the interaction features based on reviews and user preferences to make rating predictions. Our proposal both embeds traditional user preferences in reviews, interactions, and ratings and considers word distribution bias and review quoting to model interaction features. Six real-world datasets are used to demonstrate effectiveness and performance. PBFL achieves an average improvement of 4.46% in root-mean-square error (RMSE) and 3.86% in mean absolute error (MAE) over the best baseline.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102283"},"PeriodicalIF":2.5,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139677949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multivariate hierarchical DBSCAN model for enhanced maritime data analytics 用于增强海事数据分析的多变量分层 DBSCAN 模型
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-02-02 DOI: 10.1016/j.datak.2024.102282
Nitin Newaliya, Yudhvir Singh

Clustering is an important data analytics technique and has numerous use cases. It leads to the determination of insights and knowledge which would not be readily discernible on routine examination of the data. Enhancement of clustering techniques is an active field of research, with various optimisation models being proposed. Such enhancements are also undertaken to address particular issues being faced in specific applications. This paper looks at a particular use case in the maritime domain and how an enhancement of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering results in the apt use of data analytics to solve a real-life issue. Passage of vessels over water is one of the significant utilisations of maritime regions. Trajectory analysis of these vessels helps provide valuable information, thus, maritime movement data and the knowledge extracted from manipulation of this data play an essential role in various applications, viz., assessing traffic densities, identifying traffic routes, reducing collision risks, etc. Optimised trajectory information would help enable safe and energy-efficient green operations at sea and assist autonomous operations of maritime systems and vehicles. Many studies focus on determining trajectory densities but miss out on individual trajectory granularities. Determining trajectories by using unique identities of the vessels may also lead to errors. Using an unsupervised DBSCAN method of identifying trajectories could help overcome these limitations. Further, to enhance outcomes and insights, the inclusion of temporal information along with additional parameters of Automatic Identification System (AIS) data in DBSCAN is proposed. Towards this, a new design and implementation for data analytics called the Multivariate Hierarchical DBSCAN method for better clustering of Maritime movement data, such as AIS, has been developed, which helps determine granular information and individual trajectories in an unsupervised manner. It is seen from the evaluation metrics that the performance of this method is better than other data clustering techniques.

聚类是一种重要的数据分析技术,有许多使用案例。通过聚类,可以发现常规数据检查中不易发现的洞察力和知识。增强聚类技术是一个活跃的研究领域,提出了各种优化模型。这些改进也是为了解决特定应用中面临的特殊问题。本文探讨了海事领域的一个特殊应用案例,以及如何通过增强基于密度的带噪声应用空间聚类(DBSCAN)聚类技术,恰当地利用数据分析来解决现实生活中的问题。船只在水上航行是海域的重要用途之一。对这些船只的轨迹分析有助于提供有价值的信息,因此,海上运输数据和从这些数据中提取的知识在各种应用中发挥着重要作用,如评估交通密度、确定交通路线、降低碰撞风险等。优化的轨迹信息将有助于实现安全、节能的绿色海上作业,并有助于海事系统和车辆的自主运行。许多研究侧重于确定轨迹密度,但忽略了单个轨迹的粒度。使用船只的唯一标识来确定轨迹也可能导致误差。使用无监督 DBSCAN 方法识别轨迹有助于克服这些局限性。此外,为了提高结果和洞察力,建议在 DBSCAN 中纳入时间信息以及自动识别系统(AIS)数据的附加参数。为此,开发了一种新的数据分析设计和实施方法,称为多变量分层 DBSCAN 方法,用于更好地对 AIS 等海事运动数据进行聚类,有助于以无监督方式确定细粒度信息和个体轨迹。从评估指标可以看出,该方法的性能优于其他数据聚类技术。
{"title":"Multivariate hierarchical DBSCAN model for enhanced maritime data analytics","authors":"Nitin Newaliya,&nbsp;Yudhvir Singh","doi":"10.1016/j.datak.2024.102282","DOIUrl":"10.1016/j.datak.2024.102282","url":null,"abstract":"<div><p>Clustering is an important data analytics technique and has numerous use cases. It leads to the determination of insights and knowledge which would not be readily discernible on routine examination of the data. Enhancement of clustering techniques is an active field of research, with various optimisation models being proposed. Such enhancements are also undertaken to address particular issues being faced in specific applications. This paper looks at a particular use case in the maritime domain and how an enhancement of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering results in the apt use of data analytics to solve a real-life issue. Passage of vessels over water is one of the significant utilisations of maritime regions. Trajectory analysis of these vessels helps provide valuable information, thus, maritime movement data and the knowledge extracted from manipulation of this data play an essential role in various applications, viz., assessing traffic densities, identifying traffic routes, reducing collision risks, etc. Optimised trajectory information would help enable safe and energy-efficient green operations at sea and assist autonomous operations of maritime systems and vehicles. Many studies focus on determining trajectory densities but miss out on individual trajectory granularities. Determining trajectories by using unique identities of the vessels may also lead to errors. Using an unsupervised DBSCAN method of identifying trajectories could help overcome these limitations. Further, to enhance outcomes and insights, the inclusion of temporal information along with additional parameters of Automatic Identification System (AIS) data in DBSCAN is proposed. Towards this, a new design and implementation for data analytics called the Multivariate Hierarchical DBSCAN method for better clustering of Maritime movement data, such as AIS, has been developed, which helps determine granular information and individual trajectories in an unsupervised manner. It is seen from the evaluation metrics that the performance of this method is better than other data clustering techniques.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102282"},"PeriodicalIF":2.5,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139667962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI system architecture design methodology based on IMO (Input-AI Model-Output) structure for successful AI adoption in organizations 基于 IMO(输入-AI 模型-输出)结构的人工智能系统架构设计方法,促进组织成功采用人工智能
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-01-28 DOI: 10.1016/j.datak.2023.102264
Seungkyu Park , Joong yoon Lee , Jooyeoun Lee

With the advancement of AI technology, the successful AI adoption in organizations has become a top priority in modern society. However, many organizations still struggle to articulate the necessary AI, and AI experts have difficulties understanding the problems faced by these organizations. This knowledge gap makes it difficult for organizations to identify the technical requirements, such as necessary data and algorithms, for adopting AI. To overcome this problem, we propose a new AI system architecture design methodology based on the IMO (Input-AI Model-Output) structure. The IMO structure enables effective identification of the technical requirements necessary to develop real AI models. While previous research has identified the importance and challenges of technical requirements, such as data and AI algorithms, for AI adoption, there has been little research on methodology to concretize them. Our methodology is composed of three stages: problem definition, system AI solution, and AI technical solution to design the AI technology and requirements that organizations need at a system level. The effectiveness of our methodology is demonstrated through a case study, logical comparative analysis with other studies, and experts reviews, which demonstrate that our methodology can support successful AI adoption to organizations.

随着人工智能技术的发展,在组织中成功采用人工智能已成为现代社会的当务之急。然而,许多组织仍在努力阐述必要的人工智能,而人工智能专家也难以理解这些组织所面临的问题。这种知识鸿沟使得组织难以确定采用人工智能所需的技术要求,如必要的数据和算法。为了克服这一问题,我们提出了一种基于 IMO(输入-AI 模型-输出)结构的新型人工智能系统架构设计方法。IMO 结构能有效识别开发真正的人工智能模型所需的技术要求。虽然以往的研究已经确定了技术要求(如数据和人工智能算法)对于人工智能应用的重要性和挑战,但很少有研究将其具体化的方法。我们的方法论由三个阶段组成:问题定义、系统人工智能解决方案和人工智能技术解决方案,以便在系统层面设计组织所需的人工智能技术和要求。我们的方法论通过案例研究、与其他研究的逻辑比较分析以及专家评论来证明其有效性,这些研究表明我们的方法论能够支持企业成功采用人工智能。
{"title":"AI system architecture design methodology based on IMO (Input-AI Model-Output) structure for successful AI adoption in organizations","authors":"Seungkyu Park ,&nbsp;Joong yoon Lee ,&nbsp;Jooyeoun Lee","doi":"10.1016/j.datak.2023.102264","DOIUrl":"10.1016/j.datak.2023.102264","url":null,"abstract":"<div><p>With the advancement of AI technology, the successful AI adoption in organizations has become a top priority in modern society. However, many organizations still struggle to articulate the necessary AI, and AI experts have difficulties understanding the problems faced by these organizations. This knowledge gap makes it difficult for organizations to identify the technical requirements, such as necessary data and algorithms, for adopting AI. To overcome this problem, we propose a new AI system architecture design methodology based on the IMO (Input-AI Model-Output) structure. The IMO structure enables effective identification of the technical requirements necessary to develop real AI models. While previous research has identified the importance and challenges of technical requirements, such as data and AI algorithms, for AI adoption, there has been little research on methodology to concretize them. Our methodology is composed of three stages: problem definition, system AI solution, and AI technical solution to design the AI technology and requirements that organizations need at a system level. The effectiveness of our methodology is demonstrated through a case study, logical comparative analysis with other studies, and experts reviews, which demonstrate that our methodology can support successful AI adoption to organizations.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102264"},"PeriodicalIF":2.5,"publicationDate":"2024-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X23001246/pdfft?md5=e0d3a91ff85a9662d7d0a2bed8c5acfd&pid=1-s2.0-S0169023X23001246-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139588883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification 适用于教育和专业培训领域的新句子嵌入框架,并将其应用于分层多标签文本分类
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-01-19 DOI: 10.1016/j.datak.2024.102281
Guillaume Lefebvre , Haytham Elghazel , Theodore Guillet , Alexandre Aussem , Matthieu Sonnati

In recent years, Natural Language Processing (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of HMCCCProbT to compare favorably to state-of-the-art HMC algorithms in terms of classification accuracy and also the ability of BERTEPro to obtain better probability predictions, well suited to HMCCCProbT, than three other vector representation techniques.

近年来,通过先进的通用语言嵌入,自然语言处理(NLP)技术取得了长足的进步,在语义相似性和文本分类等 NLP 任务中实现了突破。然而,分层多标签分类(HMC)的复杂性也随之增加,在这种情况下,一个实体可能属于多个分层分类的类别。在这种复杂的情况下,应用于特定领域的文本(如教育和专业培训领域),一般的语言嵌入模型往往不能充分代表专业领域的独特术语和上下文的细微差别。为了解决这个问题,我们提出了一种新颖的分层多标签文本分类方法 HMCCCProbT。这一创新框架包含多个分类器,其中每个分类器都是在现有 Transformer 模型的基础上,使用新颖的句子嵌入方法 BERTEPro 构建的,其预训练已在教育和专业培训文本上进行了扩展,然后在多个 NLP 任务上进行了微调。每个分类器负责给定层次的预测,并将输入特征向量增强的局部概率预测传播给负责后续层次的分类器。HMCCCProbT 解决了模型的可扩展性和语义解释问题,为应对特定领域分层多标签分类的挑战提供了强大的解决方案。在三个特定领域的文本 HMC 数据集上进行的实验表明,HMCCCProbT 在分类准确性方面可与最先进的 HMC 算法相媲美,而且与其他三种向量表示技术相比,BERTEPro 能够获得更好的概率预测,非常适合 HMCCCProbT。
{"title":"A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification","authors":"Guillaume Lefebvre ,&nbsp;Haytham Elghazel ,&nbsp;Theodore Guillet ,&nbsp;Alexandre Aussem ,&nbsp;Matthieu Sonnati","doi":"10.1016/j.datak.2024.102281","DOIUrl":"10.1016/j.datak.2024.102281","url":null,"abstract":"<div><p><span>In recent years, Natural Language Processing<span> (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification<span>. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers<span>, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of </span></span></span></span><span>HMCCCProbT</span><span> to compare favorably to state-of-the-art HMC algorithms<span> in terms of classification accuracy and also the ability of </span></span><span>BERTEPro</span> to obtain better probability predictions, well suited to <span>HMCCCProbT</span><span>, than three other vector representation techniques.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102281"},"PeriodicalIF":2.5,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139500576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Issues in inter-organizational data sharing: Findings from practice and research challenges 组织间数据共享的问题:来自实践和研究挑战的发现
IF 2.5 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-01-10 DOI: 10.1016/j.datak.2024.102280
Ilka Jussen , Frederik Möller , Julia Schweihoff , Anna Gieß , Giulia Giussani , Boris Otto

Sharing data is highly potent in assisting companies in internal optimization and designing new products and services. While the benefits seem obvious, sharing data is accompanied by a spectrum of concerns ranging from fears of sharing something of value, unawareness of what will happen to the data, or simply a lack of understanding of the short- and mid-term benefits. The article analyzes data sharing in inter-organizational relationships by examining 13 cases in a qualitative interview study and through public data analysis. Given the importance of inter-organizational data sharing as indicated by large research initiatives such as Gaia-X and Catena-X, we explore issues arising in this process and formulate research challenges. We use the theoretical lens of Actor-Network Theory to analyze our data and entangle its constructs with concepts in data sharing.

数据共享在协助公司进行内部优化以及设计新产品和服务方面非常有效。虽然数据共享的好处似乎显而易见,但同时也伴随着各种担忧,包括害怕分享有价值的东西、不知道数据会发生什么变化,或者只是对短期和中期的好处缺乏了解。文章通过定性访谈研究和公共数据分析,对 13 个案例进行了研究,分析了组织间关系中的数据共享。鉴于 Gaia-X 和 Catena-X 等大型研究计划显示了组织间数据共享的重要性,我们探讨了这一过程中出现的问题,并提出了研究挑战。我们使用行动者网络理论(Actor-Network Theory)的理论视角来分析我们的数据,并将其构造与数据共享的概念联系起来。
{"title":"Issues in inter-organizational data sharing: Findings from practice and research challenges","authors":"Ilka Jussen ,&nbsp;Frederik Möller ,&nbsp;Julia Schweihoff ,&nbsp;Anna Gieß ,&nbsp;Giulia Giussani ,&nbsp;Boris Otto","doi":"10.1016/j.datak.2024.102280","DOIUrl":"10.1016/j.datak.2024.102280","url":null,"abstract":"<div><p>Sharing data is highly potent in assisting companies in internal optimization and designing new products and services. While the benefits seem obvious, sharing data is accompanied by a spectrum of concerns ranging from fears of sharing something of value, unawareness of what will happen to the data, or simply a lack of understanding of the short- and mid-term benefits. The article analyzes data sharing in inter-organizational relationships by examining 13 cases in a qualitative interview study and through public data analysis. Given the importance of inter-organizational data sharing as indicated by large research initiatives such as Gaia-X and Catena-X, we explore issues arising in this process and formulate research challenges. We use the theoretical lens of Actor-Network Theory to analyze our data and entangle its constructs with concepts in data sharing.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102280"},"PeriodicalIF":2.5,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000041/pdfft?md5=8cca34784bb0ed03de222b7dc6fbfc47&pid=1-s2.0-S0169023X24000041-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Data & Knowledge Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1