首页 > 最新文献

Data & Knowledge Engineering最新文献

英文 中文
Effective text classification using BERT, MTM LSTM, and DT 使用 BERT、MTM LSTM 和 DT 进行有效的文本分类
IF 2.5 3区 计算机科学 Q2 Decision Sciences Pub Date : 2024-05-01 DOI: 10.1016/j.datak.2024.102306
Saman Jamshidi , Mahin Mohammadi , Saeed Bagheri , Hamid Esmaeili Najafabadi , Alireza Rezvanian , Mehdi Gheisari , Mustafa Ghaderzadeh , Amir Shahab Shahabi , Zongda Wu

Text classification plays a critical role in managing large volumes of electronically produced texts. As the number of such texts increases, manual analysis becomes impractical, necessitating an intelligent approach for processing information. Deep learning models have witnessed widespread application in text classification, including the use of recurrent neural networks like Many to One Long Short-Term Memory (MTO LSTM). Nonetheless, this model is limited by its reliance on only the last token for text labelling. To overcome this limitation, this study introduces a novel hybrid model that combines Bidirectional Encoder Representations from Transformers (BERT), Many To Many Long Short-Term Memory (MTM LSTM), and Decision Templates (DT) for text classification. In this new model, the text is first embedded using the BERT model and then trained using MTM LSTM to approximate the target at each token. Finally, the approximations are fused using DT. The proposed model is evaluated using the well-known IMDB movie review dataset for binary classification and Drug Review Dataset for multiclass classification. The results demonstrate superior performance in terms of accuracy, recall, precision, and F1 score compared to previous models. The hybrid model presented in this study holds significant potential for a wide range of text classification tasks and stands as a valuable contribution to the field.

文本分类在管理大量电子文本方面发挥着至关重要的作用。随着此类文本数量的增加,人工分析变得不切实际,因此需要一种智能方法来处理信息。深度学习模型在文本分类中得到了广泛应用,包括使用多对一长短时记忆(MTO LSTM)等递归神经网络。然而,这种模型的局限性在于仅依赖最后一个标记进行文本标注。为了克服这一局限,本研究引入了一种新型混合模型,该模型结合了变压器双向编码器表征(BERT)、多对多长短期记忆(MTM LSTM)和决策模板(DT),用于文本分类。在这个新模型中,首先使用 BERT 模型嵌入文本,然后使用 MTM LSTM 进行训练,以近似每个标记的目标。最后,使用 DT 对近似值进行融合。我们使用著名的 IMDB 电影评论数据集进行了二分类评估,并使用药物评论数据集进行了多分类评估。结果表明,与之前的模型相比,该模型在准确率、召回率、精确度和 F1 分数等方面都表现出色。本研究提出的混合模型在广泛的文本分类任务中具有巨大的潜力,是对该领域的宝贵贡献。
{"title":"Effective text classification using BERT, MTM LSTM, and DT","authors":"Saman Jamshidi ,&nbsp;Mahin Mohammadi ,&nbsp;Saeed Bagheri ,&nbsp;Hamid Esmaeili Najafabadi ,&nbsp;Alireza Rezvanian ,&nbsp;Mehdi Gheisari ,&nbsp;Mustafa Ghaderzadeh ,&nbsp;Amir Shahab Shahabi ,&nbsp;Zongda Wu","doi":"10.1016/j.datak.2024.102306","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102306","url":null,"abstract":"<div><p>Text classification plays a critical role in managing large volumes of electronically produced texts. As the number of such texts increases, manual analysis becomes impractical, necessitating an intelligent approach for processing information. Deep learning models have witnessed widespread application in text classification, including the use of recurrent neural networks like Many to One Long Short-Term Memory (MTO LSTM). Nonetheless, this model is limited by its reliance on only the last token for text labelling. To overcome this limitation, this study introduces a novel hybrid model that combines Bidirectional Encoder Representations from Transformers (BERT), Many To Many Long Short-Term Memory (MTM LSTM), and Decision Templates (DT) for text classification. In this new model, the text is first embedded using the BERT model and then trained using MTM LSTM to approximate the target at each token. Finally, the approximations are fused using DT. The proposed model is evaluated using the well-known IMDB movie review dataset for binary classification and Drug Review Dataset for multiclass classification. The results demonstrate superior performance in terms of accuracy, recall, precision, and F1 score compared to previous models. The hybrid model presented in this study holds significant potential for a wide range of text classification tasks and stands as a valuable contribution to the field.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140825257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Transformers and Linguistic Features integration for Author Profiling tasks in Spanish 评估转换器和语言特征整合在西班牙语作者分析任务中的应用
IF 2.5 3区 计算机科学 Q2 Decision Sciences Pub Date : 2024-05-01 DOI: 10.1016/j.datak.2024.102307
José Antonio García-Díaz , Ghassan Beydoun , Rafel Valencia-García

Author profiling consists of extracting their demographic and psychographic information by examining their writings. This information can then be used to improve the reader experience and to detect bots or propagators of hoaxes and/or hate speech. Therefore, author profiling can be applied to build more robust and efficient Knowledge-Based Systems for tasks such as content moderation, user profiling, and information retrieval. Author profiling is typically performed automatically as a document classification task. Recently, language models based on transformers have also proven to be quite effective in this task. However, the size and heterogeneity of novel language models, makes it necessary to evaluate them in context. The contributions we make in this paper are four-fold: First, we evaluate which language models are best suited to perform author profiling in Spanish. These experiments include basic, distilled, and multilingual models. Second, we evaluate how feature integration can improve performance for this task. We evaluate two distinct strategies: knowledge integration and ensemble learning. Third, we evaluate the ability of linguistic features to improve the interpretability of the results. Fourth, we evaluate the performance of each language model in terms of memory, training, and inference times. Our results indicate that the use of lightweight models can indeed achieve similar performance to heavy models and that multilingual models are actually less effective than models trained with one language. Finally, we confirm that the best models and strategies for integrating features ultimately depend on the context of the task.

作者分析包括通过研究作者的著作来提取其人口统计学和心理学信息。这些信息可用于改善读者体验,检测机器人或恶作剧和/或仇恨言论的传播者。因此,作者特征描述可用于为内容管理、用户特征描述和信息检索等任务构建更强大、更高效的知识型系统。作者特征描述通常作为文档分类任务自动执行。最近,基于转换器的语言模型也被证明在这项任务中相当有效。然而,由于新型语言模型的规模和异质性,有必要在上下文中对其进行评估。我们在本文中做出了四方面的贡献:首先,我们评估了哪些语言模型最适合在西班牙语中执行作者剖析。这些实验包括基本模型、提炼模型和多语言模型。其次,我们评估了特征整合如何提高这项任务的性能。我们评估了两种不同的策略:知识整合和集合学习。第三,我们评估了语言特征提高结果可解释性的能力。第四,我们评估了每个语言模型在内存、训练和推理时间方面的性能。我们的结果表明,使用轻量级模型确实可以达到与重型模型相似的性能,而多语言模型的效果实际上不如用一种语言训练的模型。最后,我们证实,整合特征的最佳模型和策略最终取决于任务的背景。
{"title":"Evaluating Transformers and Linguistic Features integration for Author Profiling tasks in Spanish","authors":"José Antonio García-Díaz ,&nbsp;Ghassan Beydoun ,&nbsp;Rafel Valencia-García","doi":"10.1016/j.datak.2024.102307","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102307","url":null,"abstract":"<div><p>Author profiling consists of extracting their demographic and psychographic information by examining their writings. This information can then be used to improve the reader experience and to detect bots or propagators of hoaxes and/or hate speech. Therefore, author profiling can be applied to build more robust and efficient Knowledge-Based Systems for tasks such as content moderation, user profiling, and information retrieval. Author profiling is typically performed automatically as a document classification task. Recently, language models based on transformers have also proven to be quite effective in this task. However, the size and heterogeneity of novel language models, makes it necessary to evaluate them in context. The contributions we make in this paper are four-fold: First, we evaluate which language models are best suited to perform author profiling in Spanish. These experiments include basic, distilled, and multilingual models. Second, we evaluate how feature integration can improve performance for this task. We evaluate two distinct strategies: knowledge integration and ensemble learning. Third, we evaluate the ability of linguistic features to improve the interpretability of the results. Fourth, we evaluate the performance of each language model in terms of memory, training, and inference times. Our results indicate that the use of lightweight models can indeed achieve similar performance to heavy models and that multilingual models are actually less effective than models trained with one language. Finally, we confirm that the best models and strategies for integrating features ultimately depend on the context of the task.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000314/pdfft?md5=42a482dbed2e2a640c46e89a6f3a69c8&pid=1-s2.0-S0169023X24000314-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140825258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Managerial risk data analytics applications using grey influence analysis (GINA) 利用灰色影响分析(GINA)进行管理风险数据分析应用
IF 2.5 3区 计算机科学 Q2 Decision Sciences Pub Date : 2024-05-01 DOI: 10.1016/j.datak.2024.102312
R. Rajesh

We observe and analyze the causal relations among risk factors in a system, considering the manufacturing supply chains. Seven major categories of risks were identified and scrutinized and the detailed analysis of causal relations using the grey influence analysis (GINA) methodology is outlined. With expert response based survey, we conduct an initial analysis of the risks using risk matrix analysis (RMA) and the risks under high priority are identified. Later, the GINA is implemented to understand the causal relations among various categories of risks, which is particularly useful in group decision-making environments. The results from RMA concludes that the capacity risks (CR) and delays (DL) are in the category of very high priority risks. GINA results also ratify the conclusions from RMA and observes that managers need to control and manage capacity risks (CR) and delays (DL) with high priorities. Additionally from the results of GINA, the causal factors disruptions (DS) and forecast risks (FR) appear to be primary importance and if unattended can lead to the initiation of several other risks in supply chains. Managers are recommended to identify disruptions at an early stage in supply chains and reduce the forecast errors to avoid bullwhips in supply chains.

考虑到制造业供应链,我们观察并分析了系统中风险因素之间的因果关系。我们确定并审查了七大类风险,并使用灰色影响分析(GINA)方法对因果关系进行了详细分析。通过基于专家回复的调查,我们使用风险矩阵分析法(RMA)对风险进行了初步分析,并确定了高度优先的风险。随后,我们采用灰色分析法来了解各类风险之间的因果关系,这在群体决策环境中尤为有用。RMA 分析结果表明,能力风险(CR)和延误风险(DL)属于高度优先风险。GINA 的结果也验证了 RMA 的结论,并指出管理人员需要以高度优先的方式控制和管理产能风险(CR)和延误风险(DL)。此外,从 GINA 的结果来看,因果因素中断(DS)和预测风险(FR)似乎是最重要的,如果不加注意,可能会导致供应链中其他一些风险的发生。建议管理者及早识别供应链中的中断,减少预测误差,避免供应链中的牛鞭现象。
{"title":"Managerial risk data analytics applications using grey influence analysis (GINA)","authors":"R. Rajesh","doi":"10.1016/j.datak.2024.102312","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102312","url":null,"abstract":"<div><p>We observe and analyze the causal relations among risk factors in a system, considering the manufacturing supply chains. Seven major categories of risks were identified and scrutinized and the detailed analysis of causal relations using the grey influence analysis (GINA) methodology is outlined. With expert response based survey, we conduct an initial analysis of the risks using risk matrix analysis (RMA) and the risks under high priority are identified. Later, the GINA is implemented to understand the causal relations among various categories of risks, which is particularly useful in group decision-making environments. The results from RMA concludes that the <em>capacity risks (CR)</em> and <em>delays (DL)</em> are in the category of very high priority risks. GINA results also ratify the conclusions from RMA and observes that managers need to control and manage <em>capacity risks (CR)</em> and <em>delays (DL)</em> with high priorities. Additionally from the results of GINA, the causal factors <em>disruptions (DS)</em> and <em>forecast risks (FR)</em> appear to be primary importance and if unattended can lead to the initiation of several other risks in supply chains. Managers are recommended to identify disruptions at an early stage in supply chains and reduce the forecast errors to avoid bullwhips in supply chains.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140879377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A graph based named entity disambiguation using clique partitioning and semantic relatedness 利用小块分割和语义相关性进行基于图的命名实体消歧
IF 2.5 3区 计算机科学 Q2 Decision Sciences Pub Date : 2024-04-30 DOI: 10.1016/j.datak.2024.102308
Ramla Belalta , Mouhoub Belazzoug , Farid Meziane

Disambiguating name mentions in texts is a crucial task in Natural Language Processing, especially in entity linking. The credibility and efficiency of such systems depend largely on this task. For a given name entity mention in a text, there are many potential candidate entities that may refer to it in the knowledge base. Therefore, it is very difficult to assign the correct candidate from the whole set of candidate entities of this mention. To solve this problem, collective entity disambiguation is a prominent approach. In this paper, we present a novel algorithm called CPSR for collective entity disambiguation, which is based on a graph approach and semantic relatedness. A clique partitioning algorithm is used to find the best clique that contains a set of candidate entities. These candidate entities provide the answers to the corresponding mentions in the disambiguation process. To evaluate our algorithm, we carried out a series of experiments on seven well-known datasets, namely, AIDA/CoNLL2003-TestB, IITB, MSNBC, AQUAINT, ACE2004, Cweb, and Wiki. The Kensho Derived Wikimedia Dataset (KDWD) is used as the knowledge base for our system. From the experimental results, our CPSR algorithm outperforms both the baselines and other well-known state-of-the-art approaches.

对文本中提到的名称进行消歧是自然语言处理中的一项重要任务,尤其是在实体链接中。此类系统的可信度和效率在很大程度上取决于这项任务。对于文本中提到的某个名称实体,知识库中可能有许多潜在的候选实体。因此,要从这一提及的全部候选实体中指定正确的候选实体是非常困难的。为了解决这个问题,集体实体消歧是一种突出的方法。本文提出了一种名为 CPSR 的新型集体实体消歧算法,该算法基于图方法和语义相关性。该算法基于图方法和语义相关性,使用簇划分算法来找到包含一组候选实体的最佳簇。这些候选实体在消歧过程中为相应的提及提供答案。为了评估我们的算法,我们在七个著名的数据集上进行了一系列实验,即 AIDA/CoNLL2003-TestB、IITB、MSNBC、AQUAINT、ACE2004、Cweb 和 Wiki。Kensho Derived Wikimedia Dataset (KDWD) 被用作我们系统的知识库。从实验结果来看,我们的 CPSR 算法优于基线和其他著名的先进方法。
{"title":"A graph based named entity disambiguation using clique partitioning and semantic relatedness","authors":"Ramla Belalta ,&nbsp;Mouhoub Belazzoug ,&nbsp;Farid Meziane","doi":"10.1016/j.datak.2024.102308","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102308","url":null,"abstract":"<div><p>Disambiguating name mentions in texts is a crucial task in Natural Language Processing, especially in entity linking. The credibility and efficiency of such systems depend largely on this task. For a given name entity mention in a text, there are many potential candidate entities that may refer to it in the knowledge base. Therefore, it is very difficult to assign the correct candidate from the whole set of candidate entities of this mention. To solve this problem, collective entity disambiguation is a prominent approach. In this paper, we present a novel algorithm called CPSR for collective entity disambiguation, which is based on a graph approach and semantic relatedness. A clique partitioning algorithm is used to find the best clique that contains a set of candidate entities. These candidate entities provide the answers to the corresponding mentions in the disambiguation process. To evaluate our algorithm, we carried out a series of experiments on seven well-known datasets, namely, AIDA/CoNLL2003-TestB, IITB, MSNBC, AQUAINT, ACE2004, Cweb, and Wiki. The Kensho Derived Wikimedia Dataset (KDWD) is used as the knowledge base for our system. From the experimental results, our CPSR algorithm outperforms both the baselines and other well-known state-of-the-art approaches.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140901817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CQuAE: A new Contextualized QUestion Answering corpus on Education domain CQuAE:新的教育领域语境化问题解答语料库
IF 2.5 3区 计算机科学 Q2 Decision Sciences Pub Date : 2024-04-15 DOI: 10.1016/j.datak.2024.102305
Thomas Gerald , Louis Tamames , Sofiane Ettayeb , Ha-Quang Le , Patrick Paroubek , Anne Vilnat

Generating education-related questions and answers remains an open issue while being useful for students, teachers, and teaching aids. Given textual course material, we are interested in generating non-factual questions that require an elaborate answer (relying on analysis or reasoning). Despite the availability of annotated corpora of questions and answers, the effort to develop a generator using deep learning faces two main challenges. Firstly, freely accessible and qualitative data are insufficient to train generative approaches. Secondly, for a stand-alone application, we do not have explicit support to guide the generation toward complex questions. To tackle the first issue, we propose a new corpus based on education documents. For the second point, we propose to study several retargetable language algorithms to produce answers by extracting text spans from contextual documents to help the generation of questions. We particularly study the contribution of deep neural syntactic parsing and transformer-based semantic representation, taking into account the question type (according to our specific question typology) and the contextual support text span. Additionally, recent advances in generation models have proven the efficiency of the instruction-based approach for natural language generation. Consequently, we propose a first investigation of very large language models to generate questions related to the education domain.

生成与教育相关的问题和答案,对学生、教师和教学辅助工具有用,但仍是一个未决问题。对于文本课程材料,我们感兴趣的是生成需要详细回答(依靠分析或推理)的非事实性问题。尽管有问题和答案的注释语料库,但利用深度学习开发生成器的工作面临两大挑战。首先,免费获取的定性数据不足以训练生成方法。其次,对于独立应用来说,我们没有明确的支持来引导生成复杂问题。为了解决第一个问题,我们提出了一个基于教育文件的新语料库。针对第二个问题,我们建议研究几种可重新定位的语言算法,通过从上下文文档中提取文本跨度来生成答案,从而帮助生成问题。考虑到问题类型(根据我们特定的问题类型学)和上下文支持文本跨度,我们特别研究了深度神经句法分析和基于转换器的语义表示的贡献。此外,生成模型的最新进展证明了基于指令的自然语言生成方法的效率。因此,我们建议对超大型语言模型进行首次研究,以生成与教育领域相关的问题。
{"title":"CQuAE: A new Contextualized QUestion Answering corpus on Education domain","authors":"Thomas Gerald ,&nbsp;Louis Tamames ,&nbsp;Sofiane Ettayeb ,&nbsp;Ha-Quang Le ,&nbsp;Patrick Paroubek ,&nbsp;Anne Vilnat","doi":"10.1016/j.datak.2024.102305","DOIUrl":"10.1016/j.datak.2024.102305","url":null,"abstract":"<div><p>Generating education-related questions and answers remains an open issue while being useful for students, teachers, and teaching aids. Given textual course material, we are interested in generating non-factual questions that require an elaborate answer (relying on analysis or reasoning). Despite the availability of annotated corpora of questions and answers, the effort to develop a generator using deep learning faces two main challenges. Firstly, freely accessible and qualitative data are insufficient to train generative approaches. Secondly, for a stand-alone application, we do not have explicit support to guide the generation toward complex questions. To tackle the first issue, we propose a new corpus based on education documents. For the second point, we propose to study several retargetable language algorithms to produce answers by extracting text spans from contextual documents to help the generation of questions. We particularly study the contribution of deep neural syntactic parsing and transformer-based semantic representation, taking into account the question type (according to our specific question typology) and the contextual support text span. Additionally, recent advances in generation models have proven the efficiency of the instruction-based approach for natural language generation. Consequently, we propose a first investigation of very large language models to generate questions related to the education domain.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140768347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial intelligence in digital twins—A systematic literature review 数字双胞胎中的人工智能--系统文献综述
IF 2.5 3区 计算机科学 Q2 Decision Sciences Pub Date : 2024-04-03 DOI: 10.1016/j.datak.2024.102304
Tim Kreuzer, Panagiotis Papapetrou, Jelena Zdravkovic

Artificial intelligence and digital twins have become more popular in recent years and have seen usage across different application domains for various scenarios. This study reviews the literature at the intersection of the two fields, where digital twins integrate an artificial intelligence component. We follow a systematic literature review approach, analyzing a total of 149 related studies. In the assessed literature, a variety of problems are approached with an artificial intelligence-integrated digital twin, demonstrating its applicability across different fields. Our findings indicate that there is a lack of in-depth modeling approaches regarding the digital twin, while many articles focus on the implementation and testing of the artificial intelligence component. The majority of publications do not demonstrate a virtual-to-physical connection between the digital twin and the real-world system. Further, only a small portion of studies base their digital twin on real-time data from a physical system, implementing a physical-to-virtual connection.

近年来,人工智能和数字孪生变得越来越流行,并在不同应用领域的各种场景中得到广泛应用。本研究回顾了这两个领域交叉点的文献,其中数字双胞胎集成了人工智能组件。我们采用系统的文献综述方法,共分析了 149 项相关研究。在所评估的文献中,各种问题都与集成人工智能的数字孪生相关,表明了其在不同领域的适用性。我们的研究结果表明,目前缺乏有关数字孪生的深入建模方法,而许多文章则侧重于人工智能组件的实施和测试。大多数出版物没有展示数字孪生与现实世界系统之间虚拟到物理的联系。此外,只有一小部分研究将数字孪生基于物理系统的实时数据,实现了物理到虚拟的连接。
{"title":"Artificial intelligence in digital twins—A systematic literature review","authors":"Tim Kreuzer,&nbsp;Panagiotis Papapetrou,&nbsp;Jelena Zdravkovic","doi":"10.1016/j.datak.2024.102304","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102304","url":null,"abstract":"<div><p>Artificial intelligence and digital twins have become more popular in recent years and have seen usage across different application domains for various scenarios. This study reviews the literature at the intersection of the two fields, where digital twins integrate an artificial intelligence component. We follow a systematic literature review approach, analyzing a total of 149 related studies. In the assessed literature, a variety of problems are approached with an artificial intelligence-integrated digital twin, demonstrating its applicability across different fields. Our findings indicate that there is a lack of in-depth modeling approaches regarding the digital twin, while many articles focus on the implementation and testing of the artificial intelligence component. The majority of publications do not demonstrate a virtual-to-physical connection between the digital twin and the real-world system. Further, only a small portion of studies base their digital twin on real-time data from a physical system, implementing a physical-to-virtual connection.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000284/pdfft?md5=7bf249b030dadbb8c82308b54aef035d&pid=1-s2.0-S0169023X24000284-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140549919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging an Isolation Forest to Anomaly Detection and Data Clustering 利用隔离林进行异常检测和数据聚类
IF 2.5 3区 计算机科学 Q2 Decision Sciences Pub Date : 2024-03-28 DOI: 10.1016/j.datak.2024.102302
Véronne Yepmo , Grégory Smits , Marie-Jeanne Lesot , Olivier Pivert

Understanding why some points in a data set are considered as anomalies cannot be done without taking into account the structure of the regular points. Whereas many machine learning methods are dedicated to the identification of anomalies on one side, or to the identification of the data inner-structure on the other side, a solution is introduced to answers these two tasks using a same data model, a variant of an isolation forest. The initial algorithm to construct an isolation forest is indeed revisited to preserve the data inner structure without affecting the efficiency of the outlier detection. Experiments conducted both on synthetic and real-world data sets show that, in addition to improving the detection of abnormal data points, the proposed variant of isolation forest allows for a reconstruction of the subspaces of high density. Therefore, the former can serve as a basis for a unified approach to detect global and local anomalies, which is a necessary condition to then provide users with informative descriptions of the data.

要理解数据集中的某些点为何被视为异常点,就必须考虑到正常点的结构。许多机器学习方法一方面致力于异常点的识别,另一方面也致力于数据内部结构的识别,而我们引入了一种解决方案,使用相同的数据模型--隔离林的变体--来回答这两项任务。为了在不影响离群点检测效率的情况下保留数据的内部结构,我们重新研究了构建隔离林的初始算法。在合成数据集和真实世界数据集上进行的实验表明,除了提高异常数据点的检测效率外,所提出的隔离林变体还能重建高密度子空间。因此,前者可以作为检测全局和局部异常的统一方法的基础,而全局和局部异常是为用户提供数据信息描述的必要条件。
{"title":"Leveraging an Isolation Forest to Anomaly Detection and Data Clustering","authors":"Véronne Yepmo ,&nbsp;Grégory Smits ,&nbsp;Marie-Jeanne Lesot ,&nbsp;Olivier Pivert","doi":"10.1016/j.datak.2024.102302","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102302","url":null,"abstract":"<div><p>Understanding why some points in a data set are considered as anomalies cannot be done without taking into account the structure of the regular points. Whereas many machine learning methods are dedicated to the identification of anomalies on one side, or to the identification of the data inner-structure on the other side, a solution is introduced to answers these two tasks using a same data model, a variant of an isolation forest. The initial algorithm to construct an isolation forest is indeed revisited to preserve the data inner structure without affecting the efficiency of the outlier detection. Experiments conducted both on synthetic and real-world data sets show that, in addition to improving the detection of abnormal data points, the proposed variant of isolation forest allows for a reconstruction of the subspaces of high density. Therefore, the former can serve as a basis for a unified approach to detect global and local anomalies, which is a necessary condition to then provide users with informative descriptions of the data.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140345076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The unresolved need for dependable guarantees on security, sovereignty, and trust in data ecosystems 对数据生态系统的安全、主权和信任提供可靠保证的需求尚未得到解决
IF 2.5 3区 计算机科学 Q2 Decision Sciences Pub Date : 2024-03-19 DOI: 10.1016/j.datak.2024.102301
Johannes Lohmöller , Jan Pennekamp , Roman Matzutt , Carolin Victoria Schneider , Eduard Vlad , Christian Trautwein , Klaus Wehrle

Data ecosystems emerged as a new paradigm to facilitate the automated and massive exchange of data from heterogeneous information sources between different stakeholders. However, the corresponding benefits come with unforeseen risks as sensitive information is potentially exposed, questioning data ecosystem reliability. Consequently, data security is of utmost importance and, thus, a central requirement for successfully realizing data ecosystems. Academia has recognized this requirement, and current initiatives foster sovereign participation via a federated infrastructure where participants retain local control over what data they offer to whom. However, recent proposals place significant trust in remote infrastructure by implementing organizational security measures such as certification processes before the admission of a participant. At the same time, the data sensitivity incentivizes participants to bypass the organizational security measures to maximize their benefit. This issue significantly weakens security, sovereignty, and trust guarantees and highlights that organizational security measures are insufficient in this context. In this paper, we argue that data ecosystems must be extended with technical means to (re)establish dependable guarantees. We underpin this need with three representative use cases for data ecosystems, which cover personal, economic, and governmental data, and systematically map the lack of dependable guarantees in related work. To this end, we identify three enablers of dependable guarantees, namely trusted remote policy enforcement, verifiable data tracking, and integration of resource-constrained participants. These enablers are critical for securely implementing data ecosystems in data-sensitive contexts.

数据生态系统作为一种新的模式出现,可促进不同利益相关者之间从异构信息源自动和大规模交换数据。然而,相应的好处也伴随着不可预见的风险,因为敏感信息可能会暴露,从而对数据生态系统的可靠性提出质疑。因此,数据安全至关重要,也是成功实现数据生态系统的核心要求。学术界已经认识到这一要求,目前的倡议是通过联合基础设施促进主权参与,参与者保留对向谁提供哪些数据的本地控制权。不过,最近的提案通过实施组织安全措施(如在接纳参与者之前的认证流程),对远程基础设施给予了极大的信任。与此同时,数据的敏感性促使参与者绕过组织安全措施,以实现利益最大化。这个问题极大地削弱了安全、主权和信任保证,并凸显了组织安全措施在这种情况下的不足。在本文中,我们认为数据生态系统必须通过技术手段进行扩展,以(重新)建立可靠的保证。我们通过三个具有代表性的数据生态系统使用案例(涵盖个人、经济和政府数据)来支持这一需求,并系统地描绘了相关工作中缺乏可靠保障的情况。为此,我们确定了可靠保证的三个推动因素,即可信的远程策略执行、可验证的数据跟踪和资源受限参与者的整合。这些使能因素对于在数据敏感环境中安全实施数据生态系统至关重要。
{"title":"The unresolved need for dependable guarantees on security, sovereignty, and trust in data ecosystems","authors":"Johannes Lohmöller ,&nbsp;Jan Pennekamp ,&nbsp;Roman Matzutt ,&nbsp;Carolin Victoria Schneider ,&nbsp;Eduard Vlad ,&nbsp;Christian Trautwein ,&nbsp;Klaus Wehrle","doi":"10.1016/j.datak.2024.102301","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102301","url":null,"abstract":"<div><p>Data ecosystems emerged as a new paradigm to facilitate the automated and massive exchange of data from heterogeneous information sources between different stakeholders. However, the corresponding benefits come with unforeseen risks as sensitive information is potentially exposed, questioning data ecosystem reliability. Consequently, data security is of utmost importance and, thus, a central requirement for successfully realizing data ecosystems. Academia has recognized this requirement, and current initiatives foster sovereign participation via a federated infrastructure where participants retain local control over what data they offer to whom. However, recent proposals place significant trust in remote infrastructure by implementing organizational security measures such as certification processes before the admission of a participant. At the same time, the data sensitivity incentivizes participants to bypass the organizational security measures to maximize their benefit. This issue significantly weakens security, sovereignty, and trust guarantees and highlights that organizational security measures are insufficient in this context. In this paper, we argue that data ecosystems must be extended with technical means to (re)establish dependable guarantees. We underpin this need with three representative use cases for data ecosystems, which cover personal, economic, and governmental data, and systematically map the lack of dependable guarantees in related work. To this end, we identify three enablers of dependable guarantees, namely trusted remote policy enforcement, verifiable data tracking, and integration of resource-constrained participants. These enablers are critical for securely implementing data ecosystems in data-sensitive contexts.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000259/pdfft?md5=5d1fb135737fcc7ddf73713a94b46ce0&pid=1-s2.0-S0169023X24000259-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140192029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Insights into commonalities of a sample: A visualization framework to explore unusual subset-dataset relationships 洞察样本的共性:探索异常子集与数据集关系的可视化框架
IF 2.5 3区 计算机科学 Q2 Decision Sciences Pub Date : 2024-03-12 DOI: 10.1016/j.datak.2024.102299
Nikolas Stege , Michael H. Breitner

Domain experts are driven by business needs, while data analysts develop and use various algorithms, methods, and tools, but often without domain knowledge. A major challenge for companies and organizations is to integrate data analytics in business processes and workflows. We deduce an interactive process and visualization framework to enable value creating collaboration in inter- and cross-disciplinary teams. Domain experts and data analysts are both empowered to analyze and discuss results and come to well-founded insights and implications. Inspired by a typical auditing problem, we develop and apply a visualization framework to single out unusual data in general subsets for potential further investigation. Our framework is applicable to both unusual data detected manually by domain experts or by algorithms applied by data analysts. Application examples show typical interaction, collaboration, visualization, and decision support.

领域专家由业务需求驱动,而数据分析师则开发和使用各种算法、方法和工具,但往往不具备领域知识。公司和组织面临的一大挑战是如何将数据分析整合到业务流程和工作流程中。我们推导出了一个交互式流程和可视化框架,以便在跨学科和跨专业团队中开展创造价值的协作。领域专家和数据分析师都有权分析和讨论结果,并得出有理有据的见解和影响。受一个典型审计问题的启发,我们开发并应用了一个可视化框架,以在一般子集中挑选出异常数据,进行潜在的进一步调查。我们的框架既适用于领域专家手动检测到的异常数据,也适用于数据分析师使用算法检测到的异常数据。应用实例展示了典型的交互、协作、可视化和决策支持。
{"title":"Insights into commonalities of a sample: A visualization framework to explore unusual subset-dataset relationships","authors":"Nikolas Stege ,&nbsp;Michael H. Breitner","doi":"10.1016/j.datak.2024.102299","DOIUrl":"10.1016/j.datak.2024.102299","url":null,"abstract":"<div><p>Domain experts are driven by business needs, while data analysts develop and use various algorithms, methods, and tools, but often without domain knowledge. A major challenge for companies and organizations is to integrate data analytics in business processes and workflows. We deduce an interactive process and visualization framework to enable value creating collaboration in inter- and cross-disciplinary teams. Domain experts and data analysts are both empowered to analyze and discuss results and come to well-founded insights and implications. Inspired by a typical auditing problem, we develop and apply a visualization framework to single out unusual data in general subsets for potential further investigation. Our framework is applicable to both unusual data detected manually by domain experts or by algorithms applied by data analysts. Application examples show typical interaction, collaboration, visualization, and decision support.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000235/pdfft?md5=5865a6d1aaccbc08965569d170abf88f&pid=1-s2.0-S0169023X24000235-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140151811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Time-aware structure matching for temporal knowledge graph alignment 用于时态知识图谱对齐的时间感知结构匹配
IF 2.5 3区 计算机科学 Q2 Decision Sciences Pub Date : 2024-03-11 DOI: 10.1016/j.datak.2024.102300
Wei Jia , Ruizhe Ma , Li Yan , Weinan Niu , Zongmin Ma

Entity alignment, aiming at identifying equivalent entity pairs across multiple knowledge graphs (KGs), serves as a vital step for knowledge fusion. As the majority of KGs undergo continuous evolution, existing solutions utilize graph neural networks (GNNs) to tackle entity alignment within temporal knowledge graphs (TKGs). However, this prevailing method often overlooks the consequential impact of relation embedding generation on entity embeddings through inherent structures. In this paper, we propose a novel model named Time-aware Structure Matching based on GNNs (TSM-GNN) that encompasses the learning of both topological and inherent structures. Our key innovation lies in a unique method for generating relation embeddings, which can enhance entity embeddings via inherent structure. Specifically, we utilize the translation property of knowledge graphs to obtain the entity embedding that is mapped into a time-aware vector space. Subsequently, we employ GNNs to learn global entity representation. To better capture the useful information from neighboring relations and entities, we introduce a time-aware attention mechanism that assigns different importance weights to different time-aware inherent structures. Experimental results on three real-world datasets demonstrate that TSM-GNN outperforms several state-of-the-art approaches for entity alignment between TKGs.

实体对齐旨在识别多个知识图谱(KG)中的等效实体对,是知识融合的重要步骤。由于大多数知识图谱都在不断演变,现有的解决方案利用图神经网络(GNN)来解决时态知识图谱(TKG)中的实体配准问题。然而,这种主流方法往往忽略了关系嵌入的生成会通过固有结构对实体嵌入产生影响。在本文中,我们提出了一种名为 "基于 GNNs 的时间感知结构匹配"(TSM-GNN)的新型模型,它包含拓扑结构和固有结构的学习。我们的关键创新在于一种生成关系嵌入的独特方法,它可以通过固有结构增强实体嵌入。具体来说,我们利用知识图谱的平移特性来获得映射到时间感知向量空间的实体嵌入。随后,我们利用 GNN 学习全局实体表示。为了更好地捕捉来自相邻关系和实体的有用信息,我们引入了时间感知关注机制,为不同的时间感知固有结构分配不同的重要性权重。在三个真实世界数据集上的实验结果表明,TSM-GNN 在 TKG 之间的实体配准方面优于几种最先进的方法。
{"title":"Time-aware structure matching for temporal knowledge graph alignment","authors":"Wei Jia ,&nbsp;Ruizhe Ma ,&nbsp;Li Yan ,&nbsp;Weinan Niu ,&nbsp;Zongmin Ma","doi":"10.1016/j.datak.2024.102300","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102300","url":null,"abstract":"<div><p>Entity alignment, aiming at identifying equivalent entity pairs across multiple knowledge graphs (KGs), serves as a vital step for knowledge fusion. As the majority of KGs undergo continuous evolution, existing solutions utilize graph neural networks (GNNs) to tackle entity alignment within temporal knowledge graphs (TKGs). However, this prevailing method often overlooks the consequential impact of relation embedding generation on entity embeddings through inherent structures. In this paper, we propose a novel model named Time-aware Structure Matching based on GNNs (TSM-GNN) that encompasses the learning of both topological and inherent structures. Our key innovation lies in a unique method for generating relation embeddings, which can enhance entity embeddings via inherent structure. Specifically, we utilize the translation property of knowledge graphs to obtain the entity embedding that is mapped into a time-aware vector space. Subsequently, we employ GNNs to learn global entity representation. To better capture the useful information from neighboring relations and entities, we introduce a time-aware attention mechanism that assigns different importance weights to different time-aware inherent structures. Experimental results on three real-world datasets demonstrate that TSM-GNN outperforms several state-of-the-art approaches for entity alignment between TKGs.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140138228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Data & Knowledge Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1