Data & Knowledge Engineering最新文献_第9页

Multivariate hierarchical DBSCAN model for enhanced maritime data analytics 用于增强海事数据分析的多变量分层 DBSCAN 模型

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-02-02 DOI: 10.1016/j.datak.2024.102282

Nitin Newaliya, Yudhvir Singh

Clustering is an important data analytics technique and has numerous use cases. It leads to the determination of insights and knowledge which would not be readily discernible on routine examination of the data. Enhancement of clustering techniques is an active field of research, with various optimisation models being proposed. Such enhancements are also undertaken to address particular issues being faced in specific applications. This paper looks at a particular use case in the maritime domain and how an enhancement of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering results in the apt use of data analytics to solve a real-life issue. Passage of vessels over water is one of the significant utilisations of maritime regions. Trajectory analysis of these vessels helps provide valuable information, thus, maritime movement data and the knowledge extracted from manipulation of this data play an essential role in various applications, viz., assessing traffic densities, identifying traffic routes, reducing collision risks, etc. Optimised trajectory information would help enable safe and energy-efficient green operations at sea and assist autonomous operations of maritime systems and vehicles. Many studies focus on determining trajectory densities but miss out on individual trajectory granularities. Determining trajectories by using unique identities of the vessels may also lead to errors. Using an unsupervised DBSCAN method of identifying trajectories could help overcome these limitations. Further, to enhance outcomes and insights, the inclusion of temporal information along with additional parameters of Automatic Identification System (AIS) data in DBSCAN is proposed. Towards this, a new design and implementation for data analytics called the Multivariate Hierarchical DBSCAN method for better clustering of Maritime movement data, such as AIS, has been developed, which helps determine granular information and individual trajectories in an unsupervised manner. It is seen from the evaluation metrics that the performance of this method is better than other data clustering techniques.

聚类是一种重要的数据分析技术，有许多使用案例。通过聚类，可以发现常规数据检查中不易发现的洞察力和知识。增强聚类技术是一个活跃的研究领域，提出了各种优化模型。这些改进也是为了解决特定应用中面临的特殊问题。本文探讨了海事领域的一个特殊应用案例，以及如何通过增强基于密度的带噪声应用空间聚类（DBSCAN）聚类技术，恰当地利用数据分析来解决现实生活中的问题。船只在水上航行是海域的重要用途之一。对这些船只的轨迹分析有助于提供有价值的信息，因此，海上运输数据和从这些数据中提取的知识在各种应用中发挥着重要作用，如评估交通密度、确定交通路线、降低碰撞风险等。优化的轨迹信息将有助于实现安全、节能的绿色海上作业，并有助于海事系统和车辆的自主运行。许多研究侧重于确定轨迹密度，但忽略了单个轨迹的粒度。使用船只的唯一标识来确定轨迹也可能导致误差。使用无监督 DBSCAN 方法识别轨迹有助于克服这些局限性。此外，为了提高结果和洞察力，建议在 DBSCAN 中纳入时间信息以及自动识别系统（AIS）数据的附加参数。为此，开发了一种新的数据分析设计和实施方法，称为多变量分层 DBSCAN 方法，用于更好地对 AIS 等海事运动数据进行聚类，有助于以无监督方式确定细粒度信息和个体轨迹。从评估指标可以看出，该方法的性能优于其他数据聚类技术。

{"title":"Multivariate hierarchical DBSCAN model for enhanced maritime data analytics","authors":"Nitin Newaliya, Yudhvir Singh","doi":"10.1016/j.datak.2024.102282","DOIUrl":"10.1016/j.datak.2024.102282","url":null,"abstract":"<div><p>Clustering is an important data analytics technique and has numerous use cases. It leads to the determination of insights and knowledge which would not be readily discernible on routine examination of the data. Enhancement of clustering techniques is an active field of research, with various optimisation models being proposed. Such enhancements are also undertaken to address particular issues being faced in specific applications. This paper looks at a particular use case in the maritime domain and how an enhancement of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering results in the apt use of data analytics to solve a real-life issue. Passage of vessels over water is one of the significant utilisations of maritime regions. Trajectory analysis of these vessels helps provide valuable information, thus, maritime movement data and the knowledge extracted from manipulation of this data play an essential role in various applications, viz., assessing traffic densities, identifying traffic routes, reducing collision risks, etc. Optimised trajectory information would help enable safe and energy-efficient green operations at sea and assist autonomous operations of maritime systems and vehicles. Many studies focus on determining trajectory densities but miss out on individual trajectory granularities. Determining trajectories by using unique identities of the vessels may also lead to errors. Using an unsupervised DBSCAN method of identifying trajectories could help overcome these limitations. Further, to enhance outcomes and insights, the inclusion of temporal information along with additional parameters of Automatic Identification System (AIS) data in DBSCAN is proposed. Towards this, a new design and implementation for data analytics called the Multivariate Hierarchical DBSCAN method for better clustering of Maritime movement data, such as AIS, has been developed, which helps determine granular information and individual trajectories in an unsupervised manner. It is seen from the evaluation metrics that the performance of this method is better than other data clustering techniques.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102282"},"PeriodicalIF":2.5,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139667962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AI system architecture design methodology based on IMO (Input-AI Model-Output) structure for successful AI adoption in organizations 基于 IMO（输入-AI 模型-输出）结构的人工智能系统架构设计方法，促进组织成功采用人工智能

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-01-28 DOI: 10.1016/j.datak.2023.102264

Seungkyu Park , Joong yoon Lee , Jooyeoun Lee

With the advancement of AI technology, the successful AI adoption in organizations has become a top priority in modern society. However, many organizations still struggle to articulate the necessary AI, and AI experts have difficulties understanding the problems faced by these organizations. This knowledge gap makes it difficult for organizations to identify the technical requirements, such as necessary data and algorithms, for adopting AI. To overcome this problem, we propose a new AI system architecture design methodology based on the IMO (Input-AI Model-Output) structure. The IMO structure enables effective identification of the technical requirements necessary to develop real AI models. While previous research has identified the importance and challenges of technical requirements, such as data and AI algorithms, for AI adoption, there has been little research on methodology to concretize them. Our methodology is composed of three stages: problem definition, system AI solution, and AI technical solution to design the AI technology and requirements that organizations need at a system level. The effectiveness of our methodology is demonstrated through a case study, logical comparative analysis with other studies, and experts reviews, which demonstrate that our methodology can support successful AI adoption to organizations.

随着人工智能技术的发展，在组织中成功采用人工智能已成为现代社会的当务之急。然而，许多组织仍在努力阐述必要的人工智能，而人工智能专家也难以理解这些组织所面临的问题。这种知识鸿沟使得组织难以确定采用人工智能所需的技术要求，如必要的数据和算法。为了克服这一问题，我们提出了一种基于 IMO（输入-AI 模型-输出）结构的新型人工智能系统架构设计方法。IMO 结构能有效识别开发真正的人工智能模型所需的技术要求。虽然以往的研究已经确定了技术要求（如数据和人工智能算法）对于人工智能应用的重要性和挑战，但很少有研究将其具体化的方法。我们的方法论由三个阶段组成：问题定义、系统人工智能解决方案和人工智能技术解决方案，以便在系统层面设计组织所需的人工智能技术和要求。我们的方法论通过案例研究、与其他研究的逻辑比较分析以及专家评论来证明其有效性，这些研究表明我们的方法论能够支持企业成功采用人工智能。

{"title":"AI system architecture design methodology based on IMO (Input-AI Model-Output) structure for successful AI adoption in organizations","authors":"Seungkyu Park , Joong yoon Lee , Jooyeoun Lee","doi":"10.1016/j.datak.2023.102264","DOIUrl":"10.1016/j.datak.2023.102264","url":null,"abstract":"<div><p>With the advancement of AI technology, the successful AI adoption in organizations has become a top priority in modern society. However, many organizations still struggle to articulate the necessary AI, and AI experts have difficulties understanding the problems faced by these organizations. This knowledge gap makes it difficult for organizations to identify the technical requirements, such as necessary data and algorithms, for adopting AI. To overcome this problem, we propose a new AI system architecture design methodology based on the IMO (Input-AI Model-Output) structure. The IMO structure enables effective identification of the technical requirements necessary to develop real AI models. While previous research has identified the importance and challenges of technical requirements, such as data and AI algorithms, for AI adoption, there has been little research on methodology to concretize them. Our methodology is composed of three stages: problem definition, system AI solution, and AI technical solution to design the AI technology and requirements that organizations need at a system level. The effectiveness of our methodology is demonstrated through a case study, logical comparative analysis with other studies, and experts reviews, which demonstrate that our methodology can support successful AI adoption to organizations.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102264"},"PeriodicalIF":2.5,"publicationDate":"2024-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X23001246/pdfft?md5=e0d3a91ff85a9662d7d0a2bed8c5acfd&pid=1-s2.0-S0169023X23001246-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139588883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification 适用于教育和专业培训领域的新句子嵌入框架，并将其应用于分层多标签文本分类

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-01-19 DOI: 10.1016/j.datak.2024.102281

Guillaume Lefebvre , Haytham Elghazel , Theodore Guillet , Alexandre Aussem , Matthieu Sonnati

In recent years, Natural Language Processing (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of HMCCCProbT to compare favorably to state-of-the-art HMC algorithms in terms of classification accuracy and also the ability of BERTEPro to obtain better probability predictions, well suited to HMCCCProbT, than three other vector representation techniques.

近年来，通过先进的通用语言嵌入，自然语言处理（NLP）技术取得了长足的进步，在语义相似性和文本分类等 NLP 任务中实现了突破。然而，分层多标签分类（HMC）的复杂性也随之增加，在这种情况下，一个实体可能属于多个分层分类的类别。在这种复杂的情况下，应用于特定领域的文本（如教育和专业培训领域），一般的语言嵌入模型往往不能充分代表专业领域的独特术语和上下文的细微差别。为了解决这个问题，我们提出了一种新颖的分层多标签文本分类方法 HMCCCProbT。这一创新框架包含多个分类器，其中每个分类器都是在现有 Transformer 模型的基础上，使用新颖的句子嵌入方法 BERTEPro 构建的，其预训练已在教育和专业培训文本上进行了扩展，然后在多个 NLP 任务上进行了微调。每个分类器负责给定层次的预测，并将输入特征向量增强的局部概率预测传播给负责后续层次的分类器。HMCCCProbT 解决了模型的可扩展性和语义解释问题，为应对特定领域分层多标签分类的挑战提供了强大的解决方案。在三个特定领域的文本 HMC 数据集上进行的实验表明，HMCCCProbT 在分类准确性方面可与最先进的 HMC 算法相媲美，而且与其他三种向量表示技术相比，BERTEPro 能够获得更好的概率预测，非常适合 HMCCCProbT。

{"title":"A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification","authors":"Guillaume Lefebvre , Haytham Elghazel , Theodore Guillet , Alexandre Aussem , Matthieu Sonnati","doi":"10.1016/j.datak.2024.102281","DOIUrl":"10.1016/j.datak.2024.102281","url":null,"abstract":"<div><p><span>In recent years, Natural Language Processing<span> (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification<span>. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers<span>, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of </span></span></span></span><span>HMCCCProbT</span><span> to compare favorably to state-of-the-art HMC algorithms<span> in terms of classification accuracy and also the ability of </span></span><span>BERTEPro</span> to obtain better probability predictions, well suited to <span>HMCCCProbT</span><span>, than three other vector representation techniques.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102281"},"PeriodicalIF":2.5,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139500576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Issues in inter-organizational data sharing: Findings from practice and research challenges 组织间数据共享的问题：来自实践和研究挑战的发现

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-01-10 DOI: 10.1016/j.datak.2024.102280

Ilka Jussen , Frederik Möller , Julia Schweihoff , Anna Gieß , Giulia Giussani , Boris Otto

Sharing data is highly potent in assisting companies in internal optimization and designing new products and services. While the benefits seem obvious, sharing data is accompanied by a spectrum of concerns ranging from fears of sharing something of value, unawareness of what will happen to the data, or simply a lack of understanding of the short- and mid-term benefits. The article analyzes data sharing in inter-organizational relationships by examining 13 cases in a qualitative interview study and through public data analysis. Given the importance of inter-organizational data sharing as indicated by large research initiatives such as Gaia-X and Catena-X, we explore issues arising in this process and formulate research challenges. We use the theoretical lens of Actor-Network Theory to analyze our data and entangle its constructs with concepts in data sharing.

数据共享在协助公司进行内部优化以及设计新产品和服务方面非常有效。虽然数据共享的好处似乎显而易见，但同时也伴随着各种担忧，包括害怕分享有价值的东西、不知道数据会发生什么变化，或者只是对短期和中期的好处缺乏了解。文章通过定性访谈研究和公共数据分析，对 13 个案例进行了研究，分析了组织间关系中的数据共享。鉴于 Gaia-X 和 Catena-X 等大型研究计划显示了组织间数据共享的重要性，我们探讨了这一过程中出现的问题，并提出了研究挑战。我们使用行动者网络理论（Actor-Network Theory）的理论视角来分析我们的数据，并将其构造与数据共享的概念联系起来。

引用次数: 0

Data analytics and knowledge discovery on big data: Algorithms, architectures, and applications 大数据上的数据分析和知识发现：算法、架构和应用

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-01-05 DOI: 10.1016/j.datak.2024.102279

Robert Wrembel , Johann Gamper

引用次数: 0

A deep learning model for predicting the number of stores and average sales in commercial district 用于预测商业区商店数量和平均销售额的深度学习模型

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2024-01-04 DOI: 10.1016/j.datak.2024.102277

Suan Lee , Sangkeun Ko , Arousha Haghighian Roudsari , Wookey Lee

This paper presents a plan for preparing for changes in the business environment by analyzing and predicting business district data in Seoul. The COVID-19 pandemic and economic crisis caused by inflation have led to an increase in store closures and a decrease in sales, which has had a significant impact on commercial districts. The number of stores and sales are critical factors that directly affect the business environment and can help prepare for changes. This study conducted correlation analysis to extract factors related to the commercial district’s environment in Seoul and estimated the number of stores and sales based on these factors. Using the Kendaltau correlation coefficient, the study found that existing population and working population were the most influential factors. Linear regression, tensor decomposition, Factorization Machine, and deep neural network models were used to estimate the number of stores and sales, with the deep neural network model showing the best performance in RMSE and evaluation indicators. This study also predicted the number of stores and sales of the service industry in a specific area using the population prediction results of the neural prophet model. The study’s findings can help identify commercial district information and predict the number of stores and sales based on location, industry, and influencing factors, contributing to the revitalization of commercial districts.

本文通过对首尔商业区数据的分析和预测，提出了一项为商业环境变化做准备的计划。COVID-19 大流行和通货膨胀引发的经济危机导致商店关闭数量增加和销售额下降，这对商业区产生了重大影响。商店数量和销售额是直接影响商业环境的关键因素，有助于为变化做好准备。本研究通过相关分析提取了与首尔商业区环境相关的因素，并根据这些因素估算了商店数量和销售额。利用 Kendaltau 相关系数，研究发现现有人口和工作人口是影响最大的因素。研究采用了线性回归、张量分解、因果化机和深度神经网络模型来估算商店数量和销售额，其中深度神经网络模型在均方根误差和评价指标方面表现最佳。本研究还利用神经先知模型的人口预测结果预测了特定地区服务业的门店数量和销售额。研究结果有助于识别商业区信息，并根据区位、行业和影响因素预测商店数量和销售额，从而促进商业区的振兴。

{"title":"A deep learning model for predicting the number of stores and average sales in commercial district","authors":"Suan Lee , Sangkeun Ko , Arousha Haghighian Roudsari , Wookey Lee","doi":"10.1016/j.datak.2024.102277","DOIUrl":"10.1016/j.datak.2024.102277","url":null,"abstract":"<div><p>This paper presents a plan for preparing for changes in the business environment by analyzing and predicting business district data in Seoul. The COVID-19 pandemic and economic crisis caused by inflation have led to an increase in store closures and a decrease in sales, which has had a significant impact on commercial districts. The number of stores and sales are critical factors that directly affect the business environment and can help prepare for changes. This study conducted correlation analysis to extract factors related to the commercial district’s environment in Seoul and estimated the number of stores and sales based on these factors. Using the Kendaltau correlation coefficient, the study found that existing population and working population were the most influential factors. Linear regression, tensor decomposition, Factorization Machine, and deep neural network models were used to estimate the number of stores and sales, with the deep neural network model showing the best performance in RMSE and evaluation indicators. This study also predicted the number of stores and sales of the service industry in a specific area using the population prediction results of the neural prophet model. The study’s findings can help identify commercial district information and predict the number of stores and sales based on location, industry, and influencing factors, contributing to the revitalization of commercial districts.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102277"},"PeriodicalIF":2.5,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000016/pdfft?md5=399d90f81e8f5fbe38aeaa5e86a26560&pid=1-s2.0-S0169023X24000016-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139095414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A transformer-based neural network framework for full names prediction with abbreviations and contexts 基于转换器的神经网络框架，用于预测包含缩写和上下文的全名

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2023-12-30 DOI: 10.1016/j.datak.2023.102275

Ziming Ye , Shuangyin Li

With the rapid spread of information, abbreviations are used more and more common because they are convenient. However, the duplication of abbreviations can lead to confusion in many cases, such as information management and information retrieval. The resultant confusion annoys users. Thus, inferring a full name from an abbreviation has practical and significant advantages. The bulk of studies in the literature mainly inferred full names based on rule-based methods, statistical models, the similarity of representation, etc. However, these methods are unable to use various grained contexts properly. In this paper, we propose a flexible framework of Multi-attention mask Abbreviation Context and Full name language model, named MACF to address the problem. With the abbreviation and contexts as the inputs, the MACF can automatically predict a full name by generation, where the contexts can be variously grained. That is, different grained contexts ranging from coarse to fine can be selected to perform such complicated tasks in which contexts include paragraphs, several sentences, or even just a few keywords. A novel multi-attention mask mechanism is also proposed, which allows the model to learn the relationships among abbreviations, contexts, and full names, a process that makes the most of various grained contexts. The three corpora of different languages and fields were analyzed and measured with seven metrics in various aspects to evaluate the proposed framework. According to the experimental results, the MACF yielded more significant and consistent outputs than other baseline methods. Moreover, we discuss the significance and findings, and give the case studies to show the performance in real applications.

随着信息的迅速传播，缩写因其方便快捷而被越来越多地使用。然而，在信息管理和信息检索等许多情况下，缩略语的重复使用会导致混乱。由此造成的混乱会让用户感到厌烦。因此，从缩写中推断全名具有实际而显著的优势。文献中的大量研究主要是基于规则方法、统计模型、表征相似性等来推断全名。然而，这些方法无法正确使用各种粒度的上下文。本文提出了一种灵活的多注意掩码缩写上下文和全名语言模型框架（命名为 MACF）来解决这一问题。以缩写和上下文为输入，MACF 可以自动生成预测全名，其中上下文可以是不同粒度的。也就是说，可以选择从粗粒到细粒的不同粒度上下文，来完成这种复杂的任务，其中上下文包括段落、几个句子，甚至只是几个关键词。此外，还提出了一种新颖的多注意掩码机制，该机制允许模型学习缩写、上下文和全名之间的关系，这一过程充分利用了不同粒度的上下文。通过对三个不同语言和领域的语料库进行分析，并从七个方面进行衡量，对所提出的框架进行了评估。实验结果表明，与其他基线方法相比，MACF 得出的结果更显著、更一致。此外，我们还讨论了实验的意义和结果，并通过案例研究展示了其在实际应用中的性能。

{"title":"A transformer-based neural network framework for full names prediction with abbreviations and contexts","authors":"Ziming Ye , Shuangyin Li","doi":"10.1016/j.datak.2023.102275","DOIUrl":"10.1016/j.datak.2023.102275","url":null,"abstract":"<div><p>With the rapid spread of information, abbreviations are used more and more common because they are convenient. However, the duplication of abbreviations can lead to confusion in many cases, such as information management and information retrieval. The resultant confusion annoys users. Thus, inferring a full name from an abbreviation has practical and significant advantages. The bulk of studies in the literature mainly inferred full names based on rule-based methods, statistical models, the similarity of representation, etc. However, these methods are unable to use various grained contexts properly. In this paper, we propose a flexible framework of Multi-attention mask Abbreviation Context and Full name language model<span>, named MACF to address the problem. With the abbreviation and contexts as the inputs, the MACF can automatically predict a full name by generation, where the contexts can be variously grained. That is, different grained contexts ranging from coarse to fine can be selected to perform such complicated tasks in which contexts include paragraphs, several sentences, or even just a few keywords. A novel multi-attention mask mechanism is also proposed, which allows the model to learn the relationships among abbreviations, contexts, and full names, a process that makes the most of various grained contexts. The three corpora of different languages and fields were analyzed and measured with seven metrics in various aspects to evaluate the proposed framework. According to the experimental results, the MACF yielded more significant and consistent outputs than other baseline methods. Moreover, we discuss the significance and findings, and give the case studies to show the performance in real applications.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102275"},"PeriodicalIF":2.5,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139069387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A bitwise approach on influence overload problem 影响超载问题的比特方法

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2023-12-30 DOI: 10.1016/j.datak.2023.102276

Charles Cheolgi Lee , Jafar Afshar , Arousha Haghighian Roudsari , Woong-Kee Loh , Wookey Lee

Increasingly developing online social networks has enabled users to send or receive information very fast. However, due to the availability of an excessive amount of data in today’s society, managing the information has become very cumbersome, which may lead to the problem of information overload. This highly eminent problem, where the existence of too much relevant information available becomes a hindrance rather than a help, may cause losses, delays, and hardships in making decisions. Thus, in this paper, by defining information overload from a different aspect, we aim to maximize the information propagation while minimizing the information overload (duplication). To do so, we theoretically present the lower and upper bounds for the information overload using a bitwise-based approach as the leverage to mitigate the computation complexities and obtain an approximation ratio of $1 - \frac{1}{e}$ . We propose two main algorithms, B-square and C-square, and compare them with the existing algorithms. Experiments on two types of datasets, synthetic and real-world networks, verify the effectiveness and efficiency of the proposed approach in addressing the problem.

日益发展的在线社交网络使用户能够快速发送或接收信息。然而，由于当今社会数据量过大，信息管理变得非常繁琐，可能导致信息超载问题。在这个非常突出的问题中，过多相关信息的存在成为一种阻碍而非帮助，可能会造成损失、延误和决策困难。因此，在本文中，我们从另一个角度定义信息过载，旨在最大限度地扩大信息传播，同时最大限度地减少信息过载（重复）。为此，我们从理论上提出了信息过载的下限和上限，使用基于比特的方法作为杠杆，以减轻计算复杂性，并获得 1-1e 的近似率。我们提出了两种主要算法：B-square 和 C-square，并将它们与现有算法进行了比较。在合成网络和真实世界网络两类数据集上进行的实验验证了所提方法在解决问题方面的有效性和效率。

{"title":"A bitwise approach on influence overload problem","authors":"Charles Cheolgi Lee , Jafar Afshar , Arousha Haghighian Roudsari , Woong-Kee Loh , Wookey Lee","doi":"10.1016/j.datak.2023.102276","DOIUrl":"10.1016/j.datak.2023.102276","url":null,"abstract":"<div><p><span>Increasingly developing online social networks has enabled users to send or receive information very fast. However, due to the availability of an excessive amount of data in today’s society, managing the information has become very cumbersome, which may lead to the problem of information overload. This highly eminent problem, where the existence of too much relevant information available becomes a hindrance rather than a help, may cause losses, delays, and hardships in making decisions. Thus, in this paper, by defining information overload from a different aspect, we aim to maximize the information propagation while minimizing the information overload (duplication). To do so, we theoretically present the lower and upper bounds for the information overload using a bitwise-based approach as the leverage to mitigate the computation complexities and obtain an approximation ratio of </span><span><math><mrow><mn>1</mn><mo>−</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>e</mi></mrow></mfrac></mrow></math></span>. We propose two main algorithms, B-square and C-square, and compare them with the existing algorithms. Experiments on two types of datasets, synthetic and real-world networks, verify the effectiveness and efficiency of the proposed approach in addressing the problem.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102276"},"PeriodicalIF":2.5,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139069125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mining Keys for Graphs 挖掘图形的密钥

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2023-12-27 DOI: 10.1016/j.datak.2023.102274

Morteza Alipourlangouri, Fei Chiang

Keys for graphs are a class of data quality rules that use topological and value constraints to uniquely identify entities in a data graph. They have been studied to support object identification, knowledge fusion, data deduplication, and social network reconciliation. Manual specification and discovery of graph keys is tedious and infeasible over large-scale graphs. To make $GKeys$ useful in practice, we study the $GKey$ discovery problem, and present $GKMiner$ , an algorithm that mines keys over graphs. Our algorithm discovers keys in a graph via frequent subgraph expansion, and notably, identifies recursive keys, i.e., where the unique identification of an entity type is dependent upon the identification of another entity type. We introduce the key properties, minimality and support, which effectively help to reduce the space of candidate keys. $GKMiner$ uses a set of auxillary structures to summarize an input graph, and to identify likely candidate keys for greater pruning efficiency and evaluation of the search space. Our evaluation shows that identifying and using recursive keys in entity linking, lead to improved accuracy, over keys found using existing graph key mining techniques.

图键是一类数据质量规则，它使用拓扑和值约束来唯一识别数据图中的实体。对它们的研究支持对象识别、知识融合、重复数据删除和社交网络调节。在大规模图中，手动规范和发现图键既繁琐又不可行。为了让 GKeys 在实践中发挥作用，我们研究了 GKey 发现问题，并提出了 GKMiner 算法，这是一种在图上挖掘密钥的算法。我们的算法通过频繁子图扩展发现图中的密钥，特别是识别递归密钥，即一个实体类型的唯一识别依赖于另一个实体类型的识别。我们引入了密钥属性--最小性和支持性，它们能有效帮助减少候选密钥的空间。GKMiner 使用一组辅助结构来概括输入图，并识别可能的候选键，以提高剪枝效率并评估搜索空间。我们的评估结果表明，在实体链接中识别和使用递归键，比使用现有图键挖掘技术找到的键更准确。

{"title":"Mining Keys for Graphs","authors":"Morteza Alipourlangouri, Fei Chiang","doi":"10.1016/j.datak.2023.102274","DOIUrl":"10.1016/j.datak.2023.102274","url":null,"abstract":"<div><p><span>Keys for graphs are a class of data quality rules that use topological and value constraints to uniquely identify entities in a data graph. They have been studied to support object identification, knowledge fusion, data deduplication, and social network reconciliation. Manual specification and discovery of graph keys is tedious and infeasible over large-scale graphs. To make </span><span><math><mi>GKeys</mi></math></span> useful in practice, we study the <span><math><mi>GKey</mi></math></span> discovery problem, and present <span><math><mi>GKMiner</mi></math></span>, an algorithm that mines keys over graphs. Our algorithm discovers keys in a graph via frequent subgraph expansion, and notably, identifies <em>recursive</em> keys, i.e., where the unique identification of an entity type is dependent upon the identification of another entity type. We introduce the key properties, <em>minimality</em> and <em>support</em>, which effectively help to reduce the space of candidate keys. <span><math><mi>GKMiner</mi></math></span><span> uses a set of auxillary structures to summarize an input graph, and to identify likely candidate keys for greater pruning efficiency and evaluation of the search space. Our evaluation shows that identifying and using recursive keys in entity linking, lead to improved accuracy, over keys found using existing graph key mining techniques.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102274"},"PeriodicalIF":2.5,"publicationDate":"2023-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139055186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An approach to on-demand extension of multidimensional cubes in multi-model settings: Application to IoT-based agro-ecology 在多模型环境中按需扩展多维立方体的方法：基于物联网的农业生态学应用

IF 2.5 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering

Pub Date : 2023-12-23 DOI: 10.1016/j.datak.2023.102267

Sandro Bimonte , Fagnine Alassane Coulibaly , Stefano Rizzi

Managing unstructured and heterogeneous data, integrating them, and enabling their analysis are among the key challenges in data ecosystems, together with the need to accommodate a progressive growth in these systems by seamlessly supporting extensibility. This is particularly relevant for OLAP analyses on multidimensional cubes stored in data warehouses (DWs), which naturally span large portions of heterogeneous data, possibly relying on different data models (relational, document-based, graph-based). While the management of model heterogeneity in DWs, using for instance multi-model databases, has already been investigated, not much has been done to support extensibility. In a previous paper we have investigated a schema-on-read scenario aimed at granting the extensibility of multidimensional cubes by proposing an architecture to support it and discussing the main open issues associated. This paper takes a step further by presenting xCube, an approach to provide on-demand extensibility of multidimensional cubes in a supply-driven fashion. xCube lets users choose a multidimensional element to be extended, using additional data, possibly uploaded from a data lake. Then, the multidimensional schema is extended by considering the functional dependencies implied by these additional data, and the extended multidimensional schema is made available to users for OLAP analyses. After explaining our approach with reference to a motivating case study in agro-ecology, we propose a proof-of-concept implementation using AgensGraph and Mondrian.

管理非结构化和异构数据、整合这些数据并对其进行分析是数据生态系统面临的主要挑战之一，同时还需要通过无缝支持可扩展性来适应这些系统的逐步发展。这与对存储在数据仓库（DW）中的多维立方体进行 OLAP 分析尤其相关，因为这些立方体自然会跨越大量异构数据，并可能依赖于不同的数据模型（关系型、文档型、图形型）。虽然人们已经研究了使用多模型数据库等方法来管理数据仓库中的模型异构性，但在支持可扩展性方面所做的工作还不多。在前一篇论文中，我们研究了读取模式方案，旨在通过提出一种支持多维立方体可扩展性的架构并讨论相关的主要开放性问题来实现多维立方体的可扩展性。xCube 允许用户选择要扩展的多维元素，并使用可能从数据湖上传的附加数据。然后，通过考虑这些附加数据所隐含的功能依赖关系来扩展多维模式，并将扩展后的多维模式提供给用户进行 OLAP 分析。在参考农业生态学的一个激励性案例研究解释我们的方法后，我们提出了使用 AgensGraph 和 Mondrian 的概念验证实施方案。

{"title":"An approach to on-demand extension of multidimensional cubes in multi-model settings: Application to IoT-based agro-ecology","authors":"Sandro Bimonte , Fagnine Alassane Coulibaly , Stefano Rizzi","doi":"10.1016/j.datak.2023.102267","DOIUrl":"10.1016/j.datak.2023.102267","url":null,"abstract":"<div><p><span>Managing unstructured and heterogeneous data<span>, integrating them, and enabling their analysis are among the key challenges in data ecosystems, together with the need to accommodate a progressive growth in these systems by seamlessly supporting extensibility. This is particularly relevant for OLAP analyses on multidimensional cubes stored in data warehouses (DWs), which naturally span large portions of heterogeneous data, possibly relying on different data models (relational, document-based, graph-based). While the management of model heterogeneity in DWs, using for instance multi-model databases, has already been investigated, not much has been done to support extensibility. In a previous paper we have investigated a schema-on-read scenario aimed at granting the extensibility of multidimensional cubes by proposing an architecture to support it and discussing the main open issues associated. This paper takes a step further by presenting </span></span><em>xCube</em><span>, an approach to provide on-demand extensibility of multidimensional cubes in a supply-driven fashion. xCube lets users choose a multidimensional element to be extended, using additional data, possibly uploaded from a data lake. Then, the multidimensional schema is extended by considering the functional dependencies implied by these additional data, and the extended multidimensional schema is made available to users for OLAP analyses. After explaining our approach with reference to a motivating case study in agro-ecology, we propose a proof-of-concept implementation using AgensGraph and Mondrian.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102267"},"PeriodicalIF":2.5,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139031847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0