Pub Date : 2024-03-11DOI: 10.1016/j.datak.2024.102300
Wei Jia , Ruizhe Ma , Li Yan , Weinan Niu , Zongmin Ma
Entity alignment, aiming at identifying equivalent entity pairs across multiple knowledge graphs (KGs), serves as a vital step for knowledge fusion. As the majority of KGs undergo continuous evolution, existing solutions utilize graph neural networks (GNNs) to tackle entity alignment within temporal knowledge graphs (TKGs). However, this prevailing method often overlooks the consequential impact of relation embedding generation on entity embeddings through inherent structures. In this paper, we propose a novel model named Time-aware Structure Matching based on GNNs (TSM-GNN) that encompasses the learning of both topological and inherent structures. Our key innovation lies in a unique method for generating relation embeddings, which can enhance entity embeddings via inherent structure. Specifically, we utilize the translation property of knowledge graphs to obtain the entity embedding that is mapped into a time-aware vector space. Subsequently, we employ GNNs to learn global entity representation. To better capture the useful information from neighboring relations and entities, we introduce a time-aware attention mechanism that assigns different importance weights to different time-aware inherent structures. Experimental results on three real-world datasets demonstrate that TSM-GNN outperforms several state-of-the-art approaches for entity alignment between TKGs.
{"title":"Time-aware structure matching for temporal knowledge graph alignment","authors":"Wei Jia , Ruizhe Ma , Li Yan , Weinan Niu , Zongmin Ma","doi":"10.1016/j.datak.2024.102300","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102300","url":null,"abstract":"<div><p>Entity alignment, aiming at identifying equivalent entity pairs across multiple knowledge graphs (KGs), serves as a vital step for knowledge fusion. As the majority of KGs undergo continuous evolution, existing solutions utilize graph neural networks (GNNs) to tackle entity alignment within temporal knowledge graphs (TKGs). However, this prevailing method often overlooks the consequential impact of relation embedding generation on entity embeddings through inherent structures. In this paper, we propose a novel model named Time-aware Structure Matching based on GNNs (TSM-GNN) that encompasses the learning of both topological and inherent structures. Our key innovation lies in a unique method for generating relation embeddings, which can enhance entity embeddings via inherent structure. Specifically, we utilize the translation property of knowledge graphs to obtain the entity embedding that is mapped into a time-aware vector space. Subsequently, we employ GNNs to learn global entity representation. To better capture the useful information from neighboring relations and entities, we introduce a time-aware attention mechanism that assigns different importance weights to different time-aware inherent structures. Experimental results on three real-world datasets demonstrate that TSM-GNN outperforms several state-of-the-art approaches for entity alignment between TKGs.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102300"},"PeriodicalIF":2.5,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140138228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-29DOI: 10.1016/j.datak.2024.102286
Marcos Da Silveira, Louis Deladiennee, Emmanuel Scolan, Cedric Pruski
The ever-increasing interest of academia, industry, and government institutions in space resource information highlights the difficulty of finding, accessing, integrating, and reusing this information. Although information is regularly published on the internet, it is disseminated on many different websites and in different formats, including scientific publications, patents, news, and reports. We are currently developing a knowledge management and sharing platform for space resources. This tool, which relies on the combined use of knowledge graphs and ontologies, formalises the domain knowledge contained in the above-mentioned documents and makes it more readily available to the community. In this article, we describe the concepts and techniques of knowledge extraction and management adopted during the design and implementation of the platform.
{"title":"A knowledge-sharing platform for space resources","authors":"Marcos Da Silveira, Louis Deladiennee, Emmanuel Scolan, Cedric Pruski","doi":"10.1016/j.datak.2024.102286","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102286","url":null,"abstract":"<div><p>The ever-increasing interest of academia, industry, and government institutions in space resource information highlights the difficulty of finding, accessing, integrating, and reusing this information. Although information is regularly published on the internet, it is disseminated on many different websites and in different formats, including scientific publications, patents, news, and reports. We are currently developing a knowledge management and sharing platform for space resources. This tool, which relies on the combined use of knowledge graphs and ontologies, formalises the domain knowledge contained in the above-mentioned documents and makes it more readily available to the community. In this article, we describe the concepts and techniques of knowledge extraction and management adopted during the design and implementation of the platform.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102286"},"PeriodicalIF":2.5,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140042746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-28DOI: 10.1016/j.datak.2024.102285
Franck Anaël Mbiaya , Christel Vrain , Frédéric Ros , Thi-Bich-Hanh Dao , Yves Lucas
This paper introduces a deep learning method for image classification that leverages knowledge formalized as a graph created from information represented by pairs attribute/value. The proposed method investigates a loss function that adaptively combines the classical cross-entropy commonly used in deep learning with a novel penalty function. The novel loss function is derived from the representation of nodes after embedding the knowledge graph and incorporates the proximity between class and image nodes. Its formulation enables the model to focus on identifying the boundary between the most challenging classes to distinguish. Experimental results on several image databases demonstrate improved performance compared to state-of-the-art methods, including classical deep learning algorithms and recent algorithms that incorporate knowledge represented by a graph.
{"title":"Knowledge graph-based image classification","authors":"Franck Anaël Mbiaya , Christel Vrain , Frédéric Ros , Thi-Bich-Hanh Dao , Yves Lucas","doi":"10.1016/j.datak.2024.102285","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102285","url":null,"abstract":"<div><p>This paper introduces a deep learning method for image classification that leverages knowledge formalized as a graph created from information represented by pairs attribute/value. The proposed method investigates a loss function that adaptively combines the classical cross-entropy commonly used in deep learning with a novel penalty function. The novel loss function is derived from the representation of nodes after embedding the knowledge graph and incorporates the proximity between class and image nodes. Its formulation enables the model to focus on identifying the boundary between the most challenging classes to distinguish. Experimental results on several image databases demonstrate improved performance compared to state-of-the-art methods, including classical deep learning algorithms and recent algorithms that incorporate knowledge represented by a graph.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102285"},"PeriodicalIF":2.5,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000090/pdfft?md5=197a1155c2e53ecde4dd061f7a501a91&pid=1-s2.0-S0169023X24000090-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140113524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-09DOI: 10.1016/j.datak.2024.102284
Mireia Costa, Ana León, Óscar Pastor
Alzheimer's disease is the most common type of dementia in the elderly. Nevertheless, there is an early onset form that is difficult to diagnose precisely. As the genetic component is the most critical factor in developing this disease, identifying relevant genetic variants is key to obtaining a more reliable and straightforward diagnosis. The information about these variants is stored in an extensive number of data sources, which must be carefully analyzed to select only the information with sufficient quality to be used in a clinical setting. This selection has become complex due to the increasing available genomic information. The SILE method was designed to systematize identifying relevant variants for a disease in this challenging context. However, several problems on how SILE identifies relevant variants were discovered when applying the method to the early onset form of Alzheimer's disease. More specifically, the method failed to address specific features of this disease such as its low incidence and familiar component. This paper proposes an improvement of the identification process defined by the SILE method to make it applicable to a further spectrum of diseases. Details of how the proposed solution has been applied are also reported. As a result of this improvement, a set of 29 variants has been identified (25 variants Accepted with a Limited Evidence and 4 Accepted with Moderate Evidence). This constitutes a valuable result that facilitates and reinforces the genetic diagnosis of the disease.
阿尔茨海默病是最常见的老年痴呆症。然而,也有一种难以精确诊断的早发型老年痴呆症。由于遗传因素是导致这种疾病的最关键因素,因此识别相关的遗传变异是获得更可靠、更直接诊断的关键。有关这些变异的信息存储在大量数据源中,必须对这些数据源进行仔细分析,只选择质量足够高的信息用于临床。由于可用的基因组信息越来越多,这种选择变得越来越复杂。SILE 方法就是为了在这种充满挑战的情况下系统地识别疾病的相关变异而设计的。然而,在将 SILE 方法应用于早发性阿尔茨海默病时,发现了该方法在识别相关变异方面存在的一些问题。更具体地说,该方法未能解决这种疾病的具体特征,如发病率低和熟悉的成分。本文建议改进 SILE 方法定义的识别过程,使其适用于更多的疾病。本文还详细介绍了如何应用所提出的解决方案。经过改进后,已识别出一组 29 个变体(25 个变体以有限证据接受,4 个以中等证据接受)。这是一项宝贵的成果,有助于并加强疾病的基因诊断。
{"title":"Improving the identification of relevant variants in genome information systems: A methodological approach with a case study on early onset Alzheimer's disease","authors":"Mireia Costa, Ana León, Óscar Pastor","doi":"10.1016/j.datak.2024.102284","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102284","url":null,"abstract":"<div><p>Alzheimer's disease is the most common type of dementia in the elderly. Nevertheless, there is an early onset form that is difficult to diagnose precisely. As the genetic component is the most critical factor in developing this disease, identifying relevant genetic variants is key to obtaining a more reliable and straightforward diagnosis. The information about these variants is stored in an extensive number of data sources, which must be carefully analyzed to select only the information with sufficient quality to be used in a clinical setting. This selection has become complex due to the increasing available genomic information. The SILE method was designed to systematize identifying relevant variants for a disease in this challenging context. However, several problems on how SILE identifies relevant variants were discovered when applying the method to the early onset form of Alzheimer's disease. More specifically, the method failed to address specific features of this disease such as its low incidence and familiar component. This paper proposes an improvement of the identification process defined by the SILE method to make it applicable to a further spectrum of diseases. Details of how the proposed solution has been applied are also reported. As a result of this improvement, a set of 29 variants has been identified (25 variants Accepted with a Limited Evidence and 4 Accepted with Moderate Evidence). This constitutes a valuable result that facilitates and reinforces the genetic diagnosis of the disease.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102284"},"PeriodicalIF":2.5,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000089/pdfft?md5=571739f0b90877da191a9d94a852f178&pid=1-s2.0-S0169023X24000089-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139738034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-04DOI: 10.1016/j.datak.2024.102278
Huma Parveen , Syed Wajahat Abbas Rizvi , Raja Sarath Kumar Boddu
Modern medicinal analysis is a complex procedure, requiring precise patient data, scientific knowledge obtained over numerous years and a theoretical understanding of related medical literature. To improve the accuracy and to reduce the time for diagnosis, clinical decision support systems (DSS) were introduced, which incorporate data mining schemes for enhancing the disease diagnosing accuracy. This work proposes a new disease-predicting model that involves 3 stages. Initially, “improved stemming and tokenization” are carried out in the pre-processing stage. Then, the “Fuzzy ontology, improved mutual information (MI), and correlation features” are extracted. Then, prediction is carried out via ensemble classifiers that include “improved Fuzzy logic, Long Short Term Memory (LSTM), Deep Convolution Neural Network (DCNN), and Bidirectional Gated Recurrent Unit (Bi-GRU)”.The outcomes from improved fuzzy logic, LSTM, and DCNN are further classified via Bi-GRU which offers the results. Specifically, Bi-GRU weights are optimally tuned using Deer Hunting Update Explored Arithmetic Optimization (DHUEAO). Finally, the efficiency of the proposed work is determined concerning a variety of metrics.
{"title":"Fuzzy-Ontology based knowledge driven disease risk level prediction with optimization assisted ensemble classifier","authors":"Huma Parveen , Syed Wajahat Abbas Rizvi , Raja Sarath Kumar Boddu","doi":"10.1016/j.datak.2024.102278","DOIUrl":"10.1016/j.datak.2024.102278","url":null,"abstract":"<div><p>Modern medicinal analysis is a complex procedure, requiring precise patient data, scientific knowledge obtained over numerous years and a theoretical understanding of related medical literature. To improve the accuracy and to reduce the time for diagnosis, clinical decision support systems (DSS) were introduced, which incorporate data mining schemes for enhancing the disease diagnosing accuracy. This work proposes a new disease-predicting model that involves 3 stages. Initially, “improved stemming and tokenization” are carried out in the pre-processing stage. Then, the “Fuzzy ontology, improved mutual information (MI), and correlation features” are extracted. Then, prediction is carried out via ensemble classifiers that include “improved Fuzzy logic, Long Short Term Memory (LSTM), Deep Convolution Neural Network (DCNN), and Bidirectional Gated Recurrent Unit (Bi-GRU)”.The outcomes from improved fuzzy logic, LSTM, and DCNN are further classified via Bi-GRU which offers the results. Specifically, Bi-GRU weights are optimally tuned using Deer Hunting Update Explored Arithmetic Optimization (DHUEAO). Finally, the efficiency of the proposed work is determined concerning a variety of metrics.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102278"},"PeriodicalIF":2.5,"publicationDate":"2024-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139677918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-03DOI: 10.1016/j.datak.2024.102283
Junrui Liu , Tong Li , Zhen Yang , Di Wu , Huan Liu
Recommendation methods improve rating prediction performance by learning selection bias phenomenon-users tend to rate items they like. These methods model selection bias by calculating the propensities of ratings, but inaccurate propensity could introduce more noise, fail to model selection bias, and reduce prediction performance. We argue that learning interaction features can effectively model selection bias and improve model performance, as interaction features explain the reason of the trend. Reviews can be used to model interaction features because they have a strong intrinsic correlation with user interests and item interactions. In this study, we propose a preference- and bias-oriented fusion learning model (PBFL) that models the interaction features based on reviews and user preferences to make rating predictions. Our proposal both embeds traditional user preferences in reviews, interactions, and ratings and considers word distribution bias and review quoting to model interaction features. Six real-world datasets are used to demonstrate effectiveness and performance. PBFL achieves an average improvement of 4.46% in root-mean-square error (RMSE) and 3.86% in mean absolute error (MAE) over the best baseline.
{"title":"Fusion learning of preference and bias from ratings and reviews for item recommendation","authors":"Junrui Liu , Tong Li , Zhen Yang , Di Wu , Huan Liu","doi":"10.1016/j.datak.2024.102283","DOIUrl":"10.1016/j.datak.2024.102283","url":null,"abstract":"<div><p>Recommendation methods improve rating prediction performance by learning selection bias phenomenon-users tend to rate items they like. These methods model selection bias by calculating the propensities of ratings, but inaccurate propensity could introduce more noise, fail to model selection bias, and reduce prediction performance. We argue that learning interaction features can effectively model selection bias and improve model performance, as interaction features explain the reason of the trend. Reviews can be used to model interaction features because they have a strong intrinsic correlation with user interests and item interactions. In this study, we propose a preference- and bias-oriented fusion learning model (PBFL) that models the interaction features based on reviews and user preferences to make rating predictions. Our proposal both embeds traditional user preferences in reviews, interactions, and ratings and considers word distribution bias and review quoting to model interaction features. Six real-world datasets are used to demonstrate effectiveness and performance. PBFL achieves an average improvement of 4.46% in root-mean-square error (RMSE) and 3.86% in mean absolute error (MAE) over the best baseline.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102283"},"PeriodicalIF":2.5,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139677949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-02DOI: 10.1016/j.datak.2024.102282
Nitin Newaliya, Yudhvir Singh
Clustering is an important data analytics technique and has numerous use cases. It leads to the determination of insights and knowledge which would not be readily discernible on routine examination of the data. Enhancement of clustering techniques is an active field of research, with various optimisation models being proposed. Such enhancements are also undertaken to address particular issues being faced in specific applications. This paper looks at a particular use case in the maritime domain and how an enhancement of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering results in the apt use of data analytics to solve a real-life issue. Passage of vessels over water is one of the significant utilisations of maritime regions. Trajectory analysis of these vessels helps provide valuable information, thus, maritime movement data and the knowledge extracted from manipulation of this data play an essential role in various applications, viz., assessing traffic densities, identifying traffic routes, reducing collision risks, etc. Optimised trajectory information would help enable safe and energy-efficient green operations at sea and assist autonomous operations of maritime systems and vehicles. Many studies focus on determining trajectory densities but miss out on individual trajectory granularities. Determining trajectories by using unique identities of the vessels may also lead to errors. Using an unsupervised DBSCAN method of identifying trajectories could help overcome these limitations. Further, to enhance outcomes and insights, the inclusion of temporal information along with additional parameters of Automatic Identification System (AIS) data in DBSCAN is proposed. Towards this, a new design and implementation for data analytics called the Multivariate Hierarchical DBSCAN method for better clustering of Maritime movement data, such as AIS, has been developed, which helps determine granular information and individual trajectories in an unsupervised manner. It is seen from the evaluation metrics that the performance of this method is better than other data clustering techniques.
{"title":"Multivariate hierarchical DBSCAN model for enhanced maritime data analytics","authors":"Nitin Newaliya, Yudhvir Singh","doi":"10.1016/j.datak.2024.102282","DOIUrl":"10.1016/j.datak.2024.102282","url":null,"abstract":"<div><p>Clustering is an important data analytics technique and has numerous use cases. It leads to the determination of insights and knowledge which would not be readily discernible on routine examination of the data. Enhancement of clustering techniques is an active field of research, with various optimisation models being proposed. Such enhancements are also undertaken to address particular issues being faced in specific applications. This paper looks at a particular use case in the maritime domain and how an enhancement of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering results in the apt use of data analytics to solve a real-life issue. Passage of vessels over water is one of the significant utilisations of maritime regions. Trajectory analysis of these vessels helps provide valuable information, thus, maritime movement data and the knowledge extracted from manipulation of this data play an essential role in various applications, viz., assessing traffic densities, identifying traffic routes, reducing collision risks, etc. Optimised trajectory information would help enable safe and energy-efficient green operations at sea and assist autonomous operations of maritime systems and vehicles. Many studies focus on determining trajectory densities but miss out on individual trajectory granularities. Determining trajectories by using unique identities of the vessels may also lead to errors. Using an unsupervised DBSCAN method of identifying trajectories could help overcome these limitations. Further, to enhance outcomes and insights, the inclusion of temporal information along with additional parameters of Automatic Identification System (AIS) data in DBSCAN is proposed. Towards this, a new design and implementation for data analytics called the Multivariate Hierarchical DBSCAN method for better clustering of Maritime movement data, such as AIS, has been developed, which helps determine granular information and individual trajectories in an unsupervised manner. It is seen from the evaluation metrics that the performance of this method is better than other data clustering techniques.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102282"},"PeriodicalIF":2.5,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139667962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-28DOI: 10.1016/j.datak.2023.102264
Seungkyu Park , Joong yoon Lee , Jooyeoun Lee
With the advancement of AI technology, the successful AI adoption in organizations has become a top priority in modern society. However, many organizations still struggle to articulate the necessary AI, and AI experts have difficulties understanding the problems faced by these organizations. This knowledge gap makes it difficult for organizations to identify the technical requirements, such as necessary data and algorithms, for adopting AI. To overcome this problem, we propose a new AI system architecture design methodology based on the IMO (Input-AI Model-Output) structure. The IMO structure enables effective identification of the technical requirements necessary to develop real AI models. While previous research has identified the importance and challenges of technical requirements, such as data and AI algorithms, for AI adoption, there has been little research on methodology to concretize them. Our methodology is composed of three stages: problem definition, system AI solution, and AI technical solution to design the AI technology and requirements that organizations need at a system level. The effectiveness of our methodology is demonstrated through a case study, logical comparative analysis with other studies, and experts reviews, which demonstrate that our methodology can support successful AI adoption to organizations.
{"title":"AI system architecture design methodology based on IMO (Input-AI Model-Output) structure for successful AI adoption in organizations","authors":"Seungkyu Park , Joong yoon Lee , Jooyeoun Lee","doi":"10.1016/j.datak.2023.102264","DOIUrl":"10.1016/j.datak.2023.102264","url":null,"abstract":"<div><p>With the advancement of AI technology, the successful AI adoption in organizations has become a top priority in modern society. However, many organizations still struggle to articulate the necessary AI, and AI experts have difficulties understanding the problems faced by these organizations. This knowledge gap makes it difficult for organizations to identify the technical requirements, such as necessary data and algorithms, for adopting AI. To overcome this problem, we propose a new AI system architecture design methodology based on the IMO (Input-AI Model-Output) structure. The IMO structure enables effective identification of the technical requirements necessary to develop real AI models. While previous research has identified the importance and challenges of technical requirements, such as data and AI algorithms, for AI adoption, there has been little research on methodology to concretize them. Our methodology is composed of three stages: problem definition, system AI solution, and AI technical solution to design the AI technology and requirements that organizations need at a system level. The effectiveness of our methodology is demonstrated through a case study, logical comparative analysis with other studies, and experts reviews, which demonstrate that our methodology can support successful AI adoption to organizations.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102264"},"PeriodicalIF":2.5,"publicationDate":"2024-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X23001246/pdfft?md5=e0d3a91ff85a9662d7d0a2bed8c5acfd&pid=1-s2.0-S0169023X23001246-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139588883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, Natural Language Processing (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of HMCCCProbT to compare favorably to state-of-the-art HMC algorithms in terms of classification accuracy and also the ability of BERTEPro to obtain better probability predictions, well suited to HMCCCProbT, than three other vector representation techniques.
{"title":"A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification","authors":"Guillaume Lefebvre , Haytham Elghazel , Theodore Guillet , Alexandre Aussem , Matthieu Sonnati","doi":"10.1016/j.datak.2024.102281","DOIUrl":"10.1016/j.datak.2024.102281","url":null,"abstract":"<div><p><span>In recent years, Natural Language Processing<span> (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification<span>. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers<span>, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of </span></span></span></span><span>HMCCCProbT</span><span> to compare favorably to state-of-the-art HMC algorithms<span> in terms of classification accuracy and also the ability of </span></span><span>BERTEPro</span> to obtain better probability predictions, well suited to <span>HMCCCProbT</span><span>, than three other vector representation techniques.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102281"},"PeriodicalIF":2.5,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139500576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-10DOI: 10.1016/j.datak.2024.102280
Ilka Jussen , Frederik Möller , Julia Schweihoff , Anna Gieß , Giulia Giussani , Boris Otto
Sharing data is highly potent in assisting companies in internal optimization and designing new products and services. While the benefits seem obvious, sharing data is accompanied by a spectrum of concerns ranging from fears of sharing something of value, unawareness of what will happen to the data, or simply a lack of understanding of the short- and mid-term benefits. The article analyzes data sharing in inter-organizational relationships by examining 13 cases in a qualitative interview study and through public data analysis. Given the importance of inter-organizational data sharing as indicated by large research initiatives such as Gaia-X and Catena-X, we explore issues arising in this process and formulate research challenges. We use the theoretical lens of Actor-Network Theory to analyze our data and entangle its constructs with concepts in data sharing.
{"title":"Issues in inter-organizational data sharing: Findings from practice and research challenges","authors":"Ilka Jussen , Frederik Möller , Julia Schweihoff , Anna Gieß , Giulia Giussani , Boris Otto","doi":"10.1016/j.datak.2024.102280","DOIUrl":"10.1016/j.datak.2024.102280","url":null,"abstract":"<div><p>Sharing data is highly potent in assisting companies in internal optimization and designing new products and services. While the benefits seem obvious, sharing data is accompanied by a spectrum of concerns ranging from fears of sharing something of value, unawareness of what will happen to the data, or simply a lack of understanding of the short- and mid-term benefits. The article analyzes data sharing in inter-organizational relationships by examining 13 cases in a qualitative interview study and through public data analysis. Given the importance of inter-organizational data sharing as indicated by large research initiatives such as Gaia-X and Catena-X, we explore issues arising in this process and formulate research challenges. We use the theoretical lens of Actor-Network Theory to analyze our data and entangle its constructs with concepts in data sharing.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102280"},"PeriodicalIF":2.5,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000041/pdfft?md5=8cca34784bb0ed03de222b7dc6fbfc47&pid=1-s2.0-S0169023X24000041-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}