Pub Date : 2024-05-01DOI: 10.1016/j.datak.2024.102306
Saman Jamshidi , Mahin Mohammadi , Saeed Bagheri , Hamid Esmaeili Najafabadi , Alireza Rezvanian , Mehdi Gheisari , Mustafa Ghaderzadeh , Amir Shahab Shahabi , Zongda Wu
Text classification plays a critical role in managing large volumes of electronically produced texts. As the number of such texts increases, manual analysis becomes impractical, necessitating an intelligent approach for processing information. Deep learning models have witnessed widespread application in text classification, including the use of recurrent neural networks like Many to One Long Short-Term Memory (MTO LSTM). Nonetheless, this model is limited by its reliance on only the last token for text labelling. To overcome this limitation, this study introduces a novel hybrid model that combines Bidirectional Encoder Representations from Transformers (BERT), Many To Many Long Short-Term Memory (MTM LSTM), and Decision Templates (DT) for text classification. In this new model, the text is first embedded using the BERT model and then trained using MTM LSTM to approximate the target at each token. Finally, the approximations are fused using DT. The proposed model is evaluated using the well-known IMDB movie review dataset for binary classification and Drug Review Dataset for multiclass classification. The results demonstrate superior performance in terms of accuracy, recall, precision, and F1 score compared to previous models. The hybrid model presented in this study holds significant potential for a wide range of text classification tasks and stands as a valuable contribution to the field.
{"title":"Effective text classification using BERT, MTM LSTM, and DT","authors":"Saman Jamshidi , Mahin Mohammadi , Saeed Bagheri , Hamid Esmaeili Najafabadi , Alireza Rezvanian , Mehdi Gheisari , Mustafa Ghaderzadeh , Amir Shahab Shahabi , Zongda Wu","doi":"10.1016/j.datak.2024.102306","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102306","url":null,"abstract":"<div><p>Text classification plays a critical role in managing large volumes of electronically produced texts. As the number of such texts increases, manual analysis becomes impractical, necessitating an intelligent approach for processing information. Deep learning models have witnessed widespread application in text classification, including the use of recurrent neural networks like Many to One Long Short-Term Memory (MTO LSTM). Nonetheless, this model is limited by its reliance on only the last token for text labelling. To overcome this limitation, this study introduces a novel hybrid model that combines Bidirectional Encoder Representations from Transformers (BERT), Many To Many Long Short-Term Memory (MTM LSTM), and Decision Templates (DT) for text classification. In this new model, the text is first embedded using the BERT model and then trained using MTM LSTM to approximate the target at each token. Finally, the approximations are fused using DT. The proposed model is evaluated using the well-known IMDB movie review dataset for binary classification and Drug Review Dataset for multiclass classification. The results demonstrate superior performance in terms of accuracy, recall, precision, and F1 score compared to previous models. The hybrid model presented in this study holds significant potential for a wide range of text classification tasks and stands as a valuable contribution to the field.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140825257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.datak.2024.102307
José Antonio García-Díaz , Ghassan Beydoun , Rafel Valencia-García
Author profiling consists of extracting their demographic and psychographic information by examining their writings. This information can then be used to improve the reader experience and to detect bots or propagators of hoaxes and/or hate speech. Therefore, author profiling can be applied to build more robust and efficient Knowledge-Based Systems for tasks such as content moderation, user profiling, and information retrieval. Author profiling is typically performed automatically as a document classification task. Recently, language models based on transformers have also proven to be quite effective in this task. However, the size and heterogeneity of novel language models, makes it necessary to evaluate them in context. The contributions we make in this paper are four-fold: First, we evaluate which language models are best suited to perform author profiling in Spanish. These experiments include basic, distilled, and multilingual models. Second, we evaluate how feature integration can improve performance for this task. We evaluate two distinct strategies: knowledge integration and ensemble learning. Third, we evaluate the ability of linguistic features to improve the interpretability of the results. Fourth, we evaluate the performance of each language model in terms of memory, training, and inference times. Our results indicate that the use of lightweight models can indeed achieve similar performance to heavy models and that multilingual models are actually less effective than models trained with one language. Finally, we confirm that the best models and strategies for integrating features ultimately depend on the context of the task.
{"title":"Evaluating Transformers and Linguistic Features integration for Author Profiling tasks in Spanish","authors":"José Antonio García-Díaz , Ghassan Beydoun , Rafel Valencia-García","doi":"10.1016/j.datak.2024.102307","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102307","url":null,"abstract":"<div><p>Author profiling consists of extracting their demographic and psychographic information by examining their writings. This information can then be used to improve the reader experience and to detect bots or propagators of hoaxes and/or hate speech. Therefore, author profiling can be applied to build more robust and efficient Knowledge-Based Systems for tasks such as content moderation, user profiling, and information retrieval. Author profiling is typically performed automatically as a document classification task. Recently, language models based on transformers have also proven to be quite effective in this task. However, the size and heterogeneity of novel language models, makes it necessary to evaluate them in context. The contributions we make in this paper are four-fold: First, we evaluate which language models are best suited to perform author profiling in Spanish. These experiments include basic, distilled, and multilingual models. Second, we evaluate how feature integration can improve performance for this task. We evaluate two distinct strategies: knowledge integration and ensemble learning. Third, we evaluate the ability of linguistic features to improve the interpretability of the results. Fourth, we evaluate the performance of each language model in terms of memory, training, and inference times. Our results indicate that the use of lightweight models can indeed achieve similar performance to heavy models and that multilingual models are actually less effective than models trained with one language. Finally, we confirm that the best models and strategies for integrating features ultimately depend on the context of the task.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000314/pdfft?md5=42a482dbed2e2a640c46e89a6f3a69c8&pid=1-s2.0-S0169023X24000314-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140825258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.datak.2024.102312
R. Rajesh
We observe and analyze the causal relations among risk factors in a system, considering the manufacturing supply chains. Seven major categories of risks were identified and scrutinized and the detailed analysis of causal relations using the grey influence analysis (GINA) methodology is outlined. With expert response based survey, we conduct an initial analysis of the risks using risk matrix analysis (RMA) and the risks under high priority are identified. Later, the GINA is implemented to understand the causal relations among various categories of risks, which is particularly useful in group decision-making environments. The results from RMA concludes that the capacity risks (CR) and delays (DL) are in the category of very high priority risks. GINA results also ratify the conclusions from RMA and observes that managers need to control and manage capacity risks (CR) and delays (DL) with high priorities. Additionally from the results of GINA, the causal factors disruptions (DS) and forecast risks (FR) appear to be primary importance and if unattended can lead to the initiation of several other risks in supply chains. Managers are recommended to identify disruptions at an early stage in supply chains and reduce the forecast errors to avoid bullwhips in supply chains.
{"title":"Managerial risk data analytics applications using grey influence analysis (GINA)","authors":"R. Rajesh","doi":"10.1016/j.datak.2024.102312","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102312","url":null,"abstract":"<div><p>We observe and analyze the causal relations among risk factors in a system, considering the manufacturing supply chains. Seven major categories of risks were identified and scrutinized and the detailed analysis of causal relations using the grey influence analysis (GINA) methodology is outlined. With expert response based survey, we conduct an initial analysis of the risks using risk matrix analysis (RMA) and the risks under high priority are identified. Later, the GINA is implemented to understand the causal relations among various categories of risks, which is particularly useful in group decision-making environments. The results from RMA concludes that the <em>capacity risks (CR)</em> and <em>delays (DL)</em> are in the category of very high priority risks. GINA results also ratify the conclusions from RMA and observes that managers need to control and manage <em>capacity risks (CR)</em> and <em>delays (DL)</em> with high priorities. Additionally from the results of GINA, the causal factors <em>disruptions (DS)</em> and <em>forecast risks (FR)</em> appear to be primary importance and if unattended can lead to the initiation of several other risks in supply chains. Managers are recommended to identify disruptions at an early stage in supply chains and reduce the forecast errors to avoid bullwhips in supply chains.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140879377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-30DOI: 10.1016/j.datak.2024.102308
Ramla Belalta , Mouhoub Belazzoug , Farid Meziane
Disambiguating name mentions in texts is a crucial task in Natural Language Processing, especially in entity linking. The credibility and efficiency of such systems depend largely on this task. For a given name entity mention in a text, there are many potential candidate entities that may refer to it in the knowledge base. Therefore, it is very difficult to assign the correct candidate from the whole set of candidate entities of this mention. To solve this problem, collective entity disambiguation is a prominent approach. In this paper, we present a novel algorithm called CPSR for collective entity disambiguation, which is based on a graph approach and semantic relatedness. A clique partitioning algorithm is used to find the best clique that contains a set of candidate entities. These candidate entities provide the answers to the corresponding mentions in the disambiguation process. To evaluate our algorithm, we carried out a series of experiments on seven well-known datasets, namely, AIDA/CoNLL2003-TestB, IITB, MSNBC, AQUAINT, ACE2004, Cweb, and Wiki. The Kensho Derived Wikimedia Dataset (KDWD) is used as the knowledge base for our system. From the experimental results, our CPSR algorithm outperforms both the baselines and other well-known state-of-the-art approaches.
{"title":"A graph based named entity disambiguation using clique partitioning and semantic relatedness","authors":"Ramla Belalta , Mouhoub Belazzoug , Farid Meziane","doi":"10.1016/j.datak.2024.102308","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102308","url":null,"abstract":"<div><p>Disambiguating name mentions in texts is a crucial task in Natural Language Processing, especially in entity linking. The credibility and efficiency of such systems depend largely on this task. For a given name entity mention in a text, there are many potential candidate entities that may refer to it in the knowledge base. Therefore, it is very difficult to assign the correct candidate from the whole set of candidate entities of this mention. To solve this problem, collective entity disambiguation is a prominent approach. In this paper, we present a novel algorithm called CPSR for collective entity disambiguation, which is based on a graph approach and semantic relatedness. A clique partitioning algorithm is used to find the best clique that contains a set of candidate entities. These candidate entities provide the answers to the corresponding mentions in the disambiguation process. To evaluate our algorithm, we carried out a series of experiments on seven well-known datasets, namely, AIDA/CoNLL2003-TestB, IITB, MSNBC, AQUAINT, ACE2004, Cweb, and Wiki. The Kensho Derived Wikimedia Dataset (KDWD) is used as the knowledge base for our system. From the experimental results, our CPSR algorithm outperforms both the baselines and other well-known state-of-the-art approaches.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140901817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-15DOI: 10.1016/j.datak.2024.102305
Thomas Gerald , Louis Tamames , Sofiane Ettayeb , Ha-Quang Le , Patrick Paroubek , Anne Vilnat
Generating education-related questions and answers remains an open issue while being useful for students, teachers, and teaching aids. Given textual course material, we are interested in generating non-factual questions that require an elaborate answer (relying on analysis or reasoning). Despite the availability of annotated corpora of questions and answers, the effort to develop a generator using deep learning faces two main challenges. Firstly, freely accessible and qualitative data are insufficient to train generative approaches. Secondly, for a stand-alone application, we do not have explicit support to guide the generation toward complex questions. To tackle the first issue, we propose a new corpus based on education documents. For the second point, we propose to study several retargetable language algorithms to produce answers by extracting text spans from contextual documents to help the generation of questions. We particularly study the contribution of deep neural syntactic parsing and transformer-based semantic representation, taking into account the question type (according to our specific question typology) and the contextual support text span. Additionally, recent advances in generation models have proven the efficiency of the instruction-based approach for natural language generation. Consequently, we propose a first investigation of very large language models to generate questions related to the education domain.
{"title":"CQuAE: A new Contextualized QUestion Answering corpus on Education domain","authors":"Thomas Gerald , Louis Tamames , Sofiane Ettayeb , Ha-Quang Le , Patrick Paroubek , Anne Vilnat","doi":"10.1016/j.datak.2024.102305","DOIUrl":"10.1016/j.datak.2024.102305","url":null,"abstract":"<div><p>Generating education-related questions and answers remains an open issue while being useful for students, teachers, and teaching aids. Given textual course material, we are interested in generating non-factual questions that require an elaborate answer (relying on analysis or reasoning). Despite the availability of annotated corpora of questions and answers, the effort to develop a generator using deep learning faces two main challenges. Firstly, freely accessible and qualitative data are insufficient to train generative approaches. Secondly, for a stand-alone application, we do not have explicit support to guide the generation toward complex questions. To tackle the first issue, we propose a new corpus based on education documents. For the second point, we propose to study several retargetable language algorithms to produce answers by extracting text spans from contextual documents to help the generation of questions. We particularly study the contribution of deep neural syntactic parsing and transformer-based semantic representation, taking into account the question type (according to our specific question typology) and the contextual support text span. Additionally, recent advances in generation models have proven the efficiency of the instruction-based approach for natural language generation. Consequently, we propose a first investigation of very large language models to generate questions related to the education domain.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140768347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-03DOI: 10.1016/j.datak.2024.102304
Tim Kreuzer, Panagiotis Papapetrou, Jelena Zdravkovic
Artificial intelligence and digital twins have become more popular in recent years and have seen usage across different application domains for various scenarios. This study reviews the literature at the intersection of the two fields, where digital twins integrate an artificial intelligence component. We follow a systematic literature review approach, analyzing a total of 149 related studies. In the assessed literature, a variety of problems are approached with an artificial intelligence-integrated digital twin, demonstrating its applicability across different fields. Our findings indicate that there is a lack of in-depth modeling approaches regarding the digital twin, while many articles focus on the implementation and testing of the artificial intelligence component. The majority of publications do not demonstrate a virtual-to-physical connection between the digital twin and the real-world system. Further, only a small portion of studies base their digital twin on real-time data from a physical system, implementing a physical-to-virtual connection.
{"title":"Artificial intelligence in digital twins—A systematic literature review","authors":"Tim Kreuzer, Panagiotis Papapetrou, Jelena Zdravkovic","doi":"10.1016/j.datak.2024.102304","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102304","url":null,"abstract":"<div><p>Artificial intelligence and digital twins have become more popular in recent years and have seen usage across different application domains for various scenarios. This study reviews the literature at the intersection of the two fields, where digital twins integrate an artificial intelligence component. We follow a systematic literature review approach, analyzing a total of 149 related studies. In the assessed literature, a variety of problems are approached with an artificial intelligence-integrated digital twin, demonstrating its applicability across different fields. Our findings indicate that there is a lack of in-depth modeling approaches regarding the digital twin, while many articles focus on the implementation and testing of the artificial intelligence component. The majority of publications do not demonstrate a virtual-to-physical connection between the digital twin and the real-world system. Further, only a small portion of studies base their digital twin on real-time data from a physical system, implementing a physical-to-virtual connection.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000284/pdfft?md5=7bf249b030dadbb8c82308b54aef035d&pid=1-s2.0-S0169023X24000284-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140549919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Understanding why some points in a data set are considered as anomalies cannot be done without taking into account the structure of the regular points. Whereas many machine learning methods are dedicated to the identification of anomalies on one side, or to the identification of the data inner-structure on the other side, a solution is introduced to answers these two tasks using a same data model, a variant of an isolation forest. The initial algorithm to construct an isolation forest is indeed revisited to preserve the data inner structure without affecting the efficiency of the outlier detection. Experiments conducted both on synthetic and real-world data sets show that, in addition to improving the detection of abnormal data points, the proposed variant of isolation forest allows for a reconstruction of the subspaces of high density. Therefore, the former can serve as a basis for a unified approach to detect global and local anomalies, which is a necessary condition to then provide users with informative descriptions of the data.
{"title":"Leveraging an Isolation Forest to Anomaly Detection and Data Clustering","authors":"Véronne Yepmo , Grégory Smits , Marie-Jeanne Lesot , Olivier Pivert","doi":"10.1016/j.datak.2024.102302","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102302","url":null,"abstract":"<div><p>Understanding why some points in a data set are considered as anomalies cannot be done without taking into account the structure of the regular points. Whereas many machine learning methods are dedicated to the identification of anomalies on one side, or to the identification of the data inner-structure on the other side, a solution is introduced to answers these two tasks using a same data model, a variant of an isolation forest. The initial algorithm to construct an isolation forest is indeed revisited to preserve the data inner structure without affecting the efficiency of the outlier detection. Experiments conducted both on synthetic and real-world data sets show that, in addition to improving the detection of abnormal data points, the proposed variant of isolation forest allows for a reconstruction of the subspaces of high density. Therefore, the former can serve as a basis for a unified approach to detect global and local anomalies, which is a necessary condition to then provide users with informative descriptions of the data.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140345076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-19DOI: 10.1016/j.datak.2024.102301
Johannes Lohmöller , Jan Pennekamp , Roman Matzutt , Carolin Victoria Schneider , Eduard Vlad , Christian Trautwein , Klaus Wehrle
Data ecosystems emerged as a new paradigm to facilitate the automated and massive exchange of data from heterogeneous information sources between different stakeholders. However, the corresponding benefits come with unforeseen risks as sensitive information is potentially exposed, questioning data ecosystem reliability. Consequently, data security is of utmost importance and, thus, a central requirement for successfully realizing data ecosystems. Academia has recognized this requirement, and current initiatives foster sovereign participation via a federated infrastructure where participants retain local control over what data they offer to whom. However, recent proposals place significant trust in remote infrastructure by implementing organizational security measures such as certification processes before the admission of a participant. At the same time, the data sensitivity incentivizes participants to bypass the organizational security measures to maximize their benefit. This issue significantly weakens security, sovereignty, and trust guarantees and highlights that organizational security measures are insufficient in this context. In this paper, we argue that data ecosystems must be extended with technical means to (re)establish dependable guarantees. We underpin this need with three representative use cases for data ecosystems, which cover personal, economic, and governmental data, and systematically map the lack of dependable guarantees in related work. To this end, we identify three enablers of dependable guarantees, namely trusted remote policy enforcement, verifiable data tracking, and integration of resource-constrained participants. These enablers are critical for securely implementing data ecosystems in data-sensitive contexts.
{"title":"The unresolved need for dependable guarantees on security, sovereignty, and trust in data ecosystems","authors":"Johannes Lohmöller , Jan Pennekamp , Roman Matzutt , Carolin Victoria Schneider , Eduard Vlad , Christian Trautwein , Klaus Wehrle","doi":"10.1016/j.datak.2024.102301","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102301","url":null,"abstract":"<div><p>Data ecosystems emerged as a new paradigm to facilitate the automated and massive exchange of data from heterogeneous information sources between different stakeholders. However, the corresponding benefits come with unforeseen risks as sensitive information is potentially exposed, questioning data ecosystem reliability. Consequently, data security is of utmost importance and, thus, a central requirement for successfully realizing data ecosystems. Academia has recognized this requirement, and current initiatives foster sovereign participation via a federated infrastructure where participants retain local control over what data they offer to whom. However, recent proposals place significant trust in remote infrastructure by implementing organizational security measures such as certification processes before the admission of a participant. At the same time, the data sensitivity incentivizes participants to bypass the organizational security measures to maximize their benefit. This issue significantly weakens security, sovereignty, and trust guarantees and highlights that organizational security measures are insufficient in this context. In this paper, we argue that data ecosystems must be extended with technical means to (re)establish dependable guarantees. We underpin this need with three representative use cases for data ecosystems, which cover personal, economic, and governmental data, and systematically map the lack of dependable guarantees in related work. To this end, we identify three enablers of dependable guarantees, namely trusted remote policy enforcement, verifiable data tracking, and integration of resource-constrained participants. These enablers are critical for securely implementing data ecosystems in data-sensitive contexts.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000259/pdfft?md5=5d1fb135737fcc7ddf73713a94b46ce0&pid=1-s2.0-S0169023X24000259-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140192029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.1016/j.datak.2024.102299
Nikolas Stege , Michael H. Breitner
Domain experts are driven by business needs, while data analysts develop and use various algorithms, methods, and tools, but often without domain knowledge. A major challenge for companies and organizations is to integrate data analytics in business processes and workflows. We deduce an interactive process and visualization framework to enable value creating collaboration in inter- and cross-disciplinary teams. Domain experts and data analysts are both empowered to analyze and discuss results and come to well-founded insights and implications. Inspired by a typical auditing problem, we develop and apply a visualization framework to single out unusual data in general subsets for potential further investigation. Our framework is applicable to both unusual data detected manually by domain experts or by algorithms applied by data analysts. Application examples show typical interaction, collaboration, visualization, and decision support.
{"title":"Insights into commonalities of a sample: A visualization framework to explore unusual subset-dataset relationships","authors":"Nikolas Stege , Michael H. Breitner","doi":"10.1016/j.datak.2024.102299","DOIUrl":"10.1016/j.datak.2024.102299","url":null,"abstract":"<div><p>Domain experts are driven by business needs, while data analysts develop and use various algorithms, methods, and tools, but often without domain knowledge. A major challenge for companies and organizations is to integrate data analytics in business processes and workflows. We deduce an interactive process and visualization framework to enable value creating collaboration in inter- and cross-disciplinary teams. Domain experts and data analysts are both empowered to analyze and discuss results and come to well-founded insights and implications. Inspired by a typical auditing problem, we develop and apply a visualization framework to single out unusual data in general subsets for potential further investigation. Our framework is applicable to both unusual data detected manually by domain experts or by algorithms applied by data analysts. Application examples show typical interaction, collaboration, visualization, and decision support.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000235/pdfft?md5=5865a6d1aaccbc08965569d170abf88f&pid=1-s2.0-S0169023X24000235-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140151811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-11DOI: 10.1016/j.datak.2024.102300
Wei Jia , Ruizhe Ma , Li Yan , Weinan Niu , Zongmin Ma
Entity alignment, aiming at identifying equivalent entity pairs across multiple knowledge graphs (KGs), serves as a vital step for knowledge fusion. As the majority of KGs undergo continuous evolution, existing solutions utilize graph neural networks (GNNs) to tackle entity alignment within temporal knowledge graphs (TKGs). However, this prevailing method often overlooks the consequential impact of relation embedding generation on entity embeddings through inherent structures. In this paper, we propose a novel model named Time-aware Structure Matching based on GNNs (TSM-GNN) that encompasses the learning of both topological and inherent structures. Our key innovation lies in a unique method for generating relation embeddings, which can enhance entity embeddings via inherent structure. Specifically, we utilize the translation property of knowledge graphs to obtain the entity embedding that is mapped into a time-aware vector space. Subsequently, we employ GNNs to learn global entity representation. To better capture the useful information from neighboring relations and entities, we introduce a time-aware attention mechanism that assigns different importance weights to different time-aware inherent structures. Experimental results on three real-world datasets demonstrate that TSM-GNN outperforms several state-of-the-art approaches for entity alignment between TKGs.
{"title":"Time-aware structure matching for temporal knowledge graph alignment","authors":"Wei Jia , Ruizhe Ma , Li Yan , Weinan Niu , Zongmin Ma","doi":"10.1016/j.datak.2024.102300","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102300","url":null,"abstract":"<div><p>Entity alignment, aiming at identifying equivalent entity pairs across multiple knowledge graphs (KGs), serves as a vital step for knowledge fusion. As the majority of KGs undergo continuous evolution, existing solutions utilize graph neural networks (GNNs) to tackle entity alignment within temporal knowledge graphs (TKGs). However, this prevailing method often overlooks the consequential impact of relation embedding generation on entity embeddings through inherent structures. In this paper, we propose a novel model named Time-aware Structure Matching based on GNNs (TSM-GNN) that encompasses the learning of both topological and inherent structures. Our key innovation lies in a unique method for generating relation embeddings, which can enhance entity embeddings via inherent structure. Specifically, we utilize the translation property of knowledge graphs to obtain the entity embedding that is mapped into a time-aware vector space. Subsequently, we employ GNNs to learn global entity representation. To better capture the useful information from neighboring relations and entities, we introduce a time-aware attention mechanism that assigns different importance weights to different time-aware inherent structures. Experimental results on three real-world datasets demonstrate that TSM-GNN outperforms several state-of-the-art approaches for entity alignment between TKGs.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140138228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}