Pub Date : 2024-06-28DOI: 10.1016/j.is.2024.102421
Dehua Liu , Selasi Kwashie , Yidi Zhang , Guangtong Zhou , Michael Bewong , Xiaoying Wu , Xi Guo , Keqing He , Zaiwen Feng
Graph entity dependencies (GEDs) are novel graph constraints, unifying keys and functional dependencies, for property graphs. They have been found useful in many real-world data quality and data management tasks, including fact checking on social media networks and entity resolution. In this paper, we study the discovery problem of GEDs—finding a minimal cover of valid GEDs in a given graph data. We formalise the problem, and propose an effective and efficient approach to overcome major bottlenecks in GED discovery. In particular, we leverage existing graph partitioning algorithms to enable fast GED-scope discovery, and employ effective pruning strategies over the prohibitively large space of candidate dependencies. Furthermore, we define an interestingness measure for GEDs based on the minimum description length principle, to score and rank the mined cover set of GEDs. Finally, we demonstrate the scalability and effectiveness of our GED discovery approach through extensive experiments on real-world benchmark graph data sets; and present the usefulness of the discovered rules in different downstream data quality management applications.
{"title":"An efficient approach for discovering Graph Entity Dependencies (GEDs)","authors":"Dehua Liu , Selasi Kwashie , Yidi Zhang , Guangtong Zhou , Michael Bewong , Xiaoying Wu , Xi Guo , Keqing He , Zaiwen Feng","doi":"10.1016/j.is.2024.102421","DOIUrl":"https://doi.org/10.1016/j.is.2024.102421","url":null,"abstract":"<div><p>Graph entity dependencies (GEDs) are novel graph constraints, unifying keys and functional dependencies, for property graphs. They have been found useful in many real-world data quality and data management tasks, including fact checking on social media networks and entity resolution. In this paper, we study the discovery problem of GEDs—finding a minimal cover of valid GEDs in a given graph data. We formalise the problem, and propose an effective and efficient approach to overcome major bottlenecks in GED discovery. In particular, we leverage existing graph partitioning algorithms to enable fast GED-scope discovery, and employ effective pruning strategies over the prohibitively large space of candidate dependencies. Furthermore, we define an interestingness measure for GEDs based on the minimum description length principle, to score and rank the mined cover set of GEDs. Finally, we demonstrate the scalability and effectiveness of our GED discovery approach through extensive experiments on real-world benchmark graph data sets; and present the usefulness of the discovered rules in different downstream data quality management applications.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"125 ","pages":"Article 102421"},"PeriodicalIF":3.0,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437924000796/pdfft?md5=8af2f9051185a5f57df5320cb4c1b7bd&pid=1-s2.0-S0306437924000796-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141583109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-10DOI: 10.1016/j.is.2024.102420
Ahmed Al-Ghezi, Lena Wiese
The Resource Description Framework (RDF) is widely used to model web data. The scale and complexity of the modeled data emphasized performance challenges on the RDF-triple stores. Workload adaption is one important strategy to deal with those challenges on the storage level. Current workload-adaption approaches lack the necessary generalization of the problem and only optimize part of the storage layer with the workload (mostly the replication). This creates a big performance gap within other data structures (e.g. indexes and cache) that could heavily benefit from the same workload adaption strategy. Moreover, the workload statistics are built collectively in most of the current approaches. Thus, the analysis process is unaware of whether workloads’ items are old or recent. However, that does not simulate the temporal trends that exist naturally in user queries which causes the analysis process to lag behind the rapid workload development. We present a novel universal adaption approach to the storage management of a distributed RDF store. The system aims to find optimal data assignments to the different indexes, replications, and join cache within the limited storage space. We present a cost model based on the workload that often contains frequent patterns. The workload is dynamically and continuously analyzed to evaluate predefined rules considering the benefits and costs of all options of assigning data to the storage structures. The objective is to reduce query execution time by letting different data containers compete on the limited storage space. By modeling the workload statistics as time series, we can apply well-known smoothing techniques allowing the importance of the workload to decay over time. That allows the universal adaption to stay tuned with potential changes in the workload trends.
{"title":"Analyzing workload trends for boosting triple stores performance","authors":"Ahmed Al-Ghezi, Lena Wiese","doi":"10.1016/j.is.2024.102420","DOIUrl":"10.1016/j.is.2024.102420","url":null,"abstract":"<div><p>The Resource Description Framework (RDF) is widely used to model web data. The scale and complexity of the modeled data emphasized performance challenges on the RDF-triple stores. Workload adaption is one important strategy to deal with those challenges on the storage level. Current workload-adaption approaches lack the necessary generalization of the problem and only optimize part of the storage layer with the workload (mostly the replication). This creates a big performance gap within other data structures (e.g. indexes and cache) that could heavily benefit from the same workload adaption strategy. Moreover, the workload statistics are built collectively in most of the current approaches. Thus, the analysis process is unaware of whether workloads’ items are old or recent. However, that does not simulate the temporal trends that exist naturally in user queries which causes the analysis process to lag behind the rapid workload development. We present a novel universal adaption approach to the storage management of a distributed RDF store. The system aims to find optimal data assignments to the different indexes, replications, and join cache within the limited storage space. We present a cost model based on the workload that often contains frequent patterns. The workload is dynamically and continuously analyzed to evaluate predefined rules considering the benefits and costs of all options of assigning data to the storage structures. The objective is to reduce query execution time by letting different data containers compete on the limited storage space. By modeling the workload statistics as time series, we can apply well-known smoothing techniques allowing the importance of the workload to decay over time. That allows the universal adaption to stay tuned with potential changes in the workload trends.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"125 ","pages":"Article 102420"},"PeriodicalIF":3.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437924000784/pdfft?md5=4a9d8f0acac2d10b05565ee129773c94&pid=1-s2.0-S0306437924000784-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141393476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-04DOI: 10.1016/j.is.2024.102419
Yaojun Hao , Haotian Wang , Qingshan Zhao , Liping Feng , Jian Wang
Over the past two decades, many studies have devoted a good deal of attention to detect injection attacks in recommender systems. However, most of the studies mainly focus on detecting the heuristically-generated injection attacks, which are heuristically fabricated by hand-engineering. In practice, the adversarially-learned injection attacks have been proposed based on optimization methods and enhanced the ability in the camouflage and threat. Under the adversarially-learned injection attacks, the traditional detection models are likely to be fooled. In this paper, a detection method is proposed for the adversarially-learned injection attacks via knowledge graphs. Firstly, with the advantages of wealth information from knowledge graphs, item-pairs on the extension hops of knowledge graphs are regarded as the implicit preferences for users. Also, the item-pair popularity series and user item-pair matrix are constructed to express the user's preferences. Secondly, the word embedding model and principal component analysis are utilized to extract the user's initial vector representations from the item-pair popularity series and item-pair matrix, respectively. Moreover, the Variational Autoencoders with the improved R-drop regularization are used to reconstruct the embedding vectors and further identify the shilling profiles. Finally, the experiments on three real-world datasets indicate that the proposed detector has superior performance than benchmark methods when detecting the adversarially-learned injection attacks. In addition, the detector is evaluated under the heuristically-generated injection attacks and demonstrates the outstanding performance.
{"title":"Detecting the adversarially-learned injection attacks via knowledge graphs","authors":"Yaojun Hao , Haotian Wang , Qingshan Zhao , Liping Feng , Jian Wang","doi":"10.1016/j.is.2024.102419","DOIUrl":"https://doi.org/10.1016/j.is.2024.102419","url":null,"abstract":"<div><p>Over the past two decades, many studies have devoted a good deal of attention to detect injection attacks in recommender systems. However, most of the studies mainly focus on detecting the heuristically-generated injection attacks, which are heuristically fabricated by hand-engineering. In practice, the adversarially-learned injection attacks have been proposed based on optimization methods and enhanced the ability in the camouflage and threat. Under the adversarially-learned injection attacks, the traditional detection models are likely to be fooled. In this paper, a detection method is proposed for the adversarially-learned injection attacks via knowledge graphs. Firstly, with the advantages of wealth information from knowledge graphs, item-pairs on the extension hops of knowledge graphs are regarded as the implicit preferences for users. Also, the item-pair popularity series and user item-pair matrix are constructed to express the user's preferences. Secondly, the word embedding model and principal component analysis are utilized to extract the user's initial vector representations from the item-pair popularity series and item-pair matrix, respectively. Moreover, the Variational Autoencoders with the improved R-drop regularization are used to reconstruct the embedding vectors and further identify the shilling profiles. Finally, the experiments on three real-world datasets indicate that the proposed detector has superior performance than benchmark methods when detecting the adversarially-learned injection attacks. In addition, the detector is evaluated under the heuristically-generated injection attacks and demonstrates the outstanding performance.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"125 ","pages":"Article 102419"},"PeriodicalIF":3.7,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141325033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01DOI: 10.1016/j.is.2024.102418
Xiaolin Han , Tobias Grubenmann , Chenhao Ma , Xiaodong Li , Wenya Sun , Sze Chun Wong , Xuequn Shang , Reynold Cheng
Incident detection (ID), or the automatic discovery of anomalies from road traffic data (e.g., road sensor and GPS data), enables emergency actions (e.g., rescuing injured people) to be carried out in a timely fashion. Existing ID solutions based on data mining or machine learning often rely on dense traffic data; for instance, sensors installed in highways provide frequent updates of road information. In this paper, we ask the question: can ID be performed on sparse traffic data (e.g., location data obtained from GPS devices equipped on vehicles)? As these data may not be enough to describe the state of the roads involved, they can undermine the effectiveness of existing ID solutions. To tackle this challenge, we borrow an important insight from the transportation area, which uses trajectories (i.e., moving histories of vehicles) to derive incident patterns. We study how to obtain incident patterns from trajectories and devise a new solution (called Filter-Discovery-Match (FDM)) to detect anomalies in sparse traffic data. We have also developed a fast algorithm to support FDM. Experiments on a taxi dataset in Hong Kong and a simulated dataset show that FDM is more effective than state-of-the-art ID solutions on sparse traffic data, and is also efficient.
事故检测(ID),即从道路交通数据(如道路传感器和全球定位系统数据)中自动发现异常情况,从而及时采取紧急行动(如抢救伤员)。现有的基于数据挖掘或机器学习的 ID 解决方案通常依赖于密集的交通数据;例如,安装在高速公路上的传感器可提供频繁更新的道路信息。在本文中,我们提出了这样一个问题:ID 能否在稀疏的交通数据(例如从车辆上配备的 GPS 设备获得的位置数据)上执行?由于这些数据可能不足以描述相关道路的状态,因此会削弱现有 ID 解决方案的有效性。为了应对这一挑战,我们借鉴了交通领域的一个重要见解,即利用轨迹(即车辆的移动历史)来推导事故模式。我们研究了如何从轨迹中获取事故模式,并设计了一种新的解决方案(称为 "过滤-发现-匹配"(FDM))来检测稀疏交通数据中的异常情况。我们还开发了一种支持 FDM 的快速算法。在香港出租车数据集和模拟数据集上进行的实验表明,在稀疏交通数据上,FDM 比最先进的 ID 解决方案更有效,而且还很高效。
{"title":"FDM: Effective and efficient incident detection on sparse trajectory data","authors":"Xiaolin Han , Tobias Grubenmann , Chenhao Ma , Xiaodong Li , Wenya Sun , Sze Chun Wong , Xuequn Shang , Reynold Cheng","doi":"10.1016/j.is.2024.102418","DOIUrl":"10.1016/j.is.2024.102418","url":null,"abstract":"<div><p>Incident detection (ID), or the automatic discovery of anomalies from road traffic data (e.g., road sensor and GPS data), enables emergency actions (e.g., rescuing injured people) to be carried out in a timely fashion. Existing ID solutions based on data mining or machine learning often rely on <em>dense</em> traffic data; for instance, sensors installed in highways provide frequent updates of road information. In this paper, we ask the question: can ID be performed on <em>sparse</em> traffic data (e.g., location data obtained from GPS devices equipped on vehicles)? As these data may not be enough to describe the state of the roads involved, they can undermine the effectiveness of existing ID solutions. To tackle this challenge, we borrow an important insight from the transportation area, which uses trajectories (i.e., moving histories of vehicles) to derive <em>incident patterns</em>. We study how to obtain incident patterns from trajectories and devise a new solution (called <u>F</u>ilter-<u>D</u>iscovery-<u>M</u>atch (<strong>FDM</strong>)) to detect anomalies in sparse traffic data. We have also developed a fast algorithm to support FDM. Experiments on a taxi dataset in Hong Kong and a simulated dataset show that FDM is more effective than state-of-the-art ID solutions on sparse traffic data, and is also efficient.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"125 ","pages":"Article 102418"},"PeriodicalIF":3.7,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141278964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When solving the problem of identifying similar records in different datasets (known as Entity Resolution or ER), one big challenge is the lack of enough labeled data. Which is crucial for building strong machine learning models, but getting this data can be expensive and time-consuming. Active Machine Learning (ActiveML) is a helpful approach because it cleverly picks the most useful pieces of data to learn from. It uses two main ideas: informativeness and representativeness. Typical ActiveML methods used in ER usually depend too much on just one of these ideas, which can make them less effective, especially when starting with very little data. Our research introduces a new combined method that uses both ideas together. We created two versions of this method, called DPQ and STQ, and tested them on eleven different real-world datasets. The results showed that our new method improves ER by producing better scores, more stable models, and faster learning with less training data compared to existing methods.
{"title":"Enhancing Entity Resolution with a hybrid Active Machine Learning framework: Strategies for optimal learning in sparse datasets","authors":"Mourad Jabrane , Hiba Tabbaa , Aissam Hadri , Imad Hafidi","doi":"10.1016/j.is.2024.102410","DOIUrl":"10.1016/j.is.2024.102410","url":null,"abstract":"<div><p>When solving the problem of identifying similar records in different datasets (known as Entity Resolution or ER), one big challenge is the lack of enough labeled data. Which is crucial for building strong machine learning models, but getting this data can be expensive and time-consuming. Active Machine Learning (ActiveML) is a helpful approach because it cleverly picks the most useful pieces of data to learn from. It uses two main ideas: informativeness and representativeness. Typical ActiveML methods used in ER usually depend too much on just one of these ideas, which can make them less effective, especially when starting with very little data. Our research introduces a new combined method that uses both ideas together. We created two versions of this method, called DPQ and STQ, and tested them on eleven different real-world datasets. The results showed that our new method improves ER by producing better scores, more stable models, and faster learning with less training data compared to existing methods.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"125 ","pages":"Article 102410"},"PeriodicalIF":3.7,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141188334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-21DOI: 10.1016/j.is.2024.102409
Giovanni Di Gennaro , Claudia Greco , Amedeo Buonanno , Marialucia Cuciniello , Terry Amorese , Maria Santina Ler , Gennaro Cordasco , Francesco A.N. Palmieri , Anna Esposito
The growth of data-driven approaches typical of Machine Learning leads to an ever-increasing need for large quantities of labeled data. Unfortunately, these attributions are often made automatically and/or crudely, thus destroying the very concept of “ground truth” they are supposed to represent. To address this problem, we introduce HUM-CARD, a dataset of human trajectories in crowded contexts manually annotated by nine experts in engineering and psychology, totaling approximately hours. Our multidisciplinary labeling process has enabled the creation of a well-structured ontology, accounting for both individual and contextual factors influencing human movement dynamics in shared environments. Preliminary and descriptive analyzes are presented, highlighting the potential benefits of this dataset and its methodology in various research challenges.
{"title":"HUM-CARD: A human crowded annotated real dataset","authors":"Giovanni Di Gennaro , Claudia Greco , Amedeo Buonanno , Marialucia Cuciniello , Terry Amorese , Maria Santina Ler , Gennaro Cordasco , Francesco A.N. Palmieri , Anna Esposito","doi":"10.1016/j.is.2024.102409","DOIUrl":"10.1016/j.is.2024.102409","url":null,"abstract":"<div><p>The growth of data-driven approaches typical of Machine Learning leads to an ever-increasing need for large quantities of labeled data. Unfortunately, these attributions are often made automatically and/or crudely, thus destroying the very concept of “ground truth” they are supposed to represent. To address this problem, we introduce HUM-CARD, a dataset of human trajectories in crowded contexts manually annotated by nine experts in engineering and psychology, totaling approximately <span><math><mrow><mn>5000</mn></mrow></math></span> hours. Our multidisciplinary labeling process has enabled the creation of a well-structured ontology, accounting for both individual and contextual factors influencing human movement dynamics in shared environments. Preliminary and descriptive analyzes are presented, highlighting the potential benefits of this dataset and its methodology in various research challenges.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"124 ","pages":"Article 102409"},"PeriodicalIF":3.7,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S030643792400067X/pdfft?md5=e81bccaabf431209b490556bb4e67c4b&pid=1-s2.0-S030643792400067X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141138482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-21DOI: 10.1016/j.is.2024.102408
Huiting Ma , Dengao Li , Jian Fu , Guiji Zhao , Jumin Zhao
Heart failure, as a critical symptom or terminal stage of assorted heart diseases, is a world-class public health problem. Establishing a prognostic model can help identify high dangerous patients, save their lives promptly, and reduce medical burden. Although integrating structured indicators and unstructured text for complementary information has been proven effective in disease prediction tasks, there are still certain limitations. Firstly, the processing of single branch modes is easily overlooked, which can affect the final fusion result. Secondly, simple fusion will lose complementary information between modalities, limiting the network’s learning ability. Thirdly, incomplete interpretability can affect the practical application and development of the model. To overcome these challenges, this paper proposes the MDL-HFP multimodal model for predicting patient prognosis using the MIMIC-III public database. Firstly, the ADASYN algorithm is used to handle the imbalance of data categories. Then, the proposed improved Deep&Cross Network is used for automatic feature selection to encode structured sparse features, and implicit graph structure information is introduced to encode unstructured clinical notes based on the HR-BGCN model. Finally, the information of the two modalities is fused through a cross-modal dynamic interaction layer. By comparing multiple advanced multimodal deep learning models, the model’s effectiveness is verified, with an average F1 score of 90.42% and an average accuracy of 90.70%. The model proposed in this paper can accurately classify the readmission status of patients, thereby assisting doctors in making judgments and improving the patient’s prognosis. Further visual analysis demonstrates the usability of the model, providing a comprehensive explanation for clinical decision-making.
{"title":"Heart failure prognosis prediction: Let’s start with the MDL-HFP model","authors":"Huiting Ma , Dengao Li , Jian Fu , Guiji Zhao , Jumin Zhao","doi":"10.1016/j.is.2024.102408","DOIUrl":"10.1016/j.is.2024.102408","url":null,"abstract":"<div><p>Heart failure, as a critical symptom or terminal stage of assorted heart diseases, is a world-class public health problem. Establishing a prognostic model can help identify high dangerous patients, save their lives promptly, and reduce medical burden. Although integrating structured indicators and unstructured text for complementary information has been proven effective in disease prediction tasks, there are still certain limitations. Firstly, the processing of single branch modes is easily overlooked, which can affect the final fusion result. Secondly, simple fusion will lose complementary information between modalities, limiting the network’s learning ability. Thirdly, incomplete interpretability can affect the practical application and development of the model. To overcome these challenges, this paper proposes the MDL-HFP multimodal model for predicting patient prognosis using the MIMIC-III public database. Firstly, the ADASYN algorithm is used to handle the imbalance of data categories. Then, the proposed improved Deep&Cross Network is used for automatic feature selection to encode structured sparse features, and implicit graph structure information is introduced to encode unstructured clinical notes based on the HR-BGCN model. Finally, the information of the two modalities is fused through a cross-modal dynamic interaction layer. By comparing multiple advanced multimodal deep learning models, the model’s effectiveness is verified, with an average F1 score of 90.42% and an average accuracy of 90.70%. The model proposed in this paper can accurately classify the readmission status of patients, thereby assisting doctors in making judgments and improving the patient’s prognosis. Further visual analysis demonstrates the usability of the model, providing a comprehensive explanation for clinical decision-making.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"125 ","pages":"Article 102408"},"PeriodicalIF":3.7,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141137614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-19DOI: 10.1016/j.is.2024.102405
Wei Guan, Jian Cao, Yang Gu, Shiyou Qian
Anomalies in business processes are inevitable for various reasons such as system failures and operator errors. Detecting anomalies is important for the management and optimization of business processes. However, prevailing anomaly detection approaches often fail to capture crucial structural information about the underlying process. To address this, we propose a multi-Graph based Anomaly detection fraMework for business processes via grAph neural networks, named GAMA. GAMA makes use of structural process information and attribute information in a more integrated way. In GAMA, multiple graphs are applied to model a trace in which each attribute is modeled as a separate graph. In particular, the graph constructed for the special attribute activity reflects the control flow. Then GAMA employs a multi-graph encoder and a multi-sequence decoder on multiple graphs to detect anomalies in terms of the reconstruction errors. Moreover, three teacher forcing styles are designed to enhance GAMA’s ability to reconstruct normal behaviors and thus improve detection performance. We conduct extensive experiments on both synthetic logs and real-life logs. The experiment results demonstrate that GAMA outperforms state-of-the-art methods for both trace-level and attribute-level anomaly detection.
{"title":"GAMA: A multi-graph-based anomaly detection framework for business processes via graph neural networks","authors":"Wei Guan, Jian Cao, Yang Gu, Shiyou Qian","doi":"10.1016/j.is.2024.102405","DOIUrl":"https://doi.org/10.1016/j.is.2024.102405","url":null,"abstract":"<div><p>Anomalies in business processes are inevitable for various reasons such as system failures and operator errors. Detecting anomalies is important for the management and optimization of business processes. However, prevailing anomaly detection approaches often fail to capture crucial structural information about the underlying process. To address this, we propose a multi-Graph based Anomaly detection fraMework for business processes via grAph neural networks, named GAMA. GAMA makes use of structural process information and attribute information in a more integrated way. In GAMA, multiple graphs are applied to model a trace in which each attribute is modeled as a separate graph. In particular, the graph constructed for the special attribute <em>activity</em> reflects the control flow. Then GAMA employs a multi-graph encoder and a multi-sequence decoder on multiple graphs to detect anomalies in terms of the reconstruction errors. Moreover, three teacher forcing styles are designed to enhance GAMA’s ability to reconstruct normal behaviors and thus improve detection performance. We conduct extensive experiments on both synthetic logs and real-life logs. The experiment results demonstrate that GAMA outperforms state-of-the-art methods for both trace-level and attribute-level anomaly detection.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"124 ","pages":"Article 102405"},"PeriodicalIF":3.7,"publicationDate":"2024-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141083465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-18DOI: 10.1016/j.is.2024.102406
Carlos Quijada-Fuentes , M. Andrea Rodríguez , Diego Seco
This paper introduces the TRGST data structure, which is designed to handle queries related to topological relations between paths represented as sequences of stops in a network. As an example, these paths could correspond to stops on a public transport network, and a query of interest is to retrieve paths that share at least consecutive stops. While topological relations among spatial objects have received extensive attention, the efficient processing of these relations in the context of trajectory paths, considering both time and space efficiency, remains a relatively less explored domain. Taking inspiration from pattern matching implementations, the TRGST data structure is constructed on the foundation of the Generalized Suffix Tree. Its purpose is to provide a compact representation of a set of paths and to efficiently handle topological relation queries by leveraging the pattern search capabilities inherent in this structure. The paper provides a detailed account of the structure and algorithms of TRGST, followed by a performance analysis utilizing both real and synthetic data. The results underscore the remarkable scalability of the TRGST in terms of both query time and space utilization.
本文介绍 TRGST 数据结构,该结构旨在处理与网络中以站点序列表示的路径之间的拓扑关系有关的查询。举例来说,这些路径可能对应于公共交通网络中的站点,我们感兴趣的查询是检索至少有 k 个连续站点的路径。虽然空间对象之间的拓扑关系已受到广泛关注,但在轨迹路径中如何高效处理这些关系,同时考虑时间和空间效率,仍是一个探索相对较少的领域。受模式匹配实现的启发,TRGST 数据结构是在广义后缀树的基础上构建的。其目的是提供一组路径的紧凑表示,并利用该结构固有的模式搜索功能高效处理拓扑关系查询。本文详细介绍了 TRGST 的结构和算法,随后利用真实数据和合成数据进行了性能分析。结果表明,TRGST 在查询时间和空间利用率方面都具有显著的可扩展性。
{"title":"TRGST: An enhanced generalized suffix tree for topological relations between paths","authors":"Carlos Quijada-Fuentes , M. Andrea Rodríguez , Diego Seco","doi":"10.1016/j.is.2024.102406","DOIUrl":"10.1016/j.is.2024.102406","url":null,"abstract":"<div><p>This paper introduces the <em>TRGST</em> data structure, which is designed to handle queries related to topological relations between paths represented as sequences of stops in a network. As an example, these paths could correspond to stops on a public transport network, and a query of interest is to retrieve paths that share at least <span><math><mi>k</mi></math></span> consecutive stops. While topological relations among spatial objects have received extensive attention, the efficient processing of these relations in the context of trajectory paths, considering both time and space efficiency, remains a relatively less explored domain. Taking inspiration from pattern matching implementations, the <em>TRGST</em> data structure is constructed on the foundation of the Generalized Suffix Tree. Its purpose is to provide a compact representation of a set of paths and to efficiently handle topological relation queries by leveraging the pattern search capabilities inherent in this structure. The paper provides a detailed account of the structure and algorithms of <em>TRGST</em>, followed by a performance analysis utilizing both real and synthetic data. The results underscore the remarkable scalability of the <em>TRGST</em> in terms of both query time and space utilization.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"125 ","pages":"Article 102406"},"PeriodicalIF":3.7,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141144791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-18DOI: 10.1016/j.is.2024.102407
Hang Zhang, Mingxin Gan
Users have various behaviors on items, including page view, tag-as-favorite, add-to-cart, and purchase in online shopping platforms. These various types of behaviors reflect users’ different intentions, which also help learn their preferences on items in a recommender system. Although some multi-behavior recommendation methods have been proposed, two significant challenges have not been widely noticed: (i) capturing heterogeneous and dynamic preferences of users simultaneously from different types of behaviors; (ii) modeling the dynamic dependency among various types of behaviors. To overcome the above challenges, we propose a novel multi-behavior dynamic dependency learning method (MBDL) to explore the heterogeneity and dependency among various types of behavior sequences for recommendation. In brief, MBDL first uses a dual-channel interest encoder to learn the long-term interest representations and the evolution of short-term interests from the behavior-aware item sequences. Then, MBDL adopts a contrastive learning method to preserve the consistency of user’s long-term behavioral patterns, and a multi-head attention network to capture the dynamic dependency among short-term interactive behaviors. Finally, MBDL adaptively integrates the influence of long- and short-term interests to predict future user–item interactions. Experiments on two real-world datasets show that the proposed MBDL method outperforms state-of-the-art methods significantly on recommendation accuracy. Further ablation studies demonstrate the effectiveness of our model and the benefits of learning dynamic dependency among types of behaviors.
{"title":"MBDL: Exploring dynamic dependency among various types of behaviors for recommendation","authors":"Hang Zhang, Mingxin Gan","doi":"10.1016/j.is.2024.102407","DOIUrl":"10.1016/j.is.2024.102407","url":null,"abstract":"<div><p>Users have various behaviors on items, including <em>page view</em>, <em>tag-as-favorite</em>, <em>add-to-cart</em>, and <em>purchase</em> in online shopping platforms. These various types of behaviors reflect users’ different intentions, which also help learn their preferences on items in a recommender system. Although some multi-behavior recommendation methods have been proposed, two significant challenges have not been widely noticed: (i) capturing heterogeneous and dynamic preferences of users simultaneously from different types of behaviors; (ii) modeling the dynamic dependency among various types of behaviors. To overcome the above challenges, we propose a novel multi-behavior dynamic dependency learning method (MBDL) to explore the heterogeneity and dependency among various types of behavior sequences for recommendation. In brief, MBDL first uses a dual-channel interest encoder to learn the long-term interest representations and the evolution of short-term interests from the behavior-aware item sequences. Then, MBDL adopts a contrastive learning method to preserve the consistency of user’s long-term behavioral patterns, and a multi-head attention network to capture the dynamic dependency among short-term interactive behaviors. Finally, MBDL adaptively integrates the influence of long- and short-term interests to predict future user–item interactions. Experiments on two real-world datasets show that the proposed MBDL method outperforms state-of-the-art methods significantly on recommendation accuracy. Further ablation studies demonstrate the effectiveness of our model and the benefits of learning dynamic dependency among types of behaviors.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"124 ","pages":"Article 102407"},"PeriodicalIF":3.7,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141143297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}