首页 > 最新文献

Information Systems最新文献

英文 中文
Privacy-preserving record linkage using reference set based encoding: A single parameter method 使用基于引用集的编码保护隐私的记录链接:单参数方法
IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-05-15 DOI: 10.1016/j.is.2025.102569
Sumayya Ziyad , Peter Christen , Anushka Vidanage , Charini Nanayakkara , Rainer Schnell
Record linkage is the process of matching records that refer to the same entity across two or more databases. In many application areas, ranging from healthcare to government services, the databases to be linked contain sensitive personal information, and hence, cannot be shared across organisations. Privacy-Preserving Record Linkage (PPRL) aims to overcome this challenge by facilitating the comparison of records that have been encoded or encrypted, thereby allowing linkage without the need of sharing any sensitive data. While various PPRL techniques have been developed, most of them do not properly address privacy concerns, such as the various vulnerabilities of encoded data with regard to cryptanalysis attacks. Existing PPRL methods, furthermore, do not provide conceptual analyses of how a user should set the various parameters required, possibly leading to sub-optimal results with regard to both linkage quality and privacy protection. Here we present a novel encoding method for PPRL that employs reference q-gram sets to generate bit arrays that represent sensitive values. Our method requires a single user parameter that determines a trade-off between linkage quality, scalability, and privacy. All other parameters are either data driven or have strong bounds based on the user-set parameter. Furthermore, our method addresses the length, frequency, and pattern-based PPRL vulnerabilities that are exploited by existing PPRL attacks. We conceptually analyse our method and experimentally evaluate it using multiple databases. Our results show that our method provides robust results for both high linkage quality and strong privacy protection.
记录链接是在两个或多个数据库中匹配引用同一实体的记录的过程。在许多应用领域,从医疗保健到政府服务,要链接的数据库包含敏感的个人信息,因此不能跨组织共享。隐私保护记录链接(PPRL)旨在通过促进已编码或加密的记录的比较来克服这一挑战,从而允许在不共享任何敏感数据的情况下进行链接。虽然已经开发了各种PPRL技术,但它们中的大多数都没有适当地解决隐私问题,例如与密码分析攻击有关的编码数据的各种漏洞。此外,现有的PPRL方法没有提供用户应该如何设置所需的各种参数的概念分析,这可能导致在链接质量和隐私保护方面的次优结果。本文提出了一种新的PPRL编码方法,该方法使用参考q-gram集来生成表示敏感值的位数组。我们的方法需要一个用户参数来决定链接质量、可伸缩性和隐私之间的权衡。所有其他参数要么是数据驱动的,要么具有基于用户集参数的强边界。此外,我们的方法解决了长度、频率和基于模式的PPRL漏洞,这些漏洞被现有的PPRL攻击所利用。我们从概念上分析了我们的方法,并使用多个数据库对其进行了实验评估。结果表明,该方法既具有较高的链接质量,又具有较强的隐私保护能力。
{"title":"Privacy-preserving record linkage using reference set based encoding: A single parameter method","authors":"Sumayya Ziyad ,&nbsp;Peter Christen ,&nbsp;Anushka Vidanage ,&nbsp;Charini Nanayakkara ,&nbsp;Rainer Schnell","doi":"10.1016/j.is.2025.102569","DOIUrl":"10.1016/j.is.2025.102569","url":null,"abstract":"<div><div>Record linkage is the process of matching records that refer to the same entity across two or more databases. In many application areas, ranging from healthcare to government services, the databases to be linked contain sensitive personal information, and hence, cannot be shared across organisations. Privacy-Preserving Record Linkage (PPRL) aims to overcome this challenge by facilitating the comparison of records that have been encoded or encrypted, thereby allowing linkage without the need of sharing any sensitive data. While various PPRL techniques have been developed, most of them do not properly address privacy concerns, such as the various vulnerabilities of encoded data with regard to cryptanalysis attacks. Existing PPRL methods, furthermore, do not provide conceptual analyses of how a user should set the various parameters required, possibly leading to sub-optimal results with regard to both linkage quality and privacy protection. Here we present a <em>novel encoding method for PPRL that employs reference q-gram sets to generate bit arrays that represent sensitive values. Our method requires a single user parameter that determines a trade-off between linkage quality, scalability, and privacy.</em> All other parameters are either data driven or have strong bounds based on the user-set parameter. Furthermore, our method addresses the length, frequency, and pattern-based PPRL vulnerabilities that are exploited by existing PPRL attacks. We conceptually analyse our method and experimentally evaluate it using multiple databases. Our results show that our method provides robust results for both high linkage quality and strong privacy protection.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102569"},"PeriodicalIF":3.0,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144089512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Training-free sparse representations of dense vectors for scalable information retrieval 面向可扩展信息检索的密集向量的无训练稀疏表示
IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-05-13 DOI: 10.1016/j.is.2025.102567
Fabio Carrara, Lucia Vadicamo, Giuseppe Amato, Claudio Gennaro
In this paper, we propose and analyze Vec2Doc, a novel training-free method to transform dense vectors into sparse integer vectors, facilitating the use of inverted indexes for information retrieval (IR). The exponential growth of deep learning and artificial intelligence has revolutionized scientific problem-solving in areas such as computer vision, natural language processing, and automatic content generation. These advances have also significantly impacted IR, with a better understanding of natural language and multimodal content analysis leading to more accurate information retrieval. Despite these developments, modern IR relies primarily on the similarity evaluation of dense vectors from the latent spaces of deep neural networks. This dependence introduces substantial challenges in performing similarity searches on large collections containing billions of vectors. Traditional IR methods, which employ inverted indexes and vector space models, are adept at handling sparse vectors but do not work well with dense ones. Vec2Doc attempts to fill this gap by converting dense vectors into a format compatible with conventional inverted index techniques. Our preliminary experimental evaluations show that Vec2Doc is a promising solution to overcome the scalability problems inherent in vector-based IR, offering an alternative method for efficient and accurate large-scale information retrieval.
本文提出并分析了一种新的Vec2Doc方法,该方法可以将密集向量转换为稀疏整数向量,便于倒排索引在信息检索(IR)中的应用。深度学习和人工智能的指数级增长彻底改变了计算机视觉、自然语言处理和自动内容生成等领域的科学问题解决方式。这些进步也显著影响了信息检索,更好地理解自然语言和多模态内容分析导致更准确的信息检索。尽管有这些发展,现代红外主要依赖于深度神经网络潜在空间中密集向量的相似性评估。这种依赖性给在包含数十亿个向量的大型集合上执行相似性搜索带来了实质性的挑战。传统的红外方法采用倒排索引和向量空间模型,擅长处理稀疏向量,但处理密集向量效果不佳。Vec2Doc试图通过将密集向量转换为与传统倒排索引技术兼容的格式来填补这一空白。我们的初步实验评估表明,Vec2Doc是一种很有前途的解决方案,可以克服基于向量的IR固有的可扩展性问题,为高效、准确的大规模信息检索提供了一种替代方法。
{"title":"Training-free sparse representations of dense vectors for scalable information retrieval","authors":"Fabio Carrara,&nbsp;Lucia Vadicamo,&nbsp;Giuseppe Amato,&nbsp;Claudio Gennaro","doi":"10.1016/j.is.2025.102567","DOIUrl":"10.1016/j.is.2025.102567","url":null,"abstract":"<div><div>In this paper, we propose and analyze Vec2Doc, a novel training-free method to transform dense vectors into sparse integer vectors, facilitating the use of inverted indexes for information retrieval (IR). The exponential growth of deep learning and artificial intelligence has revolutionized scientific problem-solving in areas such as computer vision, natural language processing, and automatic content generation. These advances have also significantly impacted IR, with a better understanding of natural language and multimodal content analysis leading to more accurate information retrieval. Despite these developments, modern IR relies primarily on the similarity evaluation of dense vectors from the latent spaces of deep neural networks. This dependence introduces substantial challenges in performing similarity searches on large collections containing billions of vectors. Traditional IR methods, which employ inverted indexes and vector space models, are adept at handling sparse vectors but do not work well with dense ones. Vec2Doc attempts to fill this gap by converting dense vectors into a format compatible with conventional inverted index techniques. Our preliminary experimental evaluations show that Vec2Doc is a promising solution to overcome the scalability problems inherent in vector-based IR, offering an alternative method for efficient and accurate large-scale information retrieval.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102567"},"PeriodicalIF":3.0,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finding HSP neighbors via an exact, hierarchical approach 通过精确的分层方法找到HSP邻居
IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-05-12 DOI: 10.1016/j.is.2025.102565
Cole Foster , Edgar Chávez , Benjamin Kimia
The Half Space Proximal (HSP) graph is a low out-degree monotonic graph with a wide range of applications in various domains, including combinatorial optimization in strings, enhancing kNN classification, simplifying chemical networks, estimating local intrinsic dimensionality, and generating uniform samples from skewed distributions, among others. However, the linear complexity of finding HSP neighbors of a query limits its scalability, thus motivating approximate indexing which sacrifices accuracy in favor of restricting the test to a small local neighborhood. This compromise leads to the loss of crucial long-range connections which as a result introduce false positives and exclude false negatives, and compromising some of the essential properties of the HSP. To overcome these limitations, this paper proposes a fast and exact algorithm for computing the HSP which enjoys sublinear complexity as demonstrated by extensive experimentation. Our hierarchical approach leverages the triangle inequality applied to pivots to enable efficient HSP search in metric spaces with the Hilbert Exclusion property. A key component of our approach is the concept of the shifted generalized hyperplane between two points, which allows for the invalidation of entire groups of points. Our approach ensures the computation of the exact HSP with efficiency, even for datasets containing hundreds of millions of points.
半空间近端图(Half Space Proximal, HSP)是一种低次单调图,在许多领域有着广泛的应用,包括字符串的组合优化、增强kNN分类、简化化学网络、估计局部固有维数以及从偏态分布中生成均匀样本等。然而,查找查询的HSP邻居的线性复杂性限制了它的可伸缩性,从而激发了近似索引,牺牲了准确性,从而将测试限制在较小的本地邻居中。这种妥协导致了关键的远程连接的丢失,从而引入假阳性和排除假阴性,并损害了HSP的一些基本属性。为了克服这些限制,本文提出了一种快速精确的算法来计算具有亚线性复杂性的HSP,并通过大量实验证明了这一点。我们的分层方法利用应用于枢轴的三角形不等式来实现在度量空间中具有希尔伯特不相容性质的高效HSP搜索。我们的方法的一个关键组成部分是两点之间的位移广义超平面的概念,它允许整个点群的无效。我们的方法确保了精确的HSP的计算效率,即使对于包含数亿个点的数据集也是如此。
{"title":"Finding HSP neighbors via an exact, hierarchical approach","authors":"Cole Foster ,&nbsp;Edgar Chávez ,&nbsp;Benjamin Kimia","doi":"10.1016/j.is.2025.102565","DOIUrl":"10.1016/j.is.2025.102565","url":null,"abstract":"<div><div>The Half Space Proximal (HSP) graph is a low out-degree monotonic graph with a wide range of applications in various domains, including combinatorial optimization in strings, enhancing <span><math><mi>k</mi></math></span>NN classification, simplifying chemical networks, estimating local intrinsic dimensionality, and generating uniform samples from skewed distributions, among others. However, the linear complexity of finding HSP neighbors of a query limits its scalability, thus motivating approximate indexing which sacrifices accuracy in favor of restricting the test to a small local neighborhood. This compromise leads to the loss of crucial long-range connections which as a result introduce false positives and exclude false negatives, and compromising some of the essential properties of the HSP. To overcome these limitations, this paper proposes a fast and exact algorithm for computing the HSP which enjoys sublinear complexity as demonstrated by extensive experimentation. Our hierarchical approach leverages the triangle inequality applied to pivots to enable efficient HSP search in metric spaces with the Hilbert Exclusion property. A key component of our approach is the concept of the <em>shifted generalized hyperplane</em> between two points, which allows for the invalidation of entire groups of points. Our approach ensures the computation of the exact HSP with efficiency, even for datasets containing hundreds of millions of points.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102565"},"PeriodicalIF":3.0,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144099762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Alternating Optimization Scheme for Binary Sketches 二元草图的交替优化方案
IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-05-10 DOI: 10.1016/j.is.2025.102563
Erik Thordsen, Erich Schubert
Searching for similar objects in intrinsically high-dimensional data sets is a challenging task. The use of compact sketches has been proposed for faster similarity search using linear scans. Binary sketches are one such approach to find a good mapping from the original data space to bit strings of a fixed length. These bit strings can be compared efficiently using only few XOR and bit count operations, replacing costly similarity computations with an inexpensive approximation. We propose a new scheme to initialize and improve binary sketches for similarity search in Euclidean spaces. Our optimization iteratively improves the quality of the sketches with a form of orthogonalization. We provide empirical evidence that the quality of the sketches has a peak beyond which it is not correlated to neither bit independence nor bit balance, which contradicts a previous hypothesis in the literature. Regularization in the form of noise added to the training data can turn the peak into a plateau and applying the optimization in a stochastic fashion, i.e., training on smaller subsets of the data, allows for rapid initialization. We provide a loss function that allows to approximate the same objective using neural network frameworks such as PyTorch, elevating the approach to GPU-based training.
在本质上高维的数据集中搜索相似的对象是一项具有挑战性的任务。紧凑草图的使用已经提出了更快的相似性搜索使用线性扫描。二进制草图就是这样一种方法,可以找到从原始数据空间到固定长度的位串的良好映射。这些位串可以使用少量的异或和位计数操作进行有效的比较,用便宜的近似值代替昂贵的相似性计算。提出了一种初始化和改进二元草图的方案,用于欧几里得空间的相似性搜索。我们的优化通过一种正交化形式迭代地提高了草图的质量。我们提供的经验证据表明,草图的质量有一个峰值,超过这个峰值,它既不与比特独立性相关,也不与比特平衡相关,这与文献中先前的假设相矛盾。以噪声形式加入训练数据的正则化可以将峰值变为平台,并以随机方式应用优化,即在较小的数据子集上进行训练,允许快速初始化。我们提供了一个损失函数,允许使用PyTorch等神经网络框架近似相同的目标,将方法提升到基于gpu的训练。
{"title":"An Alternating Optimization Scheme for Binary Sketches","authors":"Erik Thordsen,&nbsp;Erich Schubert","doi":"10.1016/j.is.2025.102563","DOIUrl":"10.1016/j.is.2025.102563","url":null,"abstract":"<div><div>Searching for similar objects in intrinsically high-dimensional data sets is a challenging task. The use of compact sketches has been proposed for faster similarity search using linear scans. Binary sketches are one such approach to find a good mapping from the original data space to bit strings of a fixed length. These bit strings can be compared efficiently using only few XOR and bit count operations, replacing costly similarity computations with an inexpensive approximation. We propose a new scheme to initialize and improve binary sketches for similarity search in Euclidean spaces. Our optimization iteratively improves the quality of the sketches with a form of orthogonalization. We provide empirical evidence that the quality of the sketches has a peak beyond which it is not correlated to neither bit independence nor bit balance, which contradicts a previous hypothesis in the literature. Regularization in the form of noise added to the training data can turn the peak into a plateau and applying the optimization in a stochastic fashion, i.e., training on smaller subsets of the data, allows for rapid initialization. We provide a loss function that allows to approximate the same objective using neural network frameworks such as PyTorch, elevating the approach to GPU-based training.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102563"},"PeriodicalIF":3.0,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144070924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Class Representatives Selection in non-metric spaces for nearest prototype classification 非度量空间中最接近原型分类的类代表选择
IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-05-10 DOI: 10.1016/j.is.2025.102564
Jaroslav Hlaváč , Martin Kopp , Tomáš Skopal
The nearest prototype classification is a less computationally intensive replacement for the k-NN method, especially when large datasets are considered. Centroids are often used as prototypes to represent whole classes in metric spaces. Selection of class prototypes in non-metric spaces is more challenging as the idea of computing centroids is not directly applicable. Instead, a set of representative objects can be used as the class prototype.
This paper presents the Class Representatives Selection (CRS) method, a novel memory and computationally efficient method that finds a small yet representative set of objects from each class to be used as a prototype. CRS leverages the similarity graph representation of each class created by the NN-Descent algorithm to pick a low number of representatives that ensure sufficient class coverage. Thanks to the graph-based approach, CRS can be applied to any space where at least a pairwise similarity can be defined. In the experimental evaluation, we demonstrate that our method outperforms the state-of-the-art techniques on multiple datasets from different domains.
最接近的原型分类是k-NN方法的一种计算强度较小的替代方法,特别是在考虑大型数据集时。质心通常用作度量空间中表示整个类的原型。在非度量空间中,类原型的选择更具挑战性,因为计算质心的思想不能直接适用。相反,可以使用一组代表性对象作为类原型。本文提出了类代表选择(CRS)方法,这是一种新的内存和计算效率高的方法,它从每个类中找到一个小而有代表性的对象集作为原型。CRS利用由NN-Descent算法创建的每个类的相似图表示来选择少量的代表,以确保足够的类覆盖率。由于基于图的方法,CRS可以应用于至少可以定义成对相似性的任何空间。在实验评估中,我们证明了我们的方法在来自不同领域的多个数据集上优于最先进的技术。
{"title":"Class Representatives Selection in non-metric spaces for nearest prototype classification","authors":"Jaroslav Hlaváč ,&nbsp;Martin Kopp ,&nbsp;Tomáš Skopal","doi":"10.1016/j.is.2025.102564","DOIUrl":"10.1016/j.is.2025.102564","url":null,"abstract":"<div><div>The nearest prototype classification is a less computationally intensive replacement for the <span><math><mi>k</mi></math></span>-NN method, especially when large datasets are considered. Centroids are often used as prototypes to represent whole classes in metric spaces. Selection of class prototypes in non-metric spaces is more challenging as the idea of computing centroids is not directly applicable. Instead, a set of representative objects can be used as the class prototype.</div><div>This paper presents the Class Representatives Selection (CRS) method, a novel memory and computationally efficient method that finds a small yet representative set of objects from each class to be used as a prototype. CRS leverages the similarity graph representation of each class created by the NN-Descent algorithm to pick a low number of representatives that ensure sufficient class coverage. Thanks to the graph-based approach, CRS can be applied to any space where at least a pairwise similarity can be defined. In the experimental evaluation, we demonstrate that our method outperforms the state-of-the-art techniques on multiple datasets from different domains.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102564"},"PeriodicalIF":3.0,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143948792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Back to the Order: Partial orders in streaming conformance checking 回到顺序:流一致性检查中的部分顺序
IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-05-10 DOI: 10.1016/j.is.2025.102566
Kristo Raun , Riccardo Tommasini , Ahmed Awad
Most organizations are built around their business processes. Commonly, these processes follow a predefined path. Deviations from the expected path can lead to lower quality products and services, reduced efficiencies, and compliance liabilities. Rapid identification of deviations helps mitigate such risks. For identifying deviations, the conformance checker would need to know the sequence in which events occurred. In this paper, we tackle two challenges associated with knowing the right sequence of events. First, we look at out-of-order event arrival, a common occurrence in modern information systems. Second, we extend the previous work by incorporating partial order handling. Partially ordered events are a well-studied problem in process mining, but to the best of our knowledge it has not been researched in terms of fast-paced streaming conformance checking. Real-life and semi-synthetic datasets are used for validating the proposed methods.
大多数组织都是围绕其业务流程构建的。通常,这些流程遵循预定义的路径。偏离预期路径可能导致产品和服务质量降低、效率降低以及遵从性责任。快速识别偏差有助于减轻此类风险。为了识别偏差,一致性检查人员需要知道事件发生的顺序。在本文中,我们解决了与了解事件的正确顺序相关的两个挑战。首先,我们看一下无序事件到达,这是现代信息系统中常见的现象。其次,我们通过合并部分订单处理扩展了以前的工作。在过程挖掘中,部分有序事件是一个研究得很好的问题,但据我们所知,在快节奏的流一致性检查方面还没有研究过。现实生活和半合成数据集用于验证所提出的方法。
{"title":"Back to the Order: Partial orders in streaming conformance checking","authors":"Kristo Raun ,&nbsp;Riccardo Tommasini ,&nbsp;Ahmed Awad","doi":"10.1016/j.is.2025.102566","DOIUrl":"10.1016/j.is.2025.102566","url":null,"abstract":"<div><div>Most organizations are built around their business processes. Commonly, these processes follow a predefined path. Deviations from the expected path can lead to lower quality products and services, reduced efficiencies, and compliance liabilities. Rapid identification of deviations helps mitigate such risks. For identifying deviations, the conformance checker would need to know the sequence in which events occurred. In this paper, we tackle two challenges associated with knowing the right sequence of events. First, we look at out-of-order event arrival, a common occurrence in modern information systems. Second, we extend the previous work by incorporating partial order handling. Partially ordered events are a well-studied problem in process mining, but to the best of our knowledge it has not been researched in terms of fast-paced streaming conformance checking. Real-life and semi-synthetic datasets are used for validating the proposed methods.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102566"},"PeriodicalIF":3.0,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143948791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Timeline-based process discovery 基于时间轴的流程发现
IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-05-10 DOI: 10.1016/j.is.2025.102568
Christoffer Rubensson , Harleen Kaur , Timotheus Kampik , Jan Mendling
A key concern of automatic process discovery is providing insights into business process performance. Process analysts are specifically interested in waiting times and delays for identifying opportunities to speed up processes. Against this backdrop, it is surprising that current techniques for automatic process discovery generate directly-follows graphs and comparable process models without representing the time axis explicitly. This paper presents four layout strategies for automatically constructing process models that explicitly align with a time axis. We exemplify our approaches for directly-follows graphs. We evaluate their effectiveness by applying them to real-world event logs with varying complexities. Our specific focus is on their ability to handle the trade-off between high control-flow abstraction and high consistency of temporal activity order. Our results show that timeline-based layouts provide benefits in terms of an explicit representation of temporal distances. They face challenges for logs with many repeating and concurrent activities.
自动流程发现的一个关键关注点是提供对业务流程性能的洞察。流程分析师对识别加速流程的机会的等待时间和延迟特别感兴趣。在这种背景下,令人惊讶的是,当前的自动过程发现技术生成了直接跟随的图和可比较的过程模型,而没有显式地表示时间轴。本文提出了四种用于自动构建与时间轴显式对齐的过程模型的布局策略。我们举例说明了直接跟随图的方法。我们通过将它们应用于具有不同复杂性的真实事件日志来评估它们的有效性。我们特别关注的是它们处理高控制流抽象和时间活动顺序的高一致性之间的权衡的能力。我们的研究结果表明,基于时间线的布局在明确表示时间距离方面提供了好处。他们面临着许多重复和并发活动的日志挑战。
{"title":"Timeline-based process discovery","authors":"Christoffer Rubensson ,&nbsp;Harleen Kaur ,&nbsp;Timotheus Kampik ,&nbsp;Jan Mendling","doi":"10.1016/j.is.2025.102568","DOIUrl":"10.1016/j.is.2025.102568","url":null,"abstract":"<div><div>A key concern of automatic process discovery is providing insights into business process performance. Process analysts are specifically interested in waiting times and delays for identifying opportunities to speed up processes. Against this backdrop, it is surprising that current techniques for automatic process discovery generate directly-follows graphs and comparable process models without representing the time axis explicitly. This paper presents four layout strategies for automatically constructing process models that explicitly align with a time axis. We exemplify our approaches for directly-follows graphs. We evaluate their effectiveness by applying them to real-world event logs with varying complexities. Our specific focus is on their ability to handle the trade-off between high control-flow abstraction and high consistency of temporal activity order. Our results show that timeline-based layouts provide benefits in terms of an explicit representation of temporal distances. They face challenges for logs with many repeating and concurrent activities.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102568"},"PeriodicalIF":3.0,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144070923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stochastic conformance checking based on variable-length Markov chains 基于变长马尔可夫链的随机一致性检验
IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-05-09 DOI: 10.1016/j.is.2025.102561
Emilio Incerto , Andrea Vandin , Sima Sarv Ahrabi
Conformance checking is central in process mining (PM). It studies deviations of logs from reference processes. Originally, the proposed approaches did not focus on stochastic aspects of the underlying process, and gave qualitative models as output. Recently, these have been extended in approaches for stochastic conformance checking (SCC), giving quantitative models as output. A different community, namely the software performance engineering (PE) one, interested in the synthesis of stochastic processes since decades, has developed independently techniques to synthesize Markov Chains (MC) that describe the stochastic process underlying program runs. However, these were never applied to SCC problems. We propose a novel approach to SCC based on PE results for the synthesis of stochastic processes. Thanks to a rich experimental evaluation, we show that it outperforms the state-of-the-art. In doing so, we further bridge PE and PM, fostering cross-fertilization. We use techniques for the synthesis of Variable-length MC (VLMC), higher-order MC able to compactly encode complex path dependencies in the control-flow. VLMCs are equipped with a notion of likelihood that a trace belongs to a model. We use it to perform SCC of a log against a model. We establish the degree of conformance by equipping VLMCs with uEMSC, a standard conformance measure in the SCC literature. We compare with 18 SCC techniques from the PM literature, using 11 benchmark datasets from the PM community. We outperform all approaches in 10 out of 11 datasets, i.e., we get uEMSC values closer to 1 for logs conforming to a model. Furthermore, we show that VLMC are efficient, as they handled all considered datasets in a few seconds.
一致性检查是过程挖掘(PM)的核心。它研究了日志与参考过程的偏差。最初,提出的方法并没有关注潜在过程的随机方面,而是给出定性模型作为输出。最近,这些方法在随机一致性检查(SCC)方法中得到了扩展,给出了定量模型作为输出。一个不同的社区,即软件性能工程(PE),几十年来对随机过程的合成感兴趣,已经独立开发了合成马尔可夫链(MC)的技术,该技术描述了程序运行背后的随机过程。然而,这些从未应用于SCC问题。我们提出了一种基于PE结果的随机过程综合SCC的新方法。由于丰富的实验评估,我们表明它优于最先进的技术。通过这样做,我们进一步架起了体育和项目管理的桥梁,促进了交叉受精。我们使用了可变长度MC (VLMC)的合成技术,高阶MC能够在控制流中紧凑地编码复杂的路径依赖。vlmc配备了跟踪属于模型的可能性概念。我们使用它对模型执行日志的SCC。我们通过为vlmc配备uEMSC (SCC文献中的标准一致性测量)来建立一致性程度。我们比较了PM文献中的18种SCC技术,使用了PM社区的11个基准数据集。在11个数据集中的10个中,我们的性能优于所有方法,也就是说,对于符合模型的日志,我们的uEMSC值更接近1。此外,我们表明VLMC是高效的,因为它们在几秒钟内处理所有考虑的数据集。
{"title":"Stochastic conformance checking based on variable-length Markov chains","authors":"Emilio Incerto ,&nbsp;Andrea Vandin ,&nbsp;Sima Sarv Ahrabi","doi":"10.1016/j.is.2025.102561","DOIUrl":"10.1016/j.is.2025.102561","url":null,"abstract":"<div><div>Conformance checking is central in process mining (PM). It studies deviations of logs from reference processes. Originally, the proposed approaches did not focus on stochastic aspects of the underlying process, and gave qualitative models as output. Recently, these have been extended in approaches for <em>stochastic conformance checking</em> (SCC), giving quantitative models as output. A different community, namely the <em>software performance engineering</em> (PE) one, interested in the synthesis of stochastic processes since decades, has developed independently techniques to synthesize Markov Chains (MC) that describe the stochastic process underlying program runs. However, these were never applied to SCC problems. We propose a novel approach to SCC based on PE results for the synthesis of stochastic processes. Thanks to a rich experimental evaluation, we show that it outperforms the state-of-the-art. In doing so, we further bridge PE and PM, fostering cross-fertilization. We use techniques for the synthesis of Variable-length MC (VLMC), higher-order MC able to compactly encode complex path dependencies in the control-flow. VLMCs are equipped with a notion of likelihood that a trace belongs to a model. We use it to perform SCC of a log against a model. We establish the degree of conformance by equipping VLMCs with uEMSC, a standard conformance measure in the SCC literature. We compare with 18 SCC techniques from the PM literature, using 11 benchmark datasets from the PM community. We outperform all approaches in 10 out of 11 datasets, i.e., we get uEMSC values closer to 1 for logs conforming to a model. Furthermore, we show that VLMC are efficient, as they handled all considered datasets in a few seconds.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102561"},"PeriodicalIF":3.0,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Multi-Faceted Visual Process Analytics 面向多面可视化过程分析
IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-05-06 DOI: 10.1016/j.is.2025.102560
Stef van den Elzen , Mieke Jans , Niels Martin , Femke Pieters , Christian Tominski , Maria-Cruz Villa-Uriol , Sebastiaan J. van Zelst
Both the fields of Process Mining (PM) and Visual Analytics (VA) aim to make complex phenomena understandable. In PM, the goal is to gain insights into the execution of complex processes by analyzing the event data that is captured in event logs. This data is inherently multi-faceted, meaning that it covers various data facets, including spatial and temporal dependencies, relations between data entities (such as cases/events), and multivariate data attributes per entity. However, the multi-faceted nature of the data has not received much attention in PM. Conversely, VA research has investigated interactive visual methods for making multi-faceted data understandable for about two decades. In this study, we bring together PM and VA with the goal of advancing towards Visual Process Analytics (VPA) of multi-faceted processes. To this end, we present a systematic view of relevant (VA) data facets in the context of PM and assess to what extent existing PM visualizations address the data facets’ characteristics, making use of VA guidelines. In addition to visualizations, we look at how PM can benefit from analytical abstraction and interaction techniques known in the VA realm. Based on this, we discuss open challenges and opportunities for future research towards multi-faceted VPA.
过程挖掘(Process Mining, PM)和可视化分析(Visual Analytics, VA)都旨在使复杂的现象变得可理解。在PM中,目标是通过分析事件日志中捕获的事件数据来深入了解复杂流程的执行情况。该数据本质上是多方面的,这意味着它涵盖了各种数据方面,包括空间和时间依赖性、数据实体(如案例/事件)之间的关系以及每个实体的多变量数据属性。然而,数据的多面性在项目管理中并没有得到太多的关注。相反,VA的研究已经研究了交互可视化方法,使多方面的数据可以理解,大约有二十年了。在本研究中,我们将PM和VA结合在一起,目标是推进多面过程的可视化过程分析(VPA)。为此,我们提出了项目管理背景下相关(VA)数据方面的系统视图,并评估现有项目管理可视化处理数据方面特征的程度,利用VA指南。除了可视化之外,我们还将了解PM如何从VA领域中已知的分析抽象和交互技术中获益。在此基础上,我们讨论了面向多方面的VPA未来研究面临的挑战和机遇。
{"title":"Towards Multi-Faceted Visual Process Analytics","authors":"Stef van den Elzen ,&nbsp;Mieke Jans ,&nbsp;Niels Martin ,&nbsp;Femke Pieters ,&nbsp;Christian Tominski ,&nbsp;Maria-Cruz Villa-Uriol ,&nbsp;Sebastiaan J. van Zelst","doi":"10.1016/j.is.2025.102560","DOIUrl":"10.1016/j.is.2025.102560","url":null,"abstract":"<div><div>Both the fields of Process Mining (PM) and Visual Analytics (VA) aim to make complex phenomena understandable. In PM, the goal is to gain insights into the execution of complex processes by analyzing the event data that is captured in event logs. This data is inherently multi-faceted, meaning that it covers various data facets, including spatial and temporal dependencies, relations between data entities (such as cases/events), and multivariate data attributes per entity. However, the multi-faceted nature of the data has not received much attention in PM. Conversely, VA research has investigated interactive visual methods for making multi-faceted data understandable for about two decades. In this study, we bring together PM and VA with the goal of advancing towards Visual Process Analytics (VPA) of multi-faceted processes. To this end, we present a systematic view of relevant (VA) data facets in the context of PM and assess to what extent existing PM visualizations address the data facets’ characteristics, making use of VA guidelines. In addition to visualizations, we look at how PM can benefit from analytical abstraction and interaction techniques known in the VA realm. Based on this, we discuss open challenges and opportunities for future research towards multi-faceted VPA.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102560"},"PeriodicalIF":3.0,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143934651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Support estimation in frequent itemsets mining on Enriched Two Level Tree 富二层树频繁项集挖掘中的支持度估计
IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-05-06 DOI: 10.1016/j.is.2025.102559
Clémentin Tayou Djamegni , William Kery Branston Ndemaze , Edith Belise Kenmogne , Hervé Maradona Nana Kouassi , Arnauld Nzegha Fountsop , Idriss Tetakouchom , Laurent Cabrel Tabueu Fotso
Efficiently counting the support of candidate itemsets is a crucial aspect of extracting frequent itemsets because it directly impacts the overall performance of the mining process. Researchers have developed various techniques and data structures to overcome this challenge, but the problem is still open. In this paper, we investigate the two-level tree enrichment technique as a potential solution without adding significant computational overhead. In addition, we introduce ETL_Miner, a novel algorithm that provides an estimated bound for the support value of all candidate itemsets within the search space. The method presented in this article is flexible and can be used with various algorithms. To demonstrate this point, we introduce a modified version of Apriori that integrates ETL_Miner as an extra pruning phase. Preliminary empirical experimental results on both real and synthetic datasets confirm the accuracy of the proposed method and reduce the total extraction time.
有效地计算候选项集的支持度是提取频繁项集的一个关键方面,因为它直接影响挖掘过程的整体性能。研究人员已经开发了各种技术和数据结构来克服这一挑战,但问题仍然存在。在本文中,我们研究了两级树富集技术作为一种潜在的解决方案,而不会增加显著的计算开销。此外,我们还引入了ETL_Miner算法,该算法为搜索空间内所有候选项集的支持值提供了估计边界。本文提出的方法是灵活的,可用于各种算法。为了证明这一点,我们引入了一个修改版本的Apriori,它将ETL_Miner集成为一个额外的修剪阶段。在真实和合成数据集上的初步实验结果证实了该方法的准确性,并缩短了总提取时间。
{"title":"Support estimation in frequent itemsets mining on Enriched Two Level Tree","authors":"Clémentin Tayou Djamegni ,&nbsp;William Kery Branston Ndemaze ,&nbsp;Edith Belise Kenmogne ,&nbsp;Hervé Maradona Nana Kouassi ,&nbsp;Arnauld Nzegha Fountsop ,&nbsp;Idriss Tetakouchom ,&nbsp;Laurent Cabrel Tabueu Fotso","doi":"10.1016/j.is.2025.102559","DOIUrl":"10.1016/j.is.2025.102559","url":null,"abstract":"<div><div>Efficiently counting the support of candidate itemsets is a crucial aspect of extracting frequent itemsets because it directly impacts the overall performance of the mining process. Researchers have developed various techniques and data structures to overcome this challenge, but the problem is still open. In this paper, we investigate the two-level tree enrichment technique as a potential solution without adding significant computational overhead. In addition, we introduce ETL_Miner, a novel algorithm that provides an estimated bound for the support value of all candidate itemsets within the search space. The method presented in this article is flexible and can be used with various algorithms. To demonstrate this point, we introduce a modified version of Apriori that integrates ETL_Miner as an extra pruning phase. Preliminary empirical experimental results on both real and synthetic datasets confirm the accuracy of the proposed method and reduce the total extraction time.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102559"},"PeriodicalIF":3.0,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143934650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Information Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1