首页 > 最新文献

Information Systems最新文献

英文 中文
Explaining cube measures through Intentional Analytics 通过意向分析解释立方体测量方法
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-18 DOI: 10.1016/j.is.2023.102338
Matteo Francia , Stefano Rizzi , Patrick Marcel

The Intentional Analytics Model (IAM) has been devised to couple OLAP and analytics by (i) letting users express their analysis intentions on multidimensional data cubes and (ii) returning enhanced cubes, i.e., multidimensional data annotated with knowledge insights in the form of models (e.g., correlations). Five intention operators were proposed to this end; of these, describe and assess have been investigated in previous papers. In this work we enrich the IAM picture by focusing on the explain operator, whose goal is to provide an answer to the user asking “why does measure m show these values?”; specifically, we consider models that explain m in terms of one or more other measures. We propose a syntax for the operator and discuss how enhanced cubes are built by (i) finding the relationship between m and the other cube measures via regression analysis and cross-correlation, and (ii) highlighting the most interesting one. Finally, we test the operator implementation in terms of efficiency and effectiveness.

Intentional Analytics Model(IAM,意向分析模型)旨在通过以下方式将 OLAP 和分析结合起来:(i) 让用户在多维数据立方体上表达他们的分析意向;(ii) 返回增强的立方体,即以模型(如相关性)形式注释了知识见解的多维数据。为此,我们提出了五个意向操作符;其中,描述和评估已在以前的论文中进行过研究。在这项工作中,我们将重点放在解释运算符上,以丰富 IAM 的内容,解释运算符的目标是回答用户 "为什么测量值 m 会显示这些值?"的问题;具体来说,我们考虑用一个或多个其他测量值来解释 m 的模型。我们为运算符提出了一种语法,并讨论了如何通过以下方法构建增强立方体:(i) 通过回归分析和交叉相关分析找到 m 与其他立方体测量值之间的关系,(ii) 突出显示最有趣的测量值。最后,我们从效率和效果方面测试了算子的实现。
{"title":"Explaining cube measures through Intentional Analytics","authors":"Matteo Francia ,&nbsp;Stefano Rizzi ,&nbsp;Patrick Marcel","doi":"10.1016/j.is.2023.102338","DOIUrl":"10.1016/j.is.2023.102338","url":null,"abstract":"<div><p>The Intentional Analytics Model (IAM) has been devised to couple OLAP and analytics by (i) letting users express their analysis intentions on multidimensional data cubes and (ii) returning enhanced cubes, i.e., multidimensional data annotated with knowledge insights in the form of models (e.g., correlations). Five intention operators were proposed to this end; of these, <span>describe</span> and <span>assess</span> have been investigated in previous papers. In this work we enrich the IAM picture by focusing on the <span>explain</span> operator, whose goal is to provide an answer to the user asking “why does measure <span><math><mi>m</mi></math></span> show these values?”; specifically, we consider models that explain <span><math><mi>m</mi></math></span> in terms of one or more other measures. We propose a syntax for the operator and discuss how enhanced cubes are built by (i) finding the relationship between <span><math><mi>m</mi></math></span> and the other cube measures via regression analysis and cross-correlation, and (ii) highlighting the most interesting one. Finally, we test the operator implementation in terms of efficiency and effectiveness.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102338"},"PeriodicalIF":3.7,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001746/pdfft?md5=23f8fab78fdd903fb8bd9c0b6f06f739&pid=1-s2.0-S0306437923001746-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138742073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LSPC: Exploring contrastive clustering based on local semantic information and prototype LSPC:基于本地语义信息和原型的对比聚类探索
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-13 DOI: 10.1016/j.is.2023.102336
Jun-Fen Chen, Lang Sun, Bo-Jun Xie

Recently years, several prominent contrastive learning algorithms, a kind of self-supervised learning methods, have been extensively studied that can efficiently extract useful feature representations from input images by means of data augmentation techniques. How to further partition the representations into meaningful clusters is the issue that deep clustering is addressing. In this work, a deep clustering algorithm based on local semantic information and prototype is proposed referring to LSPC that aims at learning a group of representative prototypes. Rather than learning the distinguishing characteristics between different images, more attention is given to the essential characteristics of images that are maybe from a potential category. On the training framework, contrastive learning is skillfully combined with k-means clustering algorithm. The prediction is transformed into soft assignments for end-to-end training. In order to enable the model to accurately capture the semantic information between images, we mine similar samples of training samples in the embedded space as local semantic information to effectively increase the similarity between samples belonging to the same cluster. Experimental results show that our algorithm achieves state-of-the-art performance on several commonly used public datasets, and additional experiments prove that this superior clustering performance can also be extended to large datasets such as ImageNet.

近年来,对比学习算法作为一种自监督学习方法得到了广泛的研究,它可以通过数据增强技术有效地从输入图像中提取有用的特征表示。如何将表示进一步划分为有意义的聚类是深度聚类要解决的问题。本文在LSPC的基础上,提出了一种基于局部语义信息和原型的深度聚类算法,旨在学习一组具有代表性的原型。比起学习不同图像之间的区别特征,更多的是关注可能来自潜在类别的图像的本质特征。在训练框架上,将对比学习与k-means聚类算法巧妙结合。将预测转化为端到端训练的软任务。为了使模型能够准确地捕获图像之间的语义信息,我们在嵌入空间中挖掘训练样本的相似样本作为局部语义信息,有效地增加了属于同一聚类的样本之间的相似度。实验结果表明,我们的算法在几个常用的公共数据集上达到了最先进的性能,另外的实验证明,这种优越的聚类性能也可以扩展到像ImageNet这样的大型数据集上。
{"title":"LSPC: Exploring contrastive clustering based on local semantic information and prototype","authors":"Jun-Fen Chen,&nbsp;Lang Sun,&nbsp;Bo-Jun Xie","doi":"10.1016/j.is.2023.102336","DOIUrl":"10.1016/j.is.2023.102336","url":null,"abstract":"<div><p>Recently years, several prominent contrastive learning<span><span> algorithms, a kind of self-supervised learning methods, have been extensively studied that can efficiently extract useful feature representations from input images by means of data augmentation techniques. How to further partition the representations into meaningful clusters is the issue that deep clustering is addressing. In this work, a deep </span>clustering algorithm based on local semantic information and prototype is proposed referring to LSPC that aims at learning a group of representative prototypes. Rather than learning the distinguishing characteristics between different images, more attention is given to the essential characteristics of images that are maybe from a potential category. On the training framework, contrastive learning is skillfully combined with k-means clustering algorithm. The prediction is transformed into soft assignments for end-to-end training. In order to enable the model to accurately capture the semantic information between images, we mine similar samples of training samples in the embedded space as local semantic information to effectively increase the similarity between samples belonging to the same cluster. Experimental results show that our algorithm achieves state-of-the-art performance on several commonly used public datasets, and additional experiments prove that this superior clustering performance can also be extended to large datasets such as ImageNet.</span></p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102336"},"PeriodicalIF":3.7,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138628611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Heterogeneous graph neural networks for fraud detection and explanation in supply chain finance 用于供应链金融欺诈检测和解释的异构图神经网络
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-11 DOI: 10.1016/j.is.2023.102335
Bin Wu , Kuo-Ming Chao , Yinsheng Li

It is a critical mission for financial service providers to discover fraudulent borrowers in a supply chain. The borrowers’ transactions in an ongoing business are inspected to support the providers’ decision on whether to lend the money. Considering multiple participants in a supply chain business, the borrowers may use sophisticated tricks to cheat, making fraud detection challenging. In this work, we propose a multitask learning framework, MultiFraud, for complex fraud detection with reasonable explanation. The heterogeneous information from multi-view around the entities is leveraged in the detection framework based on heterogeneous graph neural networks. MultiFraud enables multiple domains to share embeddings and enhance modeling capabilities for fraud detection. The developed explainer provides comprehensive explanations across multiple graphs. Experimental results on five datasets demonstrate the framework’s effectiveness in fraud detection and explanation across domains.

发现供应链中的欺诈性借款人是金融服务提供商的一项重要任务。金融服务提供商需要检查借款人在正在进行的业务中的交易,以决定是否放贷。考虑到供应链业务中有多个参与者,借款人可能会使用复杂的手段进行欺骗,这使得欺诈检测具有挑战性。在这项工作中,我们提出了一个多任务学习框架--MultiFraud,用于复杂欺诈检测,并给出了合理的解释。基于异构图神经网络的检测框架充分利用了实体周围多视角的异构信息。MultiFraud 可使多个领域共享嵌入信息,增强欺诈检测的建模能力。所开发的解释器可提供跨多个图的全面解释。在五个数据集上的实验结果证明了该框架在欺诈检测和跨领域解释方面的有效性。
{"title":"Heterogeneous graph neural networks for fraud detection and explanation in supply chain finance","authors":"Bin Wu ,&nbsp;Kuo-Ming Chao ,&nbsp;Yinsheng Li","doi":"10.1016/j.is.2023.102335","DOIUrl":"https://doi.org/10.1016/j.is.2023.102335","url":null,"abstract":"<div><p>It is a critical mission for financial service providers to discover fraudulent borrowers in a supply chain. The borrowers’ transactions in an ongoing business are inspected to support the providers’ decision on whether to lend the money. Considering multiple participants in a supply chain business, the borrowers may use sophisticated tricks to cheat, making fraud detection challenging. In this work, we propose a multitask learning<span> framework, MultiFraud, for complex fraud detection with reasonable explanation. The heterogeneous information from multi-view around the entities is leveraged in the detection framework based on heterogeneous graph neural networks. MultiFraud enables multiple domains to share embeddings and enhance modeling capabilities for fraud detection. The developed explainer provides comprehensive explanations across multiple graphs. Experimental results on five datasets demonstrate the framework’s effectiveness in fraud detection and explanation across domains.</span></p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102335"},"PeriodicalIF":3.7,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138577761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attention-based multi attribute matrix factorization for enhanced recommendation performance 基于注意力的多属性矩阵因式分解提高推荐性能
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-09 DOI: 10.1016/j.is.2023.102334
Dongsoo Jang , Qinglong Li , Chaeyoung Lee , Jaekyeong Kim

In E-commerce platforms, auxiliary information containing several attributes (e.g., price, quality, and brand) can improve recommendation performance. However, previous studies used a simple combined embedding approach that did not consider the importance of each attribute embedded in the auxiliary information or only used some attributes of the auxiliary information. However, user purchasing behavior can vary significantly depending on the attributes. Thus, we propose multi attribute-based matrix factorization (MAMF), which considers the importance of each attribute embedded in various auxiliary information. MAMF obtains more representative and specific attention features of the user and item using a self-attention mechanism. By acquiring attentive representation, MAMF learns a high-level interaction precisely between users and items. To evaluate the performance of the proposed MAMF, we conducted extensive experiments using three real-world datasets from amazon.com. The experimental results show that MAMF exhibits excellent recommendation performance compared with various baseline models.

在电子商务平台中,包含多个属性(如价格、质量和品牌)的辅助信息可以提高推荐性能。然而,以往的研究采用的是简单的组合嵌入方法,没有考虑辅助信息中嵌入的每个属性的重要性,或者只使用了辅助信息中的某些属性。然而,用户的购买行为会因属性不同而有很大差异。因此,我们提出了基于多属性的矩阵因式分解(MAMF),它考虑了嵌入在各种辅助信息中的每个属性的重要性。MAMF 利用自我关注机制获取用户和商品更具代表性和特定的关注特征。通过获取注意力表征,MAMF 可以精确地学习用户与物品之间的高级交互。为了评估所提出的 MAMF 的性能,我们使用来自 amazon.com 的三个真实数据集进行了大量实验。实验结果表明,与各种基线模型相比,MAMF 表现出了卓越的推荐性能。
{"title":"Attention-based multi attribute matrix factorization for enhanced recommendation performance","authors":"Dongsoo Jang ,&nbsp;Qinglong Li ,&nbsp;Chaeyoung Lee ,&nbsp;Jaekyeong Kim","doi":"10.1016/j.is.2023.102334","DOIUrl":"10.1016/j.is.2023.102334","url":null,"abstract":"<div><p><span>In E-commerce platforms, auxiliary information containing several attributes (e.g., price, quality, and brand) can improve recommendation performance. However, previous studies used a simple combined embedding approach that did not consider the importance of each attribute embedded in the auxiliary information or only used some attributes of the auxiliary information. However, user purchasing behavior can vary significantly depending on the attributes. Thus, we propose multi attribute-based matrix factorization (MAMF), which considers the importance of each attribute embedded in various auxiliary information. MAMF obtains more representative and specific attention features of the user and item using a self-attention mechanism. By acquiring attentive representation, MAMF learns a high-level interaction precisely between users and items. To evaluate the performance of the proposed MAMF, we conducted extensive experiments using three real-world datasets from amazon.com. The experimental results show that MAMF exhibits excellent recommendation performance compared with various </span>baseline models.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102334"},"PeriodicalIF":3.7,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138572711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An efficient visual exploration approach of geospatial vector big data on the web map 网络地图上地理空间矢量大数据的高效可视化探索方法
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-09 DOI: 10.1016/j.is.2023.102333
Zebang Liu , Luo Chen , Mengyu Ma , Anran Yang , Zhinong Zhong , Ning Jing

The visual exploration of geospatial vector data has become an increasingly important part of the management and analysis of geospatial vector big data (GVBD). With the rapid growth of data scale, it is difficult to realize efficient visual exploration of GVBD by current visualization technologies even if parallel distributed computing technology is adopted. To fill the gap, this paper proposes a visual exploration approach of GVBD on the web map. In this approach, we propose the display-driven computing model and combine the traditional data-driven computing method to design an adaptive real-time visualization algorithm. At the same time, we design a pixel-quad-R tree spatial index structure. Finally, we realize the multilevel real-time interactive visual exploration of GVBD in a single machine by constructing the index offline to support the online computation for visualization, and all the visualization results can be calculated in real-time without the external cache occupation. The experimental results show that the approach outperforms current mainstream visualization methods and obtains the visualization results at any zoom level within 0.5 s, which can be well applied to multilevel real-time interactive visual exploration of the billion-scale GVBD.

地理空间矢量数据的可视化探索已日益成为地理空间矢量大数据(GVBD)管理和分析的重要组成部分。随着数据规模的快速增长,即使采用并行分布式计算技术,现有的可视化技术也难以实现对地理空间矢量大数据的高效可视化探索。为了填补这一空白,本文提出了一种在网络地图上对 GVBD 进行可视化探索的方法。在这种方法中,我们提出了显示驱动计算模型,并结合传统的数据驱动计算方法,设计了一种自适应实时可视化算法。同时,我们还设计了一种像素四R树形空间索引结构。最后,我们通过离线构建索引来支持可视化的在线计算,在单机上实现了GVBD的多层次实时交互式可视化探索,所有可视化结果均可实时计算,无需占用外部缓存。实验结果表明,该方法优于目前主流的可视化方法,可在0.5 s内获得任意缩放级别的可视化结果,可以很好地应用于亿级尺度GVBD的多级实时交互可视化探索。
{"title":"An efficient visual exploration approach of geospatial vector big data on the web map","authors":"Zebang Liu ,&nbsp;Luo Chen ,&nbsp;Mengyu Ma ,&nbsp;Anran Yang ,&nbsp;Zhinong Zhong ,&nbsp;Ning Jing","doi":"10.1016/j.is.2023.102333","DOIUrl":"10.1016/j.is.2023.102333","url":null,"abstract":"<div><p><span>The visual exploration of geospatial vector data has become an increasingly important part of the management and analysis of geospatial vector big data (GVBD). With the rapid growth of data scale, it is difficult to realize efficient visual exploration of GVBD by current visualization technologies even if parallel distributed computing<span> technology is adopted. To fill the gap, this paper proposes a visual exploration approach of GVBD on the web map. In this approach, we propose the display-driven computing model and combine the traditional data-driven computing method to design an adaptive real-time visualization algorithm. At the same time, we design a pixel-quad-R tree spatial index structure. Finally, we realize the multilevel real-time interactive visual exploration of GVBD in a single machine by constructing the index offline to support the online computation for visualization, and all the visualization results can be calculated in real-time without the external cache occupation. The experimental results show that the approach outperforms current mainstream </span></span>visualization methods and obtains the visualization results at any zoom level within 0.5 s, which can be well applied to multilevel real-time interactive visual exploration of the billion-scale GVBD.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102333"},"PeriodicalIF":3.7,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138567682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Validation set sampling strategies for predictive process monitoring 用于预测性过程监测的验证集采样策略
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-07 DOI: 10.1016/j.is.2023.102330
Jari Peeperkorn , Seppe vanden Broucke , Jochen De Weerdt

Previous studies investigating the efficacy of long short-term memory (LSTM) recurrent neural networks in predictive process monitoring and their ability to capture the underlying process structure have raised concerns about their limited ability to generalize to unseen behavior. Event logs often fail to capture the full spectrum of behavior permitted by the underlying processes. To overcome these challenges, this study introduces innovative validation set sampling strategies based on control-flow variant-based resampling. These strategies have undergone extensive evaluation to assess their impact on hyperparameter selection and early stopping, resulting in notable enhancements to the generalization capabilities of trained LSTM models. In addition, this study expands the experimental framework to enable accurate interpretation of underlying process models and provide valuable insights. By conducting experiments with event logs representing process models of varying complexities, this research elucidates the effectiveness of the proposed validation strategies. Furthermore, the extended framework facilitates investigations into the influence of event log completeness on the learning quality of predictive process models. The novel validation set sampling strategies proposed in this study facilitate the development of more effective and reliable predictive process models, ultimately bolstering generalization capabilities and improving the understanding of underlying process dynamics.

以前的研究调查了长短期记忆(LSTM)递归神经网络在预测性流程监控中的功效及其捕捉底层流程结构的能力,这些研究引起了人们对这些网络泛化到未见行为的能力有限的担忧。事件日志往往无法捕捉底层进程所允许的全部行为。为了克服这些挑战,本研究引入了基于控制流变体重采样的创新验证集采样策略。这些策略经过了广泛的评估,以评估其对超参数选择和早期停止的影响,从而显著增强了训练有素的 LSTM 模型的泛化能力。此外,本研究还扩展了实验框架,以便准确解释底层流程模型并提供有价值的见解。通过对代表不同复杂度流程模型的事件日志进行实验,本研究阐明了所提出的验证策略的有效性。此外,扩展框架还有助于研究事件日志完整性对预测流程模型学习质量的影响。本研究提出的新型验证集采样策略有助于开发更有效、更可靠的预测性流程模型,最终增强泛化能力,提高对潜在流程动态的理解。
{"title":"Validation set sampling strategies for predictive process monitoring","authors":"Jari Peeperkorn ,&nbsp;Seppe vanden Broucke ,&nbsp;Jochen De Weerdt","doi":"10.1016/j.is.2023.102330","DOIUrl":"10.1016/j.is.2023.102330","url":null,"abstract":"<div><p>Previous studies investigating the efficacy of long short-term memory (LSTM) recurrent neural networks in predictive process monitoring and their ability to capture the underlying process structure have raised concerns about their limited ability to generalize to unseen behavior. Event logs often fail to capture the full spectrum of behavior permitted by the underlying processes. To overcome these challenges, this study introduces innovative validation set sampling strategies based on control-flow variant-based resampling. These strategies have undergone extensive evaluation to assess their impact on hyperparameter selection and early stopping, resulting in notable enhancements to the generalization capabilities of trained LSTM models. In addition, this study expands the experimental framework to enable accurate interpretation of underlying process models and provide valuable insights. By conducting experiments with event logs representing process models of varying complexities, this research elucidates the effectiveness of the proposed validation strategies. Furthermore, the extended framework facilitates investigations into the influence of event log completeness on the learning quality of predictive process models. The novel validation set sampling strategies proposed in this study facilitate the development of more effective and reliable predictive process models, ultimately bolstering generalization capabilities and improving the understanding of underlying process dynamics.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102330"},"PeriodicalIF":3.7,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138566986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records 客户记录重复数据删除管道中指导相似度计算的参数调优:来自研发项目的经验
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-04 DOI: 10.1016/j.is.2023.102323
Witold Andrzejewski , Bartosz Bębel , Paweł Boiński , Robert Wrembel

Data stored in information systems are often erroneous. Duplicate data are one of the typical error type. To discover and handle duplicates, the so-called deduplication methods are applied. They are complex and time costly algorithms. In data deduplication, pairs of records are compared and their similarities are computed. For a given deduplication problem, challenging tasks are: (1) to decide which similarity measures are the most adequate to given attributes being compared and (2) defining the importance of attributes being compared, and (3) defining adequate similarity thresholds between similar and not similar pairs of records. In this paper, we summarize our experience gained from a real R&D project run for a large financial institution. In particular, we answer the following three research questions: (1) what are the adequate similarity measures for comparing attributes of text data types, (2) what are the adequate weights of attributes in the procedure of comparing pairs of records, and (3) what are the similarity thresholds between classes: duplicates, probably duplicates, non-duplicates? The answers to the questions are based on the experimental evaluation of 54 similarity measures for text values. The measures were compared on five different real data sets of different data characteristic. The similarity measures were assessed based on: (1) similarity values they produced for given values being compared and (2) their execution time. Furthermore, we present our method, based on mathematical programming, for computing weights of attributes and similarity thresholds for records being compared. The experimental evaluation of the method and its assessment by experts from the financial institution proved that it is adequate to the deduplication problem at hand. The whole data deduplication pipeline that we have developed has been deployed in the financial institution and is run in their production system, processing batches of over 20 million of customer records.

存储在信息系统中的数据常常是错误的。重复数据是典型的错误类型之一。为了发现和处理重复数据,使用了所谓的重复数据删除方法。它们是复杂且耗时的算法。在重复数据删除中,对记录进行比较并计算它们的相似度。对于给定的重复数据删除问题,具有挑战性的任务是:(1)决定哪些相似性度量最适合要比较的给定属性;(2)定义要比较的属性的重要性;(3)在相似和不相似的记录对之间定义适当的相似性阈值。在本文中,我们总结了从为一家大型金融机构运行的实际研发项目中获得的经验。特别是,我们回答了以下三个研究问题:(1)比较文本数据类型属性的适当相似性度量是什么,(2)比较成对记录过程中属性的适当权重是什么,以及(3)类之间的相似性阈值是什么:重复,可能重复,非重复?问题的答案是基于54个文本值相似度量的实验评估。在具有不同数据特征的5个不同的真实数据集上进行了度量比较。对相似性度量的评估基于:(1)它们为所比较的给定值产生的相似性值和(2)它们的执行时间。此外,我们提出了一种基于数学规划的方法,用于计算被比较记录的属性权重和相似阈值。该方法的实验评价和金融机构专家的评估证明,该方法足以解决手头的重复数据删除问题。我们开发的整个重复数据删除管道已经部署在金融机构中,并在他们的生产系统中运行,处理了超过2000万条客户记录。
{"title":"On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records","authors":"Witold Andrzejewski ,&nbsp;Bartosz Bębel ,&nbsp;Paweł Boiński ,&nbsp;Robert Wrembel","doi":"10.1016/j.is.2023.102323","DOIUrl":"10.1016/j.is.2023.102323","url":null,"abstract":"<div><p><span><span>Data stored in information systems are often erroneous. Duplicate data are one of the typical error type. To discover and handle duplicates, the so-called deduplication methods are applied. They are complex and time costly algorithms. In </span>data deduplication<span><span>, pairs of records are compared and their similarities are computed. For a given deduplication problem, challenging tasks are: (1) to decide which similarity measures are the most adequate to given attributes being compared and (2) defining the importance of attributes being compared, and (3) defining adequate similarity thresholds between similar and not similar pairs of records. In this paper, we summarize our experience gained from a real R&amp;D project run for a large financial institution. In particular, we answer the following three research questions: (1) what are the adequate similarity measures for comparing attributes of text data types, (2) what are the adequate weights of attributes in the procedure of comparing pairs of records, and (3) what are the similarity thresholds between classes: duplicates, probably duplicates, non-duplicates? The answers to the questions are based on the experimental evaluation of 54 similarity measures for text values. The measures were compared on five different </span>real data sets of different data characteristic. The similarity measures were assessed based on: (1) similarity values they produced for given values being compared and (2) their execution time. Furthermore, we present our method, based on </span></span>mathematical programming, for computing weights of attributes and similarity thresholds for records being compared. The experimental evaluation of the method and its assessment by experts from the financial institution proved that it is adequate to the deduplication problem at hand. The whole data deduplication pipeline that we have developed has been deployed in the financial institution and is run in their production system, processing batches of over 20 million of customer records.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102323"},"PeriodicalIF":3.7,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Foundations and practice of binary process discovery 二进制过程发现的基础与实践
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-01 DOI: 10.1016/j.is.2023.102339
Tijs Slaats, S. Debois, Christoffer Olling Back, Axel Kjeld Fjelrad Christfort
{"title":"Foundations and practice of binary process discovery","authors":"Tijs Slaats, S. Debois, Christoffer Olling Back, Axel Kjeld Fjelrad Christfort","doi":"10.1016/j.is.2023.102339","DOIUrl":"https://doi.org/10.1016/j.is.2023.102339","url":null,"abstract":"","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"188 ","pages":""},"PeriodicalIF":3.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139026507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Worker similarity-based noise correction for crowdsourcing 基于工人相似性的众包噪声校正
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-11-30 DOI: 10.1016/j.is.2023.102321
Yufei Hu , Liangxiao Jiang , Wenjun Zhang

Crowdsourcing offers a cost-effective way to obtain multiple noisy labels for each instance by employing multiple crowd workers. Then label integration is used to infer its integrated label. Despite the effectiveness of label integration algorithms, there always remains a certain degree of noise in the integrated labels. Thus noise correction algorithms have been proposed to reduce the impact of noise. However, almost all existing noise correction algorithms only focus on individual workers but ignore the correlations among workers. In this paper, we argue that similar workers have similar annotating skills and tend to be consistent in annotating same or similar instances. Based on this premise, we propose a novel noise correction algorithm called worker similarity-based noise correction (WSNC). At first, WSNC exploits the annotating information of similar workers on similar instances to estimate the quality of each label annotated by each worker on each instance. Then, WSNC re-infers the integrated label of each instance based on the qualities of its multiple noisy labels. Finally, WSNC considers the instance whose re-inferred integrated label differs from its original integrated label as a noise instance and further corrects it. The extensive experiments on a large number of simulated and three real-world crowdsourced datasets verify the effectiveness of WSNC.

众包为每个实例提供了一种成本效益高的方法,即通过雇用多名众包工作者来获取多个噪声标签。然后用标签积分法推导其集成标签。尽管标签集成算法是有效的,但在集成后的标签中仍然存在一定程度的噪声。因此,提出了噪声校正算法来降低噪声的影响。然而,几乎所有现有的噪声校正算法都只关注单个工人,而忽略了工人之间的相关性。在本文中,我们认为相似的工作者具有相似的注释技能,并且在注释相同或相似的实例时倾向于一致。在此前提下,我们提出了一种新的噪声校正算法——基于工人相似度的噪声校正(WSNC)。首先,WSNC利用相似工作人员在相似实例上的标注信息来估计每个工作人员在每个实例上标注的每个标签的质量。然后,WSNC根据每个实例的多个噪声标签的质量,重新推导出每个实例的综合标签。最后,WSNC将重新推断的集成标签与其原始集成标签不同的实例视为噪声实例,并对其进行进一步校正。在大量模拟数据集和三个真实众包数据集上的大量实验验证了WSNC的有效性。
{"title":"Worker similarity-based noise correction for crowdsourcing","authors":"Yufei Hu ,&nbsp;Liangxiao Jiang ,&nbsp;Wenjun Zhang","doi":"10.1016/j.is.2023.102321","DOIUrl":"https://doi.org/10.1016/j.is.2023.102321","url":null,"abstract":"<div><p>Crowdsourcing offers a cost-effective way to obtain multiple noisy labels for each instance by employing multiple crowd workers. Then label integration is used to infer its integrated label. Despite the effectiveness of label integration algorithms, there always remains a certain degree of noise in the integrated labels. Thus noise correction algorithms have been proposed to reduce the impact of noise. However, almost all existing noise correction algorithms only focus on individual workers but ignore the correlations among workers. In this paper, we argue that similar workers have similar annotating skills and tend to be consistent in annotating same or similar instances. Based on this premise, we propose a novel noise correction algorithm called worker similarity-based noise correction (WSNC). At first, WSNC exploits the annotating information of similar workers on similar instances to estimate the quality of each label annotated by each worker on each instance. Then, WSNC re-infers the integrated label of each instance based on the qualities of its multiple noisy labels. Finally, WSNC considers the instance whose re-inferred integrated label differs from its original integrated label as a noise instance and further corrects it. The extensive experiments on a large number of simulated and three real-world crowdsourced datasets verify the effectiveness of WSNC.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102321"},"PeriodicalIF":3.7,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138474894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel self-supervised graph model based on counterfactual learning for diversified recommendation 基于反事实学习的多元推荐自监督图模型
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-11-29 DOI: 10.1016/j.is.2023.102322
Pu Ji, Minghui Yang, Rui Sun

Consumers’ needs present a trend of diversification, which causes the emergence of diversified recommendation systems. However, existing diversified recommendation research mostly focuses on objective function construction rather than on the root cause that limits diversity—namely, imbalanced data distribution. This study considers how to balance data distribution to improve recommendation diversity. We propose a novel self-supervised graph model based on counterfactual learning (SSG-CL) for diversified recommendation. SSG-CL first distinguishes the dominant and disadvantageous categories for each user based on long-tail theory. It then introduces counterfactual learning to construct an auxiliary view with relatively balanced distribution among the dominant and disadvantageous categories. Next, we conduct contrastive learning between the user–item interaction graph and the auxiliary view as the self-supervised auxiliary task that aims to improve recommendation diversity. Finally, SSG-CL leverages a multitask training strategy to jointly optimize the main accuracy-oriented recommendation task and the self-supervised auxiliary task. Finally, we conduct experimental studies on real-world datasets, and the results indicate good SSG-CL performance in terms of accuracy and diversity.

消费者的需求呈现出多样化的趋势,这就导致了多样化推荐系统的出现。然而,现有的多元化推荐研究大多侧重于目标函数的构建,而没有关注限制多样性的根本原因——数据分布的不平衡。本研究考虑如何平衡数据分布以提高推荐多样性。提出了一种基于反事实学习(SSG-CL)的多元推荐自监督图模型。SSG-CL首先根据长尾理论区分每个用户的优势和劣势类别。然后引入反事实学习,构建优势类别和劣势类别相对均衡分布的辅助视图。接下来,我们将用户-物品交互图和辅助视图作为自监督辅助任务进行对比学习,以提高推荐多样性。最后,SSG-CL利用多任务训练策略,共同优化面向准确率的主推荐任务和自监督辅助任务。最后,我们在真实数据集上进行了实验研究,结果表明SSG-CL在准确性和多样性方面具有良好的性能。
{"title":"A novel self-supervised graph model based on counterfactual learning for diversified recommendation","authors":"Pu Ji,&nbsp;Minghui Yang,&nbsp;Rui Sun","doi":"10.1016/j.is.2023.102322","DOIUrl":"10.1016/j.is.2023.102322","url":null,"abstract":"<div><p>Consumers’ needs present a trend of diversification, which causes the emergence of diversified recommendation systems. However, existing diversified recommendation research mostly focuses on objective function construction rather than on the root cause that limits diversity—namely, imbalanced data distribution. This study considers how to balance data distribution to improve recommendation diversity. We propose a novel self-supervised graph model based on counterfactual learning (SSG-CL) for diversified recommendation. SSG-CL first distinguishes the dominant and disadvantageous categories for each user based on long-tail theory. It then introduces counterfactual learning to construct an auxiliary view with relatively balanced distribution among the dominant and disadvantageous categories. Next, we conduct contrastive learning between the user–item interaction graph and the auxiliary view as the self-supervised auxiliary task that aims to improve recommendation diversity. Finally, SSG-CL leverages a multitask training strategy to jointly optimize the main accuracy-oriented recommendation task and the self-supervised auxiliary task. Finally, we conduct experimental studies on real-world datasets, and the results indicate good SSG-CL performance in terms of accuracy and diversity.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102322"},"PeriodicalIF":3.7,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Information Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1