首页 > 最新文献

Information Systems最新文献

英文 中文
Attention-based multi attribute matrix factorization for enhanced recommendation performance 基于注意力的多属性矩阵因式分解提高推荐性能
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-09 DOI: 10.1016/j.is.2023.102334
Dongsoo Jang , Qinglong Li , Chaeyoung Lee , Jaekyeong Kim

In E-commerce platforms, auxiliary information containing several attributes (e.g., price, quality, and brand) can improve recommendation performance. However, previous studies used a simple combined embedding approach that did not consider the importance of each attribute embedded in the auxiliary information or only used some attributes of the auxiliary information. However, user purchasing behavior can vary significantly depending on the attributes. Thus, we propose multi attribute-based matrix factorization (MAMF), which considers the importance of each attribute embedded in various auxiliary information. MAMF obtains more representative and specific attention features of the user and item using a self-attention mechanism. By acquiring attentive representation, MAMF learns a high-level interaction precisely between users and items. To evaluate the performance of the proposed MAMF, we conducted extensive experiments using three real-world datasets from amazon.com. The experimental results show that MAMF exhibits excellent recommendation performance compared with various baseline models.

在电子商务平台中,包含多个属性(如价格、质量和品牌)的辅助信息可以提高推荐性能。然而,以往的研究采用的是简单的组合嵌入方法,没有考虑辅助信息中嵌入的每个属性的重要性,或者只使用了辅助信息中的某些属性。然而,用户的购买行为会因属性不同而有很大差异。因此,我们提出了基于多属性的矩阵因式分解(MAMF),它考虑了嵌入在各种辅助信息中的每个属性的重要性。MAMF 利用自我关注机制获取用户和商品更具代表性和特定的关注特征。通过获取注意力表征,MAMF 可以精确地学习用户与物品之间的高级交互。为了评估所提出的 MAMF 的性能,我们使用来自 amazon.com 的三个真实数据集进行了大量实验。实验结果表明,与各种基线模型相比,MAMF 表现出了卓越的推荐性能。
{"title":"Attention-based multi attribute matrix factorization for enhanced recommendation performance","authors":"Dongsoo Jang ,&nbsp;Qinglong Li ,&nbsp;Chaeyoung Lee ,&nbsp;Jaekyeong Kim","doi":"10.1016/j.is.2023.102334","DOIUrl":"10.1016/j.is.2023.102334","url":null,"abstract":"<div><p><span>In E-commerce platforms, auxiliary information containing several attributes (e.g., price, quality, and brand) can improve recommendation performance. However, previous studies used a simple combined embedding approach that did not consider the importance of each attribute embedded in the auxiliary information or only used some attributes of the auxiliary information. However, user purchasing behavior can vary significantly depending on the attributes. Thus, we propose multi attribute-based matrix factorization (MAMF), which considers the importance of each attribute embedded in various auxiliary information. MAMF obtains more representative and specific attention features of the user and item using a self-attention mechanism. By acquiring attentive representation, MAMF learns a high-level interaction precisely between users and items. To evaluate the performance of the proposed MAMF, we conducted extensive experiments using three real-world datasets from amazon.com. The experimental results show that MAMF exhibits excellent recommendation performance compared with various </span>baseline models.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102334"},"PeriodicalIF":3.7,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138572711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An efficient visual exploration approach of geospatial vector big data on the web map 网络地图上地理空间矢量大数据的高效可视化探索方法
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-09 DOI: 10.1016/j.is.2023.102333
Zebang Liu , Luo Chen , Mengyu Ma , Anran Yang , Zhinong Zhong , Ning Jing

The visual exploration of geospatial vector data has become an increasingly important part of the management and analysis of geospatial vector big data (GVBD). With the rapid growth of data scale, it is difficult to realize efficient visual exploration of GVBD by current visualization technologies even if parallel distributed computing technology is adopted. To fill the gap, this paper proposes a visual exploration approach of GVBD on the web map. In this approach, we propose the display-driven computing model and combine the traditional data-driven computing method to design an adaptive real-time visualization algorithm. At the same time, we design a pixel-quad-R tree spatial index structure. Finally, we realize the multilevel real-time interactive visual exploration of GVBD in a single machine by constructing the index offline to support the online computation for visualization, and all the visualization results can be calculated in real-time without the external cache occupation. The experimental results show that the approach outperforms current mainstream visualization methods and obtains the visualization results at any zoom level within 0.5 s, which can be well applied to multilevel real-time interactive visual exploration of the billion-scale GVBD.

地理空间矢量数据的可视化探索已日益成为地理空间矢量大数据(GVBD)管理和分析的重要组成部分。随着数据规模的快速增长,即使采用并行分布式计算技术,现有的可视化技术也难以实现对地理空间矢量大数据的高效可视化探索。为了填补这一空白,本文提出了一种在网络地图上对 GVBD 进行可视化探索的方法。在这种方法中,我们提出了显示驱动计算模型,并结合传统的数据驱动计算方法,设计了一种自适应实时可视化算法。同时,我们还设计了一种像素四R树形空间索引结构。最后,我们通过离线构建索引来支持可视化的在线计算,在单机上实现了GVBD的多层次实时交互式可视化探索,所有可视化结果均可实时计算,无需占用外部缓存。实验结果表明,该方法优于目前主流的可视化方法,可在0.5 s内获得任意缩放级别的可视化结果,可以很好地应用于亿级尺度GVBD的多级实时交互可视化探索。
{"title":"An efficient visual exploration approach of geospatial vector big data on the web map","authors":"Zebang Liu ,&nbsp;Luo Chen ,&nbsp;Mengyu Ma ,&nbsp;Anran Yang ,&nbsp;Zhinong Zhong ,&nbsp;Ning Jing","doi":"10.1016/j.is.2023.102333","DOIUrl":"10.1016/j.is.2023.102333","url":null,"abstract":"<div><p><span>The visual exploration of geospatial vector data has become an increasingly important part of the management and analysis of geospatial vector big data (GVBD). With the rapid growth of data scale, it is difficult to realize efficient visual exploration of GVBD by current visualization technologies even if parallel distributed computing<span> technology is adopted. To fill the gap, this paper proposes a visual exploration approach of GVBD on the web map. In this approach, we propose the display-driven computing model and combine the traditional data-driven computing method to design an adaptive real-time visualization algorithm. At the same time, we design a pixel-quad-R tree spatial index structure. Finally, we realize the multilevel real-time interactive visual exploration of GVBD in a single machine by constructing the index offline to support the online computation for visualization, and all the visualization results can be calculated in real-time without the external cache occupation. The experimental results show that the approach outperforms current mainstream </span></span>visualization methods and obtains the visualization results at any zoom level within 0.5 s, which can be well applied to multilevel real-time interactive visual exploration of the billion-scale GVBD.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102333"},"PeriodicalIF":3.7,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138567682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Validation set sampling strategies for predictive process monitoring 用于预测性过程监测的验证集采样策略
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-07 DOI: 10.1016/j.is.2023.102330
Jari Peeperkorn , Seppe vanden Broucke , Jochen De Weerdt

Previous studies investigating the efficacy of long short-term memory (LSTM) recurrent neural networks in predictive process monitoring and their ability to capture the underlying process structure have raised concerns about their limited ability to generalize to unseen behavior. Event logs often fail to capture the full spectrum of behavior permitted by the underlying processes. To overcome these challenges, this study introduces innovative validation set sampling strategies based on control-flow variant-based resampling. These strategies have undergone extensive evaluation to assess their impact on hyperparameter selection and early stopping, resulting in notable enhancements to the generalization capabilities of trained LSTM models. In addition, this study expands the experimental framework to enable accurate interpretation of underlying process models and provide valuable insights. By conducting experiments with event logs representing process models of varying complexities, this research elucidates the effectiveness of the proposed validation strategies. Furthermore, the extended framework facilitates investigations into the influence of event log completeness on the learning quality of predictive process models. The novel validation set sampling strategies proposed in this study facilitate the development of more effective and reliable predictive process models, ultimately bolstering generalization capabilities and improving the understanding of underlying process dynamics.

以前的研究调查了长短期记忆(LSTM)递归神经网络在预测性流程监控中的功效及其捕捉底层流程结构的能力,这些研究引起了人们对这些网络泛化到未见行为的能力有限的担忧。事件日志往往无法捕捉底层进程所允许的全部行为。为了克服这些挑战,本研究引入了基于控制流变体重采样的创新验证集采样策略。这些策略经过了广泛的评估,以评估其对超参数选择和早期停止的影响,从而显著增强了训练有素的 LSTM 模型的泛化能力。此外,本研究还扩展了实验框架,以便准确解释底层流程模型并提供有价值的见解。通过对代表不同复杂度流程模型的事件日志进行实验,本研究阐明了所提出的验证策略的有效性。此外,扩展框架还有助于研究事件日志完整性对预测流程模型学习质量的影响。本研究提出的新型验证集采样策略有助于开发更有效、更可靠的预测性流程模型,最终增强泛化能力,提高对潜在流程动态的理解。
{"title":"Validation set sampling strategies for predictive process monitoring","authors":"Jari Peeperkorn ,&nbsp;Seppe vanden Broucke ,&nbsp;Jochen De Weerdt","doi":"10.1016/j.is.2023.102330","DOIUrl":"10.1016/j.is.2023.102330","url":null,"abstract":"<div><p>Previous studies investigating the efficacy of long short-term memory (LSTM) recurrent neural networks in predictive process monitoring and their ability to capture the underlying process structure have raised concerns about their limited ability to generalize to unseen behavior. Event logs often fail to capture the full spectrum of behavior permitted by the underlying processes. To overcome these challenges, this study introduces innovative validation set sampling strategies based on control-flow variant-based resampling. These strategies have undergone extensive evaluation to assess their impact on hyperparameter selection and early stopping, resulting in notable enhancements to the generalization capabilities of trained LSTM models. In addition, this study expands the experimental framework to enable accurate interpretation of underlying process models and provide valuable insights. By conducting experiments with event logs representing process models of varying complexities, this research elucidates the effectiveness of the proposed validation strategies. Furthermore, the extended framework facilitates investigations into the influence of event log completeness on the learning quality of predictive process models. The novel validation set sampling strategies proposed in this study facilitate the development of more effective and reliable predictive process models, ultimately bolstering generalization capabilities and improving the understanding of underlying process dynamics.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102330"},"PeriodicalIF":3.7,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138566986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records 客户记录重复数据删除管道中指导相似度计算的参数调优:来自研发项目的经验
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-04 DOI: 10.1016/j.is.2023.102323
Witold Andrzejewski , Bartosz Bębel , Paweł Boiński , Robert Wrembel

Data stored in information systems are often erroneous. Duplicate data are one of the typical error type. To discover and handle duplicates, the so-called deduplication methods are applied. They are complex and time costly algorithms. In data deduplication, pairs of records are compared and their similarities are computed. For a given deduplication problem, challenging tasks are: (1) to decide which similarity measures are the most adequate to given attributes being compared and (2) defining the importance of attributes being compared, and (3) defining adequate similarity thresholds between similar and not similar pairs of records. In this paper, we summarize our experience gained from a real R&D project run for a large financial institution. In particular, we answer the following three research questions: (1) what are the adequate similarity measures for comparing attributes of text data types, (2) what are the adequate weights of attributes in the procedure of comparing pairs of records, and (3) what are the similarity thresholds between classes: duplicates, probably duplicates, non-duplicates? The answers to the questions are based on the experimental evaluation of 54 similarity measures for text values. The measures were compared on five different real data sets of different data characteristic. The similarity measures were assessed based on: (1) similarity values they produced for given values being compared and (2) their execution time. Furthermore, we present our method, based on mathematical programming, for computing weights of attributes and similarity thresholds for records being compared. The experimental evaluation of the method and its assessment by experts from the financial institution proved that it is adequate to the deduplication problem at hand. The whole data deduplication pipeline that we have developed has been deployed in the financial institution and is run in their production system, processing batches of over 20 million of customer records.

存储在信息系统中的数据常常是错误的。重复数据是典型的错误类型之一。为了发现和处理重复数据,使用了所谓的重复数据删除方法。它们是复杂且耗时的算法。在重复数据删除中,对记录进行比较并计算它们的相似度。对于给定的重复数据删除问题,具有挑战性的任务是:(1)决定哪些相似性度量最适合要比较的给定属性;(2)定义要比较的属性的重要性;(3)在相似和不相似的记录对之间定义适当的相似性阈值。在本文中,我们总结了从为一家大型金融机构运行的实际研发项目中获得的经验。特别是,我们回答了以下三个研究问题:(1)比较文本数据类型属性的适当相似性度量是什么,(2)比较成对记录过程中属性的适当权重是什么,以及(3)类之间的相似性阈值是什么:重复,可能重复,非重复?问题的答案是基于54个文本值相似度量的实验评估。在具有不同数据特征的5个不同的真实数据集上进行了度量比较。对相似性度量的评估基于:(1)它们为所比较的给定值产生的相似性值和(2)它们的执行时间。此外,我们提出了一种基于数学规划的方法,用于计算被比较记录的属性权重和相似阈值。该方法的实验评价和金融机构专家的评估证明,该方法足以解决手头的重复数据删除问题。我们开发的整个重复数据删除管道已经部署在金融机构中,并在他们的生产系统中运行,处理了超过2000万条客户记录。
{"title":"On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records","authors":"Witold Andrzejewski ,&nbsp;Bartosz Bębel ,&nbsp;Paweł Boiński ,&nbsp;Robert Wrembel","doi":"10.1016/j.is.2023.102323","DOIUrl":"10.1016/j.is.2023.102323","url":null,"abstract":"<div><p><span><span>Data stored in information systems are often erroneous. Duplicate data are one of the typical error type. To discover and handle duplicates, the so-called deduplication methods are applied. They are complex and time costly algorithms. In </span>data deduplication<span><span>, pairs of records are compared and their similarities are computed. For a given deduplication problem, challenging tasks are: (1) to decide which similarity measures are the most adequate to given attributes being compared and (2) defining the importance of attributes being compared, and (3) defining adequate similarity thresholds between similar and not similar pairs of records. In this paper, we summarize our experience gained from a real R&amp;D project run for a large financial institution. In particular, we answer the following three research questions: (1) what are the adequate similarity measures for comparing attributes of text data types, (2) what are the adequate weights of attributes in the procedure of comparing pairs of records, and (3) what are the similarity thresholds between classes: duplicates, probably duplicates, non-duplicates? The answers to the questions are based on the experimental evaluation of 54 similarity measures for text values. The measures were compared on five different </span>real data sets of different data characteristic. The similarity measures were assessed based on: (1) similarity values they produced for given values being compared and (2) their execution time. Furthermore, we present our method, based on </span></span>mathematical programming, for computing weights of attributes and similarity thresholds for records being compared. The experimental evaluation of the method and its assessment by experts from the financial institution proved that it is adequate to the deduplication problem at hand. The whole data deduplication pipeline that we have developed has been deployed in the financial institution and is run in their production system, processing batches of over 20 million of customer records.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102323"},"PeriodicalIF":3.7,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Foundations and practice of binary process discovery 二进制过程发现的基础与实践
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-12-01 DOI: 10.1016/j.is.2023.102339
Tijs Slaats, S. Debois, Christoffer Olling Back, Axel Kjeld Fjelrad Christfort
{"title":"Foundations and practice of binary process discovery","authors":"Tijs Slaats, S. Debois, Christoffer Olling Back, Axel Kjeld Fjelrad Christfort","doi":"10.1016/j.is.2023.102339","DOIUrl":"https://doi.org/10.1016/j.is.2023.102339","url":null,"abstract":"","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"188 ","pages":""},"PeriodicalIF":3.7,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139026507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Worker similarity-based noise correction for crowdsourcing 基于工人相似性的众包噪声校正
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-11-30 DOI: 10.1016/j.is.2023.102321
Yufei Hu , Liangxiao Jiang , Wenjun Zhang

Crowdsourcing offers a cost-effective way to obtain multiple noisy labels for each instance by employing multiple crowd workers. Then label integration is used to infer its integrated label. Despite the effectiveness of label integration algorithms, there always remains a certain degree of noise in the integrated labels. Thus noise correction algorithms have been proposed to reduce the impact of noise. However, almost all existing noise correction algorithms only focus on individual workers but ignore the correlations among workers. In this paper, we argue that similar workers have similar annotating skills and tend to be consistent in annotating same or similar instances. Based on this premise, we propose a novel noise correction algorithm called worker similarity-based noise correction (WSNC). At first, WSNC exploits the annotating information of similar workers on similar instances to estimate the quality of each label annotated by each worker on each instance. Then, WSNC re-infers the integrated label of each instance based on the qualities of its multiple noisy labels. Finally, WSNC considers the instance whose re-inferred integrated label differs from its original integrated label as a noise instance and further corrects it. The extensive experiments on a large number of simulated and three real-world crowdsourced datasets verify the effectiveness of WSNC.

众包为每个实例提供了一种成本效益高的方法,即通过雇用多名众包工作者来获取多个噪声标签。然后用标签积分法推导其集成标签。尽管标签集成算法是有效的,但在集成后的标签中仍然存在一定程度的噪声。因此,提出了噪声校正算法来降低噪声的影响。然而,几乎所有现有的噪声校正算法都只关注单个工人,而忽略了工人之间的相关性。在本文中,我们认为相似的工作者具有相似的注释技能,并且在注释相同或相似的实例时倾向于一致。在此前提下,我们提出了一种新的噪声校正算法——基于工人相似度的噪声校正(WSNC)。首先,WSNC利用相似工作人员在相似实例上的标注信息来估计每个工作人员在每个实例上标注的每个标签的质量。然后,WSNC根据每个实例的多个噪声标签的质量,重新推导出每个实例的综合标签。最后,WSNC将重新推断的集成标签与其原始集成标签不同的实例视为噪声实例,并对其进行进一步校正。在大量模拟数据集和三个真实众包数据集上的大量实验验证了WSNC的有效性。
{"title":"Worker similarity-based noise correction for crowdsourcing","authors":"Yufei Hu ,&nbsp;Liangxiao Jiang ,&nbsp;Wenjun Zhang","doi":"10.1016/j.is.2023.102321","DOIUrl":"https://doi.org/10.1016/j.is.2023.102321","url":null,"abstract":"<div><p>Crowdsourcing offers a cost-effective way to obtain multiple noisy labels for each instance by employing multiple crowd workers. Then label integration is used to infer its integrated label. Despite the effectiveness of label integration algorithms, there always remains a certain degree of noise in the integrated labels. Thus noise correction algorithms have been proposed to reduce the impact of noise. However, almost all existing noise correction algorithms only focus on individual workers but ignore the correlations among workers. In this paper, we argue that similar workers have similar annotating skills and tend to be consistent in annotating same or similar instances. Based on this premise, we propose a novel noise correction algorithm called worker similarity-based noise correction (WSNC). At first, WSNC exploits the annotating information of similar workers on similar instances to estimate the quality of each label annotated by each worker on each instance. Then, WSNC re-infers the integrated label of each instance based on the qualities of its multiple noisy labels. Finally, WSNC considers the instance whose re-inferred integrated label differs from its original integrated label as a noise instance and further corrects it. The extensive experiments on a large number of simulated and three real-world crowdsourced datasets verify the effectiveness of WSNC.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102321"},"PeriodicalIF":3.7,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138474894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel self-supervised graph model based on counterfactual learning for diversified recommendation 基于反事实学习的多元推荐自监督图模型
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-11-29 DOI: 10.1016/j.is.2023.102322
Pu Ji, Minghui Yang, Rui Sun

Consumers’ needs present a trend of diversification, which causes the emergence of diversified recommendation systems. However, existing diversified recommendation research mostly focuses on objective function construction rather than on the root cause that limits diversity—namely, imbalanced data distribution. This study considers how to balance data distribution to improve recommendation diversity. We propose a novel self-supervised graph model based on counterfactual learning (SSG-CL) for diversified recommendation. SSG-CL first distinguishes the dominant and disadvantageous categories for each user based on long-tail theory. It then introduces counterfactual learning to construct an auxiliary view with relatively balanced distribution among the dominant and disadvantageous categories. Next, we conduct contrastive learning between the user–item interaction graph and the auxiliary view as the self-supervised auxiliary task that aims to improve recommendation diversity. Finally, SSG-CL leverages a multitask training strategy to jointly optimize the main accuracy-oriented recommendation task and the self-supervised auxiliary task. Finally, we conduct experimental studies on real-world datasets, and the results indicate good SSG-CL performance in terms of accuracy and diversity.

消费者的需求呈现出多样化的趋势,这就导致了多样化推荐系统的出现。然而,现有的多元化推荐研究大多侧重于目标函数的构建,而没有关注限制多样性的根本原因——数据分布的不平衡。本研究考虑如何平衡数据分布以提高推荐多样性。提出了一种基于反事实学习(SSG-CL)的多元推荐自监督图模型。SSG-CL首先根据长尾理论区分每个用户的优势和劣势类别。然后引入反事实学习,构建优势类别和劣势类别相对均衡分布的辅助视图。接下来,我们将用户-物品交互图和辅助视图作为自监督辅助任务进行对比学习,以提高推荐多样性。最后,SSG-CL利用多任务训练策略,共同优化面向准确率的主推荐任务和自监督辅助任务。最后,我们在真实数据集上进行了实验研究,结果表明SSG-CL在准确性和多样性方面具有良好的性能。
{"title":"A novel self-supervised graph model based on counterfactual learning for diversified recommendation","authors":"Pu Ji,&nbsp;Minghui Yang,&nbsp;Rui Sun","doi":"10.1016/j.is.2023.102322","DOIUrl":"10.1016/j.is.2023.102322","url":null,"abstract":"<div><p>Consumers’ needs present a trend of diversification, which causes the emergence of diversified recommendation systems. However, existing diversified recommendation research mostly focuses on objective function construction rather than on the root cause that limits diversity—namely, imbalanced data distribution. This study considers how to balance data distribution to improve recommendation diversity. We propose a novel self-supervised graph model based on counterfactual learning (SSG-CL) for diversified recommendation. SSG-CL first distinguishes the dominant and disadvantageous categories for each user based on long-tail theory. It then introduces counterfactual learning to construct an auxiliary view with relatively balanced distribution among the dominant and disadvantageous categories. Next, we conduct contrastive learning between the user–item interaction graph and the auxiliary view as the self-supervised auxiliary task that aims to improve recommendation diversity. Finally, SSG-CL leverages a multitask training strategy to jointly optimize the main accuracy-oriented recommendation task and the self-supervised auxiliary task. Finally, we conduct experimental studies on real-world datasets, and the results indicate good SSG-CL performance in terms of accuracy and diversity.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102322"},"PeriodicalIF":3.7,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An empirical evaluation of unsupervised event log abstraction techniques in process mining 过程挖掘中无监督事件日志抽象技术的经验评价
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-11-25 DOI: 10.1016/j.is.2023.102320
Greg Van Houdt , Massimiliano de Leoni , Niels Martin , Benoît Depaire

These days, businesses keep track of more and more data in their information systems. Moreover, this data becomes more fine-grained than ever, tracking clicks and mutations in databases at the lowest level possible. Faced with such data, process discovery often struggles with producing comprehensible models, as they instead return spaghetti-like models. Such finely granulated models do not fit the business user’s mental model of the process under investigation. To tackle this, event log abstraction (ELA) techniques can transform the underlying event log to a higher granularity level. However, insights into the performance of these techniques are lacking in literature as results are only based on small-scale experiments and are often inconclusive. Against this background, this paper evaluates state-of-the-art abstraction techniques on 400 event logs. Results show that ELA sacrifices fitness for precision, but complexity reductions heavily depend on the ELA technique used. This study also illustrates the importance of a larger-scale experiment, as sub-sampling of results leads to contradictory conclusions.

如今,企业在其信息系统中跟踪越来越多的数据。此外,这些数据变得比以往任何时候都更细粒度,可以在尽可能低的级别上跟踪数据库中的点击和变化。面对这样的数据,过程发现常常难以产生可理解的模型,因为它们返回的是类似意大利面的模型。这种精细粒度的模型不适合业务用户对所研究流程的心理模型。为了解决这个问题,事件日志抽象(ELA)技术可以将底层事件日志转换到更高的粒度级别。然而,文献中缺乏对这些技术性能的深入了解,因为结果仅基于小规模实验,而且往往不确定。在此背景下,本文对400个事件日志的最新抽象技术进行了评估。结果表明,ELA为了精度牺牲了适应度,但复杂性的降低很大程度上取决于所使用的ELA技术。这项研究还说明了大规模实验的重要性,因为结果的子抽样会导致相互矛盾的结论。
{"title":"An empirical evaluation of unsupervised event log abstraction techniques in process mining","authors":"Greg Van Houdt ,&nbsp;Massimiliano de Leoni ,&nbsp;Niels Martin ,&nbsp;Benoît Depaire","doi":"10.1016/j.is.2023.102320","DOIUrl":"https://doi.org/10.1016/j.is.2023.102320","url":null,"abstract":"<div><p>These days, businesses keep track of more and more data in their information systems. Moreover, this data becomes more fine-grained than ever, tracking clicks and mutations in databases at the lowest level possible. Faced with such data, process discovery often struggles with producing comprehensible models, as they instead return spaghetti-like models. Such finely granulated models do not fit the business user’s mental model of the process under investigation. To tackle this, event log abstraction (ELA) techniques can transform the underlying event log to a higher granularity level. However, insights into the performance of these techniques are lacking in literature as results are only based on small-scale experiments and are often inconclusive. Against this background, this paper evaluates state-of-the-art abstraction techniques on 400 event logs. Results show that ELA sacrifices fitness for precision, but complexity reductions heavily depend on the ELA technique used. This study also illustrates the importance of a larger-scale experiment, as sub-sampling of results leads to contradictory conclusions.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102320"},"PeriodicalIF":3.7,"publicationDate":"2023-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138454299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Big data analytics deep learning techniques and applications: A survey 大数据分析深度学习技术与应用:调查
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-11-21 DOI: 10.1016/j.is.2023.102318
Hend A. Selmy , Hoda K. Mohamed , Walaa Medhat

Deep learning (DL), as one of the most active machine learning research fields, has achieved great success in numerous scientific and technological disciplines, including speech recognition, image classification, language processing, big data analytics, and many more. Big data analytics (BDA), where raw data is often unlabeled or uncategorized, can greatly benefit from DL because of its ability to analyze and learn from enormous amounts of unstructured data. This survey paper tackles a comprehensive overview of state-of-the-art DL techniques applied in BDA. The main target of this survey is intended to illustrate the significance of DL and its taxonomy and detail the basic techniques used in BDA. It also explains the DL techniques used in big IoT data applications as well as their various complexities and challenges. The survey presents various real-world data-intensive applications where DL techniques can be applied. In particular, it concentrates on the DL techniques in accordance with the BDA type for each application domain. Additionally, the survey examines DL benchmarked frameworks used in BDA and reviews the available benchmarked datasets, besides analyzing the strengths and limitations of each DL technique and their suitable applications. Further, a comparative analysis is also presented by comparing existing approaches to the DL methods used in BDA. Finally, the challenges of DL modeling and future directions are discussed.

深度学习(Deep learning, DL)作为最活跃的机器学习研究领域之一,在语音识别、图像分类、语言处理、大数据分析等众多科技学科中取得了巨大的成功。大数据分析(BDA)的原始数据通常是未标记或未分类的,深度学习可以极大地受益于它,因为它能够分析和学习大量非结构化数据。这篇调查论文全面概述了最先进的深度学习技术在BDA中的应用。本调查的主要目的是为了说明深度学习及其分类的重要性,并详细介绍了在BDA中使用的基本技术。它还解释了大物联网数据应用中使用的深度学习技术,以及它们的各种复杂性和挑战。该调查展示了可以应用深度学习技术的各种现实世界数据密集型应用。特别地,它集中于与每个应用程序领域的BDA类型相一致的DL技术。此外,该调查还检查了BDA中使用的深度学习基准框架,并审查了可用的基准数据集,此外还分析了每种深度学习技术的优势和局限性及其适用的应用。此外,通过比较BDA中使用的现有方法和DL方法,还提出了比较分析。最后,讨论了深度学习建模的挑战和未来发展方向。
{"title":"Big data analytics deep learning techniques and applications: A survey","authors":"Hend A. Selmy ,&nbsp;Hoda K. Mohamed ,&nbsp;Walaa Medhat","doi":"10.1016/j.is.2023.102318","DOIUrl":"https://doi.org/10.1016/j.is.2023.102318","url":null,"abstract":"<div><p>Deep learning (DL), as one of the most active machine learning research fields, has achieved great success in numerous scientific and technological disciplines, including speech recognition, image classification, language processing, big data analytics, and many more. Big data analytics (BDA), where raw data is often unlabeled or uncategorized, can greatly benefit from DL because of its ability to analyze and learn from enormous amounts of unstructured data. This survey paper tackles a comprehensive overview of state-of-the-art DL techniques applied in BDA. The main target of this survey is intended to illustrate the significance of DL and its taxonomy and detail the basic techniques used in BDA. It also explains the DL techniques used in big IoT data applications as well as their various complexities and challenges. The survey presents various real-world data-intensive applications where DL techniques can be applied. In particular, it concentrates on the DL techniques in accordance with the BDA type for each application domain. Additionally, the survey examines DL benchmarked frameworks used in BDA and reviews the available benchmarked datasets, besides analyzing the strengths and limitations of each DL technique and their suitable applications. Further, a comparative analysis is also presented by comparing existing approaches to the DL methods used in BDA. Finally, the challenges of DL modeling and future directions are discussed.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"120 ","pages":"Article 102318"},"PeriodicalIF":3.7,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138436402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Document structure-driven investigative information retrieval 文档结构驱动的调查信息检索
IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-11-19 DOI: 10.1016/j.is.2023.102315
Tuomas Ketola, Thomas Roelleke

Data-driven investigations are increasingly dealing with non-moderated, non-standard and even manipulated information Whether the field in question is journalism, law enforcement, or insurance fraud it is becoming more and more difficult for investigators to verify the outcomes of various black-box systems To contribute to this need of discovery methods that can be used for verification, we introduce a methodology for document structure-driven investigative information retrieval (InvIR) InvIR is defined as a subtask of exploratory IR, where transparency and reasoning take centre stage The aim of InvIR is to facilitate the verification and discovery of facts from data and the communication of those facts to others From a technical perspective, the methodology applies recent work from structured document retrieval (SDR) concerned with formal retrieval constraints and information content-based field weighting (ICFW) Using ICFW, the paper establishes the concept of relevance structures to describe the document structure-based relevance of documents These contexts are then used to help the user navigate during their discovery process and to rank entities of interest The proposed methodology is evaluated using a prototype search system called Relevance Structure-based Entity Ranker (RSER) in order to demonstrate its the feasibility This methodology represents an interesting and important research direction in a world where transparency is becoming more vital than ever.

数据驱动的调查越来越多地处理未经审核的、非标准的甚至是被操纵的信息,无论所涉及的领域是新闻、执法还是保险欺诈,调查人员越来越难以验证各种黑匣子系统的结果。我们介绍了一种文档结构驱动的调查性信息检索(InvIR)方法。InvIR被定义为探索性信息检索的一个子任务,其中透明度和推理占据中心位置。InvIR的目的是促进从数据中验证和发现事实,并将这些事实与他人交流。该方法应用了结构化文档检索(SDR)中有关正式检索约束和基于信息内容的字段加权(ICFW)的最新工作。本文建立了相关结构的概念来描述基于文档结构的文档相关性,然后使用这些上下文来帮助用户在发现过程中导航并对感兴趣的实体进行排名。为了证明其可行性,使用基于关联结构的实体排名(RSER)原型搜索系统对所提出的方法进行了评估,该方法代表了一个有趣且重要的研究方向透明度变得比以往任何时候都更加重要。
{"title":"Document structure-driven investigative information retrieval","authors":"Tuomas Ketola,&nbsp;Thomas Roelleke","doi":"10.1016/j.is.2023.102315","DOIUrl":"https://doi.org/10.1016/j.is.2023.102315","url":null,"abstract":"<div><p>Data-driven investigations are increasingly dealing with non-moderated, non-standard and even manipulated information Whether the field in question is journalism, law enforcement, or insurance fraud it is becoming more and more difficult for investigators to verify the outcomes of various black-box systems To contribute to this need of discovery methods that can be used for verification, we introduce a methodology for document structure-driven investigative information retrieval (InvIR) InvIR is defined as a subtask of exploratory IR, where transparency and reasoning take centre stage The aim of InvIR is to facilitate the verification and discovery of facts from data and the communication of those facts to others From a technical perspective, the methodology applies recent work from structured document retrieval (SDR) concerned with formal retrieval constraints and information content-based field weighting (ICFW) Using ICFW, the paper establishes the concept of relevance structures to describe the document structure-based relevance of documents These contexts are then used to help the user navigate during their discovery process and to rank entities of interest The proposed methodology is evaluated using a prototype search system called Relevance Structure-based Entity Ranker (RSER) in order to demonstrate its the feasibility This methodology represents an interesting and important research direction in a world where transparency is becoming more vital than ever.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102315"},"PeriodicalIF":3.7,"publicationDate":"2023-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001515/pdfft?md5=934dc470062407433a9cf64fc9053b41&pid=1-s2.0-S0306437923001515-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138454298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Information Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1