Information Systems最新文献_第9页

Enhancing business process simulation models with extraneous activity delays 利用无关活动延迟增强业务流程模拟模型

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-01-22 DOI: 10.1016/j.is.2024.102346

David Chapela-Campa, Marlon Dumas

Business Process Simulation (BPS) is a common approach to estimate the impact of changes to a business process on its performance measures. For example, it allows us to estimate what would be the cycle time of a process if we automated one of its activities, or if some resources become unavailable. The starting point of BPS is a business process model annotated with simulation parameters (a BPS model). In traditional approaches, BPS models are manually designed by modeling specialists. This approach is time-consuming and error-prone. To address this shortcoming, several studies have proposed methods to automatically discover BPS models from event logs via process mining techniques. However, current techniques in this space discover BPS models that only capture waiting times caused by resource contention or resource unavailability. Oftentimes, a considerable portion of the waiting time in a business process corresponds to extraneous delays, e.g., a resource waits for the customer to return a phone call. This article proposes a method that discovers extraneous delays from event logs of business process executions. The proposed approach computes, for each pair of causally consecutive activity instances in the event log, the time when the target activity instance should theoretically have started, given the availability of the relevant resource. Based on the difference between the theoretical and the actual start times, the approach estimates the distribution of extraneous delays, and it enhances the BPS model with timer events to capture these delays. An empirical evaluation involving synthetic and real-life logs shows that the approach produces BPS models that better reflect the temporal dynamics of the process, relative to BPS models that do not capture extraneous delays.

业务流程模拟（BPS）是估算业务流程变化对其性能指标影响的常用方法。例如，它可以让我们估算出，如果某个流程的某项活动实现自动化，或者某些资源变得不可用，那么该流程的周期时间会是多少。BPS 的起点是一个注有模拟参数的业务流程模型（BPS 模型）。在传统方法中，BPS 模型是由建模专家手动设计的。这种方法既耗时又容易出错。为了解决这一缺陷，一些研究提出了通过流程挖掘技术从事件日志中自动发现 BPS 模型的方法。然而，目前该领域的技术发现的 BPS 模型只能捕捉到资源争用或资源不可用造成的等待时间。通常情况下，业务流程中相当大一部分等待时间是由无关延迟造成的，例如，资源等待客户回电话。本文提出了一种从业务流程执行的事件日志中发现无关延迟的方法。对于事件日志中每一对因果关系连续的活动实例，所提出的方法会计算在相关资源可用的情况下，目标活动实例理论上应该开始的时间。根据理论开始时间和实际开始时间之间的差异，该方法估算了无关延迟的分布情况，并利用计时器事件增强了 BPS 模型，以捕捉这些延迟。一项涉及合成日志和真实日志的经验评估表明，相对于不捕捉无关延迟的 BPS 模型，该方法生成的 BPS 模型能更好地反映流程的时间动态。

{"title":"Enhancing business process simulation models with extraneous activity delays","authors":"David Chapela-Campa, Marlon Dumas","doi":"10.1016/j.is.2024.102346","DOIUrl":"10.1016/j.is.2024.102346","url":null,"abstract":"<div><p><span>Business Process Simulation (BPS) is a common approach to estimate the impact of changes to a business process on its performance measures. For example, it allows us to estimate what would be the cycle time of a process if we automated one of its activities, or if some resources become unavailable. The starting point of BPS is a business process model annotated with simulation parameters (a BPS model). In traditional approaches, BPS models are manually designed by modeling specialists. This approach is time-consuming and error-prone. To address this shortcoming, several studies have proposed methods to automatically discover BPS models from event logs via process mining techniques. However, current techniques in this space discover BPS models that only capture waiting times caused by </span>resource contention or resource unavailability. Oftentimes, a considerable portion of the waiting time in a business process corresponds to extraneous delays, e.g., a resource waits for the customer to return a phone call. This article proposes a method that discovers extraneous delays from event logs of business process executions. The proposed approach computes, for each pair of causally consecutive activity instances in the event log, the time when the target activity instance should theoretically have started, given the availability of the relevant resource. Based on the difference between the theoretical and the actual start times, the approach estimates the distribution of extraneous delays, and it enhances the BPS model with timer events to capture these delays. An empirical evaluation involving synthetic and real-life logs shows that the approach produces BPS models that better reflect the temporal dynamics of the process, relative to BPS models that do not capture extraneous delays.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"122 ","pages":"Article 102346"},"PeriodicalIF":3.7,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139516507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Repairing raw metadata for metadata management 为元数据管理修复原始元数据

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-01-20 DOI: 10.1016/j.is.2024.102344

Hiba Khalid, Esteban Zimányi

With the exponential growth of data production, the generation of metadata has become an integral part of the process. Metadata plays a crucial role in facilitating enhanced data analytics, data integration, and resource management by offering valuable insights. However, inconsistencies arise due to deviations from standards in metadata recording, including missing attribute information, publishing URLs, and provenance. Furthermore, the recorded metadata may exhibit inconsistencies, such as varied value formats, special characters, and inaccurately entered values. Addressing these inconsistencies through metadata preparation can greatly enhance the user experience during data management tasks.

This paper introduces MDPrep, a system that explores the usability and applicability of data preparation techniques in improving metadata quality. Our approach involves three steps: (1) detecting and identifying problematic metadata elements and structural issues, (2) employing a keyword-based approach to enhance metadata elements and a syntax-based approach to rectify structural metadata issues, and (3) comparing the outcomes to ensure improved readability and reusability of prepared metadata files.

随着数据生产的指数级增长，元数据的生成已成为数据生产过程中不可或缺的一部分。元数据通过提供有价值的见解，在促进增强数据分析、数据整合和资源管理方面发挥着至关重要的作用。然而，由于元数据记录偏离了标准，包括缺少属性信息、发布 URL 和出处，因此会出现不一致的情况。此外，记录的元数据也可能表现出不一致，如不同的值格式、特殊字符和不准确的输入值。本文介绍的 MDPrep 是一个探索数据准备技术在提高元数据质量方面的可用性和适用性的系统。我们的方法包括三个步骤：我们的方法包括三个步骤：(1) 检测和识别有问题的元数据元素和结构问题；(2) 采用基于关键字的方法来增强元数据元素，并采用基于语法的方法来纠正元数据结构问题；(3) 比较结果，以确保提高准备好的元数据文件的可读性和可重用性。

{"title":"Repairing raw metadata for metadata management","authors":"Hiba Khalid, Esteban Zimányi","doi":"10.1016/j.is.2024.102344","DOIUrl":"10.1016/j.is.2024.102344","url":null,"abstract":"<div><p>With the exponential growth of data production, the generation of metadata has become an integral part of the process. Metadata plays a crucial role in facilitating enhanced data analytics, data integration, and resource management by offering valuable insights. However, inconsistencies arise due to deviations from standards in metadata recording, including missing attribute information, publishing URLs, and provenance. Furthermore, the recorded metadata may exhibit inconsistencies, such as varied value formats, special characters, and inaccurately entered values. Addressing these inconsistencies through metadata preparation can greatly enhance the user experience during data management tasks.</p><p>This paper introduces MDPrep, a system that explores the usability and applicability of data preparation techniques in improving metadata quality. Our approach involves three steps: (1) detecting and identifying problematic metadata elements and structural issues, (2) employing a keyword-based approach to enhance metadata elements and a syntax-based approach to rectify structural metadata issues, and (3) comparing the outcomes to ensure improved readability and reusability of prepared metadata files.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"122 ","pages":"Article 102344"},"PeriodicalIF":3.7,"publicationDate":"2024-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139510480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Filtering with relational similarity 利用关系相似性进行筛选

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-01-20 DOI: 10.1016/j.is.2024.102345

Vladimir Mic , Pavel Zezula

For decades, the success of the similarity search has been based on detailed quantifications of pairwise similarities of objects. Currently, the search features have become much more precise but also bulkier, and the similarity computations are more time-consuming. We show that nearly no precise similarity quantifications are needed to evaluate the $k$ nearest neighbours ( $k$ NN) queries that dominate real-life applications. Based on the well-known fact that a selection of the most similar alternative out of several options is a much easier task than deciding the absolute similarity scores, we propose the search based on an epistemologically simpler concept of relational similarity. Having arbitrary objects $q, o_{1}, o_{2}$ from the search domain, the $k$ NN search is solvable just by the ability to choose the more similar object to $q$ out of $o_{1}, o_{2}$ . To support the filtering efficiency, we also consider a neutral option, i.e., equal similarities of $q, o_{1}$ and $q, o_{2}$ . We formalise such concept and discuss its advantages with respect to similarity quantifications, namely the efficiency, robustness and scalability with respect to the dataset size. Our pioneering implementation of the relational similarity search for the Euclidean and Cosine spaces demonstrates robust filtering power and efficiency compared to several contemporary techniques.

几十年来，相似性搜索的成功一直基于对对象成对相似性的详细量化。目前，搜索特征变得更加精确，但也更加庞大，而且相似性计算也更加耗时。我们的研究表明，在现实生活中占主导地位的 k 近邻（kNN）查询几乎不需要精确的相似性量化。众所周知，从多个选项中选择最相似的选项比确定绝对相似度得分要容易得多，基于这一事实，我们提出了基于认识论上更简单的关系相似性概念的搜索方法。如果搜索域中有任意对象 q、o1、o2，那么 kNN 搜索只需从 o1、o2 中选择与 q 更相似的对象即可。为了提高过滤效率，我们还考虑了中性选项，即 q,o1 和 q,o2 的相似度相等。我们正式提出了这一概念，并讨论了它在相似性量化方面的优势，即效率、鲁棒性和与数据集大小相关的可扩展性。我们开创性地实现了欧几里得空间和余弦空间的关系相似性搜索，与几种当代技术相比，显示出强大的过滤能力和效率。

{"title":"Filtering with relational similarity","authors":"Vladimir Mic , Pavel Zezula","doi":"10.1016/j.is.2024.102345","DOIUrl":"10.1016/j.is.2024.102345","url":null,"abstract":"<div><p>For decades, the success of the similarity search has been based on detailed quantifications of pairwise similarities of objects. Currently, the search features have become much more precise but also bulkier, and the similarity computations are more time-consuming. We show that nearly no precise similarity quantifications are needed to evaluate the <span><math><mi>k</mi></math></span> nearest neighbours (<span><math><mi>k</mi></math></span>NN) queries that dominate real-life applications. Based on the well-known fact that a selection of the most similar alternative out of several options is a much easier task than deciding the absolute similarity scores, we propose the search based on an epistemologically simpler concept of relational similarity. Having arbitrary objects <span><math><mrow><mi>q</mi><mo>,</mo><msub><mrow><mi>o</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mi>o</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow></math></span> from the search domain, the <span><math><mi>k</mi></math></span>NN search is solvable just by the ability to choose the more similar object to <span><math><mi>q</mi></math></span> out of <span><math><mrow><msub><mrow><mi>o</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mi>o</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow></math></span>. To support the filtering efficiency, we also consider a neutral option, i.e., equal similarities of <span><math><mrow><mi>q</mi><mo>,</mo><msub><mrow><mi>o</mi></mrow><mrow><mn>1</mn></mrow></msub></mrow></math></span> and <span><math><mrow><mi>q</mi><mo>,</mo><msub><mrow><mi>o</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow></math></span>. We formalise such concept and discuss its advantages with respect to similarity quantifications, namely the efficiency, robustness and scalability with respect to the dataset size. Our pioneering implementation of the relational similarity search for the Euclidean and Cosine spaces demonstrates robust filtering power and efficiency compared to several contemporary techniques.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"122 ","pages":"Article 102345"},"PeriodicalIF":3.7,"publicationDate":"2024-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437924000036/pdfft?md5=02857cd176b247b381941578e10c094d&pid=1-s2.0-S0306437924000036-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139510378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Process Query Language: Design, Implementation, and Evaluation 过程查询语言：设计、实施和评估

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2024-01-05 DOI: 10.1016/j.is.2023.102337

Artem Polyvyanyy , Arthur H.M. ter Hofstede , Marcello La Rosa , Chun Ouyang , Anastasiia Pika

Organizations can benefit from the use of practices, techniques, and tools from the area of business process management. Through the focus on processes, they create process models that require management, including support for versioning, refactoring and querying. Querying thus far has primarily focused on structural properties of models rather than on exploiting behavioral properties capturing aspects of model execution. While the latter is more challenging, it is also more effective, especially when models are used for auditing or process automation. The focus of this paper is to overcome the challenges associated with behavioral querying of process models in order to unlock its benefits. The first challenge concerns determining decidability of the building blocks of the query language, which are the possible behavioral relations between process tasks. The second challenge concerns achieving acceptable performance of query evaluation. The evaluation of a query may require expensive checks in all process models, of which there may be thousands. In light of these challenges, this paper proposes a special-purpose programming language, namely Process Query Language (PQL) for behavioral querying of process model collections. The language relies on a set of behavioral predicates between process tasks, whose usefulness has been empirically evaluated with a pool of process model stakeholders. This study resulted in a selection of the predicates to be implemented in PQL, whose decidability has also been formally proven. The computational performance of the language has been extensively evaluated through a set of experiments against two large process model collections.

组织可以从业务流程管理领域的实践、技术和工具的使用中获益。通过对流程的关注，他们创建了需要管理的流程模型，包括对版本、重构和查询的支持。迄今为止，查询主要侧重于模型的结构属性，而不是利用捕捉模型执行方面的行为属性。虽然后者更具挑战性，但也更有效，尤其是当模型用于审计或流程自动化时。本文的重点是克服与流程模型行为查询相关的挑战，以释放其优势。第一个挑战是确定查询语言构件的可判定性，即流程任务之间可能存在的行为关系。第二个挑战是实现可接受的查询评估性能。查询评估可能需要在所有流程模型中进行昂贵的检查，而这些流程模型可能有数千个。鉴于这些挑战，本文提出了一种专用编程语言，即流程查询语言（PQL），用于流程模型集合的行为查询。该语言依赖于流程任务之间的一组行为谓词，其实用性已通过流程模型利益相关者库进行了经验评估。通过这项研究，我们选择了要在 PQL 中实现的谓词，这些谓词的可解性也得到了正式证明。通过针对两个大型流程模型集合的一系列实验，对该语言的计算性能进行了广泛评估。

{"title":"Process Query Language: Design, Implementation, and Evaluation","authors":"Artem Polyvyanyy , Arthur H.M. ter Hofstede , Marcello La Rosa , Chun Ouyang , Anastasiia Pika","doi":"10.1016/j.is.2023.102337","DOIUrl":"10.1016/j.is.2023.102337","url":null,"abstract":"<div><p>Organizations can benefit from the use of practices, techniques, and tools from the area of business process management. Through the focus on processes, they create process models that require management, including support for versioning, refactoring and querying. Querying thus far has primarily focused on structural properties of models rather than on exploiting behavioral properties capturing aspects of model execution. While the latter is more challenging, it is also more effective, especially when models are used for auditing or process automation. The focus of this paper is to overcome the challenges associated with behavioral querying of process models in order to unlock its benefits. The first challenge concerns determining decidability of the building blocks of the query language, which are the possible behavioral relations between process tasks. The second challenge concerns achieving acceptable performance of query evaluation. The evaluation of a query may require expensive checks in all process models, of which there may be thousands. In light of these challenges, this paper proposes a special-purpose programming language, namely Process Query Language (PQL) for behavioral querying of process model collections. The language relies on a set of behavioral predicates between process tasks, whose usefulness has been empirically evaluated with a pool of process model stakeholders. This study resulted in a selection of the predicates to be implemented in PQL, whose decidability has also been formally proven. The computational performance of the language has been extensively evaluated through a set of experiments against two large process model collections.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"122 ","pages":"Article 102337"},"PeriodicalIF":3.7,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001734/pdfft?md5=bf63d7c9889b99ccdd113784876bd7b9&pid=1-s2.0-S0306437923001734-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139101888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers 文本预处理还值得花时间吗？流行预处理方法对 Transformers 和传统分类器影响的比较调查

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2023-12-23 DOI: 10.1016/j.is.2023.102342

Marco Siino, Ilenia Tinnirello, Marco La Cascia

With the advent of the modern pre-trained Transformers, the text preprocessing has started to be neglected and not specifically addressed in recent NLP literature. However, both from a linguistic and from a computer science point of view, we believe that even when using modern Transformers, text preprocessing can significantly impact on the performance of a classification model. We want to investigate and compare, through this study, how preprocessing impacts on the Text Classification (TC) performance of modern and traditional classification models. We report and discuss the preprocessing techniques found in the literature and their most recent variants or applications to address TC tasks in different domains. In order to assess how much the preprocessing affects classification performance, we apply the three top referenced preprocessing techniques (alone or in combination) to four publicly available datasets from different domains. Then, nine machine learning models – including modern Transformers – get the preprocessed text as input. The results presented show that an educated choice on the text preprocessing strategy to employ should be based on the task as well as on the model considered. Outcomes in this survey show that choosing the best preprocessing technique – in place of the worst – can significantly improve accuracy on the classification (up to 25%, as in the case of an XLNet on the IMDB dataset). In some cases, by means of a suitable preprocessing strategy, even a simple Naïve Bayes classifier proved to outperform (i.e., by 2% in accuracy) the best performing Transformer. We found that Transformers and traditional models exhibit a higher impact of the preprocessing on the TC performance. Our main findings are: (1) also on modern pre-trained language models, preprocessing can affect performance, depending on the datasets and on the preprocessing technique or combination of techniques used, (2) in some cases, using a proper preprocessing strategy, simple models can outperform Transformers on TC tasks, (3) similar classes of models exhibit similar level of sensitivity to text preprocessing.

随着现代预训练转换器的出现，文本预处理开始被忽视，在最近的 NLP 文献中也没有专门论述。然而，无论是从语言学还是从计算机科学的角度来看，我们都认为，即使使用现代转换器，文本预处理也会对分类模型的性能产生重大影响。我们希望通过本研究调查和比较预处理如何影响现代和传统分类模型的文本分类（TC）性能。我们报告并讨论了文献中发现的预处理技术及其最新变体或应用，以解决不同领域的文本分类任务。为了评估预处理对分类性能的影响程度，我们将三种最常用的预处理技术（单独或组合）应用于四个不同领域的公开数据集。然后，九个机器学习模型（包括现代变形金刚）将预处理后的文本作为输入。调查结果表明，在选择文本预处理策略时，应根据任务和所考虑的模型做出明智的选择。本次调查的结果表明，选择最好的预处理技术（而不是最差的）可以显著提高分类的准确性（高达 25%，如 IMDB 数据集上的 XLNet）。在某些情况下，通过采用合适的预处理策略，即使是简单的奈夫贝叶斯分类器也能超越性能最好的变形器（即准确率提高 2%）。我们发现，Transformer 和传统模型的预处理对 TC 性能的影响更大。我们的主要发现有(1)在现代预训练语言模型上，预处理也会影响性能，这取决于数据集和预处理技术或所使用的技术组合；(2)在某些情况下，使用适当的预处理策略，简单模型在 TC 任务上的性能会优于 Transformer；(3)类似类别的模型对文本预处理的敏感程度相似。

{"title":"Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers","authors":"Marco Siino, Ilenia Tinnirello, Marco La Cascia","doi":"10.1016/j.is.2023.102342","DOIUrl":"10.1016/j.is.2023.102342","url":null,"abstract":"<div><p>With the advent of the modern pre-trained Transformers, the text preprocessing has started to be neglected and not specifically addressed in recent NLP literature. However, both from a linguistic and from a computer science point of view, we believe that even when using modern Transformers, text preprocessing can significantly impact on the performance of a classification model. We want to investigate and compare, through this study, how preprocessing impacts on the Text Classification (TC) performance of modern and traditional classification models. We report and discuss the preprocessing techniques found in the literature and their most recent variants or applications to address TC tasks in different domains. In order to assess how much the preprocessing affects classification performance, we apply the three top referenced preprocessing techniques (alone or in combination) to four publicly available datasets from different domains. Then, nine machine learning models – including modern Transformers – get the preprocessed text as input. The results presented show that an educated choice on the text preprocessing strategy to employ should be based on the task as well as on the model considered. Outcomes in this survey show that choosing the best preprocessing technique – in place of the worst – can significantly improve accuracy on the classification (up to 25%, as in the case of an XLNet on the IMDB dataset). In some cases, by means of a suitable preprocessing strategy, even a simple Naïve Bayes classifier proved to outperform (i.e., by 2% in accuracy) the best performing Transformer. We found that Transformers and traditional models exhibit a higher impact of the preprocessing on the TC performance. Our main findings are: (1) also on modern pre-trained language models, preprocessing can affect performance, depending on the datasets and on the preprocessing technique or combination of techniques used, (2) in some cases, using a proper preprocessing strategy, simple models can outperform Transformers on TC tasks, (3) similar classes of models exhibit similar level of sensitivity to text preprocessing.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102342"},"PeriodicalIF":3.7,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001783/pdfft?md5=f6a37c2a5b264959fc055b2613fb321e&pid=1-s2.0-S0306437923001783-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139024022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HubHSP graph: Capturing local geometrical and statistical data properties via spanning graphs HubHSP 图：通过跨度图捕捉局部几何和统计数据属性

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2023-12-22 DOI: 10.1016/j.is.2023.102341

Stephane Marchand-Maillet , Edgar Chávez

The computation of a continuous generative model to describe a finite sample of an infinite metric space can prove challenging and lead to erroneous hypothesis, particularly in high-dimensional spaces. In this paper, we follow a different route and define the Hubness Half Space Partitioning graph (HubHSP graph). By constructing this spanning graph over the dataset, we can capture both the geometrical and statistical properties of the data without resorting to any continuity assumption. Leveraging the classical graph-theoretic apparatus, the HubHSP graph facilitates critical operations, including the creation of a representative sample of the original dataset, without relying on density estimation. This representative subsample is essential for a range of operations, including indexing, visualization, and machine learning tasks such as clustering or inductive learning. With the HubHSP graph, we can bypass the limitations of traditional methods and obtain a holistic understanding of our dataset’s properties, enabling us to unlock its full potential.

计算一个连续的生成模型来描述无限度量空间的有限样本可能具有挑战性，并导致错误的假设，尤其是在高维空间中。在本文中，我们另辟蹊径，定义了中枢半空间分区图（HubHSP 图）。通过在数据集上构建这种跨度图，我们可以捕捉到数据的几何和统计属性，而无需诉诸任何连续性假设。利用经典的图论装置，HubHSP 图有助于进行关键操作，包括创建原始数据集的代表性样本，而无需依赖密度估计。这种代表性子样本对一系列操作至关重要，包括索引、可视化和机器学习任务（如聚类或归纳学习）。有了 HubHSP 图，我们就可以绕过传统方法的限制，全面了解数据集的属性，从而充分挖掘数据集的潜力。

{"title":"HubHSP graph: Capturing local geometrical and statistical data properties via spanning graphs","authors":"Stephane Marchand-Maillet , Edgar Chávez","doi":"10.1016/j.is.2023.102341","DOIUrl":"10.1016/j.is.2023.102341","url":null,"abstract":"<div><p>The computation of a continuous generative model to describe a finite sample of an infinite metric space can prove challenging and lead to erroneous hypothesis, particularly in high-dimensional spaces. In this paper, we follow a different route and define the Hubness Half Space Partitioning graph (HubHSP graph). By constructing this spanning graph over the dataset, we can capture both the geometrical and statistical properties of the data without resorting to any continuity assumption. Leveraging the classical graph-theoretic apparatus, the HubHSP graph facilitates critical operations, including the creation of a representative sample of the original dataset, without relying on density estimation. This representative subsample is essential for a range of operations, including indexing, visualization, and machine learning tasks such as clustering or inductive learning. With the HubHSP graph, we can bypass the limitations of traditional methods and obtain a holistic understanding of our dataset’s properties, enabling us to unlock its full potential.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102341"},"PeriodicalIF":3.7,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001771/pdfft?md5=fc0eac6dd447ca16f10189821d083444&pid=1-s2.0-S0306437923001771-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139017802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HubHSP graph: Capturing local geometrical and statistical data properties via spanning graphs HubHSP 图：通过跨度图捕捉局部几何和统计数据属性

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2023-12-22 DOI: 10.1016/j.is.2023.102341

Stephane Marchand-Maillet, Edgar Chávez

The computation of a continuous generative model to describe a finite sample of an infinite metric space can prove challenging and lead to erroneous hypothesis, particularly in high-dimensional spaces. In this paper, we follow a different route and define the Hubness Half Space Partitioning graph (HubHSP graph). By constructing this spanning graph over the dataset, we can capture both the geometrical and statistical properties of the data without resorting to any continuity assumption. Leveraging the classical graph-theoretic apparatus, the HubHSP graph facilitates critical operations, including the creation of a representative sample of the original dataset, without relying on density estimation. This representative subsample is essential for a range of operations, including indexing, visualization, and machine learning tasks such as clustering or inductive learning. With the HubHSP graph, we can bypass the limitations of traditional methods and obtain a holistic understanding of our dataset’s properties, enabling us to unlock its full potential.

计算一个连续的生成模型来描述无限度量空间的有限样本可能具有挑战性，并导致错误的假设，尤其是在高维空间中。在本文中，我们另辟蹊径，定义了中枢半空间分区图（HubHSP 图）。通过在数据集上构建这种跨度图，我们可以捕捉到数据的几何和统计属性，而无需诉诸任何连续性假设。利用经典的图论装置，HubHSP 图有助于进行关键操作，包括创建原始数据集的代表性样本，而无需依赖密度估计。这种代表性子样本对一系列操作至关重要，包括索引、可视化和机器学习任务（如聚类或归纳学习）。有了 HubHSP 图，我们就可以绕过传统方法的限制，全面了解数据集的属性，从而充分挖掘数据集的潜力。

引用次数: 0

A screenshot-based task mining framework for disclosing the drivers behind variable human actions 基于屏幕截图的任务挖掘框架，揭示人类可变行为背后的驱动因素

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2023-12-21 DOI: 10.1016/j.is.2023.102340

A. Martínez-Rojas , A. Jiménez-Ramírez , J.G. Enríquez , H.A. Reijers

Robotic Process Automation (RPA) enables subject matter experts to use the graphical user interface as a means to automate and integrate systems. This is a fast method to automate repetitive, mundane tasks. To avoid constructing a software robot from scratch, Task Mining approaches can be used to monitor human behavior through a series of timestamped events, such as mouse clicks and keystrokes. From a so-called User Interface log (UI Log), it is possible to automatically discover the process model behind this behavior. However, when the discovered process model shows different process variants, it is hard to determine what drives a human’s decision to execute one variant over the other. Existing approaches do analyze the UI Log in search for the underlying rules, but neglect what can be seen on the screen. As a result, a major part of the human decision-making remains hidden. To address this gap, this paper describes a Task Mining framework that uses the screenshot of each event in the UI Log as an additional source of information. From such an enriched UI Log, by using image-processing techniques and Machine Learning algorithms, a decision tree is created, which offers a more complete explanation of the human decision-making process. The presented framework can express the decision tree graphically, explicitly identifying which elements in the screenshots are relevant to make the decision. The framework has been evaluated through a case study that involves a process with real-life screenshots. The results indicate a satisfactorily high accuracy of the overall approach, even if a small UI Log is used. The evaluation also identifies challenges for applying the framework in a real-life setting when a high density of interface elements is present.

机器人流程自动化（RPA）使主题专家能够使用图形用户界面作为自动化和集成系统的手段。这是一种快速实现重复性日常任务自动化的方法。为避免从零开始构建软件机器人，任务挖掘方法可用于通过一系列时间戳事件（如鼠标点击和击键）监控人类行为。通过所谓的用户界面日志（UI Log），可以自动发现这种行为背后的流程模型。然而，当发现的流程模型显示出不同的流程变体时，就很难确定是什么驱动人类决定执行一个变体而不是另一个变体。现有的方法的确可以通过分析用户界面日志来寻找潜在的规则，但却忽略了屏幕上可以看到的内容。因此，人类决策的主要部分仍然被隐藏起来。为了弥补这一不足，本文介绍了一种任务挖掘框架，它将用户界面日志中每个事件的屏幕截图作为额外的信息来源。通过使用图像处理技术和机器学习算法，从这样一个丰富的用户界面日志中创建出一棵决策树，从而更完整地解释人类的决策过程。所介绍的框架能以图形方式表达决策树，明确识别屏幕截图中与决策相关的元素。该框架已通过一项案例研究进行了评估，该案例研究涉及一个带有真实截图的流程。结果表明，即使使用的用户界面 Log 较小，整体方法的准确性也令人满意。评估还发现了在界面元素密度较高的现实环境中应用该框架所面临的挑战。

{"title":"A screenshot-based task mining framework for disclosing the drivers behind variable human actions","authors":"A. Martínez-Rojas , A. Jiménez-Ramírez , J.G. Enríquez , H.A. Reijers","doi":"10.1016/j.is.2023.102340","DOIUrl":"10.1016/j.is.2023.102340","url":null,"abstract":"<div><p>Robotic Process Automation (RPA) enables subject matter experts to use the graphical user interface as a means to automate and integrate systems. This is a fast method to automate repetitive, mundane tasks. To avoid constructing a software robot from scratch, Task Mining approaches can be used to monitor human behavior through a series of timestamped events, such as mouse clicks and keystrokes. From a so-called User Interface log (UI Log), it is possible to automatically discover the process model behind this behavior. However, when the discovered process model shows different process variants, it is hard to determine what drives a human’s decision to execute one variant over the other. Existing approaches do analyze the UI Log in search for the underlying rules, but neglect what can be seen on the screen. As a result, a major part of the human decision-making remains hidden. To address this gap, this paper describes a Task Mining framework that uses the screenshot of each event in the UI Log as an additional source of information. From such an enriched UI Log, by using image-processing techniques and Machine Learning algorithms, a decision tree is created, which offers a more complete explanation of the human decision-making process. The presented framework can express the decision tree graphically, explicitly identifying which elements in the screenshots are relevant to make the decision. The framework has been evaluated through a case study that involves a process with real-life screenshots. The results indicate a satisfactorily high accuracy of the overall approach, even if a small UI Log is used. The evaluation also identifies challenges for applying the framework in a real-life setting when a high density of interface elements is present.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102340"},"PeriodicalIF":3.7,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S030643792300176X/pdfft?md5=595e70f04b75d2dca939507ee4f713af&pid=1-s2.0-S030643792300176X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139018882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A screenshot-based task mining framework for disclosing the drivers behind variable human actions 基于屏幕截图的任务挖掘框架，揭示人类可变行为背后的驱动因素

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2023-12-21 DOI: 10.1016/j.is.2023.102340

A. Martínez-Rojas, A. Jiménez-Ramírez, J.G. Enríquez, H.A. Reijers

Robotic Process Automation (RPA) enables subject matter experts to use the graphical user interface as a means to automate and integrate systems. This is a fast method to automate repetitive, mundane tasks. To avoid constructing a software robot from scratch, Task Mining approaches can be used to monitor human behavior through a series of timestamped events, such as mouse clicks and keystrokes. From a so-called User Interface log (UI Log), it is possible to automatically discover the process model behind this behavior. However, when the discovered process model shows different process variants, it is hard to determine what drives a human’s decision to execute one variant over the other. Existing approaches do analyze the UI Log in search for the underlying rules, but neglect what can be seen on the screen. As a result, a major part of the human decision-making remains hidden. To address this gap, this paper describes a Task Mining framework that uses the screenshot of each event in the UI Log as an additional source of information. From such an enriched UI Log, by using image-processing techniques and Machine Learning algorithms, a decision tree is created, which offers a more complete explanation of the human decision-making process. The presented framework can express the decision tree graphically, explicitly identifying which elements in the screenshots are relevant to make the decision. The framework has been evaluated through a case study that involves a process with real-life screenshots. The results indicate a satisfactorily high accuracy of the overall approach, even if a small UI Log is used. The evaluation also identifies challenges for applying the framework in a real-life setting when a high density of interface elements is present.

机器人流程自动化（RPA）使主题专家能够使用图形用户界面作为自动化和集成系统的手段。这是一种快速实现重复性日常任务自动化的方法。为了避免从头开始构建软件机器人，任务挖掘方法可用于通过一系列带有时间戳的事件（如鼠标点击和击键）监控人类行为。通过所谓的用户界面日志（UI Log），可以自动发现这种行为背后的流程模型。然而，当发现的流程模型显示出不同的流程变体时，就很难确定是什么驱动人类决定执行一个变体而不是另一个变体。现有的方法的确可以通过分析用户界面日志来寻找潜在的规则，但却忽略了屏幕上可以看到的内容。因此，人类决策的主要部分仍然被隐藏起来。为了弥补这一不足，本文介绍了一种任务挖掘框架，它将用户界面日志中每个事件的屏幕截图作为额外的信息来源。通过使用图像处理技术和机器学习算法，从这样一个丰富的用户界面日志中创建出一棵决策树，从而更完整地解释人类的决策过程。所介绍的框架能以图形方式表达决策树，明确识别屏幕截图中与决策相关的元素。该框架已通过一项案例研究进行了评估，该案例研究涉及一个带有真实截图的流程。结果表明，即使使用的用户界面 Log 较小，整体方法的准确性也令人满意。评估还发现了在界面元素密度较高的现实环境中应用该框架所面临的挑战。

{"title":"A screenshot-based task mining framework for disclosing the drivers behind variable human actions","authors":"A. Martínez-Rojas, A. Jiménez-Ramírez, J.G. Enríquez, H.A. Reijers","doi":"10.1016/j.is.2023.102340","DOIUrl":"https://doi.org/10.1016/j.is.2023.102340","url":null,"abstract":"<p>Robotic Process Automation (RPA) enables subject matter experts to use the graphical user interface as a means to automate and integrate systems. This is a fast method to automate repetitive, mundane tasks. To avoid constructing a software robot from scratch, Task Mining approaches can be used to monitor human behavior through a series of timestamped events, such as mouse clicks and keystrokes. From a so-called User Interface log (UI Log), it is possible to automatically discover the process model behind this behavior. However, when the discovered process model shows different process variants, it is hard to determine what drives a human’s decision to execute one variant over the other. Existing approaches do analyze the UI Log in search for the underlying rules, but neglect what can be seen on the screen. As a result, a major part of the human decision-making remains hidden. To address this gap, this paper describes a Task Mining framework that uses the screenshot of each event in the UI Log as an additional source of information. From such an enriched UI Log, by using image-processing techniques and Machine Learning algorithms, a decision tree is created, which offers a more complete explanation of the human decision-making process. The presented framework can express the decision tree graphically, explicitly identifying which elements in the screenshots are relevant to make the decision. The framework has been evaluated through a case study that involves a process with real-life screenshots. The results indicate a satisfactorily high accuracy of the overall approach, even if a small UI Log is used. The evaluation also identifies challenges for applying the framework in a real-life setting when a high density of interface elements is present.</p>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"74 1","pages":""},"PeriodicalIF":3.7,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139023993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Foundations and practice of binary process discovery 二进制过程发现的基础与实践

IF 3.7 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems

Pub Date : 2023-12-20 DOI: 10.1016/j.is.2023.102339

Tijs Slaats , Søren Debois , Christoffer Olling Back , Axel Kjeld Fjelrad Christfort

Most contemporary process discovery methods take as inputs only positive examples of process executions, and so they are one-class classification algorithms. However, we have found negative examples to also be available in industry, hence we build on earlier work that treats process discovery as a binary classification problem. This approach opens the door to many well-established methods and metrics from machine learning, in particular to improve the distinction between what should and should not be allowed by the output model. Concretely, we (1) present a verified formalisation of process discovery as a binary classification problem; (2) provide cases with negative examples from industry, including real-life logs; (3) propose the Rejection Miner binary classification procedure, applicable to any process notation that has a suitable syntactic composition operator; (4) implement two concrete binary miners, one outputting Declare patterns, the other Dynamic Condition Response (DCR) graphs; and (5) apply these miners to real world and synthetic logs obtained from our industry partners and the process discovery contest, showing increased output model quality in terms of accuracy and model size.

大多数当代流程发现方法仅将流程执行的正面示例作为输入，因此属于单类分类算法。然而，我们发现工业中也有负面示例，因此我们在早期工作的基础上，将流程发现视为二元分类问题。这种方法为机器学习中许多成熟的方法和指标打开了大门，特别是改进了输出模型应该允许和不应该允许的内容之间的区别。具体来说，我们（1）将流程发现形式化为一个二元分类问题，并进行了验证；（2）提供了来自行业的负面案例，包括现实生活中的日志；（3）提出了拒绝矿工二元分类程序，该程序适用于任何具有合适语法组成算子的流程符号；(4) 实现两个具体的二进制矿工，一个输出声明模式，另一个输出动态条件响应（DCR）图；以及 (5) 将这些矿工应用于从我们的行业合作伙伴和流程发现竞赛中获得的真实世界和合成日志，结果显示在准确性和模型大小方面输出模型的质量都有所提高。

{"title":"Foundations and practice of binary process discovery","authors":"Tijs Slaats , Søren Debois , Christoffer Olling Back , Axel Kjeld Fjelrad Christfort","doi":"10.1016/j.is.2023.102339","DOIUrl":"10.1016/j.is.2023.102339","url":null,"abstract":"<div><p>Most contemporary process discovery methods take as inputs only <em>positive</em> examples of process executions, and so they are <em>one-class classification</em> algorithms. However, we have found <em>negative</em> examples to also be available in industry, hence we build on earlier work that treats process discovery as a <em>binary classification</em> problem. This approach opens the door to many well-established methods and metrics from machine learning, in particular to improve the distinction between what should and should not be allowed by the output model. Concretely, we (1) present a verified formalisation of process discovery as a binary classification problem; (2) provide cases with negative examples from industry, including real-life logs; (3) propose the Rejection Miner binary classification procedure, applicable to any process notation that has a suitable syntactic composition operator; (4) implement two concrete binary miners, one outputting Declare patterns, the other Dynamic Condition Response (DCR) graphs; and (5) apply these miners to real world and synthetic logs obtained from our industry partners and the process discovery contest, showing increased output model quality in terms of accuracy and model size.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102339"},"PeriodicalIF":3.7,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001758/pdfft?md5=f2bf1fcd001426b54f1d43f5ac2ad3d9&pid=1-s2.0-S0306437923001758-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139024055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0