Pub Date : 2024-01-20DOI: 10.1016/j.is.2024.102345
Vladimir Mic , Pavel Zezula
For decades, the success of the similarity search has been based on detailed quantifications of pairwise similarities of objects. Currently, the search features have become much more precise but also bulkier, and the similarity computations are more time-consuming. We show that nearly no precise similarity quantifications are needed to evaluate the nearest neighbours (NN) queries that dominate real-life applications. Based on the well-known fact that a selection of the most similar alternative out of several options is a much easier task than deciding the absolute similarity scores, we propose the search based on an epistemologically simpler concept of relational similarity. Having arbitrary objects from the search domain, the NN search is solvable just by the ability to choose the more similar object to out of . To support the filtering efficiency, we also consider a neutral option, i.e., equal similarities of and . We formalise such concept and discuss its advantages with respect to similarity quantifications, namely the efficiency, robustness and scalability with respect to the dataset size. Our pioneering implementation of the relational similarity search for the Euclidean and Cosine spaces demonstrates robust filtering power and efficiency compared to several contemporary techniques.
{"title":"Filtering with relational similarity","authors":"Vladimir Mic , Pavel Zezula","doi":"10.1016/j.is.2024.102345","DOIUrl":"10.1016/j.is.2024.102345","url":null,"abstract":"<div><p>For decades, the success of the similarity search has been based on detailed quantifications of pairwise similarities of objects. Currently, the search features have become much more precise but also bulkier, and the similarity computations are more time-consuming. We show that nearly no precise similarity quantifications are needed to evaluate the <span><math><mi>k</mi></math></span> nearest neighbours (<span><math><mi>k</mi></math></span>NN) queries that dominate real-life applications. Based on the well-known fact that a selection of the most similar alternative out of several options is a much easier task than deciding the absolute similarity scores, we propose the search based on an epistemologically simpler concept of relational similarity. Having arbitrary objects <span><math><mrow><mi>q</mi><mo>,</mo><msub><mrow><mi>o</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mi>o</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow></math></span> from the search domain, the <span><math><mi>k</mi></math></span>NN search is solvable just by the ability to choose the more similar object to <span><math><mi>q</mi></math></span> out of <span><math><mrow><msub><mrow><mi>o</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mi>o</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow></math></span>. To support the filtering efficiency, we also consider a neutral option, i.e., equal similarities of <span><math><mrow><mi>q</mi><mo>,</mo><msub><mrow><mi>o</mi></mrow><mrow><mn>1</mn></mrow></msub></mrow></math></span> and <span><math><mrow><mi>q</mi><mo>,</mo><msub><mrow><mi>o</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow></math></span>. We formalise such concept and discuss its advantages with respect to similarity quantifications, namely the efficiency, robustness and scalability with respect to the dataset size. Our pioneering implementation of the relational similarity search for the Euclidean and Cosine spaces demonstrates robust filtering power and efficiency compared to several contemporary techniques.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"122 ","pages":"Article 102345"},"PeriodicalIF":3.7,"publicationDate":"2024-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437924000036/pdfft?md5=02857cd176b247b381941578e10c094d&pid=1-s2.0-S0306437924000036-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139510378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-05DOI: 10.1016/j.is.2023.102337
Artem Polyvyanyy , Arthur H.M. ter Hofstede , Marcello La Rosa , Chun Ouyang , Anastasiia Pika
Organizations can benefit from the use of practices, techniques, and tools from the area of business process management. Through the focus on processes, they create process models that require management, including support for versioning, refactoring and querying. Querying thus far has primarily focused on structural properties of models rather than on exploiting behavioral properties capturing aspects of model execution. While the latter is more challenging, it is also more effective, especially when models are used for auditing or process automation. The focus of this paper is to overcome the challenges associated with behavioral querying of process models in order to unlock its benefits. The first challenge concerns determining decidability of the building blocks of the query language, which are the possible behavioral relations between process tasks. The second challenge concerns achieving acceptable performance of query evaluation. The evaluation of a query may require expensive checks in all process models, of which there may be thousands. In light of these challenges, this paper proposes a special-purpose programming language, namely Process Query Language (PQL) for behavioral querying of process model collections. The language relies on a set of behavioral predicates between process tasks, whose usefulness has been empirically evaluated with a pool of process model stakeholders. This study resulted in a selection of the predicates to be implemented in PQL, whose decidability has also been formally proven. The computational performance of the language has been extensively evaluated through a set of experiments against two large process model collections.
{"title":"Process Query Language: Design, Implementation, and Evaluation","authors":"Artem Polyvyanyy , Arthur H.M. ter Hofstede , Marcello La Rosa , Chun Ouyang , Anastasiia Pika","doi":"10.1016/j.is.2023.102337","DOIUrl":"10.1016/j.is.2023.102337","url":null,"abstract":"<div><p>Organizations can benefit from the use of practices, techniques, and tools from the area of business process management. Through the focus on processes, they create process models that require management, including support for versioning, refactoring and querying. Querying thus far has primarily focused on structural properties of models rather than on exploiting behavioral properties capturing aspects of model execution. While the latter is more challenging, it is also more effective, especially when models are used for auditing or process automation. The focus of this paper is to overcome the challenges associated with behavioral querying of process models in order to unlock its benefits. The first challenge concerns determining decidability of the building blocks of the query language, which are the possible behavioral relations between process tasks. The second challenge concerns achieving acceptable performance of query evaluation. The evaluation of a query may require expensive checks in all process models, of which there may be thousands. In light of these challenges, this paper proposes a special-purpose programming language, namely Process Query Language (PQL) for behavioral querying of process model collections. The language relies on a set of behavioral predicates between process tasks, whose usefulness has been empirically evaluated with a pool of process model stakeholders. This study resulted in a selection of the predicates to be implemented in PQL, whose decidability has also been formally proven. The computational performance of the language has been extensively evaluated through a set of experiments against two large process model collections.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"122 ","pages":"Article 102337"},"PeriodicalIF":3.7,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001734/pdfft?md5=bf63d7c9889b99ccdd113784876bd7b9&pid=1-s2.0-S0306437923001734-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139101888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-23DOI: 10.1016/j.is.2023.102342
Marco Siino, Ilenia Tinnirello, Marco La Cascia
With the advent of the modern pre-trained Transformers, the text preprocessing has started to be neglected and not specifically addressed in recent NLP literature. However, both from a linguistic and from a computer science point of view, we believe that even when using modern Transformers, text preprocessing can significantly impact on the performance of a classification model. We want to investigate and compare, through this study, how preprocessing impacts on the Text Classification (TC) performance of modern and traditional classification models. We report and discuss the preprocessing techniques found in the literature and their most recent variants or applications to address TC tasks in different domains. In order to assess how much the preprocessing affects classification performance, we apply the three top referenced preprocessing techniques (alone or in combination) to four publicly available datasets from different domains. Then, nine machine learning models – including modern Transformers – get the preprocessed text as input. The results presented show that an educated choice on the text preprocessing strategy to employ should be based on the task as well as on the model considered. Outcomes in this survey show that choosing the best preprocessing technique – in place of the worst – can significantly improve accuracy on the classification (up to 25%, as in the case of an XLNet on the IMDB dataset). In some cases, by means of a suitable preprocessing strategy, even a simple Naïve Bayes classifier proved to outperform (i.e., by 2% in accuracy) the best performing Transformer. We found that Transformers and traditional models exhibit a higher impact of the preprocessing on the TC performance. Our main findings are: (1) also on modern pre-trained language models, preprocessing can affect performance, depending on the datasets and on the preprocessing technique or combination of techniques used, (2) in some cases, using a proper preprocessing strategy, simple models can outperform Transformers on TC tasks, (3) similar classes of models exhibit similar level of sensitivity to text preprocessing.
{"title":"Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers","authors":"Marco Siino, Ilenia Tinnirello, Marco La Cascia","doi":"10.1016/j.is.2023.102342","DOIUrl":"10.1016/j.is.2023.102342","url":null,"abstract":"<div><p>With the advent of the modern pre-trained Transformers, the text preprocessing has started to be neglected and not specifically addressed in recent NLP literature. However, both from a linguistic and from a computer science point of view, we believe that even when using modern Transformers, text preprocessing can significantly impact on the performance of a classification model. We want to investigate and compare, through this study, how preprocessing impacts on the Text Classification (TC) performance of modern and traditional classification models. We report and discuss the preprocessing techniques found in the literature and their most recent variants or applications to address TC tasks in different domains. In order to assess how much the preprocessing affects classification performance, we apply the three top referenced preprocessing techniques (alone or in combination) to four publicly available datasets from different domains. Then, nine machine learning models – including modern Transformers – get the preprocessed text as input. The results presented show that an educated choice on the text preprocessing strategy to employ should be based on the task as well as on the model considered. Outcomes in this survey show that choosing the best preprocessing technique – in place of the worst – can significantly improve accuracy on the classification (up to 25%, as in the case of an XLNet on the IMDB dataset). In some cases, by means of a suitable preprocessing strategy, even a simple Naïve Bayes classifier proved to outperform (i.e., by 2% in accuracy) the best performing Transformer. We found that Transformers and traditional models exhibit a higher impact of the preprocessing on the TC performance. Our main findings are: (1) also on modern pre-trained language models, preprocessing can affect performance, depending on the datasets and on the preprocessing technique or combination of techniques used, (2) in some cases, using a proper preprocessing strategy, simple models can outperform Transformers on TC tasks, (3) similar classes of models exhibit similar level of sensitivity to text preprocessing.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102342"},"PeriodicalIF":3.7,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001783/pdfft?md5=f6a37c2a5b264959fc055b2613fb321e&pid=1-s2.0-S0306437923001783-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139024022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-22DOI: 10.1016/j.is.2023.102341
Stephane Marchand-Maillet , Edgar Chávez
The computation of a continuous generative model to describe a finite sample of an infinite metric space can prove challenging and lead to erroneous hypothesis, particularly in high-dimensional spaces. In this paper, we follow a different route and define the Hubness Half Space Partitioning graph (HubHSP graph). By constructing this spanning graph over the dataset, we can capture both the geometrical and statistical properties of the data without resorting to any continuity assumption. Leveraging the classical graph-theoretic apparatus, the HubHSP graph facilitates critical operations, including the creation of a representative sample of the original dataset, without relying on density estimation. This representative subsample is essential for a range of operations, including indexing, visualization, and machine learning tasks such as clustering or inductive learning. With the HubHSP graph, we can bypass the limitations of traditional methods and obtain a holistic understanding of our dataset’s properties, enabling us to unlock its full potential.
{"title":"HubHSP graph: Capturing local geometrical and statistical data properties via spanning graphs","authors":"Stephane Marchand-Maillet , Edgar Chávez","doi":"10.1016/j.is.2023.102341","DOIUrl":"10.1016/j.is.2023.102341","url":null,"abstract":"<div><p>The computation of a continuous generative model to describe a finite sample of an infinite metric space can prove challenging and lead to erroneous hypothesis, particularly in high-dimensional spaces. In this paper, we follow a different route and define the Hubness Half Space Partitioning graph (HubHSP graph). By constructing this spanning graph over the dataset, we can capture both the geometrical and statistical properties of the data without resorting to any continuity assumption. Leveraging the classical graph-theoretic apparatus, the HubHSP graph facilitates critical operations, including the creation of a representative sample of the original dataset, without relying on density estimation. This representative subsample is essential for a range of operations, including indexing, visualization, and machine learning tasks such as clustering or inductive learning. With the HubHSP graph, we can bypass the limitations of traditional methods and obtain a holistic understanding of our dataset’s properties, enabling us to unlock its full potential.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102341"},"PeriodicalIF":3.7,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001771/pdfft?md5=fc0eac6dd447ca16f10189821d083444&pid=1-s2.0-S0306437923001771-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139017802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-22DOI: 10.1016/j.is.2023.102341
Stephane Marchand-Maillet, Edgar Chávez
The computation of a continuous generative model to describe a finite sample of an infinite metric space can prove challenging and lead to erroneous hypothesis, particularly in high-dimensional spaces. In this paper, we follow a different route and define the Hubness Half Space Partitioning graph (HubHSP graph). By constructing this spanning graph over the dataset, we can capture both the geometrical and statistical properties of the data without resorting to any continuity assumption. Leveraging the classical graph-theoretic apparatus, the HubHSP graph facilitates critical operations, including the creation of a representative sample of the original dataset, without relying on density estimation. This representative subsample is essential for a range of operations, including indexing, visualization, and machine learning tasks such as clustering or inductive learning. With the HubHSP graph, we can bypass the limitations of traditional methods and obtain a holistic understanding of our dataset’s properties, enabling us to unlock its full potential.
{"title":"HubHSP graph: Capturing local geometrical and statistical data properties via spanning graphs","authors":"Stephane Marchand-Maillet, Edgar Chávez","doi":"10.1016/j.is.2023.102341","DOIUrl":"https://doi.org/10.1016/j.is.2023.102341","url":null,"abstract":"<p>The computation of a continuous generative model to describe a finite sample of an infinite metric space can prove challenging and lead to erroneous hypothesis, particularly in high-dimensional spaces. In this paper, we follow a different route and define the Hubness Half Space Partitioning graph (HubHSP graph). By constructing this spanning graph over the dataset, we can capture both the geometrical and statistical properties of the data without resorting to any continuity assumption. Leveraging the classical graph-theoretic apparatus, the HubHSP graph facilitates critical operations, including the creation of a representative sample of the original dataset, without relying on density estimation. This representative subsample is essential for a range of operations, including indexing, visualization, and machine learning tasks such as clustering or inductive learning. With the HubHSP graph, we can bypass the limitations of traditional methods and obtain a holistic understanding of our dataset’s properties, enabling us to unlock its full potential.</p>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"30 1","pages":""},"PeriodicalIF":3.7,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139023963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-21DOI: 10.1016/j.is.2023.102340
A. Martínez-Rojas , A. Jiménez-Ramírez , J.G. Enríquez , H.A. Reijers
Robotic Process Automation (RPA) enables subject matter experts to use the graphical user interface as a means to automate and integrate systems. This is a fast method to automate repetitive, mundane tasks. To avoid constructing a software robot from scratch, Task Mining approaches can be used to monitor human behavior through a series of timestamped events, such as mouse clicks and keystrokes. From a so-called User Interface log (UI Log), it is possible to automatically discover the process model behind this behavior. However, when the discovered process model shows different process variants, it is hard to determine what drives a human’s decision to execute one variant over the other. Existing approaches do analyze the UI Log in search for the underlying rules, but neglect what can be seen on the screen. As a result, a major part of the human decision-making remains hidden. To address this gap, this paper describes a Task Mining framework that uses the screenshot of each event in the UI Log as an additional source of information. From such an enriched UI Log, by using image-processing techniques and Machine Learning algorithms, a decision tree is created, which offers a more complete explanation of the human decision-making process. The presented framework can express the decision tree graphically, explicitly identifying which elements in the screenshots are relevant to make the decision. The framework has been evaluated through a case study that involves a process with real-life screenshots. The results indicate a satisfactorily high accuracy of the overall approach, even if a small UI Log is used. The evaluation also identifies challenges for applying the framework in a real-life setting when a high density of interface elements is present.
{"title":"A screenshot-based task mining framework for disclosing the drivers behind variable human actions","authors":"A. Martínez-Rojas , A. Jiménez-Ramírez , J.G. Enríquez , H.A. Reijers","doi":"10.1016/j.is.2023.102340","DOIUrl":"10.1016/j.is.2023.102340","url":null,"abstract":"<div><p>Robotic Process Automation (RPA) enables subject matter experts to use the graphical user interface as a means to automate and integrate systems. This is a fast method to automate repetitive, mundane tasks. To avoid constructing a software robot from scratch, Task Mining approaches can be used to monitor human behavior through a series of timestamped events, such as mouse clicks and keystrokes. From a so-called User Interface log (UI Log), it is possible to automatically discover the process model behind this behavior. However, when the discovered process model shows different process variants, it is hard to determine what drives a human’s decision to execute one variant over the other. Existing approaches do analyze the UI Log in search for the underlying rules, but neglect what can be seen on the screen. As a result, a major part of the human decision-making remains hidden. To address this gap, this paper describes a Task Mining framework that uses the screenshot of each event in the UI Log as an additional source of information. From such an enriched UI Log, by using image-processing techniques and Machine Learning algorithms, a decision tree is created, which offers a more complete explanation of the human decision-making process. The presented framework can express the decision tree graphically, explicitly identifying which elements in the screenshots are relevant to make the decision. The framework has been evaluated through a case study that involves a process with real-life screenshots. The results indicate a satisfactorily high accuracy of the overall approach, even if a small UI Log is used. The evaluation also identifies challenges for applying the framework in a real-life setting when a high density of interface elements is present.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102340"},"PeriodicalIF":3.7,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S030643792300176X/pdfft?md5=595e70f04b75d2dca939507ee4f713af&pid=1-s2.0-S030643792300176X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139018882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-21DOI: 10.1016/j.is.2023.102340
A. Martínez-Rojas, A. Jiménez-Ramírez, J.G. Enríquez, H.A. Reijers
Robotic Process Automation (RPA) enables subject matter experts to use the graphical user interface as a means to automate and integrate systems. This is a fast method to automate repetitive, mundane tasks. To avoid constructing a software robot from scratch, Task Mining approaches can be used to monitor human behavior through a series of timestamped events, such as mouse clicks and keystrokes. From a so-called User Interface log (UI Log), it is possible to automatically discover the process model behind this behavior. However, when the discovered process model shows different process variants, it is hard to determine what drives a human’s decision to execute one variant over the other. Existing approaches do analyze the UI Log in search for the underlying rules, but neglect what can be seen on the screen. As a result, a major part of the human decision-making remains hidden. To address this gap, this paper describes a Task Mining framework that uses the screenshot of each event in the UI Log as an additional source of information. From such an enriched UI Log, by using image-processing techniques and Machine Learning algorithms, a decision tree is created, which offers a more complete explanation of the human decision-making process. The presented framework can express the decision tree graphically, explicitly identifying which elements in the screenshots are relevant to make the decision. The framework has been evaluated through a case study that involves a process with real-life screenshots. The results indicate a satisfactorily high accuracy of the overall approach, even if a small UI Log is used. The evaluation also identifies challenges for applying the framework in a real-life setting when a high density of interface elements is present.
{"title":"A screenshot-based task mining framework for disclosing the drivers behind variable human actions","authors":"A. Martínez-Rojas, A. Jiménez-Ramírez, J.G. Enríquez, H.A. Reijers","doi":"10.1016/j.is.2023.102340","DOIUrl":"https://doi.org/10.1016/j.is.2023.102340","url":null,"abstract":"<p>Robotic Process Automation (RPA) enables subject matter experts to use the graphical user interface as a means to automate and integrate systems. This is a fast method to automate repetitive, mundane tasks. To avoid constructing a software robot from scratch, Task Mining approaches can be used to monitor human behavior through a series of timestamped events, such as mouse clicks and keystrokes. From a so-called User Interface log (UI Log), it is possible to automatically discover the process model behind this behavior. However, when the discovered process model shows different process variants, it is hard to determine what drives a human’s decision to execute one variant over the other. Existing approaches do analyze the UI Log in search for the underlying rules, but neglect what can be seen on the screen. As a result, a major part of the human decision-making remains hidden. To address this gap, this paper describes a Task Mining framework that uses the screenshot of each event in the UI Log as an additional source of information. From such an enriched UI Log, by using image-processing techniques and Machine Learning algorithms, a decision tree is created, which offers a more complete explanation of the human decision-making process. The presented framework can express the decision tree graphically, explicitly identifying which elements in the screenshots are relevant to make the decision. The framework has been evaluated through a case study that involves a process with real-life screenshots. The results indicate a satisfactorily high accuracy of the overall approach, even if a small UI Log is used. The evaluation also identifies challenges for applying the framework in a real-life setting when a high density of interface elements is present.</p>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"74 1","pages":""},"PeriodicalIF":3.7,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139023993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-20DOI: 10.1016/j.is.2023.102339
Tijs Slaats , Søren Debois , Christoffer Olling Back , Axel Kjeld Fjelrad Christfort
Most contemporary process discovery methods take as inputs only positive examples of process executions, and so they are one-class classification algorithms. However, we have found negative examples to also be available in industry, hence we build on earlier work that treats process discovery as a binary classification problem. This approach opens the door to many well-established methods and metrics from machine learning, in particular to improve the distinction between what should and should not be allowed by the output model. Concretely, we (1) present a verified formalisation of process discovery as a binary classification problem; (2) provide cases with negative examples from industry, including real-life logs; (3) propose the Rejection Miner binary classification procedure, applicable to any process notation that has a suitable syntactic composition operator; (4) implement two concrete binary miners, one outputting Declare patterns, the other Dynamic Condition Response (DCR) graphs; and (5) apply these miners to real world and synthetic logs obtained from our industry partners and the process discovery contest, showing increased output model quality in terms of accuracy and model size.
{"title":"Foundations and practice of binary process discovery","authors":"Tijs Slaats , Søren Debois , Christoffer Olling Back , Axel Kjeld Fjelrad Christfort","doi":"10.1016/j.is.2023.102339","DOIUrl":"10.1016/j.is.2023.102339","url":null,"abstract":"<div><p>Most contemporary process discovery methods take as inputs only <em>positive</em> examples of process executions, and so they are <em>one-class classification</em> algorithms. However, we have found <em>negative</em> examples to also be available in industry, hence we build on earlier work that treats process discovery as a <em>binary classification</em> problem. This approach opens the door to many well-established methods and metrics from machine learning, in particular to improve the distinction between what should and should not be allowed by the output model. Concretely, we (1) present a verified formalisation of process discovery as a binary classification problem; (2) provide cases with negative examples from industry, including real-life logs; (3) propose the Rejection Miner binary classification procedure, applicable to any process notation that has a suitable syntactic composition operator; (4) implement two concrete binary miners, one outputting Declare patterns, the other Dynamic Condition Response (DCR) graphs; and (5) apply these miners to real world and synthetic logs obtained from our industry partners and the process discovery contest, showing increased output model quality in terms of accuracy and model size.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102339"},"PeriodicalIF":3.7,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001758/pdfft?md5=f2bf1fcd001426b54f1d43f5ac2ad3d9&pid=1-s2.0-S0306437923001758-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139024055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-18DOI: 10.1016/j.is.2023.102338
Matteo Francia , Stefano Rizzi , Patrick Marcel
The Intentional Analytics Model (IAM) has been devised to couple OLAP and analytics by (i) letting users express their analysis intentions on multidimensional data cubes and (ii) returning enhanced cubes, i.e., multidimensional data annotated with knowledge insights in the form of models (e.g., correlations). Five intention operators were proposed to this end; of these, describe and assess have been investigated in previous papers. In this work we enrich the IAM picture by focusing on the explain operator, whose goal is to provide an answer to the user asking “why does measure show these values?”; specifically, we consider models that explain in terms of one or more other measures. We propose a syntax for the operator and discuss how enhanced cubes are built by (i) finding the relationship between and the other cube measures via regression analysis and cross-correlation, and (ii) highlighting the most interesting one. Finally, we test the operator implementation in terms of efficiency and effectiveness.
Intentional Analytics Model(IAM,意向分析模型)旨在通过以下方式将 OLAP 和分析结合起来:(i) 让用户在多维数据立方体上表达他们的分析意向;(ii) 返回增强的立方体,即以模型(如相关性)形式注释了知识见解的多维数据。为此,我们提出了五个意向操作符;其中,描述和评估已在以前的论文中进行过研究。在这项工作中,我们将重点放在解释运算符上,以丰富 IAM 的内容,解释运算符的目标是回答用户 "为什么测量值 m 会显示这些值?"的问题;具体来说,我们考虑用一个或多个其他测量值来解释 m 的模型。我们为运算符提出了一种语法,并讨论了如何通过以下方法构建增强立方体:(i) 通过回归分析和交叉相关分析找到 m 与其他立方体测量值之间的关系,(ii) 突出显示最有趣的测量值。最后,我们从效率和效果方面测试了算子的实现。
{"title":"Explaining cube measures through Intentional Analytics","authors":"Matteo Francia , Stefano Rizzi , Patrick Marcel","doi":"10.1016/j.is.2023.102338","DOIUrl":"10.1016/j.is.2023.102338","url":null,"abstract":"<div><p>The Intentional Analytics Model (IAM) has been devised to couple OLAP and analytics by (i) letting users express their analysis intentions on multidimensional data cubes and (ii) returning enhanced cubes, i.e., multidimensional data annotated with knowledge insights in the form of models (e.g., correlations). Five intention operators were proposed to this end; of these, <span>describe</span> and <span>assess</span> have been investigated in previous papers. In this work we enrich the IAM picture by focusing on the <span>explain</span> operator, whose goal is to provide an answer to the user asking “why does measure <span><math><mi>m</mi></math></span> show these values?”; specifically, we consider models that explain <span><math><mi>m</mi></math></span> in terms of one or more other measures. We propose a syntax for the operator and discuss how enhanced cubes are built by (i) finding the relationship between <span><math><mi>m</mi></math></span> and the other cube measures via regression analysis and cross-correlation, and (ii) highlighting the most interesting one. Finally, we test the operator implementation in terms of efficiency and effectiveness.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102338"},"PeriodicalIF":3.7,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001746/pdfft?md5=23f8fab78fdd903fb8bd9c0b6f06f739&pid=1-s2.0-S0306437923001746-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138742073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-13DOI: 10.1016/j.is.2023.102336
Jun-Fen Chen, Lang Sun, Bo-Jun Xie
Recently years, several prominent contrastive learning algorithms, a kind of self-supervised learning methods, have been extensively studied that can efficiently extract useful feature representations from input images by means of data augmentation techniques. How to further partition the representations into meaningful clusters is the issue that deep clustering is addressing. In this work, a deep clustering algorithm based on local semantic information and prototype is proposed referring to LSPC that aims at learning a group of representative prototypes. Rather than learning the distinguishing characteristics between different images, more attention is given to the essential characteristics of images that are maybe from a potential category. On the training framework, contrastive learning is skillfully combined with k-means clustering algorithm. The prediction is transformed into soft assignments for end-to-end training. In order to enable the model to accurately capture the semantic information between images, we mine similar samples of training samples in the embedded space as local semantic information to effectively increase the similarity between samples belonging to the same cluster. Experimental results show that our algorithm achieves state-of-the-art performance on several commonly used public datasets, and additional experiments prove that this superior clustering performance can also be extended to large datasets such as ImageNet.
{"title":"LSPC: Exploring contrastive clustering based on local semantic information and prototype","authors":"Jun-Fen Chen, Lang Sun, Bo-Jun Xie","doi":"10.1016/j.is.2023.102336","DOIUrl":"10.1016/j.is.2023.102336","url":null,"abstract":"<div><p>Recently years, several prominent contrastive learning<span><span> algorithms, a kind of self-supervised learning methods, have been extensively studied that can efficiently extract useful feature representations from input images by means of data augmentation techniques. How to further partition the representations into meaningful clusters is the issue that deep clustering is addressing. In this work, a deep </span>clustering algorithm based on local semantic information and prototype is proposed referring to LSPC that aims at learning a group of representative prototypes. Rather than learning the distinguishing characteristics between different images, more attention is given to the essential characteristics of images that are maybe from a potential category. On the training framework, contrastive learning is skillfully combined with k-means clustering algorithm. The prediction is transformed into soft assignments for end-to-end training. In order to enable the model to accurately capture the semantic information between images, we mine similar samples of training samples in the embedded space as local semantic information to effectively increase the similarity between samples belonging to the same cluster. Experimental results show that our algorithm achieves state-of-the-art performance on several commonly used public datasets, and additional experiments prove that this superior clustering performance can also be extended to large datasets such as ImageNet.</span></p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"121 ","pages":"Article 102336"},"PeriodicalIF":3.7,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138628611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}