Pub Date : 2025-01-01Epub Date: 2025-10-07DOI: 10.1007/s44248-025-00067-x
Zahra Sadeghi, Stan Matwin
Global sensitivity analysis seeks to detect influential input factors contributing to a black-box model's specific decisions. This aligns with a key objective of AI explainability: Clarifying and interpreting the behavior of machine learning algorithms by identifying the features that influence their decisions-a significant approach for mitigating the computational burden associated with processing high-dimensional data. Various techniques are proposed for sensitivity analysis; however, each of these methods focuses on different mathematical aspects, which can lead to varying conclusions about the impact or importance of each feature. Therefore, it remains unclear which of these algorithms are most suitable for machine learning models and, in particular, deep learning models. Our goal is to examine the influential features identified by each sensitivity analysis algorithm and evaluate their role in helping deep learning models make accurate decisions. In this article, first, we present the mathematical foundations underlying Global Sensitivity algorithms and explain the rationale behind selecting the important features identified by each method. We then provide a comparative case study on global sensitivity analysis methods and propose a methodology to evaluate the efficacy of these methods by conducting a case study on MNIST digit dataset classification. Our study highlights the most effective global sensitivity analysis methods for detecting the key factors influencing the digit data classification.
{"title":"A comparative case study on the performance of global sensitivity analysis methods on digit classification.","authors":"Zahra Sadeghi, Stan Matwin","doi":"10.1007/s44248-025-00067-x","DOIUrl":"10.1007/s44248-025-00067-x","url":null,"abstract":"<p><p>Global sensitivity analysis seeks to detect influential input factors contributing to a black-box model's specific decisions. This aligns with a key objective of AI explainability: Clarifying and interpreting the behavior of machine learning algorithms by identifying the features that influence their decisions-a significant approach for mitigating the computational burden associated with processing high-dimensional data. Various techniques are proposed for sensitivity analysis; however, each of these methods focuses on different mathematical aspects, which can lead to varying conclusions about the impact or importance of each feature. Therefore, it remains unclear which of these algorithms are most suitable for machine learning models and, in particular, deep learning models. Our goal is to examine the influential features identified by each sensitivity analysis algorithm and evaluate their role in helping deep learning models make accurate decisions. In this article, first, we present the mathematical foundations underlying Global Sensitivity algorithms and explain the rationale behind selecting the important features identified by each method. We then provide a comparative case study on global sensitivity analysis methods and propose a methodology to evaluate the efficacy of these methods by conducting a case study on MNIST digit dataset classification. Our study highlights the most effective global sensitivity analysis methods for detecting the key factors influencing the digit data classification.</p>","PeriodicalId":72824,"journal":{"name":"Discover data","volume":"3 1","pages":"38"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12504340/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145260013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-23DOI: 10.1007/s44248-024-00010-6
Thomas Prantl, André Bauer, Simon Engel, Lukas Horn, Christian Krupitzer, Lukas Iffländer, Samuel Kounev
{"title":"Benchmarking of Secure Group Communication schemes with focus on IoT","authors":"Thomas Prantl, André Bauer, Simon Engel, Lukas Horn, Christian Krupitzer, Lukas Iffländer, Samuel Kounev","doi":"10.1007/s44248-024-00010-6","DOIUrl":"https://doi.org/10.1007/s44248-024-00010-6","url":null,"abstract":"","PeriodicalId":72824,"journal":{"name":"Discover data","volume":"8 10","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141105153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-02DOI: 10.1007/s44248-024-00009-z
Hakan Mehmetcik, M. Ganiz, Melih Koluk, Galip Yüksel, Muslim Yılmaz, Muhammed Mustafa İnce, Emre Tortumlu
{"title":"TFPsocialmedia: a public dataset for studying Turkish foreign policy","authors":"Hakan Mehmetcik, M. Ganiz, Melih Koluk, Galip Yüksel, Muslim Yılmaz, Muhammed Mustafa İnce, Emre Tortumlu","doi":"10.1007/s44248-024-00009-z","DOIUrl":"https://doi.org/10.1007/s44248-024-00009-z","url":null,"abstract":"","PeriodicalId":72824,"journal":{"name":"Discover data","volume":"27 27","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140753206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-18DOI: 10.1007/s44248-024-00006-2
Liyuan Liu, Meng Han
{"title":"Data sharing and exchanging with incentive and optimization: a survey","authors":"Liyuan Liu, Meng Han","doi":"10.1007/s44248-024-00006-2","DOIUrl":"https://doi.org/10.1007/s44248-024-00006-2","url":null,"abstract":"","PeriodicalId":72824,"journal":{"name":"Discover data","volume":"60 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140234195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-30DOI: 10.1007/s44248-023-00005-9
Luciano Ignaczak, Márcio Garcia Martins, C. A. da Costa, Bruna Donida, Maria Cristina Peres da Silva
{"title":"An evaluation of NERC learning-based approaches to discover personal data in Brazilian Portuguese documents","authors":"Luciano Ignaczak, Márcio Garcia Martins, C. A. da Costa, Bruna Donida, Maria Cristina Peres da Silva","doi":"10.1007/s44248-023-00005-9","DOIUrl":"https://doi.org/10.1007/s44248-023-00005-9","url":null,"abstract":"","PeriodicalId":72824,"journal":{"name":"Discover data","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139206571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-30DOI: 10.1007/s44248-023-00004-w
Henrike Stephani, Thomas Weibel, Ronald Rösch, A. Moghiseh
{"title":"Challenges and approaches when realizing online surface inspection systems with deep learning algorithms","authors":"Henrike Stephani, Thomas Weibel, Ronald Rösch, A. Moghiseh","doi":"10.1007/s44248-023-00004-w","DOIUrl":"https://doi.org/10.1007/s44248-023-00004-w","url":null,"abstract":"","PeriodicalId":72824,"journal":{"name":"Discover data","volume":"123 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83507933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1007/s44248-023-00003-x
Paul K Mvula, Paula Branco, Guy-Vincent Jourdan, Herna L Viktor
In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.
{"title":"A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning.","authors":"Paul K Mvula, Paula Branco, Guy-Vincent Jourdan, Herna L Viktor","doi":"10.1007/s44248-023-00003-x","DOIUrl":"https://doi.org/10.1007/s44248-023-00003-x","url":null,"abstract":"<p><p>In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.</p>","PeriodicalId":72824,"journal":{"name":"Discover data","volume":"1 1","pages":"4"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10079755/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9284026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1007/s44248-023-00002-y
Paul K Mvula, Paula Branco, Guy-Vincent Jourdan, Herna L Viktor
Research into Intrusion and Anomaly Detectors at the Host level typically pays much attention to extracting attributes from system call traces. These include window-based, Hidden Markov Models, and sequence-model-based attributes. Recently, several works have been focusing on sequence-model-based feature extractors, specifically Word2Vec and GloVe, to extract embeddings from the system call traces due to their ability to capture semantic relationships among system calls. However, due to the nature of the data, these extractors introduce inconsistencies in the extracted features, causing the Machine Learning models built on them to yield inaccurate and potentially misleading results. In this paper, we first highlight the research challenges posed by these extractors. Then, we conduct experiments with new feature sets assessing their suitability to address the detected issues. Our experiments show that Word2Vec is prone to introducing more duplicated samples than GloVe. Regarding the solutions proposed, we found that concatenating the embedding vectors generated by Word2Vec and GloVe yields the overall best balanced accuracy. In addition to resolving the challenge of data leakage, this approach enables an improvement in performance relative to other alternatives.
{"title":"Evaluating Word Embedding Feature Extraction Techniques for Host-Based Intrusion Detection Systems.","authors":"Paul K Mvula, Paula Branco, Guy-Vincent Jourdan, Herna L Viktor","doi":"10.1007/s44248-023-00002-y","DOIUrl":"https://doi.org/10.1007/s44248-023-00002-y","url":null,"abstract":"<p><p>Research into Intrusion and Anomaly Detectors at the Host level typically pays much attention to extracting attributes from system call traces. These include window-based, Hidden Markov Models, and sequence-model-based attributes. Recently, several works have been focusing on sequence-model-based feature extractors, specifically Word2Vec and GloVe, to extract embeddings from the system call traces due to their ability to capture semantic relationships among system calls. However, due to the nature of the data, these extractors introduce inconsistencies in the extracted features, causing the Machine Learning models built on them to yield inaccurate and potentially misleading results. In this paper, we first highlight the research challenges posed by these extractors. Then, we conduct experiments with new feature sets assessing their suitability to address the detected issues. Our experiments show that Word2Vec is prone to introducing more duplicated samples than GloVe. Regarding the solutions proposed, we found that concatenating the embedding vectors generated by Word2Vec and GloVe yields the overall best balanced accuracy. In addition to resolving the challenge of data leakage, this approach enables an improvement in performance relative to other alternatives.</p>","PeriodicalId":72824,"journal":{"name":"Discover data","volume":"1 1","pages":"2"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10077957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9274107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}