Crowdfunding has become a popular financing method, attracting investors, businesses, and entrepreneurs. However, many campaigns fail to secure funding, making it crucial to reduce participation risks using artificial intelligence (AI). This study investigates the effectiveness of advanced AI techniques in predicting the success of crowdfunding campaigns on Kickstarter by analyzing campaign blurbs. We compare the performance of two widely used text representation models, bidirectional encoder representations from transformers (BERT) and FastText, in conjunction with long-short term memory (LSTM) and gradient boosting machine (GBM) classifiers. Our analysis involves preprocessing campaign blurbs, extracting features using BERT and FastText, and evaluating the predictive performance of these features with LSTM and GBM models. All experimental results show that BERT representations significantly outperform FastText, with the highest accuracy of 0.745 achieved using a fine-tuned BERT model combined with LSTM. These findings highlight the importance of using deep contextual embeddings and the benefits of fine-tuning pre-trained models for domain-specific applications. The results are benchmarked against existing methods, demonstrating the superiority of our approach. This study provides valuable insights for improving predictive models in the crowdfunding domain, offering practical implications for campaign creators and investors.
{"title":"Comparative analysis of BERT and FastText representations on crowdfunding campaign success prediction","authors":"Hakan Gunduz","doi":"10.7717/peerj-cs.2316","DOIUrl":"https://doi.org/10.7717/peerj-cs.2316","url":null,"abstract":"Crowdfunding has become a popular financing method, attracting investors, businesses, and entrepreneurs. However, many campaigns fail to secure funding, making it crucial to reduce participation risks using artificial intelligence (AI). This study investigates the effectiveness of advanced AI techniques in predicting the success of crowdfunding campaigns on Kickstarter by analyzing campaign blurbs. We compare the performance of two widely used text representation models, bidirectional encoder representations from transformers (BERT) and FastText, in conjunction with long-short term memory (LSTM) and gradient boosting machine (GBM) classifiers. Our analysis involves preprocessing campaign blurbs, extracting features using BERT and FastText, and evaluating the predictive performance of these features with LSTM and GBM models. All experimental results show that BERT representations significantly outperform FastText, with the highest accuracy of 0.745 achieved using a fine-tuned BERT model combined with LSTM. These findings highlight the importance of using deep contextual embeddings and the benefits of fine-tuning pre-trained models for domain-specific applications. The results are benchmarked against existing methods, demonstrating the superiority of our approach. This study provides valuable insights for improving predictive models in the crowdfunding domain, offering practical implications for campaign creators and investors.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"35 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, social media has become much more popular to use to express people’s feelings in different forms. Social media such as X (i.e., Twitter) provides a huge amount of data to be analyzed by using sentiment analysis tools to examine the sentiment of people in an understandable way. Many works study sentiment analysis by taking in consideration the spatial and temporal dimensions to provide the most precise analysis of these data and to better understand people’s opinions. But there is a need to facilitate and speed up the searching process to allow the user to find the sentiment analysis of recent top-k tweets in a specified location including the temporal aspect. This work comes with the aim of providing a general framework of data indexing and search query to simplify the search process and to get the results in an efficient way. The proposed query extends the fundamental spatial distance query, commonly used in spatial-temporal data analysis. This query, coupled with sentiment analysis, operates on an indexed dataset, classifying temporal data as positive, negative, or neutral. The proposed query demonstrates over a tenfold improvement in query time compared to the baseline index with various parameters such as top-k, query distance, and the number of query keywords.
近年来,通过社交媒体以不同形式表达人们的情感变得越来越流行。X (即 Twitter)等社交媒体提供了海量数据,可通过情感分析工具进行分析,以易于理解的方式研究人们的情感。许多研究情感分析的著作都考虑到了空间和时间维度,以便对这些数据进行最精确的分析,更好地理解人们的观点。但是,有必要促进和加快搜索过程,让用户能够找到指定位置上最近前 k 条推文的情感分析,包括时间方面的分析。这项工作的目的是提供一个数据索引和搜索查询的总体框架,以简化搜索过程并高效地获取结果。所提出的查询扩展了时空数据分析中常用的基本空间距离查询。该查询与情感分析相结合,对索引数据集进行操作,将时态数据分为正面、负面或中性。在使用 top-k、查询距离和查询关键词数量等各种参数的情况下,与基线索引相比,拟议查询的查询时间缩短了十倍以上。
{"title":"Top-k sentiment analysis over spatio-temporal data","authors":"Abdulaziz Almaslukh, Aisha Almaalwy, Nasser Allheeib, Abdulaziz Alajaji, Mohammed Almukaynizi, Yazeed Alabdulkarim","doi":"10.7717/peerj-cs.2297","DOIUrl":"https://doi.org/10.7717/peerj-cs.2297","url":null,"abstract":"In recent years, social media has become much more popular to use to express people’s feelings in different forms. Social media such as X (i.e., Twitter) provides a huge amount of data to be analyzed by using sentiment analysis tools to examine the sentiment of people in an understandable way. Many works study sentiment analysis by taking in consideration the spatial and temporal dimensions to provide the most precise analysis of these data and to better understand people’s opinions. But there is a need to facilitate and speed up the searching process to allow the user to find the sentiment analysis of recent top-k tweets in a specified location including the temporal aspect. This work comes with the aim of providing a general framework of data indexing and search query to simplify the search process and to get the results in an efficient way. The proposed query extends the fundamental spatial distance query, commonly used in spatial-temporal data analysis. This query, coupled with sentiment analysis, operates on an indexed dataset, classifying temporal data as positive, negative, or neutral. The proposed query demonstrates over a tenfold improvement in query time compared to the baseline index with various parameters such as top-k, query distance, and the number of query keywords.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"168 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increased use of artificial intelligence generated content (AIGC) among vast user populations has heightened the risk of private data leaks. Effective auditing and regulation remain challenging, further compounding the risks associated with the leaks involving model parameters and user data. Blockchain technology, renowned for its decentralized consensus mechanism and tamper-resistant properties, is emerging as an ideal tool for documenting, auditing, and analyzing the behaviors of all stakeholders in machine learning as a service (MLaaS). This study centers on biometric recognition systems, addressing pressing privacy and security concerns through innovative endeavors. We conducted experiments to analyze six distinct deep neural networks, leveraging a dataset quality metric grounded in the query output space to quantify the value of the transfer datasets. This analysis revealed the impact of imbalanced datasets on training accuracy, thereby bolstering the system’s capacity to detect model data thefts. Furthermore, we designed and implemented a novel Bio-Rollup scheme, seamlessly integrating technologies such as certificate authority, blockchain layer two scaling, and zero-knowledge proofs. This innovative scheme facilitates lightweight auditing through Merkle proofs, enhancing efficiency while minimizing blockchain storage requirements. Compared to the baseline approach, Bio-Rollup restores the integrity of the biometric system and simplifies deployment procedures. It effectively prevents unauthorized use through certificate authorization and zero-knowledge proofs, thus safeguarding user privacy and offering a passive defense against model stealing attacks.
{"title":"Bio-Rollup: a new privacy protection solution for biometrics based on two-layer scalability-focused blockchain","authors":"Jian Yun, Yusheng Lu, Xinyang Liu, Jingdan Guan","doi":"10.7717/peerj-cs.2268","DOIUrl":"https://doi.org/10.7717/peerj-cs.2268","url":null,"abstract":"The increased use of artificial intelligence generated content (AIGC) among vast user populations has heightened the risk of private data leaks. Effective auditing and regulation remain challenging, further compounding the risks associated with the leaks involving model parameters and user data. Blockchain technology, renowned for its decentralized consensus mechanism and tamper-resistant properties, is emerging as an ideal tool for documenting, auditing, and analyzing the behaviors of all stakeholders in machine learning as a service (MLaaS). This study centers on biometric recognition systems, addressing pressing privacy and security concerns through innovative endeavors. We conducted experiments to analyze six distinct deep neural networks, leveraging a dataset quality metric grounded in the query output space to quantify the value of the transfer datasets. This analysis revealed the impact of imbalanced datasets on training accuracy, thereby bolstering the system’s capacity to detect model data thefts. Furthermore, we designed and implemented a novel Bio-Rollup scheme, seamlessly integrating technologies such as certificate authority, blockchain layer two scaling, and zero-knowledge proofs. This innovative scheme facilitates lightweight auditing through Merkle proofs, enhancing efficiency while minimizing blockchain storage requirements. Compared to the baseline approach, Bio-Rollup restores the integrity of the biometric system and simplifies deployment procedures. It effectively prevents unauthorized use through certificate authorization and zero-knowledge proofs, thus safeguarding user privacy and offering a passive defense against model stealing attacks.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"60 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kamal Bella, Azidine Guezzaz, Said Benkirane, Mourade Azrour, Yasser Fouad, Mbadiwe S. Benyeogor, Nisreen Innab
The adoption and integration of the Internet of Things (IoT) have become essential for the advancement of many industries, unlocking purposeful connections between objects. However, the surge in IoT adoption and integration has also made it a prime target for malicious attacks. Consequently, ensuring the security of IoT systems and ecosystems has emerged as a crucial research area. Notably, advancements in addressing these security threats include the implementation of intrusion detection systems (IDS), garnering considerable attention within the research community. In this study, and in aim to enhance network anomaly detection, we present a novel intrusion detection approach: the Deep Neural Decision Forest-based IDS (DNDF-IDS). The DNDF-IDS incorporates an improved decision forest model coupled with neural networks to achieve heightened accuracy (ACC). Employing four distinct feature selection methods separately, namely principal component analysis (PCA), LASSO regression (LR), SelectKBest, and Random Forest Feature Importance (RFFI), our objective is to streamline training and prediction processes, enhance overall performance, and identify the most correlated features. Evaluation of our model on three diverse datasets (NSL-KDD, CICIDS2017, and UNSW-NB15) reveals impressive ACC values ranging from 94.09% to 98.84%, depending on the dataset and the feature selection method. Notably, our model achieves a remarkable prediction time of 0.1 ms per record. Comparative analyses with other recent random forest and Convolutional Neural Networks (CNN) based models indicate that our DNDF-IDS performs similarly or even outperforms them in certain instances, particularly when utilizing the top 10 features. One key advantage of our novel model lies in its ability to make accurate predictions with only a few features, showcasing an efficient utilization of computational resources.
{"title":"An efficient intrusion detection system for IoT security using CNN decision forest","authors":"Kamal Bella, Azidine Guezzaz, Said Benkirane, Mourade Azrour, Yasser Fouad, Mbadiwe S. Benyeogor, Nisreen Innab","doi":"10.7717/peerj-cs.2290","DOIUrl":"https://doi.org/10.7717/peerj-cs.2290","url":null,"abstract":"The adoption and integration of the Internet of Things (IoT) have become essential for the advancement of many industries, unlocking purposeful connections between objects. However, the surge in IoT adoption and integration has also made it a prime target for malicious attacks. Consequently, ensuring the security of IoT systems and ecosystems has emerged as a crucial research area. Notably, advancements in addressing these security threats include the implementation of intrusion detection systems (IDS), garnering considerable attention within the research community. In this study, and in aim to enhance network anomaly detection, we present a novel intrusion detection approach: the Deep Neural Decision Forest-based IDS (DNDF-IDS). The DNDF-IDS incorporates an improved decision forest model coupled with neural networks to achieve heightened accuracy (ACC). Employing four distinct feature selection methods separately, namely principal component analysis (PCA), LASSO regression (LR), SelectKBest, and Random Forest Feature Importance (RFFI), our objective is to streamline training and prediction processes, enhance overall performance, and identify the most correlated features. Evaluation of our model on three diverse datasets (NSL-KDD, CICIDS2017, and UNSW-NB15) reveals impressive ACC values ranging from 94.09% to 98.84%, depending on the dataset and the feature selection method. Notably, our model achieves a remarkable prediction time of 0.1 ms per record. Comparative analyses with other recent random forest and Convolutional Neural Networks (CNN) based models indicate that our DNDF-IDS performs similarly or even outperforms them in certain instances, particularly when utilizing the top 10 features. One key advantage of our novel model lies in its ability to make accurate predictions with only a few features, showcasing an efficient utilization of computational resources.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"106 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.
近年来,由于基于机器学习的作者识别技术不断进步,根据代码作者的编程风格对其进行自动识别(称为作者归属或代码风格测量)已成为可能。代码风格测量法一旦在规模上可行,就可用于善意或恶意的活动,包括:识别代码中最专业的同事(如果作者信息丢失);对开源开发人员进行指纹识别,以便向他们主动提供工作机会;对非法软件的开发人员进行去匿名化,以便对他们进行追捕。根据各自的目标,利益相关者都希望提高或降低代码风格测量的效率。为了给这些决策提供信息,我们研究了代码风格测量的准确性如何受到两种常见软件开发活动的影响:代码格式化和代码精简。我们使用基于 code2vec 的作者分类器,对输入源文件的具体语法树(CST)表示法,对来自 Google Code Jam 数据集(59 位作者)的 Python 代码进行了代码风格测量。我们同时使用 CST 和 AST(抽象语法树)进行实验。我们比较了各自的分类准确率:(1) 原始数据集;(2) 使用 Black 格式化的数据集;(3) 使用 Python Minifier 简化的数据集。我们的结果表明(1) 基于 CST 的文体测量法比基于 AST 的文体测量法表现更好(51.00%→68%),(2) 代码格式化使代码文体测量法的准确率大幅下降(15%)(68%→53%),而最小化又进一步降低了 3%(68%→50%)。虽然代码格式化和最小化的准确性都有显著下降,但都不足以使开发人员无法通过代码风格测量进行识别。
{"title":"Code stylometry vs formatting and minification","authors":"Stefano Balla, Maurizio Gabbrielli, Stefano Zacchiroli","doi":"10.7717/peerj-cs.2142","DOIUrl":"https://doi.org/10.7717/peerj-cs.2142","url":null,"abstract":"The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"47 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Routing in vehicular ad hoc networks (VANETs) enables vehicles to communicate for safety and non-safety applications. However, there are limitations in wireless communication that can degrade VANET performance, so it is crucial to optimize the operation of routing protocols to address this. Various routing protocols employed the expected transmission count (ETX) in their operation as one way to achieve the required efficiency and robustness. ETX is used to estimate link quality for improved route selection. While some studies have evaluated the utilization of ETX in specific protocols, they lack a comprehensive analysis across protocols under varied network conditions. This research provides a comprehensive comparative evaluation of ETX-based routing protocols for VANETs using the nomadic community mobility model. It covers a foundational routing protocol, ad hoc on-demand distance vector (AODV), as well as newer variants that utilize ETX, lightweight ETX (LETX), and power-based light reverse ETX (PLR-ETX), which are referred to herein as AODV-ETX, AODV-LETX, and AODV-PLR, respectively. The protocols are thoroughly analyzed via ns-3 simulations under different traffic and mobility scenarios. Our evaluation model considers five performance parameters including throughput, routing overhead, end-to-end delay, packet loss, and underutilization ratio. The analysis provides insight into designing robust and adaptive ETX routing for VANET to better serve emerging intelligent transportation system applications through a better understanding of protocol performance under different network conditions. The key findings show that ETX-optimized routing can provide significant performance enhancements in terms of end-to-end delay, throughput, routing overhead, packet loss and underutilization ratio. The extensive simulations demonstrated that AODV-PLR outperforms its counterparts AODV-ETX and AODV-LETX and the foundational AODV routing protocol across the performance metrics.
{"title":"An empirical evaluation of link quality utilization in ETX routing for VANETs","authors":"Raad Al-Qassas, Malik Qasaimeh","doi":"10.7717/peerj-cs.2259","DOIUrl":"https://doi.org/10.7717/peerj-cs.2259","url":null,"abstract":"Routing in vehicular ad hoc networks (VANETs) enables vehicles to communicate for safety and non-safety applications. However, there are limitations in wireless communication that can degrade VANET performance, so it is crucial to optimize the operation of routing protocols to address this. Various routing protocols employed the expected transmission count (ETX) in their operation as one way to achieve the required efficiency and robustness. ETX is used to estimate link quality for improved route selection. While some studies have evaluated the utilization of ETX in specific protocols, they lack a comprehensive analysis across protocols under varied network conditions. This research provides a comprehensive comparative evaluation of ETX-based routing protocols for VANETs using the nomadic community mobility model. It covers a foundational routing protocol, ad hoc on-demand distance vector (AODV), as well as newer variants that utilize ETX, lightweight ETX (LETX), and power-based light reverse ETX (PLR-ETX), which are referred to herein as AODV-ETX, AODV-LETX, and AODV-PLR, respectively. The protocols are thoroughly analyzed via ns-3 simulations under different traffic and mobility scenarios. Our evaluation model considers five performance parameters including throughput, routing overhead, end-to-end delay, packet loss, and underutilization ratio. The analysis provides insight into designing robust and adaptive ETX routing for VANET to better serve emerging intelligent transportation system applications through a better understanding of protocol performance under different network conditions. The key findings show that ETX-optimized routing can provide significant performance enhancements in terms of end-to-end delay, throughput, routing overhead, packet loss and underutilization ratio. The extensive simulations demonstrated that AODV-PLR outperforms its counterparts AODV-ETX and AODV-LETX and the foundational AODV routing protocol across the performance metrics.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"21 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Classification rule mining represents a significant field of machine learning, facilitating informed decision-making through the extraction of meaningful rules from complex data. Many classification methods cannot simultaneously optimize both explainability and different performance metrics at the same time. Metaheuristic optimization-based solutions, inspired by natural phenomena, offer a potential paradigm shift in this field, enabling the development of interpretable and scalable classifiers. In contrast to classical methods, such rule extraction-based solutions are capable of classification by taking multiple purposes into consideration simultaneously. To the best of our knowledge, although there are limited studies on metaheuristic based classification, there is not any method that optimize more than three objectives while increasing the explainability and interpretability for classification task. In this study, data sets are treated as the search space and metaheuristics as the many-objective rule discovery strategy and study proposes a metaheuristic many-objective optimization-based rule extraction approach for the first time in the literature. Chaos theory is also integrated to the optimization method for performance increment and the proposed chaotic rule-based SPEA2 algorithm enables the simultaneous optimization of four different success metrics and automatic rule extraction. Another distinctive feature of the proposed algorithm is that, in contrast to classical random search methods, it can mitigate issues such as correlation and poor uniformity between candidate solutions through the use of a chaotic random search mechanism in the exploration and exploitation phases. The efficacy of the proposed method is evaluated using three distinct data sets, and its performance is demonstrated in comparison with other classical machine learning results.
{"title":"Increasing the explainability and success in classification: many-objective classification rule mining based on chaos integrated SPEA2","authors":"Suna Yildirim, Bilal Alatas","doi":"10.7717/peerj-cs.2307","DOIUrl":"https://doi.org/10.7717/peerj-cs.2307","url":null,"abstract":"Classification rule mining represents a significant field of machine learning, facilitating informed decision-making through the extraction of meaningful rules from complex data. Many classification methods cannot simultaneously optimize both explainability and different performance metrics at the same time. Metaheuristic optimization-based solutions, inspired by natural phenomena, offer a potential paradigm shift in this field, enabling the development of interpretable and scalable classifiers. In contrast to classical methods, such rule extraction-based solutions are capable of classification by taking multiple purposes into consideration simultaneously. To the best of our knowledge, although there are limited studies on metaheuristic based classification, there is not any method that optimize more than three objectives while increasing the explainability and interpretability for classification task. In this study, data sets are treated as the search space and metaheuristics as the many-objective rule discovery strategy and study proposes a metaheuristic many-objective optimization-based rule extraction approach for the first time in the literature. Chaos theory is also integrated to the optimization method for performance increment and the proposed chaotic rule-based SPEA2 algorithm enables the simultaneous optimization of four different success metrics and automatic rule extraction. Another distinctive feature of the proposed algorithm is that, in contrast to classical random search methods, it can mitigate issues such as correlation and poor uniformity between candidate solutions through the use of a chaotic random search mechanism in the exploration and exploitation phases. The efficacy of the proposed method is evaluated using three distinct data sets, and its performance is demonstrated in comparison with other classical machine learning results.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"12 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artificial intelligence (AI) and machine learning (ML) aim to mimic human intelligence and enhance decision making processes across various fields. A key performance determinant in a ML model is the ratio between the training and testing dataset. This research investigates the impact of varying train-test split ratios on machine learning model performance and generalization capabilities using the BraTS 2013 dataset. Logistic regression, random forest, k nearest neighbors, and support vector machines were trained with split ratios ranging from 60:40 to 95:05. Findings reveal significant variations in accuracies across these ratios, emphasizing the critical need to strike a balance to avoid overfitting or underfitting. The study underscores the importance of selecting an optimal train-test split ratio that considers tradeoffs such as model performance metrics, statistical measures, and resource constraints. Ultimately, these insights contribute to a deeper understanding of how ratio selection impacts the effectiveness and reliability of machine learning applications across diverse fields.
人工智能(AI)和机器学习(ML)旨在模仿人类智能,增强各领域的决策过程。决定 ML 模型性能的一个关键因素是训练数据集和测试数据集之间的比例。本研究使用 BraTS 2013 数据集研究了不同的训练-测试分割比例对机器学习模型性能和泛化能力的影响。对逻辑回归、随机森林、k 近邻和支持向量机进行了训练,拆分比例从 60:40 到 95:05。研究结果表明,不同比例下的准确率差异很大,这强调了在避免过度拟合或拟合不足方面取得平衡的重要性。这项研究强调了选择最佳训练-测试分割比例的重要性,该比例应考虑到模型性能指标、统计量和资源限制等权衡因素。最终,这些见解有助于深入理解比率选择如何影响不同领域机器学习应用的有效性和可靠性。
{"title":"Trade-off between training and testing ratio in machine learning for medical image processing","authors":"Muthuramalingam Sivakumar, Sudhaman Parthasarathy, Thiyagarajan Padmapriya","doi":"10.7717/peerj-cs.2245","DOIUrl":"https://doi.org/10.7717/peerj-cs.2245","url":null,"abstract":"Artificial intelligence (AI) and machine learning (ML) aim to mimic human intelligence and enhance decision making processes across various fields. A key performance determinant in a ML model is the ratio between the training and testing dataset. This research investigates the impact of varying train-test split ratios on machine learning model performance and generalization capabilities using the BraTS 2013 dataset. Logistic regression, random forest, k nearest neighbors, and support vector machines were trained with split ratios ranging from 60:40 to 95:05. Findings reveal significant variations in accuracies across these ratios, emphasizing the critical need to strike a balance to avoid overfitting or underfitting. The study underscores the importance of selecting an optimal train-test split ratio that considers tradeoffs such as model performance metrics, statistical measures, and resource constraints. Ultimately, these insights contribute to a deeper understanding of how ratio selection impacts the effectiveness and reliability of machine learning applications across diverse fields.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"19 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Electrocardiograms (ECGs) provide essential data for diagnosing arrhythmias, which can potentially cause serious health complications. Early detection through continuous monitoring is crucial for timely intervention. The Massachusetts Institute of Technology-Beth Israel Hospital (MIT-BIH) arrhythmia dataset employed for arrhythmia analysis research comprises imbalanced data. It is necessary to create a robust model independent of data imbalances to classify arrhythmias accurately. To mitigate the pronounced class imbalance in the MIT-BIH arrhythmia dataset, this study employs advanced augmentation techniques, specifically variational autoencoder (VAE) and conditional diffusion, to augment the dataset. Furthermore, accurately segmenting the continuous heartbeat dataset into individual heartbeats is crucial for confidently detecting arrhythmias. This research compared a model that employed annotation-based segmentation, utilizing R-peak labels, and a model that utilized an automated segmentation method based on a deep learning model to segment heartbeats. In our experiments, the proposed model, utilizing MobileNetV2 along with annotation-based segmentation and conditional diffusion augmentation to address minority class, demonstrated a notable 1.23% improvement in the F1 score and 1.73% in the precision, compared to the model classifying arrhythmia classes with the original imbalanced dataset. This research presents a model that accurately classifies a wide range of arrhythmias, including minority classes, moving beyond the previously limited arrhythmia classification models. It can serve as a basis for better data utilization and model performance improvement in arrhythmia diagnosis and medical service research. These achievements enhance the applicability in the medical field and contribute to improving the quality of healthcare services by providing more sophisticated and reliable diagnostic tools.
心电图(ECG)为诊断心律失常提供了重要数据,而心律失常有可能导致严重的健康并发症。通过持续监测进行早期检测对及时干预至关重要。用于心律失常分析研究的麻省理工学院-以色列贝斯医院(MIT-BIH)心律失常数据集包含不平衡数据。有必要创建一个不受数据不平衡影响的稳健模型,以便对心律失常进行准确分类。为了缓解 MIT-BIH 心律失常数据集中明显的类别不平衡问题,本研究采用了先进的增强技术,特别是变异自动编码器(VAE)和条件扩散技术来增强数据集。此外,准确地将连续心跳数据集分割为单个心跳对于可靠地检测心律失常至关重要。本研究比较了一种利用 R 峰标签进行基于注释的分割的模型和一种利用基于深度学习模型的自动分割方法来分割心跳的模型。在我们的实验中,与使用原始不平衡数据集对心律失常类别进行分类的模型相比,利用 MobileNetV2 以及基于注释的分割和条件扩散增强来解决少数类别问题的拟议模型在 F1 分数和精确度方面分别有 1.23% 和 1.73% 的显著提高。这项研究提出了一种能准确分类各种心律失常(包括少数类别)的模型,超越了以前有限的心律失常分类模型。它可以作为心律失常诊断和医疗服务研究中更好地利用数据和提高模型性能的基础。这些成果提高了在医疗领域的适用性,并通过提供更先进、更可靠的诊断工具,为改善医疗服务质量做出了贡献。
{"title":"Classification of imbalanced ECGs through segmentation models and augmented by conditional diffusion model","authors":"Jinhee Kwak, Jaehee Jung","doi":"10.7717/peerj-cs.2299","DOIUrl":"https://doi.org/10.7717/peerj-cs.2299","url":null,"abstract":"Electrocardiograms (ECGs) provide essential data for diagnosing arrhythmias, which can potentially cause serious health complications. Early detection through continuous monitoring is crucial for timely intervention. The Massachusetts Institute of Technology-Beth Israel Hospital (MIT-BIH) arrhythmia dataset employed for arrhythmia analysis research comprises imbalanced data. It is necessary to create a robust model independent of data imbalances to classify arrhythmias accurately. To mitigate the pronounced class imbalance in the MIT-BIH arrhythmia dataset, this study employs advanced augmentation techniques, specifically variational autoencoder (VAE) and conditional diffusion, to augment the dataset. Furthermore, accurately segmenting the continuous heartbeat dataset into individual heartbeats is crucial for confidently detecting arrhythmias. This research compared a model that employed annotation-based segmentation, utilizing R-peak labels, and a model that utilized an automated segmentation method based on a deep learning model to segment heartbeats. In our experiments, the proposed model, utilizing MobileNetV2 along with annotation-based segmentation and conditional diffusion augmentation to address minority class, demonstrated a notable 1.23% improvement in the F1 score and 1.73% in the precision, compared to the model classifying arrhythmia classes with the original imbalanced dataset. This research presents a model that accurately classifies a wide range of arrhythmias, including minority classes, moving beyond the previously limited arrhythmia classification models. It can serve as a basis for better data utilization and model performance improvement in arrhythmia diagnosis and medical service research. These achievements enhance the applicability in the medical field and contribute to improving the quality of healthcare services by providing more sophisticated and reliable diagnostic tools.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"15 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-image super-resolution technology based on deep learning is widely used in remote sensing. The non-local feature reflects the correlation information between different regions. Most neural networks extract various non-local information of images in the spatial domain but ignore the similarity characteristics of frequency distribution, which limits the performance of the algorithm. To solve this problem, we propose a frequency distribution aware network based on discrete cosine transformation for remote sensing image super-resolution. This network first proposes a frequency-aware module. This module can effectively extract the similarity characteristics of the frequency distribution between different regions by rearranging the frequency feature matrix of the image. A global frequency feature fusion module is also proposed. It can extract the non-local information of feature maps at different scales in the frequency domain with little computational cost. The experiments were on two commonly-used remote sensing datasets. The experimental results show that the proposed algorithm can effectively complete image reconstruction and performs better than some advanced super-resolution algorithms. The code is available at https://github.com/Liyszepc/FDANet.
{"title":"Frequency distribution-aware network based on discrete cosine transformation (DCT) for remote sensing image super resolution","authors":"Yunsong Li, Debao Yuan","doi":"10.7717/peerj-cs.2255","DOIUrl":"https://doi.org/10.7717/peerj-cs.2255","url":null,"abstract":"Single-image super-resolution technology based on deep learning is widely used in remote sensing. The non-local feature reflects the correlation information between different regions. Most neural networks extract various non-local information of images in the spatial domain but ignore the similarity characteristics of frequency distribution, which limits the performance of the algorithm. To solve this problem, we propose a frequency distribution aware network based on discrete cosine transformation for remote sensing image super-resolution. This network first proposes a frequency-aware module. This module can effectively extract the similarity characteristics of the frequency distribution between different regions by rearranging the frequency feature matrix of the image. A global frequency feature fusion module is also proposed. It can extract the non-local information of feature maps at different scales in the frequency domain with little computational cost. The experiments were on two commonly-used remote sensing datasets. The experimental results show that the proposed algorithm can effectively complete image reconstruction and performs better than some advanced super-resolution algorithms. The code is available at https://github.com/Liyszepc/FDANet.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"3 1","pages":""},"PeriodicalIF":3.8,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142203452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}