Machine learning with applications最新文献_第3页

Practical classification accuracy of sequential data using neural networks

Machine learning with applications

Pub Date : 2024-12-18 DOI: 10.1016/j.mlwa.2024.100611

Mamoru Mimura

Many existing studies on neural network accuracy utilize datasets that may not always reflect real-world conditions. While it has been demonstrated that accuracy tends to decrease as the number of benign samples increases, this effect has not been quantitatively assessed within neural networks. Moreover, its relevance to security tasks beyond malware classification remains unexplored. In this research, we refined the metric to evaluate the degradation of accuracy with an increased number of benign samples in test data. Utilizing both standard and specific neural network models, we conducted experiments to adapt this metric to neural networks and various feature extraction techniques. Using the FFRI dataset, comprising 150,000 malware and 400,000 benign samples, along with the URL dataset, containing 3143 malicious and 106,545,781 benign samples, we increased benign samples in the test set while keeping the training set’s malicious and benign samples constant. Our findings indicate that neural networks can indeed overestimate their accuracy with a smaller count of benign samples. Importantly, our refined metric is not only applicable to neural networks but is also effective for other feature extraction methods and security tasks beyond malware detection.

{"title":"Practical classification accuracy of sequential data using neural networks","authors":"Mamoru Mimura","doi":"10.1016/j.mlwa.2024.100611","DOIUrl":"10.1016/j.mlwa.2024.100611","url":null,"abstract":"<div><div>Many existing studies on neural network accuracy utilize datasets that may not always reflect real-world conditions. While it has been demonstrated that accuracy tends to decrease as the number of benign samples increases, this effect has not been quantitatively assessed within neural networks. Moreover, its relevance to security tasks beyond malware classification remains unexplored. In this research, we refined the metric to evaluate the degradation of accuracy with an increased number of benign samples in test data. Utilizing both standard and specific neural network models, we conducted experiments to adapt this metric to neural networks and various feature extraction techniques. Using the FFRI dataset, comprising 150,000 malware and 400,000 benign samples, along with the URL dataset, containing 3143 malicious and 106,545,781 benign samples, we increased benign samples in the test set while keeping the training set’s malicious and benign samples constant. Our findings indicate that neural networks can indeed overestimate their accuracy with a smaller count of benign samples. Importantly, our refined metric is not only applicable to neural networks but is also effective for other feature extraction methods and security tasks beyond malware detection.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"19 ","pages":"Article 100611"},"PeriodicalIF":0.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143170232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Combinations of distributional regression algorithms with application in uncertainty estimation of corrected satellite precipitation products

Machine learning with applications

Pub Date : 2024-12-16 DOI: 10.1016/j.mlwa.2024.100615

Georgia Papacharalampous, Hristos Tyralis, Nikolaos Doulamis, Anastasios Doulamis

To facilitate effective decision-making, precipitation datasets should include uncertainty estimates. Quantile regression with machine learning has been proposed for issuing such estimates. Distributional regression offers distinct advantages over quantile regression, including the ability to model intermittency as well as a stronger ability to extrapolate beyond the training data, which is critical for predicting extreme precipitation. Therefore, here, we introduce the concept of distributional regression in precipitation dataset creation, specifically for the spatial prediction task of correcting satellite precipitation products. Building upon this concept, we formulated new ensemble learning methods that can be valuable not only for spatial prediction but also for other prediction problems. These methods exploit conditional zero-adjusted probability distributions estimated with generalized additive models for location, scale and shape (GAMLSS), spline-based GAMLSS and distributional regression forests as well as their ensembles (stacking based on quantile regression and equal-weight averaging). To identify the most effective methods for our specific problem, we compared them to benchmarks using a large, multi-source precipitation dataset. Stacking was shown to be superior to individual methods at most quantile levels when evaluated with the quantile loss function. Moreover, while the relative ranking of the methods varied across different quantile levels, stacking methods, and to a lesser extent mean combiners, exhibited lower variance in their performance across different quantiles compared to individual methods that occasionally ranked extremely low. Overall, a task-specific combination of multiple distributional regression algorithms could yield significant benefits in terms of stability.

{"title":"Combinations of distributional regression algorithms with application in uncertainty estimation of corrected satellite precipitation products","authors":"Georgia Papacharalampous, Hristos Tyralis, Nikolaos Doulamis, Anastasios Doulamis","doi":"10.1016/j.mlwa.2024.100615","DOIUrl":"10.1016/j.mlwa.2024.100615","url":null,"abstract":"<div><div>To facilitate effective decision-making, precipitation datasets should include uncertainty estimates. Quantile regression with machine learning has been proposed for issuing such estimates. Distributional regression offers distinct advantages over quantile regression, including the ability to model intermittency as well as a stronger ability to extrapolate beyond the training data, which is critical for predicting extreme precipitation. Therefore, here, we introduce the concept of distributional regression in precipitation dataset creation, specifically for the spatial prediction task of correcting satellite precipitation products. Building upon this concept, we formulated new ensemble learning methods that can be valuable not only for spatial prediction but also for other prediction problems. These methods exploit conditional zero-adjusted probability distributions estimated with generalized additive models for location, scale and shape (GAMLSS), spline-based GAMLSS and distributional regression forests as well as their ensembles (stacking based on quantile regression and equal-weight averaging). To identify the most effective methods for our specific problem, we compared them to benchmarks using a large, multi-source precipitation dataset. Stacking was shown to be superior to individual methods at most quantile levels when evaluated with the quantile loss function. Moreover, while the relative ranking of the methods varied across different quantile levels, stacking methods, and to a lesser extent mean combiners, exhibited lower variance in their performance across different quantiles compared to individual methods that occasionally ranked extremely low. Overall, a task-specific combination of multiple distributional regression algorithms could yield significant benefits in terms of stability.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"19 ","pages":"Article 100615"},"PeriodicalIF":0.0,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143168592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive gate residual connection and multi-scale RCNN for fake news detection

Machine learning with applications

Pub Date : 2024-12-15 DOI: 10.1016/j.mlwa.2024.100612

QunHui Zhou, Tijian Cai

Detection of false news based on text classification technology has significant research significance and practical value in the current information age. However, existing methods overlook the problem of uneven sample distribution in the false news dataset and fail to consider the mutual influence between news articles. In light of this, this paper proposes a new method for false news detection. Firstly, news texts are embedded using Electra (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) to obtain word embedding representations. Secondly, Multi-Scale Recurrent Convolutional Neural Network (RCNN) is employed to further extract contextual information from news texts. Self-attention is introduced to calculate attention scores between news articles, allowing for mutual influence between news features. The establishment of connections between modules is achieved through adaptive gated residual connections. Finally, the focal loss function is used to balance the relationship between few-sample and multi-sample data in the dataset. Experimental results on publicly available false news detection datasets demonstrate that the proposed method achieves higher prediction accuracy than the comparative methods. This method provides a new perspective for the field of false news detection, playing a positive role in promoting information authenticity and protecting public interests.

{"title":"Adaptive gate residual connection and multi-scale RCNN for fake news detection","authors":"QunHui Zhou, Tijian Cai","doi":"10.1016/j.mlwa.2024.100612","DOIUrl":"10.1016/j.mlwa.2024.100612","url":null,"abstract":"<div><div>Detection of false news based on text classification technology has significant research significance and practical value in the current information age. However, existing methods overlook the problem of uneven sample distribution in the false news dataset and fail to consider the mutual influence between news articles. In light of this, this paper proposes a new method for false news detection. Firstly, news texts are embedded using Electra (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) to obtain word embedding representations. Secondly, Multi-Scale Recurrent Convolutional Neural Network (RCNN) is employed to further extract contextual information from news texts. Self-attention is introduced to calculate attention scores between news articles, allowing for mutual influence between news features. The establishment of connections between modules is achieved through adaptive gated residual connections. Finally, the focal loss function is used to balance the relationship between few-sample and multi-sample data in the dataset. Experimental results on publicly available false news detection datasets demonstrate that the proposed method achieves higher prediction accuracy than the comparative methods. This method provides a new perspective for the field of false news detection, playing a positive role in promoting information authenticity and protecting public interests.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"19 ","pages":"Article 100612"},"PeriodicalIF":0.0,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143169032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The association between mindfulness, psychological flexibility, and rumination in predicting mental health and well-being among university students using machine learning and structural equation modeling

Machine learning with applications

Pub Date : 2024-12-15 DOI: 10.1016/j.mlwa.2024.100614

Ruohan Feng , Vaibhav Mishra , Xin Hao , Paul Verhaeghen

Objectives

This study explores the intricate relationships between mindfulness, psychological flexibility, rumination, and their combined impact on mental health and well-being.

Methods

Random forest regression on survey data from 524 undergraduate students was used to identify significant predictors from a comprehensive set of psychological variables. Neural networks were then trained on various combinations of these predictors to evaluate their performance in predicting mental health and well-being outcomes. Finally, structural equation modeling (SEM) was employed to validate a model based on the identified key predictors, focusing on pathways from mindfulness through psychological flexibility to rumination and well-being.

Results

The random forest analysis revealed that the mindfulness variables exerted their influence partially indirectly through psychological flexibility and rumination. The deep neural network analysis supported these findings and additionally showed that the mindfulness manifold model (consisting of self-awareness, self-regulation, and self-transcendence) was superior to the Five Facet Mindfulness Questionnaire variables in predicting mental health outcomes. The SEM analysis confirmed that psychological flexibility, particularly its avoidance and acceptance components, mediated the relationship between mindfulness and mental health. The hypothesized serial mediation pathway—mindfulness affecting psychological flexibility, which then influences rumination and subsequently mental health and well-being—was supported by the data. Self-transcendence was a particularly powerful predictor of mental health outcomes.

Conclusions

The findings underscore the critical role of psychological flexibility and rumination in mediating the effects of mindfulness on mental health and well-being, suggesting that enhancing mindfulness and psychological flexibility might significantly reduce rumination, thereby improving overall mental health and well-being.

{"title":"The association between mindfulness, psychological flexibility, and rumination in predicting mental health and well-being among university students using machine learning and structural equation modeling","authors":"Ruohan Feng , Vaibhav Mishra , Xin Hao , Paul Verhaeghen","doi":"10.1016/j.mlwa.2024.100614","DOIUrl":"10.1016/j.mlwa.2024.100614","url":null,"abstract":"<div><h3>Objectives</h3><div>This study explores the intricate relationships between mindfulness, psychological flexibility, rumination, and their combined impact on mental health and well-being.</div></div><div><h3>Methods</h3><div>Random forest regression on survey data from 524 undergraduate students was used to identify significant predictors from a comprehensive set of psychological variables. Neural networks were then trained on various combinations of these predictors to evaluate their performance in predicting mental health and well-being outcomes. Finally, structural equation modeling (SEM) was employed to validate a model based on the identified key predictors, focusing on pathways from mindfulness through psychological flexibility to rumination and well-being.</div></div><div><h3>Results</h3><div>The random forest analysis revealed that the mindfulness variables exerted their influence partially indirectly through psychological flexibility and rumination. The deep neural network analysis supported these findings and additionally showed that the mindfulness manifold model (consisting of self-awareness, self-regulation, and self-transcendence) was superior to the Five Facet Mindfulness Questionnaire variables in predicting mental health outcomes. The SEM analysis confirmed that psychological flexibility, particularly its avoidance and acceptance components, mediated the relationship between mindfulness and mental health. The hypothesized serial mediation pathway—mindfulness affecting psychological flexibility, which then influences rumination and subsequently mental health and well-being—was supported by the data. Self-transcendence was a particularly powerful predictor of mental health outcomes.</div></div><div><h3>Conclusions</h3><div>The findings underscore the critical role of psychological flexibility and rumination in mediating the effects of mindfulness on mental health and well-being, suggesting that enhancing mindfulness and psychological flexibility might significantly reduce rumination, thereby improving overall mental health and well-being.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"19 ","pages":"Article 100614"},"PeriodicalIF":0.0,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143170242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detection of fraud in IoT based credit card collected dataset using machine learning

Machine learning with applications

Pub Date : 2024-12-09 DOI: 10.1016/j.mlwa.2024.100603

Mohammed Naif Alatawi

Due in large part to the proliferation of electronic financial transactions, credit card fraud is a serious problem for customers, merchants, and banks. For this reason, a novel approach is offered to fraud detection that makes use of cutting-edge ML methods in an IoT setting. The method in this paper employs a carefully selected set of cutting-edge ML algorithms specifically designed to handle the complexities of fraud detection, in contrast to older approaches that have difficulty adapting to shifting fraud patterns. In order to address the many facets of the problem, the methodology employs a large collection of ML models. These models include deep neural networks, decision trees, support vector machines, random forests, and clustering methods. This paper provides a solution that is able to detect fraudulent activity in real time by efficiently analyzing massive amounts of transactional data thanks to the power of big data processing and cloud computing. The model is able to distinguish between valid and fraudulent transactions thanks to careful feature engineering and anomaly detection methods. Extensive experiments on a large and diverse collection of real and simulated credit card transactions, both legitimate and fraudulent, prove the success of this technique. The findings demonstrate state-of-the-art performance in fraud detection, with increased precision and recall rates compared to traditional methods. And because the presented ML models are easy to understand, they improve fraud risk management and prevention techniques. The findings of this study provide banking institutions, government agencies, and policymakers with vital information for combating the negative effects of credit card fraud on consumers, companies, and the economy as a whole. This study provides a solution to the problem of fraud in the Internet of Things (IoT) ecosystem and paves the way for future developments in this crucial area by proposing a unique ML-driven approach to the problem.

{"title":"Detection of fraud in IoT based credit card collected dataset using machine learning","authors":"Mohammed Naif Alatawi","doi":"10.1016/j.mlwa.2024.100603","DOIUrl":"10.1016/j.mlwa.2024.100603","url":null,"abstract":"<div><div>Due in large part to the proliferation of electronic financial transactions, credit card fraud is a serious problem for customers, merchants, and banks. For this reason, a novel approach is offered to fraud detection that makes use of cutting-edge ML methods in an IoT setting. The method in this paper employs a carefully selected set of cutting-edge ML algorithms specifically designed to handle the complexities of fraud detection, in contrast to older approaches that have difficulty adapting to shifting fraud patterns. In order to address the many facets of the problem, the methodology employs a large collection of ML models. These models include deep neural networks, decision trees, support vector machines, random forests, and clustering methods. This paper provides a solution that is able to detect fraudulent activity in real time by efficiently analyzing massive amounts of transactional data thanks to the power of big data processing and cloud computing. The model is able to distinguish between valid and fraudulent transactions thanks to careful feature engineering and anomaly detection methods. Extensive experiments on a large and diverse collection of real and simulated credit card transactions, both legitimate and fraudulent, prove the success of this technique. The findings demonstrate state-of-the-art performance in fraud detection, with increased precision and recall rates compared to traditional methods. And because the presented ML models are easy to understand, they improve fraud risk management and prevention techniques. The findings of this study provide banking institutions, government agencies, and policymakers with vital information for combating the negative effects of credit card fraud on consumers, companies, and the economy as a whole. This study provides a solution to the problem of fraud in the Internet of Things (IoT) ecosystem and paves the way for future developments in this crucial area by proposing a unique ML-driven approach to the problem.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"19 ","pages":"Article 100603"},"PeriodicalIF":0.0,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143169033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Demographic bias mitigation at test-time using uncertainty estimation and human–machine partnership

Machine learning with applications

Pub Date : 2024-12-06 DOI: 10.1016/j.mlwa.2024.100610

Anoop Krishnan Upendran Nair , Ajita Rattani

Facial attribute classification algorithms frequently manifest demographic biases by obtaining differential performance across gender and racial groups. Existing bias mitigation techniques are mostly in-processing techniques, i.e., implemented during the classifier’s training stage, that often lack generalizability, require demographically annotated training sets, and exhibit a trade-off between fairness and classification accuracy. In this paper, we propose a technique to mitigate bias at the test time i.e., during the deployment stage, by harnessing prediction uncertainty and human–machine partnership. To this front, we propose to utilize those lowest percentages of test data samples identified as outliers with high prediction uncertainty. These identified uncertain samples at test-time are labeled by human analysts for decision rendering and for subsequently re-training the deep neural network in a continual learning framework. With minimal human involvement and through iterative refinement of the network with human guidance at test-time, we seek to enhance the accuracy as well as the fairness of the already deployed facial attribute classification algorithms. Extensive experiments are conducted on gender and smile attribute classification tasks using four publicly available datasets and with gender and race as the protected attributes. The obtained outcomes consistently demonstrate improved accuracy by up to 2% and 5% for the gender and smile attribute classification tasks, respectively, using our proposed approaches. Further, the demographic bias was significantly reduced, outperforming the State-of-the-Art (SOTA) bias mitigation and baseline techniques by up to 55% for both classification tasks. The demo shall be released on https://github.com/hashtaglensman/HumanintheLoop.

{"title":"Demographic bias mitigation at test-time using uncertainty estimation and human–machine partnership","authors":"Anoop Krishnan Upendran Nair , Ajita Rattani","doi":"10.1016/j.mlwa.2024.100610","DOIUrl":"10.1016/j.mlwa.2024.100610","url":null,"abstract":"<div><div>Facial attribute classification algorithms frequently manifest demographic biases by obtaining differential performance across gender and racial groups. Existing bias mitigation techniques are mostly in-processing techniques, i.e., implemented during the classifier’s training stage, that often lack generalizability, require demographically annotated training sets, and exhibit a trade-off between fairness and classification accuracy. In this paper, we propose a technique to mitigate bias at the test time i.e., during the deployment stage, by harnessing prediction uncertainty and human–machine partnership. To this front, we propose to utilize those lowest percentages of test data samples identified as outliers with high prediction uncertainty. These identified uncertain samples at test-time are labeled by human analysts for decision rendering and for subsequently re-training the deep neural network in a continual learning framework. With minimal human involvement and through iterative refinement of the network with human guidance at test-time, we seek to enhance the accuracy as well as the fairness of the already deployed facial attribute classification algorithms. Extensive experiments are conducted on gender and smile attribute classification tasks using four publicly available datasets and with gender and race as the protected attributes. The obtained outcomes consistently demonstrate improved accuracy by up to 2% and 5% for the gender and smile attribute classification tasks, respectively, using our proposed approaches. Further, the demographic bias was significantly reduced, outperforming the State-of-the-Art (SOTA) bias mitigation and baseline techniques by up to 55% for both classification tasks. The demo shall be released on <span><span>https://github.com/hashtaglensman/HumanintheLoop</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"19 ","pages":"Article 100610"},"PeriodicalIF":0.0,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143170237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application of convolutional neural networks and ensemble methods in the fiber volume content analysis of natural fiber composites

Machine learning with applications

Pub Date : 2024-12-02 DOI: 10.1016/j.mlwa.2024.100609

Florian Rothenhäusler , Rodrigo Queiroz Albuquerque , Marcel Sticher , Christopher Kuenneth , Holger Ruckdaeschel

The incorporation of natural fibers into fiber-reinforced polymer composites (FRPC) has the potential to bolster their sustainability. A critical attribute of FRPC is the fiber volume content (FVC), a parameter that profoundly influences their thermo-mechanical characteristics. However, the determination of FVC in natural fiber composites (NFC) through manual analysis of light microscopy images is a labor-intensive process. In this work, it is demonstrated that the pixels from light microscopy images of NFC can be utilized to predict FVC using machine learning (ML) models. In this proof-of-concept investigation, it is shown that convolutional neural network-based models predict FVC with an accuracy required in polymer engineering applications, with a mean average error of 2.72 % and an

R^{2}

coefficient of 0.85. Finally, it is shown that much simpler ML models, non-specialized in image recognition, besides being much easier and more efficient to optimize and train, can also deliver good accuracies required for FVC characterization, which not only contributes to the sustainability, but also facilitates the access of such models by researchers in regions with little computational resources. This study marks a substantial advancement in the area of automated characterization of NFC, and democratization of knowledge, offering a promising avenue for the enhancement of sustainable materials.

{"title":"Application of convolutional neural networks and ensemble methods in the fiber volume content analysis of natural fiber composites","authors":"Florian Rothenhäusler , Rodrigo Queiroz Albuquerque , Marcel Sticher , Christopher Kuenneth , Holger Ruckdaeschel","doi":"10.1016/j.mlwa.2024.100609","DOIUrl":"10.1016/j.mlwa.2024.100609","url":null,"abstract":"<div><div>The incorporation of natural fibers into fiber-reinforced polymer composites (FRPC) has the potential to bolster their sustainability. A critical attribute of FRPC is the fiber volume content (<em>FVC</em>), a parameter that profoundly influences their thermo-mechanical characteristics. However, the determination of <em>FVC</em> in natural fiber composites (NFC) through manual analysis of light microscopy images is a labor-intensive process. In this work, it is demonstrated that the pixels from light microscopy images of NFC can be utilized to predict <em>FVC</em> using machine learning (ML) models. In this proof-of-concept investigation, it is shown that convolutional neural network-based models predict <em>FVC</em> with an accuracy required in polymer engineering applications, with a mean average error of 2.72% and an <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> coefficient of 0.85. Finally, it is shown that much simpler ML models, non-specialized in image recognition, besides being much easier and more efficient to optimize and train, can also deliver good accuracies required for <em>FVC</em> characterization, which not only contributes to the sustainability, but also facilitates the access of such models by researchers in regions with little computational resources. This study marks a substantial advancement in the area of automated characterization of NFC, and democratization of knowledge, offering a promising avenue for the enhancement of sustainable materials.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"19 ","pages":"Article 100609"},"PeriodicalIF":0.0,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143170238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sharing is CAIRing: Characterizing principles and assessing properties of universal privacy evaluation for synthetic tabular data

Machine learning with applications

Pub Date : 2024-12-01 DOI: 10.1016/j.mlwa.2024.100608

Tobias Hyrup, Anton Danholt Lautrup, Arthur Zimek, Peter Schneider-Kamp

Data sharing is a necessity for innovative progress in many domains, especially in healthcare. However, the ability to share data is hindered by regulations protecting the privacy of natural persons. Synthetic tabular data provide a promising solution to address data sharing difficulties but does not inherently guarantee privacy. Still, there is a lack of agreement on appropriate methods for assessing the privacy-preserving capabilities of synthetic data, making it difficult to compare results across studies. To the best of our knowledge, this is the first work to identify properties that constitute good universal privacy evaluation metrics for synthetic tabular data. The goal of universally applicable metrics is to enable comparability across studies and to allow non-technical stakeholders to understand how privacy is protected. We identify four principles for the assessment of metrics: Comparability, Applicability, Interpretability, and Representativeness (CAIR). To quantify and rank the degree to which evaluation metrics conform to the CAIR principles, we design a rubric using a scale of 1–4. Each of the four properties is scored on four parameters, yielding 16 total dimensions. We study the applicability and usefulness of the CAIR principles and rubric by assessing a selection of metrics popular in other studies. The results provide granular insights into the strengths and weaknesses of existing metrics that not only rank the metrics but highlight areas of potential improvements. We expect that the CAIR principles will foster agreement among researchers and organizations on which universal privacy evaluation metrics are appropriate for synthetic tabular data.

{"title":"Sharing is CAIRing: Characterizing principles and assessing properties of universal privacy evaluation for synthetic tabular data","authors":"Tobias Hyrup, Anton Danholt Lautrup, Arthur Zimek, Peter Schneider-Kamp","doi":"10.1016/j.mlwa.2024.100608","DOIUrl":"10.1016/j.mlwa.2024.100608","url":null,"abstract":"<div><div>Data sharing is a necessity for innovative progress in many domains, especially in healthcare. However, the ability to share data is hindered by regulations protecting the privacy of natural persons. Synthetic tabular data provide a promising solution to address data sharing difficulties but does not inherently guarantee privacy. Still, there is a lack of agreement on appropriate methods for assessing the privacy-preserving capabilities of synthetic data, making it difficult to compare results across studies. To the best of our knowledge, this is the first work to identify properties that constitute good universal privacy evaluation metrics for synthetic tabular data. The goal of universally applicable metrics is to enable comparability across studies and to allow non-technical stakeholders to understand how privacy is protected. We identify four principles for the assessment of metrics: Comparability, Applicability, Interpretability, and Representativeness (CAIR). To quantify and rank the degree to which evaluation metrics conform to the CAIR principles, we design a rubric using a scale of 1–4. Each of the four properties is scored on four parameters, yielding 16 total dimensions. We study the applicability and usefulness of the CAIR principles and rubric by assessing a selection of metrics popular in other studies. The results provide granular insights into the strengths and weaknesses of existing metrics that not only rank the metrics but highlight areas of potential improvements. We expect that the CAIR principles will foster agreement among researchers and organizations on which universal privacy evaluation metrics are appropriate for synthetic tabular data.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"18 ","pages":"Article 100608"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143143187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Document Layout Error Rate (DLER) metric to evaluate image segmentation methods 评估图像分割方法的文档布局错误率 (DLER) 指标

Machine learning with applications

Pub Date : 2024-11-19 DOI: 10.1016/j.mlwa.2024.100606

Ari Vesalainen , Mikko Tolonen , Laura Ruotsalainen

Scholarly editions play a crucial role in humanities research, particularly in the study of literature and historical documents. The primary objective of these editions is to reconstruct the original text or provide insights into the author’s intentions. Traditionally, crafting a critical edition required a lifetime of dedication. However, thanks to recent advancements in deep learning and computer vision, modern text recognition tools can now be used to expedite this process. A key part of these tools is document layout analysis (DLA), where image segmentation methods are used to detect different text elements. Most existing DLA solutions have focused on evaluating the accuracy of these methods, often neglecting to study the practical consequences of method selection. In this study, we have developed a new metric, the Document Layout Error Rate (DLER), which evaluates the performance of fine-grained DLA methods within the overall pipeline. This metric helps identify the method with the lowest error rate, thereby minimizing the manual effort required for corrections. We applied this evaluation method to assess four different methods and their efficacy for the DLA task in the context of David Hume’s History of England.

学术版本在人文学科研究，尤其是文学和历史文献研究中发挥着至关重要的作用。这些版本的主要目的是重构原文或深入了解作者的意图。传统上，制作批判性版本需要一生的奉献。然而，得益于深度学习和计算机视觉领域的最新进展，现在可以使用现代文本识别工具来加快这一过程。这些工具的一个关键部分是文档排版分析（DLA），其中使用图像分割方法来检测不同的文本元素。大多数现有的 DLA 解决方案都侧重于评估这些方法的准确性，往往忽视了对方法选择的实际影响的研究。在本研究中，我们开发了一种新指标--文档布局错误率（DLER），用于评估整个管道中细粒度 DLA 方法的性能。该指标有助于确定错误率最低的方法，从而最大限度地减少人工修正所需的工作量。我们在大卫-休谟（David Hume）的《英国史》（History of England）中应用了这种评估方法来评估四种不同的方法及其在 DLA 任务中的功效。

{"title":"Document Layout Error Rate (DLER) metric to evaluate image segmentation methods","authors":"Ari Vesalainen , Mikko Tolonen , Laura Ruotsalainen","doi":"10.1016/j.mlwa.2024.100606","DOIUrl":"10.1016/j.mlwa.2024.100606","url":null,"abstract":"<div><div>Scholarly editions play a crucial role in humanities research, particularly in the study of literature and historical documents. The primary objective of these editions is to reconstruct the original text or provide insights into the author’s intentions. Traditionally, crafting a critical edition required a lifetime of dedication. However, thanks to recent advancements in deep learning and computer vision, modern text recognition tools can now be used to expedite this process. A key part of these tools is document layout analysis (DLA), where image segmentation methods are used to detect different text elements. Most existing DLA solutions have focused on evaluating the accuracy of these methods, often neglecting to study the practical consequences of method selection. In this study, we have developed a new metric, the Document Layout Error Rate (DLER), which evaluates the performance of fine-grained DLA methods within the overall pipeline. This metric helps identify the method with the lowest error rate, thereby minimizing the manual effort required for corrections. We applied this evaluation method to assess four different methods and their efficacy for the DLA task in the context of David Hume’s <em>History of England</em>.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"18 ","pages":"Article 100606"},"PeriodicalIF":0.0,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142698987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Supervised machine learning for microbiomics: Bridging the gap between current and best practices 用于微生物组学的有监督机器学习：缩小当前实践与最佳实践之间的差距

Machine learning with applications

Pub Date : 2024-11-14 DOI: 10.1016/j.mlwa.2024.100607

Natasha Katherine Dudek , Mariami Chakhvadze , Saba Kobakhidze , Omar Kantidze , Yuriy Gankin

Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021 to 2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.

机器学习（ML）有望推动临床微生物组学的创新，如疾病诊断和预后。然而，要在这些领域成功实施机器学习，就必须开发出可重复、可解释的模型，以满足监管机构制定的严格性能标准。本研究旨在确定微生物组学中当前 ML 实践中需要改进的关键领域，重点是缩小现有方法与临床应用要求之间的差距。为此，我们分析了 2021 年至 2022 年的 100 篇同行评审文章。在这一语料库中，数据集的中位数为 161.5 个样本，其中超过三分之一的数据集包含的样本少于 100 个，这表明过度拟合的可能性很大。有限的人口统计学数据进一步引发了对普遍性和公平性的担忧，24% 的研究遗漏了参与者的居住国，种族/民族、教育程度和收入等属性也很少被报告（分别为 11%、2% 和 0%）。方法论问题也很常见；例如，在 86% 的研究中，我们无法有把握地排除测试集遗漏和数据泄漏的可能性，这表明在所有文献中都很有可能出现夸大绩效估计值的情况。可重复性也是一个令人担忧的问题，有 78% 的研究放弃公开共享其 ML 代码。基于这一分析，我们提供了避免常见陷阱的指南，这些陷阱可能会妨碍模型的性能、可推广性和可信度。在讨论的同时，我们还提供了将 ML 应用于微生物组学数据的互动教程，以帮助在社区内建立和加强最佳实践。

{"title":"Supervised machine learning for microbiomics: Bridging the gap between current and best practices","authors":"Natasha Katherine Dudek , Mariami Chakhvadze , Saba Kobakhidze , Omar Kantidze , Yuriy Gankin","doi":"10.1016/j.mlwa.2024.100607","DOIUrl":"10.1016/j.mlwa.2024.100607","url":null,"abstract":"<div><div>Machine learning (ML) is poised to drive innovations in clinical microbiomics, such as in disease diagnostics and prognostics. However, the successful implementation of ML in these domains necessitates the development of reproducible, interpretable models that meet the rigorous performance standards set by regulatory agencies. This study aims to identify key areas in need of improvement in current ML practices within microbiomics, with a focus on bridging the gap between existing methodologies and the requirements for clinical application. To do so, we analyze 100 peer-reviewed articles from 2021 to 2022. Within this corpus, datasets have a median size of 161.5 samples, with over one-third containing fewer than 100 samples, signaling a high potential for overfitting. Limited demographic data further raises concerns about generalizability and fairness, with 24% of studies omitting participants' country of residence, and attributes like race/ethnicity, education, and income rarely reported (11%, 2%, and 0%, respectively). Methodological issues are also common; for instance, for 86% of studies we could not confidently rule out test set omission and data leakage, suggesting a strong potential for inflated performance estimates across the literature. Reproducibility is a concern, with 78% of studies abstaining from sharing their ML code publicly. Based on this analysis, we provide guidance to avoid common pitfalls that can hinder model performance, generalizability, and trustworthiness. An interactive tutorial on applying ML to microbiomics data accompanies the discussion, to help establish and reinforce best practices within the community.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"18 ","pages":"Article 100607"},"PeriodicalIF":0.0,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142698988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0