Pub Date : 2022-03-01DOI: 10.1109/CDMA54072.2022.00014
A. Xuan, Mengmeng Yin, Yupei Li, Xiyu Chen, Zhenliang Ma
How to choose the appropriate model to predict the time series is one of the most prominent activities of temporal data analysis. Empirical evidence is often adopted to select the most suitable model since there is no unified standard for matching data and models. Data characteristics affect model performance to a certain extent and maybe where the factors that determine the balance between prediction accuracy and model complexity are. In this article, Multi-Criteria Performance Measure method considering Mean of Absolute Value of the Residual Autocorrelation was adopted to address this problem. Case studies apply Time-Series Analysis decomposing datasets into trend, seasonality and residue and summarize the limitations and recommendations from the stochasticity of the residue. The results show that the statistical models perform best for datasets with low stochasticity, deep learning models specialize in forecasting fluctuant and long-term time series data, machine learning models could be candidates for datasets that possess numerical characters between the previous two categories. Conclusions could provide suggestions in selecting appropriate models and guide the research community in focusing the effort on more feasible or promising directions.
{"title":"A comprehensive evaluation of statistical, machine learning and deep learning models for time series prediction","authors":"A. Xuan, Mengmeng Yin, Yupei Li, Xiyu Chen, Zhenliang Ma","doi":"10.1109/CDMA54072.2022.00014","DOIUrl":"https://doi.org/10.1109/CDMA54072.2022.00014","url":null,"abstract":"How to choose the appropriate model to predict the time series is one of the most prominent activities of temporal data analysis. Empirical evidence is often adopted to select the most suitable model since there is no unified standard for matching data and models. Data characteristics affect model performance to a certain extent and maybe where the factors that determine the balance between prediction accuracy and model complexity are. In this article, Multi-Criteria Performance Measure method considering Mean of Absolute Value of the Residual Autocorrelation was adopted to address this problem. Case studies apply Time-Series Analysis decomposing datasets into trend, seasonality and residue and summarize the limitations and recommendations from the stochasticity of the residue. The results show that the statistical models perform best for datasets with low stochasticity, deep learning models specialize in forecasting fluctuant and long-term time series data, machine learning models could be candidates for datasets that possess numerical characters between the previous two categories. Conclusions could provide suggestions in selecting appropriate models and guide the research community in focusing the effort on more feasible or promising directions.","PeriodicalId":313042,"journal":{"name":"2022 7th International Conference on Data Science and Machine Learning Applications (CDMA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121257749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/CDMA54072.2022.00020
Atheer Algherairy, Wadha Almattar, Eman Bakri, Salma A. Albelali
Breast cancer appears to be a common type of cancer suffered by women globally, with considered high death rates. The survival rate of breast cancer patients decreases considerably for patients diagnosed at an advanced stage compared to those diagnosed at an early stage. The objective of this study is to investigate breast cancer classification and diagnosis task using the data from WBCD dataset. In our methodology, first, the breast cancer data was scaled. Then, four features selection methods were used to analyze the features. Pearson's Correlation method, Forward Selection method, Mutual Information and Univariate ROC-AUC were the used feature selectors. Next, different Machine Leaning models were applied including Support Vector Machine, Logistic Regression and XGBoost. Finally, the three models were cross-validated by 5-fold method. The ML models with different classifiers were evaluated based on several performance measures including accuracy, precision, recall, and F1-score. results show that Logistic Regression (LR) model with Forward Selection appeared to be the most successful classifier. The obtained classification accuracy, precision, and F1-score were 0.982, 0.983, 0.986; respectively. However, the highest recall score was 0.992 achieved by SVM model with Correlation feature selection. The developed model could potentially help the medical experts for the early diagnosis of breast cancer to decrease potential risk.
{"title":"The Impact of Feature Selection on Different Machine Learning Models for Breast Cancer Classification","authors":"Atheer Algherairy, Wadha Almattar, Eman Bakri, Salma A. Albelali","doi":"10.1109/CDMA54072.2022.00020","DOIUrl":"https://doi.org/10.1109/CDMA54072.2022.00020","url":null,"abstract":"Breast cancer appears to be a common type of cancer suffered by women globally, with considered high death rates. The survival rate of breast cancer patients decreases considerably for patients diagnosed at an advanced stage compared to those diagnosed at an early stage. The objective of this study is to investigate breast cancer classification and diagnosis task using the data from WBCD dataset. In our methodology, first, the breast cancer data was scaled. Then, four features selection methods were used to analyze the features. Pearson's Correlation method, Forward Selection method, Mutual Information and Univariate ROC-AUC were the used feature selectors. Next, different Machine Leaning models were applied including Support Vector Machine, Logistic Regression and XGBoost. Finally, the three models were cross-validated by 5-fold method. The ML models with different classifiers were evaluated based on several performance measures including accuracy, precision, recall, and F1-score. results show that Logistic Regression (LR) model with Forward Selection appeared to be the most successful classifier. The obtained classification accuracy, precision, and F1-score were 0.982, 0.983, 0.986; respectively. However, the highest recall score was 0.992 achieved by SVM model with Correlation feature selection. The developed model could potentially help the medical experts for the early diagnosis of breast cancer to decrease potential risk.","PeriodicalId":313042,"journal":{"name":"2022 7th International Conference on Data Science and Machine Learning Applications (CDMA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129951403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/cdma54072.2022.00001
The proceedings contain 39 papers. The topics discussed include: evaluation of machine learning to early detection of highly cited papers;towards using deep reinforcement learning for better COVID-19 vaccine distribution strategies;an investigation of forecasting Tadawul all share index (TASI) using machine learning;intelligent deep detection method for malicious tampering of cancer imagery;an empirical analysis of health-related campaigns on twitter Arabic hashtags;the accuracy performance of semantic segmentation network with different backbones;a comprehensive evaluation of statistical, machine learning and deep learning models for time series prediction;depression detection in Arabic using speech language recognition;a deep learning framework for temperature forecasting;improving relevance in a recommendation system to suggest charities without explicit user profiles using dual-autoencoders;and the impact of feature selection on different machine learning models for breast cancer classification.
{"title":"Proceedings 2022 7th International Conference on Data Science and Machine Learning Applications","authors":"","doi":"10.1109/cdma54072.2022.00001","DOIUrl":"https://doi.org/10.1109/cdma54072.2022.00001","url":null,"abstract":"The proceedings contain 39 papers. The topics discussed include: evaluation of machine learning to early detection of highly cited papers;towards using deep reinforcement learning for better COVID-19 vaccine distribution strategies;an investigation of forecasting Tadawul all share index (TASI) using machine learning;intelligent deep detection method for malicious tampering of cancer imagery;an empirical analysis of health-related campaigns on twitter Arabic hashtags;the accuracy performance of semantic segmentation network with different backbones;a comprehensive evaluation of statistical, machine learning and deep learning models for time series prediction;depression detection in Arabic using speech language recognition;a deep learning framework for temperature forecasting;improving relevance in a recommendation system to suggest charities without explicit user profiles using dual-autoencoders;and the impact of feature selection on different machine learning models for breast cancer classification.","PeriodicalId":313042,"journal":{"name":"2022 7th International Conference on Data Science and Machine Learning Applications (CDMA)","volume":"348 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134451295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the field of computer vision, one-shot learning has proven to be effective, as it works accurately with a single labeled training example and a small number of training sets. In one-shot learning, we must accurately make predictions based on only one sample of each new class. In this paper, we look at a strategy for learning Siamese neural networks that use a distinctive structure to automatically evaluate the similarity between inputs. The goal of this paper is to apply the concept of one-shot learning to audio classification by extracting specific features, where it uses triplet loss to train the model to learn through Siamese network and calculates the rate of similarity while testing via a support set and a query set. We have executed our experiment on LibriSpeech ASR corpus. We evaluated our work on N-way-1-shot learning and generated strong results for 2-way (100%), 3-way (95%), 4-way (84%), and 5-way (74%) that outperform existing machine learning models by a large margin. To the best of our knowledge, this may be the first paper to look at the possibility of one-shot human speech recognition on the LibriSpeech ASR corpus using the Siamese network.
{"title":"One Voice is All You Need: A One-Shot Approach to Recognize Your Voice","authors":"Priata Nowshin, Shahriar Rumi Dipto, Intesur Ahmed, Deboraj Chowdhury, Galib Abdun Noor, Amitabha Chakrabarty, Muhammad Tahmeed Abdullah, Moshiur Rahman","doi":"10.1109/CDMA54072.2022.00022","DOIUrl":"https://doi.org/10.1109/CDMA54072.2022.00022","url":null,"abstract":"In the field of computer vision, one-shot learning has proven to be effective, as it works accurately with a single labeled training example and a small number of training sets. In one-shot learning, we must accurately make predictions based on only one sample of each new class. In this paper, we look at a strategy for learning Siamese neural networks that use a distinctive structure to automatically evaluate the similarity between inputs. The goal of this paper is to apply the concept of one-shot learning to audio classification by extracting specific features, where it uses triplet loss to train the model to learn through Siamese network and calculates the rate of similarity while testing via a support set and a query set. We have executed our experiment on LibriSpeech ASR corpus. We evaluated our work on N-way-1-shot learning and generated strong results for 2-way (100%), 3-way (95%), 4-way (84%), and 5-way (74%) that outperform existing machine learning models by a large margin. To the best of our knowledge, this may be the first paper to look at the possibility of one-shot human speech recognition on the LibriSpeech ASR corpus using the Siamese network.","PeriodicalId":313042,"journal":{"name":"2022 7th International Conference on Data Science and Machine Learning Applications (CDMA)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131357346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/CDMA54072.2022.00015
Zainab Alsharif, Salma Elhag, S. Alfakeh
Depression is one of the most common mental illnesses. Inaccurate assessments and misdiagnosis of illness are quite common for such mental disorder. In response to the issue of inaccurate assessment and misdiagnosis of depression, this study discusses the use of speech-language recognition to improve the detection of depression in Arabic speech. In this study, we extract speech features after collecting the dataset. These speech features can be obtained from both linguistic (uttered words) and para-linguistic (acoustic cues) features which we focus on. We classify the participants into two groups: clinically depressed and non-depressed. To do that, we start by recording speeches from interviews with the two groups. Then we extract para-linguistic features by using MFCC to help in building a model to detect depression. We use CNN to build the classification model. The accuracy of the classification model is 98% which will help in detecting depression depending on audio data.
{"title":"Depression Detection in Arabic Using Speech Language Recognition","authors":"Zainab Alsharif, Salma Elhag, S. Alfakeh","doi":"10.1109/CDMA54072.2022.00015","DOIUrl":"https://doi.org/10.1109/CDMA54072.2022.00015","url":null,"abstract":"Depression is one of the most common mental illnesses. Inaccurate assessments and misdiagnosis of illness are quite common for such mental disorder. In response to the issue of inaccurate assessment and misdiagnosis of depression, this study discusses the use of speech-language recognition to improve the detection of depression in Arabic speech. In this study, we extract speech features after collecting the dataset. These speech features can be obtained from both linguistic (uttered words) and para-linguistic (acoustic cues) features which we focus on. We classify the participants into two groups: clinically depressed and non-depressed. To do that, we start by recording speeches from interviews with the two groups. Then we extract para-linguistic features by using MFCC to help in building a model to detect depression. We use CNN to build the classification model. The accuracy of the classification model is 98% which will help in detecting depression depending on audio data.","PeriodicalId":313042,"journal":{"name":"2022 7th International Conference on Data Science and Machine Learning Applications (CDMA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121228619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/CDMA54072.2022.00011
Niddal H. Imam, V. Vassilakis, Dimitris Kolovos
Twitter trending hashtags are a primary feature, where users regularly visit to get news or chat with each other. However, this valuable feature has been abused by malicious campaigns that use Twitter hashtags to disseminate religious hatred, promote terrorist propaganda, distribute fake financial news, and spread healthcare rumours. In recent years, some health-related campaigns flooded Arabic trending hashtags in Twitter. These campaigns not only irritate users, but they also distribute malicious content. In this paper, a comprehensive empirical analysis of the ongoing health-related campaigns on Twitter Arabic hashtags is presented. After collecting and an-notating tweets posted by these campaigns, we qualitatively analyzed the characteristics and behaviours of these tweets. We seek to find out what makes some of the tweets posted by these campaigns difficult to detect. Two main findings were identified: (1) these campaigns exhibit some spamming activities, such as using bots and trolls, (2) they use unique hijacked accounts as adversarial examples to obfuscate detection. This study is the first to qualitatively analyze health-related campaigns on Twitter Arabic hashtags from security point of view. Our findings suggest that some of the tweets posted by these campaigns need to be treated as adversarial examples that have not only been crafted to evade detection but also to undermine the deployed detection system.
{"title":"An Empirical Analysis of Health-Related Campaigns on Twitter Arabic Hashtags","authors":"Niddal H. Imam, V. Vassilakis, Dimitris Kolovos","doi":"10.1109/CDMA54072.2022.00011","DOIUrl":"https://doi.org/10.1109/CDMA54072.2022.00011","url":null,"abstract":"Twitter trending hashtags are a primary feature, where users regularly visit to get news or chat with each other. However, this valuable feature has been abused by malicious campaigns that use Twitter hashtags to disseminate religious hatred, promote terrorist propaganda, distribute fake financial news, and spread healthcare rumours. In recent years, some health-related campaigns flooded Arabic trending hashtags in Twitter. These campaigns not only irritate users, but they also distribute malicious content. In this paper, a comprehensive empirical analysis of the ongoing health-related campaigns on Twitter Arabic hashtags is presented. After collecting and an-notating tweets posted by these campaigns, we qualitatively analyzed the characteristics and behaviours of these tweets. We seek to find out what makes some of the tweets posted by these campaigns difficult to detect. Two main findings were identified: (1) these campaigns exhibit some spamming activities, such as using bots and trolls, (2) they use unique hijacked accounts as adversarial examples to obfuscate detection. This study is the first to qualitatively analyze health-related campaigns on Twitter Arabic hashtags from security point of view. Our findings suggest that some of the tweets posted by these campaigns need to be treated as adversarial examples that have not only been crafted to evade detection but also to undermine the deployed detection system.","PeriodicalId":313042,"journal":{"name":"2022 7th International Conference on Data Science and Machine Learning Applications (CDMA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124123482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/CDMA54072.2022.00007
F. Trad, Salah El Falou
Vaccination has been the most promising hope to get back to normal ever since the COVID-19 outbreak started. But as promising as this sounds, vaccinating all of the population at the same time is practically infeasible because of the limited supply of vaccines from one side and the high demand from the other side. So, the process cannot happen overnight, and this is why governments kept thinking about how they can distribute vaccines in a way that helps their citizens get back to normal with the least possible damages (infections and deaths). In this study, we investigate how Reinforcement Learning (RL) can be used to distribute vaccines more efficiently among the citizens of a country, given their age and profession. For this reason, we created an RL agent that learns vaccine distribution strategies through its interaction with a Monte Carlo (MC) simulation environment that we built. This environment runs an Agent-Based Model (ABM) where we have agents interacting with each other and with the environment where they live and based on their behavior, the virus will spread. The goal of the RL agent was to find vaccine distribution strategies that would minimize the number of infections and deaths in the environment where our agents live. After training our RL agent for 100 episodes, we compared the best strategy that RL gave us with some of the well-known strategies that countries adopt, and we found that the RL stratezy outperformed them.
{"title":"Towards Using Deep Reinforcement Learning for Better COVID-19 Vaccine Distribution Strategies","authors":"F. Trad, Salah El Falou","doi":"10.1109/CDMA54072.2022.00007","DOIUrl":"https://doi.org/10.1109/CDMA54072.2022.00007","url":null,"abstract":"Vaccination has been the most promising hope to get back to normal ever since the COVID-19 outbreak started. But as promising as this sounds, vaccinating all of the population at the same time is practically infeasible because of the limited supply of vaccines from one side and the high demand from the other side. So, the process cannot happen overnight, and this is why governments kept thinking about how they can distribute vaccines in a way that helps their citizens get back to normal with the least possible damages (infections and deaths). In this study, we investigate how Reinforcement Learning (RL) can be used to distribute vaccines more efficiently among the citizens of a country, given their age and profession. For this reason, we created an RL agent that learns vaccine distribution strategies through its interaction with a Monte Carlo (MC) simulation environment that we built. This environment runs an Agent-Based Model (ABM) where we have agents interacting with each other and with the environment where they live and based on their behavior, the virus will spread. The goal of the RL agent was to find vaccine distribution strategies that would minimize the number of infections and deaths in the environment where our agents live. After training our RL agent for 100 episodes, we compared the best strategy that RL gave us with some of the well-known strategies that countries adopt, and we found that the RL stratezy outperformed them.","PeriodicalId":313042,"journal":{"name":"2022 7th International Conference on Data Science and Machine Learning Applications (CDMA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122122297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/CDMA54072.2022.00037
Lama Al-Bakhat, Sultan Almuhammadi
Since the introduction of QUIC protocol, a major change has affected the Internet transport layer, which improves user experience with some security threats. Developed by Google in 2012, QUIC provides a low latency, connection-oriented and encrypted transport. In addition to the encryption capability of QUIC, it overcomes many issues found in the current transport protocols, such as the high-latency connection establishment in TCP. On the other hand, studies on the security analysis of QUIC's key establishment showed several drawbacks. Moreover, the encryption mechanism of the protocol allows adversarial Command & Control (C2) packets to blind with regular QUIC traffic without raising any alarms. Therefore, in this study, we develop a machine learning approach based on fingerprinting that can be used in intrusion detection systems to detect malicious C2 QUIC traffic. To demonstrate the effectiveness of this approach, we conducted an experiment and tested the performance of six machine learning classifiers. The results show that by utilizing the fingerprint, most of the classifiers recognized malicious C2 traffic with an average accuracy of 98%.
{"title":"Intrusion Detection on QUIC Traffic: A Machine Learning Approach","authors":"Lama Al-Bakhat, Sultan Almuhammadi","doi":"10.1109/CDMA54072.2022.00037","DOIUrl":"https://doi.org/10.1109/CDMA54072.2022.00037","url":null,"abstract":"Since the introduction of QUIC protocol, a major change has affected the Internet transport layer, which improves user experience with some security threats. Developed by Google in 2012, QUIC provides a low latency, connection-oriented and encrypted transport. In addition to the encryption capability of QUIC, it overcomes many issues found in the current transport protocols, such as the high-latency connection establishment in TCP. On the other hand, studies on the security analysis of QUIC's key establishment showed several drawbacks. Moreover, the encryption mechanism of the protocol allows adversarial Command & Control (C2) packets to blind with regular QUIC traffic without raising any alarms. Therefore, in this study, we develop a machine learning approach based on fingerprinting that can be used in intrusion detection systems to detect malicious C2 QUIC traffic. To demonstrate the effectiveness of this approach, we conducted an experiment and tested the performance of six machine learning classifiers. The results show that by utilizing the fingerprint, most of the classifiers recognized malicious C2 traffic with an average accuracy of 98%.","PeriodicalId":313042,"journal":{"name":"2022 7th International Conference on Data Science and Machine Learning Applications (CDMA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129884525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/CDMA54072.2022.00039
Saif Kazi, Priyesh Vakharia, Parth Shah, Riya Gupta, Yash Tailor, Palak Mantry, Jash Rathod
Data preprocessing is an important prerequisite for data mining and machine learning. In this paper, we introduce Preprocessy, a Python framework that provides customisable data preprocessing pipelines for processing structured data. Preprocessy pipelines come with sane defaults and the framework also provides low-level functions to build custom pipelines. The paper gives a brief overview of the features and the high-level APIs of Preprocessy along with a performance comparison against Scikit-learn and Pandas on two datasets. Preprocessy provides functions for handling missing data and outliers, data normalisation, feature selection and data sampling. The goal of Preprocessy is to be easy to use, flexible and performant. Preprocessy helps beginners and experts alike by making data preprocessing an easier and faster task.
{"title":"Preprocessy: A Customisable Data Preprocessing Framework with High-Level APIs","authors":"Saif Kazi, Priyesh Vakharia, Parth Shah, Riya Gupta, Yash Tailor, Palak Mantry, Jash Rathod","doi":"10.1109/CDMA54072.2022.00039","DOIUrl":"https://doi.org/10.1109/CDMA54072.2022.00039","url":null,"abstract":"Data preprocessing is an important prerequisite for data mining and machine learning. In this paper, we introduce Preprocessy, a Python framework that provides customisable data preprocessing pipelines for processing structured data. Preprocessy pipelines come with sane defaults and the framework also provides low-level functions to build custom pipelines. The paper gives a brief overview of the features and the high-level APIs of Preprocessy along with a performance comparison against Scikit-learn and Pandas on two datasets. Preprocessy provides functions for handling missing data and outliers, data normalisation, feature selection and data sampling. The goal of Preprocessy is to be easy to use, flexible and performant. Preprocessy helps beginners and experts alike by making data preprocessing an easier and faster task.","PeriodicalId":313042,"journal":{"name":"2022 7th International Conference on Data Science and Machine Learning Applications (CDMA)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126767920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-01DOI: 10.1109/CDMA54072.2022.00040
Abdullatif Al-Najim, Abrar Al-Amoudi, Kenji Ooishi, Mustafa Al-Nasser
Recommendation engine's techniques have proved their performance in different fields such as Amazon and Netflix. This paper discusses the usage of the recommendation engine concept in the industrial field, especially in maintenance operations. Nowadays, the plant maintenance team needs to make a maintenance plan against sudden asset failure, to reduce unscheduled production downtime. However, the planning takes a lot of time, because the appropriate maintenance countermeasures are chosen from many options depending on the failure condition and asset environment. Therefore, we try to suggest a reliable countermeasure against the failure conditions to make the planning time short. In this work, we propose two approaches for the maintenance recommender systems based on artificial intelligence techniques to recommend the maintenance actions. The first approach is a single-stage recommender system that reads the defect information and its description entered by the operator to recommend the maintenance action for similar defects found in the historical data. The second approach is a multi-stage recommender system where the system starts by estimating one of the maintenance attributes which be used as an input for the next stage to estimate the next maintenance attribute. Finally, we will evaluate the accuracy of the recommendation by using past maintenance report which contains defect condition and maintenance actions adopted actually in the past. We found that the multi-stage system outperformed the single-stage system in terms of accuracy, and the multistage system is possibly helped the maintenance team against the sudden asset failure with the maintenance action recommendation.
{"title":"Intelligent Maintenance Recommender System","authors":"Abdullatif Al-Najim, Abrar Al-Amoudi, Kenji Ooishi, Mustafa Al-Nasser","doi":"10.1109/CDMA54072.2022.00040","DOIUrl":"https://doi.org/10.1109/CDMA54072.2022.00040","url":null,"abstract":"Recommendation engine's techniques have proved their performance in different fields such as Amazon and Netflix. This paper discusses the usage of the recommendation engine concept in the industrial field, especially in maintenance operations. Nowadays, the plant maintenance team needs to make a maintenance plan against sudden asset failure, to reduce unscheduled production downtime. However, the planning takes a lot of time, because the appropriate maintenance countermeasures are chosen from many options depending on the failure condition and asset environment. Therefore, we try to suggest a reliable countermeasure against the failure conditions to make the planning time short. In this work, we propose two approaches for the maintenance recommender systems based on artificial intelligence techniques to recommend the maintenance actions. The first approach is a single-stage recommender system that reads the defect information and its description entered by the operator to recommend the maintenance action for similar defects found in the historical data. The second approach is a multi-stage recommender system where the system starts by estimating one of the maintenance attributes which be used as an input for the next stage to estimate the next maintenance attribute. Finally, we will evaluate the accuracy of the recommendation by using past maintenance report which contains defect condition and maintenance actions adopted actually in the past. We found that the multi-stage system outperformed the single-stage system in terms of accuracy, and the multistage system is possibly helped the maintenance team against the sudden asset failure with the maintenance action recommendation.","PeriodicalId":313042,"journal":{"name":"2022 7th International Conference on Data Science and Machine Learning Applications (CDMA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129236375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}