Pub Date : 2023-08-01DOI: 10.1007/s40745-023-00486-0
Shouhao Zhou, Xuelin Huang, Chan Shen, Hagop M. Kantarjian
This work concerns the effective personalized prediction of longitudinal biomarker trajectory, motivated by a study of cancer targeted therapy for patients with chronic myeloid leukemia (CML). Continuous monitoring with a confirmed biomarker of residual disease is a key component of CML management for early prediction of disease relapse. However, the longitudinal biomarker measurements have highly heterogeneous trajectories between subjects (patients) with various shapes and patterns. It is believed that the trajectory is clinically related to the development of treatment resistance, but there was limited knowledge about the underlying mechanism. To address the challenge, we propose a novel Bayesian approach to modeling the distribution of subject-specific longitudinal trajectories. It exploits flexible Bayesian learning to accommodate complex changing patterns over time and non-linear covariate effects, and allows for real-time prediction of both in-sample and out-of-sample subjects. The generated information can help make clinical decisions, and consequently enhance the personalized treatment management of precision medicine.
{"title":"Bayesian Learning of Personalized Longitudinal Biomarker Trajectory","authors":"Shouhao Zhou, Xuelin Huang, Chan Shen, Hagop M. Kantarjian","doi":"10.1007/s40745-023-00486-0","DOIUrl":"10.1007/s40745-023-00486-0","url":null,"abstract":"<div><p>This work concerns the effective personalized prediction of longitudinal biomarker trajectory, motivated by a study of cancer targeted therapy for patients with chronic myeloid leukemia (CML). Continuous monitoring with a confirmed biomarker of residual disease is a key component of CML management for early prediction of disease relapse. However, the longitudinal biomarker measurements have highly heterogeneous trajectories between subjects (patients) with various shapes and patterns. It is believed that the trajectory is clinically related to the development of treatment resistance, but there was limited knowledge about the underlying mechanism. To address the challenge, we propose a novel Bayesian approach to modeling the distribution of subject-specific longitudinal trajectories. It exploits flexible Bayesian learning to accommodate complex changing patterns over time and non-linear covariate effects, and allows for real-time prediction of both in-sample and out-of-sample subjects. The generated information can help make clinical decisions, and consequently enhance the personalized treatment management of precision medicine.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 3","pages":"1031 - 1050"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46463104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, a reliability test plan under time truncated life test is considered for the logistic Rayleigh distribution ((mathcal {LRD})). A brief discussion over statistical properties and significance of the (mathcal {LRD}) is placed in this present study. Larger the value of median—better is the quality of the lot is considered as quality characteristic for the proposed reliability test plan. Minimum sample sizes are placed in tabular form for different set up of specified consumer’s risk. Also operating characteristics ((mathcal{O}mathcal{C})) values are shown in tabular forms for the chosen set up and discussed the pattern of (mathcal{O}mathcal{C}) values. A comparative analysis of the present study with some other reliability test plans is discussed based on the sample sizes. As an illustration, the performance of the proposed plan for the (mathcal {LRD}) is shown through real-life examples.
{"title":"Applications of Reliability Test Plan for Logistic Rayleigh Distributed Quality Characteristic","authors":"Mahendra Saha, Harsh Tripathi, Anju Devi, Pratibha Pareek","doi":"10.1007/s40745-023-00473-5","DOIUrl":"10.1007/s40745-023-00473-5","url":null,"abstract":"<div><p>In this article, a reliability test plan under time truncated life test is considered for the logistic Rayleigh distribution (<span>(mathcal {LRD})</span>). A brief discussion over statistical properties and significance of the <span>(mathcal {LRD})</span> is placed in this present study. Larger the value of median—better is the quality of the lot is considered as quality characteristic for the proposed reliability test plan. Minimum sample sizes are placed in tabular form for different set up of specified consumer’s risk. Also operating characteristics (<span>(mathcal{O}mathcal{C})</span>) values are shown in tabular forms for the chosen set up and discussed the pattern of <span>(mathcal{O}mathcal{C})</span> values. A comparative analysis of the present study with some other reliability test plans is discussed based on the sample sizes. As an illustration, the performance of the proposed plan for the <span>(mathcal {LRD})</span> is shown through real-life examples.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 5","pages":"1687 - 1703"},"PeriodicalIF":0.0,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43684187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-15DOI: 10.1007/s40745-023-00482-4
Tarakashar Das, Sabrina Mobassirin, Syed Md. Minhaz Hossain, Aka Das, Anik Sen, Khaleque Md. Aashiq Kamal, Kaushik Deb
Parkinson’s disease is one of the most prevalent and harmful neurodegenerative conditions (PD). Even today, PD diagnosis and monitoring remain pricy and inconvenient processes. With the unprecedented progress of artificial intelligence algorithms, there is an opportunity to develop a cost-effective system for diagnosing PD at an earlier stage. No permanent remedy has been established yet; however, an earlier diagnosis helps lead a better life. Probably, the three most responsible categories of symptoms for Parkinson’s Disease are tremors, rigidity, and body bradykinesia. Therefore, we investigate the 53 unique features of the Parkinson’s Progression Markers Initiative dataset to determine the significant symptoms, including three major categories. As feature selection is integral to developing a generalized model, we investigate including and excluding feature selection. Four feature selection methods are incorporated—low variance filter, Wilcoxon rank-sum test, principle component analysis, and Chi-square test. Furthermore, we utilize machine learning, ensemble learning, and artificial neural networks (ANN) for classification. Experimental evidence shows that not all symptoms are equally important, but no symptom can be completely eliminated. However, our proposed ANN model attains the best mean accuracy of 99.51%, 98.17% mean specificity, 0.9830 mean Kappa Score, 0.99 mean AUC, and 99.70% mean F1-score with all the features. The efficiency of our suggested technique on diverse data modalities is demonstrated by comparison with recent publications. Finally, we established a trade-off between classification time and accuracy.
帕金森病是最常见、危害最大的神经退行性疾病(PD)之一。时至今日,帕金森病的诊断和监测过程仍然昂贵而不便。随着人工智能算法取得前所未有的进步,我们有机会开发出一种经济高效的系统,用于早期诊断帕金森病。然而,早期诊断有助于改善生活质量。帕金森病最主要的三类症状可能是震颤、僵直和肢体运动迟缓。因此,我们研究了帕金森病进展标志物倡议数据集的 53 个独特特征,以确定包括三大类在内的重要症状。由于特征选择是开发广义模型不可或缺的一部分,因此我们对包括和不包括特征选择进行了研究。我们采用了四种特征选择方法--低方差过滤器、Wilcoxon 秩和检验、原理成分分析和卡方检验。此外,我们还利用机器学习、集合学习和人工神经网络(ANN)进行分类。实验证据表明,并非所有症状都同等重要,但没有任何症状可以完全排除。然而,我们提出的人工神经网络模型在所有特征中取得了最佳的平均准确率(99.51%)、平均特异性(98.17%)、平均 Kappa 分数(0.9830)、平均 AUC(0.99)和平均 F1 分数(99.70%)。通过与最近发表的文章进行比较,我们证明了所建议的技术在不同数据模式下的效率。最后,我们在分类时间和准确性之间进行了权衡。
{"title":"Patient Questionnaires Based Parkinson’s Disease Classification Using Artificial Neural Network","authors":"Tarakashar Das, Sabrina Mobassirin, Syed Md. Minhaz Hossain, Aka Das, Anik Sen, Khaleque Md. Aashiq Kamal, Kaushik Deb","doi":"10.1007/s40745-023-00482-4","DOIUrl":"10.1007/s40745-023-00482-4","url":null,"abstract":"<div><p>Parkinson’s disease is one of the most prevalent and harmful neurodegenerative conditions (PD). Even today, PD diagnosis and monitoring remain pricy and inconvenient processes. With the unprecedented progress of artificial intelligence algorithms, there is an opportunity to develop a cost-effective system for diagnosing PD at an earlier stage. No permanent remedy has been established yet; however, an earlier diagnosis helps lead a better life. Probably, the three most responsible categories of symptoms for Parkinson’s Disease are tremors, rigidity, and body bradykinesia. Therefore, we investigate the 53 unique features of the Parkinson’s Progression Markers Initiative dataset to determine the significant symptoms, including three major categories. As feature selection is integral to developing a generalized model, we investigate including and excluding feature selection. Four feature selection methods are incorporated—low variance filter, Wilcoxon rank-sum test, principle component analysis, and Chi-square test. Furthermore, we utilize machine learning, ensemble learning, and artificial neural networks (ANN) for classification. Experimental evidence shows that not all symptoms are equally important, but no symptom can be completely eliminated. However, our proposed ANN model attains the best mean accuracy of 99.51%, 98.17% mean specificity, 0.9830 mean Kappa Score, 0.99 mean AUC, and 99.70% mean F1-score with all the features. The efficiency of our suggested technique on diverse data modalities is demonstrated by comparison with recent publications. Finally, we established a trade-off between classification time and accuracy.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 5","pages":"1821 - 1864"},"PeriodicalIF":0.0,"publicationDate":"2023-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s40745-023-00482-4.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46611876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-11DOI: 10.1007/s40745-023-00483-3
Ishfaq S. Ahmad, Rameesa Jan, Poonam Nirwan, Peer Bilal Ahmad
In this paper, a new two-parameter distribution over the bounded support (0,1) is introduced and studied in detail. Some of the interesting statistical properties like concavity, hazard rate function, mean residual life, moments and quantile function are discussed. The method of moments and maximum likelihood estimation methods are used to estimate unknown parameters of the proposed model. Besides, finite sample performance of estimation methods are evaluated through the Monte-Carlo simulation study. Application of the proposed distribution to the real data sets shows a better fit than many known two-parameter distributions on the unit interval. Moreover, a new regression model as an alternative to various unit interval regression models is introduced.
{"title":"A New Class of Distribution Over Bounded Support and Its Associated Regression Model","authors":"Ishfaq S. Ahmad, Rameesa Jan, Poonam Nirwan, Peer Bilal Ahmad","doi":"10.1007/s40745-023-00483-3","DOIUrl":"10.1007/s40745-023-00483-3","url":null,"abstract":"<div><p>In this paper, a new two-parameter distribution over the bounded support (0,1) is introduced and studied in detail. Some of the interesting statistical properties like concavity, hazard rate function, mean residual life, moments and quantile function are discussed. The method of moments and maximum likelihood estimation methods are used to estimate unknown parameters of the proposed model. Besides, finite sample performance of estimation methods are evaluated through the Monte-Carlo simulation study. Application of the proposed distribution to the real data sets shows a better fit than many known two-parameter distributions on the unit interval. Moreover, a new regression model as an alternative to various unit interval regression models is introduced.\u0000</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 2","pages":"549 - 569"},"PeriodicalIF":0.0,"publicationDate":"2023-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44336464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brain tumor segmentation is an important field and a sensitive task in tumor diagnosis. The treatment research in this area has helped specialists in detecting the tumor’s location in order to deal with it in its early stages. Numerous methods based on deep learning, have been proposed, including the symmetric U-Net architectures, which revealed great results in the medical imaging field, precisely brain tumor segmentation. In this paper, we proposed an improved U-Net architecture called Inception U-Det inspired by U-Det. This work aims at employing the inception block instead of the convolution one used in the bi-directional feature pyramid neural (Bi-FPN) network during the skip connection U-Det phase. Furthermore, a comparison study has been performed between our proposed approach and the three known architectures in medical imaging segmentation; U-Net, DC-Unet, and U-Det. Several segmentation metrics have been computed and then taken into account in these methods, by means of the publicly available BraTS datasets. Thus, our obtained results have showed promising results in terms of accuracy, dice similarity coefficient (DSC), and intersection–union ratio (IOU). Moreover, the proposed method has achieved a DSC of 87.9%, 85.5%, and 83.9% on BraTS2020, BraTS2018, and BraTS2017, respectively, calculated from the best fold in fourfold cross-validation employed in the present approach.
{"title":"Inception-UDet: An Improved U-Net Architecture for Brain Tumor Segmentation","authors":"Ilyasse Aboussaleh, Jamal Riffi, Adnane Mohamed Mahraz, Hamid Tairi","doi":"10.1007/s40745-023-00480-6","DOIUrl":"10.1007/s40745-023-00480-6","url":null,"abstract":"<div><p>Brain tumor segmentation is an important field and a sensitive task in tumor diagnosis. The treatment research in this area has helped specialists in detecting the tumor’s location in order to deal with it in its early stages. Numerous methods based on deep learning, have been proposed, including the symmetric U-Net architectures, which revealed great results in the medical imaging field, precisely brain tumor segmentation. In this paper, we proposed an improved U-Net architecture called Inception U-Det inspired by U-Det. This work aims at employing the inception block instead of the convolution one used in the bi-directional feature pyramid neural (Bi-FPN) network during the skip connection U-Det phase. Furthermore, a comparison study has been performed between our proposed approach and the three known architectures in medical imaging segmentation; U-Net, DC-Unet, and U-Det. Several segmentation metrics have been computed and then taken into account in these methods, by means of the publicly available BraTS datasets. Thus, our obtained results have showed promising results in terms of accuracy, dice similarity coefficient (DSC), and intersection–union ratio (IOU). Moreover, the proposed method has achieved a DSC of 87.9%, 85.5%, and 83.9% on BraTS2020, BraTS2018, and BraTS2017, respectively, calculated from the best fold in fourfold cross-validation employed in the present approach.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 3","pages":"831 - 853"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45364813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-29DOI: 10.1007/s40745-023-00481-5
Reetika Sarkar, Sithija Manage, Xiaoli Gao
High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including the Lasso and MCP, etc. In this paper, we perform comparative study of regularization approaches for variable selection under different correlation structures and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection. Extensive simulation studies and high-dimensional genomic data analysis on real datasets have demonstrated the advantage of the proposed rPGBS method over some of the most used regularization methods. In particular, rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to some recent methods addressing variable selection with strong correlations: Precision Lasso (Wang et al. in Bioinformatics 35:1181–1187, 2019) and Whitening Lasso (Zhu et al. in Bioinformatics 37:2238–2244, 2021). Moreover, rPGBS has been shown to be computationally efficient across various settings.
{"title":"Stable Variable Selection for High-Dimensional Genomic Data with Strong Correlations","authors":"Reetika Sarkar, Sithija Manage, Xiaoli Gao","doi":"10.1007/s40745-023-00481-5","DOIUrl":"10.1007/s40745-023-00481-5","url":null,"abstract":"<div><p>High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including the Lasso and MCP, etc. In this paper, we perform comparative study of regularization approaches for variable selection under different correlation structures and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection. Extensive simulation studies and high-dimensional genomic data analysis on real datasets have demonstrated the advantage of the proposed rPGBS method over some of the most used regularization methods. In particular, rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to some recent methods addressing variable selection with strong correlations: Precision Lasso (Wang et al. in Bioinformatics 35:1181–1187, 2019) and Whitening Lasso (Zhu et al. in Bioinformatics 37:2238–2244, 2021). Moreover, rPGBS has been shown to be computationally efficient across various settings.\u0000</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 4","pages":"1139 - 1164"},"PeriodicalIF":0.0,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135049935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-15DOI: 10.1007/s40745-023-00477-1
Kousik Maiti, Suchandan Kayal, Aditi Kar Gangopadhyay
This article addresses estimation of the parameters and reliability characteristics of a generalized X-Exponential distribution based on the progressive type-II censored sample. The maximum likelihood estimates (MLEs) are obtained. The uniqueness and existence of the MLEs are studied. The Bayes estimates are obtained under squared error and entropy loss functions. For computation of the Bayes estimates, Markov Chain Monte Carlo method is used. Bootstrap-t and bootstrap-p methods are used to compute the interval estimates. Further, a simulation study is performed to compare the performance of the proposed estimates. Finally, a real-life dataset is considered and analysed for illustrative purposes.
本文探讨了基于渐进式 II 型删减样本的广义 X 指数分布的参数估计和可靠性特征。得到了最大似然估计值(MLE)。研究了 MLE 的唯一性和存在性。在平方误差和熵损失函数下获得贝叶斯估计值。在计算贝叶斯估计值时,使用了马尔可夫链蒙特卡罗方法。使用 Bootstrap-t 和 Bootstrap-p 方法计算区间估计值。此外,还进行了模拟研究,以比较建议的估计值的性能。最后,考虑并分析了现实生活中的一个数据集,以作说明。
{"title":"On Progressively Censored Generalized X-Exponential Distribution: (Non) Bayesian Estimation with an Application to Bladder Cancer Data","authors":"Kousik Maiti, Suchandan Kayal, Aditi Kar Gangopadhyay","doi":"10.1007/s40745-023-00477-1","DOIUrl":"10.1007/s40745-023-00477-1","url":null,"abstract":"<div><p>This article addresses estimation of the parameters and reliability characteristics of a generalized <i>X</i>-Exponential distribution based on the progressive type-II censored sample. The maximum likelihood estimates (MLEs) are obtained. The uniqueness and existence of the MLEs are studied. The Bayes estimates are obtained under squared error and entropy loss functions. For computation of the Bayes estimates, Markov Chain Monte Carlo method is used. Bootstrap-<i>t</i> and bootstrap-<i>p</i> methods are used to compute the interval estimates. Further, a simulation study is performed to compare the performance of the proposed estimates. Finally, a real-life dataset is considered and analysed for illustrative purposes.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 5","pages":"1761 - 1798"},"PeriodicalIF":0.0,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45717684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-10DOI: 10.1007/s40745-023-00475-3
WeiKang Liu, Yanchun Zhang, Hong Yang, Qinxue Meng
Machine learning methods promote the sustainable development of wise information technology of medicine (WITMED), and a variety of medical data brings high value and convenience to medical analysis. However, the applications of medical data have also been confronted with the risk of privacy leakage that is hard to avoid, especially when conducting correlation analysis or data sharing among multiple institutions. Data security and privacy preservation have recently played an essential role in the field of secure and private medical data analysis, where many differential privacy strategies are applied to medical data publishing and mining. In this paper, we survey research work on the applications of differential privacy for medical data analysis, discussing the necessity of medical privacy-preserving, the advantages of differential privacy, and their applications to typical medical data, such as genomic data and wearable device data. Furthermore, we discuss the challenges and potential future research directions for differential privacy in medical applications.
{"title":"A Survey on Differential Privacy for Medical Data Analysis","authors":"WeiKang Liu, Yanchun Zhang, Hong Yang, Qinxue Meng","doi":"10.1007/s40745-023-00475-3","DOIUrl":"10.1007/s40745-023-00475-3","url":null,"abstract":"<div><p>Machine learning methods promote the sustainable development of wise information technology of medicine (WITMED), and a variety of medical data brings high value and convenience to medical analysis. However, the applications of medical data have also been confronted with the risk of privacy leakage that is hard to avoid, especially when conducting correlation analysis or data sharing among multiple institutions. Data security and privacy preservation have recently played an essential role in the field of secure and private medical data analysis, where many differential privacy strategies are applied to medical data publishing and mining. In this paper, we survey research work on the applications of differential privacy for medical data analysis, discussing the necessity of medical privacy-preserving, the advantages of differential privacy, and their applications to typical medical data, such as genomic data and wearable device data. Furthermore, we discuss the challenges and potential future research directions for differential privacy in medical applications.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 2","pages":"733 - 747"},"PeriodicalIF":0.0,"publicationDate":"2023-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47520588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-09DOI: 10.1007/s40745-023-00479-z
Shrawan Kumar, Kavita Gupta, Manya Gupta
In this paper, the machine learning algorithm Naive Bayes Classifier is applied to the Kaggle spam mails dataset to classify the emails in our inbox as spam or ham. The dataset is made up of two main attributes: type and text. The target variable "Type" has two factors: ham and spam. The text variable contains the text messages that will be classified as spam or ham. The results are obtained by employing two different Laplace values. It is up to the decision maker to select error tolerance in ham and spam messages derived from two different Laplace values. Computing software R is used for data analysis.
本文将机器学习算法 Naive Bayes 分类器应用于 Kaggle 垃圾邮件数据集,将收件箱中的邮件分为垃圾邮件和火腿肠邮件。数据集由两个主要属性组成:类型和文本。目标变量 "类型 "包含两个因子:垃圾邮件和火腿邮件。文本变量包含将被分类为垃圾邮件或火腿肠邮件的文本信息。结果是通过使用两种不同的拉普拉斯值得出的。决策者可以根据两种不同的拉普拉斯值来选择火腿和垃圾邮件的误差容限。计算软件 R 用于数据分析。
{"title":"Naïve Bayes Classifier Model for Detecting Spam Mails","authors":"Shrawan Kumar, Kavita Gupta, Manya Gupta","doi":"10.1007/s40745-023-00479-z","DOIUrl":"10.1007/s40745-023-00479-z","url":null,"abstract":"<div><p>In this paper, the machine learning algorithm Naive Bayes Classifier is applied to the Kaggle spam mails dataset to classify the emails in our inbox as spam or ham. The dataset is made up of two main attributes: type and text. The target variable \"Type\" has two factors: ham and spam. The text variable contains the text messages that will be classified as spam or ham. The results are obtained by employing two different Laplace values. It is up to the decision maker to select error tolerance in ham and spam messages derived from two different Laplace values. Computing software R is used for data analysis.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 6","pages":"1887 - 1897"},"PeriodicalIF":0.0,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43486989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-08DOI: 10.1007/s40745-023-00474-4
Clemens Tegetmeier, Arne Johannssen, Nataliya Chukhrova
Book recommender systems provide personalized recommendations of books to users based on their previous searches or purchases. As online trading of books has become increasingly important in recent years, artificial intelligence (AI) algorithms are needed to recommend suitable books to users and encourage them to make purchasing decisions in the short and the long run. In this paper, we consider AI algorithms for so called collaborative book recommender systems, especially the matrix factorization algorithm using the stochastic gradient descent method and the book-based k-nearest-neighbor algorithm. We perform a comprehensive case study based on the Book-Crossing benchmark data set, and implement various variants of both AI algorithms to predict unknown book ratings and to recommend books to individual users based on the highest predicted ratings. This study aims to evaluate the quality of the implemented methods in recommending books by using selected evaluation metrics for AI algorithms.
{"title":"Artificial Intelligence Algorithms for Collaborative Book Recommender Systems","authors":"Clemens Tegetmeier, Arne Johannssen, Nataliya Chukhrova","doi":"10.1007/s40745-023-00474-4","DOIUrl":"10.1007/s40745-023-00474-4","url":null,"abstract":"<div><p>Book recommender systems provide personalized recommendations of books to users based on their previous searches or purchases. As online trading of books has become increasingly important in recent years, artificial intelligence (AI) algorithms are needed to recommend suitable books to users and encourage them to make purchasing decisions in the short and the long run. In this paper, we consider AI algorithms for so called collaborative book recommender systems, especially the matrix factorization algorithm using the stochastic gradient descent method and the book-based <i>k</i>-nearest-neighbor algorithm. We perform a comprehensive case study based on the Book-Crossing benchmark data set, and implement various variants of both AI algorithms to predict unknown book ratings and to recommend books to individual users based on the highest predicted ratings. This study aims to evaluate the quality of the implemented methods in recommending books by using selected evaluation metrics for AI algorithms.</p></div>","PeriodicalId":36280,"journal":{"name":"Annals of Data Science","volume":"11 5","pages":"1705 - 1739"},"PeriodicalIF":0.0,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s40745-023-00474-4.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45942766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}