首页 > 最新文献

Journal of Big Data最新文献

英文 中文
DiabSense: early diagnosis of non-insulin-dependent diabetes mellitus using smartphone-based human activity recognition and diabetic retinopathy analysis with Graph Neural Network DiabSense:利用基于智能手机的人体活动识别和图神经网络的糖尿病视网膜病变分析,早期诊断非胰岛素依赖型糖尿病
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-03 DOI: 10.1186/s40537-024-00959-w
Md Nuho Ul Alam, Ibrahim Hasnine, Erfanul Hoque Bahadur, Abdul Kadar Muhammad Masum, Mercedes Briones Urbano, Manuel Masias Vergara, Jia Uddin, Imran Ashraf, Md. Abdus Samad

Non-Insulin-Dependent Diabetes Mellitus (NIDDM) is a chronic health condition caused by high blood sugar levels, and if not treated early, it can lead to serious complications i.e. blindness. Human Activity Recognition (HAR) offers potential for early NIDDM diagnosis, emerging as a key application for HAR technology. This research introduces DiabSense, a state-of-the-art smartphone-dependent system for early staging of NIDDM. DiabSense incorporates HAR and Diabetic Retinopathy (DR) upon leveraging the power of two different Graph Neural Networks (GNN). HAR uses a comprehensive array of 23 human activities resembling Diabetes symptoms, and DR is a prevalent complication of NIDDM. Graph Attention Network (GAT) in HAR achieved 98.32% accuracy on sensor data, while Graph Convolutional Network (GCN) in the Aptos 2019 dataset scored 84.48%, surpassing other state-of-the-art models. The trained GCN analyzed retinal images of four experimental human subjects for DR report generation, and GAT generated their average duration of daily activities over 30 days. The daily activities in non-diabetic periods of diabetic patients were measured and compared with the daily activities of the experimental subjects, which helped generate risk factors. Fusing risk factors with DR conditions enabled early diagnosis recommendations for the experimental subjects despite the absence of any apparent symptoms. The comparison of DiabSense system outcome with clinical diagnosis reports in the experimental subjects was conducted using the A1C test. The test results confirmed the accurate assessment of early diagnosis requirements for experimental subjects by the system. Overall, DiabSense exhibits significant potential for ensuring early NIDDM treatment, improving millions of lives worldwide.

非胰岛素依赖型糖尿病(NIDDM)是一种由高血糖引起的慢性疾病,如果不及早治疗,会导致严重的并发症,如失明。人类活动识别(HAR)为早期 NIDDM 诊断提供了潜力,成为 HAR 技术的一项关键应用。这项研究介绍了 DiabSense,这是一种用于 NIDDM 早期分期的最先进的智能手机依赖系统。DiabSense 结合了 HAR 和糖尿病视网膜病变(DR),充分利用了两种不同的图神经网络(GNN)的功能。HAR 使用与糖尿病症状相似的 23 种人类活动,而 DR 是 NIDDM 的一种常见并发症。HAR 中的图注意网络(GAT)在传感器数据上达到了 98.32% 的准确率,而万通 2019 数据集中的图卷积网络(GCN)达到了 84.48%,超过了其他最先进的模型。训练有素的 GCN 分析了四名实验对象的视网膜图像,用于生成 DR 报告,而 GAT 则生成了他们 30 天内日常活动的平均持续时间。对糖尿病患者非糖尿病时期的日常活动进行了测量,并与实验对象的日常活动进行了比较,这有助于生成风险因素。将风险因素与 DR 条件相结合,就能在实验对象没有任何明显症状的情况下为其提供早期诊断建议。DiabSense 系统的结果与实验对象的临床诊断报告通过 A1C 测试进行了比较。测试结果证实,该系统能准确评估实验对象的早期诊断要求。总之,DiabSense 在确保 NIDDM 早期治疗方面具有巨大潜力,可改善全球数百万人的生活。
{"title":"DiabSense: early diagnosis of non-insulin-dependent diabetes mellitus using smartphone-based human activity recognition and diabetic retinopathy analysis with Graph Neural Network","authors":"Md Nuho Ul Alam, Ibrahim Hasnine, Erfanul Hoque Bahadur, Abdul Kadar Muhammad Masum, Mercedes Briones Urbano, Manuel Masias Vergara, Jia Uddin, Imran Ashraf, Md. Abdus Samad","doi":"10.1186/s40537-024-00959-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00959-w","url":null,"abstract":"<p>Non-Insulin-Dependent Diabetes Mellitus (NIDDM) is a chronic health condition caused by high blood sugar levels, and if not treated early, it can lead to serious complications i.e. blindness. Human Activity Recognition (HAR) offers potential for early NIDDM diagnosis, emerging as a key application for HAR technology. This research introduces DiabSense, a state-of-the-art smartphone-dependent system for early staging of NIDDM. DiabSense incorporates HAR and Diabetic Retinopathy (DR) upon leveraging the power of two different Graph Neural Networks (GNN). HAR uses a comprehensive array of 23 human activities resembling Diabetes symptoms, and DR is a prevalent complication of NIDDM. Graph Attention Network (GAT) in HAR achieved 98.32% accuracy on sensor data, while Graph Convolutional Network (GCN) in the Aptos 2019 dataset scored 84.48%, surpassing other state-of-the-art models. The trained GCN analyzed retinal images of four experimental human subjects for DR report generation, and GAT generated their average duration of daily activities over 30 days. The daily activities in non-diabetic periods of diabetic patients were measured and compared with the daily activities of the experimental subjects, which helped generate risk factors. Fusing risk factors with DR conditions enabled early diagnosis recommendations for the experimental subjects despite the absence of any apparent symptoms. The comparison of DiabSense system outcome with clinical diagnosis reports in the experimental subjects was conducted using the A1C test. The test results confirmed the accurate assessment of early diagnosis requirements for experimental subjects by the system. Overall, DiabSense exhibits significant potential for ensuring early NIDDM treatment, improving millions of lives worldwide.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"75 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tc-llama 2: fine-tuning LLM for technology and commercialization applications Tc-llama 2:为技术和商业化应用微调 LLM
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-02 DOI: 10.1186/s40537-024-00963-0
Jeyoon Yeom, Hakyung Lee, Hoyoon Byun, Yewon Kim, Jeongeun Byun, Yunjeong Choi, Sungjin Kim, Kyungwoo Song

This paper introduces TC-Llama 2, a novel application of large language models (LLMs) in the technology-commercialization field. Traditional methods in this field, reliant on statistical learning and expert knowledge, often face challenges in processing the complex and diverse nature of technology-commercialization data. TC-Llama 2 addresses these limitations by utilizing the advanced generalization capabilities of LLMs, specifically adapting them to this intricate domain. Our model, based on the open-source LLM framework, Llama 2, is customized through instruction tuning using bilingual Korean-English datasets. Our approach involves transforming technology-commercialization data into formats compatible with LLMs, enabling the model to learn detailed technological knowledge and product hierarchies effectively. We introduce a unique model evaluation strategy, leveraging new matching and generation tasks to verify the alignment of the technology-commercialization relationship in TC-Llama 2. Our results, derived from refining task-specific instructions for inference, provide valuable insights into customizing language models for specific sectors, potentially leading to new applications in technology categorization, utilization, and predictive product development.

本文介绍了 TC-Llama 2,这是大型语言模型(LLMs)在技术商业化领域的一种新型应用。该领域的传统方法依赖于统计学习和专家知识,在处理复杂多样的技术商业化数据时往往面临挑战。TC-Llama 2 利用 LLM 先进的泛化能力,特别是针对这一错综复杂的领域,解决了这些局限性。我们的模型以开源 LLM 框架 Llama 2 为基础,通过使用韩英双语数据集进行指令调整来定制。我们的方法包括将技术商业化数据转换成与 LLM 兼容的格式,使模型能够有效地学习详细的技术知识和产品层次。我们引入了一种独特的模型评估策略,利用新的匹配和生成任务来验证 TC-Llama 2 中技术-商业化关系的一致性。我们的结果来自于对特定任务推理指令的改进,为特定领域定制语言模型提供了宝贵的见解,有可能在技术分类、利用和预测性产品开发方面带来新的应用。
{"title":"Tc-llama 2: fine-tuning LLM for technology and commercialization applications","authors":"Jeyoon Yeom, Hakyung Lee, Hoyoon Byun, Yewon Kim, Jeongeun Byun, Yunjeong Choi, Sungjin Kim, Kyungwoo Song","doi":"10.1186/s40537-024-00963-0","DOIUrl":"https://doi.org/10.1186/s40537-024-00963-0","url":null,"abstract":"<p>This paper introduces TC-Llama 2, a novel application of large language models (LLMs) in the technology-commercialization field. Traditional methods in this field, reliant on statistical learning and expert knowledge, often face challenges in processing the complex and diverse nature of technology-commercialization data. TC-Llama 2 addresses these limitations by utilizing the advanced generalization capabilities of LLMs, specifically adapting them to this intricate domain. Our model, based on the open-source LLM framework, Llama 2, is customized through instruction tuning using bilingual Korean-English datasets. Our approach involves transforming technology-commercialization data into formats compatible with LLMs, enabling the model to learn detailed technological knowledge and product hierarchies effectively. We introduce a unique model evaluation strategy, leveraging new matching and generation tasks to verify the alignment of the technology-commercialization relationship in TC-Llama 2. Our results, derived from refining task-specific instructions for inference, provide valuable insights into customizing language models for specific sectors, potentially leading to new applications in technology categorization, utilization, and predictive product development.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"51 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An ensemble machine learning model for predicting one-year mortality in elderly coronary heart disease patients with anemia 预测患有贫血的老年冠心病患者一年死亡率的集合机器学习模型
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-24 DOI: 10.1186/s40537-024-00966-x
Longcan Cheng, Yan Nie, Hongxia Wen, Yan Li, Yali Zhao, Qian Zhang, Mingxing Lei, Shihui Fu

Objective

This study was designed to develop and validate a robust predictive model for one-year mortality in elderly coronary heart disease (CHD) patients with anemia using machine learning methods.

Methods

Demographics, tests, comorbidities, and drugs were collected for a cohort of 974 elderly patients with CHD. A prospective analysis was performed to evaluate predictive performances of the developed models. External validation of models was performed in a series of 112 elderly CHD patients with anemia.

Results

The overall one-year mortality was 43.6%. Risk factors included heart rate, chronic heart failure, tachycardia and β receptor blockers. Protective factors included hemoglobin, albumin, high density lipoprotein cholesterol, estimated glomerular filtration rate (eGFR), left ventricular ejection fraction (LVEF), aspirin, clopidogrel, calcium channel blockers, angiotensin converting enzyme inhibitors (ACEIs)/angiotensin receptor blockers (ARBs), and statins. Compared with other algorithms, an ensemble machine learning model performed the best with area under the curve (95% confidence interval) being 0.828 (0.805–0.870) and Brier score being 0.170. Calibration and density curves further confirmed favorable predicted probability and discriminative ability of an ensemble machine learning model. External validation of Ensemble Model also exhibited good performance with area under the curve (95% confidence interval) being 0.825 (0.734–0.916) and Brier score being 0.185. Patients in the high-risk group had more than six-fold probability of one-year mortality compared with those in the low-risk group (P < 0.001). Shaley Additive exPlanation identified the top five risk factors that associated with one-year mortality were hemoglobin, albumin, eGFR, LVEF, and ACEIs/ARBs.

Conclusions

This model identifies key risk factors and protective factors, providing valuable insights for improving risk assessment, informing clinical decision-making and performing targeted interventions. It outperforms other algorithms with predictive performance and provides significant opportunities for personalized risk mitigation strategies, with clinical implications for improving patient care.

方法 收集了一组 974 名老年冠心病患者的人口统计学资料、检查、合并症和药物。为评估所开发模型的预测性能,进行了前瞻性分析。结果 一年内的总死亡率为 43.6%。风险因素包括心率、慢性心衰、心动过速和β受体阻滞剂。保护因素包括血红蛋白、白蛋白、高密度脂蛋白胆固醇、估计肾小球滤过率(eGFR)、左室射血分数(LVEF)、阿司匹林、氯吡格雷、钙通道阻滞剂、血管紧张素转换酶抑制剂(ACEI)/血管紧张素受体阻滞剂(ARB)和他汀类药物。与其他算法相比,集合机器学习模型表现最佳,曲线下面积(95% 置信区间)为 0.828(0.805-0.870),布赖尔评分为 0.170。校准和密度曲线进一步证实了集合机器学习模型良好的预测概率和判别能力。集合模型的外部验证也显示出良好的性能,曲线下面积(95% 置信区间)为 0.825(0.734-0.916),Brier 评分为 0.185。与低风险组相比,高风险组患者的一年期死亡率是低风险组的六倍多(P < 0.001)。Shaley Additive exPlanation 确定了与一年死亡率相关的五大风险因素,分别是血红蛋白、白蛋白、eGFR、LVEF 和 ACEI/ARB。该模型的预测性能优于其他算法,为个性化风险缓解策略提供了重要机会,对改善患者护理具有临床意义。
{"title":"An ensemble machine learning model for predicting one-year mortality in elderly coronary heart disease patients with anemia","authors":"Longcan Cheng, Yan Nie, Hongxia Wen, Yan Li, Yali Zhao, Qian Zhang, Mingxing Lei, Shihui Fu","doi":"10.1186/s40537-024-00966-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00966-x","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Objective</h3><p>This study was designed to develop and validate a robust predictive model for one-year mortality in elderly coronary heart disease (CHD) patients with anemia using machine learning methods.</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>Demographics, tests, comorbidities, and drugs were collected for a cohort of 974 elderly patients with CHD. A prospective analysis was performed to evaluate predictive performances of the developed models. External validation of models was performed in a series of 112 elderly CHD patients with anemia.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>The overall one-year mortality was 43.6%. Risk factors included heart rate, chronic heart failure, tachycardia and β receptor blockers. Protective factors included hemoglobin, albumin, high density lipoprotein cholesterol, estimated glomerular filtration rate (eGFR), left ventricular ejection fraction (LVEF), aspirin, clopidogrel, calcium channel blockers, angiotensin converting enzyme inhibitors (ACEIs)/angiotensin receptor blockers (ARBs), and statins. Compared with other algorithms, an ensemble machine learning model performed the best with area under the curve (95% confidence interval) being 0.828 (0.805–0.870) and Brier score being 0.170. Calibration and density curves further confirmed favorable predicted probability and discriminative ability of an ensemble machine learning model. External validation of Ensemble Model also exhibited good performance with area under the curve (95% confidence interval) being 0.825 (0.734–0.916) and Brier score being 0.185. Patients in the high-risk group had more than six-fold probability of one-year mortality compared with those in the low-risk group (<i>P</i> &lt; 0.001). Shaley Additive exPlanation identified the top five risk factors that associated with one-year mortality were hemoglobin, albumin, eGFR, LVEF, and ACEIs/ARBs.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>This model identifies key risk factors and protective factors, providing valuable insights for improving risk assessment, informing clinical decision-making and performing targeted interventions. It outperforms other algorithms with predictive performance and provides significant opportunities for personalized risk mitigation strategies, with clinical implications for improving patient care.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"17 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hate speech detection in the Bengali language: a comprehensive survey 孟加拉语中的仇恨言论检测:一项全面调查
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-23 DOI: 10.1186/s40537-024-00956-z
Abdullah Al Maruf, Ahmad Jainul Abidin, Md. Mahmudul Haque, Zakaria Masud Jiyad, Aditi Golder, Raaid Alubady, Zeyar Aung

The detection of hate speech (HS) in online platforms has become extremely important for maintaining a safe and inclusive environment. While significant progress has been made in English-language HS detection, methods for detecting HS in other languages, such as Bengali, have not been explored much like English. In this survey, we outlined the key challenges specific to HS detection in Bengali, including the scarcity of labeled datasets, linguistic nuances, and contextual variations. We also examined different approaches and methodologies employed by researchers to address these challenges, including classical machine learning techniques, ensemble approaches, and more recent deep learning advancements. Furthermore, we explored the performance metrics used for evaluation, including the accuracy, precision, recall, receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), sensitivity, specificity, and F1 score, providing insights into the effectiveness of the proposed models. Additionally, we identified the limitations and future directions of research in Bengali HS detection, highlighting the need for larger annotated datasets, cross-lingual transfer learning techniques, and the incorporation of contextual information to improve the detection accuracy. This survey provides a comprehensive overview of the current state-of-the-art HS detection methods used in Bengali text and serves as a valuable resource for researchers and practitioners interested in understanding the advancements, challenges, and opportunities in addressing HS in the Bengali language, ultimately assisting in the creation of reliable and effective online platform detection systems.

检测网络平台中的仇恨言论(HS)对于维护安全、包容的环境极为重要。虽然在英语语言的仇恨言论检测方面取得了重大进展,但孟加拉语等其他语言的仇恨言论检测方法还没有像英语那样得到广泛探索。在本调查中,我们概述了孟加拉语 HS 检测所面临的主要挑战,包括标注数据集的稀缺、语言上的细微差别和语境变化。我们还研究了研究人员为应对这些挑战而采用的不同方法和手段,包括经典的机器学习技术、集合方法和最近的深度学习进展。此外,我们还探讨了用于评估的性能指标,包括准确度、精确度、召回率、接收者操作特征曲线(ROC)、ROC 曲线下面积(AUC)、灵敏度、特异性和 F1 分数,从而深入了解了所提模型的有效性。此外,我们还指出了孟加拉语 HS 检测的局限性和未来的研究方向,强调需要更大规模的注释数据集、跨语言迁移学习技术以及结合上下文信息来提高检测准确性。本调查报告全面概述了孟加拉语文本中使用的当前最先进的 HS 检测方法,为有兴趣了解孟加拉语 HS 方面的进展、挑战和机遇的研究人员和从业人员提供了宝贵的资源,最终有助于创建可靠、有效的在线平台检测系统。
{"title":"Hate speech detection in the Bengali language: a comprehensive survey","authors":"Abdullah Al Maruf, Ahmad Jainul Abidin, Md. Mahmudul Haque, Zakaria Masud Jiyad, Aditi Golder, Raaid Alubady, Zeyar Aung","doi":"10.1186/s40537-024-00956-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00956-z","url":null,"abstract":"<p>The detection of hate speech (HS) in online platforms has become extremely important for maintaining a safe and inclusive environment. While significant progress has been made in English-language HS detection, methods for detecting HS in other languages, such as Bengali, have not been explored much like English. In this survey, we outlined the key challenges specific to HS detection in Bengali, including the scarcity of labeled datasets, linguistic nuances, and contextual variations. We also examined different approaches and methodologies employed by researchers to address these challenges, including classical machine learning techniques, ensemble approaches, and more recent deep learning advancements. Furthermore, we explored the performance metrics used for evaluation, including the accuracy, precision, recall, receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), sensitivity, specificity, and F1 score, providing insights into the effectiveness of the proposed models. Additionally, we identified the limitations and future directions of research in Bengali HS detection, highlighting the need for larger annotated datasets, cross-lingual transfer learning techniques, and the incorporation of contextual information to improve the detection accuracy. This survey provides a comprehensive overview of the current state-of-the-art HS detection methods used in Bengali text and serves as a valuable resource for researchers and practitioners interested in understanding the advancements, challenges, and opportunities in addressing HS in the Bengali language, ultimately assisting in the creation of reliable and effective online platform detection systems.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"14 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques 使用机器学习技术对云环境中的 MapReduce 作业性能进行预测建模
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-23 DOI: 10.1186/s40537-024-00964-z
Mohammed Bergui, Soufiane Hourri, Said Najah, Nikola S. Nikolov

Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks.

在 Hadoop 生态系统中,MapReduce 是管理、处理和挖掘大规模数据集的基石。然而,由于缺乏精确估算作业执行时间的高效解决方案,作业执行时间的估算一直是个难题,影响着 Hadoop 集群内的任务分配和分布。在本研究中,我们提出了一种预测 MapReduce 作业执行时间的综合机器学习方法,包括数据收集、预处理、特征工程和模型评估。利用从全面的 Hadoop MapReduce 作业跟踪中获得的丰富数据集,我们探索了集群参数与作业性能之间错综复杂的关系。通过对线性回归、决策树、随机森林和梯度提升回归树等机器学习模型进行比较分析,我们发现随机森林模型是最有效的模型,表现出卓越的预测准确性和鲁棒性。我们的发现强调了数据大小和资源分配等特征在决定工作绩效方面的关键作用。通过这项工作,我们旨在提高资源管理效率,更有效地利用基于云的 Hadoop 集群来完成大规模数据处理任务。
{"title":"Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques","authors":"Mohammed Bergui, Soufiane Hourri, Said Najah, Nikola S. Nikolov","doi":"10.1186/s40537-024-00964-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00964-z","url":null,"abstract":"<p>Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"48 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introducing Mplots: scaling time series recurrence plots to massive datasets 介绍 Mplots:根据海量数据集缩放时间序列递推图
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-20 DOI: 10.1186/s40537-024-00954-1
Maryam Shahcheraghi, Ryan Mercer, João Manuel de Almeida Rodrigues, Audrey Der, Hugo Filipe Silveira Gamboa, Zachary Zimmerman, Kerry Mauck, Eamonn Keogh

Time series similarity matrices (informally, recurrence plots or dot-plots), are useful tools for time series data mining. They can be used to guide data exploration, and various useful features can be derived from them and then fed into downstream analytics. However, time series similarity matrices suffer from very poor scalability, taxing both time and memory requirements. In this work, we introduce novel ideas that allow us to scale the largest time series similarity matrices that can be examined by several orders of magnitude. The first idea is a novel algorithm to compute the matrices in a way that removes dependency on the subsequence length. This algorithm is so fast that it allows us to now address datasets where the memory limitations begin to dominate. Our second novel contribution is a multiscale algorithm that computes an approximation of the matrix appropriate for the limitations of the user’s memory/screen-resolution, then performs a local, just-in-time recomputation of any region that the user wishes to zoom-in on. Given that this largely removes time and space barriers, human visual attention then becomes the bottleneck. We further introduce algorithms that search massive matrices with quadrillions of cells and then prioritize regions for later examination by either humans or algorithms. We will demonstrate the utility of our ideas for data exploration, segmentation, and classification in domains as diverse as astronomy, bioinformatics, entomology, and wildlife monitoring.

时间序列相似性矩阵(非正式地称为递归图或点阵图)是时间序列数据挖掘的有用工具。它们可用于指导数据探索,并可从中得出各种有用的特征,然后输入到下游分析中。然而,时间序列相似性矩阵的可扩展性非常差,对时间和内存的要求都很高。在这项工作中,我们引入了新的想法,使我们能够将可检查的最大时间序列相似性矩阵扩展几个数量级。第一个想法是采用一种新颖的算法来计算矩阵,以消除对子序列长度的依赖。这种算法速度极快,使我们现在可以处理内存限制开始占主导地位的数据集。我们的第二个新贡献是一种多尺度算法,它可以根据用户内存/屏幕分辨率的限制计算矩阵的近似值,然后对用户希望放大的任何区域进行局部、即时的重新计算。由于这在很大程度上消除了时间和空间障碍,人类的视觉注意力就成了瓶颈。我们还将进一步介绍可搜索包含数万亿单元的海量矩阵的算法,然后对区域进行优先排序,以便稍后由人类或算法进行检查。我们将展示我们的想法在天文学、生物信息学、昆虫学和野生动物监测等不同领域的数据探索、分割和分类方面的实用性。
{"title":"Introducing Mplots: scaling time series recurrence plots to massive datasets","authors":"Maryam Shahcheraghi, Ryan Mercer, João Manuel de Almeida Rodrigues, Audrey Der, Hugo Filipe Silveira Gamboa, Zachary Zimmerman, Kerry Mauck, Eamonn Keogh","doi":"10.1186/s40537-024-00954-1","DOIUrl":"https://doi.org/10.1186/s40537-024-00954-1","url":null,"abstract":"<p>Time series similarity matrices (informally, recurrence plots or dot-plots), are useful tools for time series data mining. They can be used to guide data exploration, and various useful features can be derived from them and then fed into downstream analytics. However, time series similarity matrices suffer from very poor scalability, taxing both time and memory requirements. In this work, we introduce novel ideas that allow us to scale the largest time series similarity matrices that can be examined by several orders of magnitude. The first idea is a novel algorithm to compute the matrices in a way that removes dependency on the subsequence length. This algorithm is so fast that it allows us to now address datasets where the memory limitations begin to dominate. Our second novel contribution is a multiscale algorithm that computes an approximation of the matrix appropriate for the limitations of the user’s memory/screen-resolution, then performs a local, just-in-time recomputation of any region that the user wishes to zoom-in on. Given that this largely removes time and space barriers, human visual attention then becomes the bottleneck. We further introduce algorithms that search massive matrices with quadrillions of cells and then prioritize regions for later examination by either humans or algorithms. We will demonstrate the utility of our ideas for data exploration, segmentation, and classification in domains as diverse as astronomy, bioinformatics, entomology, and wildlife monitoring.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"47 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Emotion AWARE: an artificial intelligence framework for adaptable, robust, explainable, and multi-granular emotion analysis 情感 AWARE:用于适应性强、稳健、可解释和多粒度情感分析的人工智能框架
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-10 DOI: 10.1186/s40537-024-00953-2
Gihan Gamage, Daswin De Silva, Nishan Mills, Damminda Alahakoon, Milos Manic

Emotions are fundamental to human behaviour. How we feel, individually and collectively, determines how humanity evolves and advances into our shared future. The rapid digitalisation of our personal, social and professional lives means we are frequently using digital media to express, understand and respond to emotions. Although recent developments in Artificial Intelligence (AI) are able to analyse sentiment and detect emotions, they are not effective at comprehending the complexity and ambiguity of digital emotion expressions in knowledge-focused activities of customers, people, and organizations. In this paper, we address this challenge by proposing a novel AI framework for the adaptable, robust, and explainable detection of multi-granular assembles of emotions. This framework consolidates lexicon generation and finetuned Large Language Model (LLM) approaches to formulate multi-granular assembles of two, eight and fourteen emotions. The framework is robust to ambiguous emotion expressions that are implied in conversation, adaptable to domain-specific emotion semantics, and the assembles are explainable using constituent terms and intensity. We conducted nine empirical studies using datasets representing diverse human emotion behaviours. The results of these studies comprehensively demonstrate and evaluate the core capabilities of the framework, and consistently outperforms state-of-the-art approaches in adaptable, robust, and explainable multi-granular emotion detection.

情感是人类行为的根本。我们个人和集体的情感如何,决定着人类如何进化,如何迈向我们共同的未来。个人、社会和职业生活的快速数字化意味着我们经常使用数字媒体来表达、理解和回应情绪。虽然人工智能(AI)的最新发展能够分析情感和检测情绪,但它们并不能有效地理解客户、人们和组织在以知识为重点的活动中数字情绪表达的复杂性和模糊性。在本文中,我们针对这一挑战提出了一个新颖的人工智能框架,用于对多粒度情感集合进行适应性强、稳健且可解释的检测。该框架整合了词库生成和微调大语言模型(LLM)方法,以制定由两种、八种和十四种情绪组成的多粒度组合。该框架对对话中隐含的模棱两可的情绪表达具有很强的鲁棒性,可适应特定领域的情绪语义,并且可以使用组成术语和强度来解释组合。我们使用代表人类各种情绪行为的数据集进行了九项实证研究。这些研究结果全面展示和评估了该框架的核心能力,并在适应性、鲁棒性和可解释性多颗粒情感检测方面始终优于最先进的方法。
{"title":"Emotion AWARE: an artificial intelligence framework for adaptable, robust, explainable, and multi-granular emotion analysis","authors":"Gihan Gamage, Daswin De Silva, Nishan Mills, Damminda Alahakoon, Milos Manic","doi":"10.1186/s40537-024-00953-2","DOIUrl":"https://doi.org/10.1186/s40537-024-00953-2","url":null,"abstract":"<p>Emotions are fundamental to human behaviour. How we feel, individually and collectively, determines how humanity evolves and advances into our shared future. The rapid digitalisation of our personal, social and professional lives means we are frequently using digital media to express, understand and respond to emotions. Although recent developments in Artificial Intelligence (AI) are able to analyse sentiment and detect emotions, they are not effective at comprehending the complexity and ambiguity of digital emotion expressions in knowledge-focused activities of customers, people, and organizations. In this paper, we address this challenge by proposing a novel AI framework for the adaptable, robust, and explainable detection of multi-granular assembles of emotions. This framework consolidates lexicon generation and finetuned Large Language Model (LLM) approaches to formulate multi-granular assembles of two, eight and fourteen emotions. The framework is robust to ambiguous emotion expressions that are implied in conversation, adaptable to domain-specific emotion semantics, and the assembles are explainable using constituent terms and intensity. We conducted nine empirical studies using datasets representing diverse human emotion behaviours. The results of these studies comprehensively demonstrate and evaluate the core capabilities of the framework, and consistently outperforms state-of-the-art approaches in adaptable, robust, and explainable multi-granular emotion detection.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"153 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Examining ALS: reformed PCA and random forest for effective detection of ALS 检查 ALS:改革 PCA 和随机森林,有效检测 ALS
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-10 DOI: 10.1186/s40537-024-00951-4
Abdullah Alqahtani, Shtwai Alsubai, Mohemmed Sha, Ashit Kumar Dutta

ALS (Amyotrophic Lateral Sclerosis) is a fatal neurodegenerative disease of the human motor system. It is a group of progressive diseases that affects the nerve cells in the brain and spinal cord that control the muscle movement of the body hence, detection and classification of ALS at the right time is considered to be one of the vital aspects that can save the life of humans. Therefore, in various studies, different AI techniques are used for the detection of ALS, however, these methods are considered to be ineffectual in terms of identifying the disease due to the employment of ineffective algorithms. Hence, the proposed model utilizes Modified Principal Component Analysis (MPCA) and Modified Random Forest (MRF) for performing dimensionality reduction of all the potential features considered for effective classification of the ALS presence and absence of ALS causing mutation in the corresponding gene. The MPCA is adapted for capturing all the Low-Importance Data transformation. Furthermore, The MPCA is objected to performing three various approaches: Covariance Matrix Correlation, Eigen Vector- Eigenvalue decomposition, and selecting the desired principal components. This is done in aspects of implying the LI (Lower-Importance) Data Transformation. By choosing these potential components without any loss of features ensures better viability of selecting the attributes for ALS-causing gene classification. This is followed by the classification of the proposed model by using Modified RF by updating the clump detector technique. The clump detector is proceeded by clustering approach using K-means, and the data reduced by their dimension are grouped accordingly. These clustered data are analyzed either for ALS causing or devoid of causing ALS. Finally, the model’s performance is assessed using different evaluation metrics like accuracy, recall, F1 score, and precision, and the proposed model is further compared with the existing models to assess the efficacy of the proposed model.

肌萎缩侧索硬化症(ALS)是一种致命的人体运动系统神经退行性疾病。它是一组渐进性疾病,影响大脑和脊髓中控制人体肌肉运动的神经细胞,因此,适时检测和分类 ALS 被认为是挽救人类生命的重要环节之一。因此,在各种研究中,不同的人工智能技术被用于 ALS 的检测,然而,由于采用了无效算法,这些方法在识别疾病方面被认为是无效的。因此,所提出的模型利用修正主成分分析法(MPCA)和修正随机森林法(MRF)对所有潜在特征进行降维处理,以便对相应基因中是否存在导致 ALS 的突变进行有效分类。MPCA 适用于捕捉所有低重要性数据转换。此外,MPCA 还适用于三种不同的方法:协方差矩阵相关性、特征向量-特征值分解以及选择所需的主成分。这是在暗示 LI(低重要性)数据转换的方面完成的。在不损失任何特征的情况下选择这些潜在成分,可确保为 ALS 致病基因分类选择属性时具有更好的可行性。随后,通过更新团块检测器技术,使用修正射频技术对提出的模型进行分类。结块检测器是通过使用 K-means 聚类方法进行的,数据按其维度进行相应的分组。对这些聚类数据进行分析,以确定是否会导致 ALS。最后,使用不同的评估指标,如准确率、召回率、F1 分数和精确度来评估模型的性能,并将所提出的模型与现有模型进行进一步比较,以评估所提出模型的功效。
{"title":"Examining ALS: reformed PCA and random forest for effective detection of ALS","authors":"Abdullah Alqahtani, Shtwai Alsubai, Mohemmed Sha, Ashit Kumar Dutta","doi":"10.1186/s40537-024-00951-4","DOIUrl":"https://doi.org/10.1186/s40537-024-00951-4","url":null,"abstract":"<p>ALS (Amyotrophic Lateral Sclerosis) is a fatal neurodegenerative disease of the human motor system. It is a group of progressive diseases that affects the nerve cells in the brain and spinal cord that control the muscle movement of the body hence, detection and classification of ALS at the right time is considered to be one of the vital aspects that can save the life of humans. Therefore, in various studies, different AI techniques are used for the detection of ALS, however, these methods are considered to be ineffectual in terms of identifying the disease due to the employment of ineffective algorithms. Hence, the proposed model utilizes Modified Principal Component Analysis (MPCA) and Modified Random Forest (MRF) for performing dimensionality reduction of all the potential features considered for effective classification of the ALS presence and absence of ALS causing mutation in the corresponding gene. The MPCA is adapted for capturing all the Low-Importance Data transformation. Furthermore, The MPCA is objected to performing three various approaches: Covariance Matrix Correlation, Eigen Vector- Eigenvalue decomposition, and selecting the desired principal components. This is done in aspects of implying the LI (Lower-Importance) Data Transformation. By choosing these potential components without any loss of features ensures better viability of selecting the attributes for ALS-causing gene classification. This is followed by the classification of the proposed model by using Modified RF by updating the clump detector technique. The clump detector is proceeded by clustering approach using K-means, and the data reduced by their dimension are grouped accordingly. These clustered data are analyzed either for ALS causing or devoid of causing ALS. Finally, the model’s performance is assessed using different evaluation metrics like accuracy, recall, F1 score, and precision, and the proposed model is further compared with the existing models to assess the efficacy of the proposed model.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"21 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring AI-driven approaches for unstructured document analysis and future horizons 探索人工智能驱动的非结构化文件分析方法及未来展望
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-05 DOI: 10.1186/s40537-024-00948-z
Supriya V. Mahadevkar, Shruti Patil, Ketan Kotecha, Lim Way Soong, Tanupriya Choudhury

In the current industrial landscape, a significant number of sectors are grappling with the challenges posed by unstructured data, which incurs financial losses amounting to millions annually. If harnessed effectively, this data has the potential to substantially boost operational efficiency. Traditional methods for extracting information have their limitations; however, solutions powered by artificial intelligence (AI) could provide a more fitting alternative. There is an evident gap in scholarly research concerning a comprehensive evaluation of AI-driven techniques for the extraction of information from unstructured content. This systematic literature review aims to identify, assess, and deliberate on prospective research directions within the field of unstructured document information extraction. It has been observed that prevailing extraction methods primarily depend on static patterns or rules, often proving inadequate when faced with complex document structures typically encountered in real-world scenarios, such as medical records. Datasets currently available to the public suffer from low quality and are tailored for specific tasks only. This underscores an urgent need for developing new datasets that accurately reflect complex issues encountered in practical settings. The review reveals that AI-based techniques show promise in autonomously extracting information from diverse unstructured documents, encompassing both printed and handwritten text. Challenges arise, however, when dealing with varied document layouts. Proposing a framework through hybrid AI-based approaches, this review envisions processing a high-quality dataset for automatic information extraction from unstructured documents. Additionally, it emphasizes the importance of collaborative efforts between organizations and researchers to address the diverse challenges associated with unstructured data analysis.

在当前的工业领域,许多部门都在努力应对非结构化数据带来的挑战,这些数据每年造成的经济损失高达数百万美元。如果能有效利用这些数据,就有可能大幅提高运营效率。传统的信息提取方法有其局限性;然而,人工智能(AI)驱动的解决方案可以提供更合适的替代方案。在全面评估从非结构化内容中提取信息的人工智能驱动技术方面,学术研究明显存在空白。本系统性文献综述旨在确定、评估和审议非结构化文档信息提取领域的前瞻性研究方向。据观察,目前流行的信息提取方法主要依赖于静态模式或规则,当面对现实世界中常见的复杂文档结构(如医疗记录)时,这些方法往往显得力不从心。目前向公众提供的数据集质量不高,而且只针对特定任务。这突出表明,迫切需要开发新的数据集,以准确反映实际环境中遇到的复杂问题。综述显示,基于人工智能的技术在从各种非结构化文档(包括印刷文本和手写文本)中自主提取信息方面大有可为。然而,在处理不同的文档布局时会遇到挑战。本综述通过基于人工智能的混合方法提出了一个框架,设想处理一个高质量的数据集,以便从非结构化文档中自动提取信息。此外,它还强调了组织和研究人员合作应对与非结构化数据分析相关的各种挑战的重要性。
{"title":"Exploring AI-driven approaches for unstructured document analysis and future horizons","authors":"Supriya V. Mahadevkar, Shruti Patil, Ketan Kotecha, Lim Way Soong, Tanupriya Choudhury","doi":"10.1186/s40537-024-00948-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00948-z","url":null,"abstract":"<p>In the current industrial landscape, a significant number of sectors are grappling with the challenges posed by unstructured data, which incurs financial losses amounting to millions annually. If harnessed effectively, this data has the potential to substantially boost operational efficiency. Traditional methods for extracting information have their limitations; however, solutions powered by artificial intelligence (AI) could provide a more fitting alternative. There is an evident gap in scholarly research concerning a comprehensive evaluation of AI-driven techniques for the extraction of information from unstructured content. This systematic literature review aims to identify, assess, and deliberate on prospective research directions within the field of unstructured document information extraction. It has been observed that prevailing extraction methods primarily depend on static patterns or rules, often proving inadequate when faced with complex document structures typically encountered in real-world scenarios, such as medical records. Datasets currently available to the public suffer from low quality and are tailored for specific tasks only. This underscores an urgent need for developing new datasets that accurately reflect complex issues encountered in practical settings. The review reveals that AI-based techniques show promise in autonomously extracting information from diverse unstructured documents, encompassing both printed and handwritten text. Challenges arise, however, when dealing with varied document layouts. Proposing a framework through hybrid AI-based approaches, this review envisions processing a high-quality dataset for automatic information extraction from unstructured documents. Additionally, it emphasizes the importance of collaborative efforts between organizations and researchers to address the diverse challenges associated with unstructured data analysis.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"31 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141576733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
New custom rating for improving recommendation system performance 用于提高推荐系统性能的新自定义评级
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-02 DOI: 10.1186/s40537-024-00952-3
Tora Fahrudin, Dedy Rahman Wijaya

Recommendation system is currently attracting the interest of many explorers. Various new businesses have surfaced with the rise of online marketing (E-Commerce) in response to Covid-19 pandemic. This phenomenon allows recommendation items through a system called Collaborative Filtering (CF), aiming to improve shopping experience of users. Typically, the effectiveness of CF relies on the precise identification of similar profile users by similarity algorithms. Traditional similarity measures are based on the user-item rating matrix. Approximately, four custom ratings (CR) were used along with a new rating formula, termed New Custom Rating (NCR), derived from the popularity of users and items in addition to the original rating. Specifically, NCR optimized recommendation system performance by using the popularity of users and items to determine new ratings value, rather than solely relying on the original rating. Additionally, the formulas improved the representativeness of the new rating values and the accuracy of similarity algorithm calculations. Consequently, the increased accuracy of recommendation system was achieved. The implementation of NCR across four CR algorithms and recommendation system using five public datasets was examined. Consequently, the experimental results showed that NCR significantly increased recommendation system accuracy, as evidenced by reductions in RMSE, MSE, and MAE as well as increasing FCP and Hit Rate. Moreover, by combining the popularity of users and items into rating calculations, NCR improved the accuracy of various recommendation system algorithms reducing RMSE, MSE, and MAE up to 62.10%, 53.62%, 65.97%, respectively, while also increasing FCP and Hit Rate up to 11.89% and 31.42%, respectively.

推荐系统目前正吸引着众多探索者的兴趣。随着网络营销(电子商务)的兴起,各种新业务也随之浮出水面,以应对 Covid-19 的流行。这种现象允许通过一种称为协同过滤(CF)的系统来推荐商品,目的是改善用户的购物体验。通常情况下,CF 的有效性依赖于通过相似性算法精确识别相似资料用户。传统的相似性测量方法基于用户-物品评级矩阵。大约使用了四个自定义评级(CR)和一个新的评级公式,称为新自定义评级(NCR),新自定义评级是在原始评级的基础上,根据用户和物品的受欢迎程度得出的。具体来说,NCR 通过使用用户和项目的受欢迎程度来确定新的评级值,而不是仅仅依赖于原始评级,从而优化了推荐系统的性能。此外,这些公式还提高了新评分值的代表性和相似性算法计算的准确性。因此,推荐系统的准确性得到了提高。我们使用五个公共数据集对 NCR 在四种 CR 算法和推荐系统中的实施情况进行了检验。实验结果表明,NCR 显著提高了推荐系统的准确性,这体现在 RMSE、MSE 和 MAE 的降低以及 FCP 和命中率的提高上。此外,通过将用户和项目的受欢迎程度纳入评级计算,NCR 提高了各种推荐系统算法的准确性,RMSE、MSE 和 MAE 分别降低了 62.10%、53.62% 和 65.97%,FCP 和命中率也分别提高了 11.89% 和 31.42%。
{"title":"New custom rating for improving recommendation system performance","authors":"Tora Fahrudin, Dedy Rahman Wijaya","doi":"10.1186/s40537-024-00952-3","DOIUrl":"https://doi.org/10.1186/s40537-024-00952-3","url":null,"abstract":"<p>Recommendation system is currently attracting the interest of many explorers. Various new businesses have surfaced with the rise of online marketing (E-Commerce) in response to Covid-19 pandemic. This phenomenon allows recommendation items through a system called Collaborative Filtering (CF), aiming to improve shopping experience of users. Typically, the effectiveness of CF relies on the precise identification of similar profile users by similarity algorithms. Traditional similarity measures are based on the user-item rating matrix. Approximately, four custom ratings (CR) were used along with a new rating formula, termed New Custom Rating (NCR), derived from the popularity of users and items in addition to the original rating. Specifically, NCR optimized recommendation system performance by using the popularity of users and items to determine new ratings value, rather than solely relying on the original rating. Additionally, the formulas improved the representativeness of the new rating values and the accuracy of similarity algorithm calculations. Consequently, the increased accuracy of recommendation system was achieved. The implementation of NCR across four CR algorithms and recommendation system using five public datasets was examined. Consequently, the experimental results showed that NCR significantly increased recommendation system accuracy, as evidenced by reductions in RMSE, MSE, and MAE as well as increasing FCP and Hit Rate. Moreover, by combining the popularity of users and items into rating calculations, NCR improved the accuracy of various recommendation system algorithms reducing RMSE, MSE, and MAE up to 62.10%, 53.62%, 65.97%, respectively, while also increasing FCP and Hit Rate up to 11.89% and 31.42%, respectively.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"23 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141520245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1