首页 > 最新文献

Journal of Big Data最新文献

英文 中文
Skyline query under multidimensional incomplete data based on classification tree 基于分类树的多维不完整数据下的天际线查询
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-12 DOI: 10.1186/s40537-024-00923-8
Dengke Yuan, Liping Zhang, Song Li, Guanglu Sun

A method for skyline query of multidimensional incomplete data based on a classification tree has been proposed to address the problem of a large amount of useless data in existing skyline queries with multidimensional incomplete data, which leads to low query efficiency and algorithm performance. This method consists of two main parts. The first part is the proposed incomplete data weighted classification tree algorithm. In the first part, an incomplete data weighted classification tree is proposed, and the incomplete data set is classified using this tree. The data classified in the first part serves as the basis for the second step of the query. The second part proposes a skyline query algorithm for multidimensional incomplete data. The concept of optimal virtual points has been recently introduced, effectively reducing the number of comparisons of a large amount of data, thereby improving the query efficiency for incomplete data. Theoretical research and experimental analysis have shown that the proposed method can perform skyline queries for multidimensional incomplete data well, with high query efficiency and accuracy of the algorithm.

针对现有多维不完整数据的天线条查询中存在大量无用数据,导致查询效率和算法性能低下的问题,提出了一种基于分类树的多维不完整数据天线条查询方法。该方法主要由两部分组成。第一部分是提出的不完整数据加权分类树算法。在第一部分中,提出了一种不完整数据加权分类树,并使用该树对不完整数据集进行分类。第一部分分类的数据将作为第二步查询的基础。第二部分提出了多维不完整数据的天际线查询算法。最近提出了最优虚拟点的概念,有效减少了大量数据的比较次数,从而提高了不完整数据的查询效率。理论研究和实验分析表明,所提出的方法能很好地进行多维不完整数据的天际线查询,查询效率高,算法准确率高。
{"title":"Skyline query under multidimensional incomplete data based on classification tree","authors":"Dengke Yuan, Liping Zhang, Song Li, Guanglu Sun","doi":"10.1186/s40537-024-00923-8","DOIUrl":"https://doi.org/10.1186/s40537-024-00923-8","url":null,"abstract":"<p>A method for skyline query of multidimensional incomplete data based on a classification tree has been proposed to address the problem of a large amount of useless data in existing skyline queries with multidimensional incomplete data, which leads to low query efficiency and algorithm performance. This method consists of two main parts. The first part is the proposed incomplete data weighted classification tree algorithm. In the first part, an incomplete data weighted classification tree is proposed, and the incomplete data set is classified using this tree. The data classified in the first part serves as the basis for the second step of the query. The second part proposes a skyline query algorithm for multidimensional incomplete data. The concept of optimal virtual points has been recently introduced, effectively reducing the number of comparisons of a large amount of data, thereby improving the query efficiency for incomplete data. Theoretical research and experimental analysis have shown that the proposed method can perform skyline queries for multidimensional incomplete data well, with high query efficiency and accuracy of the algorithm.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"147 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting air quality index using attention hybrid deep learning and quantum-inspired particle swarm optimization 利用注意力混合深度学习和量子启发粒子群优化预测空气质量指数
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-11 DOI: 10.1186/s40537-024-00926-5
Anh Tuan Nguyen, Duy Hoang Pham, Bee Lan Oo, Yonghan Ahn, Benson T. H. Lim

Air pollution poses a significant threat to the health of the environment and human well-being. The air quality index (AQI) is an important measure of air pollution that describes the degree of air pollution and its impact on health. Therefore, accurate and reliable prediction of the AQI is critical but challenging due to the non-linearity and stochastic nature of air particles. This research aims to propose an AQI prediction hybrid deep learning model based on the Attention Convolutional Neural Networks (ACNN), Autoregressive Integrated Moving Average (ARIMA), Quantum Particle Swarm Optimization (QPSO)-enhanced-Long Short-Term Memory (LSTM) and XGBoost modelling techniques. Daily air quality data were collected from the official Seoul Air registry for the period 2021 to 2022. The data were first preprocessed through the ARIMA model to capture and fit the linear part of the data and followed by a hybrid deep learning architecture developed in the pretraining–finetuning framework for the non-linear part of the data. This hybrid model first used convolution to extract the deep features of the original air quality data, and then used the QPSO to optimize the hyperparameter for LSTM network for mining the long-terms time series features, and the XGBoost model was adopted to fine-tune the final AQI prediction model. The robustness and reliability of the resulting model were assessed and compared with other widely used models and across meteorological stations. Our proposed model achieves up to 31.13% reduction in MSE, 19.03% reduction in MAE and 2% improvement in R-squared compared to the best appropriate conventional model, indicating a much stronger magnitude of relationships between predicted and actual values. The overall results show that the attentive hybrid deep Quantum inspired Particle Swarm Optimization model is more feasible and efficient in predicting air quality index at both city-wide and station-specific levels.

空气污染对环境健康和人类福祉构成重大威胁。空气质量指数(AQI)是衡量空气污染的重要指标,它描述了空气污染的程度及其对健康的影响。因此,准确可靠地预测空气质量指数至关重要,但由于空气微粒的非线性和随机性,预测具有挑战性。本研究旨在提出一种基于注意力卷积神经网络(ACNN)、自回归综合移动平均(ARIMA)、量子粒子群优化(QPSO)-增强型长短期记忆(LSTM)和 XGBoost 建模技术的空气质量指数预测混合深度学习模型。2021 年至 2022 年期间的每日空气质量数据来自首尔空气官方登记册。首先通过 ARIMA 模型对数据进行预处理,以捕捉和拟合数据的线性部分,然后针对数据的非线性部分采用在预训练-微调框架下开发的混合深度学习架构。该混合模型首先利用卷积提取原始空气质量数据的深度特征,然后利用 QPSO 优化 LSTM 网络的超参数以挖掘长时序列特征,并采用 XGBoost 模型对最终的 AQI 预测模型进行微调。评估了最终模型的稳健性和可靠性,并与其他广泛使用的模型和各气象站进行了比较。与最合适的传统模型相比,我们提出的模型的 MSE 降低了 31.13%,MAE 降低了 19.03%,R 平方提高了 2%,这表明预测值与实际值之间的关系更为紧密。总体结果表明,受量子启发的粒子群优化混合模型在预测全市和特定站点的空气质量指数方面更加可行和高效。
{"title":"Predicting air quality index using attention hybrid deep learning and quantum-inspired particle swarm optimization","authors":"Anh Tuan Nguyen, Duy Hoang Pham, Bee Lan Oo, Yonghan Ahn, Benson T. H. Lim","doi":"10.1186/s40537-024-00926-5","DOIUrl":"https://doi.org/10.1186/s40537-024-00926-5","url":null,"abstract":"<p>Air pollution poses a significant threat to the health of the environment and human well-being. The air quality index (AQI) is an important measure of air pollution that describes the degree of air pollution and its impact on health. Therefore, accurate and reliable prediction of the AQI is critical but challenging due to the non-linearity and stochastic nature of air particles. This research aims to propose an AQI prediction hybrid deep learning model based on the Attention Convolutional Neural Networks (ACNN), Autoregressive Integrated Moving Average (ARIMA), Quantum Particle Swarm Optimization (QPSO)-enhanced-Long Short-Term Memory (LSTM) and XGBoost modelling techniques. Daily air quality data were collected from the official Seoul Air registry for the period 2021 to 2022. The data were first preprocessed through the ARIMA model to capture and fit the linear part of the data and followed by a hybrid deep learning architecture developed in the pretraining–finetuning framework for the non-linear part of the data. This hybrid model first used convolution to extract the deep features of the original air quality data, and then used the QPSO to optimize the hyperparameter for LSTM network for mining the long-terms time series features, and the XGBoost model was adopted to fine-tune the final AQI prediction model. The robustness and reliability of the resulting model were assessed and compared with other widely used models and across meteorological stations. Our proposed model achieves up to 31.13% reduction in MSE, 19.03% reduction in MAE and 2% improvement in R-squared compared to the best appropriate conventional model, indicating a much stronger magnitude of relationships between predicted and actual values. The overall results show that the attentive hybrid deep Quantum inspired Particle Swarm Optimization model is more feasible and efficient in predicting air quality index at both city-wide and station-specific levels.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"12 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A proposed hybrid framework to improve the accuracy of customer churn prediction in telecom industry 提高电信业客户流失预测准确性的混合框架建议
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-09 DOI: 10.1186/s40537-024-00922-9
Shimaa Ouf, Kholoud T. Mahmoud, Manal A. Abdel-Fattah

In the telecom sector, predicting customer churn has increased in importance in recent years. Developing a robust and accurate churn prediction model takes time, but it is crucial. Early churn prediction avoids revenue loss and improves customer retention. Telecom companies must identify these customers before they leave to solve this issue. Researchers have used a variety of applied machine-learning approaches to reveal the hidden relationships between different features. A key aspect of churn prediction is the accuracy level that affects the learning model's performance. This study aims to clarify several aspects of customer churn prediction accuracy and investigate state-of-the-art techniques' performance. However, no previous research has investigated performance using a hybrid framework combining the advantages of selecting suitable data preprocessing, ensemble learning, and resampling techniques. The study introduces a proposed hybrid framework that improves the accuracy of customer churn prediction in the telecom industry. The framework is built by integrating the XGBOOST classifier with the hybrid resampling method SMOTE-ENN, which concerns applying effective techniques for data preprocessing. The proposed framework is used for two experiments with three datasets in the telecom industry. This study determines which features are most crucial and influence customer churn, introduces the impact of data balancing, compares the classifiers' pre- and post-data balancing performances, and examines a speed-accuracy trade-off in hybrid classifiers. Many metrics, including accuracy, precision, recall, F1-score, and ROC curve, are used to analyze the results. All evaluation criteria are used to identify the most effective experiment. The results of the accuracy of the hybrid framework that respects balanced data outperformed applying the classifier only to imbalanced data. In addition, the results of the proposed hybrid framework are compared to previous studies on the same datasets, and the result of this comparison is offered. Compared with the review of the latest works, our proposed hybrid framework with the three datasets outperformed these works.

在电信行业,预测客户流失率的重要性近年来与日俱增。开发一个强大而准确的客户流失预测模型需要时间,但却至关重要。及早预测客户流失可避免收入损失并提高客户保留率。电信公司必须在客户离开之前识别出这些客户,以解决这一问题。研究人员使用了多种应用机器学习方法来揭示不同特征之间的隐藏关系。流失预测的一个关键方面是影响学习模型性能的准确度。本研究旨在阐明客户流失预测准确性的几个方面,并调查最先进技术的性能。然而,以前的研究还没有研究过使用混合框架的性能,该框架结合了选择合适的数据预处理、集合学习和重采样技术的优势。本研究提出了一种混合框架,可提高电信行业客户流失预测的准确性。该框架是通过将 XGBOOST 分类器与混合重采样方法 SMOTE-ENN 相结合而建立的,其中涉及应用有效的数据预处理技术。提出的框架在电信行业的三个数据集上进行了两次实验。本研究确定了哪些特征最关键并影响客户流失,介绍了数据平衡的影响,比较了分类器在数据平衡前和数据平衡后的性能,并研究了混合分类器在速度和准确性之间的权衡。在分析结果时使用了许多指标,包括准确度、精确度、召回率、F1-分数和 ROC 曲线。所有评价标准都用于确定最有效的实验。尊重平衡数据的混合框架的准确率结果优于仅应用于不平衡数据的分类器。此外,还将所提出的混合框架的结果与之前在相同数据集上的研究进行了比较,并提供了比较结果。与最新研究相比,我们提出的混合框架在三个数据集上的表现优于这些研究。
{"title":"A proposed hybrid framework to improve the accuracy of customer churn prediction in telecom industry","authors":"Shimaa Ouf, Kholoud T. Mahmoud, Manal A. Abdel-Fattah","doi":"10.1186/s40537-024-00922-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00922-9","url":null,"abstract":"<p>In the telecom sector, predicting customer churn has increased in importance in recent years. Developing a robust and accurate churn prediction model takes time, but it is crucial. Early churn prediction avoids revenue loss and improves customer retention. Telecom companies must identify these customers before they leave to solve this issue. Researchers have used a variety of applied machine-learning approaches to reveal the hidden relationships between different features. A key aspect of churn prediction is the accuracy level that affects the learning model's performance. This study aims to clarify several aspects of customer churn prediction accuracy and investigate state-of-the-art techniques' performance. However, no previous research has investigated performance using a hybrid framework combining the advantages of selecting suitable data preprocessing, ensemble learning, and resampling techniques. The study introduces a proposed hybrid framework that improves the accuracy of customer churn prediction in the telecom industry. The framework is built by integrating the XGBOOST classifier with the hybrid resampling method SMOTE-ENN, which concerns applying effective techniques for data preprocessing. The proposed framework is used for two experiments with three datasets in the telecom industry. This study determines which features are most crucial and influence customer churn, introduces the impact of data balancing, compares the classifiers' pre- and post-data balancing performances, and examines a speed-accuracy trade-off in hybrid classifiers. Many metrics, including accuracy, precision, recall, F1-score, and ROC curve, are used to analyze the results. All evaluation criteria are used to identify the most effective experiment. The results of the accuracy of the hybrid framework that respects balanced data outperformed applying the classifier only to imbalanced data. In addition, the results of the proposed hybrid framework are compared to previous studies on the same datasets, and the result of this comparison is offered. Compared with the review of the latest works, our proposed hybrid framework with the three datasets outperformed these works.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"58 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid topic modeling method based on dirichlet multinomial mixture and fuzzy match algorithm for short text clustering 基于dirichlet多叉混合物和模糊匹配算法的混合主题建模方法用于短文聚类
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-09 DOI: 10.1186/s40537-024-00930-9
Mutasem K. Alsmadi, Malek Alzaqebah, Sana Jawarneh, Ibrahim ALmarashdeh, Mohammed Azmi Al-Betar, Maram Alwohaibi, Noha A. Al-Mulla, Eman AE Ahmed, Ahmad AL Smadi

Topic modeling methods proved to be effective for inferring latent topics from short texts. Dealing with short texts is challenging yet helpful for many real-world applications, due to the sparse terms in the text and the high dimensionality representation. Most of the topic modeling methods require the number of topics to be defined earlier. Similarly, methods based on Dirichlet Multinomial Mixture (DMM) involve the maximum possible number of topics before execution which is hard to determine due to topic uncertainty, and many noises exist in the dataset. Hence, a new approach called the Topic Clustering algorithm based on Levenshtein Distance (TCLD) is introduced in this paper, TCLD combines DMM models and the Fuzzy matching algorithm to address two key challenges in topic modeling: (a) The outlier problem in topic modeling methods. (b) The problem of determining the optimal number of topics. TCLD uses the initial clustered topics generated by DMM models and then evaluates the semantic relationships between documents using Levenshtein Distance. Subsequently, it determines whether to keep the document in the same cluster, relocate it to another cluster, or mark it as an outlier. The results demonstrate the efficiency of the proposed approach across six English benchmark datasets, in comparison to seven topic modeling approaches, with 83% improvement in purity and 67% enhancement in Normalized Mutual Information (NMI) across all datasets. The proposed method was also applied to a collected Arabic tweet and the results showed that only 12% of the Arabic short texts were incorrectly clustered, according to human inspection.

事实证明,主题建模方法对于从短文本中推断潜在主题非常有效。由于文本中的术语稀疏且具有高维度表示,处理短文本具有挑战性,但却有助于许多现实世界的应用。大多数主题建模方法都需要提前定义主题数量。同样,基于 Dirichlet Multinomial Mixture(DMM)的方法在执行前需要确定最大可能的主题数,而由于主题的不确定性,以及数据集中存在许多噪音,很难确定主题数。因此,本文提出了一种名为基于莱文斯坦距离的话题聚类算法(TCLD)的新方法。TCLD 结合了 DMM 模型和模糊匹配算法,以解决话题建模中的两个关键难题:(a)话题建模方法中的离群值问题。(b) 确定最佳话题数量的问题。TCLD 使用 DMM 模型生成的初始聚类主题,然后使用 Levenshtein Distance 评估文档之间的语义关系。随后,它会决定是将文档保留在同一聚类中,还是将其重新定位到另一个聚类,或者将其标记为离群点。结果表明,在六个英语基准数据集上,与七种主题建模方法相比,所提出的方法效率更高,在所有数据集上,纯度提高了 83%,归一化互信息(NMI)提高了 67%。该方法还被应用于收集的阿拉伯语推文,结果显示,根据人工检测,只有 12% 的阿拉伯语短文被错误聚类。
{"title":"Hybrid topic modeling method based on dirichlet multinomial mixture and fuzzy match algorithm for short text clustering","authors":"Mutasem K. Alsmadi, Malek Alzaqebah, Sana Jawarneh, Ibrahim ALmarashdeh, Mohammed Azmi Al-Betar, Maram Alwohaibi, Noha A. Al-Mulla, Eman AE Ahmed, Ahmad AL Smadi","doi":"10.1186/s40537-024-00930-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00930-9","url":null,"abstract":"<p>Topic modeling methods proved to be effective for inferring latent topics from short texts. Dealing with short texts is challenging yet helpful for many real-world applications, due to the sparse terms in the text and the high dimensionality representation. Most of the topic modeling methods require the number of topics to be defined earlier. Similarly, methods based on Dirichlet Multinomial Mixture (DMM) involve the maximum possible number of topics before execution which is hard to determine due to topic uncertainty, and many noises exist in the dataset. Hence, a new approach called the Topic Clustering algorithm based on Levenshtein Distance (TCLD) is introduced in this paper, TCLD combines DMM models and the Fuzzy matching algorithm to address two key challenges in topic modeling: (a) The outlier problem in topic modeling methods. (b) The problem of determining the optimal number of topics. TCLD uses the initial clustered topics generated by DMM models and then evaluates the semantic relationships between documents using Levenshtein Distance. Subsequently, it determines whether to keep the document in the same cluster, relocate it to another cluster, or mark it as an outlier. The results demonstrate the efficiency of the proposed approach across six English benchmark datasets, in comparison to seven topic modeling approaches, with 83% improvement in purity and 67% enhancement in Normalized Mutual Information (NMI) across all datasets. The proposed method was also applied to a collected Arabic tweet and the results showed that only 12% of the Arabic short texts were incorrectly clustered, according to human inspection.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"23 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DEMFFA: a multi-strategy modified Fennec Fox algorithm with mixed improved differential evolutionary variation strategies DEMFFA:采用混合改进型差分进化变异策略的多策略改进型芬内克-福克斯算法
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-08 DOI: 10.1186/s40537-024-00917-6
Gang Hu, Keke Song, Xiuxiu Li, Yi Wang

The Fennec Fox algorithm (FFA) is a new meta-heuristic algorithm that is primarily inspired by the Fennec fox's ability to dig and escape from wild predators. Compared with other classical algorithms, FFA shows strong competitiveness. The “No free lunch” theorem shows that an algorithm has different effects in the face of different problems, such as: when solving high-dimensional or more complex applications, there are challenges such as easily falling into local optimal and slow convergence speed. To solve this problem with FFA, in this paper, an improved Fenna fox algorithm DEMFFA is proposed by adding sin chaotic mapping, formula factor adjustment, Cauchy operator mutation, and differential evolution mutation strategies. Firstly, a sin chaotic mapping strategy is added in the initialization stage to make the population distribution more uniform, thus speeding up the algorithm convergence speed. Secondly, in order to expedite the convergence speed of the algorithm, adjustments are made to the factors of the formula whose position is updated in the first stage, resulting in faster convergence. Finally, in order to prevent the algorithm from getting into the local optimal too early and expand the search space of the population, the Cauchy operator mutation strategy and differential evolution mutation strategy are added after the first and second stages of the original algorithm update. In order to verify the performance of the proposed DEMFFA, qualitative analysis is carried out on different test sets, and the proposed algorithm is tested with the original FFA, other classical algorithms, improved algorithms, and newly proposed algorithms on three different test sets. And we also carried out a qualitative analysis of the CEC2020. In addition, DEMFFA is applied to 10 practical engineering design problems and a complex 24-bar truss topology optimization problem, and the results show that the DEMFFA algorithm has the potential to solve complex problems.

Fennec Fox 算法(FFA)是一种新型元启发式算法,其主要灵感来源于 Fennec 狐狸挖掘和逃离野生捕食者的能力。与其他经典算法相比,FFA 算法具有很强的竞争力。没有免费的午餐 "定理表明,算法在面对不同的问题时会产生不同的效果,例如:在求解高维或更复杂的应用时,会面临容易陷入局部最优和收敛速度慢等挑战。为了解决 FFA 的这一问题,本文提出了一种改进的 Fenna fox 算法 DEMFFA,增加了 sin 混沌映射、公式因子调整、Cauchy 算子突变和微分进化突变策略。首先,在初始化阶段加入正弦混沌映射策略,使种群分布更加均匀,从而加快算法收敛速度。其次,为了加快算法的收敛速度,对第一阶段更新位置的公式因子进行调整,从而加快收敛速度。最后,为了防止算法过早进入局部最优,扩大种群的搜索空间,在原算法更新的第一和第二阶段后,增加了考奇算子突变策略和微分进化突变策略。为了验证所提出的 DEMFFA 的性能,我们在不同的测试集上进行了定性分析,并在三个不同的测试集上将所提出的算法与原始 FFA、其他经典算法、改进算法和新提出的算法进行了测试。我们还对 CEC2020 进行了定性分析。此外,我们还将 DEMFFA 应用于 10 个实际工程设计问题和一个复杂的 24 杆桁架拓扑优化问题,结果表明 DEMFFA 算法具有解决复杂问题的潜力。
{"title":"DEMFFA: a multi-strategy modified Fennec Fox algorithm with mixed improved differential evolutionary variation strategies","authors":"Gang Hu, Keke Song, Xiuxiu Li, Yi Wang","doi":"10.1186/s40537-024-00917-6","DOIUrl":"https://doi.org/10.1186/s40537-024-00917-6","url":null,"abstract":"<p>The Fennec Fox algorithm (FFA) is a new meta-heuristic algorithm that is primarily inspired by the Fennec fox's ability to dig and escape from wild predators. Compared with other classical algorithms, FFA shows strong competitiveness. The “No free lunch” theorem shows that an algorithm has different effects in the face of different problems, such as: when solving high-dimensional or more complex applications, there are challenges such as easily falling into local optimal and slow convergence speed. To solve this problem with FFA, in this paper, an improved Fenna fox algorithm DEMFFA is proposed by adding sin chaotic mapping, formula factor adjustment, Cauchy operator mutation, and differential evolution mutation strategies. Firstly, a sin chaotic mapping strategy is added in the initialization stage to make the population distribution more uniform, thus speeding up the algorithm convergence speed. Secondly, in order to expedite the convergence speed of the algorithm, adjustments are made to the factors of the formula whose position is updated in the first stage, resulting in faster convergence. Finally, in order to prevent the algorithm from getting into the local optimal too early and expand the search space of the population, the Cauchy operator mutation strategy and differential evolution mutation strategy are added after the first and second stages of the original algorithm update. In order to verify the performance of the proposed DEMFFA, qualitative analysis is carried out on different test sets, and the proposed algorithm is tested with the original FFA, other classical algorithms, improved algorithms, and newly proposed algorithms on three different test sets. And we also carried out a qualitative analysis of the CEC2020. In addition, DEMFFA is applied to 10 practical engineering design problems and a complex 24-bar truss topology optimization problem, and the results show that the DEMFFA algorithm has the potential to solve complex problems.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"38 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Establishment of an automatic diagnosis system for corneal endothelium diseases using artificial intelligence 利用人工智能建立角膜内皮疾病自动诊断系统
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-08 DOI: 10.1186/s40537-024-00913-w
Jing-hao Qu, Xiao-ran Qin, Zi-jun Xie, Jia-he Qian, Yang Zhang, Xiao-nan Sun, Yu-zhao Sun, Rong-mei Peng, Ge-ge Xiao, Jing Lin, Xiao-yan Bian, Tie-hong Chen, Yan Cheng, Shao-feng Gu, Hai-kun Wang, Jing Hong

Purpose

To use artificial intelligence to establish an automatic diagnosis system for corneal endothelium diseases (CEDs).

Methods

We develop an automatic system for detecting multiple common CEDs involving an enhanced compact convolutional transformer (ECCT). Specifically, we introduce a cross-head relative position encoding scheme into a standard self-attention module to capture contextual information among different regions and employ a token-attention feed-forward network to place greater focus on valuable abnormal regions.

Results

A total of 2723 images from CED patients are used to train our system. It achieves an accuracy of 89.53%, and the area under the receiver operating characteristic curve (AUC) is 0.958 (95% CI 0.943–0.971) on images from multiple centres.

Conclusions

Our system is the first artificial intelligence-based system for diagnosing CEDs worldwide. Images can be uploaded to a specified website, and automatic diagnoses can be obtained; this system can be particularly helpful under pandemic conditions, such as those seen during the recent COVID-19 pandemic.

目的利用人工智能建立角膜内皮疾病(CED)的自动诊断系统。方法我们开发了一种自动系统,利用增强型紧凑卷积变换器(ECCT)检测多种常见的 CED。具体来说,我们在标准自注意模块中引入了交叉头相对位置编码方案,以捕捉不同区域之间的上下文信息,并采用标记注意前馈网络来更加关注有价值的异常区域。结论我们的系统是全球首个基于人工智能的 CED 诊断系统。我们的系统是全球首个基于人工智能的 CED 诊断系统,可将图像上传到指定网站,并获得自动诊断;该系统在大流行条件下尤其有用,例如在最近的 COVID-19 大流行期间。
{"title":"Establishment of an automatic diagnosis system for corneal endothelium diseases using artificial intelligence","authors":"Jing-hao Qu, Xiao-ran Qin, Zi-jun Xie, Jia-he Qian, Yang Zhang, Xiao-nan Sun, Yu-zhao Sun, Rong-mei Peng, Ge-ge Xiao, Jing Lin, Xiao-yan Bian, Tie-hong Chen, Yan Cheng, Shao-feng Gu, Hai-kun Wang, Jing Hong","doi":"10.1186/s40537-024-00913-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00913-w","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Purpose</h3><p>To use artificial intelligence to establish an automatic diagnosis system for corneal endothelium diseases (CEDs).</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>We develop an automatic system for detecting multiple common CEDs involving an enhanced compact convolutional transformer (ECCT). Specifically, we introduce a cross-head relative position encoding scheme into a standard self-attention module to capture contextual information among different regions and employ a token-attention feed-forward network to place greater focus on valuable abnormal regions.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>A total of 2723 images from CED patients are used to train our system. It achieves an accuracy of 89.53%, and the area under the receiver operating characteristic curve (AUC) is 0.958 (95% CI 0.943–0.971) on images from multiple centres.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Our system is the first artificial intelligence-based system for diagnosing CEDs worldwide. Images can be uploaded to a specified website, and automatic diagnoses can be obtained; this system can be particularly helpful under pandemic conditions, such as those seen during the recent COVID-19 pandemic.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"15 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluation is key: a survey on evaluation measures for synthetic time series 评估是关键:关于合成时间序列评估措施的调查
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-07 DOI: 10.1186/s40537-024-00924-7
Michael Stenger, Robert Leppich, Ian Foster, Samuel Kounev, André Bauer

Synthetic data generation describes the process of learning the underlying distribution of a given real dataset in a model, which is, in turn, sampled to produce new data objects still adhering to the original distribution. This approach often finds application where circumstances limit the availability or usability of real-world datasets, for instance, in health care due to privacy concerns. While image synthesis has received much attention in the past, time series are key for many practical (e.g., industrial) applications. To date, numerous different generative models and measures to evaluate time series syntheses have been proposed. However, regarding the defining features of high-quality synthetic time series and how to quantify quality, no consensus has yet been reached among researchers. Hence, we propose a comprehensive survey on evaluation measures for time series generation to assist users in evaluating synthetic time series. For one, we provide brief descriptions or - where applicable - precise definitions. Further, we order the measures in a taxonomy and examine applicability and usage. To assist in the selection of the most appropriate measures, we provide a concise guide for fast lookup. Notably, our findings reveal a lack of a universally accepted approach for an evaluation procedure, including the selection of appropriate measures. We believe this situation hinders progress and may even erode evaluation standards to a “do as you like”-approach to synthetic data evaluation. Therefore, this survey is a preliminary step to advance the field of synthetic data evaluation.

合成数据生成描述了在一个模型中学习给定真实数据集的基本分布的过程,反过来,该模型被采样以生成新的数据对象,这些新的数据对象仍然遵循原始分布。这种方法通常适用于现实世界数据集的可用性或可用性受到限制的情况,例如,出于隐私考虑,在医疗保健领域。图像合成在过去受到了广泛关注,而时间序列则是许多实际(如工业)应用的关键。迄今为止,已经提出了许多不同的生成模型和评估时间序列合成的方法。然而,对于高质量合成时间序列的定义特征以及如何量化质量,研究人员尚未达成共识。因此,我们建议对时间序列生成的评估措施进行全面调查,以帮助用户评估合成时间序列。首先,我们提供了简要说明或(如适用)精确定义。此外,我们还以分类法的形式对评估指标进行排序,并研究其适用性和使用情况。为了帮助选择最合适的测量方法,我们提供了快速查找的简明指南。值得注意的是,我们的研究结果表明,在评估程序(包括选择适当的衡量标准)方面缺乏普遍接受的方法。我们认为,这种情况会阻碍进展,甚至会削弱评估标准,使合成数据评估变成一种 "随心所欲 "的方法。因此,本次调查是推动合成数据评估领域发展的第一步。
{"title":"Evaluation is key: a survey on evaluation measures for synthetic time series","authors":"Michael Stenger, Robert Leppich, Ian Foster, Samuel Kounev, André Bauer","doi":"10.1186/s40537-024-00924-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00924-7","url":null,"abstract":"<p>Synthetic data generation describes the process of learning the underlying distribution of a given real dataset in a model, which is, in turn, sampled to produce new data objects still adhering to the original distribution. This approach often finds application where circumstances limit the availability or usability of real-world datasets, for instance, in health care due to privacy concerns. While image synthesis has received much attention in the past, time series are key for many practical (e.g., industrial) applications. To date, numerous different generative models and measures to evaluate time series syntheses have been proposed. However, regarding the defining features of high-quality synthetic time series and how to quantify quality, no consensus has yet been reached among researchers. Hence, we propose a comprehensive survey on evaluation measures for time series generation to assist users in evaluating synthetic time series. For one, we provide brief descriptions or - where applicable - precise definitions. Further, we order the measures in a taxonomy and examine applicability and usage. To assist in the selection of the most appropriate measures, we provide a concise guide for fast lookup. Notably, our findings reveal a lack of a universally accepted approach for an evaluation procedure, including the selection of appropriate measures. We believe this situation hinders progress and may even erode evaluation standards to a “do as you like”-approach to synthetic data evaluation. Therefore, this survey is a preliminary step to advance the field of synthetic data evaluation.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"28 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing the current landscape of AI and sustainability literature: identifying key trends, addressing gaps and challenges 评估当前人工智能和可持续发展文献的现状:确定主要趋势,缩小差距,应对挑战
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-06 DOI: 10.1186/s40537-024-00912-x
Shailesh Tripathi, Nadine Bachmann, Manuel Brunner, Ziad Rizk, Herbert Jodlbauer

The United Nations’ 17 Sustainable Development Goals stress the importance of global and local efforts to address inequalities and implement sustainability. Addressing complex, interconnected sustainability challenges requires a systematic, interdisciplinary approach, where technology, AI, and data-driven methods offer potential solutions for optimizing resources, integrating different aspects of sustainability, and informed decision-making. Sustainability research surrounds various local, regional, and global challenges, emphasizing the need to identify emerging areas and gaps where AI and data-driven models play a crucial role. The study performs a comprehensive literature survey and scientometric and semantic analyses, categorizes data-driven methods for sustainability problems, and discusses the sustainable use of AI and big data. The outcomes of the analyses highlight the importance of collaborative and inclusive research that bridges regional differences, the interconnection of AI, technology, and sustainability topics, and the major research themes related to sustainability. It further emphasizes the significance of developing hybrid approaches combining AI, data-driven techniques, and expert knowledge for multi-level, multi-dimensional decision-making. Furthermore, the study recognizes the necessity of addressing ethical concerns and ensuring the sustainable use of AI and big data in sustainability research.

联合国的 17 个可持续发展目标强调了全球和地方努力解决不平等和实现可持续发展的重要性。应对复杂、相互关联的可持续发展挑战需要系统的跨学科方法,其中技术、人工智能和数据驱动方法为优化资源、整合可持续发展的不同方面和知情决策提供了潜在的解决方案。可持续发展研究围绕着各种地方、区域和全球挑战,强调需要确定人工智能和数据驱动模型在其中发挥关键作用的新兴领域和差距。本研究进行了全面的文献调查、科学计量学和语义分析,对可持续发展问题的数据驱动方法进行了分类,并讨论了人工智能和大数据的可持续利用。分析结果强调了弥合地区差异的合作性和包容性研究的重要性,人工智能、技术和可持续发展主题之间的相互联系,以及与可持续发展相关的主要研究主题。它进一步强调了开发结合人工智能、数据驱动技术和专家知识的混合方法对于多层次、多维度决策的重要性。此外,本研究还认识到有必要解决伦理问题,并确保在可持续性研究中可持续地使用人工智能和大数据。
{"title":"Assessing the current landscape of AI and sustainability literature: identifying key trends, addressing gaps and challenges","authors":"Shailesh Tripathi, Nadine Bachmann, Manuel Brunner, Ziad Rizk, Herbert Jodlbauer","doi":"10.1186/s40537-024-00912-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00912-x","url":null,"abstract":"<p>The United Nations’ 17 Sustainable Development Goals stress the importance of global and local efforts to address inequalities and implement sustainability. Addressing complex, interconnected sustainability challenges requires a systematic, interdisciplinary approach, where technology, AI, and data-driven methods offer potential solutions for optimizing resources, integrating different aspects of sustainability, and informed decision-making. Sustainability research surrounds various local, regional, and global challenges, emphasizing the need to identify emerging areas and gaps where AI and data-driven models play a crucial role. The study performs a comprehensive literature survey and scientometric and semantic analyses, categorizes data-driven methods for sustainability problems, and discusses the sustainable use of AI and big data. The outcomes of the analyses highlight the importance of collaborative and inclusive research that bridges regional differences, the interconnection of AI, technology, and sustainability topics, and the major research themes related to sustainability. It further emphasizes the significance of developing hybrid approaches combining AI, data-driven techniques, and expert knowledge for multi-level, multi-dimensional decision-making. Furthermore, the study recognizes the necessity of addressing ethical concerns and ensuring the sustainable use of AI and big data in sustainability research.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"14 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Amharic spoken digits recognition using convolutional neural network 利用卷积神经网络识别阿姆哈拉语口语数字
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-04 DOI: 10.1186/s40537-024-00910-z
Tewodros Alemu Ayall, Changjun Zhou, Huawen Liu, Getnet Mezgebu Brhanemeskel, Solomon Teferra Abate, Michael Adjeisah

Spoken digits recognition (SDR) is a type of supervised automatic speech recognition, which is required in various human–machine interaction applications. It is utilized in phone-based services like dialing systems, certain bank operations, airline reservation systems, and price extraction. However, the design of SDR is a challenging task that requires the development of labeled audio data, the proper choice of feature extraction method, and the development of the best performing model. Even if several works have been done for various languages, such as English, Arabic, Urdu, etc., there is no developed Amharic spoken digits dataset (AmSDD) to build Amharic spoken digits recognition (AmSDR) model for the Amharic language, which is the official working language of the government of Ethiopia. Therefore, in this study, we developed a new AmSDD that contains 12,000 utterances of 0 (Zaero) to 9 (zet’enyi) digits which were recorded from 120 volunteer speakers of different age groups, genders, and dialects who repeated each digit ten times. Mel frequency cepstral coefficients (MFCCs) and Mel-Spectrogram feature extraction methods were used to extract trainable features from the speech signal. We conducted different experiments on the development of the AmSDR model using the AmSDD and classical supervised learning algorithms such as Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Random Forest (RF) as the baseline. To further improve the performance recognition of AmSDR, we propose a three layers Convolutional Neural Network (CNN) architecture with Batch normalization. The results of our experiments show that the proposed CNN model outperforms the baseline algorithms and scores an accuracy of 99% and 98% using MFCCs and Mel-Spectrogram features, respectively.

口语数字识别(SDR)是一种有监督的自动语音识别,在各种人机交互应用中都需要它。它被用于拨号系统、某些银行业务、机票预订系统和价格提取等基于电话的服务中。然而,设计 SDR 是一项具有挑战性的任务,需要开发标注音频数据、正确选择特征提取方法以及开发性能最佳的模型。尽管针对英语、阿拉伯语、乌尔都语等各种语言已经开展了多项工作,但目前还没有开发出阿姆哈拉语口语数字数据集(AmSDD)来建立阿姆哈拉语口语数字识别(AmSDR)模型,而阿姆哈拉语是埃塞俄比亚政府的官方工作语言。因此,在本研究中,我们开发了一个新的 AmSDD,其中包含 12,000 个 0(Zaero)至 9(zet'enyi)数字的语句,这些语句由 120 名不同年龄段、性别和方言的志愿者记录,他们重复每个数字十次。我们使用了梅尔频率倒频谱系数(MFCC)和梅尔谱图特征提取方法从语音信号中提取可训练的特征。我们使用 AmSDD 和线性判别分析 (LDA)、K-近邻 (KNN)、支持向量机 (SVM) 和随机森林 (RF) 等经典监督学习算法作为基线,对 AmSDR 模型的开发进行了不同的实验。为了进一步提高 AmSDR 的识别性能,我们提出了一种采用批量归一化的三层卷积神经网络(CNN)架构。实验结果表明,使用 MFCC 和 Mel-Spectrogram 特征,所提出的 CNN 模型优于基线算法,准确率分别达到 99% 和 98%。
{"title":"Amharic spoken digits recognition using convolutional neural network","authors":"Tewodros Alemu Ayall, Changjun Zhou, Huawen Liu, Getnet Mezgebu Brhanemeskel, Solomon Teferra Abate, Michael Adjeisah","doi":"10.1186/s40537-024-00910-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00910-z","url":null,"abstract":"<p>Spoken digits recognition (SDR) is a type of supervised automatic speech recognition, which is required in various human–machine interaction applications. It is utilized in phone-based services like dialing systems, certain bank operations, airline reservation systems, and price extraction. However, the design of SDR is a challenging task that requires the development of labeled audio data, the proper choice of feature extraction method, and the development of the best performing model. Even if several works have been done for various languages, such as English, Arabic, Urdu, etc., there is no developed Amharic spoken digits dataset (AmSDD) to build Amharic spoken digits recognition (AmSDR) model for the Amharic language, which is the official working language of the government of Ethiopia. Therefore, in this study, we developed a new AmSDD that contains 12,000 utterances of 0 (Zaero) to 9 (zet’enyi) digits which were recorded from 120 volunteer speakers of different age groups, genders, and dialects who repeated each digit ten times. Mel frequency cepstral coefficients (MFCCs) and Mel-Spectrogram feature extraction methods were used to extract trainable features from the speech signal. We conducted different experiments on the development of the AmSDR model using the AmSDD and classical supervised learning algorithms such as Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Random Forest (RF) as the baseline. To further improve the performance recognition of AmSDR, we propose a three layers Convolutional Neural Network (CNN) architecture with Batch normalization. The results of our experiments show that the proposed CNN model outperforms the baseline algorithms and scores an accuracy of 99% and 98% using MFCCs and Mel-Spectrogram features, respectively.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"21 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Xai-driven knowledge distillation of large language models for efficient deployment on low-resource devices 对大型语言模型进行 Xai 驱动的知识提炼,以便在低资源设备上高效部署
IF 8.1 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-04 DOI: 10.1186/s40537-024-00928-3
Riccardo Cantini, Alessio Orsino, Domenico Talia

Large Language Models (LLMs) are characterized by their inherent memory inefficiency and compute-intensive nature, making them impractical to run on low-resource devices and hindering their applicability in edge AI contexts. To address this issue, Knowledge Distillation approaches have been adopted to transfer knowledge from a complex model, referred to as the teacher, to a more compact, computationally efficient one, known as the student. The aim is to retain the performance of the original model while substantially reducing computational requirements. However, traditional knowledge distillation methods may struggle to effectively transfer crucial explainable knowledge from an LLM teacher to the student, potentially leading to explanation inconsistencies and decreased performance. This paper presents DiXtill, a method based on a novel approach to distilling knowledge from LLMs into lightweight neural architectures. The main idea is to leverage local explanations provided by an eXplainable Artificial Intelligence (XAI) method to guide the cross-architecture distillation of a teacher LLM into a self-explainable student, specifically a bi-directional LSTM network.Experimental results show that our XAI-driven distillation method allows the teacher explanations to be effectively transferred to the student, resulting in better agreement compared to classical distillation methods,thus enhancing the student interpretability. Furthermore, it enables the student to achieve comparable performance to the teacher LLM while also delivering a significantly higher compression ratio and speedup compared to other techniques such as post-training quantization and pruning, which paves the way for more efficient and sustainable edge AI applications

大型语言模型(LLM)的固有特点是内存效率低和计算密集型,这使其无法在低资源设备上运行,并阻碍了其在边缘人工智能环境中的应用。为了解决这个问题,人们采用了知识蒸馏方法,将知识从复杂的模型(称为教师)转移到更紧凑、计算效率更高的模型(称为学生)。这样做的目的是保留原始模型的性能,同时大幅降低计算要求。然而,传统的知识提炼方法可能难以有效地将关键的可解释知识从 LLM 教师转移到学生,从而可能导致解释不一致和性能下降。本文介绍的 DiXtill 是一种基于新方法的方法,可将 LLM 中的知识提炼到轻量级神经架构中。实验结果表明,我们的 XAI 驱动的提炼方法能将教师的解释有效地传递给学生,与传统的提炼方法相比,能产生更好的一致性,从而提高学生的可解释性。此外,与其他技术(如训练后量化和剪枝)相比,它还能使学生获得与教师 LLM 相当的性能,同时提供更高的压缩率和速度,从而为更高效、更可持续的边缘人工智能应用铺平道路。
{"title":"Xai-driven knowledge distillation of large language models for efficient deployment on low-resource devices","authors":"Riccardo Cantini, Alessio Orsino, Domenico Talia","doi":"10.1186/s40537-024-00928-3","DOIUrl":"https://doi.org/10.1186/s40537-024-00928-3","url":null,"abstract":"<p>Large Language Models (LLMs) are characterized by their inherent memory inefficiency and compute-intensive nature, making them impractical to run on low-resource devices and hindering their applicability in edge AI contexts. To address this issue, Knowledge Distillation approaches have been adopted to transfer knowledge from a complex model, referred to as the teacher, to a more compact, computationally efficient one, known as the student. The aim is to retain the performance of the original model while substantially reducing computational requirements. However, traditional knowledge distillation methods may struggle to effectively transfer crucial explainable knowledge from an LLM teacher to the student, potentially leading to explanation inconsistencies and decreased performance. This paper presents <i>DiXtill</i>, a method based on a novel approach to distilling knowledge from LLMs into lightweight neural architectures. The main idea is to leverage local explanations provided by an eXplainable Artificial Intelligence (XAI) method to guide the cross-architecture distillation of a teacher LLM into a self-explainable student, specifically a bi-directional LSTM network.Experimental results show that our XAI-driven distillation method allows the teacher explanations to be effectively transferred to the student, resulting in better agreement compared to classical distillation methods,thus enhancing the student interpretability. Furthermore, it enables the student to achieve comparable performance to the teacher LLM while also delivering a significantly higher compression ratio and speedup compared to other techniques such as post-training quantization and pruning, which paves the way for more efficient and sustainable edge AI applications</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"18 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Big Data
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1