Journal of Big Data最新文献_第6页

A proposed hybrid framework to improve the accuracy of customer churn prediction in telecom industry 提高电信业客户流失预测准确性的混合框架建议

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-09 DOI: 10.1186/s40537-024-00922-9

Shimaa Ouf, Kholoud T. Mahmoud, Manal A. Abdel-Fattah

In the telecom sector, predicting customer churn has increased in importance in recent years. Developing a robust and accurate churn prediction model takes time, but it is crucial. Early churn prediction avoids revenue loss and improves customer retention. Telecom companies must identify these customers before they leave to solve this issue. Researchers have used a variety of applied machine-learning approaches to reveal the hidden relationships between different features. A key aspect of churn prediction is the accuracy level that affects the learning model's performance. This study aims to clarify several aspects of customer churn prediction accuracy and investigate state-of-the-art techniques' performance. However, no previous research has investigated performance using a hybrid framework combining the advantages of selecting suitable data preprocessing, ensemble learning, and resampling techniques. The study introduces a proposed hybrid framework that improves the accuracy of customer churn prediction in the telecom industry. The framework is built by integrating the XGBOOST classifier with the hybrid resampling method SMOTE-ENN, which concerns applying effective techniques for data preprocessing. The proposed framework is used for two experiments with three datasets in the telecom industry. This study determines which features are most crucial and influence customer churn, introduces the impact of data balancing, compares the classifiers' pre- and post-data balancing performances, and examines a speed-accuracy trade-off in hybrid classifiers. Many metrics, including accuracy, precision, recall, F1-score, and ROC curve, are used to analyze the results. All evaluation criteria are used to identify the most effective experiment. The results of the accuracy of the hybrid framework that respects balanced data outperformed applying the classifier only to imbalanced data. In addition, the results of the proposed hybrid framework are compared to previous studies on the same datasets, and the result of this comparison is offered. Compared with the review of the latest works, our proposed hybrid framework with the three datasets outperformed these works.

在电信行业，预测客户流失率的重要性近年来与日俱增。开发一个强大而准确的客户流失预测模型需要时间，但却至关重要。及早预测客户流失可避免收入损失并提高客户保留率。电信公司必须在客户离开之前识别出这些客户，以解决这一问题。研究人员使用了多种应用机器学习方法来揭示不同特征之间的隐藏关系。流失预测的一个关键方面是影响学习模型性能的准确度。本研究旨在阐明客户流失预测准确性的几个方面，并调查最先进技术的性能。然而，以前的研究还没有研究过使用混合框架的性能，该框架结合了选择合适的数据预处理、集合学习和重采样技术的优势。本研究提出了一种混合框架，可提高电信行业客户流失预测的准确性。该框架是通过将 XGBOOST 分类器与混合重采样方法 SMOTE-ENN 相结合而建立的，其中涉及应用有效的数据预处理技术。提出的框架在电信行业的三个数据集上进行了两次实验。本研究确定了哪些特征最关键并影响客户流失，介绍了数据平衡的影响，比较了分类器在数据平衡前和数据平衡后的性能，并研究了混合分类器在速度和准确性之间的权衡。在分析结果时使用了许多指标，包括准确度、精确度、召回率、F1-分数和 ROC 曲线。所有评价标准都用于确定最有效的实验。尊重平衡数据的混合框架的准确率结果优于仅应用于不平衡数据的分类器。此外，还将所提出的混合框架的结果与之前在相同数据集上的研究进行了比较，并提供了比较结果。与最新研究相比，我们提出的混合框架在三个数据集上的表现优于这些研究。

{"title":"A proposed hybrid framework to improve the accuracy of customer churn prediction in telecom industry","authors":"Shimaa Ouf, Kholoud T. Mahmoud, Manal A. Abdel-Fattah","doi":"10.1186/s40537-024-00922-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00922-9","url":null,"abstract":"<p>In the telecom sector, predicting customer churn has increased in importance in recent years. Developing a robust and accurate churn prediction model takes time, but it is crucial. Early churn prediction avoids revenue loss and improves customer retention. Telecom companies must identify these customers before they leave to solve this issue. Researchers have used a variety of applied machine-learning approaches to reveal the hidden relationships between different features. A key aspect of churn prediction is the accuracy level that affects the learning model's performance. This study aims to clarify several aspects of customer churn prediction accuracy and investigate state-of-the-art techniques' performance. However, no previous research has investigated performance using a hybrid framework combining the advantages of selecting suitable data preprocessing, ensemble learning, and resampling techniques. The study introduces a proposed hybrid framework that improves the accuracy of customer churn prediction in the telecom industry. The framework is built by integrating the XGBOOST classifier with the hybrid resampling method SMOTE-ENN, which concerns applying effective techniques for data preprocessing. The proposed framework is used for two experiments with three datasets in the telecom industry. This study determines which features are most crucial and influence customer churn, introduces the impact of data balancing, compares the classifiers' pre- and post-data balancing performances, and examines a speed-accuracy trade-off in hybrid classifiers. Many metrics, including accuracy, precision, recall, F1-score, and ROC curve, are used to analyze the results. All evaluation criteria are used to identify the most effective experiment. The results of the accuracy of the hybrid framework that respects balanced data outperformed applying the classifier only to imbalanced data. In addition, the results of the proposed hybrid framework are compared to previous studies on the same datasets, and the result of this comparison is offered. Compared with the review of the latest works, our proposed hybrid framework with the three datasets outperformed these works.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"58 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hybrid topic modeling method based on dirichlet multinomial mixture and fuzzy match algorithm for short text clustering 基于dirichlet多叉混合物和模糊匹配算法的混合主题建模方法用于短文聚类

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-09 DOI: 10.1186/s40537-024-00930-9

Mutasem K. Alsmadi, Malek Alzaqebah, Sana Jawarneh, Ibrahim ALmarashdeh, Mohammed Azmi Al-Betar, Maram Alwohaibi, Noha A. Al-Mulla, Eman AE Ahmed, Ahmad AL Smadi

Topic modeling methods proved to be effective for inferring latent topics from short texts. Dealing with short texts is challenging yet helpful for many real-world applications, due to the sparse terms in the text and the high dimensionality representation. Most of the topic modeling methods require the number of topics to be defined earlier. Similarly, methods based on Dirichlet Multinomial Mixture (DMM) involve the maximum possible number of topics before execution which is hard to determine due to topic uncertainty, and many noises exist in the dataset. Hence, a new approach called the Topic Clustering algorithm based on Levenshtein Distance (TCLD) is introduced in this paper, TCLD combines DMM models and the Fuzzy matching algorithm to address two key challenges in topic modeling: (a) The outlier problem in topic modeling methods. (b) The problem of determining the optimal number of topics. TCLD uses the initial clustered topics generated by DMM models and then evaluates the semantic relationships between documents using Levenshtein Distance. Subsequently, it determines whether to keep the document in the same cluster, relocate it to another cluster, or mark it as an outlier. The results demonstrate the efficiency of the proposed approach across six English benchmark datasets, in comparison to seven topic modeling approaches, with 83% improvement in purity and 67% enhancement in Normalized Mutual Information (NMI) across all datasets. The proposed method was also applied to a collected Arabic tweet and the results showed that only 12% of the Arabic short texts were incorrectly clustered, according to human inspection.

事实证明，主题建模方法对于从短文本中推断潜在主题非常有效。由于文本中的术语稀疏且具有高维度表示，处理短文本具有挑战性，但却有助于许多现实世界的应用。大多数主题建模方法都需要提前定义主题数量。同样，基于 Dirichlet Multinomial Mixture（DMM）的方法在执行前需要确定最大可能的主题数，而由于主题的不确定性，以及数据集中存在许多噪音，很难确定主题数。因此，本文提出了一种名为基于莱文斯坦距离的话题聚类算法（TCLD）的新方法。TCLD 结合了 DMM 模型和模糊匹配算法，以解决话题建模中的两个关键难题：（a）话题建模方法中的离群值问题。(b) 确定最佳话题数量的问题。TCLD 使用 DMM 模型生成的初始聚类主题，然后使用 Levenshtein Distance 评估文档之间的语义关系。随后，它会决定是将文档保留在同一聚类中，还是将其重新定位到另一个聚类，或者将其标记为离群点。结果表明，在六个英语基准数据集上，与七种主题建模方法相比，所提出的方法效率更高，在所有数据集上，纯度提高了 83%，归一化互信息（NMI）提高了 67%。该方法还被应用于收集的阿拉伯语推文，结果显示，根据人工检测，只有 12% 的阿拉伯语短文被错误聚类。

{"title":"Hybrid topic modeling method based on dirichlet multinomial mixture and fuzzy match algorithm for short text clustering","authors":"Mutasem K. Alsmadi, Malek Alzaqebah, Sana Jawarneh, Ibrahim ALmarashdeh, Mohammed Azmi Al-Betar, Maram Alwohaibi, Noha A. Al-Mulla, Eman AE Ahmed, Ahmad AL Smadi","doi":"10.1186/s40537-024-00930-9","DOIUrl":"https://doi.org/10.1186/s40537-024-00930-9","url":null,"abstract":"<p>Topic modeling methods proved to be effective for inferring latent topics from short texts. Dealing with short texts is challenging yet helpful for many real-world applications, due to the sparse terms in the text and the high dimensionality representation. Most of the topic modeling methods require the number of topics to be defined earlier. Similarly, methods based on Dirichlet Multinomial Mixture (DMM) involve the maximum possible number of topics before execution which is hard to determine due to topic uncertainty, and many noises exist in the dataset. Hence, a new approach called the Topic Clustering algorithm based on Levenshtein Distance (TCLD) is introduced in this paper, TCLD combines DMM models and the Fuzzy matching algorithm to address two key challenges in topic modeling: (a) The outlier problem in topic modeling methods. (b) The problem of determining the optimal number of topics. TCLD uses the initial clustered topics generated by DMM models and then evaluates the semantic relationships between documents using Levenshtein Distance. Subsequently, it determines whether to keep the document in the same cluster, relocate it to another cluster, or mark it as an outlier. The results demonstrate the efficiency of the proposed approach across six English benchmark datasets, in comparison to seven topic modeling approaches, with 83% improvement in purity and 67% enhancement in Normalized Mutual Information (NMI) across all datasets. The proposed method was also applied to a collected Arabic tweet and the results showed that only 12% of the Arabic short texts were incorrectly clustered, according to human inspection.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"23 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DEMFFA: a multi-strategy modified Fennec Fox algorithm with mixed improved differential evolutionary variation strategies DEMFFA：采用混合改进型差分进化变异策略的多策略改进型芬内克-福克斯算法

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-08 DOI: 10.1186/s40537-024-00917-6

Gang Hu, Keke Song, Xiuxiu Li, Yi Wang

The Fennec Fox algorithm (FFA) is a new meta-heuristic algorithm that is primarily inspired by the Fennec fox's ability to dig and escape from wild predators. Compared with other classical algorithms, FFA shows strong competitiveness. The “No free lunch” theorem shows that an algorithm has different effects in the face of different problems, such as: when solving high-dimensional or more complex applications, there are challenges such as easily falling into local optimal and slow convergence speed. To solve this problem with FFA, in this paper, an improved Fenna fox algorithm DEMFFA is proposed by adding sin chaotic mapping, formula factor adjustment, Cauchy operator mutation, and differential evolution mutation strategies. Firstly, a sin chaotic mapping strategy is added in the initialization stage to make the population distribution more uniform, thus speeding up the algorithm convergence speed. Secondly, in order to expedite the convergence speed of the algorithm, adjustments are made to the factors of the formula whose position is updated in the first stage, resulting in faster convergence. Finally, in order to prevent the algorithm from getting into the local optimal too early and expand the search space of the population, the Cauchy operator mutation strategy and differential evolution mutation strategy are added after the first and second stages of the original algorithm update. In order to verify the performance of the proposed DEMFFA, qualitative analysis is carried out on different test sets, and the proposed algorithm is tested with the original FFA, other classical algorithms, improved algorithms, and newly proposed algorithms on three different test sets. And we also carried out a qualitative analysis of the CEC2020. In addition, DEMFFA is applied to 10 practical engineering design problems and a complex 24-bar truss topology optimization problem, and the results show that the DEMFFA algorithm has the potential to solve complex problems.

Fennec Fox 算法（FFA）是一种新型元启发式算法，其主要灵感来源于 Fennec 狐狸挖掘和逃离野生捕食者的能力。与其他经典算法相比，FFA 算法具有很强的竞争力。没有免费的午餐 "定理表明，算法在面对不同的问题时会产生不同的效果，例如：在求解高维或更复杂的应用时，会面临容易陷入局部最优和收敛速度慢等挑战。为了解决 FFA 的这一问题，本文提出了一种改进的 Fenna fox 算法 DEMFFA，增加了 sin 混沌映射、公式因子调整、Cauchy 算子突变和微分进化突变策略。首先，在初始化阶段加入正弦混沌映射策略，使种群分布更加均匀，从而加快算法收敛速度。其次，为了加快算法的收敛速度，对第一阶段更新位置的公式因子进行调整，从而加快收敛速度。最后，为了防止算法过早进入局部最优，扩大种群的搜索空间，在原算法更新的第一和第二阶段后，增加了考奇算子突变策略和微分进化突变策略。为了验证所提出的 DEMFFA 的性能，我们在不同的测试集上进行了定性分析，并在三个不同的测试集上将所提出的算法与原始 FFA、其他经典算法、改进算法和新提出的算法进行了测试。我们还对 CEC2020 进行了定性分析。此外，我们还将 DEMFFA 应用于 10 个实际工程设计问题和一个复杂的 24 杆桁架拓扑优化问题，结果表明 DEMFFA 算法具有解决复杂问题的潜力。

{"title":"DEMFFA: a multi-strategy modified Fennec Fox algorithm with mixed improved differential evolutionary variation strategies","authors":"Gang Hu, Keke Song, Xiuxiu Li, Yi Wang","doi":"10.1186/s40537-024-00917-6","DOIUrl":"https://doi.org/10.1186/s40537-024-00917-6","url":null,"abstract":"<p>The Fennec Fox algorithm (FFA) is a new meta-heuristic algorithm that is primarily inspired by the Fennec fox's ability to dig and escape from wild predators. Compared with other classical algorithms, FFA shows strong competitiveness. The “No free lunch” theorem shows that an algorithm has different effects in the face of different problems, such as: when solving high-dimensional or more complex applications, there are challenges such as easily falling into local optimal and slow convergence speed. To solve this problem with FFA, in this paper, an improved Fenna fox algorithm DEMFFA is proposed by adding sin chaotic mapping, formula factor adjustment, Cauchy operator mutation, and differential evolution mutation strategies. Firstly, a sin chaotic mapping strategy is added in the initialization stage to make the population distribution more uniform, thus speeding up the algorithm convergence speed. Secondly, in order to expedite the convergence speed of the algorithm, adjustments are made to the factors of the formula whose position is updated in the first stage, resulting in faster convergence. Finally, in order to prevent the algorithm from getting into the local optimal too early and expand the search space of the population, the Cauchy operator mutation strategy and differential evolution mutation strategy are added after the first and second stages of the original algorithm update. In order to verify the performance of the proposed DEMFFA, qualitative analysis is carried out on different test sets, and the proposed algorithm is tested with the original FFA, other classical algorithms, improved algorithms, and newly proposed algorithms on three different test sets. And we also carried out a qualitative analysis of the CEC2020. In addition, DEMFFA is applied to 10 practical engineering design problems and a complex 24-bar truss topology optimization problem, and the results show that the DEMFFA algorithm has the potential to solve complex problems.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"38 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Establishment of an automatic diagnosis system for corneal endothelium diseases using artificial intelligence 利用人工智能建立角膜内皮疾病自动诊断系统

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-08 DOI: 10.1186/s40537-024-00913-w

Jing-hao Qu, Xiao-ran Qin, Zi-jun Xie, Jia-he Qian, Yang Zhang, Xiao-nan Sun, Yu-zhao Sun, Rong-mei Peng, Ge-ge Xiao, Jing Lin, Xiao-yan Bian, Tie-hong Chen, Yan Cheng, Shao-feng Gu, Hai-kun Wang, Jing Hong

Purpose

To use artificial intelligence to establish an automatic diagnosis system for corneal endothelium diseases (CEDs).

Methods

We develop an automatic system for detecting multiple common CEDs involving an enhanced compact convolutional transformer (ECCT). Specifically, we introduce a cross-head relative position encoding scheme into a standard self-attention module to capture contextual information among different regions and employ a token-attention feed-forward network to place greater focus on valuable abnormal regions.

Results

A total of 2723 images from CED patients are used to train our system. It achieves an accuracy of 89.53%, and the area under the receiver operating characteristic curve (AUC) is 0.958 (95% CI 0.943–0.971) on images from multiple centres.

Conclusions

Our system is the first artificial intelligence-based system for diagnosing CEDs worldwide. Images can be uploaded to a specified website, and automatic diagnoses can be obtained; this system can be particularly helpful under pandemic conditions, such as those seen during the recent COVID-19 pandemic.

目的利用人工智能建立角膜内皮疾病（CED）的自动诊断系统。方法我们开发了一种自动系统，利用增强型紧凑卷积变换器（ECCT）检测多种常见的 CED。具体来说，我们在标准自注意模块中引入了交叉头相对位置编码方案，以捕捉不同区域之间的上下文信息，并采用标记注意前馈网络来更加关注有价值的异常区域。结论我们的系统是全球首个基于人工智能的 CED 诊断系统。我们的系统是全球首个基于人工智能的 CED 诊断系统，可将图像上传到指定网站，并获得自动诊断；该系统在大流行条件下尤其有用，例如在最近的 COVID-19 大流行期间。

{"title":"Establishment of an automatic diagnosis system for corneal endothelium diseases using artificial intelligence","authors":"Jing-hao Qu, Xiao-ran Qin, Zi-jun Xie, Jia-he Qian, Yang Zhang, Xiao-nan Sun, Yu-zhao Sun, Rong-mei Peng, Ge-ge Xiao, Jing Lin, Xiao-yan Bian, Tie-hong Chen, Yan Cheng, Shao-feng Gu, Hai-kun Wang, Jing Hong","doi":"10.1186/s40537-024-00913-w","DOIUrl":"https://doi.org/10.1186/s40537-024-00913-w","url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Purpose</h3><p>To use artificial intelligence to establish an automatic diagnosis system for corneal endothelium diseases (CEDs).</p><h3 data-test=\"abstract-sub-heading\">Methods</h3><p>We develop an automatic system for detecting multiple common CEDs involving an enhanced compact convolutional transformer (ECCT). Specifically, we introduce a cross-head relative position encoding scheme into a standard self-attention module to capture contextual information among different regions and employ a token-attention feed-forward network to place greater focus on valuable abnormal regions.</p><h3 data-test=\"abstract-sub-heading\">Results</h3><p>A total of 2723 images from CED patients are used to train our system. It achieves an accuracy of 89.53%, and the area under the receiver operating characteristic curve (AUC) is 0.958 (95% CI 0.943–0.971) on images from multiple centres.</p><h3 data-test=\"abstract-sub-heading\">Conclusions</h3><p>Our system is the first artificial intelligence-based system for diagnosing CEDs worldwide. Images can be uploaded to a specified website, and automatic diagnoses can be obtained; this system can be particularly helpful under pandemic conditions, such as those seen during the recent COVID-19 pandemic.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"15 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140931817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluation is key: a survey on evaluation measures for synthetic time series 评估是关键：关于合成时间序列评估措施的调查

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-07 DOI: 10.1186/s40537-024-00924-7

Michael Stenger, Robert Leppich, Ian Foster, Samuel Kounev, André Bauer

Synthetic data generation describes the process of learning the underlying distribution of a given real dataset in a model, which is, in turn, sampled to produce new data objects still adhering to the original distribution. This approach often finds application where circumstances limit the availability or usability of real-world datasets, for instance, in health care due to privacy concerns. While image synthesis has received much attention in the past, time series are key for many practical (e.g., industrial) applications. To date, numerous different generative models and measures to evaluate time series syntheses have been proposed. However, regarding the defining features of high-quality synthetic time series and how to quantify quality, no consensus has yet been reached among researchers. Hence, we propose a comprehensive survey on evaluation measures for time series generation to assist users in evaluating synthetic time series. For one, we provide brief descriptions or - where applicable - precise definitions. Further, we order the measures in a taxonomy and examine applicability and usage. To assist in the selection of the most appropriate measures, we provide a concise guide for fast lookup. Notably, our findings reveal a lack of a universally accepted approach for an evaluation procedure, including the selection of appropriate measures. We believe this situation hinders progress and may even erode evaluation standards to a “do as you like”-approach to synthetic data evaluation. Therefore, this survey is a preliminary step to advance the field of synthetic data evaluation.

合成数据生成描述了在一个模型中学习给定真实数据集的基本分布的过程，反过来，该模型被采样以生成新的数据对象，这些新的数据对象仍然遵循原始分布。这种方法通常适用于现实世界数据集的可用性或可用性受到限制的情况，例如，出于隐私考虑，在医疗保健领域。图像合成在过去受到了广泛关注，而时间序列则是许多实际（如工业）应用的关键。迄今为止，已经提出了许多不同的生成模型和评估时间序列合成的方法。然而，对于高质量合成时间序列的定义特征以及如何量化质量，研究人员尚未达成共识。因此，我们建议对时间序列生成的评估措施进行全面调查，以帮助用户评估合成时间序列。首先，我们提供了简要说明或（如适用）精确定义。此外，我们还以分类法的形式对评估指标进行排序，并研究其适用性和使用情况。为了帮助选择最合适的测量方法，我们提供了快速查找的简明指南。值得注意的是，我们的研究结果表明，在评估程序（包括选择适当的衡量标准）方面缺乏普遍接受的方法。我们认为，这种情况会阻碍进展，甚至会削弱评估标准，使合成数据评估变成一种 "随心所欲 "的方法。因此，本次调查是推动合成数据评估领域发展的第一步。

{"title":"Evaluation is key: a survey on evaluation measures for synthetic time series","authors":"Michael Stenger, Robert Leppich, Ian Foster, Samuel Kounev, André Bauer","doi":"10.1186/s40537-024-00924-7","DOIUrl":"https://doi.org/10.1186/s40537-024-00924-7","url":null,"abstract":"<p>Synthetic data generation describes the process of learning the underlying distribution of a given real dataset in a model, which is, in turn, sampled to produce new data objects still adhering to the original distribution. This approach often finds application where circumstances limit the availability or usability of real-world datasets, for instance, in health care due to privacy concerns. While image synthesis has received much attention in the past, time series are key for many practical (e.g., industrial) applications. To date, numerous different generative models and measures to evaluate time series syntheses have been proposed. However, regarding the defining features of high-quality synthetic time series and how to quantify quality, no consensus has yet been reached among researchers. Hence, we propose a comprehensive survey on evaluation measures for time series generation to assist users in evaluating synthetic time series. For one, we provide brief descriptions or - where applicable - precise definitions. Further, we order the measures in a taxonomy and examine applicability and usage. To assist in the selection of the most appropriate measures, we provide a concise guide for fast lookup. Notably, our findings reveal a lack of a universally accepted approach for an evaluation procedure, including the selection of appropriate measures. We believe this situation hinders progress and may even erode evaluation standards to a “do as you like”-approach to synthetic data evaluation. Therefore, this survey is a preliminary step to advance the field of synthetic data evaluation.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"28 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Assessing the current landscape of AI and sustainability literature: identifying key trends, addressing gaps and challenges 评估当前人工智能和可持续发展文献的现状：确定主要趋势，缩小差距，应对挑战

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-06 DOI: 10.1186/s40537-024-00912-x

Shailesh Tripathi, Nadine Bachmann, Manuel Brunner, Ziad Rizk, Herbert Jodlbauer

The United Nations’ 17 Sustainable Development Goals stress the importance of global and local efforts to address inequalities and implement sustainability. Addressing complex, interconnected sustainability challenges requires a systematic, interdisciplinary approach, where technology, AI, and data-driven methods offer potential solutions for optimizing resources, integrating different aspects of sustainability, and informed decision-making. Sustainability research surrounds various local, regional, and global challenges, emphasizing the need to identify emerging areas and gaps where AI and data-driven models play a crucial role. The study performs a comprehensive literature survey and scientometric and semantic analyses, categorizes data-driven methods for sustainability problems, and discusses the sustainable use of AI and big data. The outcomes of the analyses highlight the importance of collaborative and inclusive research that bridges regional differences, the interconnection of AI, technology, and sustainability topics, and the major research themes related to sustainability. It further emphasizes the significance of developing hybrid approaches combining AI, data-driven techniques, and expert knowledge for multi-level, multi-dimensional decision-making. Furthermore, the study recognizes the necessity of addressing ethical concerns and ensuring the sustainable use of AI and big data in sustainability research.

联合国的 17 个可持续发展目标强调了全球和地方努力解决不平等和实现可持续发展的重要性。应对复杂、相互关联的可持续发展挑战需要系统的跨学科方法，其中技术、人工智能和数据驱动方法为优化资源、整合可持续发展的不同方面和知情决策提供了潜在的解决方案。可持续发展研究围绕着各种地方、区域和全球挑战，强调需要确定人工智能和数据驱动模型在其中发挥关键作用的新兴领域和差距。本研究进行了全面的文献调查、科学计量学和语义分析，对可持续发展问题的数据驱动方法进行了分类，并讨论了人工智能和大数据的可持续利用。分析结果强调了弥合地区差异的合作性和包容性研究的重要性，人工智能、技术和可持续发展主题之间的相互联系，以及与可持续发展相关的主要研究主题。它进一步强调了开发结合人工智能、数据驱动技术和专家知识的混合方法对于多层次、多维度决策的重要性。此外，本研究还认识到有必要解决伦理问题，并确保在可持续性研究中可持续地使用人工智能和大数据。

{"title":"Assessing the current landscape of AI and sustainability literature: identifying key trends, addressing gaps and challenges","authors":"Shailesh Tripathi, Nadine Bachmann, Manuel Brunner, Ziad Rizk, Herbert Jodlbauer","doi":"10.1186/s40537-024-00912-x","DOIUrl":"https://doi.org/10.1186/s40537-024-00912-x","url":null,"abstract":"<p>The United Nations’ 17 Sustainable Development Goals stress the importance of global and local efforts to address inequalities and implement sustainability. Addressing complex, interconnected sustainability challenges requires a systematic, interdisciplinary approach, where technology, AI, and data-driven methods offer potential solutions for optimizing resources, integrating different aspects of sustainability, and informed decision-making. Sustainability research surrounds various local, regional, and global challenges, emphasizing the need to identify emerging areas and gaps where AI and data-driven models play a crucial role. The study performs a comprehensive literature survey and scientometric and semantic analyses, categorizes data-driven methods for sustainability problems, and discusses the sustainable use of AI and big data. The outcomes of the analyses highlight the importance of collaborative and inclusive research that bridges regional differences, the interconnection of AI, technology, and sustainability topics, and the major research themes related to sustainability. It further emphasizes the significance of developing hybrid approaches combining AI, data-driven techniques, and expert knowledge for multi-level, multi-dimensional decision-making. Furthermore, the study recognizes the necessity of addressing ethical concerns and ensuring the sustainable use of AI and big data in sustainability research.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"14 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Amharic spoken digits recognition using convolutional neural network 利用卷积神经网络识别阿姆哈拉语口语数字

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-04 DOI: 10.1186/s40537-024-00910-z

Tewodros Alemu Ayall, Changjun Zhou, Huawen Liu, Getnet Mezgebu Brhanemeskel, Solomon Teferra Abate, Michael Adjeisah

Spoken digits recognition (SDR) is a type of supervised automatic speech recognition, which is required in various human–machine interaction applications. It is utilized in phone-based services like dialing systems, certain bank operations, airline reservation systems, and price extraction. However, the design of SDR is a challenging task that requires the development of labeled audio data, the proper choice of feature extraction method, and the development of the best performing model. Even if several works have been done for various languages, such as English, Arabic, Urdu, etc., there is no developed Amharic spoken digits dataset (AmSDD) to build Amharic spoken digits recognition (AmSDR) model for the Amharic language, which is the official working language of the government of Ethiopia. Therefore, in this study, we developed a new AmSDD that contains 12,000 utterances of 0 (Zaero) to 9 (zet’enyi) digits which were recorded from 120 volunteer speakers of different age groups, genders, and dialects who repeated each digit ten times. Mel frequency cepstral coefficients (MFCCs) and Mel-Spectrogram feature extraction methods were used to extract trainable features from the speech signal. We conducted different experiments on the development of the AmSDR model using the AmSDD and classical supervised learning algorithms such as Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Random Forest (RF) as the baseline. To further improve the performance recognition of AmSDR, we propose a three layers Convolutional Neural Network (CNN) architecture with Batch normalization. The results of our experiments show that the proposed CNN model outperforms the baseline algorithms and scores an accuracy of 99% and 98% using MFCCs and Mel-Spectrogram features, respectively.

口语数字识别（SDR）是一种有监督的自动语音识别，在各种人机交互应用中都需要它。它被用于拨号系统、某些银行业务、机票预订系统和价格提取等基于电话的服务中。然而，设计 SDR 是一项具有挑战性的任务，需要开发标注音频数据、正确选择特征提取方法以及开发性能最佳的模型。尽管针对英语、阿拉伯语、乌尔都语等各种语言已经开展了多项工作，但目前还没有开发出阿姆哈拉语口语数字数据集（AmSDD）来建立阿姆哈拉语口语数字识别（AmSDR）模型，而阿姆哈拉语是埃塞俄比亚政府的官方工作语言。因此，在本研究中，我们开发了一个新的 AmSDD，其中包含 12,000 个 0（Zaero）至 9（zet'enyi）数字的语句，这些语句由 120 名不同年龄段、性别和方言的志愿者记录，他们重复每个数字十次。我们使用了梅尔频率倒频谱系数（MFCC）和梅尔谱图特征提取方法从语音信号中提取可训练的特征。我们使用 AmSDD 和线性判别分析 (LDA)、K-近邻 (KNN)、支持向量机 (SVM) 和随机森林 (RF) 等经典监督学习算法作为基线，对 AmSDR 模型的开发进行了不同的实验。为了进一步提高 AmSDR 的识别性能，我们提出了一种采用批量归一化的三层卷积神经网络（CNN）架构。实验结果表明，使用 MFCC 和 Mel-Spectrogram 特征，所提出的 CNN 模型优于基线算法，准确率分别达到 99% 和 98%。

{"title":"Amharic spoken digits recognition using convolutional neural network","authors":"Tewodros Alemu Ayall, Changjun Zhou, Huawen Liu, Getnet Mezgebu Brhanemeskel, Solomon Teferra Abate, Michael Adjeisah","doi":"10.1186/s40537-024-00910-z","DOIUrl":"https://doi.org/10.1186/s40537-024-00910-z","url":null,"abstract":"<p>Spoken digits recognition (SDR) is a type of supervised automatic speech recognition, which is required in various human–machine interaction applications. It is utilized in phone-based services like dialing systems, certain bank operations, airline reservation systems, and price extraction. However, the design of SDR is a challenging task that requires the development of labeled audio data, the proper choice of feature extraction method, and the development of the best performing model. Even if several works have been done for various languages, such as English, Arabic, Urdu, etc., there is no developed Amharic spoken digits dataset (AmSDD) to build Amharic spoken digits recognition (AmSDR) model for the Amharic language, which is the official working language of the government of Ethiopia. Therefore, in this study, we developed a new AmSDD that contains 12,000 utterances of 0 (Zaero) to 9 (zet’enyi) digits which were recorded from 120 volunteer speakers of different age groups, genders, and dialects who repeated each digit ten times. Mel frequency cepstral coefficients (MFCCs) and Mel-Spectrogram feature extraction methods were used to extract trainable features from the speech signal. We conducted different experiments on the development of the AmSDR model using the AmSDD and classical supervised learning algorithms such as Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Random Forest (RF) as the baseline. To further improve the performance recognition of AmSDR, we propose a three layers Convolutional Neural Network (CNN) architecture with Batch normalization. The results of our experiments show that the proposed CNN model outperforms the baseline algorithms and scores an accuracy of 99% and 98% using MFCCs and Mel-Spectrogram features, respectively.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"21 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Xai-driven knowledge distillation of large language models for efficient deployment on low-resource devices 对大型语言模型进行 Xai 驱动的知识提炼，以便在低资源设备上高效部署

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-04 DOI: 10.1186/s40537-024-00928-3

Riccardo Cantini, Alessio Orsino, Domenico Talia

Large Language Models (LLMs) are characterized by their inherent memory inefficiency and compute-intensive nature, making them impractical to run on low-resource devices and hindering their applicability in edge AI contexts. To address this issue, Knowledge Distillation approaches have been adopted to transfer knowledge from a complex model, referred to as the teacher, to a more compact, computationally efficient one, known as the student. The aim is to retain the performance of the original model while substantially reducing computational requirements. However, traditional knowledge distillation methods may struggle to effectively transfer crucial explainable knowledge from an LLM teacher to the student, potentially leading to explanation inconsistencies and decreased performance. This paper presents DiXtill, a method based on a novel approach to distilling knowledge from LLMs into lightweight neural architectures. The main idea is to leverage local explanations provided by an eXplainable Artificial Intelligence (XAI) method to guide the cross-architecture distillation of a teacher LLM into a self-explainable student, specifically a bi-directional LSTM network.Experimental results show that our XAI-driven distillation method allows the teacher explanations to be effectively transferred to the student, resulting in better agreement compared to classical distillation methods,thus enhancing the student interpretability. Furthermore, it enables the student to achieve comparable performance to the teacher LLM while also delivering a significantly higher compression ratio and speedup compared to other techniques such as post-training quantization and pruning, which paves the way for more efficient and sustainable edge AI applications

大型语言模型（LLM）的固有特点是内存效率低和计算密集型，这使其无法在低资源设备上运行，并阻碍了其在边缘人工智能环境中的应用。为了解决这个问题，人们采用了知识蒸馏方法，将知识从复杂的模型（称为教师）转移到更紧凑、计算效率更高的模型（称为学生）。这样做的目的是保留原始模型的性能，同时大幅降低计算要求。然而，传统的知识提炼方法可能难以有效地将关键的可解释知识从 LLM 教师转移到学生，从而可能导致解释不一致和性能下降。本文介绍的 DiXtill 是一种基于新方法的方法，可将 LLM 中的知识提炼到轻量级神经架构中。实验结果表明，我们的 XAI 驱动的提炼方法能将教师的解释有效地传递给学生，与传统的提炼方法相比，能产生更好的一致性，从而提高学生的可解释性。此外，与其他技术（如训练后量化和剪枝）相比，它还能使学生获得与教师 LLM 相当的性能，同时提供更高的压缩率和速度，从而为更高效、更可持续的边缘人工智能应用铺平道路。

{"title":"Xai-driven knowledge distillation of large language models for efficient deployment on low-resource devices","authors":"Riccardo Cantini, Alessio Orsino, Domenico Talia","doi":"10.1186/s40537-024-00928-3","DOIUrl":"https://doi.org/10.1186/s40537-024-00928-3","url":null,"abstract":"<p>Large Language Models (LLMs) are characterized by their inherent memory inefficiency and compute-intensive nature, making them impractical to run on low-resource devices and hindering their applicability in edge AI contexts. To address this issue, Knowledge Distillation approaches have been adopted to transfer knowledge from a complex model, referred to as the teacher, to a more compact, computationally efficient one, known as the student. The aim is to retain the performance of the original model while substantially reducing computational requirements. However, traditional knowledge distillation methods may struggle to effectively transfer crucial explainable knowledge from an LLM teacher to the student, potentially leading to explanation inconsistencies and decreased performance. This paper presents <i>DiXtill</i>, a method based on a novel approach to distilling knowledge from LLMs into lightweight neural architectures. The main idea is to leverage local explanations provided by an eXplainable Artificial Intelligence (XAI) method to guide the cross-architecture distillation of a teacher LLM into a self-explainable student, specifically a bi-directional LSTM network.Experimental results show that our XAI-driven distillation method allows the teacher explanations to be effectively transferred to the student, resulting in better agreement compared to classical distillation methods,thus enhancing the student interpretability. Furthermore, it enables the student to achieve comparable performance to the teacher LLM while also delivering a significantly higher compression ratio and speedup compared to other techniques such as post-training quantization and pruning, which paves the way for more efficient and sustainable edge AI applications</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"18 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-performance computing in healthcare:an automatic literature analysis perspective 医疗保健领域的高性能计算：自动文献分析视角

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-02 DOI: 10.1186/s40537-024-00929-2

Jieyi Li, Shuai Wang, Stevan Rudinac, Anwar Osseyran

The adoption of high-performance computing (HPC) in healthcare has gained significant attention in recent years, driving advancements in medical research and clinical practice. Exploring the literature on HPC implementation in healthcare is valuable for decision-makers as it provides insights into potential areas for further investigation and investment. However, manually analyzing the vast number of scholarly articles is a challenging and time-consuming task. Fortunately, topic modeling techniques offer the capacity to process extensive volumes of scientific literature, identifying key trends within the field. This paper presents an automatic literature analysis framework based on a state-of-art vector-based topic modeling algorithm with multiple embedding techniques, unveiling the research trends surrounding HPC utilization in healthcare. The proposed pipeline consists of four phases: paper extraction, data preprocessing, topic modeling and outlier detection, followed by visualization. It enables the automatic extraction of meaningful topics, exploration of their interrelationships, and identification of emerging research directions in an intuitive manner. The findings highlight the transition of HPC adoption in healthcare from traditional numerical simulation and surgical visualization to emerging topics such as drug discovery, AI-driven medical image analysis, and genomic analysis, as well as correlations and interdisciplinary connections among application domains.

近年来，高性能计算（HPC）在医疗保健领域的应用备受关注，推动了医学研究和临床实践的进步。探索有关医疗保健领域高性能计算实施情况的文献对决策者来说非常有价值，因为它为进一步调查和投资的潜在领域提供了洞察力。然而，对大量学术文章进行人工分析是一项具有挑战性且耗时的任务。幸运的是，主题建模技术能够处理大量科学文献，识别该领域的关键趋势。本文介绍了一种自动文献分析框架，该框架基于最先进的矢量主题建模算法和多种嵌入技术，揭示了医疗保健领域利用高性能计算技术的研究趋势。所提出的管道包括四个阶段：论文提取、数据预处理、主题建模和离群点检测，然后是可视化。它能以直观的方式自动提取有意义的主题，探索它们之间的相互关系，并确定新兴的研究方向。研究结果凸显了医疗保健领域采用高性能计算技术的过渡情况，即从传统的数值模拟和手术可视化过渡到药物发现、人工智能驱动的医学图像分析和基因组分析等新兴主题，以及各应用领域之间的相关性和跨学科联系。

{"title":"High-performance computing in healthcare:an automatic literature analysis perspective","authors":"Jieyi Li, Shuai Wang, Stevan Rudinac, Anwar Osseyran","doi":"10.1186/s40537-024-00929-2","DOIUrl":"https://doi.org/10.1186/s40537-024-00929-2","url":null,"abstract":"<p>The adoption of high-performance computing (HPC) in healthcare has gained significant attention in recent years, driving advancements in medical research and clinical practice. Exploring the literature on HPC implementation in healthcare is valuable for decision-makers as it provides insights into potential areas for further investigation and investment. However, manually analyzing the vast number of scholarly articles is a challenging and time-consuming task. Fortunately, topic modeling techniques offer the capacity to process extensive volumes of scientific literature, identifying key trends within the field. This paper presents an automatic literature analysis framework based on a state-of-art vector-based topic modeling algorithm with multiple embedding techniques, unveiling the research trends surrounding HPC utilization in healthcare. The proposed pipeline consists of four phases: paper extraction, data preprocessing, topic modeling and outlier detection, followed by visualization. It enables the automatic extraction of meaningful topics, exploration of their interrelationships, and identification of emerging research directions in an intuitive manner. The findings highlight the transition of HPC adoption in healthcare from traditional numerical simulation and surgical visualization to emerging topics such as drug discovery, AI-driven medical image analysis, and genomic analysis, as well as correlations and interdisciplinary connections among application domains.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"91 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140841200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Computational 3D topographic microscopy from terabytes of data per sample 利用每个样本的 TB 级数据进行三维拓扑显微计算

IF 8.1 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data

Pub Date : 2024-05-02 DOI: 10.1186/s40537-024-00901-0

Kevin C. Zhou, Mark Harfouche, Maxwell Zheng, Joakim Jönsson, Kyung Chul Lee, Kanghyun Kim, Ron Appel, Paul Reamey, Thomas Doman, Veton Saliu, Gregor Horstmeyer, Seung Ah Lee, Roarke Horstmeyer

We present a large-scale computational 3D topographic microscope that enables 6-gigapixel profilometric 3D imaging at micron-scale resolution across >110 cm² areas over multi-millimeter axial ranges. Our computational microscope, termed STARCAM (Scanning Topographic All-in-focus Reconstruction with a Computational Array Microscope), features a parallelized, 54-camera architecture with 3-axis translation to capture, for each sample of interest, a multi-dimensional, 2.1-terabyte (TB) dataset, consisting of a total of 224,640 9.4-megapixel images. We developed a self-supervised neural network-based algorithm for 3D reconstruction and stitching that jointly estimates an all-in-focus photometric composite and 3D height map across the entire field of view, using multi-view stereo information and image sharpness as a focal metric. The memory-efficient, compressed differentiable representation offered by the neural network effectively enables joint participation of the entire multi-TB dataset during the reconstruction process. Validation experiments on gauge blocks demonstrate a profilometric precision and accuracy of 10 µm or better. To demonstrate the broad utility of our new computational microscope, we applied STARCAM to a variety of decimeter-scale objects, with applications ranging from cultural heritage to industrial inspection.

我们展示了一种大型计算三维地形图显微镜，该显微镜可在多毫米轴向范围内的 110 平方厘米区域内以微米级分辨率进行六百万像素轮廓三维成像。我们的计算显微镜被称为 STARCAM（利用计算阵列显微镜进行扫描地形图全焦重建），它采用并行化的 54 相机架构，具有三轴平移功能，可为每个感兴趣的样本捕捉多维、2.1 TB（兆字节）的数据集，该数据集由总计 224,640 张 940 万像素的图像组成。我们开发了一种基于自监督神经网络的三维重建和拼接算法，该算法利用多视角立体信息和图像锐度作为焦点度量，联合估算整个视场的全焦点光度复合图和三维高度图。在重建过程中，神经网络所提供的记忆效率高的压缩可微分表示法有效地实现了整个多TB数据集的共同参与。在量块上进行的验证实验表明，轮廓测量的精度和准确度达到了 10 微米或更高。为了证明我们的新型计算显微镜的广泛实用性，我们将 STARCAM 应用于从文化遗产到工业检测的各种分米级物体。

{"title":"Computational 3D topographic microscopy from terabytes of data per sample","authors":"Kevin C. Zhou, Mark Harfouche, Maxwell Zheng, Joakim Jönsson, Kyung Chul Lee, Kanghyun Kim, Ron Appel, Paul Reamey, Thomas Doman, Veton Saliu, Gregor Horstmeyer, Seung Ah Lee, Roarke Horstmeyer","doi":"10.1186/s40537-024-00901-0","DOIUrl":"https://doi.org/10.1186/s40537-024-00901-0","url":null,"abstract":"<p>We present a large-scale computational 3D topographic microscope that enables 6-gigapixel profilometric 3D imaging at micron-scale resolution across >110 cm<sup>2</sup> areas over multi-millimeter axial ranges. Our computational microscope, termed STARCAM (Scanning Topographic All-in-focus Reconstruction with a Computational Array Microscope), features a parallelized, 54-camera architecture with 3-axis translation to capture, for each sample of interest, a multi-dimensional, 2.1-terabyte (TB) dataset, consisting of a total of 224,640 9.4-megapixel images. We developed a self-supervised neural network-based algorithm for 3D reconstruction and stitching that jointly estimates an all-in-focus photometric composite and 3D height map across the entire field of view, using multi-view stereo information and image sharpness as a focal metric. The memory-efficient, compressed differentiable representation offered by the neural network effectively enables joint participation of the entire multi-TB dataset during the reconstruction process. Validation experiments on gauge blocks demonstrate a profilometric precision and accuracy of 10 µm or better. To demonstrate the broad utility of our new computational microscope, we applied STARCAM to a variety of decimeter-scale objects, with applications ranging from cultural heritage to industrial inspection.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"33 1","pages":""},"PeriodicalIF":8.1,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0