Journal of Computational Science最新文献_第6页

Step-based checkpointing with high-level algorithmic differentiation 基于步骤的检查点与高级算法区分

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science

Pub Date : 2024-08-06 DOI: 10.1016/j.jocs.2024.102405

James R. Maddison

Automated code generation allows for a separation between the development of a model, expressed via a domain specific language, and lower level implementation details. Algorithmic differentiation can be applied symbolically at the level of the domain specific language, and the code generator reused to implement code required for an adjoint calculation. However the adjoint calculations are complicated by the well-known problem of storing or recomputing the forward data required by the adjoint, and different checkpointing strategies have been developed to tackle this problem. This article considers the combination of high-level algorithmic differentiation with step-based checkpointing schedules, with the primary application being for solvers of time-dependent partial differential equations. The focus is on algorithmic differentiation using a dynamically constructed record of forward operations, where the precise structure of the original forward calculation is unknown ahead-of-time. In addition, high-level approaches provide a simplified view of the model itself. This allows data required to restart and advance the forward, and data required to advance the adjoint, to be identified. The difference between the two types of data is here leveraged to implement checkpointing strategies with improved performance.

通过自动代码生成，可以将通过特定领域语言表达的模型开发与较低级别的实施细节分离开来。算法微分可以在特定领域语言的层次上以符号方式应用，代码生成器可以重复使用，以实现邻接计算所需的代码。然而，众所周知的问题是，要存储或重新计算旁证计算所需的前向数据，这使得旁证计算变得复杂，为了解决这个问题，人们开发了不同的检查点策略。本文考虑将高级算法微分与基于步长的检查点计划相结合，主要应用于时变偏微分方程的求解器。重点在于使用动态构建的前向运算记录进行算法微分，在这种情况下，原始前向计算的精确结构是提前未知的。此外，高层方法提供了模型本身的简化视图。这样就可以确定重启和推进正演所需的数据，以及推进副运算所需的数据。利用这两类数据之间的差异，可以实施性能更高的检查点策略。

{"title":"Step-based checkpointing with high-level algorithmic differentiation","authors":"James R. Maddison","doi":"10.1016/j.jocs.2024.102405","DOIUrl":"10.1016/j.jocs.2024.102405","url":null,"abstract":"<div><p>Automated code generation allows for a separation between the development of a model, expressed via a domain specific language, and lower level implementation details. Algorithmic differentiation can be applied symbolically at the level of the domain specific language, and the code generator reused to implement code required for an adjoint calculation. However the adjoint calculations are complicated by the well-known problem of storing or recomputing the forward data required by the adjoint, and different checkpointing strategies have been developed to tackle this problem. This article considers the combination of high-level algorithmic differentiation with step-based checkpointing schedules, with the primary application being for solvers of time-dependent partial differential equations. The focus is on algorithmic differentiation using a dynamically constructed record of forward operations, where the precise structure of the original forward calculation is unknown ahead-of-time. In addition, high-level approaches provide a simplified view of the model itself. This allows data required to restart and advance the forward, and data required to advance the adjoint, to be identified. The difference between the two types of data is here leveraged to implement checkpointing strategies with improved performance.</p></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"82 ","pages":"Article 102405"},"PeriodicalIF":3.1,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1877750324001984/pdfft?md5=6f935bc44600d9170907d962ee7163e7&pid=1-s2.0-S1877750324001984-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DARSI: A deep auto-regressive time series inference architecture for forecasting of aerodynamic parameters DARSI：用于预测空气动力参数的深度自动回归时间序列推理架构

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science

Pub Date : 2024-08-06 DOI: 10.1016/j.jocs.2024.102401

Aayush Pandey , Jeevesh Mahajan , Srinag P. , Aditya Rastogi , Arnab Roy , Partha P. Chakrabarti

In the realm of fluid mechanics, where computationally-intensive simulations demand significant time investments, especially in predicting aerodynamic coefficients, the conventional use of time series forecasting techniques becomes paramount. Existing methods prove effective with periodic time series, yet the challenge escalates when faced with aperiodic or chaotic system responses. To address this challenge, we introduce DARSI (Deep Auto-Regressive Time Series Inference), an advanced architecture and an efficient hybrid of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) components. Evaluated against established architectures (CNN, DLinear, LSTM, LSTNet, and PatchTST) for forecasting Coefficient of Lift ( $C_{L}$ ) values corresponding to Angles of Attack (AoAs) across periodic, aperiodic, and chaotic regimes, DARSI demonstrates remarkable performance, showing an average increase of 79.95% in CORR, 76.57% reduction in MAPE, 94.70% reduction in MSE, 76.18% reduction in QL, and 75.21% reduction in RRSE. Particularly adept at predicting chaotic aerodynamic coefficients, DARSI emerges as the best in static scenarios, surpassing DLinear and providing heightened reliability. In dynamic scenarios, DLinear takes the lead, with DARSI securing the second position alongside PatchTST. Furthermore, static AoAs at 24.7 are identified as the most chaotic, surpassing those at 24.9 and the study reveals a potential inflection point at AoA 24.7 in static scenarios for both DLinear and DARSI, warranting further confirmation. This research positions DARSI as an adept alternative to simulations, offering computational efficiency with significant implications for diverse time series forecasting applications across industries, particularly in advancing aerodynamic predictions in chaotic scenarios.

在流体力学领域，计算密集型模拟需要投入大量时间，尤其是在预测空气动力系数时，传统的时间序列预测技术变得至关重要。现有方法证明对周期性时间序列有效，但在面对非周期性或混沌系统响应时，挑战就升级了。为了应对这一挑战，我们引入了 DARSI（深度自回归时间序列推理），这是一种先进的架构，也是卷积神经网络（CNN）和长短期记忆（LSTM）组件的高效混合体。在预测周期性、非周期性和混沌状态下与攻击角（AoAs）相对应的升力系数（CL）值时，DARSI 与现有架构（CNN、DLinear、LSTM、LSTNet 和 PatchTST）进行了对比评估，显示出卓越的性能，CORR 平均提高了 79.95%，MAPE 平均降低了 76.57%，MSE 平均降低了 94.70%，QL 平均降低了 76.18%，RRSE 平均降低了 75.21%。DARSI 尤其擅长预测混乱的空气动力系数，在静态场景中表现最佳，超过了 DLinear，并提供了更高的可靠性。在动态场景中，DLinear 遥遥领先，DARSI 与 PatchTST 并列第二。此外，24.7 波段的静态视距被认为是最混乱的，超过了 24.9 波段的视距，研究还揭示了 DLinear 和 DARSI 在静态视距 24.7 波段的潜在拐点，值得进一步确认。这项研究将 DARSI 定义为模拟的一种有效替代方法，它具有计算效率高的特点，对各行各业的各种时间序列预测应用具有重要意义，特别是在推进混沌场景下的空气动力学预测方面。

{"title":"DARSI: A deep auto-regressive time series inference architecture for forecasting of aerodynamic parameters","authors":"Aayush Pandey , Jeevesh Mahajan , Srinag P. , Aditya Rastogi , Arnab Roy , Partha P. Chakrabarti","doi":"10.1016/j.jocs.2024.102401","DOIUrl":"10.1016/j.jocs.2024.102401","url":null,"abstract":"<div><p>In the realm of fluid mechanics, where computationally-intensive simulations demand significant time investments, especially in predicting aerodynamic coefficients, the conventional use of time series forecasting techniques becomes paramount. Existing methods prove effective with periodic time series, yet the challenge escalates when faced with aperiodic or chaotic system responses. To address this challenge, we introduce DARSI (Deep Auto-Regressive Time Series Inference), an advanced architecture and an efficient hybrid of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) components. Evaluated against established architectures (CNN, DLinear, LSTM, LSTNet, and PatchTST) for forecasting Coefficient of Lift (<span><math><msub><mrow><mi>C</mi></mrow><mrow><mi>L</mi></mrow></msub></math></span>) values corresponding to Angles of Attack (AoAs) across periodic, aperiodic, and chaotic regimes, DARSI demonstrates remarkable performance, showing an average increase of 79.95% in CORR, 76.57% reduction in MAPE, 94.70% reduction in MSE, 76.18% reduction in QL, and 75.21% reduction in RRSE. Particularly adept at predicting chaotic aerodynamic coefficients, DARSI emerges as the best in static scenarios, surpassing DLinear and providing heightened reliability. In dynamic scenarios, DLinear takes the lead, with DARSI securing the second position alongside PatchTST. Furthermore, static AoAs at 24.7 are identified as the most chaotic, surpassing those at 24.9 and the study reveals a potential inflection point at AoA 24.7 in static scenarios for both DLinear and DARSI, warranting further confirmation. This research positions DARSI as an adept alternative to simulations, offering computational efficiency with significant implications for diverse time series forecasting applications across industries, particularly in advancing aerodynamic predictions in chaotic scenarios.</p></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"82 ","pages":"Article 102401"},"PeriodicalIF":3.1,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142020720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Resonance modeling of the tsunami caused by the Aegean Sea Earthquake (Mw7.0) of October 30, 2020 2020 年 10 月 30 日爱琴海地震（Mw7.0）引发海啸的共振建模

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science

Pub Date : 2024-08-05 DOI: 10.1016/j.jocs.2024.102398

Olcay Eğri̇boyun, Lale Balas

<div><p>The resonance of tsunami waves in semi-enclosed bays is paramount in understanding and mitigating the impact of seismic events on coastal communities. Semi-enclosed bays, characterized by their partial enclosure, can amplify the effects of incoming tsunami waves due to resonance behavior, where the natural frequencies of the bay correspond to those of the incoming waves. This resonance phenomenon can significantly increase wave height and inundation levels, posing an increased risk to nearby settlements and infrastructure. Understanding the resonance patterns in these bays is crucial for accurate hazard assessment, early warning systems, and effective disaster preparedness and response strategies. On October 30, 2020, an earthquake occurred between the Turkish Bay of Seferihisar Bay and the Greek island of Samos in the Aegean Sea. Long waves generated by the normal-faulting earthquake caused notable damage to settlements within Seferihisar Bay and the north coast of Samos Island. According to the measurements of the Syros mareograph stations, the wave heights were between 2 and 20 cm and wave periods between 9 and 20 seconds. Based on on-site survey reports conducted after the earthquake, inundation was reported in six settlements within Seferihisar Bay. However, inundation was notably higher in Sığacık and Akarca, reaching 2–3 times higher than in other locations, and the water level reached 2 m high. Given that the variance in inundation levels is attributed to resonance phenomena in Sığacık and Akarca rather than the propagation of tsunami waves, this study focused on conducting wave resonance modeling in Seferihisar Bay. The resonance modeling was performed using the RIDE wave model. Furthermore, the research has been expanded to assess the resonance patterns that might emerge in the event of an alternative earthquake or underwater landslide along the fault line responsible for the seismic event, encompassing wave periods ranging from T = 1–9 minutes and T = 20–30 minutes. Modeling results revealed that on the day of the earthquake, wave heights in Sığacık Marina and Akarca surged by 8.5 times in comparison to the wave height at the epicenter. This increase is notably higher, ranging from 2 to 2.5 times, compared to calculations made for other locations (Demircili, Altınköy, and Tepecik). Consequently, it was concluded that one of the reasons for the heightened effectiveness of inundation in Sığacık and Akarca was attributable to resonance. Moreover, supplementary investigations have indicated that waves with a period of T<9 minutes will pose higher risks for Demircili, Altınköy, Sığacık Marina, and Tepecik compared to the day of the earthquake. By comprehensively studying wave resonance in semi-enclosed bays, researchers and policymakers can better anticipate the potential impact of tsunami events and take measures to protect coastal communities, ultimately increasing resilience and reducing the loss of life and property in vulner

海啸波在半封闭海湾中的共振对了解和减轻地震事件对沿海社区的影响至关重要。半封闭海湾的特点是部分封闭，由于海湾的自然频率与海啸波的自然频率一致，海啸波的共振行为会扩大海湾的影响。这种共振现象会大大增加海浪高度和淹没程度，给附近的居民点和基础设施带来更大的风险。了解这些海湾的共振模式对于准确的灾害评估、预警系统以及有效的备灾和救灾战略至关重要。2020 年 10 月 30 日，爱琴海土耳其塞费里希萨尔湾和希腊萨摩斯岛之间发生地震。正常断层地震产生的长波对塞费里希萨尔湾和萨摩斯岛北海岸的居民点造成了明显破坏。根据锡罗斯海图站的测量，波高在 2 至 20 厘米之间，波长在 9 至 20 秒之间。根据震后进行的现场调查报告，塞费里希萨尔湾内有六个居民点被淹没。不过，Sığacık 和 Akarca 的淹没程度明显高于其他地方，达到 2-3 倍，水位高达 2 米。鉴于淹没水位的变化归因于 Sığacık 和 Akarca 的共振现象，而不是海啸波的传播，因此本研究侧重于在塞费里希萨尔湾进行波浪共振建模。共振建模使用的是 RIDE 波浪模型。此外，研究还扩展到评估在发生替代地震或沿造成地震事件的断层线发生水下滑坡时可能出现的共振模式，包括 T = 1-9 分钟和 T = 20-30 分钟的波浪周期。建模结果显示，地震当天，Sığacık Marina 和 Akarca 的波高与震中波高相比激增了 8.5 倍。与其他地点（Demircili、Altınköy 和 Tepecik）的计算结果相比，波高明显增加了 2 至 2.5 倍。因此，得出的结论是，Sığacık 和 Akarca 的淹没效果提高的原因之一是共振。此外，补充调查还表明，与地震当天相比，周期为 T<9 分钟的波浪将对 Demircili、Altınköy、Sığacık Marina 和 Tepecik 造成更大风险。通过全面研究半封闭海湾的波浪共振，研究人员和决策者可以更好地预测海啸事件的潜在影响，并采取措施保护沿海社区，最终提高脆弱地区的抗灾能力，减少生命和财产损失。

{"title":"Resonance modeling of the tsunami caused by the Aegean Sea Earthquake (Mw7.0) of October 30, 2020","authors":"Olcay Eğri̇boyun, Lale Balas","doi":"10.1016/j.jocs.2024.102398","DOIUrl":"10.1016/j.jocs.2024.102398","url":null,"abstract":"<div><p>The resonance of tsunami waves in semi-enclosed bays is paramount in understanding and mitigating the impact of seismic events on coastal communities. Semi-enclosed bays, characterized by their partial enclosure, can amplify the effects of incoming tsunami waves due to resonance behavior, where the natural frequencies of the bay correspond to those of the incoming waves. This resonance phenomenon can significantly increase wave height and inundation levels, posing an increased risk to nearby settlements and infrastructure. Understanding the resonance patterns in these bays is crucial for accurate hazard assessment, early warning systems, and effective disaster preparedness and response strategies. On October 30, 2020, an earthquake occurred between the Turkish Bay of Seferihisar Bay and the Greek island of Samos in the Aegean Sea. Long waves generated by the normal-faulting earthquake caused notable damage to settlements within Seferihisar Bay and the north coast of Samos Island. According to the measurements of the Syros mareograph stations, the wave heights were between 2 and 20 cm and wave periods between 9 and 20 seconds. Based on on-site survey reports conducted after the earthquake, inundation was reported in six settlements within Seferihisar Bay. However, inundation was notably higher in Sığacık and Akarca, reaching 2–3 times higher than in other locations, and the water level reached 2 m high. Given that the variance in inundation levels is attributed to resonance phenomena in Sığacık and Akarca rather than the propagation of tsunami waves, this study focused on conducting wave resonance modeling in Seferihisar Bay. The resonance modeling was performed using the RIDE wave model. Furthermore, the research has been expanded to assess the resonance patterns that might emerge in the event of an alternative earthquake or underwater landslide along the fault line responsible for the seismic event, encompassing wave periods ranging from T = 1–9 minutes and T = 20–30 minutes. Modeling results revealed that on the day of the earthquake, wave heights in Sığacık Marina and Akarca surged by 8.5 times in comparison to the wave height at the epicenter. This increase is notably higher, ranging from 2 to 2.5 times, compared to calculations made for other locations (Demircili, Altınköy, and Tepecik). Consequently, it was concluded that one of the reasons for the heightened effectiveness of inundation in Sığacık and Akarca was attributable to resonance. Moreover, supplementary investigations have indicated that waves with a period of T<9 minutes will pose higher risks for Demircili, Altınköy, Sığacık Marina, and Tepecik compared to the day of the earthquake. By comprehensively studying wave resonance in semi-enclosed bays, researchers and policymakers can better anticipate the potential impact of tsunami events and take measures to protect coastal communities, ultimately increasing resilience and reducing the loss of life and property in vulner","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"82 ","pages":"Article 102398"},"PeriodicalIF":3.1,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring the integration of IoT and Generative AI in English language education: Smart tools for personalized learning experiences 探索物联网与生成式人工智能在英语教育中的融合：个性化学习体验的智能工具

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science

Pub Date : 2024-08-04 DOI: 10.1016/j.jocs.2024.102397

Wanjin Dong , Daohua Pan , Soonbae Kim

English language education is undergoing a transformative shift, propelled by advancements in technology. This research explores the integration of the Internet of Things (IoT) and Generative Artificial Intelligence (Generative AI) in the context of English language education, with a focus on developing a personalized oral assessment method. The proposed method leverages real-time data collection from IoT devices and Generative AI's language generation capabilities to create a dynamic and adaptive learning environment. The study addresses historical challenges in traditional teaching methodologies, emphasizing the need for AI approaches. The research objectives encompass a comprehensive exploration of the historical context, challenges, and existing technological interventions in English language education. A novel, technology-driven oral assessment method is designed, implemented, and rigorously evaluated using datasets such as Librispeech and L2Arctic. The ablation study investigates the impact of training dataset proportions and model learning rates on the method's performance. Results from the study highlight the importance of maintaining a balance in dataset proportions, selecting an optimal learning rate, and considering model depth in achieving optimal performance.

在技术进步的推动下，英语教育正在经历一场变革。本研究探讨了物联网（IoT）和生成式人工智能（Generative AI）在英语教育中的整合，重点是开发一种个性化口语评估方法。所提出的方法利用了物联网设备的实时数据收集和生成式人工智能的语言生成能力，以创建一个动态和自适应的学习环境。该研究解决了传统教学方法中的历史难题，强调了对人工智能方法的需求。研究目标包括全面探索英语教育的历史背景、挑战和现有技术干预。设计、实施并使用 Librispeech 和 L2Arctic 等数据集严格评估了一种新颖的、技术驱动的口语评估方法。消融研究调查了训练数据集比例和模型学习率对该方法性能的影响。研究结果凸显了保持数据集比例平衡、选择最佳学习率和考虑模型深度对实现最佳性能的重要性。

{"title":"Exploring the integration of IoT and Generative AI in English language education: Smart tools for personalized learning experiences","authors":"Wanjin Dong , Daohua Pan , Soonbae Kim","doi":"10.1016/j.jocs.2024.102397","DOIUrl":"10.1016/j.jocs.2024.102397","url":null,"abstract":"<div><p>English language education is undergoing a transformative shift, propelled by advancements in technology. This research explores the integration of the Internet of Things (IoT) and Generative Artificial Intelligence (Generative AI) in the context of English language education, with a focus on developing a personalized oral assessment method. The proposed method leverages real-time data collection from IoT devices and Generative AI's language generation capabilities to create a dynamic and adaptive learning environment. The study addresses historical challenges in traditional teaching methodologies, emphasizing the need for AI approaches. The research objectives encompass a comprehensive exploration of the historical context, challenges, and existing technological interventions in English language education. A novel, technology-driven oral assessment method is designed, implemented, and rigorously evaluated using datasets such as Librispeech and L2Arctic. The ablation study investigates the impact of training dataset proportions and model learning rates on the method's performance. Results from the study highlight the importance of maintaining a balance in dataset proportions, selecting an optimal learning rate, and considering model depth in achieving optimal performance.</p></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"82 ","pages":"Article 102397"},"PeriodicalIF":3.1,"publicationDate":"2024-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141998444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A topological approach for semi-supervised learning 半监督学习的拓扑方法

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science

Pub Date : 2024-08-03 DOI: 10.1016/j.jocs.2024.102403

A. Inés, C. Domínguez, J. Heras, G. Mata, J. Rubio

Nowadays, Machine Learning and Deep Learning methods have become the state-of-the-art approach to solve data classification tasks. In order to use those methods, it is necessary to acquire and label a considerable amount of data; however, this is not straightforward in some fields, since data annotation is time consuming and might require expert knowledge. This challenge can be tackled by means of semi-supervised learning methods that take advantage of both labelled and unlabelled data. In this work, we present new semi-supervised learning methods based on techniques from Topological Data Analysis (TDA). In particular, we have created two semi-supervised learning methods following two topological approaches. In the former, we have used a homological approach that consists in studying the persistence diagrams associated with the data using the bottleneck and Wasserstein distances. In the latter, we have considered the connectivity of the data. In addition, we have carried out a thorough analysis of the developed methods using 9 tabular datasets with low and high dimensionality. The results show that the developed semi-supervised methods outperform the results obtained with models trained with only manually labelled data, and are an alternative to other classical semi-supervised learning algorithms.

如今，机器学习和深度学习方法已成为解决数据分类任务的最先进方法。要使用这些方法，就必须获取并标注大量数据；然而，这在某些领域并不简单，因为数据标注不仅耗时，而且可能需要专家知识。半监督学习方法可以利用已标注和未标注的数据来解决这一难题。在这项工作中，我们提出了基于拓扑数据分析（TDA）技术的新型半监督学习方法。特别是，我们根据两种拓扑方法创建了两种半监督学习方法。在前者中，我们使用了一种同源方法，即利用瓶颈距离和瓦瑟斯坦距离研究与数据相关的持久图。在后者中，我们考虑了数据的连通性。此外，我们还使用 9 个低维和高维表格数据集对所开发的方法进行了全面分析。结果表明，所开发的半监督方法优于仅使用人工标注数据训练的模型，是其他经典半监督学习算法的替代方法。

{"title":"A topological approach for semi-supervised learning","authors":"A. Inés, C. Domínguez, J. Heras, G. Mata, J. Rubio","doi":"10.1016/j.jocs.2024.102403","DOIUrl":"10.1016/j.jocs.2024.102403","url":null,"abstract":"<div><p>Nowadays, Machine Learning and Deep Learning methods have become the state-of-the-art approach to solve data classification tasks. In order to use those methods, it is necessary to acquire and label a considerable amount of data; however, this is not straightforward in some fields, since data annotation is time consuming and might require expert knowledge. This challenge can be tackled by means of semi-supervised learning methods that take advantage of both labelled and unlabelled data. In this work, we present new semi-supervised learning methods based on techniques from Topological Data Analysis (TDA). In particular, we have created two semi-supervised learning methods following two topological approaches. In the former, we have used a homological approach that consists in studying the persistence diagrams associated with the data using the bottleneck and Wasserstein distances. In the latter, we have considered the connectivity of the data. In addition, we have carried out a thorough analysis of the developed methods using 9 tabular datasets with low and high dimensionality. The results show that the developed semi-supervised methods outperform the results obtained with models trained with only manually labelled data, and are an alternative to other classical semi-supervised learning algorithms.</p></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"82 ","pages":"Article 102403"},"PeriodicalIF":3.1,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast model calibration for predicting the response of breast cancer to chemotherapy using proper orthogonal decomposition 利用适当的正交分解快速校准模型，预测乳腺癌对化疗的反应

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science

Pub Date : 2024-08-02 DOI: 10.1016/j.jocs.2024.102400

Chase Christenson , Chengyue Wu , David A. Hormuth II , Casey E. Stowers , Megan LaMonica , Jingfei Ma , Gaiane M. Rauch , Thomas E. Yankeelov

Constructing digital twins for predictive tumor treatment response models can have a high computational demand that presents a practical barrier for their clinical adoption. In this work, we demonstrate that proper orthogonal decomposition, by which a low-dimensional representation of the full model is constructed, can be used to dramatically reduce the computational time required to calibrate a partial differential equation model to magnetic resonance imaging (MRI) data for rapid predictions of tumor growth and response to chemotherapy. In the proposed formulation, the reduction basis is based on each patient’s own MRI data and controls the overall size of the “reduced order model”. Using the full model as the reference, we validate that the reduced order mathematical model can accurately predict response in 50 triple negative breast cancer patients receiving standard of care neoadjuvant chemotherapy. The concordance correlation coefficient between the full and reduced order models was 0.986 ± 0.012 (mean ± standard deviation) for predicting changes in both tumor volume and cellularity across the entire model family, with a corresponding median local error (inter-quartile range) of 4.36 % (1.22 %, 15.04 %). The total time to estimate parameters and to predict response dramatically improves with the reduced framework. Specifically, the reduced order model accelerates our calibration by a factor of (mean ± standard deviation) 378.4 ± 279.8 when compared to the full order model for a non-mechanically coupled model. This enormous reduction in computational time can directly help realize the practical construction of digital twins when the access to computational resources is limited.

为预测性肿瘤治疗反应模型构建数字孪生模型的计算要求很高，这对其临床应用构成了实际障碍。在这项工作中，我们证明了适当的正交分解（通过该分解构建完整模型的低维表示）可用于显著减少根据磁共振成像（MRI）数据校准偏微分方程模型所需的计算时间，从而快速预测肿瘤生长和化疗反应。在建议的公式中，缩减基础基于每位患者自身的磁共振成像数据，并控制 "缩减阶次模型 "的整体大小。以完整模型为参考，我们验证了减阶数学模型能准确预测 50 名接受标准护理新辅助化疗的三阴性乳腺癌患者的反应。在预测整个模型族的肿瘤体积和细胞度变化时，全阶模型和缩减阶模型之间的一致性相关系数为 0.986 ± 0.012（平均值 ± 标准差），相应的局部误差中位数（四分位间范围）为 4.36 %（1.22 %，15.04 %）。采用简化框架后，估计参数和预测反应的总时间显著缩短。具体来说，与非机械耦合模型的全阶模型相比，缩减阶次模型将我们的校准速度提高了 378.4 ± 279.8 倍（平均值 ± 标准偏差）。在计算资源有限的情况下，计算时间的大幅缩短可直接帮助实现数字孪生的实际构建。

{"title":"Fast model calibration for predicting the response of breast cancer to chemotherapy using proper orthogonal decomposition","authors":"Chase Christenson , Chengyue Wu , David A. Hormuth II , Casey E. Stowers , Megan LaMonica , Jingfei Ma , Gaiane M. Rauch , Thomas E. Yankeelov","doi":"10.1016/j.jocs.2024.102400","DOIUrl":"10.1016/j.jocs.2024.102400","url":null,"abstract":"<div><p>Constructing digital twins for predictive tumor treatment response models can have a high computational demand that presents a practical barrier for their clinical adoption. In this work, we demonstrate that proper orthogonal decomposition, by which a low-dimensional representation of the full model is constructed, can be used to dramatically reduce the computational time required to calibrate a partial differential equation model to magnetic resonance imaging (MRI) data for rapid predictions of tumor growth and response to chemotherapy. In the proposed formulation, the reduction basis is based on each patient’s own MRI data and controls the overall size of the “reduced order model”. Using the full model as the reference, we validate that the reduced order mathematical model can accurately predict response in 50 triple negative breast cancer patients receiving standard of care neoadjuvant chemotherapy. The concordance correlation coefficient between the full and reduced order models was 0.986 ± 0.012 (mean ± standard deviation) for predicting changes in both tumor volume and cellularity across the entire model family, with a corresponding median local error (inter-quartile range) of 4.36 % (1.22 %, 15.04 %). The total time to estimate parameters and to predict response dramatically improves with the reduced framework. Specifically, the reduced order model accelerates our calibration by a factor of (mean ± standard deviation) 378.4 ± 279.8 when compared to the full order model for a non-mechanically coupled model. This enormous reduction in computational time can directly help realize the practical construction of digital twins when the access to computational resources is limited.</p></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"82 ","pages":"Article 102400"},"PeriodicalIF":3.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A generalized framework for integrating machine learning into computational fluid dynamics 将机器学习融入计算流体力学的通用框架

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science

Pub Date : 2024-08-02 DOI: 10.1016/j.jocs.2024.102404

Xuxiang Sun , Wenbo Cao , Xianglin Shan , Yilang Liu , Weiwei Zhang

The amalgamation of machine learning algorithms (ML) with computational fluid dynamics (CFD) represents a promising frontier for the advancement of fluid dynamics research. However, the practical integration of CFD with ML algorithms frequently faces challenges related to data transfer and computational efficiency. While CFD programs are conventionally scripted in Fortran or C/C++, the prevalence of Python in the machine learning domain complicates their seamless integration. To tackle these obstacles, this paper proposes a comprehensive solution. Our devised framework primarily leverages Python modules CFFI and dynamic linking library technology to seamlessly integrate ML algorithms with CFD programs, facilitating efficient data interchange between them. Distinguished by its simplicity, efficiency, flexibility, and scalability, our framework is adaptable across various CFD programs, scalable to multi-node parallelism, and compatible with heterogeneous computing systems. In this paper, we showcase a spectrum of CFD+ML algorithms based on this framework, including stability analysis of ML Reynolds stress models, bidirectional coupling between ML turbulence models and CFD programs, and online dimension reduction optimization techniques tailored for resolving unstable steady flow solutions. In addition, our framework has been successfully tested on supercomputer clusters, demonstrating its compatibility with distributed computing architectures and its ability to leverage heterogeneous computing resources for efficient computational tasks.

机器学习算法（ML）与计算流体动力学（CFD）的结合代表了流体动力学研究发展的一个前景广阔的前沿领域。然而，CFD 与 ML 算法的实际整合经常面临数据传输和计算效率方面的挑战。CFD 程序通常使用 Fortran 或 C/C++ 编写脚本，而 Python 在机器学习领域的盛行使其无缝集成变得更加复杂。为了解决这些障碍，本文提出了一个全面的解决方案。我们设计的框架主要利用 Python 模块 CFFI 和动态链接库技术，将 ML 算法与 CFD 程序无缝集成，促进它们之间的高效数据交换。我们的框架具有简单、高效、灵活和可扩展性等特点，可适用于各种 CFD 程序，可扩展到多节点并行，并与异构计算系统兼容。在本文中，我们展示了基于该框架的一系列 CFD+ML 算法，包括 ML 雷诺应力模型的稳定性分析、ML 湍流模型与 CFD 程序之间的双向耦合，以及为解决不稳定的稳定流解而量身定制的在线降维优化技术。此外，我们的框架还在超级计算机集群上进行了成功测试，证明了它与分布式计算架构的兼容性以及利用异构计算资源完成高效计算任务的能力。

{"title":"A generalized framework for integrating machine learning into computational fluid dynamics","authors":"Xuxiang Sun , Wenbo Cao , Xianglin Shan , Yilang Liu , Weiwei Zhang","doi":"10.1016/j.jocs.2024.102404","DOIUrl":"10.1016/j.jocs.2024.102404","url":null,"abstract":"<div><p>The amalgamation of machine learning algorithms (ML) with computational fluid dynamics (CFD) represents a promising frontier for the advancement of fluid dynamics research. However, the practical integration of CFD with ML algorithms frequently faces challenges related to data transfer and computational efficiency. While CFD programs are conventionally scripted in Fortran or C/C++, the prevalence of Python in the machine learning domain complicates their seamless integration. To tackle these obstacles, this paper proposes a comprehensive solution. Our devised framework primarily leverages Python modules CFFI and dynamic linking library technology to seamlessly integrate ML algorithms with CFD programs, facilitating efficient data interchange between them. Distinguished by its simplicity, efficiency, flexibility, and scalability, our framework is adaptable across various CFD programs, scalable to multi-node parallelism, and compatible with heterogeneous computing systems. In this paper, we showcase a spectrum of CFD+ML algorithms based on this framework, including stability analysis of ML Reynolds stress models, bidirectional coupling between ML turbulence models and CFD programs, and online dimension reduction optimization techniques tailored for resolving unstable steady flow solutions. In addition, our framework has been successfully tested on supercomputer clusters, demonstrating its compatibility with distributed computing architectures and its ability to leverage heterogeneous computing resources for efficient computational tasks.</p></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"82 ","pages":"Article 102404"},"PeriodicalIF":3.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141964514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Node and edge centrality based failures in multi-layer complex networks 基于节点和边缘中心性的多层复杂网络故障

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science

Pub Date : 2024-07-31 DOI: 10.1016/j.jocs.2024.102396

Dibakar Das, Jyotsna Bapat, Debabrata Das

Multi-layer complex networks (MLCN) appears in various domains, such as, transportation, supply chains, etc. Failures in MLCN can lead to major disruptions in systems. Several research have focussed on different kinds of failures, such as, cascades, their reasons and ways to avoid them. This paper considers failures in a specific type of MLCN where the lower layer provides services to the higher layer without cross layer interaction, typical of a computer network. A three layer MLCN is constructed with the same set of nodes where each layer has different characteristics, the bottom most layer is Erdos–Renyi (ER) random graph with shortest path hop count among the nodes as gaussian, the middle layer is ER graph with higher number of edges from the previous, and the top most layer is preferential attachment graph with even higher number of edges. Both edge and node failures are considered. Failures happen with decreasing order of centralities of edges and nodes in static batch mode and when the centralities change dynamically with progressive failures. Emergent pattern of three key parameters, namely, average shortest path length (ASPL), total shortest path count (TSPC) and total number of edges (TNE) for all the three layers after node or edge failures are studied. Extensive simulations show that all but one parameters show definite degrading patterns. Surprising, ASPL for the middle layer starts showing a chaotic behaviour beyond a certain point for all types of failures.

多层复杂网络（MLCN）出现在运输、供应链等多个领域。多层复杂网络的故障可导致系统出现重大混乱。一些研究集中于不同类型的故障，如级联故障、其原因和避免方法。本文研究的是一种特殊类型的 MLCN 故障，在这种 MLCN 中，下层向上层提供服务，没有跨层交互，这是计算机网络的典型特征。最底层是鄂尔多斯-雷尼（ER）随机图，节点间的最短路径跳数为高斯分布；中间层是鄂尔多斯-雷尼图，其边缘数比上一层多；最上层是优先附着图，其边缘数更多。边缘和节点故障都被考虑在内。在静态批处理模式下，故障会随着边和节点中心度的递减而发生；而在渐进故障模式下，中心度会发生动态变化。研究了节点或边缘故障后所有三层的三个关键参数，即平均最短路径长度（ASPL）、总最短路径计数（TSPC）和边缘总数（TNE）的出现模式。大量模拟显示，除一个参数外，其他所有参数都显示出明确的衰减模式。令人惊讶的是，中间层的 ASPL 在所有类型的故障中超过一定程度后开始出现混乱行为。

{"title":"Node and edge centrality based failures in multi-layer complex networks","authors":"Dibakar Das, Jyotsna Bapat, Debabrata Das","doi":"10.1016/j.jocs.2024.102396","DOIUrl":"10.1016/j.jocs.2024.102396","url":null,"abstract":"<div><p>Multi-layer complex networks (MLCN) appears in various domains, such as, transportation, supply chains, etc. Failures in MLCN can lead to major disruptions in systems. Several research have focussed on different kinds of failures, such as, cascades, their reasons and ways to avoid them. This paper considers failures in a specific type of MLCN where the lower layer provides services to the higher layer without cross layer interaction, typical of a computer network. A three layer MLCN is constructed with the same set of nodes where each layer has different characteristics, the bottom most layer is Erdos–Renyi (ER) random graph with shortest path hop count among the nodes as gaussian, the middle layer is ER graph with higher number of edges from the previous, and the top most layer is preferential attachment graph with even higher number of edges. Both edge and node failures are considered. Failures happen with decreasing order of centralities of edges and nodes in static batch mode and when the centralities change dynamically with progressive failures. Emergent pattern of three key parameters, namely, average shortest path length (ASPL), total shortest path count (TSPC) and total number of edges (TNE) for all the three layers after node or edge failures are studied. Extensive simulations show that all but one parameters show definite degrading patterns. Surprising, ASPL for the middle layer starts showing a chaotic behaviour beyond a certain point for all types of failures.</p></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"82 ","pages":"Article 102396"},"PeriodicalIF":3.1,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed service function chaining in NFV-enabled networks: A game-theoretic learning approach NFV 网络中的分布式服务功能链：博弈论学习方法

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science

Pub Date : 2024-07-30 DOI: 10.1016/j.jocs.2024.102399

Mahsa Alikhani , Vesal Hakami , Marzieh Sheikhi

In network function virtualization (NFV), Service Function Chaining (SFC) provides an ordered sequence of virtual network functions (VNFs) and subsequent steering of traffic flows through them to cater to end-to-end services. This paper addresses the NP-hard problem of minimum cost SFC deployment to support customer services that access the carrier network’s NFV infrastructure (NFVI) through some edge routers. To determine the mappings of VNFs to physical servers, a challenging aspect would be the inter-server latencies that may fluctuate over time because of the sharing nature of cloud data centers. To construct the SFC, we come up with three different formulations, each corresponding to a different informational assumption about the link latencies: First, a centralized integer linear programming (ILP) formulation is given under the assumption of the non-causal availability of exact and instantaneous inter-server latencies. The solution to this ILP can serve as a lower bound to benchmark more scalable and realistic schemes. Next, we give a distributed game-theoretic formulation (with service broker agents as players) which only requires the statistical knowledge of link latency fluctuations. The game provably admits a pure Nash equilibrium (NE) and can be solved iteratively through the well-known best response dynamics (BRD) algorithm. Our main novelty lies in the third formulation in which each service broker has neither instantaneous nor statistical knowledge of the latencies. Instead, it relies on a game-theoretic learning algorithm to compose its VNF chain only based on its own history of adopted decisions and experienced delays on each logical link. We prove that the proposed learning algorithm asymptotically converges to NE and evaluate its performance through simulations in terms of convergence and the impact of network parameters.

在网络功能虚拟化（NFV）中，服务功能链（SFC）提供了虚拟网络功能（VNF）的有序序列，并通过它们引导流量流，以满足端到端服务的需要。本文解决了部署 SFC 的最低成本这一 NP 难问题，以支持通过某些边缘路由器访问运营商网络的 NFV 基础设施 (NFVI) 的客户服务。要确定 VNF 与物理服务器的映射，一个具有挑战性的方面是服务器之间的延迟，由于云数据中心的共享性质，这种延迟可能会随时间而波动。为了构建 SFC，我们提出了三种不同的方案，每种方案都对应不同的链路延迟信息假设：首先，在服务器间准确和瞬时延迟的非因果可用性假设下，给出了集中式整数线性规划（ILP）公式。这个 ILP 的解可以作为一个下限，用来衡量更具可扩展性和更现实的方案。接下来，我们给出了一种分布式博弈论表述（以服务代理为博弈方），它只需要链路延迟波动的统计知识。该博弈可证明存在纯纳什均衡（NE），并可通过著名的最佳响应动力学（BRD）算法迭代求解。我们的主要新颖之处在于第三种表述方式，其中每个服务代理对延迟既没有即时知识，也没有统计知识。相反，它依赖于一种博弈论学习算法，仅根据自己的历史决策和每个逻辑链路上的经验延迟来组成其 VNF 链。我们证明了所提出的学习算法会逐渐收敛到近地网络，并通过模拟从收敛性和网络参数的影响方面对其性能进行了评估。

{"title":"Distributed service function chaining in NFV-enabled networks: A game-theoretic learning approach","authors":"Mahsa Alikhani , Vesal Hakami , Marzieh Sheikhi","doi":"10.1016/j.jocs.2024.102399","DOIUrl":"10.1016/j.jocs.2024.102399","url":null,"abstract":"<div><p>In network function virtualization (NFV), Service Function Chaining (SFC) provides an ordered sequence of virtual network functions (VNFs) and subsequent steering of traffic flows through them to cater to end-to-end services. This paper addresses the NP-hard problem of minimum cost SFC deployment to support customer services that access the carrier network’s NFV infrastructure (NFVI) through some edge routers. To determine the mappings of VNFs to physical servers, a challenging aspect would be the inter-server latencies that may fluctuate over time because of the sharing nature of cloud data centers. To construct the SFC, we come up with three different formulations, each corresponding to a different informational assumption about the link latencies: First, a centralized integer linear programming (ILP) formulation is given under the assumption of the non-causal availability of exact and instantaneous inter-server latencies. The solution to this ILP can serve as a lower bound to benchmark more scalable and realistic schemes. Next, we give a distributed game-theoretic formulation (with service broker agents as players) which only requires the statistical knowledge of link latency fluctuations. The game provably admits a pure Nash equilibrium (NE) and can be solved iteratively through the well-known best response dynamics (BRD) algorithm. Our main novelty lies in the third formulation in which each service broker has neither instantaneous nor statistical knowledge of the latencies. Instead, it relies on a game-theoretic learning algorithm to compose its VNF chain only based on its own history of adopted decisions and experienced delays on each logical link. We prove that the proposed learning algorithm asymptotically converges to NE and evaluate its performance through simulations in terms of convergence and the impact of network parameters.</p></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"82 ","pages":"Article 102399"},"PeriodicalIF":3.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RuMedSpellchecker: A new approach for advanced spelling error correction in Russian electronic health records RuMedSpellchecker：在俄罗斯电子健康记录中进行高级拼写错误纠正的新方法

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science

Pub Date : 2024-07-29 DOI: 10.1016/j.jocs.2024.102393

Dmitrii Pogrebnoi, Anastasia Funkner, Sergey Kovalchuk

In healthcare, a remarkable progress in machine learning has given rise to a diverse range of predictive and decision-making medical models, significantly enhancing treatment efficacy and overall quality of care. These models often rely on electronic health records (EHRs) as fundamental data sources. The effectiveness of these models is contingent on the quality of the EHRs, typically presented as unstructured text. Unfortunately, these records frequently contain spelling errors, diminishing the quality of intelligent systems relying on them. In this research, we propose a method and a tool for correcting spelling errors in Russian medical texts. Our approach combines the Symmetrical Deletion algorithm with a finely tuned BERT model to efficiently correct spelling errors, thereby enhancing the quality of the original medical texts at a minimal cost. In addition, we introduce several fine-tuned BERT models for Russian anamneses. Through rigorous evaluation and comparison with existing spelling error correction tools for the Russian language, we demonstrate that our approach and tool surpass existing open-source alternatives by 7% in correcting spelling errors in sample Russian medical texts and significantly superior in automatically correcting real-world anamneses. However, the new approach is far inferior to proprietary services such as Yandex Speller and GPT-4. The proposed tool and its source code are available on GitHub ¹ and pip ² repositories. This paper is an extended version of the work presented at ICCS 2023 (Pogrebnoi et al. 2023)

在医疗保健领域，机器学习的显著进步催生了各种预测和决策医疗模型，大大提高了治疗效果和整体医疗质量。这些模型通常依赖电子健康记录（EHR）作为基本数据源。这些模型的有效性取决于电子病历的质量，电子病历通常以非结构化文本的形式呈现。遗憾的是，这些记录经常包含拼写错误，从而降低了依赖这些记录的智能系统的质量。在这项研究中，我们提出了一种纠正俄语医疗文本中拼写错误的方法和工具。我们的方法将对称删除算法与精细调整的 BERT 模型相结合，有效地纠正拼写错误，从而以最小的成本提高原始医学文本的质量。此外，我们还介绍了几种针对俄语 "anamneses "的微调 BERT 模型。通过严格的评估以及与现有俄语拼写错误纠正工具的比较，我们证明了我们的方法和工具在纠正俄语医学样本中的拼写错误方面比现有的开源替代方法高出 7%，在自动纠正真实世界的amneses方面也有明显优势。不过，新方法远不如 Yandex Speller 和 GPT-4 等专有服务。建议的工具及其源代码可从 GitHub 和 pip 软件仓库获取。本文是在 ICCS 2023（Pogrebnoi et al.）

{"title":"RuMedSpellchecker: A new approach for advanced spelling error correction in Russian electronic health records","authors":"Dmitrii Pogrebnoi, Anastasia Funkner, Sergey Kovalchuk","doi":"10.1016/j.jocs.2024.102393","DOIUrl":"10.1016/j.jocs.2024.102393","url":null,"abstract":"<div><p>In healthcare, a remarkable progress in machine learning has given rise to a diverse range of predictive and decision-making medical models, significantly enhancing treatment efficacy and overall quality of care. These models often rely on electronic health records (EHRs) as fundamental data sources. The effectiveness of these models is contingent on the quality of the EHRs, typically presented as unstructured text. Unfortunately, these records frequently contain spelling errors, diminishing the quality of intelligent systems relying on them. In this research, we propose a method and a tool for correcting spelling errors in Russian medical texts. Our approach combines the Symmetrical Deletion algorithm with a finely tuned BERT model to efficiently correct spelling errors, thereby enhancing the quality of the original medical texts at a minimal cost. In addition, we introduce several fine-tuned BERT models for Russian anamneses. Through rigorous evaluation and comparison with existing spelling error correction tools for the Russian language, we demonstrate that our approach and tool surpass existing open-source alternatives by 7% in correcting spelling errors in sample Russian medical texts and significantly superior in automatically correcting real-world anamneses. However, the new approach is far inferior to proprietary services such as Yandex Speller and GPT-4. The proposed tool and its source code are available on GitHub <span><span><sup>1</sup></span></span> and pip <span><span><sup>2</sup></span></span> repositories. This paper is an extended version of the work presented at ICCS 2023 (Pogrebnoi et al. 2023)</p></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"82 ","pages":"Article 102393"},"PeriodicalIF":3.1,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0