MLReal:弥合机器学习中合成数据训练与真实数据应用之间的差距

Artificial Intelligence in Geosciences Pub Date : 2022-12-01 DOI:10.1016/j.aiig.2022.09.002

Tariq Alkhalifah, Hanchen Wang, Oleg Ovcharenko

{"title":"MLReal:弥合机器学习中合成数据训练与真实数据应用之间的差距","authors":"Tariq Alkhalifah, Hanchen Wang, Oleg Ovcharenko","doi":"10.1016/j.aiig.2022.09.002","DOIUrl":null,"url":null,"abstract":"<div><p>Among the biggest challenges we face in utilizing neural networks trained on waveform (i.e., seismic, electromagnetic, or ultrasound) data is its application to real data. The requirement for accurate labels often forces us to train our networks using synthetic data, where labels are readily available. However, synthetic data often fail to capture the reality of the field/real experiment, and we end up with poor performance of the trained neural networks (NNs) at the inference stage. This is because synthetic data lack many of the realistic features embedded in real data, including an accurate waveform source signature, realistic noise, and accurate reflectivity. In other words, the real data set is far from being a sample from the distribution of the synthetic training set. Thus, we describe a novel approach to enhance our supervised neural network (NN) training on synthetic data with real data features (domain adaptation). Specifically, for tasks in which the absolute values of the vertical axis (time or depth) of the input section are not crucial to the prediction, like classification, or can be corrected after the prediction, like velocity model building using a well, we suggest a series of linear operations on the input to the network data so that the training and application data have similar distributions. This is accomplished by applying two operations on the input data to the NN, whether the input is from the synthetic or real data subset domain: (1) The crosscorrelation of the input data section (i.e., shot gather, seismic image, etc.) with a fixed-location reference trace from the input data section. (2) The convolution of the resulting data with the mean (or a random sample) of the autocorrelated sections from the other subset domain. In the training stage, the input data are from the synthetic subset domain and the auto-corrected (we crosscorrelate each trace with itself) sections are from the real subset domain, and the random selection of sections from the real data is implemented at every epoch of the training. In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain. Example applications on passive seismic data for microseismic event source location determination and on active seismic data for predicting low frequencies are used to demonstrate the power of this approach in improving the applicability of our trained NNs to real data.</p></div>","PeriodicalId":100124,"journal":{"name":"Artificial Intelligence in Geosciences","volume":"3 ","pages":"Pages 101-114"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666544122000260/pdfft?md5=3e63a5c64f3830cf6afacef439cdef2b&pid=1-s2.0-S2666544122000260-main.pdf","citationCount":"0","resultStr":"{\"title\":\"MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning\",\"authors\":\"Tariq Alkhalifah, Hanchen Wang, Oleg Ovcharenko\",\"doi\":\"10.1016/j.aiig.2022.09.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Among the biggest challenges we face in utilizing neural networks trained on waveform (i.e., seismic, electromagnetic, or ultrasound) data is its application to real data. The requirement for accurate labels often forces us to train our networks using synthetic data, where labels are readily available. However, synthetic data often fail to capture the reality of the field/real experiment, and we end up with poor performance of the trained neural networks (NNs) at the inference stage. This is because synthetic data lack many of the realistic features embedded in real data, including an accurate waveform source signature, realistic noise, and accurate reflectivity. In other words, the real data set is far from being a sample from the distribution of the synthetic training set. Thus, we describe a novel approach to enhance our supervised neural network (NN) training on synthetic data with real data features (domain adaptation). Specifically, for tasks in which the absolute values of the vertical axis (time or depth) of the input section are not crucial to the prediction, like classification, or can be corrected after the prediction, like velocity model building using a well, we suggest a series of linear operations on the input to the network data so that the training and application data have similar distributions. This is accomplished by applying two operations on the input data to the NN, whether the input is from the synthetic or real data subset domain: (1) The crosscorrelation of the input data section (i.e., shot gather, seismic image, etc.) with a fixed-location reference trace from the input data section. (2) The convolution of the resulting data with the mean (or a random sample) of the autocorrelated sections from the other subset domain. In the training stage, the input data are from the synthetic subset domain and the auto-corrected (we crosscorrelate each trace with itself) sections are from the real subset domain, and the random selection of sections from the real data is implemented at every epoch of the training. In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain. Example applications on passive seismic data for microseismic event source location determination and on active seismic data for predicting low frequencies are used to demonstrate the power of this approach in improving the applicability of our trained NNs to real data.</p></div>\",\"PeriodicalId\":100124,\"journal\":{\"name\":\"Artificial Intelligence in Geosciences\",\"volume\":\"3 \",\"pages\":\"Pages 101-114\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666544122000260/pdfft?md5=3e63a5c64f3830cf6afacef439cdef2b&pid=1-s2.0-S2666544122000260-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence in Geosciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666544122000260\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666544122000260","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在利用波形(即地震、电磁或超声波)数据训练的神经网络时，我们面临的最大挑战之一是将其应用于实际数据。对准确标签的要求常常迫使我们使用合成数据来训练我们的网络，在这些数据中，标签很容易获得。然而，合成数据往往不能捕捉现场/真实实验的现实，并且我们最终在推理阶段训练的神经网络(nn)的性能很差。这是因为合成数据缺乏真实数据中嵌入的许多真实特征，包括准确的波形源特征、真实的噪声和准确的反射率。换句话说，真实的数据集远不是合成训练集分布的样本。因此，我们描述了一种新的方法来增强我们的监督神经网络(NN)训练的合成数据与真实的数据特征(域适应)。具体来说，对于输入段纵轴(时间或深度)的绝对值对预测不重要的任务(如分类)，或者可以在预测后进行校正的任务(如使用井建立速度模型)，我们建议对网络数据的输入进行一系列线性操作，使训练数据和应用数据具有相似的分布。这是通过对输入数据对NN应用两种操作来实现的，无论输入是来自合成数据子集域还是真实数据子集域:(1)输入数据部分(即射击采集，地震图像等)与输入数据部分的固定位置参考轨迹的相互关系。(2)结果数据与来自其他子集域的自相关部分的平均值(或随机样本)的卷积。在训练阶段，输入数据来自合成子集域，自动校正(我们将每个轨迹与自身相互关联)的部分来自真实子集域，并且在训练的每个epoch都从真实数据中随机选择部分。在推理/应用阶段，输入数据来自真实子集域，自相关部分的平均值来自合成数据子集域。在微地震事件源定位的被动地震数据和低频预测的主动地震数据上的实例应用，证明了这种方法在提高我们训练的神经网络对实际数据的适用性方面的强大作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning

Among the biggest challenges we face in utilizing neural networks trained on waveform (i.e., seismic, electromagnetic, or ultrasound) data is its application to real data. The requirement for accurate labels often forces us to train our networks using synthetic data, where labels are readily available. However, synthetic data often fail to capture the reality of the field/real experiment, and we end up with poor performance of the trained neural networks (NNs) at the inference stage. This is because synthetic data lack many of the realistic features embedded in real data, including an accurate waveform source signature, realistic noise, and accurate reflectivity. In other words, the real data set is far from being a sample from the distribution of the synthetic training set. Thus, we describe a novel approach to enhance our supervised neural network (NN) training on synthetic data with real data features (domain adaptation). Specifically, for tasks in which the absolute values of the vertical axis (time or depth) of the input section are not crucial to the prediction, like classification, or can be corrected after the prediction, like velocity model building using a well, we suggest a series of linear operations on the input to the network data so that the training and application data have similar distributions. This is accomplished by applying two operations on the input data to the NN, whether the input is from the synthetic or real data subset domain: (1) The crosscorrelation of the input data section (i.e., shot gather, seismic image, etc.) with a fixed-location reference trace from the input data section. (2) The convolution of the resulting data with the mean (or a random sample) of the autocorrelated sections from the other subset domain. In the training stage, the input data are from the synthetic subset domain and the auto-corrected (we crosscorrelate each trace with itself) sections are from the real subset domain, and the random selection of sections from the real data is implemented at every epoch of the training. In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain. Example applications on passive seismic data for microseismic event source location determination and on active seismic data for predicting low frequencies are used to demonstrate the power of this approach in improving the applicability of our trained NNs to real data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Artificial Intelligence in Geosciences

CiteScore

4.20

自引率

0.00%

发文量