Geoscience Language Processing for Exploration

H. Denli, HassanJaved Chughtai, Brian Hughes, Robert Gistri, Peng Xu
{"title":"Geoscience Language Processing for Exploration","authors":"H. Denli, HassanJaved Chughtai, Brian Hughes, Robert Gistri, Peng Xu","doi":"10.2118/207766-ms","DOIUrl":null,"url":null,"abstract":"\n Deep learning has recently been providing step-change capabilities, particularly using transformer models, for natural language processing applications such as question answering, query-based summarization, and language translation for general-purpose context. We have developed a geoscience-specific language processing solution using such models to enable geoscientists to perform rapid, fully-quantitative and automated analysis of large corpuses of data and gain insights.\n One of the key transformer-based model is BERT (Bidirectional Encoder Representations from Transformers). It is trained with a large amount of general-purpose text (e.g., Common Crawl). Use of such a model for geoscience applications can face a number of challenges. One is due to the insignificant presence of geoscience-specific vocabulary in general-purpose context (e.g. daily language) and the other one is due to the geoscience jargon (domain-specific meaning of words). For example, salt is more likely to be associated with table salt within a daily language but it is used as a subsurface entity within geosciences.\n To elevate such challenges, we retrained a pre-trained BERT model with our 20M internal geoscientific records. We will refer the retrained model as GeoBERT. We fine-tuned the GeoBERT model for a number of tasks including geoscience question answering and query-based summarization.\n BERT models are very large in size. For example, BERT-Large has 340M trained parameters. Geoscience language processing with these models, including GeoBERT, could result in a substantial latency when all database is processed at every call of the model. To address this challenge, we developed a retriever-reader engine consisting of an embedding-based similarity search as a context retrieval step, which helps the solution to narrow the context for a given query before processing the context with GeoBERT.\n We built a solution integrating context-retrieval and GeoBERT models. Benchmarks show that it is effective to help geologists to identify answers and context for given questions. The prototype will also produce a summary to different granularity for a given set of documents. We have also demonstrated that domain-specific GeoBERT outperforms general-purpose BERT for geoscience applications.","PeriodicalId":10959,"journal":{"name":"Day 3 Wed, November 17, 2021","volume":"9 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Day 3 Wed, November 17, 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2118/207766-ms","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Deep learning has recently been providing step-change capabilities, particularly using transformer models, for natural language processing applications such as question answering, query-based summarization, and language translation for general-purpose context. We have developed a geoscience-specific language processing solution using such models to enable geoscientists to perform rapid, fully-quantitative and automated analysis of large corpuses of data and gain insights. One of the key transformer-based model is BERT (Bidirectional Encoder Representations from Transformers). It is trained with a large amount of general-purpose text (e.g., Common Crawl). Use of such a model for geoscience applications can face a number of challenges. One is due to the insignificant presence of geoscience-specific vocabulary in general-purpose context (e.g. daily language) and the other one is due to the geoscience jargon (domain-specific meaning of words). For example, salt is more likely to be associated with table salt within a daily language but it is used as a subsurface entity within geosciences. To elevate such challenges, we retrained a pre-trained BERT model with our 20M internal geoscientific records. We will refer the retrained model as GeoBERT. We fine-tuned the GeoBERT model for a number of tasks including geoscience question answering and query-based summarization. BERT models are very large in size. For example, BERT-Large has 340M trained parameters. Geoscience language processing with these models, including GeoBERT, could result in a substantial latency when all database is processed at every call of the model. To address this challenge, we developed a retriever-reader engine consisting of an embedding-based similarity search as a context retrieval step, which helps the solution to narrow the context for a given query before processing the context with GeoBERT. We built a solution integrating context-retrieval and GeoBERT models. Benchmarks show that it is effective to help geologists to identify answers and context for given questions. The prototype will also produce a summary to different granularity for a given set of documents. We have also demonstrated that domain-specific GeoBERT outperforms general-purpose BERT for geoscience applications.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
面向勘探的地球科学语言处理
深度学习最近为自然语言处理应用程序(如问答、基于查询的摘要和通用上下文的语言翻译)提供了逐步变化的功能,特别是使用转换器模型。我们已经开发了一个地球科学专用的语言处理解决方案,使用这些模型,使地球科学家能够对大量数据进行快速、全定量和自动化的分析,并获得见解。一个关键的基于变压器的模型是BERT(双向编码器表示从变压器)。它使用大量通用文本(例如Common Crawl)进行训练。在地球科学应用中使用这种模型可能会面临许多挑战。一个是由于地球科学专用词汇在一般情况下(例如日常语言)的存在微不足道,另一个是由于地球科学术语(单词的特定领域含义)。例如,在日常语言中,盐更可能与食盐联系在一起,但它在地球科学中被用作地下实体。为了提升这些挑战,我们用我们的20M内部地球科学记录重新训练了一个预训练的BERT模型。我们将重新训练的模型称为GeoBERT。我们为许多任务调整了GeoBERT模型,包括地球科学问题回答和基于查询的摘要。BERT模型的尺寸非常大。例如,BERT-Large有340M个训练参数。使用这些模型(包括GeoBERT)进行地球科学语言处理可能会导致在每次调用模型时处理所有数据库时产生很大的延迟。为了解决这个问题,我们开发了一个检索器-阅读器引擎,其中包含基于嵌入的相似性搜索作为上下文检索步骤,这有助于解决方案在使用GeoBERT处理上下文之前缩小给定查询的上下文。我们构建了一个集成上下文检索和GeoBERT模型的解决方案。基准测试表明,它可以有效地帮助地质学家确定给定问题的答案和背景。原型还将为给定的一组文档生成不同粒度的摘要。我们还证明了特定领域的GeoBERT在地球科学应用中优于通用的BERT。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Assessment of Unconventional Resources Opportunities in the Middle East Tethyan Petroleum System in a Transfer Learning Context Block 61 Drilling Fluids Optimization Journey High Resolution Reservoir Simulator Driven Custom Scripts as the Enabler for Solving Reservoir to Surface Network Coupling Challenges Pre-Engineered Standardized Turbomachinery Solutions: A Strategic Approach to Lean Project Management Using Active and Passive Near-Field Hydrophones to Image the Near-Surface in Ultra-Shallow Waters Offshore Abu Dhabi
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1