Petro NLP: Resources for natural language processing and information extraction for the oil and gas industry

IF 4.2 2区 地球科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Computers & Geosciences Pub Date : 2024-09-05 DOI:10.1016/j.cageo.2024.105714
Fábio Corrêa Cordeiro , Patrícia Ferreira da Silva , Alexandre Tessarollo , Cláudia Freitas , Elvis de Souza , Diogo da Silva Magalhaes Gomes , Renato Rocha Souza , Flávio Codeço Coelho
{"title":"Petro NLP: Resources for natural language processing and information extraction for the oil and gas industry","authors":"Fábio Corrêa Cordeiro ,&nbsp;Patrícia Ferreira da Silva ,&nbsp;Alexandre Tessarollo ,&nbsp;Cláudia Freitas ,&nbsp;Elvis de Souza ,&nbsp;Diogo da Silva Magalhaes Gomes ,&nbsp;Renato Rocha Souza ,&nbsp;Flávio Codeço Coelho","doi":"10.1016/j.cageo.2024.105714","DOIUrl":null,"url":null,"abstract":"<div><p>Most companies struggle to find and extract relevant information from their technical documents. In particular, the Oil and Gas (O&amp;G) industry faces the challenge of dealing with large amounts of data hidden within old and new geoscientific reports collected over decades of operation. Making this information available in a structured format can unlock valuable information among these <em>mountains</em> of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built from general domain corpora extracted from the Internet and primarily written in English. This paper presents <span>Petro NLP</span>, a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese.</p><p>We connected an interdisciplinary team of geoscientists, linguists, computer scientists, petroleum engineers, librarians, and ontologists to build a knowledge graph and several annotated corpora. The <span>Petro NLP</span> resources comprise: (i) <span>Petro KGraph</span>– a knowledge graph populated with entities and relations commonly found on technical reports; and (ii) <span>Petrolês</span>, <span>PetroGold</span>, <span>PetroNER</span>, and <span>PetroRE</span>– sets of corpora containing raw text and documents annotated with morphosyntactic labels, named entities, and relations. These resources are fundamental infrastructure for future research in natural language processing and information extraction in the oil industry. Our ongoing research uses these datasets to train and enhance pre-trained machine learning models that automatically extract information from geoscientific technical documents.</p></div>","PeriodicalId":55221,"journal":{"name":"Computers & Geosciences","volume":"193 ","pages":"Article 105714"},"PeriodicalIF":4.2000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Geosciences","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0098300424001973","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Most companies struggle to find and extract relevant information from their technical documents. In particular, the Oil and Gas (O&G) industry faces the challenge of dealing with large amounts of data hidden within old and new geoscientific reports collected over decades of operation. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built from general domain corpora extracted from the Internet and primarily written in English. This paper presents Petro NLP, a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese.

We connected an interdisciplinary team of geoscientists, linguists, computer scientists, petroleum engineers, librarians, and ontologists to build a knowledge graph and several annotated corpora. The Petro NLP resources comprise: (i) Petro KGraph– a knowledge graph populated with entities and relations commonly found on technical reports; and (ii) Petrolês, PetroGold, PetroNER, and PetroRE– sets of corpora containing raw text and documents annotated with morphosyntactic labels, named entities, and relations. These resources are fundamental infrastructure for future research in natural language processing and information extraction in the oil industry. Our ongoing research uses these datasets to train and enhance pre-trained machine learning models that automatically extract information from geoscientific technical documents.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
石油 NLP:石油天然气行业自然语言处理和信息提取资源
大多数公司都在努力从技术文件中查找和提取相关信息。特别是,石油和天然气(O&G)行业面临的挑战是如何处理几十年来收集的新旧地球科学报告中隐藏的大量数据。以结构化的格式提供这些信息可以从堆积如山的数据中挖掘出有价值的信息,这对支持广泛的工业和学术应用至关重要。然而,大多数自然语言处理资源都是从互联网上提取的通用领域语料库中建立的,而且主要是用英语编写的。我们将一个由地球科学家、语言学家、计算机科学家、石油工程师、图书馆员和本体论专家组成的跨学科团队联系起来,构建了一个知识图谱和若干注释语料库。Petro NLP 资源包括:(i) Petro KGraph--一个知识图谱,其中包含技术报告中常见的实体和关系;(ii) Petrolês、PetroGold、PetroNER 和 PetroRE--包含原始文本和文档的语料集,其中标注了语态句法标签、命名实体和关系。这些资源是未来石油工业自然语言处理和信息提取研究的基础架构。我们正在进行的研究利用这些数据集来训练和增强预训练的机器学习模型,这些模型可自动从地球科学技术文档中提取信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computers & Geosciences
Computers & Geosciences 地学-地球科学综合
CiteScore
9.30
自引率
6.80%
发文量
164
审稿时长
3.4 months
期刊介绍: Computers & Geosciences publishes high impact, original research at the interface between Computer Sciences and Geosciences. Publications should apply modern computer science paradigms, whether computational or informatics-based, to address problems in the geosciences.
期刊最新文献
Multimodal feature integration network for lithology identification from point cloud data A two-dimensional magnetotelluric deep learning inversion approach based on improved Dense Convolutional Network Removing atmospheric noise from InSAR interferograms in mountainous regions with a convolutional neural network Novel empirical curvelet denoising strategy for suppressing mixed noise of microseismic data Curvilinear lineament extraction: Bayesian optimization of Principal Component Wavelet Analysis and Hysteresis Thresholding
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1