通过探索预训练语言模型的力量从大量文本中挖掘结构

Yu Zhang, Yunyi Zhang, Jiawei Han
{"title":"通过探索预训练语言模型的力量从大量文本中挖掘结构","authors":"Yu Zhang, Yunyi Zhang, Jiawei Han","doi":"10.48786/edbt.2023.81","DOIUrl":null,"url":null,"abstract":"Technologies for handling massive structured or semi-structured data have been researched extensively in database communities. However, the real-world data are largely in the form of unstructured text, posing a great challenge to their management and analysis as well as their integration with semi-structured databases. Recent developments of deep learning methods and large pre-trained language models (PLMs) have revolutionized text mining and processing and shed new light on structuring massive text data and building a framework for integrated (i.e., structured and unstructured) data management and analysis. In this tutorial, we will focus on the recently developed text mining approaches empowered by PLMs that can work without relying on heavy human annotations. We will present an organized picture of how a set of weakly supervised methods explore the power of PLMs to structure text data, with the following outline: (1) an introduction to pre-trained languagemodels that serve as new tools for our tasks, (2) mining topic structures: unsupervised and seed-guided methods for topic discovery from massive text corpora, (3) mining document structures: weakly supervised methods for text classification, (4) mining entity structures: distantly supervised and weakly supervised methods for phrase mining, named entity recognition, taxonomy construction, and structured knowledge graph construction, and (5) towards an integrated information processing paradigm. 1 BACKGROUND, GOALS, AND DURATION The massive text data available on the Web, social media, news, scientific literature, government reports, and other information sources contain rich knowledge that can potentially benefit a wide variety of information processing tasks, and they can be potentially structured and analyzed by extended database technologies. For example, one can conduct entity recognition and concept ontology construction on a large collection of scientific papers and extract the factual knowledge for knowledge base construction and subsequent analysis. How to effectively leverage the unstructured massive text data for downstream applications has remained an important and active research question for the past few decades. Recently, pre-trained language models (PLMs) such as BERT [6] have revolutionized the text mining field and brought new inspirations to structuring text data. To be specific, the following paradigm is usually adopted: pre-training neural architectures on large-scale text corpora obtained from the world knowledge (e.g., a combination of Wikipedia, books, scientific corpora, and web content), and then transferring their representations to task-specific data. By doing so, the knowledge encoded in the world corpora can be effectively leveraged to enhance © 2023 Copyright held by the owner/author(s). Published in Proceedings of the 26th International Conference on Extending Database Technology (EDBT), 28th March-31st March, 2023, ISBN 978-3-89318-092-9 on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. downstream task performance significantly. However, the major challenge of such a paradigm is that fully supervised fine-tuning of PLMs usually requires abundant human annotations, which may require domain expertise and can be expensive and timeconsuming to acquire in practice. In this tutorial, we aim to introduce the recent developments in (1) language model pre-training that turns massive texts into contextualized text representations, and (2) weakly supervised methods that transfer pre-trained representations to various tasks for mining structures of topics, documents, and entities frommassive texts. The materials introduced in our tutorial will greatly benefit researchers who work on text mining/natural language processing, data mining, and database systems, as well as practitioners who aim to obtain structured and actionable knowledge for targeted applications without access to abundant annotated data. The tutorial will be presented in 3 hours.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"108 1","pages":"851-854"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mining Structures from Massive Texts by Exploring the Power of Pre-trained Language Models\",\"authors\":\"Yu Zhang, Yunyi Zhang, Jiawei Han\",\"doi\":\"10.48786/edbt.2023.81\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Technologies for handling massive structured or semi-structured data have been researched extensively in database communities. However, the real-world data are largely in the form of unstructured text, posing a great challenge to their management and analysis as well as their integration with semi-structured databases. Recent developments of deep learning methods and large pre-trained language models (PLMs) have revolutionized text mining and processing and shed new light on structuring massive text data and building a framework for integrated (i.e., structured and unstructured) data management and analysis. In this tutorial, we will focus on the recently developed text mining approaches empowered by PLMs that can work without relying on heavy human annotations. We will present an organized picture of how a set of weakly supervised methods explore the power of PLMs to structure text data, with the following outline: (1) an introduction to pre-trained languagemodels that serve as new tools for our tasks, (2) mining topic structures: unsupervised and seed-guided methods for topic discovery from massive text corpora, (3) mining document structures: weakly supervised methods for text classification, (4) mining entity structures: distantly supervised and weakly supervised methods for phrase mining, named entity recognition, taxonomy construction, and structured knowledge graph construction, and (5) towards an integrated information processing paradigm. 1 BACKGROUND, GOALS, AND DURATION The massive text data available on the Web, social media, news, scientific literature, government reports, and other information sources contain rich knowledge that can potentially benefit a wide variety of information processing tasks, and they can be potentially structured and analyzed by extended database technologies. For example, one can conduct entity recognition and concept ontology construction on a large collection of scientific papers and extract the factual knowledge for knowledge base construction and subsequent analysis. How to effectively leverage the unstructured massive text data for downstream applications has remained an important and active research question for the past few decades. Recently, pre-trained language models (PLMs) such as BERT [6] have revolutionized the text mining field and brought new inspirations to structuring text data. To be specific, the following paradigm is usually adopted: pre-training neural architectures on large-scale text corpora obtained from the world knowledge (e.g., a combination of Wikipedia, books, scientific corpora, and web content), and then transferring their representations to task-specific data. By doing so, the knowledge encoded in the world corpora can be effectively leveraged to enhance © 2023 Copyright held by the owner/author(s). Published in Proceedings of the 26th International Conference on Extending Database Technology (EDBT), 28th March-31st March, 2023, ISBN 978-3-89318-092-9 on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. downstream task performance significantly. However, the major challenge of such a paradigm is that fully supervised fine-tuning of PLMs usually requires abundant human annotations, which may require domain expertise and can be expensive and timeconsuming to acquire in practice. In this tutorial, we aim to introduce the recent developments in (1) language model pre-training that turns massive texts into contextualized text representations, and (2) weakly supervised methods that transfer pre-trained representations to various tasks for mining structures of topics, documents, and entities frommassive texts. The materials introduced in our tutorial will greatly benefit researchers who work on text mining/natural language processing, data mining, and database systems, as well as practitioners who aim to obtain structured and actionable knowledge for targeted applications without access to abundant annotated data. The tutorial will be presented in 3 hours.\",\"PeriodicalId\":88813,\"journal\":{\"name\":\"Advances in database technology : proceedings. International Conference on Extending Database Technology\",\"volume\":\"108 1\",\"pages\":\"851-854\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in database technology : proceedings. International Conference on Extending Database Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48786/edbt.2023.81\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in database technology : proceedings. International Conference on Extending Database Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48786/edbt.2023.81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

数据库社区对处理大量结构化或半结构化数据的技术进行了广泛的研究。然而,现实世界的数据大多是非结构化文本,这给数据的管理和分析以及与半结构化数据库的集成带来了巨大的挑战。深度学习方法和大型预训练语言模型(plm)的最新发展彻底改变了文本挖掘和处理,并为构建大量文本数据和构建集成(即结构化和非结构化)数据管理和分析框架提供了新的思路。在本教程中,我们将重点介绍最近开发的由plm支持的文本挖掘方法,这些方法可以在不依赖大量人工注释的情况下工作。我们将有组织地展示一组弱监督方法如何探索plm构建文本数据的能力,并给出以下概述:(1)介绍作为我们任务新工具的预训练语言模型,(2)挖掘主题结构:从大量文本语料库中发现主题的无监督和种子引导方法,(3)挖掘文档结构:挖掘文本分类的弱监督方法,(4)挖掘实体结构。远距离监督和弱监督方法用于短语挖掘、命名实体识别、分类构建和结构化知识图构建,以及(5)迈向集成信息处理范式。背景、目标和持续时间网络、社交媒体、新闻、科学文献、政府报告和其他信息源上的大量文本数据包含丰富的知识,可以潜在地有益于各种信息处理任务,并且可以通过扩展数据库技术对它们进行结构化和分析。例如,可以对大量的科学论文进行实体识别和概念本体构建,提取事实知识,用于知识库的构建和后续分析。如何有效地利用海量非结构化文本数据进行下游应用,是过去几十年一个重要而活跃的研究课题。最近,BERT[6]等预训练语言模型(plm)彻底改变了文本挖掘领域,并为结构化文本数据带来了新的灵感。具体而言,通常采用以下范式:在从世界知识中获得的大规模文本语料库(例如维基百科、书籍、科学语料库和web内容的组合)上预训练神经架构,然后将其表示转换为特定任务的数据。通过这样做,可以有效地利用世界语料库中编码的知识来增强©2023所有者/作者持有的版权。发表于第26届国际扩展数据库技术会议论文集(EDBT), 2023年3月28日-31日,ISBN 978-3-89318-092-9, OpenProceedings.org。本论文的发布遵循知识共享许可协议cc -by-nc和4.0的条款。下游任务表现显著。然而,这种范例的主要挑战是,完全监督的plm微调通常需要大量的人工注释,这可能需要领域的专业知识,并且在实践中获得这些注释既昂贵又耗时。在本教程中,我们的目标是介绍以下方面的最新进展:(1)语言模型预训练,将大量文本转化为上下文化的文本表示;(2)弱监督方法,将预训练的表示转移到各种任务中,从大量文本中挖掘主题、文档和实体的结构。本教程中介绍的材料将极大地有利于从事文本挖掘/自然语言处理、数据挖掘和数据库系统工作的研究人员,以及旨在为目标应用程序获取结构化和可操作知识而不需要访问大量注释数据的实践者。本教程将在3小时内呈现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Mining Structures from Massive Texts by Exploring the Power of Pre-trained Language Models
Technologies for handling massive structured or semi-structured data have been researched extensively in database communities. However, the real-world data are largely in the form of unstructured text, posing a great challenge to their management and analysis as well as their integration with semi-structured databases. Recent developments of deep learning methods and large pre-trained language models (PLMs) have revolutionized text mining and processing and shed new light on structuring massive text data and building a framework for integrated (i.e., structured and unstructured) data management and analysis. In this tutorial, we will focus on the recently developed text mining approaches empowered by PLMs that can work without relying on heavy human annotations. We will present an organized picture of how a set of weakly supervised methods explore the power of PLMs to structure text data, with the following outline: (1) an introduction to pre-trained languagemodels that serve as new tools for our tasks, (2) mining topic structures: unsupervised and seed-guided methods for topic discovery from massive text corpora, (3) mining document structures: weakly supervised methods for text classification, (4) mining entity structures: distantly supervised and weakly supervised methods for phrase mining, named entity recognition, taxonomy construction, and structured knowledge graph construction, and (5) towards an integrated information processing paradigm. 1 BACKGROUND, GOALS, AND DURATION The massive text data available on the Web, social media, news, scientific literature, government reports, and other information sources contain rich knowledge that can potentially benefit a wide variety of information processing tasks, and they can be potentially structured and analyzed by extended database technologies. For example, one can conduct entity recognition and concept ontology construction on a large collection of scientific papers and extract the factual knowledge for knowledge base construction and subsequent analysis. How to effectively leverage the unstructured massive text data for downstream applications has remained an important and active research question for the past few decades. Recently, pre-trained language models (PLMs) such as BERT [6] have revolutionized the text mining field and brought new inspirations to structuring text data. To be specific, the following paradigm is usually adopted: pre-training neural architectures on large-scale text corpora obtained from the world knowledge (e.g., a combination of Wikipedia, books, scientific corpora, and web content), and then transferring their representations to task-specific data. By doing so, the knowledge encoded in the world corpora can be effectively leveraged to enhance © 2023 Copyright held by the owner/author(s). Published in Proceedings of the 26th International Conference on Extending Database Technology (EDBT), 28th March-31st March, 2023, ISBN 978-3-89318-092-9 on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. downstream task performance significantly. However, the major challenge of such a paradigm is that fully supervised fine-tuning of PLMs usually requires abundant human annotations, which may require domain expertise and can be expensive and timeconsuming to acquire in practice. In this tutorial, we aim to introduce the recent developments in (1) language model pre-training that turns massive texts into contextualized text representations, and (2) weakly supervised methods that transfer pre-trained representations to various tasks for mining structures of topics, documents, and entities frommassive texts. The materials introduced in our tutorial will greatly benefit researchers who work on text mining/natural language processing, data mining, and database systems, as well as practitioners who aim to obtain structured and actionable knowledge for targeted applications without access to abundant annotated data. The tutorial will be presented in 3 hours.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Computing Generic Abstractions from Application Datasets Fair Spatial Indexing: A paradigm for Group Spatial Fairness. Data Coverage for Detecting Representation Bias in Image Datasets: A Crowdsourcing Approach Auditing for Spatial Fairness TransEdge: Supporting Efficient Read Queries Across Untrusted Edge Nodes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1