A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Data & Knowledge Engineering Pub Date : 2024-01-19 DOI:10.1016/j.datak.2024.102281

Guillaume Lefebvre , Haytham Elghazel , Theodore Guillet , Alexandre Aussem , Matthieu Sonnati

{"title":"A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification","authors":"Guillaume Lefebvre , Haytham Elghazel , Theodore Guillet , Alexandre Aussem , Matthieu Sonnati","doi":"10.1016/j.datak.2024.102281","DOIUrl":null,"url":null,"abstract":"<div>In recent years, Natural Language Processing (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of HMCCCProbT to compare favorably to state-of-the-art HMC algorithms in terms of classification accuracy and also the ability of BERTEPro to obtain better probability predictions, well suited to HMCCCProbT, than three other vector representation techniques.</div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102281"},"PeriodicalIF":2.7000,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X24000053","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, Natural Language Processing (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of HMCCCProbT to compare favorably to state-of-the-art HMC algorithms in terms of classification accuracy and also the ability of BERTEPro to obtain better probability predictions, well suited to HMCCCProbT, than three other vector representation techniques.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

适用于教育和专业培训领域的新句子嵌入框架，并将其应用于分层多标签文本分类

近年来，通过先进的通用语言嵌入，自然语言处理（NLP）技术取得了长足的进步，在语义相似性和文本分类等 NLP 任务中实现了突破。然而，分层多标签分类（HMC）的复杂性也随之增加，在这种情况下，一个实体可能属于多个分层分类的类别。在这种复杂的情况下，应用于特定领域的文本（如教育和专业培训领域），一般的语言嵌入模型往往不能充分代表专业领域的独特术语和上下文的细微差别。为了解决这个问题，我们提出了一种新颖的分层多标签文本分类方法 HMCCCProbT。这一创新框架包含多个分类器，其中每个分类器都是在现有 Transformer 模型的基础上，使用新颖的句子嵌入方法 BERTEPro 构建的，其预训练已在教育和专业培训文本上进行了扩展，然后在多个 NLP 任务上进行了微调。每个分类器负责给定层次的预测，并将输入特征向量增强的局部概率预测传播给负责后续层次的分类器。HMCCCProbT 解决了模型的可扩展性和语义解释问题，为应对特定领域分层多标签分类的挑战提供了强大的解决方案。在三个特定领域的文本 HMC 数据集上进行的实验表明，HMCCCProbT 在分类准确性方面可与最先进的 HMC 算法相媲美，而且与其他三种向量表示技术相比，BERTEPro 能够获得更好的概率预测，非常适合 HMCCCProbT。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助