{"title":"A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification","authors":"Guillaume Lefebvre , Haytham Elghazel , Theodore Guillet , Alexandre Aussem , Matthieu Sonnati","doi":"10.1016/j.datak.2024.102281","DOIUrl":null,"url":null,"abstract":"<div><p><span>In recent years, Natural Language Processing<span> (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification<span>. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers<span>, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of </span></span></span></span><span>HMCCCProbT</span><span> to compare favorably to state-of-the-art HMC algorithms<span> in terms of classification accuracy and also the ability of </span></span><span>BERTEPro</span> to obtain better probability predictions, well suited to <span>HMCCCProbT</span><span>, than three other vector representation techniques.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102281"},"PeriodicalIF":2.7000,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X24000053","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, Natural Language Processing (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of HMCCCProbT to compare favorably to state-of-the-art HMC algorithms in terms of classification accuracy and also the ability of BERTEPro to obtain better probability predictions, well suited to HMCCCProbT, than three other vector representation techniques.
期刊介绍:
Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.