Using a self-attention architecture to automate valence categorization of French teenagers' free descriptions of their family relationships. A proof of concept.

Journal of medical artificial intelligence Pub Date : 2023-01-18 DOI:10.1101/2023.01.16.23284557

M. Sedki, N. Vidal, P. Roux, C. Barry, M. Speranza, B. Falissard, E. Brunet-Gouet

{"title":"Using a self-attention architecture to automate valence categorization of French teenagers' free descriptions of their family relationships. A proof of concept.","authors":"M. Sedki, N. Vidal, P. Roux, C. Barry, M. Speranza, B. Falissard, E. Brunet-Gouet","doi":"10.1101/2023.01.16.23284557","DOIUrl":null,"url":null,"abstract":"This paper proposes a proof of concept of using natural language processing techniques to categorize valence of family relationships described in free texts written by french teenagers. The proposed study traces the evolution of techniques for word embedding. After decomposing the different texts in our possession into short texts composed of sentences and manual labeling, we tested different word embedding scenarios to train a multi-label classification model where a text can take several labels: labels describing the family link between the teenager and the person mentioned in the text and labels describing the teenager's relationship with them positive/negative/neutral valence). The natural baseline for word vector representation of our texts is to build a TF-IDF and train classical classifiers (Elasticnet logistic regression, gradient boosting, random forest, support vector classifier) after selecting a model by cross validation in each class of machine learning models. We then studied the strengths of word-vectors embeddings by an advanced language representation technique via the CamemBERT transformer model, and, again, used them with classical classifiers to compare their respective performances. The last scenario consisted in augmenting the CamemBERT with output dense layers (perceptron) representing a classifier adapted to the multi-label classification and fine-tuning the CamemBERT original layers. The optimal fine-tuning depth that achieves a bias-variance trade-off was obtained by a cross-validation procedure. The results of the comparison of the three scenarios on a test dataset show a clear improvement of the classification performances of the scenario with fine-tuning beyond the baseline and of a simple vectorization using CamemBERT without fine-tuning. Despite the moderate size of the dataset and the input texts, fine-tuning to an optimal depth remains the best solution to build a classifier.","PeriodicalId":73815,"journal":{"name":"Journal of medical artificial intelligence","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of medical artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.01.16.23284557","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper proposes a proof of concept of using natural language processing techniques to categorize valence of family relationships described in free texts written by french teenagers. The proposed study traces the evolution of techniques for word embedding. After decomposing the different texts in our possession into short texts composed of sentences and manual labeling, we tested different word embedding scenarios to train a multi-label classification model where a text can take several labels: labels describing the family link between the teenager and the person mentioned in the text and labels describing the teenager's relationship with them positive/negative/neutral valence). The natural baseline for word vector representation of our texts is to build a TF-IDF and train classical classifiers (Elasticnet logistic regression, gradient boosting, random forest, support vector classifier) after selecting a model by cross validation in each class of machine learning models. We then studied the strengths of word-vectors embeddings by an advanced language representation technique via the CamemBERT transformer model, and, again, used them with classical classifiers to compare their respective performances. The last scenario consisted in augmenting the CamemBERT with output dense layers (perceptron) representing a classifier adapted to the multi-label classification and fine-tuning the CamemBERT original layers. The optimal fine-tuning depth that achieves a bias-variance trade-off was obtained by a cross-validation procedure. The results of the comparison of the three scenarios on a test dataset show a clear improvement of the classification performances of the scenario with fine-tuning beyond the baseline and of a simple vectorization using CamemBERT without fine-tuning. Despite the moderate size of the dataset and the input texts, fine-tuning to an optimal depth remains the best solution to build a classifier.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用自关注架构对法国青少年家庭关系的自由描述进行自动效价分类。概念验证。

本文提出了利用自然语言处理技术对法国青少年自由文本中描述的家庭关系的效价进行分类的概念证明。这项拟议的研究追溯了单词嵌入技术的发展。在将我们所拥有的不同文本分解成由句子和人工标注组成的短文本后，我们测试了不同的单词嵌入场景来训练一个多标签分类模型，其中文本可以采用几个标签：描述青少年和文本中提到的人之间的家庭联系的标签，以及描述青少年与他们的关系的标签（正/负/中性价）。我们文本的词向量表示的自然基线是在每类机器学习模型中通过交叉验证选择模型后，建立TF-IDF并训练经典分类器（Elasticnet逻辑回归、梯度增强、随机森林、支持向量分类器）。然后，我们通过CamemBERT转换器模型，通过先进的语言表示技术研究了词向量嵌入的强度，并再次将它们与经典分类器一起使用，以比较它们各自的性能。最后一种场景是用表示适用于多标签分类的分类器的输出密集层（感知器）来增强CamemBERT，并微调CamemBERT原始层。通过交叉验证程序获得了实现偏差-方差权衡的最佳微调深度。在测试数据集上对三种场景的比较结果表明，在基线之外进行微调后，场景的分类性能明显提高，并且在没有微调的情况下使用CamemBERT进行简单的矢量化。尽管数据集和输入文本的大小适中，但微调到最佳深度仍然是构建分类器的最佳解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of medical artificial intelligence

CiteScore

2.30

自引率

0.00%

发文量