Dataset Construction and Opinion Holder Detection Using Pre-trained Models

Al- Mahmud, Kazutaka Shimada
{"title":"Dataset Construction and Opinion Holder Detection Using Pre-trained Models","authors":"Al- Mahmud, Kazutaka Shimada","doi":"10.52731/ijskm.v7.i2.779","DOIUrl":null,"url":null,"abstract":"With the growing prevalence of the Internet, increasingly more people and entities express opinions on online platforms, such as Facebook, Twitter, and Amazon. As it is becoming impossible to detect online opinion trends manually, an automatic approach to detect opinion holders is essential as a means to identify specific concerns regarding a particular topic, product, or problem. Opinion holder detection comprises two steps: the presence of opinion holders in text and identification of opinion holders. The present study examines both steps. Initially, we approach this task as a binary classification problem: INSIDE or OUTSIDE. Then, we consider the identification of opinion holders as a sequence labeling task and prepare an appropriate English-language dataset. Subsequently, we employ three pre-trained models for the opinion holder detection task: BERT, DistilBERT, and contextual string embedding (CSE). For the binary classification task, we employ a logistic regression model on the top layers of the BERT and DistilBERT models. We compare the models’ performance in terms of the F1 score and accuracy. Experimental results show that DistilBERT obtained superior performance, with an F1 score of 0.901 and an accuracy of 0.924. For the opinion holder identification task, we utilize both feature- and fine-tuning-based architectures. Furthermore, we combined CSE and the conditional random field (CRF) with BERT and DistilBERT. For the feature-based architecture, we utilize five models: CSE+CRF, BERT+CRF, (BERT&CSE)+CRF, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. For the fine-tuning-based architecture, we utilize six models: BERT, BERT+CRF, (BERT&CSE)+CRF, DistilBERT, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. All language models are evaluated in terms of F1 score and processing time. The experimental results indicate that both the feature- and fine-tuning-based (DistilBERT&CSE)+CRF models jointly yielded the optimal performance, with an F1 score of 0.9453. However, feature-based CSE+CRF incurred the lowest processing time of 49 s while yielding a comparable F1 score to that obtained by the optimal-performing models.","PeriodicalId":487422,"journal":{"name":"International journal of service and knowledge management","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of service and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52731/ijskm.v7.i2.779","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

With the growing prevalence of the Internet, increasingly more people and entities express opinions on online platforms, such as Facebook, Twitter, and Amazon. As it is becoming impossible to detect online opinion trends manually, an automatic approach to detect opinion holders is essential as a means to identify specific concerns regarding a particular topic, product, or problem. Opinion holder detection comprises two steps: the presence of opinion holders in text and identification of opinion holders. The present study examines both steps. Initially, we approach this task as a binary classification problem: INSIDE or OUTSIDE. Then, we consider the identification of opinion holders as a sequence labeling task and prepare an appropriate English-language dataset. Subsequently, we employ three pre-trained models for the opinion holder detection task: BERT, DistilBERT, and contextual string embedding (CSE). For the binary classification task, we employ a logistic regression model on the top layers of the BERT and DistilBERT models. We compare the models’ performance in terms of the F1 score and accuracy. Experimental results show that DistilBERT obtained superior performance, with an F1 score of 0.901 and an accuracy of 0.924. For the opinion holder identification task, we utilize both feature- and fine-tuning-based architectures. Furthermore, we combined CSE and the conditional random field (CRF) with BERT and DistilBERT. For the feature-based architecture, we utilize five models: CSE+CRF, BERT+CRF, (BERT&CSE)+CRF, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. For the fine-tuning-based architecture, we utilize six models: BERT, BERT+CRF, (BERT&CSE)+CRF, DistilBERT, DistilBERT+CRF, and (DistilBERT&CSE)+CRF. All language models are evaluated in terms of F1 score and processing time. The experimental results indicate that both the feature- and fine-tuning-based (DistilBERT&CSE)+CRF models jointly yielded the optimal performance, with an F1 score of 0.9453. However, feature-based CSE+CRF incurred the lowest processing time of 49 s while yielding a comparable F1 score to that obtained by the optimal-performing models.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于预训练模型的数据集构建和意见持有者检测
随着互联网的日益普及,越来越多的人和实体在Facebook、Twitter、亚马逊等网络平台上发表意见。由于人工检测在线意见趋势变得越来越不可能,一种自动检测意见持有者的方法是必不可少的,因为它是识别特定主题、产品或问题的特定关注点的一种手段。意见持有人检测包括两个步骤:意见持有人在文本中的存在和意见持有人的识别。本研究考察了这两个步骤。最初,我们将此任务视为一个二元分类问题:INSIDE或OUTSIDE。然后,我们将意见持有人的识别视为一个序列标记任务,并准备了一个适当的英语数据集。随后,我们采用了三种预训练模型进行意见持有者检测任务:BERT、蒸馏BERT和上下文字符串嵌入(CSE)。对于二元分类任务,我们在BERT和蒸馏伯特模型的顶层采用逻辑回归模型。我们从F1分数和准确率两方面比较了模型的性能。实验结果表明,蒸馏酒获得了较好的性能,F1得分为0.901,准确率为0.924。对于意见持有者识别任务,我们同时使用基于特征和基于微调的体系结构。此外,我们将CSE和条件随机场(CRF)与BERT和DistilBERT相结合。对于基于特征的体系结构,我们使用了五个模型:CSE+CRF, BERT+CRF, (BERT&CSE)+CRF,蒸馏器+CRF和(蒸馏器&CSE)+CRF。对于基于微调的架构,我们使用了六个模型:BERT, BERT+CRF, (BERT&CSE)+CRF, DistilBERT, DistilBERT+CRF和(DistilBERT&CSE)+CRF。所有语言模型都是根据F1分数和处理时间进行评估的。实验结果表明,基于特征和基于微调的(DistilBERT&CSE)+CRF模型共同获得了最优的性能,F1得分为0.9453。然而,基于特征的CSE+CRF所需的处理时间最短,为49 s,同时产生的F1分数与性能最优的模型相当。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Risk Countermeasure Portfolio Management for Remote Learning Based on Lecture Type Risk Management Portfolio for Secure Telework Semi-Automatic Category Estimation and Data Augmentation for Opinion Extraction of Product Components Dataset Construction and Opinion Holder Detection Using Pre-trained Models Solution methods for unit commitment problem considering market transactions
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1