Exploring the impact of training datasets on Turkish stance detection

IF 1.2 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Turkish Journal of Electrical Engineering and Computer Sciences Pub Date : 2023-11-30 DOI:10.55730/1300-0632.4043

Muhammed Said Zengin, Berk Utku Yeni̇sey, Mucahid Kutlu

{"title":"Exploring the impact of training datasets on Turkish stance detection","authors":"Muhammed Said Zengin, Berk Utku Yeni̇sey, Mucahid Kutlu","doi":"10.55730/1300-0632.4043","DOIUrl":null,"url":null,"abstract":": Stance detection has garnered considerable attention from researchers due to its broad range of applications, including fact-checking and social computing. While state-of-the-art stance detection models are usually based on supervised machine learning methods, their effectiveness is heavily reliant on the quality of training data. This problem is more prevalent in stance detection task because the stance of a text is intimately tied to the target under consideration. While numerous datasets exist for stance detection, determining their suitability for a specific target can be challenging. In this work, we focus on Turkish stance detection and explore the impact of training data on the model performance. In particular, we fine-tune BERT model with various datasets and assess their performance when the test data is the same/different compared to the training data in terms of target and domain. In addition, given the scarcity of resources for Turkish stance detection, we investigate i) whether we can use existing datasets in other languages in a cross-lingual setup, and ii) the effectiveness of data augmentation with simple automatic labeling methods. In order to conduct our experiments, we also create new Turkish stance detection datasets for various targets in different domains. In our comprehensive experiments, our findings are as follows. 1) Using training data with multiple targets in the same domain yields high performance as the model is able to learn more characteristics of expressing stance with additional data. 2) The domain of the training data plays a crucial role in achieving high performance. 3) Automatically generated data enhances performance when combined with manually annotated data. 4) Training solely on Turkish data outperforms training with the combination of Turkish and English data. Overall, our study points out the importance of creating Turkish annotated datasets for different domains to achieve high performance in stance detection.","PeriodicalId":49410,"journal":{"name":"Turkish Journal of Electrical Engineering and Computer Sciences","volume":"202 ","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Turkish Journal of Electrical Engineering and Computer Sciences","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.55730/1300-0632.4043","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

: Stance detection has garnered considerable attention from researchers due to its broad range of applications, including fact-checking and social computing. While state-of-the-art stance detection models are usually based on supervised machine learning methods, their effectiveness is heavily reliant on the quality of training data. This problem is more prevalent in stance detection task because the stance of a text is intimately tied to the target under consideration. While numerous datasets exist for stance detection, determining their suitability for a specific target can be challenging. In this work, we focus on Turkish stance detection and explore the impact of training data on the model performance. In particular, we fine-tune BERT model with various datasets and assess their performance when the test data is the same/different compared to the training data in terms of target and domain. In addition, given the scarcity of resources for Turkish stance detection, we investigate i) whether we can use existing datasets in other languages in a cross-lingual setup, and ii) the effectiveness of data augmentation with simple automatic labeling methods. In order to conduct our experiments, we also create new Turkish stance detection datasets for various targets in different domains. In our comprehensive experiments, our findings are as follows. 1) Using training data with multiple targets in the same domain yields high performance as the model is able to learn more characteristics of expressing stance with additional data. 2) The domain of the training data plays a crucial role in achieving high performance. 3) Automatically generated data enhances performance when combined with manually annotated data. 4) Training solely on Turkish data outperforms training with the combination of Turkish and English data. Overall, our study points out the importance of creating Turkish annotated datasets for different domains to achieve high performance in stance detection.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

探索训练数据集对土耳其语姿态检测的影响

:立场检测具有广泛的应用领域，包括事实核查和社交计算，因此受到研究人员的极大关注。最先进的立场检测模型通常基于有监督的机器学习方法，但其有效性在很大程度上取决于训练数据的质量。这个问题在立场检测任务中更为普遍，因为文本的立场与所考虑的目标密切相关。虽然有许多用于立场检测的数据集，但要确定这些数据集是否适用于特定目标却很有难度。在这项工作中，我们专注于土耳其语的立场检测，并探索训练数据对模型性能的影响。特别是，我们利用各种数据集对 BERT 模型进行了微调，并评估了当测试数据在目标和领域方面与训练数据相同/不同时的性能。此外，考虑到土耳其语立场检测资源的稀缺性，我们研究了 i) 我们是否可以在跨语言设置中使用其他语言的现有数据集，以及 ii) 使用简单的自动标记方法进行数据扩充的有效性。为了进行实验，我们还针对不同领域的不同目标创建了新的土耳其语立场检测数据集。在综合实验中，我们得出了以下结论。1) 使用同一领域中多个目标的训练数据会产生较高的性能，因为模型能够通过额外的数据学习到更多表达姿态的特征。2) 训练数据的领域对实现高性能起着至关重要的作用。3) 自动生成的数据与人工标注的数据相结合，可以提高性能。4) 仅使用土耳其语数据进行训练的效果优于结合土耳其语和英语数据进行训练的效果。总之，我们的研究指出了为不同领域创建土耳其语注释数据集对实现高性能姿态检测的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Turkish Journal of Electrical Engineering and Computer Sciences COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

2.90

自引率

9.10%

发文量

审稿时长

6.9 months

期刊介绍： The Turkish Journal of Electrical Engineering & Computer Sciences is published electronically 6 times a year by the Scientific and Technological Research Council of Turkey (TÜBİTAK) Accepts English-language manuscripts in the areas of power and energy, environmental sustainability and energy efficiency, electronics, industry applications, control systems, information and systems, applied electromagnetics, communications, signal and image processing, tomographic image reconstruction, face recognition, biometrics, speech processing, video processing and analysis, object recognition, classification, feature extraction, parallel and distributed computing, cognitive systems, interaction, robotics, digital libraries and content, personalized healthcare, ICT for mobility, sensors, and artificial intelligence. Contribution is open to researchers of all nationalities.