Whodunit? Learning to Contrast for Authorship Attribution

Q3 Environmental Science AACL Bioflux Pub Date : 2022-09-23 DOI:10.48550/arXiv.2209.11887

Bo Ai, Yuchen Wang, Yugin Tan, Samson Tan

引用次数: 5

Abstract

Authorship attribution is the task of identifying the author of a given text. The key is finding representations that can differentiate between authors. Existing approaches typically use manually designed features that capture a dataset’s content and style, but these approaches are dataset-dependent and yield inconsistent performance across corpora. In this work, we propose to learn author-specific representations by fine-tuning pre-trained generic language representations with a contrastive objective (Contra-X). We show that Contra-X learns representations that form highly separable clusters for different authors. It advances the state-of-the-art on multiple human and machine authorship attribution benchmarks, enabling improvements of up to 6.8% over cross-entropy fine-tuning. However, we find that Contra-X improves overall accuracy at the cost of sacrificing performance for some authors. Resolving this tension will be an important direction for future work. To the best of our knowledge, we are the first to integrate contrastive learning with pre-trained language model fine-tuning for authorship attribution.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

侦探小说吗?学习对比作者归属

作者归属是识别给定文本的作者的任务。关键是找到能够区分作者的表现形式。现有的方法通常使用手动设计的功能来捕获数据集的内容和风格，但是这些方法依赖于数据集，并且在语料库中产生不一致的性能。在这项工作中，我们建议通过微调带有对比目标(contro - x)的预训练通用语言表示来学习作者特定的表示。我们展示了contro - x学习了为不同作者形成高度可分离簇的表示。它在多个人类和机器作者归属基准上推进了最先进的技术，比交叉熵微调提高了6.8%。然而，我们发现contro - x以牺牲某些作者的性能为代价提高了整体准确性。解决这一矛盾将是今后工作的重要方向。据我们所知，我们是第一个将对比学习与预先训练的语言模型微调结合起来的作者归属。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

AACL Bioflux Environmental Science-Management, Monitoring, Policy and Law

CiteScore

1.40

自引率

0.00%

发文量