Point Break: Surfing Heterogeneous Data for Subtitle Segmentation

Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020 Pub Date : 1900-01-01 DOI:10.4000/books.aaccademia.8620

Alina Karakanta, Matteo Negri, M. Turchi

引用次数: 3

Abstract

Subtitles, in order to achieve their purpose of transmitting information, need to be easily readable. The segmentation of subtitles into phrases or linguistic units is key to their readability and comprehension. However, automatically segmenting a sentence into subtitles is a challenging task and data containing reliable human segmentation decisions are often scarce. In this paper, we leverage data with noisy segmentation from large subtitle corpora and combine them with smaller amounts of high-quality data in order to train models which perform automatic segmentation of a sentence into subtitles. We show that even a minimum amount of reliable data can lead to readable subtitles and that quality is more important than quantity for the task of subtitle segmentation.1

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

断点:浏览字幕分割的异构数据

为了达到传递信息的目的，字幕需要具有易读性。将字幕分割成短语或语言单位是提高字幕可读性和理解力的关键。然而，自动将句子分割成字幕是一项具有挑战性的任务，并且包含可靠的人工分割决策的数据通常很少。在本文中，我们利用来自大型字幕语料库的带有噪声分割的数据，并将它们与少量高质量数据结合起来，以训练将句子自动分割成字幕的模型。我们表明，即使是最少量的可靠数据也可以产生可读的字幕，并且对于字幕分割任务来说，质量比数量更重要

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

自引率

0.00%

发文量