Sentence Identification with BOS and EOS Label Combinations

Takuma Udagawa, H. Kanayama, Issei Yoshida
{"title":"Sentence Identification with BOS and EOS Label Combinations","authors":"Takuma Udagawa, H. Kanayama, Issei Yoshida","doi":"10.48550/arXiv.2301.13352","DOIUrl":null,"url":null,"abstract":"The sentence is a fundamental unit in many NLP applications. Sentence segmentation is widely used as the first preprocessing task, where an input text is split into consecutive sentences considering the end of the sentence (EOS) as their boundaries. This task formulation relies on a strong assumption that the input text consists only of sentences, or what we call the sentential units (SUs). However, real-world texts often contain non-sentential units (NSUs) such as metadata, sentence fragments, nonlinguistic markers, etc. which are unreasonable or undesirable to be treated as a part of an SU. To tackle this issue, we formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text. To conduct sentence identification, we propose a simple yet effective method which combines the beginning of the sentence (BOS) and EOS labels to determine the most probable SUs and NSUs based on dynamic programming. To evaluate this task, we design an automatic, language-independent procedure to convert the Universal Dependencies corpora into sentence identification benchmarks. Finally, our experiments on the sentence identification task demonstrate that our proposed method generally outperforms sentence segmentation baselines which only utilize EOS labels.","PeriodicalId":73025,"journal":{"name":"Findings (Sydney (N.S.W.)","volume":"1 1","pages":"343-358"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Findings (Sydney (N.S.W.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.13352","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The sentence is a fundamental unit in many NLP applications. Sentence segmentation is widely used as the first preprocessing task, where an input text is split into consecutive sentences considering the end of the sentence (EOS) as their boundaries. This task formulation relies on a strong assumption that the input text consists only of sentences, or what we call the sentential units (SUs). However, real-world texts often contain non-sentential units (NSUs) such as metadata, sentence fragments, nonlinguistic markers, etc. which are unreasonable or undesirable to be treated as a part of an SU. To tackle this issue, we formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text. To conduct sentence identification, we propose a simple yet effective method which combines the beginning of the sentence (BOS) and EOS labels to determine the most probable SUs and NSUs based on dynamic programming. To evaluate this task, we design an automatic, language-independent procedure to convert the Universal Dependencies corpora into sentence identification benchmarks. Finally, our experiments on the sentence identification task demonstrate that our proposed method generally outperforms sentence segmentation baselines which only utilize EOS labels.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
BOS和EOS标签组合的句子识别
在许多NLP应用中,句子是一个基本单元。句子分割被广泛用作第一个预处理任务,其中输入文本被分割成连续的句子,将句子的结尾(EOS)作为它们的边界。这个任务公式依赖于一个强有力的假设,即输入文本只由句子组成,或者我们称之为句子单元(SU)。然而,现实世界的文本通常包含非句子单元(NSU),如元数据、句子片段、非语言标记等,这些单元被视为SU的一部分是不合理或不可取的。为了解决这个问题,我们制定了一个新的句子识别任务,目标是识别SU,同时排除给定文本中的NSU。为了进行句子识别,我们提出了一种简单而有效的方法,该方法结合句子开头(BOS)和EOS标签,基于动态规划来确定最可能的SU和NSU。为了评估这项任务,我们设计了一个自动的、独立于语言的过程,将通用依赖语料库转换为句子识别基准。最后,我们在句子识别任务上的实验表明,我们提出的方法通常优于仅使用EOS标签的句子分割基线。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
4 weeks
期刊最新文献
Exploring Pedestrian Injury Severity by Incorporating Spatial Information in Machine Learning Darkness and Death in the U.S.: Walking Distances Across the Nation by Time of Day and Time of Year Activity Reduction as Resilience Indicator: Evidence with Filomena Data The Lifestyle and Mobility Connection of Community Supported Agriculture (CSA) Users Transit Fleet Electrification Barriers, Resolutions and Costs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1