Comparison of text preprocessing methods

IF 2.3 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Natural Language Engineering Pub Date : 2022-06-13 DOI:10.1017/S1351324922000213

Christine P. Chai

{"title":"Comparison of text preprocessing methods","authors":"Christine P. Chai","doi":"10.1017/S1351324922000213","DOIUrl":null,"url":null,"abstract":"Abstract Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"509 - 553"},"PeriodicalIF":2.3000,"publicationDate":"2022-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/S1351324922000213","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 7

Abstract

Abstract Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

文本预处理方法比较

摘要文本预处理不仅是为建模准备语料库的重要步骤，也是直接影响自然语言处理（NLP）应用结果的关键领域。例如，精确的标记化提高了词性（POS）标记的准确性，而保留多词表达则提高了推理和机器翻译。在准备好作为计算机模型的输入之前，需要对文本语料库进行适当的预处理。预处理要求取决于语料库的性质和NLP应用程序本身，也就是说，研究人员希望通过分析数据来实现什么。传统的文本预处理实践通常就足够了，但在某些情况下，需要对文本预处理进行定制，以获得更好的分析结果。因此，我们讨论了几种常见的文本预处理方法的优缺点：删除格式、标记化、文本规范化、处理标点符号、删除停止语、词干和引理化、n语法和识别多词表达式。然后，我们提供了需要特殊预处理的文本数据集的例子，以及以前的研究人员是如何应对这一挑战的。我们希望这篇文章能成为如何选择和微调文本预处理方法的入门指南。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.