Exploiting Parts-of-Speech for Improved Textual Modeling of Code-Switching Data

2018 Twenty Fourth National Conference on Communications (NCC) Pub Date : 2018-02-01 DOI:10.1109/NCC.2018.8600097

Ganji Sreeram, R. Sinha

{"title":"Exploiting Parts-of-Speech for Improved Textual Modeling of Code-Switching Data","authors":"Ganji Sreeram, R. Sinha","doi":"10.1109/NCC.2018.8600097","DOIUrl":null,"url":null,"abstract":"Lately, the problem of code-switching has gained a lot of attention and has emerged as an active area of research. In bilingual communities, the speakers commonly embed the words and phrases of a non-native language into the syntax of a native language in their day-to-day communications. The code-switching is a global phenomenon among multilingual communities, still very limited acoustic and linguistic resources are available as yet. For developing effective speech-based applications, the ability of the existing language technologies to deal with the code-switched data cannot be over emphasized. The code-switching is broadly classified into two modes: inter-sentential and intra-sentential code-switching. In this work, we have studied the intrasentential problem in the context of code-switching language modeling task. The salient contributions of this paper includes: (i) the creation of Hindi-English code-switching text corpus by crawling a few blogging sites educating about the usage of the Internet, and (ii) the exploration of the parts-of-speech features towards more effective modeling of Hindi-English code-switched data by the monolingual language models trained on native (Hindi) language data.","PeriodicalId":121544,"journal":{"name":"2018 Twenty Fourth National Conference on Communications (NCC)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Twenty Fourth National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC.2018.8600097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Lately, the problem of code-switching has gained a lot of attention and has emerged as an active area of research. In bilingual communities, the speakers commonly embed the words and phrases of a non-native language into the syntax of a native language in their day-to-day communications. The code-switching is a global phenomenon among multilingual communities, still very limited acoustic and linguistic resources are available as yet. For developing effective speech-based applications, the ability of the existing language technologies to deal with the code-switched data cannot be over emphasized. The code-switching is broadly classified into two modes: inter-sentential and intra-sentential code-switching. In this work, we have studied the intrasentential problem in the context of code-switching language modeling task. The salient contributions of this paper includes: (i) the creation of Hindi-English code-switching text corpus by crawling a few blogging sites educating about the usage of the Internet, and (ii) the exploration of the parts-of-speech features towards more effective modeling of Hindi-English code-switched data by the monolingual language models trained on native (Hindi) language data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用词性改进语码转换数据的文本建模

近年来，语码转换问题引起了人们的广泛关注，并成为一个活跃的研究领域。在双语社区中，在日常交流中，说话者通常将非母语的单词和短语嵌入母语的语法中。语码转换是多语言社区的一种全球性现象，但目前可用的声学和语言资源仍然非常有限。为了开发有效的基于语音的应用程序，现有语言技术处理代码转换数据的能力再怎么强调也不过分。语码转换大致分为两种模式:句间语码转换和句内语码转换。在这项工作中，我们研究了代码转换语言建模任务背景下的本质问题。本文的突出贡献包括:(i)通过抓取一些关于互联网使用的博客网站来创建印地语-英语代码转换文本语料库，以及(ii)通过在本地(印地语)语言数据上训练的单语语言模型，探索词性特征，以更有效地建模印地语-英语代码转换数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 Twenty Fourth National Conference on Communications (NCC)

自引率

0.00%

发文量

期刊最新文献

Determining the Generalized Hamming Weight Hierarchy of the Binary Projective Reed-Muller Code A Cognitive Opportunistic Fractional Frequency Reuse Scheme for OFDMA Uplinks Caching Policies for Transient Data Grouping Subarray for Robust Estimation of Direction of Arrival Universal Compression of a Piecewise Stationary Source Through Sequential Change Detection