Hybrid Distance-based, CNN and Bi-LSTM System for Dictionary Expansion

IF 0.9 Q4 TELECOMMUNICATIONS Infocommunications Journal Pub Date : 2020-01-01 DOI:10.36244/ICJ.2020.4.2
Béla Benedek Szakács, T. Mészáros
{"title":"Hybrid Distance-based, CNN and Bi-LSTM System for Dictionary Expansion","authors":"Béla Benedek Szakács, T. Mészáros","doi":"10.36244/ICJ.2020.4.2","DOIUrl":null,"url":null,"abstract":"Dictionaries like Wordnet can help in a variety of Natural Language Processing applications by providing additional morphological data. They can be used in Digital Humanities research, building knowledge graphs and other applications. Creating dictionaries from large corpora of texts written in a natural language is a task that has not been a primary focus of research, as other tasks have dominated the field (such as chat-bots), but it can be a very useful tool in analysing texts. Even in the case of contemporary texts, categorizing the words according to their dictionary entry is a complex task, and for less conventional texts (in old or less researched languages) it is even harder to solve this problem automatically. Our task was to create a software that helps in expanding a dictionary containing word forms and tagging unprocessed text. We used a manually created corpus for training and testing the model. We created a combination of Bidirectional Long-Short Term Memory networks, convolutional networks and a distancebased solution that outperformed other existing solutions. While manual post-processing for the tagged text is still needed, it significantly reduces the amount of it.","PeriodicalId":42504,"journal":{"name":"Infocommunications Journal","volume":"57 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Infocommunications Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36244/ICJ.2020.4.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Dictionaries like Wordnet can help in a variety of Natural Language Processing applications by providing additional morphological data. They can be used in Digital Humanities research, building knowledge graphs and other applications. Creating dictionaries from large corpora of texts written in a natural language is a task that has not been a primary focus of research, as other tasks have dominated the field (such as chat-bots), but it can be a very useful tool in analysing texts. Even in the case of contemporary texts, categorizing the words according to their dictionary entry is a complex task, and for less conventional texts (in old or less researched languages) it is even harder to solve this problem automatically. Our task was to create a software that helps in expanding a dictionary containing word forms and tagging unprocessed text. We used a manually created corpus for training and testing the model. We created a combination of Bidirectional Long-Short Term Memory networks, convolutional networks and a distancebased solution that outperformed other existing solutions. While manual post-processing for the tagged text is still needed, it significantly reduces the amount of it.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于距离、CNN和Bi-LSTM的词典扩展混合系统
像Wordnet这样的字典可以通过提供额外的形态学数据来帮助各种自然语言处理应用程序。它们可以用于数字人文研究、构建知识图谱和其他应用。从自然语言文本的大型语料库中创建词典并不是研究的主要焦点,因为其他任务已经主导了该领域(如聊天机器人),但它可以成为分析文本的非常有用的工具。即使在当代文本的情况下,根据词典条目对单词进行分类也是一项复杂的任务,而对于不太传统的文本(在古老的或较少研究的语言中),自动解决这个问题就更难了。我们的任务是创建一个软件,帮助扩展包含单词形式的字典,并标记未处理的文本。我们使用手动创建的语料库来训练和测试模型。我们创造了双向长短期记忆网络、卷积网络和基于距离的解决方案的组合,该解决方案优于其他现有解决方案。虽然仍然需要对标记文本进行手动后处理,但它大大减少了后处理的数量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Infocommunications Journal
Infocommunications Journal TELECOMMUNICATIONS-
CiteScore
1.90
自引率
27.30%
发文量
0
期刊最新文献
Evolution of Digitization toward the Internet of Digital & Cognitive Realities and Smart Ecosystems On the Convex Hull of the Achievable Capacity Region of the Two User FDM OMA Downlink A game theoretic framework for controlling the behavior of a content seeking to be popular on social networking sites In-network DDoS detection and mitigation using INT data for IoT ecosystem Optimizing the Performance of the Iptables Stateful NAT44 Solution
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1