A relational database model and prototype for storing diverse discrete linguistic data

Alexander Magidow
{"title":"A relational database model and prototype for storing diverse discrete linguistic data","authors":"Alexander Magidow","doi":"10.21248/jlcl.30.2015.194","DOIUrl":null,"url":null,"abstract":"This article describes a model for storing multiple forms of linguistic data within a relational database as developed and tested through a prototype database for storing data from Arabic dialects. A challenge that typically confronts linguistic documentation projects is the need for a flexible data model that can be adapted to the growing needs of a project (Dimitriadis, 2006). Contributors to linguistic databases typically cannot predict exactly which attributes of their data they will need to store, and therefore the initial design of the database may need to change over time. Many projects take advantage of the flexibility of XML and RDF to allow for continuing revisions to the data model. For some projects, there may be a compelling need to use a relational database system, though some approaches to relational database design may not flexible enough to allow for adaptation over time (Dimitriadis, 2006). The goal of this article is to describe a relational database model which can adapt easily to storing new data types as a project evolves. It both describes a general data model and shows its implementation within a working project. The model is primarily intended for storing discrete linguistic elements (phonemes, morphemes including general lexical data, sentences) as opposed to text corpora, and would be expected to store data on the order of thousands to hundreds of thousands of rows.1 The relational model described in this paper is centered around the linguistic datum, encoded as a string of characters, associated in a many-to-many relationship with ‘tags,’ and in many-to-many named relationships with other datums.2 For this reason, the model will be referred to as the ‘tag-and-relationship’ model. The combination of tags and relationships allows the database to store a wide variety of linguistic data. This data model was developed in tandem with a project to encode linguistic data from Arabic dialects (the “Database of Arabic Dialects”, DAD).3 Arabic is an extremely diverse language group, with a dialects stretching from Mauritania to Afghanistan,","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"27 5","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Lang. Technol. Comput. Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.30.2015.194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

This article describes a model for storing multiple forms of linguistic data within a relational database as developed and tested through a prototype database for storing data from Arabic dialects. A challenge that typically confronts linguistic documentation projects is the need for a flexible data model that can be adapted to the growing needs of a project (Dimitriadis, 2006). Contributors to linguistic databases typically cannot predict exactly which attributes of their data they will need to store, and therefore the initial design of the database may need to change over time. Many projects take advantage of the flexibility of XML and RDF to allow for continuing revisions to the data model. For some projects, there may be a compelling need to use a relational database system, though some approaches to relational database design may not flexible enough to allow for adaptation over time (Dimitriadis, 2006). The goal of this article is to describe a relational database model which can adapt easily to storing new data types as a project evolves. It both describes a general data model and shows its implementation within a working project. The model is primarily intended for storing discrete linguistic elements (phonemes, morphemes including general lexical data, sentences) as opposed to text corpora, and would be expected to store data on the order of thousands to hundreds of thousands of rows.1 The relational model described in this paper is centered around the linguistic datum, encoded as a string of characters, associated in a many-to-many relationship with ‘tags,’ and in many-to-many named relationships with other datums.2 For this reason, the model will be referred to as the ‘tag-and-relationship’ model. The combination of tags and relationships allows the database to store a wide variety of linguistic data. This data model was developed in tandem with a project to encode linguistic data from Arabic dialects (the “Database of Arabic Dialects”, DAD).3 Arabic is an extremely diverse language group, with a dialects stretching from Mauritania to Afghanistan,
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一种用于存储各种离散语言数据的关系数据库模型和原型
本文描述了一个用于在关系数据库中存储多种形式的语言数据的模型,并通过一个用于存储来自阿拉伯方言的数据的原型数据库进行了开发和测试。语言文档项目通常面临的一个挑战是需要一个灵活的数据模型,以适应项目不断增长的需求(Dimitriadis, 2006)。语言数据库的贡献者通常不能准确地预测他们需要存储的数据的哪些属性,因此数据库的初始设计可能需要随着时间的推移而改变。许多项目利用XML和RDF的灵活性来允许对数据模型进行持续修订。对于某些项目,可能迫切需要使用关系数据库系统,尽管关系数据库设计的一些方法可能不够灵活,无法随着时间的推移进行调整(Dimitriadis, 2006)。本文的目标是描述一种关系数据库模型,该模型可以随着项目的发展轻松适应存储新数据类型。它既描述了一个通用的数据模型,又展示了它在一个工作项目中的实现。该模型主要用于存储离散的语言元素(音素、包括一般词汇数据的语素、句子),而不是文本语料库,并且预计将存储数千到数十万行的数据本文中描述的关系模型以语言数据为中心,编码为一串字符,与“标签”以多对多关系关联,并与其他数据以多对多命名关系关联由于这个原因,该模型将被称为“标记-关系”模型。标记和关系的组合允许数据库存储各种各样的语言数据。这个数据模型是与一个对来自阿拉伯方言的语言数据进行编码的项目(“阿拉伯方言数据库”,DAD)一起开发的阿拉伯语是一个极其多样化的语言群体,其方言从毛里塔尼亚延伸到阿富汗,
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Aufbau eines Referenzkorpus zur deutschsprachigen internetbasierten Kommunikation als Zusatzkomponente für die Korpora im Projekt 'Digitales Wörterbuch der deutschen Sprache' (DWDS) Crowdsourcing the OCR Ground Truth of a German and French Cultural Heritage Corpus Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1