A relational database model and prototype for storing diverse discrete linguistic data

J. Lang. Technol. Comput. Linguistics Pub Date : 2015-07-01 DOI:10.21248/jlcl.30.2015.194

Alexander Magidow

{"title":"A relational database model and prototype for storing diverse discrete linguistic data","authors":"Alexander Magidow","doi":"10.21248/jlcl.30.2015.194","DOIUrl":null,"url":null,"abstract":"This article describes a model for storing multiple forms of linguistic data within a relational database as developed and tested through a prototype database for storing data from Arabic dialects. A challenge that typically confronts linguistic documentation projects is the need for a flexible data model that can be adapted to the growing needs of a project (Dimitriadis, 2006). Contributors to linguistic databases typically cannot predict exactly which attributes of their data they will need to store, and therefore the initial design of the database may need to change over time. Many projects take advantage of the flexibility of XML and RDF to allow for continuing revisions to the data model. For some projects, there may be a compelling need to use a relational database system, though some approaches to relational database design may not flexible enough to allow for adaptation over time (Dimitriadis, 2006). The goal of this article is to describe a relational database model which can adapt easily to storing new data types as a project evolves. It both describes a general data model and shows its implementation within a working project. The model is primarily intended for storing discrete linguistic elements (phonemes, morphemes including general lexical data, sentences) as opposed to text corpora, and would be expected to store data on the order of thousands to hundreds of thousands of rows.1 The relational model described in this paper is centered around the linguistic datum, encoded as a string of characters, associated in a many-to-many relationship with ‘tags,’ and in many-to-many named relationships with other datums.2 For this reason, the model will be referred to as the ‘tag-and-relationship’ model. The combination of tags and relationships allows the database to store a wide variety of linguistic data. This data model was developed in tandem with a project to encode linguistic data from Arabic dialects (the “Database of Arabic Dialects”, DAD).3 Arabic is an extremely diverse language group, with a dialects stretching from Mauritania to Afghanistan,","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"27 5","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Lang. Technol. Comput. Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.30.2015.194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

This article describes a model for storing multiple forms of linguistic data within a relational database as developed and tested through a prototype database for storing data from Arabic dialects. A challenge that typically confronts linguistic documentation projects is the need for a flexible data model that can be adapted to the growing needs of a project (Dimitriadis, 2006). Contributors to linguistic databases typically cannot predict exactly which attributes of their data they will need to store, and therefore the initial design of the database may need to change over time. Many projects take advantage of the flexibility of XML and RDF to allow for continuing revisions to the data model. For some projects, there may be a compelling need to use a relational database system, though some approaches to relational database design may not flexible enough to allow for adaptation over time (Dimitriadis, 2006). The goal of this article is to describe a relational database model which can adapt easily to storing new data types as a project evolves. It both describes a general data model and shows its implementation within a working project. The model is primarily intended for storing discrete linguistic elements (phonemes, morphemes including general lexical data, sentences) as opposed to text corpora, and would be expected to store data on the order of thousands to hundreds of thousands of rows.1 The relational model described in this paper is centered around the linguistic datum, encoded as a string of characters, associated in a many-to-many relationship with ‘tags,’ and in many-to-many named relationships with other datums.2 For this reason, the model will be referred to as the ‘tag-and-relationship’ model. The combination of tags and relationships allows the database to store a wide variety of linguistic data. This data model was developed in tandem with a project to encode linguistic data from Arabic dialects (the “Database of Arabic Dialects”, DAD).3 Arabic is an extremely diverse language group, with a dialects stretching from Mauritania to Afghanistan,

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种用于存储各种离散语言数据的关系数据库模型和原型

本文描述了一个用于在关系数据库中存储多种形式的语言数据的模型，并通过一个用于存储来自阿拉伯方言的数据的原型数据库进行了开发和测试。语言文档项目通常面临的一个挑战是需要一个灵活的数据模型，以适应项目不断增长的需求(Dimitriadis, 2006)。语言数据库的贡献者通常不能准确地预测他们需要存储的数据的哪些属性，因此数据库的初始设计可能需要随着时间的推移而改变。许多项目利用XML和RDF的灵活性来允许对数据模型进行持续修订。对于某些项目，可能迫切需要使用关系数据库系统，尽管关系数据库设计的一些方法可能不够灵活，无法随着时间的推移进行调整(Dimitriadis, 2006)。本文的目标是描述一种关系数据库模型，该模型可以随着项目的发展轻松适应存储新数据类型。它既描述了一个通用的数据模型，又展示了它在一个工作项目中的实现。该模型主要用于存储离散的语言元素(音素、包括一般词汇数据的语素、句子)，而不是文本语料库，并且预计将存储数千到数十万行的数据本文中描述的关系模型以语言数据为中心，编码为一串字符，与“标签”以多对多关系关联，并与其他数据以多对多命名关系关联由于这个原因，该模型将被称为“标记-关系”模型。标记和关系的组合允许数据库存储各种各样的语言数据。这个数据模型是与一个对来自阿拉伯方言的语言数据进行编码的项目(“阿拉伯方言数据库”，DAD)一起开发的阿拉伯语是一个极其多样化的语言群体，其方言从毛里塔尼亚延伸到阿富汗，

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

J. Lang. Technol. Comput. Linguistics

自引率

0.00%

发文量