Automatic Acquisition of Corpus for Multimedia Applications

Communications of the IBIMA Pub Date : 2011-02-05 DOI:10.5171/2011.254926

Najeh Hajlaoui

引用次数: 0

Abstract

Evaluations of tools (information retrieval systems, machine learning, speech recognition, machine translation, automatic acquisition of data, etc.) are annually organized throughout evaluation campaigns (TREC, ELRA, ESTER IWSLT, etc.). The building of an ad hoc evaluation corpus in the context of these evaluation campaigns is a complex task and it is done manually today and with a high cost. Indeed, this is a very dedicated corpus that would answer to an application need in a precise context but automating its building is a challenge that will help significantly the organization of these campaigns. As a contribution to this challenge, we propose in a context of multimedia information retrieval, an approach of multilevel extension of a small applicative corpus to a larger and voluminous corpus based on the detection of intersections between the two corpus in terms of lemmas having the same grammatical label, that means to get a list of appropriate terminology for which we use several tools (internal and external to our laboratory) and we try to evaluate them in order to keep consistency and coherence with the original corpus..

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多媒体应用语料库的自动获取

工具评估(信息检索系统、机器学习、语音识别、机器翻译、数据自动获取等)每年在评估活动(TREC、ELRA、ESTER IWSLT等)中组织。在这些评估活动的背景下建立一个特别的评估语料库是一项复杂的任务，目前是手工完成的，而且成本很高。的确，这是一个非常专用的语料库，可以在精确的上下文中满足应用程序的需求，但是自动化它的构建是一个挑战，它将极大地帮助这些活动的组织。为了应对这一挑战，我们在多媒体信息检索的背景下，提出了一种基于检测两个语料库之间具有相同语法标签的词的交集，将小的应用语料库多层次扩展到更大的语料库的方法。这意味着我们要得到一个合适的术语列表，我们使用几个工具(内部和外部的实验室)，我们试图评估它们，以保持与原始语料库的一致性和一致性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Communications of the IBIMA

自引率

0.00%

发文量