Proceedings of the Workshop on Multilingual Information Access (MIA)最新文献

英文中文

An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language 低资源孟加拉语语篇模式识别的标注数据集与自动方法

Proceedings of the Workshop on Multilingual Information Access (MIA)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.2

Salim Sazzed

The modes of discourse aid in comprehending the convention and purpose of various forms of languages used during communication. In this study, we introduce a discourse mode annotated corpus for the low-resource Bangla (also referred to as Bengali) language. The corpus consists of sentence-level annotation of three different discourse modes, narrative, descriptive, and informative of the text excerpted from a number of Bangla novels. We analyze the annotated corpus to expose various linguistic aspects of discourse modes, such as class distributions and average sentence lengths. To automatically determine the mode of discourse, we apply CML (classical machine learning) classifiers with n-gram based statistical features and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) based language model. We observe that fine-tuned BERT-based approach yields more promising results than n-gram based CML classifiers. Our created discourse mode annotated dataset, the first of its kind in Bangla, and the evaluation, provide baselines for the automatic discourse mode identification in Bangla and can assist various downstream natural language processing tasks.

话语模式有助于理解交际中使用的各种语言形式的惯例和目的。在这项研究中，我们为低资源的孟加拉语(也称为孟加拉语)引入了一个话语模式注释的语料库。该语料库由三种不同话语模式的句子级注释组成，即叙事性、描述性和信息性的文本节选自一些孟加拉小说。我们分析了标注的语料库，揭示了话语模式的各个语言方面，如类分布和平均句子长度。为了自动确定话语模式，我们应用了基于n-gram统计特征的CML(经典机器学习)分类器和基于微调的BERT(双向编码器表示)的语言模型。我们观察到基于bert的微调方法比基于n-gram的CML分类器产生更有希望的结果。我们创建的语篇模式标注数据集(首个在孟加拉语中创建的语篇模式标注数据集)和评估为孟加拉语的自动语篇模式识别提供了基线，并可以辅助各种下游自然语言处理任务。

{"title":"An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language","authors":"Salim Sazzed","doi":"10.18653/v1/2022.mia-1.2","DOIUrl":"https://doi.org/10.18653/v1/2022.mia-1.2","url":null,"abstract":"The modes of discourse aid in comprehending the convention and purpose of various forms of languages used during communication. In this study, we introduce a discourse mode annotated corpus for the low-resource Bangla (also referred to as Bengali) language. The corpus consists of sentence-level annotation of three different discourse modes, narrative, descriptive, and informative of the text excerpted from a number of Bangla novels. We analyze the annotated corpus to expose various linguistic aspects of discourse modes, such as class distributions and average sentence lengths. To automatically determine the mode of discourse, we apply CML (classical machine learning) classifiers with n-gram based statistical features and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) based language model. We observe that fine-tuned BERT-based approach yields more promising results than n-gram based CML classifiers. Our created discourse mode annotated dataset, the first of its kind in Bangla, and the evaluation, provide baselines for the automatic discourse mode identification in Bangla and can assist various downstream natural language processing tasks.","PeriodicalId":333865,"journal":{"name":"Proceedings of the Workshop on Multilingual Information Access (MIA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123843181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Zero-shot cross-lingual open domain question answering 零机会跨语言开放领域问答

Proceedings of the Workshop on Multilingual Information Access (MIA)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.9

Sumit Agarwal, Suraj Tripathi, T. Mitamura, C. Rosé

People speaking different kinds of languages search for information in a cross-lingual manner. They tend to ask questions in their language and expect the answer to be in the same language, despite the evidence lying in another language. In this paper, we present our approach for this task of cross-lingual open-domain question-answering. Our proposed method employs a passage reranker, the fusion-in-decoder technique for generation, and a wiki data entity-based post-processing system to tackle the inability to generate entities across all languages. Our end-2-end pipeline shows an improvement of 3 and 4.6 points on F1 and EM metrics respectively, when compared with the baseline CORA model on the XOR-TyDi dataset. We also evaluate the effectiveness of our proposed techniques in the zero-shot setting using the MKQA dataset and show an improvement of 5 points in F1 for high-resource and 3 points improvement for low-resource zero-shot languages. Our team, CMUmQA’s submission in the MIA-Shared task ranked 1st in the constrained setup for the dev and 2nd in the test setting.

说不同语言的人以跨语言的方式搜索信息。他们倾向于用自己的语言提出问题，并期望用同样的语言得到答案，尽管证据存在于另一种语言中。在本文中，我们提出了跨语言开放域问答任务的方法。我们提出的方法采用了一个通道重新排序器，用于生成的融合解码器技术，以及一个基于wiki数据实体的后处理系统来解决无法跨所有语言生成实体的问题。与XOR-TyDi数据集上的基线CORA模型相比，我们的端到端管道在F1和EM指标上分别提高了3分和4.6分。我们还使用MKQA数据集评估了我们提出的技术在零射击设置中的有效性，并显示高资源的F1提高了5分，低资源的零射击语言提高了3分。我们的团队CMUmQA提交的MIA-Shared任务在开发受限设置中排名第一，在测试设置中排名第二。

引用次数: 1

Complex Word Identification in Vietnamese: Towards Vietnamese Text Simplification 越南语复词识别:走向越南语文本简化

Proceedings of the Workshop on Multilingual Information Access (MIA)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.6

Phuong-Thai Nguyen, David Kauchak

Text Simplification has been an extensively researched problem in English, but has not been investigated in Vietnamese. We focus on the Vietnamese-specific Complex Word Identification task, often the first step in Lexical Simplification (Shardlow, 2013). We examine three different Vietnamese datasets constructed for other Natural Language Processing tasks and show that, like in other languages, frequency is a strong signal in determining whether a word is complex, with a mean accuracy of 86.87%. Across the datasets, we find that the 10% most frequent words in many corpus can be labelled as simple, and the rest as complex, though this is more variable for smaller corpora. We also examine how human annotators perform at this task. Given the subjective nature, there is a fair amount of variability in which words are seen as difficult, though majority results are more consistent.

英语语篇化简是一个被广泛研究的问题，但越南语语篇化简的研究还很少。我们专注于越南语特定的复杂单词识别任务，这通常是词汇简化的第一步(Shardlow, 2013)。我们研究了为其他自然语言处理任务构建的三个不同的越南语数据集，结果表明，与其他语言一样，频率是确定单词是否复杂的一个强烈信号，平均准确率为86.87%。通过数据集，我们发现许多语料库中10%最常见的单词可以标记为简单，其余的可以标记为复杂，尽管对于较小的语料库来说这是更可变的。我们还将研究人类注释者如何执行此任务。考虑到主观性质，有相当多的词汇被认为是困难的，尽管大多数结果是一致的。

引用次数: 0

Benchmarking Language-agnostic Intent Classification for Virtual Assistant Platforms 基于语言无关的虚拟助手平台意图分类基准

Proceedings of the Workshop on Multilingual Information Access (MIA)

Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.7

Gengyu Wang, Cheng Qian, Lin Pan, Haode Qi, L. Kunc, Saloni Potdar

Current virtual assistant (VA) platforms are beholden to the limited number of languages they support. Every component, such as the tokenizer and intent classifier, is engineered for specific languages in these intricate platforms. Thus, supporting a new language in such platforms is a resource-intensive operation requiring expensive re-training and re-designing. In this paper, we propose a benchmark for evaluating language-agnostic intent classification, the most critical component of VA platforms. To ensure the benchmarking is challenging and comprehensive, we include 29 public and internal datasets across 10 low-resource languages and evaluate various training and testing settings with consideration of both accuracy and training time. The benchmarking result shows that Watson Assistant, among 7 commercial VA platforms and pre-trained multilingual language models (LMs), demonstrates close-to-best accuracy with the best accuracy-training time trade-off.

当前的虚拟助理(VA)平台受限于它们支持的语言数量有限。每个组件，如标记器和意图分类器，都是为这些复杂平台中的特定语言设计的。因此，在这样的平台上支持一种新语言是一项资源密集型的操作，需要昂贵的重新培训和重新设计。在本文中，我们提出了一个评估与语言无关的意图分类的基准，这是VA平台最关键的组成部分。为了确保基准测试具有挑战性和全面性，我们包括29个公共和内部数据集，涵盖10种低资源语言，并在考虑准确性和训练时间的情况下评估各种训练和测试设置。基准测试结果表明，Watson Assistant在7个商业VA平台和预训练的多语言语言模型(LMs)中表现出接近最佳的准确率和最佳的准确率-训练时间权衡。

引用次数: 2

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Workshop on Multilingual Information Access (MIA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀