首页 > 最新文献

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval最新文献

英文 中文
Feature Extraction Technique Based on Conv1D and Conv2D Network for Thai Speech Emotion Recognition 基于Conv1D和Conv2D网络的泰语语音情感识别特征提取技术
Naris Prombut, S. Waijanya, Nuttachot Promrit
Speech Emotion Recognition is one of the challenges in Natural Language Processing (NLP) area. There are many factors used to identify emotions in speech, such as pitch, intensity, frequency, duration, and speakers' nationality. This paper implements a speech emotion recognition model specifically for Thai language by classifying it into 5 emotions: Angry, Frustrated, Neutral, Sad, and Happy. This research uses a dataset from VISTEC-depa AI Research Institute of Thailand. There are 21,562 sounds (scripts) divided into 70% of training data and 30% of test data. We use the Mel spectrogram and Mel-frequency Cepstral Coefficients (MFCC) technique for feature extraction and 1D Convolutional Neural Network (Conv1D) all together with 2D Convolutional Neural Network (Conv2D), to classify emotions. With respect to the result, MFCC with Conv2D provides the highest accuracy at 80.59%, and is higher than the baseline study, which is of 71.35%.
语音情感识别是自然语言处理(NLP)领域的挑战之一。有很多因素可以用来识别言语中的情绪,比如音调、强度、频率、持续时间和说话者的国籍。本文实现了一个专门针对泰语的语音情绪识别模型,将泰语分为5种情绪:愤怒、沮丧、中性、悲伤和快乐。本研究使用了泰国VISTEC-depa人工智能研究所的数据集。有21,562个声音(脚本),分为70%的训练数据和30%的测试数据。我们使用Mel谱图和Mel频率倒谱系数(MFCC)技术进行特征提取,并使用1D卷积神经网络(Conv1D)和2D卷积神经网络(Conv2D)对情绪进行分类。结果显示,采用Conv2D的MFCC的准确率最高,为80.59%,高于基线研究的71.35%。
{"title":"Feature Extraction Technique Based on Conv1D and Conv2D Network for Thai Speech Emotion Recognition","authors":"Naris Prombut, S. Waijanya, Nuttachot Promrit","doi":"10.1145/3508230.3508238","DOIUrl":"https://doi.org/10.1145/3508230.3508238","url":null,"abstract":"Speech Emotion Recognition is one of the challenges in Natural Language Processing (NLP) area. There are many factors used to identify emotions in speech, such as pitch, intensity, frequency, duration, and speakers' nationality. This paper implements a speech emotion recognition model specifically for Thai language by classifying it into 5 emotions: Angry, Frustrated, Neutral, Sad, and Happy. This research uses a dataset from VISTEC-depa AI Research Institute of Thailand. There are 21,562 sounds (scripts) divided into 70% of training data and 30% of test data. We use the Mel spectrogram and Mel-frequency Cepstral Coefficients (MFCC) technique for feature extraction and 1D Convolutional Neural Network (Conv1D) all together with 2D Convolutional Neural Network (Conv2D), to classify emotions. With respect to the result, MFCC with Conv2D provides the highest accuracy at 80.59%, and is higher than the baseline study, which is of 71.35%.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128565889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Automated Intention Mining with Comparatively Fine-tuning BERT 基于相对微调BERT的自动化意图挖掘
Xuan Sun, Luqun Li, F. Mercaldo, Yichen Yang, A. Santone, F. Martinelli
In the field of software engineering, intention mining is an interesting but challenging task, where the goal is to have a good understanding of user generated texts so as to capture their requirements that are useful for software maintenance and evolution. Recently, BERT and its variants have achieved state-of-the-art performance among various natural language processing tasks such as machine translation, machine reading comprehension and natural language inference. However, few studies try to investigate the efficacy of pre-trained language models in the task. In this paper, we present a new baseline with fine-tuned BERT model. Our method achieves state-of-the-art results on three benchmark data sets, outscoring baselines by a substantial margin. We also further investigate the efficacy of the pre-trained BERT model with shallower network depths through a simple strategy for layer selection.
在软件工程领域,意图挖掘是一项有趣但具有挑战性的任务,其目标是很好地理解用户生成的文本,从而捕获对软件维护和发展有用的需求。近年来,BERT及其变体在机器翻译、机器阅读理解和自然语言推理等各种自然语言处理任务中取得了最先进的性能。然而,很少有研究试图调查预训练语言模型在任务中的效果。在本文中,我们提出了一个新的基线与微调BERT模型。我们的方法在三个基准数据集上实现了最先进的结果,大大超过了基线。我们还通过一种简单的层选择策略进一步研究了具有较浅网络深度的预训练BERT模型的有效性。
{"title":"Automated Intention Mining with Comparatively Fine-tuning BERT","authors":"Xuan Sun, Luqun Li, F. Mercaldo, Yichen Yang, A. Santone, F. Martinelli","doi":"10.1145/3508230.3508254","DOIUrl":"https://doi.org/10.1145/3508230.3508254","url":null,"abstract":"In the field of software engineering, intention mining is an interesting but challenging task, where the goal is to have a good understanding of user generated texts so as to capture their requirements that are useful for software maintenance and evolution. Recently, BERT and its variants have achieved state-of-the-art performance among various natural language processing tasks such as machine translation, machine reading comprehension and natural language inference. However, few studies try to investigate the efficacy of pre-trained language models in the task. In this paper, we present a new baseline with fine-tuned BERT model. Our method achieves state-of-the-art results on three benchmark data sets, outscoring baselines by a substantial margin. We also further investigate the efficacy of the pre-trained BERT model with shallower network depths through a simple strategy for layer selection.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134325488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CBCP: A Method of Causality Extraction from Unstructured Financial Text CBCP:一种非结构化金融文本的因果关系提取方法
Lang Cao, Shihuangzhai Zhang, Juxing Chen
Extracting causality information from unstructured natural language text is a challenging problem in natural language processing. However, there are no mature special causality extraction systems. Most people use basic sequence labeling methods, such as BERT-CRF model, to extract causal elements from unstructured text and the results are usually not well. At the same time, there is a large number of causal event relations in the field of finance. If we can extract enormous financial causality, this information will help us better understand the relationships between financial events and build related event evolutionary graphs in the future. In this paper, we propose a causality extraction method for this question, named CBCP (Center word-based BERT-CRF with Pattern extraction), which can directly extract cause elements and effect elements from unstructured text. Compared to BERT-CRF model, our model incorporates the information of center words as prior conditions and performs better in the performance of entity extraction. Moreover, our method combined with pattern can further improve the effect of extracting causality. Then we evaluate our method and compare it to the basic sequence labeling method. We prove that our method performs better than other basic extraction methods on causality extraction tasks in the finance field. At last, we summarize our work and prospect some future work.
从非结构化的自然语言文本中提取因果关系信息是自然语言处理中的一个难题。然而,目前还没有成熟的专门的因果关系提取系统。大多数人使用基本的序列标记方法,如BERT-CRF模型,从非结构化文本中提取因果元素,结果往往不太理想。同时,在金融领域中存在着大量的因果事件关系。如果我们能够提取大量的金融因果关系,这些信息将有助于我们更好地理解金融事件之间的关系,并在未来构建相关的事件演化图。针对这一问题,我们提出了一种基于中心词的BERT-CRF with Pattern extraction (CBCP)的因果关系提取方法,该方法可以直接从非结构化文本中提取原因元素和效果元素。与BERT-CRF模型相比,我们的模型将中心词信息作为先验条件,在实体提取方面表现更好。此外,我们的方法与模式相结合,可以进一步提高因果关系提取的效果。然后对该方法进行了评价,并与基本序列标记方法进行了比较。在金融领域的因果关系提取任务中,我们证明了我们的方法比其他基本提取方法表现得更好。最后,对工作进行了总结,并对今后的工作进行了展望。
{"title":"CBCP: A Method of Causality Extraction from Unstructured Financial Text","authors":"Lang Cao, Shihuangzhai Zhang, Juxing Chen","doi":"10.1145/3508230.3508250","DOIUrl":"https://doi.org/10.1145/3508230.3508250","url":null,"abstract":"Extracting causality information from unstructured natural language text is a challenging problem in natural language processing. However, there are no mature special causality extraction systems. Most people use basic sequence labeling methods, such as BERT-CRF model, to extract causal elements from unstructured text and the results are usually not well. At the same time, there is a large number of causal event relations in the field of finance. If we can extract enormous financial causality, this information will help us better understand the relationships between financial events and build related event evolutionary graphs in the future. In this paper, we propose a causality extraction method for this question, named CBCP (Center word-based BERT-CRF with Pattern extraction), which can directly extract cause elements and effect elements from unstructured text. Compared to BERT-CRF model, our model incorporates the information of center words as prior conditions and performs better in the performance of entity extraction. Moreover, our method combined with pattern can further improve the effect of extracting causality. Then we evaluate our method and compare it to the basic sequence labeling method. We prove that our method performs better than other basic extraction methods on causality extraction tasks in the finance field. At last, we summarize our work and prospect some future work.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128896098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improved Bi-GRU Model for Imbalanced English Toxic Comments Dataset 不平衡英语有毒评论数据集的改进Bi-GRU模型
Zhongguo Wang, Bao Zhang
Deep learning is widely used in the study of English toxic comment classification. However, most existing studies failed to consider data imbalance. Aiming at an imbalanced English Toxic Comments Dataset, we propose an improved Bi-gated recurrent unit (GRU) model that combines an oversampling and cost-sensitive method. We use random oversampling in the improved model to reduce the data imbalance, introduce a cost-sensitive method, and propose a new loss function for the Bi-GRU model. Experimental results show that the improved Bi-GRU model demonstrates a significantly improved classification performance in the imbalanced English Toxic Comments Dataset.
深度学习被广泛应用于英语有毒评论分类的研究中。然而,现有的研究大多没有考虑数据的不平衡。针对不平衡的英语有毒评论数据集,我们提出了一种改进的双门循环单元(GRU)模型,该模型结合了过采样和成本敏感方法。我们在改进的模型中使用随机过采样来减少数据不平衡,引入成本敏感方法,并为Bi-GRU模型提出了一个新的损失函数。实验结果表明,改进的Bi-GRU模型在不平衡英语有毒评论数据集上的分类性能得到了显著提高。
{"title":"Improved Bi-GRU Model for Imbalanced English Toxic Comments Dataset","authors":"Zhongguo Wang, Bao Zhang","doi":"10.1145/3508230.3508234","DOIUrl":"https://doi.org/10.1145/3508230.3508234","url":null,"abstract":"Deep learning is widely used in the study of English toxic comment classification. However, most existing studies failed to consider data imbalance. Aiming at an imbalanced English Toxic Comments Dataset, we propose an improved Bi-gated recurrent unit (GRU) model that combines an oversampling and cost-sensitive method. We use random oversampling in the improved model to reduce the data imbalance, introduce a cost-sensitive method, and propose a new loss function for the Bi-GRU model. Experimental results show that the improved Bi-GRU model demonstrates a significantly improved classification performance in the imbalanced English Toxic Comments Dataset.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"142 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129298806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scored and Error-annotated Essay Dataset of Chinese EFL/ESL Learners 中国EFL/ESL学习者的记分和纠错论文数据集
Kai Jin, Wuying Liu
A certain scale of finely annotated essay dataset of EFL/ESL (English as a foreign language or the second language) learners is not only an important language resource for language research and teaching, but also contributing materials for language-related computing science. Unfortunately, this type of data open on the Internet are not only of small quantity but also of uneven quality, especially such data of Chinese learners. We collected 147 essays of Chinese EFL/ESL learners and had four teachers score them under the same criteria and one teacher annotate major errors, and have them scored in Pigai scoring system. We then structured the score file, error-annotated files, essay files together with context information, and built the Scored and Error-annotated Essay Dataset of Chinese EFL/ESL Learners (SeedCel) which is open on the Internet and will be incrementally updated. This paper explains how SeedCel is constructed, what the details of SeedCel are, and where SeedCel will be used.
一定规模的EFL/ESL(英语作为外语或第二语言)学习者精细注释的论文数据集不仅是语言研究和教学的重要语言资源,也是语言相关计算科学的重要材料。遗憾的是,在互联网上公开的这类数据不仅数量少,而且质量参差不齐,尤其是中国学习者的数据。我们收集了147篇中国英语/ESL学习者的作文,由4位老师按照相同的标准进行评分,并由1位老师对主要错误进行批改,并在拼改评分系统中进行评分。然后,我们将分数文件、错误注释文件、论文文件与上下文信息结合起来,构建了中国英语/ESL学习者分数和错误注释论文数据集(SeedCel),该数据集在互联网上开放,并将逐步更新。本文解释了SeedCel是如何构建的,SeedCel的细节是什么,以及将在哪里使用SeedCel。
{"title":"Scored and Error-annotated Essay Dataset of Chinese EFL/ESL Learners","authors":"Kai Jin, Wuying Liu","doi":"10.1145/3508230.3508245","DOIUrl":"https://doi.org/10.1145/3508230.3508245","url":null,"abstract":"A certain scale of finely annotated essay dataset of EFL/ESL (English as a foreign language or the second language) learners is not only an important language resource for language research and teaching, but also contributing materials for language-related computing science. Unfortunately, this type of data open on the Internet are not only of small quantity but also of uneven quality, especially such data of Chinese learners. We collected 147 essays of Chinese EFL/ESL learners and had four teachers score them under the same criteria and one teacher annotate major errors, and have them scored in Pigai scoring system. We then structured the score file, error-annotated files, essay files together with context information, and built the Scored and Error-annotated Essay Dataset of Chinese EFL/ESL Learners (SeedCel) which is open on the Internet and will be incrementally updated. This paper explains how SeedCel is constructed, what the details of SeedCel are, and where SeedCel will be used.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128600485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topic Segmentation for Interview Dialogue System 访谈对话系统的话题分割
Taiga Kirihara, Kazuyuki Matsumoto, M. Sasayama, Minoru Yoshida, K. Kita
In this study, topic segmentation was performed by referring to the interview dialogue corpus. Utterance intention tags were added to the existing interview dialogue corpus, and uttered sentences were vectorized using BERT, Sentence BERT, and Distil BERT. In addition, topic classification was performed using the utterance intention tags and the features of the preceding and following uttered sentences. Consequently, the greatest accuracy was achieved when the utterance intention tag was used with DistilBERT.
在本研究中,我们参考访谈对话语料库进行话题分割。在现有的访谈对话语料库中添加话语意图标签,并使用BERT、Sentence BERT和蒸馏BERT对所发出的句子进行矢量化。此外,利用话语意图标签和前后句子的特征进行主题分类。因此,当话语意图标签与蒸馏酒一起使用时,达到了最高的准确性。
{"title":"Topic Segmentation for Interview Dialogue System","authors":"Taiga Kirihara, Kazuyuki Matsumoto, M. Sasayama, Minoru Yoshida, K. Kita","doi":"10.1145/3508230.3508237","DOIUrl":"https://doi.org/10.1145/3508230.3508237","url":null,"abstract":"In this study, topic segmentation was performed by referring to the interview dialogue corpus. Utterance intention tags were added to the existing interview dialogue corpus, and uttered sentences were vectorized using BERT, Sentence BERT, and Distil BERT. In addition, topic classification was performed using the utterance intention tags and the features of the preceding and following uttered sentences. Consequently, the greatest accuracy was achieved when the utterance intention tag was used with DistilBERT.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114645610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Research on judgment reasoning using natural language inference in Chinese medical texts 基于自然语言推理的中医文本判断推理研究
Xin Li, Wenping Kong
Machine reading comprehension (MRC) is a task used to test the degree to which a machine understands natural language by asking the machine to answer questions according to a given context. Judgment reasoning is one of MRC tasks which means that given a context and questions, let machine gives the true and false answers, for some real-world data, there will be another option of unknown. Considering the current research status, this paper uses natural language inference (NLI) models to further study this judgment reasoning task, which is mainly to judge the semantic relationship between two sentences. In our paper, we first explain how the NLI task can be used to train universal sentence encoding models in the judgment reasoning process and subsequently describe the architectures used in NLI task, which covers a suitable range of sentence encoders currently in use and take the bi-directional long short-term memory (BI-LSTM) model with max-pooling over the hidden representations as an example explained in this paper. After some comparative experiments, we have verified that our NLI models are effective strategies to improve the performance of judgment reasoning in Chinese medical texts, which can effectively improve the accuracy values.
机器阅读理解(MRC)是一项测试机器理解自然语言程度的任务,通过要求机器根据给定的上下文回答问题。判断推理是MRC任务之一,这意味着给定上下文和问题,让机器给出正确和错误的答案,对于一些现实世界的数据,会有另一种未知的选择。考虑到目前的研究现状,本文利用自然语言推理(NLI)模型进一步研究这一判断推理任务,该任务主要是判断两个句子之间的语义关系。在本文中,我们首先解释了在判断推理过程中如何使用NLI任务来训练通用句子编码模型,随后描述了NLI任务中使用的架构,该架构涵盖了当前使用的合适范围的句子编码器,并以双向长短期记忆(BI-LSTM)模型为例进行了解释。经过一些对比实验,我们验证了我们的NLI模型是提高中医文本判断推理性能的有效策略,可以有效地提高准确率值。
{"title":"Research on judgment reasoning using natural language inference in Chinese medical texts","authors":"Xin Li, Wenping Kong","doi":"10.1145/3508230.3508248","DOIUrl":"https://doi.org/10.1145/3508230.3508248","url":null,"abstract":"Machine reading comprehension (MRC) is a task used to test the degree to which a machine understands natural language by asking the machine to answer questions according to a given context. Judgment reasoning is one of MRC tasks which means that given a context and questions, let machine gives the true and false answers, for some real-world data, there will be another option of unknown. Considering the current research status, this paper uses natural language inference (NLI) models to further study this judgment reasoning task, which is mainly to judge the semantic relationship between two sentences. In our paper, we first explain how the NLI task can be used to train universal sentence encoding models in the judgment reasoning process and subsequently describe the architectures used in NLI task, which covers a suitable range of sentence encoders currently in use and take the bi-directional long short-term memory (BI-LSTM) model with max-pooling over the hidden representations as an example explained in this paper. After some comparative experiments, we have verified that our NLI models are effective strategies to improve the performance of judgment reasoning in Chinese medical texts, which can effectively improve the accuracy values.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124430715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong 低资源的新语言机器学习:以香港书面语和口语为例
Hei Yi Mak, Tan Lee
The majority of inhabitants in Hong Kong are able to read and write in standard Chinese but use Cantonese as the primary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The rise of written Cantonese is increasingly evident in the cyber world. The growing interaction between Mandarin speakers and Cantonese speakers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of this study is on the effort of preparing good amount of training data for NMT. In addition to collecting 28K parallel sentences from previous linguistic studies and scattered internet resources, we devise an effective approach to obtaining 72K parallel sentences by automatically extracting pairs of semantically similar sentences from parallel articles on Chinese Wikipedia and Cantonese Wikipedia. We show that leveraging highly similar sentence pairs mined from Wikipedia improves translation performance in all test sets. Our system outperforms Baidu Fanyi's Chinese-to-Cantonese translation on 6 out of 8 test sets in BLEU scores. Translation examples reveal that our system is able to capture important linguistic transformations between standard Chinese and spoken Cantonese.
香港大部分居民都能读写普通话,但在日常生活中主要使用粤语。粤语口语可以改写成汉字,这就构成了所谓的书面粤语。书面广东话在词汇和语法上与标准书面汉语有显著的差异。书面粤语在网络世界的兴起越来越明显。说普通话的人和说粤语的人之间的互动越来越多,这导致了对汉语和粤语之间自动翻译的明确需求。本文介绍了一种基于变压器的神经网络机器翻译系统,用于汉语对粤语的书面语翻译。鉴于汉语和粤语的平行文本数据极为稀缺,本研究的重点是为NMT准备大量的训练数据。除了从以往的语言学研究和分散的互联网资源中收集28K平行句子外,我们还设计了一种有效的方法,通过自动从中文维基百科和广东话维基百科的平行文章中提取语义相似的句子对来获得72K平行句子。我们表明,利用从维基百科中挖掘的高度相似的句子对可以提高所有测试集的翻译性能。我们的系统在8个测试集中的6个测试集的BLEU分数上超过了百度翻一的汉粤语翻译。翻译实例表明,我们的系统能够捕捉到标准汉语和粤语口语之间的重要语言转换。
{"title":"Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong","authors":"Hei Yi Mak, Tan Lee","doi":"10.1145/3508230.3508242","DOIUrl":"https://doi.org/10.1145/3508230.3508242","url":null,"abstract":"The majority of inhabitants in Hong Kong are able to read and write in standard Chinese but use Cantonese as the primary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The rise of written Cantonese is increasingly evident in the cyber world. The growing interaction between Mandarin speakers and Cantonese speakers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of this study is on the effort of preparing good amount of training data for NMT. In addition to collecting 28K parallel sentences from previous linguistic studies and scattered internet resources, we devise an effective approach to obtaining 72K parallel sentences by automatically extracting pairs of semantically similar sentences from parallel articles on Chinese Wikipedia and Cantonese Wikipedia. We show that leveraging highly similar sentence pairs mined from Wikipedia improves translation performance in all test sets. Our system outperforms Baidu Fanyi's Chinese-to-Cantonese translation on 6 out of 8 test sets in BLEU scores. Translation examples reveal that our system is able to capture important linguistic transformations between standard Chinese and spoken Cantonese.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129902136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Natural Language Processing Applied on Large Scale Data Extraction from Scientific Papers in Fuel Cells 自然语言处理在燃料电池科学论文大规模数据提取中的应用
Feifan Yang
Natural language processing (NLP) has a great potential to help scientists automatically extract information from large-scale text datasets. In this paper, we focus on the process of NLP — including text acquisition, text preprocessing, word embedding training, and named entity recognition — applied on 106,181 abstracts of fuel cell papers. Then we evaluate our trained model on its ability of analogy, use the model to analyze the research trend in fuel cell materials and predict new materials. To the best of our knowledge, it is the first time that NLP has been applied in the field of fuel cells. This data-driven technique is demonstrated to have the potential to promote the discoveries of new fuel cell materials.
自然语言处理(NLP)在帮助科学家从大规模文本数据集中自动提取信息方面具有巨大的潜力。在本文中,我们重点研究了自然语言处理的过程,包括文本获取、文本预处理、词嵌入训练和命名实体识别,并将其应用于106,181篇燃料电池论文摘要。然后对模型的类比能力进行评价,并用该模型分析燃料电池材料的研究趋势和预测新材料。据我们所知,这是NLP首次应用于燃料电池领域。这种数据驱动的技术被证明具有促进新燃料电池材料发现的潜力。
{"title":"Natural Language Processing Applied on Large Scale Data Extraction from Scientific Papers in Fuel Cells","authors":"Feifan Yang","doi":"10.1145/3508230.3508256","DOIUrl":"https://doi.org/10.1145/3508230.3508256","url":null,"abstract":"Natural language processing (NLP) has a great potential to help scientists automatically extract information from large-scale text datasets. In this paper, we focus on the process of NLP — including text acquisition, text preprocessing, word embedding training, and named entity recognition — applied on 106,181 abstracts of fuel cell papers. Then we evaluate our trained model on its ability of analogy, use the model to analyze the research trend in fuel cell materials and predict new materials. To the best of our knowledge, it is the first time that NLP has been applied in the field of fuel cells. This data-driven technique is demonstrated to have the potential to promote the discoveries of new fuel cell materials.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134455202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Examination of the quality of Conceptnet relations for PubMed abstracts PubMed摘要的概念关系质量检验
Rajeswaran Viswanathan, S. Priya
Conceptnet is a crowd sourced knowledge graph used to find relationship between words and concepts. PubMed is the largest source of documents for the bio-medical domain. From the PubMed abstracts stop words are removed and remaining words are used as seed words. For these seed words “Nearest neighbor” words are identified as candidate words using 3 popular Word Vectors (WV) - Word2Vec, Glove and FastText. Similarity is calculated for these words for each strata of relationship. Bootstrap estimator in Random Effects Model (REM) is used to study this relationship using the similarity scores. Analysis shows that there is heterogeneity among the relationships independent of the WV used as base.
Conceptnet是一个众包知识图谱,用于查找单词和概念之间的关系。PubMed是生物医学领域最大的文献来源。从PubMed摘要中删除停止词,剩余词用作种子词。对于这些种子词,“最近邻”词被识别为候选词,使用3种流行的词向量(WV) - Word2Vec, Glove和FastText。对于每一层关系,计算这些单词的相似度。随机效应模型(REM)中的自举估计器利用相似性分数来研究这种关系。分析表明,与WV无关的关系存在异质性。
{"title":"Examination of the quality of Conceptnet relations for PubMed abstracts","authors":"Rajeswaran Viswanathan, S. Priya","doi":"10.1145/3508230.3508243","DOIUrl":"https://doi.org/10.1145/3508230.3508243","url":null,"abstract":"Conceptnet is a crowd sourced knowledge graph used to find relationship between words and concepts. PubMed is the largest source of documents for the bio-medical domain. From the PubMed abstracts stop words are removed and remaining words are used as seed words. For these seed words “Nearest neighbor” words are identified as candidate words using 3 popular Word Vectors (WV) - Word2Vec, Glove and FastText. Similarity is calculated for these words for each strata of relationship. Bootstrap estimator in Random Effects Model (REM) is used to study this relationship using the similarity scores. Analysis shows that there is heterogeneity among the relationships independent of the WV used as base.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"29 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133670306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1