Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval最新文献

英文中文

Semantic Preserving Siamese Autoencoder for Binary Quantization of Word Embeddings 用于词嵌入二进制量化的语义保持连体自编码器

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508235

Wouter Mostard, Lambert Schomaker, M. Wiering

Word embeddings are used as building blocks for a wide range of natural language processing and information retrieval tasks. These embeddings are usually represented as continuous vectors, requiring significant memory capacity and computationally expensive similarity measures. In this study, we introduce a novel method for semantic hashing continuous vector representations into lower-dimensional Hamming space while explicitly preserving semantic information between words. This is achieved by introducing a Siamese autoencoder combined with a novel semantic preserving loss function. We show that our quantization model induces only a 4% loss of semantic information over continuous representations and outperforms the baseline models on several word similarity and sentence classification tasks. Finally, we show through cluster analysis that our method learns binary representations where individual bits hold interpretable semantic information. In conclusion, binary quantization of word embeddings significantly decreases time and space requirements while offering new possibilities through exploiting semantic information of individual bits in downstream information retrieval tasks.

词嵌入被广泛地用作自然语言处理和信息检索任务的构建块。这些嵌入通常表示为连续向量，需要大量的内存容量和计算昂贵的相似性度量。在这项研究中，我们引入了一种新的方法，在低维汉明空间中对连续向量表示进行语义哈希，同时显式地保留词之间的语义信息。这是通过引入Siamese自编码器和一种新的语义保持损失函数来实现的。研究表明，我们的量化模型在连续表示中仅导致4%的语义信息损失，并且在几个单词相似度和句子分类任务上优于基线模型。最后，我们通过聚类分析表明，我们的方法学习二进制表示，其中单个比特包含可解释的语义信息。总之，词嵌入的二值量化显著降低了时间和空间需求，同时通过在下游信息检索任务中利用单个比特的语义信息提供了新的可能性。

{"title":"Semantic Preserving Siamese Autoencoder for Binary Quantization of Word Embeddings","authors":"Wouter Mostard, Lambert Schomaker, M. Wiering","doi":"10.1145/3508230.3508235","DOIUrl":"https://doi.org/10.1145/3508230.3508235","url":null,"abstract":"Word embeddings are used as building blocks for a wide range of natural language processing and information retrieval tasks. These embeddings are usually represented as continuous vectors, requiring significant memory capacity and computationally expensive similarity measures. In this study, we introduce a novel method for semantic hashing continuous vector representations into lower-dimensional Hamming space while explicitly preserving semantic information between words. This is achieved by introducing a Siamese autoencoder combined with a novel semantic preserving loss function. We show that our quantization model induces only a 4% loss of semantic information over continuous representations and outperforms the baseline models on several word similarity and sentence classification tasks. Finally, we show through cluster analysis that our method learns binary representations where individual bits hold interpretable semantic information. In conclusion, binary quantization of word embeddings significantly decreases time and space requirements while offering new possibilities through exploiting semantic information of individual bits in downstream information retrieval tasks.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127487650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Contrastive Study on Linguistic Features between HT and MT based on NLPIR-ICTCLAS: A Case Study of Philosophical Text 基于NLPIR-ICTCLAS的汉译与汉译语言特征对比研究——以哲学文本为例

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508240

Yumei Ge, Bin Xu

This paper, with the aid of NLPIR-ICTCLAS, analyzes and compares original English texts and different translation versions of a philosophical text. A 1:6 English-Chinese translation corpus is applied to study the linguistic structural features of human translation (HT) and machine translation (MT). This study shows that the HT is characterized by more complicated language and complex sentences. At the same time, in the process of translation, compared with MT engines, human translator can intentionally avoid using too many functional words, and deliver grammatical structures and logical relations of sentences mainly by the meanings of words or clauses. The five MT versions share similarities in the use of notional words and functional words.

本文借助nlpi - ictclas，对一篇哲学文本的英文原文和不同的翻译版本进行了分析和比较。采用1:6英汉翻译语料库，研究了人工翻译和机器翻译的语言结构特征。研究表明，汉语交际具有语言复杂、句子复杂的特点。同时，在翻译过程中，与机器翻译引擎相比，人工译者可以有意识地避免使用过多的功能词，主要通过单词或从句的含义来传递句子的语法结构和逻辑关系。五个MT版本在实义词和虚词的使用上有相似之处。

引用次数: 0

Text Sentiment Analysis based on BERT and Convolutional Neural Networks 基于BERT和卷积神经网络的文本情感分析

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508231

Ping Huang, Huijuan Zhu, Lei Zheng, Ying Wang

The rapid development of the network has accelerated the speed of information circulation. Analyzing the emotional tendency contained in the network text is very helpful to tap the needs of users. However, most of the existing sentiment classification models rely on manually labeled text features, resulting in insufficient mining of deep semantic features hidden in the text, and it is difficult to improve the classification performance significantly. This paper presents a text sentiment classification model combining BERT and convolutional neural networks (CNN). The model uses BERT to complete the word embedding of the text, and then uses CNN to learn the deep semantic information about the text, so as mine the emotional tendency towards the text. Through verification on the large movie review dataset, BERT-CNN model can achieve an accuracy of 86.67%, which is significantly better than traditional classification method of textCNN. The results show that the method has good performance in this field.

网络的快速发展加快了信息流通的速度。分析网络文本中蕴含的情感倾向，有助于挖掘用户的需求。然而，现有的情感分类模型大多依赖于人工标注的文本特征，导致对隐藏在文本中的深层语义特征挖掘不足，难以显著提高分类性能。本文提出了一种结合BERT和卷积神经网络(CNN)的文本情感分类模型。该模型使用BERT完成文本的词嵌入，然后使用CNN学习文本的深层语义信息，从而挖掘对文本的情感倾向。通过在大型影评数据集上的验证，BERT-CNN模型可以达到86.67%的准确率，明显优于传统的textCNN分类方法。结果表明，该方法在该领域具有良好的性能。

引用次数: 3

Query Disambiguation to Enhance Biomedical Information Retrieval Based on Neural Networks 基于神经网络的查询消歧增强生物医学信息检索

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508253

Wided Selmi, Hager Kammoun, Ikram Amous

Information Retrieval Systems (IRS) use a query to find the relevant documents. Often the query term can have more than one sense; this is known as the ambiguity problem. This problem is a cause of poor performance in IRS. For this purpose, Word Sense Disambiguation (WSD) specifically deals with choosing the right sense of an ambiguous term, among a set of given candidate senses, according to its context (surrounding text). Obtaining all candidate senses is therefore a challenge for WSD. Word Sense Induction (WSI) is a task that automatically induces the different senses of a target word in different contexts. In this work, we propose a biomedical query disambiguation method. In this method, WSI use K-means algorithm to cluster the different contexts of ambiguous query term (MeSH descriptor) in order to induce the different senses. The different contexts are the sentences extracted from PubMed containing the target MeSH descriptor. To represent sentences as vectors, we propose to use a contextualized embeddings model “Biobert”. Our method is derived from the intuitive idea that the correct sense in the one having the high similarity among the candidate senses of an ambiguous term with its context. The conducted experiments on OHSUMED test collection yielded significant results.

信息检索系统(IRS)使用查询来查找相关文档。通常，查询词可以有不止一种含义;这就是所谓的歧义问题。这个问题是导致IRS性能不佳的一个原因。为此，词义消歧(WSD)专门处理根据上下文(周围文本)在一组给定的候选意义中选择歧义术语的正确意义。因此，获取所有候选感官对水务署来说是一项挑战。词义归纳(WSI)是一种在不同语境中自动归纳目标词的不同意义的任务。在这项工作中，我们提出了一种生物医学查询消歧方法。该方法利用K-means算法对歧义查询词的不同上下文(MeSH描述符)进行聚类，从而归纳出不同的语义。不同的上下文是从PubMed中提取的包含目标MeSH描述符的句子。为了将句子表示为向量，我们建议使用上下文化嵌入模型“Biobert”。我们的方法来源于一个直观的想法，即歧义术语的候选意义与其上下文具有高相似性的意义才是正确的意义。在OHSUMED测试集上进行的实验取得了显著的结果。

{"title":"Query Disambiguation to Enhance Biomedical Information Retrieval Based on Neural Networks","authors":"Wided Selmi, Hager Kammoun, Ikram Amous","doi":"10.1145/3508230.3508253","DOIUrl":"https://doi.org/10.1145/3508230.3508253","url":null,"abstract":"Information Retrieval Systems (IRS) use a query to find the relevant documents. Often the query term can have more than one sense; this is known as the ambiguity problem. This problem is a cause of poor performance in IRS. For this purpose, Word Sense Disambiguation (WSD) specifically deals with choosing the right sense of an ambiguous term, among a set of given candidate senses, according to its context (surrounding text). Obtaining all candidate senses is therefore a challenge for WSD. Word Sense Induction (WSI) is a task that automatically induces the different senses of a target word in different contexts. In this work, we propose a biomedical query disambiguation method. In this method, WSI use K-means algorithm to cluster the different contexts of ambiguous query term (MeSH descriptor) in order to induce the different senses. The different contexts are the sentences extracted from PubMed containing the target MeSH descriptor. To represent sentences as vectors, we propose to use a contextualized embeddings model “Biobert”. Our method is derived from the intuitive idea that the correct sense in the one having the high similarity among the candidate senses of an ambiguous term with its context. The conducted experiments on OHSUMED test collection yielded significant results.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125986145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Retrieval-based End-to-End Tamil language Conversational Agent for Closed Domain using Machine Learning 使用机器学习的基于检索的端到端泰米尔语封闭域会话代理

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508251

Kumaran Kugathasan, Uthayasanker Thayasivam

Businesses around the world have started to adopt text-based conversational agents to provide a great customer experience as an alternative to minimize expensive customer service agents. Coming up with a conversational agent is comparatively easier for businesses that serve customers who speak high resourced languages like English since there are enough and more paid as well as open-source chatbot frameworks available. But for a low resource language like Tamil, there is no such framework support. The approaches proposed in researches for building high resource language chatbots are not suitable for Tamil due to the lack of many language-related resources. This paper proposes a new approach for building a Tamil language conversational agent using the dataset scraped from the FAQ corpus and expanding it more to capture the morphological richness and high inflexional nature of the Tamil language. Each question is mapped to intent and a multiclass intent classifier was built to identify the intent of the user. CNN based classifier performed best with 98.72% accuracy.

世界各地的企业已经开始采用基于文本的会话代理来提供出色的客户体验，作为最小化昂贵的客户服务代理的替代方案。对于那些为讲英语等资源丰富的语言的客户提供服务的企业来说，想出一个对话代理相对容易，因为有足够多的、更多的付费和开源聊天机器人框架可用。但是对于像泰米尔语这样的低资源语言，没有这样的框架支持。由于缺乏大量的语言相关资源，研究中提出的构建高资源语言聊天机器人的方法并不适合泰米尔语。本文提出了一种利用从FAQ语料库中抓取的数据集构建泰米尔语会话代理的新方法，并对其进行扩展，以捕获泰米尔语的形态丰富性和高度非弹性特性。每个问题都映射到意图，并建立了一个多类意图分类器来识别用户的意图。基于CNN的分类器表现最好，准确率为98.72%。

{"title":"Retrieval-based End-to-End Tamil language Conversational Agent for Closed Domain using Machine Learning","authors":"Kumaran Kugathasan, Uthayasanker Thayasivam","doi":"10.1145/3508230.3508251","DOIUrl":"https://doi.org/10.1145/3508230.3508251","url":null,"abstract":"Businesses around the world have started to adopt text-based conversational agents to provide a great customer experience as an alternative to minimize expensive customer service agents. Coming up with a conversational agent is comparatively easier for businesses that serve customers who speak high resourced languages like English since there are enough and more paid as well as open-source chatbot frameworks available. But for a low resource language like Tamil, there is no such framework support. The approaches proposed in researches for building high resource language chatbots are not suitable for Tamil due to the lack of many language-related resources. This paper proposes a new approach for building a Tamil language conversational agent using the dataset scraped from the FAQ corpus and expanding it more to capture the morphological richness and high inflexional nature of the Tamil language. Each question is mapped to intent and a multiclass intent classifier was built to identify the intent of the user. CNN based classifier performed best with 98.72% accuracy.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132988505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Method of Graphical User Interface Adaptation Using Reinforcement Learning and Automated Testing 使用强化学习和自动化测试的图形用户界面适应方法

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508255

Victor Fyodorov, A. Karsakov

Abstract—Graphical user interface adaptation becomes an increasingly time-consuming and resource-intensive task due to modern programs complexity and a big variety of information output devices. In this paper we propose a method for adapting a graphical user interface based on a person's workflow using a specific implementation of the interface. This method makes it possible to adapt the interface to the peculiarities of the user's workflow through optimization in the navigation area between program windows.

摘要:由于现代程序的复杂性和信息输出设备的多样性，图形用户界面适配成为一项越来越耗时和资源密集的任务。在本文中，我们提出了一种方法来适应图形用户界面基于一个人的工作流程，使用接口的特定实现。该方法通过优化程序窗口之间的导航区域，使界面适应用户工作流程的特点成为可能。

引用次数: 0

Annotation and Evaluation of Utterance Intention Tag for Interview Dialogue Corpus 访谈对话语料中话语意图标签的标注与评价

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508236

M. Sasayama, Kazuyuki Matsumoto

In this paper, we proposed the utterance intention tags for an interview dialogue corpus. We constructed an interview dialogue corpus with the tags we designed. Three or five annotators annotated the tags to the interview dialogue corpus (the total number of 49999 utterances) consisting of 30 dialogues. We conducted an evaluation experiment using Fleiss's kappa value to evaluate the reliability of the proposed tags. When three annotators annotated 18 different tags to the corpus, we obtained the kappa value of 0.55.

在本文中，我们提出了一个访谈对话语料库的话语意图标签。我们用我们设计的标签构建了一个访谈对话语料库。三个或五个注释者将标签注释到由30个对话组成的访谈对话语料库(总共49999个话语)。我们使用Fleiss的kappa值进行了评估实验，以评估所提出标签的可靠性。当三个注释者对语料标注18个不同的标签时，我们得到kappa值为0.55。

引用次数: 1

STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering STIF:使用词嵌入和聚类的半监督分类归纳

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508247

Maryam Mousavi, Elena Steiner, S. Corman, Scott W. Ruston, Dylan Weber, H. Davulcu

In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic’s specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.

在本文中，我们使用术语嵌入和聚类方法开发了一个半监督分类法归纳框架，该框架包含来自650个乌克兰相关博客域的145,000篇文章，时间为2010-2020年。我们提取了32,429个名词短语(NPs)，并将这些NPs分成两类:一般/模糊短语(可能出现在任何主题下)和局部/非模糊短语(与主题的细节有关)。我们使用术语表示和聚类方法，使用Silhouette方法将主题/非歧义短语划分为90组。接下来，一个由10名通信科学家组成的团队分析了NP集群，并在其密码本旁边引入了一个两级分类法。在实现94%的编码器间可靠性之后，编码器开始将所有主题/非歧义短语映射到金标准分类法中。我们评估了一系列术语表示和聚类方法使用外在和内在的措施。我们确定使用K-Means的GloVe嵌入在这个真实数据集中获得了最高的性能(即74%的纯度)。

{"title":"STIF: Semi-Supervised Taxonomy Induction using Term Embeddings and Clustering","authors":"Maryam Mousavi, Elena Steiner, S. Corman, Scott W. Ruston, Dylan Weber, H. Davulcu","doi":"10.1145/3508230.3508247","DOIUrl":"https://doi.org/10.1145/3508230.3508247","url":null,"abstract":"In this paper, we developed a semi-supervised taxonomy induction framework using term embedding and clustering methods for a blog corpus comprising 145,000 posts from 650 Ukraine-related blog domains dated between 2010-2020. We extracted 32,429 noun phrases (NPs) and proceeded to split these NPs into a pair of categories: General/Ambiguous phrases, which might appear under any topic vs. Topical/Non-Ambiguous phrases, which pertain to a topic’s specifics. We used term representation and clustering methods to partition the topical/non-ambiguous phrases into 90 groups using the Silhouette method. Next, a team of 10 communications scientists analyzed the NP clusters and inducted a two-level taxonomy alongside its codebook. Upon achieving intercoder reliability of 94%, coders proceeded to map all topical/non-ambiguous phrases into a gold-standard taxonomy. We evaluated a range of term representation and clustering methods using extrinsic and intrinsic measures. We determined that GloVe embeddings with K-Means achieved the highest performance (i.e. 74% purity) for this real-world dataset.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124883357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Named Entity Recognition using Knowledge Graph Embeddings and DistilBERT 基于知识图嵌入和蒸馏器的命名实体识别

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508252

Shreya R. Mehta, Mansi A. Radke, Sagar Sunkle

Named Entity Recognition (NER) is a Natural Language Processing (NLP) task of identifying entities from a natural language text and classifies them into categories like Person, Location, Organization etc. Pre-trained neural language models (PNLM) based on transformers are state-of-the-art in many NLP task including NER. Analysis of output of DistilBERT, a popular PNLM, reveals that mis-classifications occur when a non-entity word is at a place contextually suitable for an entity. The paper is based on the hypothesis that the performance of a PNLM can be improved by combining it with Knowledge Graph Embeddings (KGE). We show that fine-tuning of DistilBERT along with NumberBatch KGE gives performance improvement over various Open-domain as well as Biomedical-domain datasets.

命名实体识别(NER)是一种自然语言处理(NLP)任务，它从自然语言文本中识别实体，并将它们分类为人物、位置、组织等类别。基于变压器的预训练神经语言模型(PNLM)在包括NER在内的许多NLP任务中都是最先进的。对蒸馏器(一个流行的PNLM)输出的分析表明，当一个非实体词在上下文适合于一个实体的地方时，就会发生错误分类。本文基于将PNLM与知识图嵌入(KGE)相结合可以提高其性能的假设。我们表明，在各种开放域和生物医学域数据集上，对蒸馏器和NumberBatch KGE进行微调可以提高性能。

引用次数: 1

A Study of Predicting the Sincerity of a Question Asked Using Machine Learning 使用机器学习预测问题诚意的研究

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

Pub Date : 2021-12-17 DOI: 10.1145/3508230.3508258

T. Nguyen, P. Meesad

The growth of applications in both scientific socialism and naturalism causes it increasingly difficult to assess whether a question is sincere or not. It is mandatory for many marketing and financial companies. Many utilizations will be reconfigured beyond recognition, especially text and images, while others face potential extinction as a corollary of advances in technology and computer science in particular. Analyzing text and image data will be truly needed for understanding valuable insights. In this paper, we analyzed the Quora dataset obtained from Kaggle.com to filter insincere and spam content. We used different preprocessing algorithms and analysis models provided in PySpark. Besides, we analyzed the manner of users established in writing their posts via the proposed prediction models. Finally, we showed the most accurate algorithm of the selected algorithms for classifying questions on Quora. The Gradient Boosted Tree was the best model for questions on Quora with an accuracy was 79.5% and followed was Long-Short Term Memory (LSTM) reaching 78.0%. Compared to other methods, the same building in Scikit-Learn and machine learning GRU, BiLSTM, BiGRU, applying models in PySpark could get a better answer in classifying questions on Quora.

科学社会主义和自然主义的应用日益增多，使得一个问题的真伪越来越难以判断。这是许多营销和金融公司的强制性要求。许多应用将被重新配置，尤其是文本和图像，而其他应用则面临着潜在的灭绝，这是技术进步的必然结果，尤其是计算机科学。分析文本和图像数据对于理解有价值的见解是非常必要的。在本文中，我们分析了从Kaggle.com获得的Quora数据集，以过滤不真诚和垃圾内容。我们使用了PySpark提供的不同预处理算法和分析模型。此外，我们还通过提出的预测模型分析了用户在撰写帖子时的建立方式。最后，我们展示了Quora上的问题分类算法中最准确的算法。梯度提升树是Quora上问题的最佳模型，准确率为79.5%，其次是长短期记忆(LSTM)，达到78.0%。与其他方法相比，在Scikit-Learn和机器学习GRU, BiLSTM, BiGRU中同样的构建，在PySpark中应用模型可以在Quora上得到更好的分类问题的答案。

{"title":"A Study of Predicting the Sincerity of a Question Asked Using Machine Learning","authors":"T. Nguyen, P. Meesad","doi":"10.1145/3508230.3508258","DOIUrl":"https://doi.org/10.1145/3508230.3508258","url":null,"abstract":"The growth of applications in both scientific socialism and naturalism causes it increasingly difficult to assess whether a question is sincere or not. It is mandatory for many marketing and financial companies. Many utilizations will be reconfigured beyond recognition, especially text and images, while others face potential extinction as a corollary of advances in technology and computer science in particular. Analyzing text and image data will be truly needed for understanding valuable insights. In this paper, we analyzed the Quora dataset obtained from Kaggle.com to filter insincere and spam content. We used different preprocessing algorithms and analysis models provided in PySpark. Besides, we analyzed the manner of users established in writing their posts via the proposed prediction models. Finally, we showed the most accurate algorithm of the selected algorithms for classifying questions on Quora. The Gradient Boosted Tree was the best model for questions on Quora with an accuracy was 79.5% and followed was Long-Short Term Memory (LSTM) reaching 78.0%. Compared to other methods, the same building in Scikit-Learn and machine learning GRU, BiLSTM, BiGRU, applying models in PySpark could get a better answer in classifying questions on Quora.","PeriodicalId":252146,"journal":{"name":"Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126270466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀