Icon最新文献

英文中文

Center-free intuitionistic fuzzy c-means clustering algorithm based on similarity of hybrid spatial membership for image segmentation 基于混合空间隶属度相似性的无中心直觉模糊c均值聚类算法用于图像分割

Q3 Arts and Humanities

Icon

Pub Date : 2023-03-01 DOI: 10.1109/ICNLP58431.2023.00019

Lan Rong, Shumin Wang, He Hu, Zhao Feng, Haiyan Yu, Zhang Lu

In order to address the issue that the center-free fuzzy c-means (CFFCM) clustering algorithm does not consider the texture features and spatial information of pixels, and the time complexity is too high, a center-free intuitionistic fuzzy c-means clustering algorithm based on similarity of hybrid spatial membership for image segmentation is proposed. In the proposed algorithm, the voting model is used to generate intuitionistic fuzzy sets (IFS), and the generated hesitation degree and membership degree are combined with spatial information to design a spatial intuitionistic membership degree similarity model. This model can deal with the similarity between pixels and classes in gray information, so the segmentation efficiency is improved. At the same time, the intuitionistic fuzzy local binary pattern (IFLBP) operator is used to extract the image texture information and introduce it into the objective function. Spatial membership similarity model is used to process texture information and improve the segmentation accuracy of the algorithm. The results of simulation experiment show that the proposed has advantages in both visual effect and evaluation index.

针对无中心模糊c-均值(CFFCM)聚类算法未考虑像素的纹理特征和空间信息以及时间复杂度过高的问题，提出了一种基于混合空间隶属度相似性的图像分割无中心直觉模糊c-均值聚类算法。该算法利用投票模型生成直觉模糊集(IFS)，并将生成的犹豫度和隶属度与空间信息相结合，设计空间直觉隶属度相似模型。该模型可以处理灰度信息中像素和类别之间的相似性，从而提高分割效率。同时，利用直觉模糊局部二值模式(IFLBP)算子提取图像纹理信息，并将其引入目标函数。采用空间隶属度相似模型对纹理信息进行处理，提高了算法的分割精度。仿真实验结果表明，该方法在视觉效果和评价指标上都具有优势。

{"title":"Center-free intuitionistic fuzzy c-means clustering algorithm based on similarity of hybrid spatial membership for image segmentation","authors":"Lan Rong, Shumin Wang, He Hu, Zhao Feng, Haiyan Yu, Zhang Lu","doi":"10.1109/ICNLP58431.2023.00019","DOIUrl":"https://doi.org/10.1109/ICNLP58431.2023.00019","url":null,"abstract":"In order to address the issue that the center-free fuzzy c-means (CFFCM) clustering algorithm does not consider the texture features and spatial information of pixels, and the time complexity is too high, a center-free intuitionistic fuzzy c-means clustering algorithm based on similarity of hybrid spatial membership for image segmentation is proposed. In the proposed algorithm, the voting model is used to generate intuitionistic fuzzy sets (IFS), and the generated hesitation degree and membership degree are combined with spatial information to design a spatial intuitionistic membership degree similarity model. This model can deal with the similarity between pixels and classes in gray information, so the segmentation efficiency is improved. At the same time, the intuitionistic fuzzy local binary pattern (IFLBP) operator is used to extract the image texture information and introduce it into the objective function. Spatial membership similarity model is used to process texture information and improve the segmentation accuracy of the algorithm. The results of simulation experiment show that the proposed has advantages in both visual effect and evaluation index.","PeriodicalId":53637,"journal":{"name":"Icon","volume":"201 1","pages":"65-72"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80167774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Visual Question Answering Model Enhanced with Image Emotional Information 图像情感信息增强的多模态视觉问答模型

Q3 Arts and Humanities

Icon

Pub Date : 2023-03-01 DOI: 10.1109/ICNLP58431.2023.00056

Jin Cai, Guoyong Cai

Visual Question Answering is a multimedia understanding task that gives an image and natural language questions related to its content and allows the computer to answer them correctly. The early visual question answering models often ignore the emotional information in the image, resulting in insufficient performance in answering emotional-related questions; on the other hand, the existing visual question answering models that integrate emotional information do not make full use of the key areas of the image and text keywords, and do not understand fine-grained questions deeply enough, resulting in low accuracy. In order to fully integrate image emotional information into the visual question answering model and enhance the ability of the model to answer questions, a multimodal visual question answering model (IEMVQA) enhanced by image emotional information is proposed, and experiments are carried out on the visual question answering benchmark dataset. The final experiment shows that the IEMVQA model performs better than other comparison methods in comprehensive indicators, and verifies the effectiveness of using emotional information to assist visual question answering model.

视觉问答是一种多媒体理解任务，它给出与其内容相关的图像和自然语言问题，并允许计算机正确回答它们。早期的视觉问答模型往往忽略了图像中的情感信息，导致在回答情感相关问题时表现不佳;另一方面，现有的整合情感信息的视觉问答模型没有充分利用图像和文本关键词的关键区域，对细粒度问题的理解不够深入，导致准确率较低。为了将图像情感信息充分整合到视觉问答模型中，增强模型的答题能力，提出了一种基于图像情感信息增强的多模态视觉问答模型(IEMVQA)，并在视觉问答基准数据集上进行了实验。最后的实验表明，IEMVQA模型在综合指标上优于其他比较方法，验证了利用情感信息辅助视觉问答模型的有效性。

{"title":"Multimodal Visual Question Answering Model Enhanced with Image Emotional Information","authors":"Jin Cai, Guoyong Cai","doi":"10.1109/ICNLP58431.2023.00056","DOIUrl":"https://doi.org/10.1109/ICNLP58431.2023.00056","url":null,"abstract":"Visual Question Answering is a multimedia understanding task that gives an image and natural language questions related to its content and allows the computer to answer them correctly. The early visual question answering models often ignore the emotional information in the image, resulting in insufficient performance in answering emotional-related questions; on the other hand, the existing visual question answering models that integrate emotional information do not make full use of the key areas of the image and text keywords, and do not understand fine-grained questions deeply enough, resulting in low accuracy. In order to fully integrate image emotional information into the visual question answering model and enhance the ability of the model to answer questions, a multimodal visual question answering model (IEMVQA) enhanced by image emotional information is proposed, and experiments are carried out on the visual question answering benchmark dataset. The final experiment shows that the IEMVQA model performs better than other comparison methods in comprehensive indicators, and verifies the effectiveness of using emotional information to assist visual question answering model.","PeriodicalId":53637,"journal":{"name":"Icon","volume":"24 1","pages":"268-273"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73803245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bengali Fake Review Detection using Semi-supervised Generative Adversarial Networks 基于半监督生成对抗网络的孟加拉语虚假评论检测

Q3 Arts and Humanities

Icon

Pub Date : 2023-03-01 DOI: 10.1109/ICNLP58431.2023.00011

Md. Tanvir Rouf Shawon, G. M. Shahariar, F. Shah, Mohammad Shafiul Alam, M. S. Mahbub

This paper investigates the potential of semi-supervised Generative Adversarial Networks (GANs) to fine-tune pretrained language models in order to classify Bengali fake reviews from real reviews with a few annotated data. With the rise of social media and e-commerce, the ability to detect fake or deceptive reviews is becoming increasingly important in order to protect consumers from being misled by false information. Any machine learning model will have trouble identifying a fake review, especially for a low resource language like Bengali. We have demonstrated that the proposed semi-supervised GAN-LM architecture (generative adversarial network on top of a pretrained language model) is a viable solution in classifying Bengali fake reviews as the experimental results suggest that even with only 1024 annotated samples, BanglaBERT with semi-supervised GAN (SSGAN) achieved an accuracy of 83.59% and a f1-score of 84.89% outperforming other pretrained language models - BanglaBERT generator, Bangla BERT Base and BanglaElectra by almost 3%, 4% and 10% respectively in terms of accuracy. The experiments were conducted on a manually labeled food review dataset consisting of total 6014 real and fake reviews collected from various social media groups. Researchers that are experiencing difficulty recognizing not just fake reviews but other classification issues owing to a lack of labeled data may find a solution in our proposed methodology.

本文研究了半监督生成对抗网络(GANs)对预训练语言模型进行微调的潜力，以便对具有少量注释数据的孟加拉语虚假评论和真实评论进行分类。随着社交媒体和电子商务的兴起，为了保护消费者不被虚假信息误导，检测虚假或欺骗性评论的能力变得越来越重要。任何机器学习模型都难以识别虚假评论，尤其是对于孟加拉语这样的低资源语言。我们已经证明，所提出的半监督GAN- lm架构(基于预训练语言模型的生成对抗网络)是对孟加拉语虚假评论进行分类的可行解决方案，因为实验结果表明，即使只有1024个带注释的样本，具有半监督GAN (SSGAN)的BanglaBERT的准确率为83.59%，得分为84.89%，优于其他预训练语言模型- BanglaBERT生成器。在准确率方面，Bangla BERT Base和BanglaElectra分别提高了近3%、4%和10%。实验是在一个人工标记的食品评论数据集上进行的，该数据集包括从各种社交媒体组收集的6014条真实和虚假评论。由于缺乏标记数据而难以识别虚假评论和其他分类问题的研究人员可能会在我们提出的方法中找到解决方案。

{"title":"Bengali Fake Review Detection using Semi-supervised Generative Adversarial Networks","authors":"Md. Tanvir Rouf Shawon, G. M. Shahariar, F. Shah, Mohammad Shafiul Alam, M. S. Mahbub","doi":"10.1109/ICNLP58431.2023.00011","DOIUrl":"https://doi.org/10.1109/ICNLP58431.2023.00011","url":null,"abstract":"This paper investigates the potential of semi-supervised Generative Adversarial Networks (GANs) to fine-tune pretrained language models in order to classify Bengali fake reviews from real reviews with a few annotated data. With the rise of social media and e-commerce, the ability to detect fake or deceptive reviews is becoming increasingly important in order to protect consumers from being misled by false information. Any machine learning model will have trouble identifying a fake review, especially for a low resource language like Bengali. We have demonstrated that the proposed semi-supervised GAN-LM architecture (generative adversarial network on top of a pretrained language model) is a viable solution in classifying Bengali fake reviews as the experimental results suggest that even with only 1024 annotated samples, BanglaBERT with semi-supervised GAN (SSGAN) achieved an accuracy of 83.59% and a f1-score of 84.89% outperforming other pretrained language models - BanglaBERT generator, Bangla BERT Base and BanglaElectra by almost 3%, 4% and 10% respectively in terms of accuracy. The experiments were conducted on a manually labeled food review dataset consisting of total 6014 real and fake reviews collected from various social media groups. Researchers that are experiencing difficulty recognizing not just fake reviews but other classification issues owing to a lack of labeled data may find a solution in our proposed methodology.","PeriodicalId":53637,"journal":{"name":"Icon","volume":"54 1","pages":"12-16"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73889263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

When to Use Large Language Model: Upper Bound Analysis of BM25 Algorithms in Reading Comprehension Task 何时使用大型语言模型:阅读理解任务中BM25算法的上界分析

Q3 Arts and Humanities

Icon

Pub Date : 2023-03-01 DOI: 10.1109/icnlp58431.2023.00049

Tingzhen Liu, Qianqian Xiong, Shengxi Zhang

Large language model (LLM) is a representation of a major advancement in AI, and has been used in multiple natural language processing tasks. Nevertheless, in different business scenarios, LLM requires fine-tuning by engineers to achieve satisfactory performance, and the cost of achieving target performance and fine-turning may not match. Based on the Baidu STI dataset, we study the upper bound of the performance that classical information retrieval methods can achieve under a specific business, and compare it with the cost and performance of the participating team based on LLM. This paper gives an insight into the potential of classical computational linguistics algorithms, and which can help decision-makers make reasonable choices for LLM and low-cost methods in business R& D.

大型语言模型(Large language model, LLM)是人工智能领域取得重大进展的代表，已被用于多种自然语言处理任务。然而，在不同的业务场景中，LLM需要工程师进行微调才能达到满意的性能，而实现目标性能和微调的成本可能并不匹配。基于百度STI数据集，研究了特定业务下经典信息检索方法所能达到的性能上界，并与基于LLM的参与团队的成本和性能进行了比较。本文揭示了经典计算语言学算法的潜力，可以帮助决策者在商业研发中合理选择LLM和低成本方法。

引用次数: 1

Computer-aided Analysis of Conceptual Metaphors in English News Report 英语新闻报道中概念隐喻的计算机辅助分析

Q3 Arts and Humanities

Icon

Pub Date : 2023-03-01 DOI: 10.1109/ICNLP58431.2023.00058

Tu Ying

Metaphor has been observed to be a widespread method of communication in international news coverage. However, there haven’t been many studies on metaphors in English news reporting, let alone ones that used computer-assisted tools. The study uses computer-assisted tools to analyze and interpret the types of metaphors and their structural models present in 12 news reports about the Belt and Road Initiative (BRI) published in The New York Times. The findings of the research aim to present the Chinese national image depicted in the mainstream western media and reveal the underlying values and attitudes contained in the news reports on BRI.

隐喻是国际新闻报道中广泛使用的一种交际方式。然而，关于英语新闻报道中隐喻的研究并不多，更不用说使用计算机辅助工具的研究了。该研究使用计算机辅助工具分析和解读《纽约时报》发表的12篇有关“一带一路”倡议的新闻报道中的隐喻类型及其结构模型。研究结果旨在呈现西方主流媒体所描绘的中国国家形象，揭示有关“一带一路”的新闻报道所包含的潜在价值观和态度。

引用次数: 0

Generalization Algorithm of Multimodal Pre-Training Model Based on Graph-Text Self-Supervised Training 基于图-文本自监督训练的多模态预训练模型概化算法

Q3 Arts and Humanities

Icon

Pub Date : 2023-02-16 DOI: 10.1109/ICNLP58431.2023.00066

Xiaobing Zhang, Zhenhao Tang, Zi Long, Xianghua Fu

Recently, a large number of studies have shown that the introduction of visual information can effectively improve the effect of neural machine translation (NMT). Its effectiveness largely depends on the availability of a large number of bilingual parallel sentence pairs and manual image annotation. The lack of images and the effectiveness of images have been difficult to solve. In this paper, a multimodal pre-training generalization algorithm for self-supervised training is proposed, which overcomes the lack of visual information and inaccuracy, and thus extends the applicability of images on NMT. Specifically, we will search for many pictures from the existing sentences through the search engine, and then through the relationship between visual information and text, do the self-supervised training task of graphics and text to obtain more effective visual information for text. We show that when the filtered information is used as multimodal machine translation for fine-tuning, the effect of translation in the global voice dataset is 0.5 BLEU higher than the baseline.

近年来，大量研究表明，视觉信息的引入可以有效提高神经机器翻译(NMT)的翻译效果。其有效性很大程度上取决于大量双语平行句对的可用性和人工图像标注。图像的缺失和图像的有效性一直是难以解决的问题。本文提出了一种用于自监督训练的多模态预训练泛化算法，克服了视觉信息缺乏和不准确的问题，从而扩展了图像在NMT上的适用性。具体来说，我们将通过搜索引擎从已有的句子中搜索出许多图片，然后通过视觉信息和文本的关系，做图形和文本的自监督训练任务，为文本获得更有效的视觉信息。我们表明，当将过滤后的信息用作多模态机器翻译进行微调时，全球语音数据集中的翻译效果比基线高0.5 BLEU。

引用次数: 0

Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages 低资源印度语言跨语言事实抽取的大规模多语言模型

Q3 Arts and Humanities

Icon

Pub Date : 2023-02-09 DOI: 10.48550/arXiv.2302.04790

Bhavyajeet Singh, Pavan Kandru, Anubhav Sharma, Vasudeva Varma

Massive knowledge graphs like Wikidata attempt to capture world knowledge about multiple entities. Recent approaches concentrate on automatically enriching these KGs from text. However a lot of information present in the form of natural text in low resource languages is often missed out. Cross Lingual Information Extraction aims at extracting factual information in the form of English triples from low resource Indian Language text. Despite its massive potential, progress made on this task is lagging when compared to Monolingual Information Extraction. In this paper, we propose the task of Cross Lingual Fact Extraction(CLFE) from text and devise an end-to-end generative approach for the same which achieves an overall F1 score of 77.46

像Wikidata这样的海量知识图试图捕捉关于多个实体的世界知识。最近的方法集中于从文本中自动丰富这些KGs。然而，在资源匮乏的语言中，以自然文本的形式呈现的许多信息往往被遗漏了。跨语言信息提取旨在从低资源的印度语文本中提取英语三元组形式的事实信息。尽管有巨大的潜力，但与单语言信息提取相比，这项任务的进展是滞后的。在本文中，我们提出了从文本中提取跨语言事实（CLFE）的任务，并为此设计了一种端到端的生成方法，该方法的F1总分为77.46

引用次数: 1

There is No Big Brother or Small Brother:Knowledge Infusion in Language Models for Link Prediction and Question Answering 没有老大哥或小弟:链接预测和问答语言模型中的知识注入

Q3 Arts and Humanities

Icon

Pub Date : 2023-01-10 DOI: 10.48550/arXiv.2301.04013

Ankush Agarwal, Sakharam Gawade, Sachin Channabasavarajendra, P. Bhattacharyya

The integration of knowledge graphs with deep learning is thriving in improving the performance of various natural language processing (NLP) tasks. In this paper, we focus on knowledge-infused link prediction and question answering using language models, T5, and BLOOM across three domains:Aviation, Movie, and Web. In this context, we infuse knowledge in large and small language models and study their performance, and find the performance to be similar. For the link prediction task on the Aviation Knowledge Graph, we obtain a 0.2 hits@1 score using T5-small, T5-base, T5-large, and BLOOM. Using template-based scripts, we create a set of 1 million synthetic factoid QA pairs in the aviation domain from National Transportation Safety Board (NTSB) reports. On our curated QA pairs, the three models of T5 achieve a 0.7 hits@1 score. We validate our findings with the paired student t test and Cohen’s kappa scores. For link prediction on Aviation Knowledge Graph using T5-small and T5-large, we obtain a Cohen’s kappa score of 0.76, showing substantial agreement between the models. Thus, we infer that small language models perform similar to large language models with the infusion of knowledge.

知识图与深度学习的集成在提高各种自然语言处理（NLP）任务的性能方面正在蓬勃发展。在本文中，我们重点研究了在航空、电影和网络三个领域中使用语言模型T5和BLOOM进行的知识注入的链接预测和问题回答。在这种情况下，我们将知识注入大的和小的语言模型中，并研究它们的性能，发现它们的性能是相似的。对于航空知识图上的链接预测任务，我们获得了0.2hits@1使用T5小、T5基础、T5大和BLOOM得分。使用基于模板的脚本，我们根据美国国家运输安全委员会（NTSB）的报告，在航空领域创建了一组100万个合成事实QA对。在我们精心策划的QA配对中，T5的三款车型获得了0.7hits@1分数我们用配对学生t检验和Cohen的kappa分数来验证我们的发现。对于使用T5小和T5大的航空知识图上的链接预测，我们获得了0.76的Cohen’s kappa分数，显示了模型之间的基本一致性。因此，我们推断，随着知识的注入，小语言模型的表现与大语言模型相似。

{"title":"There is No Big Brother or Small Brother:Knowledge Infusion in Language Models for Link Prediction and Question Answering","authors":"Ankush Agarwal, Sakharam Gawade, Sachin Channabasavarajendra, P. Bhattacharyya","doi":"10.48550/arXiv.2301.04013","DOIUrl":"https://doi.org/10.48550/arXiv.2301.04013","url":null,"abstract":"The integration of knowledge graphs with deep learning is thriving in improving the performance of various natural language processing (NLP) tasks. In this paper, we focus on knowledge-infused link prediction and question answering using language models, T5, and BLOOM across three domains:Aviation, Movie, and Web. In this context, we infuse knowledge in large and small language models and study their performance, and find the performance to be similar. For the link prediction task on the Aviation Knowledge Graph, we obtain a 0.2 hits@1 score using T5-small, T5-base, T5-large, and BLOOM. Using template-based scripts, we create a set of 1 million synthetic factoid QA pairs in the aviation domain from National Transportation Safety Board (NTSB) reports. On our curated QA pairs, the three models of T5 achieve a 0.7 hits@1 score. We validate our findings with the paired student t test and Cohen’s kappa scores. For link prediction on Aviation Knowledge Graph using T5-small and T5-large, we obtain a Cohen’s kappa score of 0.76, showing substantial agreement between the models. Thus, we infer that small language models perform similar to large language models with the infusion of knowledge.","PeriodicalId":53637,"journal":{"name":"Icon","volume":"1 1","pages":"204-211"},"PeriodicalIF":0.0,"publicationDate":"2023-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46274359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ABB-BERT: A BERT model for disambiguating abbreviations and contractions 用于消除缩写和缩写歧义的BERT模型

Q3 Arts and Humanities

Icon

Pub Date : 2022-07-08 DOI: 10.48550/arXiv.2207.04008

Prateek Kacker, Andi Cupallari, Aswin Giridhar Subramanian, Nimit Jain

Abbreviations and contractions are commonly found in text across different domains. For example, doctors’ notes contain many contractions that can be personalized based on their choices. Existing spelling correction models are not suitable to handle expansions because of many reductions of characters in words. In this work, we propose ABB-BERT, a BERT-based model, which deals with an ambiguous language containing abbreviations and contractions. ABB-BERT can rank them from thousands of options and is designed for scale. It is trained on Wikipedia text, and the algorithm allows it to be fine-tuned with little compute to get better performance for a domain or person. We are publicly releasing the training dataset for abbreviations and contractions derived from Wikipedia.

缩写和缩写通常出现在不同领域的文本中。例如，医生的笔记中包含许多宫缩，这些宫缩可以根据他们的选择进行个性化设置。现有的拼写校正模型不适合处理扩展，因为单词中的字符减少了很多。在这项工作中，我们提出了ABB-BERT，这是一个基于BERT的模型，它处理包含缩写和缩写的歧义语言。ABB-BERT可以从数千个选项中对它们进行排名，并且是为规模而设计的。它是在维基百科文本上训练的，算法允许它在几乎没有计算的情况下进行微调，以获得更好的域或个人性能。我们正在公开发布源自维基百科的缩写和缩写的训练数据集。

引用次数: 0

Towards Multimodal Vision-Language Models Generating Non-Generic Text 生成非通用文本的多模式视觉语言模型

Q3 Arts and Humanities

Icon

Pub Date : 2022-06-28 DOI: 10.1609/aaai.v36i11.21705

Wes Robbins

Vision-language models can assess visual context in an image and generate descriptive text. While the generated text may be accurate and syntactically correct, it is often overly general. To address this, recent work has used optical character recognition to supplement visual information with text extracted from an image. In this work, we contend that vision-language models can benefit from information that can be extracted from an image, but are not used by current models. We modify previous multimodal frameworks to accept relevant information from any number of auxiliary classifiers. In particular, we focus on person names as an additional set of tokens and create a novel image-caption dataset to facilitate captioning with person names. The dataset, Politicians and Athletes in Captions (PAC), consists of captioned images of well-known people in context. By fine-tuning pretrained models with this dataset, we demonstrate a model that can naturally integrate facial recognition tokens into generated text by training on limited data. For the PAC dataset, we provide a discussion on collection and baseline benchmark scores.

视觉语言模型可以评估图像中的视觉上下文并生成描述性文本。虽然生成的文本可能准确且语法正确，但它通常过于笼统。为了解决这个问题，最近的工作已经使用光学字符识别从图像中提取文本来补充视觉信息。在这项工作中，我们认为视觉语言模型可以从可以从图像中提取的信息中受益，但目前的模型没有使用这些信息。我们修改了以前的多模态框架，以接受来自任意数量的辅助分类器的相关信息。特别地，我们将人名作为一组额外的标记，并创建了一个新的图像标题数据集，以方便使用人名进行标题。数据集，政治家和运动员的标题(PAC)，由上下文中的知名人物的标题图像组成。通过对该数据集的预训练模型进行微调，我们展示了一个模型，该模型可以通过有限的数据训练自然地将面部识别令牌集成到生成的文本中。对于PAC数据集，我们提供了关于收集和基准基准分数的讨论。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Icon

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀