ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

Bang Yang;Fenglin Liu;Yuexian Zou;Xian Wu;Yaowei Wang;David A. Clifton
{"title":"ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation","authors":"Bang Yang;Fenglin Liu;Yuexian Zou;Xian Wu;Yaowei Wang;David A. Clifton","doi":"10.1109/TPAMI.2024.3371376","DOIUrl":null,"url":null,"abstract":"Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. As a result, it is necessary to collect and label data-text pairs for training, which is both costly and time-consuming. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and “believable” outputs and significantly outperforms existing zero-shot methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 8","pages":"5712-5724"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10453989/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. As a result, it is necessary to collect and label data-text pairs for training, which is both costly and time-consuming. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and “believable” outputs and significantly outperforms existing zero-shot methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ZeroNLG:为零镜头多模态和多语言自然语言生成进行域对齐和自动编码。
自然语言生成(NLG)接受图像、视频或文本形式的输入数据,并生成相应的自然语言文本作为输出。现有的自然语言生成方法主要采用有监督的方法,并在很大程度上依赖于数据到文本的耦合对。然而,对于许多目标场景和非英语语言,往往无法获得足够数量的标记数据。因此,有必要收集和标注数据-文本对进行训练,这既费钱又费时。为了放宽下游任务对标注数据的依赖,我们提出了一个直观有效的零点学习框架 ZeroNLG,它可以在一个统一的框架内处理多个 NLG 任务,包括图像到文本(图像字幕)、视频到文本(视频字幕)和文本到文本(神经机器翻译),横跨英语、汉语、德语和法语。ZeroNLG 在训练时不需要任何标记的下游对。在训练过程中,ZeroNLG (i) 将不同领域(跨模式和语言)投射到共享的共同潜在空间中的相应坐标上;(ii) 通过对齐不同领域在该空间中的相应坐标,将其连接起来;(iii) 建立一个无监督的多语言自动编码器,通过重建输入文本在共享潜在空间中的坐标,学习生成文本。因此,在推理过程中,基于数据到文本的管道,ZeroNLG 可以根据输入数据在共同空间中的坐标,生成不同语言的目标句子。在这个统一的框架内,输入视觉(图像或视频)数据时,ZeroNLG 可以执行零镜头视觉字幕;输入文本句子时,ZeroNLG 可以执行零镜头机器翻译。我们展示了在十二项 NLG 任务上进行的大量实验结果,结果表明,在不使用任何标记的下游对进行训练的情况下,ZeroNLG 可以生成高质量和 "可信 "的输出结果,其性能明显优于现有的零镜头方法。我们的代码和数据可在 https://github.com/yangbang18/ZeroNLG 上获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
2024 Reviewers List* Rate-Distortion Theory in Coding for Machines and its Applications. Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines. Class-Agnostic Repetitive Action Counting Using Wearable Devices. On the Upper Bounds of Number of Linear Regions and Generalization Error of Deep Convolutional Neural Networks.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1