Language, Vision and Action are Better Together

Jason Baldridge
{"title":"Language, Vision and Action are Better Together","authors":"Jason Baldridge","doi":"10.1145/3442442.3451897","DOIUrl":null,"url":null,"abstract":"Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language-including speech-in the context of other modalities and environments is needed, and there has never been a better time to do it. Without ever invoking the worn-out, overblown phrase ”how babies learn” in the talk, I’ll cover three of my team’s efforts involving language, vision and action. First: our work on speech-image representation learning and retrieval, where we demonstrate settings in which directly encoding speech outperforms the hard-to-beat strategy of using automatic speech recognition and strong text encoders. Second: two models for text-to-image generation: a multi-stage model which exploits user-guidance in the form of mouse traces and a single-stage one which uses cross-modal contrastive losses. Third: Room-across-Room, a multilingual dataset for vision-and-language navigation, for which we collected spoken navigation instructions, high-quality text transcriptions, and fine-grained alignments between words and pixels in high-definition 360-degree panoramas. I’ll wrap up with some thoughts on how work on computational language grounding more broadly presents new opportunities to enhance and advance our scientific understanding of language and its fundamental role in human intelligence.","PeriodicalId":129420,"journal":{"name":"Companion Proceedings of the Web Conference 2021","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the Web Conference 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3442442.3451897","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language-including speech-in the context of other modalities and environments is needed, and there has never been a better time to do it. Without ever invoking the worn-out, overblown phrase ”how babies learn” in the talk, I’ll cover three of my team’s efforts involving language, vision and action. First: our work on speech-image representation learning and retrieval, where we demonstrate settings in which directly encoding speech outperforms the hard-to-beat strategy of using automatic speech recognition and strong text encoders. Second: two models for text-to-image generation: a multi-stage model which exploits user-guidance in the form of mouse traces and a single-stage one which uses cross-modal contrastive losses. Third: Room-across-Room, a multilingual dataset for vision-and-language navigation, for which we collected spoken navigation instructions, high-quality text transcriptions, and fine-grained alignments between words and pixels in high-definition 360-degree panoramas. I’ll wrap up with some thoughts on how work on computational language grounding more broadly presents new opportunities to enhance and advance our scientific understanding of language and its fundamental role in human intelligence.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
语言、视觉和行动在一起会更好
人类的知识和语言的使用与感知、行动和大脑的组织有着千丝万缕的联系,然而自然语言处理仍然由文本主导!我们需要在其他模式和环境的背景下对语言(包括言语)进行更多的研究,而现在正是进行研究的最佳时机。我不会在演讲中引用“婴儿是如何学习的”这个老生常谈、夸大其词的短语,我将介绍我的团队在语言、视觉和行动方面的三个方面的努力。首先:我们在语音图像表示学习和检索方面的工作,其中我们展示了直接编码语音的设置优于使用自动语音识别和强文本编码器的难以击败的策略。第二:文本到图像生成的两个模型:利用鼠标轨迹形式的用户引导的多阶段模型和使用跨模态对比损失的单阶段模型。第三:room - cross- room,这是一个用于视觉和语言导航的多语言数据集,我们为此收集了语音导航说明,高质量的文本转录,以及高清360度全景图中单词和像素之间的细粒度对齐。最后,我将提出一些想法,说明计算语言基础的工作如何更广泛地为加强和推进我们对语言的科学理解及其在人类智能中的基本作用提供了新的机会。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Do I Trust this Stranger? Generalized Trust and the Governance of Online Communities Explainable Demand Forecasting: A Data Mining Goldmine Tracing the Factoids: the Anatomy of Information Re-organization in Wikipedia Articles AI Principles in Identifying Toxicity in Online Conversation: Keynote at the Third Workshop on Fairness, Accountability, Transparency, Ethics and Society on the Web Fairness beyond “equal”: The Diversity Searcher as a Tool to Detect and Enhance the Representation of Socio-political Actors in News Media
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1