音频描述中的省略和推理意义生成，以及对视频内容描述自动化的启示

IF 2.1 4区计算机科学 Q3 COMPUTER SCIENCE, CYBERNETICS Universal Access in the Information Society Pub Date : 2023-10-08 DOI:10.1007/s10209-023-01045-3

Kim Starr, Sabine Braun

{"title":"音频描述中的省略和推理意义生成，以及对视频内容描述自动化的启示","authors":"Kim Starr, Sabine Braun","doi":"10.1007/s10209-023-01045-3","DOIUrl":null,"url":null,"abstract":"Abstract There is broad consensus that audio description (AD) is a modality of intersemiotic translation, but there are different views in relation to how AD can be more precisely conceptualised. While Benecke (Audiodeskription als partielle Translation. Modell und Methode, LIT, Berlin, 2014) characterises AD as ‘partial translation’, Braun (T 28: 302–313, 2016) hypothesises that what audio describers appear to ‘omit’ from their descriptions can normally be inferred by the audience, drawing on narrative cues from dialogue, mise-en-scène, kinesis, music or sound effects. The study reported in this paper tested this hypothesis using a corpus of material created during the H2020 MeMAD project. The MeMAD project aimed to improve access to audiovisual (AV) content through a combination of human and computer-based methods of description. One of the MeMAD workstreams addressed human approaches to describing visually salient cues. This included an analysis of the potential impact of omissions in AD, which is the focus of this paper. Using a corpus of approximately 500 audio described film extracts we identified the visual elements that can be considered essential for the construction of the filmic narrative and then performed a qualitative analysis of the corresponding audio descriptions to determine how these elements are verbally represented and whether any omitted elements could be inferred from other cues that are accessible to visually impaired audiences. We then identified the most likely source of these inferences and the conditions upon which retrieval could be predicated, preparing the ground for future reception studies to test our hypotheses with target audiences. In this paper, we discuss the methodology used to determine where omissions occur in the analysed audio descriptions, consider worked examples from the MeMAD500 film corpus, and outline the findings of our study namely that various strategies are relevant to inferring omitted information, including the use of proximal and distal contextual cues, and reliance on the application of common knowledge and iconic scenarios. To conclude, consideration is given to overcoming significant omissions in human-generated AD, such as using extended AD formats, and mitigating similar gaps in machine-generated descriptions, where incorporating dialogue analysis and other supplementary data into the computer model could resolve many omissions.","PeriodicalId":49115,"journal":{"name":"Universal Access in the Information Society","volume":"23 1","pages":"0"},"PeriodicalIF":2.1000,"publicationDate":"2023-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Omissions and inferential meaning-making in audio description, and implications for automating video content description\",\"authors\":\"Kim Starr, Sabine Braun\",\"doi\":\"10.1007/s10209-023-01045-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract There is broad consensus that audio description (AD) is a modality of intersemiotic translation, but there are different views in relation to how AD can be more precisely conceptualised. While Benecke (Audiodeskription als partielle Translation. Modell und Methode, LIT, Berlin, 2014) characterises AD as ‘partial translation’, Braun (T 28: 302–313, 2016) hypothesises that what audio describers appear to ‘omit’ from their descriptions can normally be inferred by the audience, drawing on narrative cues from dialogue, mise-en-scène, kinesis, music or sound effects. The study reported in this paper tested this hypothesis using a corpus of material created during the H2020 MeMAD project. The MeMAD project aimed to improve access to audiovisual (AV) content through a combination of human and computer-based methods of description. One of the MeMAD workstreams addressed human approaches to describing visually salient cues. This included an analysis of the potential impact of omissions in AD, which is the focus of this paper. Using a corpus of approximately 500 audio described film extracts we identified the visual elements that can be considered essential for the construction of the filmic narrative and then performed a qualitative analysis of the corresponding audio descriptions to determine how these elements are verbally represented and whether any omitted elements could be inferred from other cues that are accessible to visually impaired audiences. We then identified the most likely source of these inferences and the conditions upon which retrieval could be predicated, preparing the ground for future reception studies to test our hypotheses with target audiences. In this paper, we discuss the methodology used to determine where omissions occur in the analysed audio descriptions, consider worked examples from the MeMAD500 film corpus, and outline the findings of our study namely that various strategies are relevant to inferring omitted information, including the use of proximal and distal contextual cues, and reliance on the application of common knowledge and iconic scenarios. To conclude, consideration is given to overcoming significant omissions in human-generated AD, such as using extended AD formats, and mitigating similar gaps in machine-generated descriptions, where incorporating dialogue analysis and other supplementary data into the computer model could resolve many omissions.\",\"PeriodicalId\":49115,\"journal\":{\"name\":\"Universal Access in the Information Society\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2023-10-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Universal Access in the Information Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s10209-023-01045-3\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, CYBERNETICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Universal Access in the Information Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10209-023-01045-3","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}

引用次数: 0

摘要

摘要音频描述是一种符码间翻译的形态，这一观点已经得到了广泛的认同，但对于如何更精确地对音频描述进行概念化却存在不同的观点。而Benecke (Audiodeskription)则是粒子翻译。模型与方法，文学，柏林，2014)将AD描述为“部分翻译”，Braun (T 28: 302-313, 2016)假设音频描述者似乎从他们的描述中“省略”的内容通常可以由观众推断出来，从对话，场景，动作，音乐或声音效果中提取叙事线索。本文报道的研究使用H2020 MeMAD项目期间创建的材料语料库验证了这一假设。MeMAD项目旨在通过结合基于人和计算机的描述方法来改善对视听(AV)内容的访问。其中一个MeMAD工作流解决了人类描述视觉显著线索的方法。这包括对AD中遗漏的潜在影响的分析，这是本文的重点。使用大约500个音频描述的电影摘录的语料库，我们确定了可以被认为对电影叙事构建至关重要的视觉元素，然后对相应的音频描述进行定性分析，以确定这些元素是如何口头表达的，以及是否可以从视障观众可以访问的其他线索中推断出任何遗漏的元素。然后，我们确定了这些推论的最有可能的来源，以及可以预测检索的条件，为未来的接受研究奠定基础，以便在目标受众中测试我们的假设。在本文中，我们讨论了用于确定所分析的音频描述中遗漏位置的方法，考虑了MeMAD500电影语料库中的工作示例，并概述了我们的研究结果，即与推断遗漏信息相关的各种策略，包括使用近端和远端上下文线索，以及依赖于常识和标志性场景的应用。综上所述，考虑克服人类生成的AD中的重大遗漏，例如使用扩展的AD格式，并减轻机器生成描述中的类似差距，其中将对话分析和其他补充数据纳入计算机模型可以解决许多遗漏。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Omissions and inferential meaning-making in audio description, and implications for automating video content description

Abstract There is broad consensus that audio description (AD) is a modality of intersemiotic translation, but there are different views in relation to how AD can be more precisely conceptualised. While Benecke (Audiodeskription als partielle Translation. Modell und Methode, LIT, Berlin, 2014) characterises AD as ‘partial translation’, Braun (T 28: 302–313, 2016) hypothesises that what audio describers appear to ‘omit’ from their descriptions can normally be inferred by the audience, drawing on narrative cues from dialogue, mise-en-scène, kinesis, music or sound effects. The study reported in this paper tested this hypothesis using a corpus of material created during the H2020 MeMAD project. The MeMAD project aimed to improve access to audiovisual (AV) content through a combination of human and computer-based methods of description. One of the MeMAD workstreams addressed human approaches to describing visually salient cues. This included an analysis of the potential impact of omissions in AD, which is the focus of this paper. Using a corpus of approximately 500 audio described film extracts we identified the visual elements that can be considered essential for the construction of the filmic narrative and then performed a qualitative analysis of the corresponding audio descriptions to determine how these elements are verbally represented and whether any omitted elements could be inferred from other cues that are accessible to visually impaired audiences. We then identified the most likely source of these inferences and the conditions upon which retrieval could be predicated, preparing the ground for future reception studies to test our hypotheses with target audiences. In this paper, we discuss the methodology used to determine where omissions occur in the analysed audio descriptions, consider worked examples from the MeMAD500 film corpus, and outline the findings of our study namely that various strategies are relevant to inferring omitted information, including the use of proximal and distal contextual cues, and reliance on the application of common knowledge and iconic scenarios. To conclude, consideration is given to overcoming significant omissions in human-generated AD, such as using extended AD formats, and mitigating similar gaps in machine-generated descriptions, where incorporating dialogue analysis and other supplementary data into the computer model could resolve many omissions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Universal Access in the Information Society COMPUTER SCIENCE, CYBERNETICS-

CiteScore

6.10

自引率

16.70%

发文量

审稿时长

>12 weeks

期刊介绍： Universal Access in the Information Society (UAIS) is an international, interdisciplinary refereed journal that solicits original research contributions addressing the accessibility, usability, and, ultimately, acceptability of Information Society Technologies by anyone, anywhere, at anytime, and through any media and device. Universal access refers to the conscious and systematic effort to proactively apply principles, methods and tools of universal design order to develop Information Society Technologies that are accessible and usable by all citizens, including the very young and the elderly and people with different types of disabilities, thus avoiding the need for a posteriori adaptations or specialized design. The journal''s unique focus is on theoretical, methodological, and empirical research, of both technological and non-technological nature, that addresses equitable access and active participation of potentially all citizens in the information society.