{"title":"改进人工智能辅助视频编辑:通过多任务学习优化镜头分析","authors":"","doi":"10.1016/j.neucom.2024.128485","DOIUrl":null,"url":null,"abstract":"<div><p>In recent years, AI-assisted video editing has shown promising applications. Understanding and analyzing camera language accurately is fundamental in video editing, guiding subsequent editing and production processes. However, many existing methods for camera language analysis overlook computational efficiency and deployment requirements in favor of improving classification accuracy. Consequently, they often fail to meet the demands of scenarios with limited computing power, such as mobile devices. To address this challenge, this paper proposes an efficient multi-task camera language analysis pipeline based on shared representations. This approach employs a multi-task learning architecture with hard parameter sharing, enabling different camera language classification tasks to utilize the same low-level feature extraction network, thereby implicitly learning feature representations of the footage. Subsequently, each classification sub-task independently learns the high-level semantic information corresponding to the camera language type. This method significantly reduces computational complexity and memory usage while facilitating efficient deployment on devices with limited computing power. Furthermore, to enhance performance, we introduce a dynamic task priority strategy and a conditional dataset downsampling strategy. The experimental results demonstrate that achieved a comprehensive accuracy surpassing all previous methods. Moreover, training time was reduced by 66.33%, inference cost decreased by 59.85%, and memory usage decreased by 31.95% on the 2-task dataset MovieShots; on the 4-task dataset AVE, training time was reduced by 95.34%, inference cost decreased by 97.23%, and memory usage decreased by 61.21%.</p></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":null,"pages":null},"PeriodicalIF":5.5000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving AI-assisted video editing: Optimized footage analysis through multi-task learning\",\"authors\":\"\",\"doi\":\"10.1016/j.neucom.2024.128485\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In recent years, AI-assisted video editing has shown promising applications. Understanding and analyzing camera language accurately is fundamental in video editing, guiding subsequent editing and production processes. However, many existing methods for camera language analysis overlook computational efficiency and deployment requirements in favor of improving classification accuracy. Consequently, they often fail to meet the demands of scenarios with limited computing power, such as mobile devices. To address this challenge, this paper proposes an efficient multi-task camera language analysis pipeline based on shared representations. This approach employs a multi-task learning architecture with hard parameter sharing, enabling different camera language classification tasks to utilize the same low-level feature extraction network, thereby implicitly learning feature representations of the footage. Subsequently, each classification sub-task independently learns the high-level semantic information corresponding to the camera language type. This method significantly reduces computational complexity and memory usage while facilitating efficient deployment on devices with limited computing power. Furthermore, to enhance performance, we introduce a dynamic task priority strategy and a conditional dataset downsampling strategy. The experimental results demonstrate that achieved a comprehensive accuracy surpassing all previous methods. Moreover, training time was reduced by 66.33%, inference cost decreased by 59.85%, and memory usage decreased by 31.95% on the 2-task dataset MovieShots; on the 4-task dataset AVE, training time was reduced by 95.34%, inference cost decreased by 97.23%, and memory usage decreased by 61.21%.</p></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224012566\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224012566","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Improving AI-assisted video editing: Optimized footage analysis through multi-task learning
In recent years, AI-assisted video editing has shown promising applications. Understanding and analyzing camera language accurately is fundamental in video editing, guiding subsequent editing and production processes. However, many existing methods for camera language analysis overlook computational efficiency and deployment requirements in favor of improving classification accuracy. Consequently, they often fail to meet the demands of scenarios with limited computing power, such as mobile devices. To address this challenge, this paper proposes an efficient multi-task camera language analysis pipeline based on shared representations. This approach employs a multi-task learning architecture with hard parameter sharing, enabling different camera language classification tasks to utilize the same low-level feature extraction network, thereby implicitly learning feature representations of the footage. Subsequently, each classification sub-task independently learns the high-level semantic information corresponding to the camera language type. This method significantly reduces computational complexity and memory usage while facilitating efficient deployment on devices with limited computing power. Furthermore, to enhance performance, we introduce a dynamic task priority strategy and a conditional dataset downsampling strategy. The experimental results demonstrate that achieved a comprehensive accuracy surpassing all previous methods. Moreover, training time was reduced by 66.33%, inference cost decreased by 59.85%, and memory usage decreased by 31.95% on the 2-task dataset MovieShots; on the 4-task dataset AVE, training time was reduced by 95.34%, inference cost decreased by 97.23%, and memory usage decreased by 61.21%.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.