{"title":"A dense video caption dataset of student classroom behaviors and a baseline model with boundary semantic awareness","authors":"Yong Wu , Jinyu Tian , HuiJun Liu , Yuanyan Tang","doi":"10.1016/j.displa.2024.102804","DOIUrl":null,"url":null,"abstract":"<div><p>Dense video captioning automatically locates events in untrimmed videos and describes event contents through natural language. This task has many potential applications, including security, assisting people who are visually impaired, and video retrieval. The related datasets constitute an important foundation for research on data-driven methods. However, the existing models for building dense video caption datasets were designed for the universal domain, often ignoring the characteristics and requirements of a specific domain. In addition, the one-way dataset construction process cannot form a closed-loop iterative scheme to improve the quality of the dataset. Therefore, this paper proposes a novel dataset construction model that is suitable for classroom-specific scenarios. On this basis, the Dense Video Caption Dataset of Student Classroom Behaviors (SCB-DVC) is constructed. Additionally, the existing dense video captioning methods typically utilize only temporal event boundaries as direct supervisory information during localization and fail to consider semantic information. This results in a limited correlation between the localization and captioning stages. This defect makes it more difficult to locate events in videos with oversmooth boundaries (due to the excessive similarity between the foregrounds and backgrounds (temporal domains) of events). Therefore, we propose a fine-grained semantic-aware assisted boundary localization-based dense video captioning method. This method enhances the ability to effectively learn the differential features between the foreground and background of an event by introducing semantic-aware information. It can provide increased boundary perception and achieve more accurate captions. Experimental results show that the proposed method performs well on both the SCB-DVC dataset and public datasets (ActivityNet Captions, YouCook2 and TACoS). We will release the SCB-DVC dataset soon.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102804"},"PeriodicalIF":3.7000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938224001689","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Dense video captioning automatically locates events in untrimmed videos and describes event contents through natural language. This task has many potential applications, including security, assisting people who are visually impaired, and video retrieval. The related datasets constitute an important foundation for research on data-driven methods. However, the existing models for building dense video caption datasets were designed for the universal domain, often ignoring the characteristics and requirements of a specific domain. In addition, the one-way dataset construction process cannot form a closed-loop iterative scheme to improve the quality of the dataset. Therefore, this paper proposes a novel dataset construction model that is suitable for classroom-specific scenarios. On this basis, the Dense Video Caption Dataset of Student Classroom Behaviors (SCB-DVC) is constructed. Additionally, the existing dense video captioning methods typically utilize only temporal event boundaries as direct supervisory information during localization and fail to consider semantic information. This results in a limited correlation between the localization and captioning stages. This defect makes it more difficult to locate events in videos with oversmooth boundaries (due to the excessive similarity between the foregrounds and backgrounds (temporal domains) of events). Therefore, we propose a fine-grained semantic-aware assisted boundary localization-based dense video captioning method. This method enhances the ability to effectively learn the differential features between the foreground and background of an event by introducing semantic-aware information. It can provide increased boundary perception and achieve more accurate captions. Experimental results show that the proposed method performs well on both the SCB-DVC dataset and public datasets (ActivityNet Captions, YouCook2 and TACoS). We will release the SCB-DVC dataset soon.
期刊介绍:
Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface.
Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.