Face recognition in unconstrained surveillance videos is challenging due to the different acquisition settings and face variations. We propose to utilize the complementary correlation between multi-frames to improve face recognition performance. We design an algorithm to build a representative frame set from the video sequence, selecting faces with high quality and large appearance diversity. We also devise a refined Deep Residual Equivariant Mapping (DREAM) block to improve the discriminative power of the extracted deep features. Extensive experiments on two relevant face recognition benchmarks, YouTube Face and IJB-A, show the effectiveness of the proposed method. Our work is also lightweight, and can be easily embedded into existing CNN based face recognition systems.
{"title":"Improving face recognition in surveillance video with judicious selection and fusion of representative frames","authors":"Zhaozhen Ding, Qingfang Zheng, Chunhua Hou, Guang Shen","doi":"10.1145/3444685.3446259","DOIUrl":"https://doi.org/10.1145/3444685.3446259","url":null,"abstract":"Face recognition in unconstrained surveillance videos is challenging due to the different acquisition settings and face variations. We propose to utilize the complementary correlation between multi-frames to improve face recognition performance. We design an algorithm to build a representative frame set from the video sequence, selecting faces with high quality and large appearance diversity. We also devise a refined Deep Residual Equivariant Mapping (DREAM) block to improve the discriminative power of the extracted deep features. Extensive experiments on two relevant face recognition benchmarks, YouTube Face and IJB-A, show the effectiveness of the proposed method. Our work is also lightweight, and can be easily embedded into existing CNN based face recognition systems.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"1000 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123101760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Domain adaptation has received lots of attention for its high efficiency in dealing with cross-domain learning tasks. Most existing domain adaptation methods adopt the strategies relying on large amounts of source label information, which limits their applications in the real world where only a few label samples are available. We exploit the local geometric connections to tackle this problem and propose a Local Structure Alignment (LSA) guided domain adaptation method in this paper. LSA leverages the Nyström method to describe the distribution difference from the geometric perspective and then perform the distribution alignment between domains. Specifically, LSA constructs a domain-invariant Hessian matrix to locally connect the data of the two domains through minimizing the Nyström approximation error. And then it integrates the domain-invariant Hessian matrix with the semi-supervised learning and finally builds an adaptive semi-supervised model. Extensive experimental results validate that the proposed LSA outperforms the traditional domain adaptation methods especially when only sparse source label information is available.
{"title":"Local structure alignment guided domain adaptation with few source samples","authors":"Yuying Cai, Jinfeng Li, Baodi Liu, Weifeng Liu, Kai Zhang, Changsheng Xu","doi":"10.1145/3444685.3446327","DOIUrl":"https://doi.org/10.1145/3444685.3446327","url":null,"abstract":"Domain adaptation has received lots of attention for its high efficiency in dealing with cross-domain learning tasks. Most existing domain adaptation methods adopt the strategies relying on large amounts of source label information, which limits their applications in the real world where only a few label samples are available. We exploit the local geometric connections to tackle this problem and propose a Local Structure Alignment (LSA) guided domain adaptation method in this paper. LSA leverages the Nyström method to describe the distribution difference from the geometric perspective and then perform the distribution alignment between domains. Specifically, LSA constructs a domain-invariant Hessian matrix to locally connect the data of the two domains through minimizing the Nyström approximation error. And then it integrates the domain-invariant Hessian matrix with the semi-supervised learning and finally builds an adaptive semi-supervised model. Extensive experimental results validate that the proposed LSA outperforms the traditional domain adaptation methods especially when only sparse source label information is available.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116463023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, a non-iterative solution to the non-perspective pose estimation from line correspondences was proposed. Specifically, the proposed method uses an intermediate camera frame and an intermediate world frame, which simplifies the expression of rotation matrix by reducing to the two freedoms from three in the rotation matrix R. Then formulate the pose estimation problem into an optimal problem. Our method solve the parameters of rotation matrix by building the fifteenth-order and fourth-order univariate polynomial. The proposed method can be applied into the pose estimation of the perspective camera. We utilize both the simulated data and real data to conduct the comparative experiments. The experimental results show that the proposed method is comparable or better than existing methods in the aspects of accuracy, stability and efficiency.
{"title":"Intermediate coordinate based pose non-perspective estimation from line correspondences","authors":"Yujia Cao, Zhichao Cui, Yuehu Liu, Xiaojun Lv, K.C.C. Peng","doi":"10.1145/3444685.3446299","DOIUrl":"https://doi.org/10.1145/3444685.3446299","url":null,"abstract":"In this paper, a non-iterative solution to the non-perspective pose estimation from line correspondences was proposed. Specifically, the proposed method uses an intermediate camera frame and an intermediate world frame, which simplifies the expression of rotation matrix by reducing to the two freedoms from three in the rotation matrix R. Then formulate the pose estimation problem into an optimal problem. Our method solve the parameters of rotation matrix by building the fifteenth-order and fourth-order univariate polynomial. The proposed method can be applied into the pose estimation of the perspective camera. We utilize both the simulated data and real data to conduct the comparative experiments. The experimental results show that the proposed method is comparable or better than existing methods in the aspects of accuracy, stability and efficiency.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129576671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Wang, Luo Xiong, Kaiwen Du, Yan Yan, Hanzi Wang
Existing regression-based deep trackers usually localize a target based on a response map, where the highest peak response corresponds to the predicted target location. Nevertheless, when the background distractors appear or the target scale changes frequently, the response map is prone to produce multiple sub-peak responses to interfere with model prediction. In this paper, we propose a robust online tracking method via Scale-Aware localization and Peak Response strength (SAPR), which can learn a discriminative model predictor to estimate a target state accurately. Specifically, to cope with large scale variations, we propose a Scale-Aware Localization (SAL) module to provide multi-scale response maps based on the scale pyramid scheme. Furthermore, to focus on the target response, we propose a simple yet effective Peak Response Strength (PRS) module to fuse the multi-scale response maps and the response maps generated by a correlation filter. According to the response map with the maximum classification score, the model predictor iteratively updates its filter weights for accurate target state estimation. Experimental results on three benchmark datasets, including OTB100, VOT2018 and LaSOT, demonstrate that the proposed SAPR accurately estimates the target state, achieving the favorable performance against several state-of-the-art trackers.
{"title":"Robust visual tracking via scale-aware localization and peak response strength","authors":"Ying Wang, Luo Xiong, Kaiwen Du, Yan Yan, Hanzi Wang","doi":"10.1145/3444685.3446274","DOIUrl":"https://doi.org/10.1145/3444685.3446274","url":null,"abstract":"Existing regression-based deep trackers usually localize a target based on a response map, where the highest peak response corresponds to the predicted target location. Nevertheless, when the background distractors appear or the target scale changes frequently, the response map is prone to produce multiple sub-peak responses to interfere with model prediction. In this paper, we propose a robust online tracking method via Scale-Aware localization and Peak Response strength (SAPR), which can learn a discriminative model predictor to estimate a target state accurately. Specifically, to cope with large scale variations, we propose a Scale-Aware Localization (SAL) module to provide multi-scale response maps based on the scale pyramid scheme. Furthermore, to focus on the target response, we propose a simple yet effective Peak Response Strength (PRS) module to fuse the multi-scale response maps and the response maps generated by a correlation filter. According to the response map with the maximum classification score, the model predictor iteratively updates its filter weights for accurate target state estimation. Experimental results on three benchmark datasets, including OTB100, VOT2018 and LaSOT, demonstrate that the proposed SAPR accurately estimates the target state, achieving the favorable performance against several state-of-the-art trackers.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130979850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most existing text-to-image generation methods focus on synthesizing images using only text descriptions, but this cannot meet the requirement of generating desired objects with given backgrounds. In this paper, we propose a Background-induced Generative Network (BGNet) that combines attention mechanisms, background synthesis, and multi-level discriminator to generate realistic images with given backgrounds according to text descriptions. BGNet takes a multi-stage generation as the basic framework to generate fine-grained images and introduces a hybrid attention mechanism to capture the local semantic correlation between texts and images. To adjust the impact of the given backgrounds on the synthesized images, synthesis blocks are added at each stage of image generation, which appropriately combines the foreground objects generated by the text descriptions with the given background images. Besides, a multi-level discriminator and its corresponding loss function are proposed to optimize the synthesized images. The experimental results on the CUB bird dataset demonstrate the superiority of our method and its ability to generate realistic images with given backgrounds.
{"title":"A background-induced generative network with multi-level discriminator for text-to-image generation","authors":"Ping Wang, Li Liu, Huaxiang Zhang, Tianshi Wang","doi":"10.1145/3444685.3446291","DOIUrl":"https://doi.org/10.1145/3444685.3446291","url":null,"abstract":"Most existing text-to-image generation methods focus on synthesizing images using only text descriptions, but this cannot meet the requirement of generating desired objects with given backgrounds. In this paper, we propose a Background-induced Generative Network (BGNet) that combines attention mechanisms, background synthesis, and multi-level discriminator to generate realistic images with given backgrounds according to text descriptions. BGNet takes a multi-stage generation as the basic framework to generate fine-grained images and introduces a hybrid attention mechanism to capture the local semantic correlation between texts and images. To adjust the impact of the given backgrounds on the synthesized images, synthesis blocks are added at each stage of image generation, which appropriately combines the foreground objects generated by the text descriptions with the given background images. Besides, a multi-level discriminator and its corresponding loss function are proposed to optimize the synthesized images. The experimental results on the CUB bird dataset demonstrate the superiority of our method and its ability to generate realistic images with given backgrounds.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133740910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Texture transfer has been successfully applied in computer vision and computer graphics. Since non-stationary textures are usually complex and anisotropic, it is challenging to transfer these textures by simple supervised method. In this paper, we propose a general solution for non-stationary texture transfer, which can preserve the local structure and visual richness of textures. The inputs of our framework are source texture and semantic annotation pair. We record different semantics as different regions and obtain the color and distribution information from different regions, which is used to guide the the low-level texture transfer algorithm. Specifically, we exploit these local distributions to regularize the texture transfer objective function, which is minimized by iterative search and voting steps. In the search step, we search the nearest neighbor fields of source image to target image through Generalized PatchMatch (GPM) algorithm. In the voting step, we calculate histogram weights and coherence weights for different semantic regions to ensure color accuracy and texture continuity, and to further transfer the textures from the source to the target. By comparing with state-of-the-art algorithms, we demonstrate the effectiveness and superiority of our technique in various non-stationary textures.
{"title":"Transfer non-stationary texture with complex appearance","authors":"Cheng Peng, Na Qi, Qing Zhu","doi":"10.1145/3444685.3446297","DOIUrl":"https://doi.org/10.1145/3444685.3446297","url":null,"abstract":"Texture transfer has been successfully applied in computer vision and computer graphics. Since non-stationary textures are usually complex and anisotropic, it is challenging to transfer these textures by simple supervised method. In this paper, we propose a general solution for non-stationary texture transfer, which can preserve the local structure and visual richness of textures. The inputs of our framework are source texture and semantic annotation pair. We record different semantics as different regions and obtain the color and distribution information from different regions, which is used to guide the the low-level texture transfer algorithm. Specifically, we exploit these local distributions to regularize the texture transfer objective function, which is minimized by iterative search and voting steps. In the search step, we search the nearest neighbor fields of source image to target image through Generalized PatchMatch (GPM) algorithm. In the voting step, we calculate histogram weights and coherence weights for different semantic regions to ensure color accuracy and texture continuity, and to further transfer the textures from the source to the target. By comparing with state-of-the-art algorithms, we demonstrate the effectiveness and superiority of our technique in various non-stationary textures.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125112176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yasuhiro Mochida, D. Shirai, Takahiro Yamaguchi, S. Kuwabara, H. Nishizawa
Remote production is an emerging concept concerning the outside-broadcasting workflow enabled by Internet Protocol (IP)-based production systems, and it is expected to be much more efficient than the conventional workflow. However, long-distance transmission of uncompressed video signals and time synchronization of distributed IP-video devices are challenging. A system architecture for remote production using optical transponders (capable of long-distance and large-capacity optical communication) is proposed. A field experiment confirmed that uncompressed video signals can be transmitted successfully by this architecture. The status monitoring of uncompressed video transmission in remote production is also challenging. To address the challenge, a method for automatically monitoring the status of IP-video devices is also proposed. The monitoring system was implemented by using whitebox transponders, and it was confirmed that the system can automatically register IP-video devices, generate an IP-video flow model, and detect traffic anomalies.
{"title":"A novel system architecture and an automatic monitoring method for remote production","authors":"Yasuhiro Mochida, D. Shirai, Takahiro Yamaguchi, S. Kuwabara, H. Nishizawa","doi":"10.1145/3444685.3446277","DOIUrl":"https://doi.org/10.1145/3444685.3446277","url":null,"abstract":"Remote production is an emerging concept concerning the outside-broadcasting workflow enabled by Internet Protocol (IP)-based production systems, and it is expected to be much more efficient than the conventional workflow. However, long-distance transmission of uncompressed video signals and time synchronization of distributed IP-video devices are challenging. A system architecture for remote production using optical transponders (capable of long-distance and large-capacity optical communication) is proposed. A field experiment confirmed that uncompressed video signals can be transmitted successfully by this architecture. The status monitoring of uncompressed video transmission in remote production is also challenging. To address the challenge, a method for automatically monitoring the status of IP-video devices is also proposed. The monitoring system was implemented by using whitebox transponders, and it was confirmed that the system can automatically register IP-video devices, generate an IP-video flow model, and detect traffic anomalies.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124317499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The method proposes a multi-focus noisy image fusion algorithm combining gradient regularized convolutional sparse representatione and spatial frequency. Firstly, the source image is decomposed into a base layer and a detail layer through two-scale image decomposition. The detail layer uses the Alternating Direction Method of Multipliers (ADMM) to solve the convolutional sparse coefficients with gradient penalties to complete the fusion of detail layer coefficients. Then, The base layer uses the spatial frequency to judge the focus area, the spatial frequency and the "choose-max" strategy are applied to achieved the multi-focus fusion result of base layer. Finally, the fused image is calculated as a superposition of the base layer and the detail layer. Experimental results show that compared with other algorithms, this algorithm provides excellent subjective visual perception and objective evaluation metrics.
{"title":"Multi-focus noisy image fusion based on gradient regularized convolutional sparse representatione","authors":"Xuanjing Shen, Yunqi Zhang, Haipeng Chen, Di Gai","doi":"10.1145/3444685.3446325","DOIUrl":"https://doi.org/10.1145/3444685.3446325","url":null,"abstract":"The method proposes a multi-focus noisy image fusion algorithm combining gradient regularized convolutional sparse representatione and spatial frequency. Firstly, the source image is decomposed into a base layer and a detail layer through two-scale image decomposition. The detail layer uses the Alternating Direction Method of Multipliers (ADMM) to solve the convolutional sparse coefficients with gradient penalties to complete the fusion of detail layer coefficients. Then, The base layer uses the spatial frequency to judge the focus area, the spatial frequency and the \"choose-max\" strategy are applied to achieved the multi-focus fusion result of base layer. Finally, the fused image is calculated as a superposition of the base layer and the detail layer. Experimental results show that compared with other algorithms, this algorithm provides excellent subjective visual perception and objective evaluation metrics.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124366174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Video Browser Showdown (VBS) has influenced the Multimedia community already for 10 years now. More than 30 unique teams from over 21 countries participated in the VBS since 2012 already. In 2021, we are celebrating the 10th anniversary of VBS, where 17 international teams compete against each other in an unprecedented contest of fast and accurate multimedia retrieval. In this tutorial we discuss the motivation and details of the VBS contest, including its history, rules, evaluation metrics, and achievements for multimedia retrieval. We talk about the properties of specific VBS retrieval systems and their unique characteristics, as well as existing open-source tools that can be used as a starting point for participating for the first time. Participants of this tutorial get a detailed understanding of the VBS and its search systems, and see the latest developments of interactive video retrieval.
{"title":"10 years of video browser showdown","authors":"K. Schoeffmann, Jakub Lokoč, W. Bailer","doi":"10.1145/3444685.3450215","DOIUrl":"https://doi.org/10.1145/3444685.3450215","url":null,"abstract":"The Video Browser Showdown (VBS) has influenced the Multimedia community already for 10 years now. More than 30 unique teams from over 21 countries participated in the VBS since 2012 already. In 2021, we are celebrating the 10th anniversary of VBS, where 17 international teams compete against each other in an unprecedented contest of fast and accurate multimedia retrieval. In this tutorial we discuss the motivation and details of the VBS contest, including its history, rules, evaluation metrics, and achievements for multimedia retrieval. We talk about the properties of specific VBS retrieval systems and their unique characteristics, as well as existing open-source tools that can be used as a starting point for participating for the first time. Participants of this tutorial get a detailed understanding of the VBS and its search systems, and see the latest developments of interactive video retrieval.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122570176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heling Chen, Zhongyuan Wang, Yingjiao Pei, Baojin Huang, Weiping Tu
In the information explosion era, people only want to access the news information that they are interested in. News broadcast story segmentation is strongly needed, which is an essential basis for personalized delivery and short video. The existing advanced story boundary segmentation methods utilize semantic similarity of subtitles, thus entailing complex semantic computation. The title texts of news broadcast programs include headline (or primary) captions, dialogue captions and the channel logo, while the same story clips only render one primary caption in most news broadcast. Inspired by this fact, we propose a simple method for story segmentation based on the primary caption, which combines YOLOv3 based primary caption extraction and preliminary location of boundaries. In particular, we introduce mean hash to achieve the fast and reliable comparison for detected small-size primary caption blocks. We further incorporate scene recognition to exact the preliminary boundaries, because the primary captions always appear later than the story boundary. Experimental results on two Chinese news broadcast datasets show that our method enjoys high accuracy in terms of R, P and F1-measures.
{"title":"Story segmentation for news broadcast based on primary caption","authors":"Heling Chen, Zhongyuan Wang, Yingjiao Pei, Baojin Huang, Weiping Tu","doi":"10.1145/3444685.3446298","DOIUrl":"https://doi.org/10.1145/3444685.3446298","url":null,"abstract":"In the information explosion era, people only want to access the news information that they are interested in. News broadcast story segmentation is strongly needed, which is an essential basis for personalized delivery and short video. The existing advanced story boundary segmentation methods utilize semantic similarity of subtitles, thus entailing complex semantic computation. The title texts of news broadcast programs include headline (or primary) captions, dialogue captions and the channel logo, while the same story clips only render one primary caption in most news broadcast. Inspired by this fact, we propose a simple method for story segmentation based on the primary caption, which combines YOLOv3 based primary caption extraction and preliminary location of boundaries. In particular, we introduce mean hash to achieve the fast and reliable comparison for detected small-size primary caption blocks. We further incorporate scene recognition to exact the preliminary boundaries, because the primary captions always appear later than the story boundary. Experimental results on two Chinese news broadcast datasets show that our method enjoys high accuracy in terms of R, P and F1-measures.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131264427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}