Music emotion experience is a rather subjective and personalized issue. Therefore, we previously developed a personalized music recommendation system called MemoMusic to navigate listeners to more positive emotional states based not only on music emotion but also on possible memories aroused by music. In this paper, we propose to extend MemoMusic with automatic music generation based on an LSTM network, which can learn the characteristic of a tiny music clip with particular Valence and Arousal values and predict a new music sequence with similar music style. We call this enhanced system MemoMusic Verison 2.0. For experiment, a new dataset of 177 music in MIDI format was collected and labelled using the Valence-Arousal model from three categories of Classical, Popular, and Yanni music. Experimental results further demonstrate that memory is an influencing factor in determining perceived music emotion, and MemoMusic Version 2.0 can moderately navigate listeners to better emotional states.
{"title":"Memomusic Version 2.0: Extending Personalized Music Recommendation with Automatic Music Generation","authors":"Luntian Mou, Yiyuan Zhao, Quan Hao, Yunhan Tian, Juehui Li, Jueying Li, Yiqi Sun, Feng Gao, Baocai Yin","doi":"10.1109/ICMEW56448.2022.9859356","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859356","url":null,"abstract":"Music emotion experience is a rather subjective and personalized issue. Therefore, we previously developed a personalized music recommendation system called MemoMusic to navigate listeners to more positive emotional states based not only on music emotion but also on possible memories aroused by music. In this paper, we propose to extend MemoMusic with automatic music generation based on an LSTM network, which can learn the characteristic of a tiny music clip with particular Valence and Arousal values and predict a new music sequence with similar music style. We call this enhanced system MemoMusic Verison 2.0. For experiment, a new dataset of 177 music in MIDI format was collected and labelled using the Valence-Arousal model from three categories of Classical, Popular, and Yanni music. Experimental results further demonstrate that memory is an influencing factor in determining perceived music emotion, and MemoMusic Version 2.0 can moderately navigate listeners to better emotional states.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122313114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859519
Yiru Chen, Yumei Wang, Yu Liu
Video restoration and enhancement tasks, including video super-resolution(VSR), are designed to convert low-quality videos into high-quality videos to improve the audience’s visual experience. In recent years, many deep learning methods using optical flow estimation or deformable convolution have been applied to video super-resolution. However, we find that motion estimation based on a single optical flow is difficult to capture enough inter-frame information, and the method using deformable convolution lacks clear motion constraints, which affects its ability to process fast motion. Therefore, we propose a multi-offset-flow-based network (MOFN) to make more effective use of inter-frame information by using optical flow with offset diversity. We proposed an alignment and compensation module that can estimate the optical flow with multiple offsets for neighbouring frames and perform frame alignment. The aligned video frames will be fed into the fusion module, and high-quality video frames will be obtained after fusion and reconstruction. Extensive results show that our proposed model has a good ability to process motion. On several benchmark datasets, our method has achieved favorable performance compared with the most advanced methods.
{"title":"MOFN: Multi-Offset-Flow-Based Network for Video Restoration and Enhancement","authors":"Yiru Chen, Yumei Wang, Yu Liu","doi":"10.1109/ICMEW56448.2022.9859519","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859519","url":null,"abstract":"Video restoration and enhancement tasks, including video super-resolution(VSR), are designed to convert low-quality videos into high-quality videos to improve the audience’s visual experience. In recent years, many deep learning methods using optical flow estimation or deformable convolution have been applied to video super-resolution. However, we find that motion estimation based on a single optical flow is difficult to capture enough inter-frame information, and the method using deformable convolution lacks clear motion constraints, which affects its ability to process fast motion. Therefore, we propose a multi-offset-flow-based network (MOFN) to make more effective use of inter-frame information by using optical flow with offset diversity. We proposed an alignment and compensation module that can estimate the optical flow with multiple offsets for neighbouring frames and perform frame alignment. The aligned video frames will be fed into the fusion module, and high-quality video frames will be obtained after fusion and reconstruction. Extensive results show that our proposed model has a good ability to process motion. On several benchmark datasets, our method has achieved favorable performance compared with the most advanced methods.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131738283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859353
Haoming Liu, Li Guo, Zhongwen Zhou, Hanyuan Zhang
Incorporating depth information into RGB images has proven its effectiveness in semantic segmentation. The multi-modal feature fusion, which integrates depth and RGB features, is a crucial component determining segmentation accuracy. Most existing multi-modal feature fusion schemes enhance multi-modal features via channel-wise attention modules which leverage global context information. In this work, we propose a novel pyramid-context guided fusion (PCGF) module to fully exploit the complementary information from the depth and RGB features. The proposed PCGF utilizes both local and global contexts inside the attention module to provide effective guidance for fusing cross-modal features of inconsistent semantics. Moreover, we introduce a lightweight yet practical multi-level general fusion module to combine the features at multiple levels of abstraction to enable high-resolution prediction. Utilizing the proposed feature fusion modules, our Pyramid-Context Guided Network (PCGNet) can learn discriminative features by taking full advantage of multi-modal and multi-level information. Our comprehensive experiments demonstrate that the proposed PCGNet achieves state-of-the-art performance on two benchmark datasets NYUDv2 and SUN-RGBD.
{"title":"Pyramid-Context Guided Feature Fusion for RGB-D Semantic Segmentation","authors":"Haoming Liu, Li Guo, Zhongwen Zhou, Hanyuan Zhang","doi":"10.1109/ICMEW56448.2022.9859353","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859353","url":null,"abstract":"Incorporating depth information into RGB images has proven its effectiveness in semantic segmentation. The multi-modal feature fusion, which integrates depth and RGB features, is a crucial component determining segmentation accuracy. Most existing multi-modal feature fusion schemes enhance multi-modal features via channel-wise attention modules which leverage global context information. In this work, we propose a novel pyramid-context guided fusion (PCGF) module to fully exploit the complementary information from the depth and RGB features. The proposed PCGF utilizes both local and global contexts inside the attention module to provide effective guidance for fusing cross-modal features of inconsistent semantics. Moreover, we introduce a lightweight yet practical multi-level general fusion module to combine the features at multiple levels of abstraction to enable high-resolution prediction. Utilizing the proposed feature fusion modules, our Pyramid-Context Guided Network (PCGNet) can learn discriminative features by taking full advantage of multi-modal and multi-level information. Our comprehensive experiments demonstrate that the proposed PCGNet achieves state-of-the-art performance on two benchmark datasets NYUDv2 and SUN-RGBD.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132627692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859525
Ge Zhang, Huanyu He, Haiyang Wang, Weiyao Lin
It has been witnessed that the learned data compression techniques has outperformed conventional ones. However, the non-deterministic floating-point calculation makes the probability prediction inconsistent between sender and receiver, disabling practical applications. We propose to use the integer network to relieve this problem and focus on graph data lossless compression. Firstly, we propose an adaptive fixed-point format, AdaFixedPoint, which can convert a floating-point model, which has graph convolution layers to a fixed-point one with minimal precision loss and enable deterministic graph data lossless compression. Secondly, we propose QbiasFree Compensation and Bin Regularization to quantize the network with fewer bits, relieving the computation cost. Experiments show that our proposed integer network can achieve successful cross-platform graph data compression. And compared with the commonly used 8 bits, our method remarkably decreases the quantized average bit to 5 bits, without a performance drop.
{"title":"Integer Network for Cross Platform Graph Data Lossless Compression","authors":"Ge Zhang, Huanyu He, Haiyang Wang, Weiyao Lin","doi":"10.1109/ICMEW56448.2022.9859525","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859525","url":null,"abstract":"It has been witnessed that the learned data compression techniques has outperformed conventional ones. However, the non-deterministic floating-point calculation makes the probability prediction inconsistent between sender and receiver, disabling practical applications. We propose to use the integer network to relieve this problem and focus on graph data lossless compression. Firstly, we propose an adaptive fixed-point format, AdaFixedPoint, which can convert a floating-point model, which has graph convolution layers to a fixed-point one with minimal precision loss and enable deterministic graph data lossless compression. Secondly, we propose QbiasFree Compensation and Bin Regularization to quantize the network with fewer bits, relieving the computation cost. Experiments show that our proposed integer network can achieve successful cross-platform graph data compression. And compared with the commonly used 8 bits, our method remarkably decreases the quantized average bit to 5 bits, without a performance drop.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"40 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120945354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859398
Rongfeng Li, Kuoxi Yu
The automatic scoring of singing evaluation is a hot topic in recent years. Improving the score following effect is the first step to improve the accuracy of evaluation. Most of the commonly used methods are based on DTW, but for audios with low singing quality and inaccurate pitch, DTW often predicts the onset incorrectly. In order to solve the above problems, this paper focus on the offline following, mainly improves from two aspects: 1. Sol-fa name recognition is done before pitch tracking as preprocess. We cannot guarantee that the pitch of the singer is correct, but we can assume that the singer pronounces the sol-fa name correctly, so we use sol-fa name recognition as preprocessing; 2. Regularized DTW is proposed based on the basis of sol-fa name recognition. The results show that for general audio, under the condition of a tolerance of 20ms, compared with about 86% accuracy of ordinary DTW algorithm, our algorithm has improved to about 92%, while the average error of predicted notes is reduced by about 23ms. For audio with low signal-to-noise ratio and unstable voice frequency, the alignment effect is improved by about 20% compared with ordinary DTW.
{"title":"Regularized DTW in Offline Music Score-Following for Sight-Singing Based on Sol-fa Name Recognition","authors":"Rongfeng Li, Kuoxi Yu","doi":"10.1109/ICMEW56448.2022.9859398","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859398","url":null,"abstract":"The automatic scoring of singing evaluation is a hot topic in recent years. Improving the score following effect is the first step to improve the accuracy of evaluation. Most of the commonly used methods are based on DTW, but for audios with low singing quality and inaccurate pitch, DTW often predicts the onset incorrectly. In order to solve the above problems, this paper focus on the offline following, mainly improves from two aspects: 1. Sol-fa name recognition is done before pitch tracking as preprocess. We cannot guarantee that the pitch of the singer is correct, but we can assume that the singer pronounces the sol-fa name correctly, so we use sol-fa name recognition as preprocessing; 2. Regularized DTW is proposed based on the basis of sol-fa name recognition. The results show that for general audio, under the condition of a tolerance of 20ms, compared with about 86% accuracy of ordinary DTW algorithm, our algorithm has improved to about 92%, while the average error of predicted notes is reduced by about 23ms. For audio with low signal-to-noise ratio and unstable voice frequency, the alignment effect is improved by about 20% compared with ordinary DTW.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123729766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859441
Tiange Zhou, Borou Yu, Jiajian Min, Zeyu Wang
Throughout the history of dance and music collaborations, composers and choreographers have always engaged in separate workflows. Usually, composers and choreographers complete the music and choreograph the moves separately, where the lack of mutual understanding of their artistic approaches results in a long production time. There is a strong need in the performance industry to reduce the time for establishing a collaborative foundation, allowing for more productive creations. We propose DAMUS, a work-in-progress collaborative system for choreography and music composition, in order to reduce production time and boost productivity.DAMUS is composed of a dance module DA and a music module MUS. DA translates dance motion into MoCap data, Labanotation, and number notation, and sets rules of variations for choreography. MUS produces musical materials that fit the tempo and rhythm of specific dance genres or moves. We applied our system prototype to case studies in three different genres. In the future, we plan to pursue more genres and further develop DAMUS with evolutionary computation and style transfer.
{"title":"DAMUS: A Collaborative System for Choreography and Music Composition","authors":"Tiange Zhou, Borou Yu, Jiajian Min, Zeyu Wang","doi":"10.1109/ICMEW56448.2022.9859441","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859441","url":null,"abstract":"Throughout the history of dance and music collaborations, composers and choreographers have always engaged in separate workflows. Usually, composers and choreographers complete the music and choreograph the moves separately, where the lack of mutual understanding of their artistic approaches results in a long production time. There is a strong need in the performance industry to reduce the time for establishing a collaborative foundation, allowing for more productive creations. We propose DAMUS, a work-in-progress collaborative system for choreography and music composition, in order to reduce production time and boost productivity.DAMUS is composed of a dance module DA and a music module MUS. DA translates dance motion into MoCap data, Labanotation, and number notation, and sets rules of variations for choreography. MUS produces musical materials that fit the tempo and rhythm of specific dance genres or moves. We applied our system prototype to case studies in three different genres. In the future, we plan to pursue more genres and further develop DAMUS with evolutionary computation and style transfer.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127675449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859269
Hongjian Bo, Cong Xu, Boying Wu, Lin Ma, Haifeng Li
Emotion recognition based on electroencephalogram (EEG) has been widely concerned because it could reflect intrinsic emotional information. Although a large number of achievements have been made, great challenges still exist. For example, strict identification conditions make it difficult to apply in real life. Therefore, an experimental method of emotion induction based on daily sounds is proposed in this article, which is closer to the everyday work environment. Then, a feature optimization method based on the representation dissimilarity matrix is proposed. Finally, the feature evaluation criteria are established and the emotion-related features are found. In this article, EEG data of 16 volunteers in different emotional sounds were collected. Three types of EEG feature: high-order crossing, power spectral density and difference asymmetry were extracted. After feature optimization, and model construction, the recognition rate of high and low valence was up to 69%. This study explores the dynamic response of people listening to sound and shows that the environmental sound could effectively induce and recognize emotional status, which could better help AI understand people’s preferences and needs.
{"title":"Emotion Recognition Based on Representation Dissimilarity Matrix","authors":"Hongjian Bo, Cong Xu, Boying Wu, Lin Ma, Haifeng Li","doi":"10.1109/ICMEW56448.2022.9859269","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859269","url":null,"abstract":"Emotion recognition based on electroencephalogram (EEG) has been widely concerned because it could reflect intrinsic emotional information. Although a large number of achievements have been made, great challenges still exist. For example, strict identification conditions make it difficult to apply in real life. Therefore, an experimental method of emotion induction based on daily sounds is proposed in this article, which is closer to the everyday work environment. Then, a feature optimization method based on the representation dissimilarity matrix is proposed. Finally, the feature evaluation criteria are established and the emotion-related features are found. In this article, EEG data of 16 volunteers in different emotional sounds were collected. Three types of EEG feature: high-order crossing, power spectral density and difference asymmetry were extracted. After feature optimization, and model construction, the recognition rate of high and low valence was up to 69%. This study explores the dynamic response of people listening to sound and shows that the environmental sound could effectively induce and recognize emotional status, which could better help AI understand people’s preferences and needs.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127387583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859289
Soyeon Hong, Jeonghoon Kim, Donghoon Lee, Hyunsouk Cho
Recently, a lot of Multimodal Sentiment Analysis (MSA) models appeared to understanding opinions in multimedia. To accelerate MSA researches, CMU-MOSI and CMU-MOSEI were released as the open-datasets. However, it is hard to observe the input data elements in detail and analyze the prediction model results with each video clip for qualitative evaluation. For these reasons, this paper suggests DeMuSA, demo for multimodal sentiment analysis to explore raw data instance and compare prediction models by utterance-level.
{"title":"Demusa: Demo for Multimodal Sentiment Analysis","authors":"Soyeon Hong, Jeonghoon Kim, Donghoon Lee, Hyunsouk Cho","doi":"10.1109/ICMEW56448.2022.9859289","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859289","url":null,"abstract":"Recently, a lot of Multimodal Sentiment Analysis (MSA) models appeared to understanding opinions in multimedia. To accelerate MSA researches, CMU-MOSI and CMU-MOSEI were released as the open-datasets. However, it is hard to observe the input data elements in detail and analyze the prediction model results with each video clip for qualitative evaluation. For these reasons, this paper suggests DeMuSA, demo for multimodal sentiment analysis to explore raw data instance and compare prediction models by utterance-level.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133956038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859474
P. Aarabi
In this paper, we outline a method for searching a set of information based on both the individual diversity of the constituent elements of the set as well as its overall ensemble diversity. Using the example of searching user accounts on Instagram, we are able to perform searches based on the representative diversity (across race, age, gender, body type, skin tone, and disability) of the posts as well as the overall diversity of the search results.
{"title":"Diversity-Based Media Search","authors":"P. Aarabi","doi":"10.1109/ICMEW56448.2022.9859474","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859474","url":null,"abstract":"In this paper, we outline a method for searching a set of information based on both the individual diversity of the constituent elements of the set as well as its overall ensemble diversity. Using the example of searching user accounts on Instagram, we are able to perform searches based on the representative diversity (across race, age, gender, body type, skin tone, and disability) of the posts as well as the overall diversity of the search results.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132819165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859386
Tomoya Sawada, Teng-Yok Lee, Masahiro Mizuno
This paper proposes a simple and effective video object instance segmentation method without fine-tuning named Mask Refinement Module(MRM). Many papers settle a labeling problem aiming to separate foreground objects, but most of them require training their networks again on target data. In a real scenario, it is not easy to collect dataset on the target environment and to label them as well due to security policies or a cost problem, especially for industry. We solve the problem by reshaping object masks with a video based online-learning method that enables us to adapt various changes frame by frame. In extensive experiments, results show that our approach is highly effective compared to modern methods by up to 13.9% improving of F-measure on large video surveillance dataset such as CDNet (118K images).
{"title":"Video Object Segmentation with Online Mask Refinement","authors":"Tomoya Sawada, Teng-Yok Lee, Masahiro Mizuno","doi":"10.1109/ICMEW56448.2022.9859386","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859386","url":null,"abstract":"This paper proposes a simple and effective video object instance segmentation method without fine-tuning named Mask Refinement Module(MRM). Many papers settle a labeling problem aiming to separate foreground objects, but most of them require training their networks again on target data. In a real scenario, it is not easy to collect dataset on the target environment and to label them as well due to security policies or a cost problem, especially for industry. We solve the problem by reshaping object masks with a video based online-learning method that enables us to adapt various changes frame by frame. In extensive experiments, results show that our approach is highly effective compared to modern methods by up to 13.9% improving of F-measure on large video surveillance dataset such as CDNet (118K images).","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128074320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}