Pub Date : 2021-09-01DOI: 10.1109/MIPR51284.2021.00073
Vanessa Utz, S. DiPaola
In this paper we outline the need for increased control over the stimuli that are used within the field of empirical aesthetics. Since artworks are highly complex stimuli and traditional man-made artworks vary across many different dimensions (such as color palette, subject matter, style) it is difficult to isolate the effect a single variable has on the aesthetic processing that occurs in a viewer. We therefore propose to explore the use of computer-generated artworks as stimuli instead due to the high degree of control that experimenters have over the generated output. We describe how computational creativity systems work by outlining our own cognitive based multi-module AI system, and then discuss the benefits of these systems as well as some preliminary work in this space. We conclude the paper by addressing the limitation of reduced ecological validity.
{"title":"Exploring the Application of AI-generated Artworks for the Study of Aesthetic Processing","authors":"Vanessa Utz, S. DiPaola","doi":"10.1109/MIPR51284.2021.00073","DOIUrl":"https://doi.org/10.1109/MIPR51284.2021.00073","url":null,"abstract":"In this paper we outline the need for increased control over the stimuli that are used within the field of empirical aesthetics. Since artworks are highly complex stimuli and traditional man-made artworks vary across many different dimensions (such as color palette, subject matter, style) it is difficult to isolate the effect a single variable has on the aesthetic processing that occurs in a viewer. We therefore propose to explore the use of computer-generated artworks as stimuli instead due to the high degree of control that experimenters have over the generated output. We describe how computational creativity systems work by outlining our own cognitive based multi-module AI system, and then discuss the benefits of these systems as well as some preliminary work in this space. We conclude the paper by addressing the limitation of reduced ecological validity.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"360 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131785210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.1109/MIPR51284.2021.00010
Zheng Guo, Lei Gao, L. Guan
In this paper, we present a manifold semantic canonical correlation (MSCC) framework with application to feature fusion. In the proposed framework, a manifold method is first employed to preserve the local structural information of multi-view feature spaces. Afterwards, a semantic canonical correlation algorithm is integrated with the manifold method to accomplish the task of feature fusion. Since the semantic canonical correlation algorithm is capable of measuring the global correlation across multiple variables, both the local structural information and the global correlation are incorporated into the proposed framework, resulting in a new feature representation of high quality. To demonstrate the effectiveness and the generality of the proposed solution, we conduct experiments on audio emotion recognition and object recognition by utilizing classic and deep neural network (DNN) based features, respectively. Experimental results show the superiority of the proposed solution on feature fusion.
{"title":"A Manifold Semantic Canonical Correlation Framework for Effective Feature Fusion","authors":"Zheng Guo, Lei Gao, L. Guan","doi":"10.1109/MIPR51284.2021.00010","DOIUrl":"https://doi.org/10.1109/MIPR51284.2021.00010","url":null,"abstract":"In this paper, we present a manifold semantic canonical correlation (MSCC) framework with application to feature fusion. In the proposed framework, a manifold method is first employed to preserve the local structural information of multi-view feature spaces. Afterwards, a semantic canonical correlation algorithm is integrated with the manifold method to accomplish the task of feature fusion. Since the semantic canonical correlation algorithm is capable of measuring the global correlation across multiple variables, both the local structural information and the global correlation are incorporated into the proposed framework, resulting in a new feature representation of high quality. To demonstrate the effectiveness and the generality of the proposed solution, we conduct experiments on audio emotion recognition and object recognition by utilizing classic and deep neural network (DNN) based features, respectively. Experimental results show the superiority of the proposed solution on feature fusion.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133411768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.1109/MIPR51284.2021.00063
Gen-Fang Chen
This paper first discusses the representation of infinite decimals, and then compares and analyzes the two different distributions of "strong/weak" of the 60 Jingfang tones, respectively, from Book of the Later Han and the work of Chen Yingshi, Interpreting of 60 Jingfang Tones. The paper asserts that the "strong/weak" distribution is based on the tuning system of the "Three-scale Rise/Fall Tuning" and constructs the optimal "strong/weak" distribution of infinite decimals according to the "Three-scale Rise/Fall Tuning" by using the least square method. Finally, it obtains the optimal distribution of "strong/weak" of the 60 Jingfang tones by using a dynamic planning algorithm of artificial intelligence.
{"title":"Distinguishing the \"strong/weak\" in the 60 Jingfang tones and their optimal distribution","authors":"Gen-Fang Chen","doi":"10.1109/MIPR51284.2021.00063","DOIUrl":"https://doi.org/10.1109/MIPR51284.2021.00063","url":null,"abstract":"This paper first discusses the representation of infinite decimals, and then compares and analyzes the two different distributions of \"strong/weak\" of the 60 Jingfang tones, respectively, from Book of the Later Han and the work of Chen Yingshi, Interpreting of 60 Jingfang Tones. The paper asserts that the \"strong/weak\" distribution is based on the tuning system of the \"Three-scale Rise/Fall Tuning\" and constructs the optimal \"strong/weak\" distribution of infinite decimals according to the \"Three-scale Rise/Fall Tuning\" by using the least square method. Finally, it obtains the optimal distribution of \"strong/weak\" of the 60 Jingfang tones by using a dynamic planning algorithm of artificial intelligence.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133780347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.1109/MIPR51284.2021.00030
Yash Garg, K. Candan
Advances in sensory technologies are enabling the capture of a diverse spectrum of real-world data streams. In-creasing availability of such data, especially in the form of multi-variate time series, allows for new opportunities for applications that rely on identifying and leveraging complex temporal patterns A particular challenge such algorithms face is that complex patterns consist of multiple simpler patterns of varying scales (temporal length). While several recent works (such as multi-head attention networks) recognized the fact complex patterns need to be understood in the form of multiple simpler patterns, we note that existing works lack the ability of represent the interactions across these constituting patterns. To tackle this limitation, in this paper, we propose a novel Multi-scale Multi-head Attention with Cross-Talk (XM2A) framework designed to represent multi-scale patterns that make up a complex pattern by configuring each attention head to learn a pattern at a particular scale and accounting for the co-existence of patterns at multiple scales through a cross-talking mechanism among the heads. Experiments show that XM2A outperforms state-of-the-art attention mechanisms, such as Transformer and MSMSA, on benchmark datasets, such as SADD, AUSLAN, and MOCAP.
{"title":"XM2A: Multi-Scale Multi-Head Attention with Cross-Talk for Multi-Variate Time Series Analysis","authors":"Yash Garg, K. Candan","doi":"10.1109/MIPR51284.2021.00030","DOIUrl":"https://doi.org/10.1109/MIPR51284.2021.00030","url":null,"abstract":"Advances in sensory technologies are enabling the capture of a diverse spectrum of real-world data streams. In-creasing availability of such data, especially in the form of multi-variate time series, allows for new opportunities for applications that rely on identifying and leveraging complex temporal patterns A particular challenge such algorithms face is that complex patterns consist of multiple simpler patterns of varying scales (temporal length). While several recent works (such as multi-head attention networks) recognized the fact complex patterns need to be understood in the form of multiple simpler patterns, we note that existing works lack the ability of represent the interactions across these constituting patterns. To tackle this limitation, in this paper, we propose a novel Multi-scale Multi-head Attention with Cross-Talk (XM2A) framework designed to represent multi-scale patterns that make up a complex pattern by configuring each attention head to learn a pattern at a particular scale and accounting for the co-existence of patterns at multiple scales through a cross-talking mechanism among the heads. Experiments show that XM2A outperforms state-of-the-art attention mechanisms, such as Transformer and MSMSA, on benchmark datasets, such as SADD, AUSLAN, and MOCAP.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114759827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.1109/MIPR51284.2021.00027
Shisheng Wang, Hideki Nakayama
Although the recent progress of deep learning has enabled reinforcement learning (RL) algorithms to achieve human-level performance in retro video games within a short training time, the application of real-world robotics remains limited. The conventional RL procedure requires agents to interact with the environment. Meanwhile, the interactions with the physical world can not be easily parallelized or accelerated as in other tasks. Moreover, the gap between the real world and simulation makes it harder to transfer the policy trained in simulators to physical robots. Thus, we propose a model-based method to mitigate the interaction overheads for real-world robotic tasks. In particular, our model incorporates an autoencoder, a recurrent network, and a generative network to make stochastic predictions of observations. We conduct the experiments on a collision avoidance task for disc-like robots and show that the generative model can serve as a virtual RL environment. Our method has the benefit of lower interaction overheads as inference of deep neural networks on GPUs is faster than observing the transitions in the real environment, and it can replace the real RL environment with limited rollout length.
{"title":"Stochastic Observation Prediction for Efficient Reinforcement Learning in Robotics","authors":"Shisheng Wang, Hideki Nakayama","doi":"10.1109/MIPR51284.2021.00027","DOIUrl":"https://doi.org/10.1109/MIPR51284.2021.00027","url":null,"abstract":"Although the recent progress of deep learning has enabled reinforcement learning (RL) algorithms to achieve human-level performance in retro video games within a short training time, the application of real-world robotics remains limited. The conventional RL procedure requires agents to interact with the environment. Meanwhile, the interactions with the physical world can not be easily parallelized or accelerated as in other tasks. Moreover, the gap between the real world and simulation makes it harder to transfer the policy trained in simulators to physical robots. Thus, we propose a model-based method to mitigate the interaction overheads for real-world robotic tasks. In particular, our model incorporates an autoencoder, a recurrent network, and a generative network to make stochastic predictions of observations. We conduct the experiments on a collision avoidance task for disc-like robots and show that the generative model can serve as a virtual RL environment. Our method has the benefit of lower interaction overheads as inference of deep neural networks on GPUs is faster than observing the transitions in the real environment, and it can replace the real RL environment with limited rollout length.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116441713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.1109/MIPR51284.2021.00011
Takuya Yonezawa, Yuanyuan Wang, Yukiko Kawai, K. Sumiya
Recently, short recipe videos such as Kurashiru and DELISH KITCHEN have become popular. These short recipe videos can help people learn many cooking skills in a brief time. However, it is difficult for users to understand all cooking operations by viewing these videos only once. These short recipe videos do not consider users’ cooking skills (cooking levels) since anyone may view the same video. Therefore, in this work, we propose an interactive cooking support system for short recipe videos by extracting and weighting cooking operations for each cooking genre based on user browsing behavior. The system then recommends various supplementary recipe videos based on the weights of cooking operations and user browsing behavior. Also, the system provides a user interface, called Dynamic Video Tag Cloud for visualizing the supplementary recipe videos, and the supplementary recipe videos can be dynamically changed based on the user browsing behavior. As a result, users can intuitively and easily understand cooking operations suited to their cooking favorites. Finally, we verified the effectiveness of the weighting of cooking operations and discussed the usefulness of our proposed user interface using the SUS score.
{"title":"An Interactive Cooking Support System for Short Recipe Videos based on User Browsing Behavior","authors":"Takuya Yonezawa, Yuanyuan Wang, Yukiko Kawai, K. Sumiya","doi":"10.1109/MIPR51284.2021.00011","DOIUrl":"https://doi.org/10.1109/MIPR51284.2021.00011","url":null,"abstract":"Recently, short recipe videos such as Kurashiru and DELISH KITCHEN have become popular. These short recipe videos can help people learn many cooking skills in a brief time. However, it is difficult for users to understand all cooking operations by viewing these videos only once. These short recipe videos do not consider users’ cooking skills (cooking levels) since anyone may view the same video. Therefore, in this work, we propose an interactive cooking support system for short recipe videos by extracting and weighting cooking operations for each cooking genre based on user browsing behavior. The system then recommends various supplementary recipe videos based on the weights of cooking operations and user browsing behavior. Also, the system provides a user interface, called Dynamic Video Tag Cloud for visualizing the supplementary recipe videos, and the supplementary recipe videos can be dynamically changed based on the user browsing behavior. As a result, users can intuitively and easily understand cooking operations suited to their cooking favorites. Finally, we verified the effectiveness of the weighting of cooking operations and discussed the usefulness of our proposed user interface using the SUS score.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123182768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.1109/MIPR51284.2021.00056
Peng Tan, Yi Ji, Yuqing Xu
Creative programming has become a mature mode of innovation in foreign countries, It provides a new way of intangible cultural heritage teaching. This paper will explore the application of visual programming tool (Scratch) in the teaching of intangible cultural heritage (Cantonese Porcelain) from three teaching modules. The research shows that this integrated teaching method can effectively stimulate participant’s interest and creative thinking in learning of intangible cultural heritage, and provide a new way of thinking and practical reference for the current innovative teaching of intangible cultural heritage.
{"title":"Rethinking of Intangible Cultural Heritage Teaching with Creative Programming in China","authors":"Peng Tan, Yi Ji, Yuqing Xu","doi":"10.1109/MIPR51284.2021.00056","DOIUrl":"https://doi.org/10.1109/MIPR51284.2021.00056","url":null,"abstract":"Creative programming has become a mature mode of innovation in foreign countries, It provides a new way of intangible cultural heritage teaching. This paper will explore the application of visual programming tool (Scratch) in the teaching of intangible cultural heritage (Cantonese Porcelain) from three teaching modules. The research shows that this integrated teaching method can effectively stimulate participant’s interest and creative thinking in learning of intangible cultural heritage, and provide a new way of thinking and practical reference for the current innovative teaching of intangible cultural heritage.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"231 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132343668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.1109/MIPR51284.2021.00033
Zanyah Ailsworth, Wei-bang Chen, Yongjin Lu, Xiaoliang Wang, Melissa Tsui, H. Al-Ghaib, Ben Zimmerman
Thermal barrier coating, a widely used advanced manufacturing technique in various industries, provides thermal insulation and surface protection to a substrate by spraying melted coating materials on to the surface of the substrate. As the melted coating materials solidify, it creates microstructures that affect the coating quality. An important coating quality assessment metric that determines its effectiveness is porosity, the quantity of microstructures within the coating. In this article, we aim to build a novel algorithm to determine the microstructures in a thermal barrier coating, which is used to calculate porosity. The hybrid approach combines the efficiency of thresholding-based techniques and the accuracy of convolutional neural network (CNN) based techniques to perform a binary semantic segmentation. We evaluate the performance of the proposed hybrid approach on coating images generated from two different types of coating powders. These images exhibit various texture features. The experimental results show that the proposed hybrid approach outperforms the thresholding-based approach and the CNN-based approach in terms of accuracy on both types of images. In addition, the time complexity of the hybrid approach is also greatly optimized compared to the CNN-based approach.
{"title":"A Hybrid Image Segmentation Approach for Thermal Barrier Coating Quality Assessments","authors":"Zanyah Ailsworth, Wei-bang Chen, Yongjin Lu, Xiaoliang Wang, Melissa Tsui, H. Al-Ghaib, Ben Zimmerman","doi":"10.1109/MIPR51284.2021.00033","DOIUrl":"https://doi.org/10.1109/MIPR51284.2021.00033","url":null,"abstract":"Thermal barrier coating, a widely used advanced manufacturing technique in various industries, provides thermal insulation and surface protection to a substrate by spraying melted coating materials on to the surface of the substrate. As the melted coating materials solidify, it creates microstructures that affect the coating quality. An important coating quality assessment metric that determines its effectiveness is porosity, the quantity of microstructures within the coating. In this article, we aim to build a novel algorithm to determine the microstructures in a thermal barrier coating, which is used to calculate porosity. The hybrid approach combines the efficiency of thresholding-based techniques and the accuracy of convolutional neural network (CNN) based techniques to perform a binary semantic segmentation. We evaluate the performance of the proposed hybrid approach on coating images generated from two different types of coating powders. These images exhibit various texture features. The experimental results show that the proposed hybrid approach outperforms the thresholding-based approach and the CNN-based approach in terms of accuracy on both types of images. In addition, the time complexity of the hybrid approach is also greatly optimized compared to the CNN-based approach.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127425225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.1109/MIPR51284.2021.00039
Marc A. Kastner, Chihaya Matsuhira, I. Ide, S. Satoh
Recently, multi-modal applications bring a need for a human-like understanding of the perception differences across modalities. For example, while something might have a clear image in a visual context, it might be perceived as too technical in a textual context. Such differences related to a semantic gap make a transfer between modalities or a combination of modalities in multi-modal processing a difficult task. Imageability as a concept from Psycholinguistics gives promising insight to the human perception of vision and language. In order to understand cross-modal differences of semantics, we create and analyze a cross-modal dataset for imageability. We estimate three imageability values grounded in 1) a visual space from a large set of images, 2) a textual space from Web-trained word embeddings, and 3) a phonetic space based on word pronunciations. A subset of the corpus is evaluated with an existing imageability dictionary to ensure a basic generalization, but otherwise targets finding cross-modal differences and outliers. We visualize the dataset and analyze it regarding outliers and differences for each modality. As additional sources of knowledge, part-of-speech and etymological origin of all words are estimated and analyzed in context of the modalities. The dataset of multi-modal imageability values and a link to an interactive browser with visualizations are made available on the Web.
{"title":"A multi-modal dataset for analyzing the imageability of concepts across modalities","authors":"Marc A. Kastner, Chihaya Matsuhira, I. Ide, S. Satoh","doi":"10.1109/MIPR51284.2021.00039","DOIUrl":"https://doi.org/10.1109/MIPR51284.2021.00039","url":null,"abstract":"Recently, multi-modal applications bring a need for a human-like understanding of the perception differences across modalities. For example, while something might have a clear image in a visual context, it might be perceived as too technical in a textual context. Such differences related to a semantic gap make a transfer between modalities or a combination of modalities in multi-modal processing a difficult task. Imageability as a concept from Psycholinguistics gives promising insight to the human perception of vision and language. In order to understand cross-modal differences of semantics, we create and analyze a cross-modal dataset for imageability. We estimate three imageability values grounded in 1) a visual space from a large set of images, 2) a textual space from Web-trained word embeddings, and 3) a phonetic space based on word pronunciations. A subset of the corpus is evaluated with an existing imageability dictionary to ensure a basic generalization, but otherwise targets finding cross-modal differences and outliers. We visualize the dataset and analyze it regarding outliers and differences for each modality. As additional sources of knowledge, part-of-speech and etymological origin of all words are estimated and analyzed in context of the modalities. The dataset of multi-modal imageability values and a link to an interactive browser with visualizations are made available on the Web.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132985388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-04-10DOI: 10.1109/MIPR51284.2021.00020
Mickael Cormier, Houraalsadat Mortazavi Moshkenan, Franz Lörch, J. Metzler, J. Beyerer
Our goal is to transfer the motion of real people from a source video to a target video with realistic results. While recent advances significantly improved image-to-image translations, only few works account for body motions and temporal consistency. However, those focus only on video retargeting for a single actor/ for single actors. In this work, we propose a marker-less approach for multiple-person video-to-video transfer using pose as an intermediate representation. Given a source video with multiple persons dancing or working out, our method transfers the body motion of all actors to a new set of actors in a different video. Differently from recent "do as I do" methods, we focus specifically on transferring multiple person at the same time and tackle the related identity switch problem. Our method is able to convincingly transfer body motion to the target video, while preserving specific features of the target video, such as feet touching the floor and relative position of the actors. The evaluation is performed with visual quality and appearance metrics using publicly available videos with the permission of their owners.
{"title":"Do as we do: Multiple Person Video-To-Video Transfer","authors":"Mickael Cormier, Houraalsadat Mortazavi Moshkenan, Franz Lörch, J. Metzler, J. Beyerer","doi":"10.1109/MIPR51284.2021.00020","DOIUrl":"https://doi.org/10.1109/MIPR51284.2021.00020","url":null,"abstract":"Our goal is to transfer the motion of real people from a source video to a target video with realistic results. While recent advances significantly improved image-to-image translations, only few works account for body motions and temporal consistency. However, those focus only on video retargeting for a single actor/ for single actors. In this work, we propose a marker-less approach for multiple-person video-to-video transfer using pose as an intermediate representation. Given a source video with multiple persons dancing or working out, our method transfers the body motion of all actors to a new set of actors in a different video. Differently from recent \"do as I do\" methods, we focus specifically on transferring multiple person at the same time and tackle the related identity switch problem. Our method is able to convincingly transfer body motion to the target video, while preserving specific features of the target video, such as feet touching the floor and relative position of the actors. The evaluation is performed with visual quality and appearance metrics using publicly available videos with the permission of their owners.","PeriodicalId":139543,"journal":{"name":"2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124731061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}