In this work, we propose a generative-based approach, FLNet, to synthesize floor layout plans guided by user constraints. Our approach considers user inputs in the form of boundary, room types, and spatial relationships and generates the layout design satisfying these requirements. We evaluated our approach on floor plans data, RPLAN, consisting of 80,000 vector-graphics floor plans of residential buildings designed by professional architects. We perform both qualitative and quantitative analysis along three metrics - Layout generation accuracy, Realism, and Quality to evaluate the generated layout designs. We compare our approach with the existing baselines and outperform on all these metrics. The layout designs generated by our approach are more realistic and of better quality.
{"title":"FLNet: Graph Constrained Floor Layout Generation","authors":"Abhinav Upadhyay, Alpana Dubey, Veenu Arora, Mani Suma Kuriakose, Shaurya Agarawal","doi":"10.1109/ICMEW56448.2022.9859350","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859350","url":null,"abstract":"In this work, we propose a generative-based approach, FLNet, to synthesize floor layout plans guided by user constraints. Our approach considers user inputs in the form of boundary, room types, and spatial relationships and generates the layout design satisfying these requirements. We evaluated our approach on floor plans data, RPLAN, consisting of 80,000 vector-graphics floor plans of residential buildings designed by professional architects. We perform both qualitative and quantitative analysis along three metrics - Layout generation accuracy, Realism, and Quality to evaluate the generated layout designs. We compare our approach with the existing baselines and outperform on all these metrics. The layout designs generated by our approach are more realistic and of better quality.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131634148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Product Disassembly has become an area of active research as it supports sustainable development by aiding effective end-of-life (EOL) stage strategies like reuse, re-manufacturing, recycling, etc. In this work, we propose a new approach, 3D-DSPNet, that can utilize 3D data from CAD assembly models to generate a feasible disassembly sequence. Our approach uses Graph-based learning to process the graph representation of CAD models. Currently, the available 3D CAD model datasets lack ground truth disassembly sequences. We propose and curate a new dataset, the 3D-DSP dataset, which includes ground truth information about the disassembly sequence for 3D product models. We carry out evaluation and analysis of results to explain the efficacy of the proposed method. Our approach significantly outperforms the existing baseline. We develop an Autodesk Fusion 360 plug-in that generates disassembly sequence animation, allowing intuitive analysis of the disassembly plan.
{"title":"3D-DSPnet: Product Disassembly Sequence Planning","authors":"Abhinav Upadhyay, Bharat Ladrecha, Alpana Dubey, Suma Mani Kuriakose, P. Goenka","doi":"10.1109/ICMEW56448.2022.9859434","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859434","url":null,"abstract":"Product Disassembly has become an area of active research as it supports sustainable development by aiding effective end-of-life (EOL) stage strategies like reuse, re-manufacturing, recycling, etc. In this work, we propose a new approach, 3D-DSPNet, that can utilize 3D data from CAD assembly models to generate a feasible disassembly sequence. Our approach uses Graph-based learning to process the graph representation of CAD models. Currently, the available 3D CAD model datasets lack ground truth disassembly sequences. We propose and curate a new dataset, the 3D-DSP dataset, which includes ground truth information about the disassembly sequence for 3D product models. We carry out evaluation and analysis of results to explain the efficacy of the proposed method. Our approach significantly outperforms the existing baseline. We develop an Autodesk Fusion 360 plug-in that generates disassembly sequence animation, allowing intuitive analysis of the disassembly plan.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129324515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859330
Mengyuan Guan, Suncheng Xiang, Ting Liu, Yuzhuo Fu
Unsupervised Domain Adaptation (UDA) Person reidentification (ReID) strives towards fine-tuning the model trained on a labelled source-domain dataset to a target-domain dataset, which has grown by leaps and bounds due to the advancement of deep convolution neural network (CNN). However, traditional CNN-based methods mainly focus on learning small discriminative features in local pedestrian region, which fails to exploit the potential of rich structural patterns and suffers from information loss on details caused by convolution operators. To tackle the challenge, this work attempts to exploit the valuable fine-grained attributes based on Transformers. Inspired by this, we propose a Cross-Domain Transformer network CDTnet to enhance the robust feature learning in connection with pedestrian attributes. As far as we are aware, we are among the first attempt to adopt a pure transformer for cross-domain ReID research. All-inclusive experiments conducted on several ReID benchmarks demonstrate that our method can reach a comparable yield with reference to the state-of-the-arts.
{"title":"CDTNET: Cross-Domain Transformer Based on Attributes for Person Re-Identification","authors":"Mengyuan Guan, Suncheng Xiang, Ting Liu, Yuzhuo Fu","doi":"10.1109/ICMEW56448.2022.9859330","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859330","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) Person reidentification (ReID) strives towards fine-tuning the model trained on a labelled source-domain dataset to a target-domain dataset, which has grown by leaps and bounds due to the advancement of deep convolution neural network (CNN). However, traditional CNN-based methods mainly focus on learning small discriminative features in local pedestrian region, which fails to exploit the potential of rich structural patterns and suffers from information loss on details caused by convolution operators. To tackle the challenge, this work attempts to exploit the valuable fine-grained attributes based on Transformers. Inspired by this, we propose a Cross-Domain Transformer network CDTnet to enhance the robust feature learning in connection with pedestrian attributes. As far as we are aware, we are among the first attempt to adopt a pure transformer for cross-domain ReID research. All-inclusive experiments conducted on several ReID benchmarks demonstrate that our method can reach a comparable yield with reference to the state-of-the-arts.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116945360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859286
Weipeng Wang, Xiaobing Li, Cong Jin, Di Lu, Qingwen Zhou, Tie Yun
Many deep music generation algorithms have recently been able to produce good-sounding music, but there have been few studies on controlled generation. In this process, the human sense of participation is usually very weak, and it is difficult to integrate one’s own musical motivation into the creation. In this study, we will introduce CPS (Compound word with style), a model that can specify a target style and generate a complete musical composition from scratch. We first added the genre meta-information to the music representation and distinguished it from other low-level music representations, thus strengthening the influence of the control signal. We modeled with the linear transformer, while used an adaptive strategy with different settings for different types of music tokens to reduce the probability of disharmonic music. The experiments show that, when compared to the baseline model, our model performs better in terms of basic music metrics as well as metrics for evaluating controlled ability.
近年来,许多深度音乐生成算法都能够产生好听的音乐,但对控制生成的研究却很少。在这个过程中,人的参与感通常很弱,很难将自己的音乐动机融入到创作中。在本研究中,我们将介绍CPS (Compound word with style),这是一个可以指定目标风格并从头生成完整音乐作品的模型。我们首先在音乐表征中加入体裁元信息,并将其与其他低级音乐表征区分开来,从而加强控制信号的影响。我们使用线性变压器建模,同时对不同类型的音乐符号使用不同设置的自适应策略来减少不和谐音乐的概率。实验表明,与基线模型相比,我们的模型在基本音乐指标以及评估控制能力的指标方面表现得更好。
{"title":"CPS: Full-Song and Style-Conditioned Music Generation with Linear Transformer","authors":"Weipeng Wang, Xiaobing Li, Cong Jin, Di Lu, Qingwen Zhou, Tie Yun","doi":"10.1109/ICMEW56448.2022.9859286","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859286","url":null,"abstract":"Many deep music generation algorithms have recently been able to produce good-sounding music, but there have been few studies on controlled generation. In this process, the human sense of participation is usually very weak, and it is difficult to integrate one’s own musical motivation into the creation. In this study, we will introduce CPS (Compound word with style), a model that can specify a target style and generate a complete musical composition from scratch. We first added the genre meta-information to the music representation and distinguished it from other low-level music representations, thus strengthening the influence of the control signal. We modeled with the linear transformer, while used an adaptive strategy with different settings for different types of music tokens to reduce the probability of disharmonic music. The experiments show that, when compared to the baseline model, our model performs better in terms of basic music metrics as well as metrics for evaluating controlled ability.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131095293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859303
Yunbin Deng, Ryan Campbell, Piyush Kumar
It is critical that real-time gun and fire detection from video be accurate to protect life, property and the environment. Recent advances in deep machine learning have greatly improved detection accuracy in this domain. In this paper, a semantic embedding-based method is developed for zero-shot gun and fire detection. Using a pre-trained Contrastive Language-Image Pre-Training (CLIP) model, input images and arbitrary texts can be mapped to semantic vectors and their similarity can be computed. By defining object classes using the semantic vector of each classes’ description, highly accurate object detection accuracy can be achieved without training any new model. Evaluation of this method on public domain FireNet and IMFDB datasets demonstrates fire and gun detection accuracy of 99.8% and 97.3%, respectively, which significantly outperforms state of the art FireNet and you look only once (YOLO) algorithms. Semantic embedding enables open set semantic search in video and simplifies deploying and maintaining object detection applications.
{"title":"Fire and Gun Detection Based on Sematic Embeddings","authors":"Yunbin Deng, Ryan Campbell, Piyush Kumar","doi":"10.1109/ICMEW56448.2022.9859303","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859303","url":null,"abstract":"It is critical that real-time gun and fire detection from video be accurate to protect life, property and the environment. Recent advances in deep machine learning have greatly improved detection accuracy in this domain. In this paper, a semantic embedding-based method is developed for zero-shot gun and fire detection. Using a pre-trained Contrastive Language-Image Pre-Training (CLIP) model, input images and arbitrary texts can be mapped to semantic vectors and their similarity can be computed. By defining object classes using the semantic vector of each classes’ description, highly accurate object detection accuracy can be achieved without training any new model. Evaluation of this method on public domain FireNet and IMFDB datasets demonstrates fire and gun detection accuracy of 99.8% and 97.3%, respectively, which significantly outperforms state of the art FireNet and you look only once (YOLO) algorithms. Semantic embedding enables open set semantic search in video and simplifies deploying and maintaining object detection applications.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131248700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859348
Maik Simon, Erik Bochinski, Markus Küchhold, T. Sikora
Bottleneck situations can occur in overcrowded areas such as entrances or narrowed passages and are associated with a great danger to the life and health of involved people. The automated detection of such bottlenecks is the first crucial step to mitigate these dangers. In this work, we utilize the dynamics of motions using the Lagrangian approach from the analysis of dynamic systems to analyze profiles of groups of people. The derived features, which are observed by the long-term dependent motion dynamics, are described by two-dimensional Lagrangian fields. We extend the underlying Lagrangian framework by a novel measure to capture the density of motion and hence people in the context of crowd analysis. Further, we show how this novel density measure can be combined with the established arc length measure for the detection of bottlenecks in videos. Experimental evaluations show a 5% improvement over the state-of-the-art for spatiotemporal bottleneck detection.
{"title":"Bottleneck Detection in Crowded Video Scenes Utilizing Lagrangian Motion Analysis Via Density and Arc Length Measures","authors":"Maik Simon, Erik Bochinski, Markus Küchhold, T. Sikora","doi":"10.1109/ICMEW56448.2022.9859348","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859348","url":null,"abstract":"Bottleneck situations can occur in overcrowded areas such as entrances or narrowed passages and are associated with a great danger to the life and health of involved people. The automated detection of such bottlenecks is the first crucial step to mitigate these dangers. In this work, we utilize the dynamics of motions using the Lagrangian approach from the analysis of dynamic systems to analyze profiles of groups of people. The derived features, which are observed by the long-term dependent motion dynamics, are described by two-dimensional Lagrangian fields. We extend the underlying Lagrangian framework by a novel measure to capture the density of motion and hence people in the context of crowd analysis. Further, we show how this novel density measure can be combined with the established arc length measure for the detection of bottlenecks in videos. Experimental evaluations show a 5% improvement over the state-of-the-art for spatiotemporal bottleneck detection.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114488150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859507
Sheng-Po Tseng, Jan-Yue Lin, Wei-Chien Cheng, L. Yeh, Chih-Ya Shen
We present a decentralized federated learning (FL) framework based on blockchain. In traditional federated learning, it is necessary that a third-party centralized server aggregates all the gradients which participant in the upload, but such a trusted third-party may not always exist. We address this issue with the decentralized blockchain and encrypt the neural network model parameters and gradients.
{"title":"Decentralized Federated Learning with Enhanced Privacy Preservation","authors":"Sheng-Po Tseng, Jan-Yue Lin, Wei-Chien Cheng, L. Yeh, Chih-Ya Shen","doi":"10.1109/ICMEW56448.2022.9859507","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859507","url":null,"abstract":"We present a decentralized federated learning (FL) framework based on blockchain. In traditional federated learning, it is necessary that a third-party centralized server aggregates all the gradients which participant in the upload, but such a trusted third-party may not always exist. We address this issue with the decentralized blockchain and encrypt the neural network model parameters and gradients.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133676947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859414
Beiji Zou, Min Wang, Lingzi Jiang, Yue Zhang, Shu Liu
Surveillance video anomaly detection is a challenging problem because of the diversity of abnormal events. The current prediction-based methods outperform reconstruction-based methods. But the former has the following issues: 1) Using optical flow to represent motion will affect real-time detection. 2) Distinguishing abnormal events only by local relationships will lead to ambiguity. 3) Semantic information and spatiotemporal constraint are not fully utilized. To address these problems, we propose FECP-Net: a network with feature enhancement and consistency frame prediction for surveillance video anomaly detection. We use the RGB difference between consecutive frames rather than optical flow to realize real-time detection. Meanwhile, we design a feature enhancement module to enrich semantics and global context information in features. In addition, we add spatiotemporal consistency constraint and consistency loss to strengthen consistency predictions. Extensive experiments on standard benchmarks demonstrate the effectiveness of our method.
{"title":"Surveillance Video Anomaly Detection with Feature Enhancement and Consistency Frame Prediction","authors":"Beiji Zou, Min Wang, Lingzi Jiang, Yue Zhang, Shu Liu","doi":"10.1109/ICMEW56448.2022.9859414","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859414","url":null,"abstract":"Surveillance video anomaly detection is a challenging problem because of the diversity of abnormal events. The current prediction-based methods outperform reconstruction-based methods. But the former has the following issues: 1) Using optical flow to represent motion will affect real-time detection. 2) Distinguishing abnormal events only by local relationships will lead to ambiguity. 3) Semantic information and spatiotemporal constraint are not fully utilized. To address these problems, we propose FECP-Net: a network with feature enhancement and consistency frame prediction for surveillance video anomaly detection. We use the RGB difference between consecutive frames rather than optical flow to realize real-time detection. Meanwhile, we design a feature enhancement module to enrich semantics and global context information in features. In addition, we add spatiotemporal consistency constraint and consistency loss to strengthen consistency predictions. Extensive experiments on standard benchmarks demonstrate the effectiveness of our method.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129972457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, self-supervised learning has been studied to deal with the limitation of available labeled-dataset. Among the major components of self-supervised learning, the data augmentation pipeline is one key factor in enhancing the resulting performance. However, most researchers manually designed the augmentation pipeline, and the limited collections of transformation may cause the lack of robustness of the learned feature representation. In this work, we proposed Multi-Augmentations for Self-Supervised Representation Learning (MA-SSRL), which fully searched for various augmentation policies to build the entire pipeline to improve the robustness of the learned feature representation. MA-SSRL successfully learns the invariant feature representation and presents an efficient, effective, and adaptable data augmentation pipeline for self-supervised pre-training on different distribution and domain datasets. MA-SSRL outperforms the previous state-of-the-art methods on transfer and semi-supervised benchmarks while requiring fewer training epochs. Code available on GitHub1.
近年来,人们研究了自监督学习来解决可用标记数据集的局限性。在自监督学习的主要组成部分中,数据增强管道是提高结果性能的关键因素之一。然而,大多数研究人员手工设计了增强管道,并且有限的变换集合可能导致学习到的特征表示缺乏鲁棒性。在这项工作中,我们提出了自我监督表示学习的多增强(multi - augmentation for Self-Supervised Representation Learning, MA-SSRL),它充分搜索各种增强策略来构建整个管道,以提高学习到的特征表示的鲁棒性。MA-SSRL成功地学习了不变特征表示,为不同分布和领域数据集的自监督预训练提供了一种高效、有效、适应性强的数据增强管道。MA-SSRL在迁移和半监督基准测试上优于以前最先进的方法,同时需要更少的训练周期。代码可在GitHub1。
{"title":"Multi-Augmentation for Efficient Self-Supervised Visual Representation Learning","authors":"Van-Nhiem Tran, Chi-En Huang, Shenyao Liu, Kai-Lin Yang, Timothy Ko, Yung-Hui Li","doi":"10.1109/ICMEW56448.2022.9859465","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859465","url":null,"abstract":"In recent years, self-supervised learning has been studied to deal with the limitation of available labeled-dataset. Among the major components of self-supervised learning, the data augmentation pipeline is one key factor in enhancing the resulting performance. However, most researchers manually designed the augmentation pipeline, and the limited collections of transformation may cause the lack of robustness of the learned feature representation. In this work, we proposed Multi-Augmentations for Self-Supervised Representation Learning (MA-SSRL), which fully searched for various augmentation policies to build the entire pipeline to improve the robustness of the learned feature representation. MA-SSRL successfully learns the invariant feature representation and presents an efficient, effective, and adaptable data augmentation pipeline for self-supervised pre-training on different distribution and domain datasets. MA-SSRL outperforms the previous state-of-the-art methods on transfer and semi-supervised benchmarks while requiring fewer training epochs. Code available on GitHub1.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129542332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-18DOI: 10.1109/ICMEW56448.2022.9859415
Zhongyu Jiang, Haorui Ji, Samuel Menaker, Jenq-Neng Hwang
With the rapid developments of computer vision and deep learning technologies, artificial intelligence takes a more and more important role in sports analyses. In this paper, to attain the objective of automated golf swing analyses, we propose a lightweight temporal-based 2D human pose estimation (HPE) method, called GolfPose, which achieves improved performance than the state-of-the-art image-based HPE methods. Unlike traditional image-based methods, our temporal-based method, designed for efficient and effective golf swing analyses, takes advantage of the temporal information to improve the estimation accuracy of fast-moving and partially self-occluded keypoints. Furthermore, in order to make sure the golf swing analyses can run on mobile devices, we optimize the model architecture to achieve real-time inference. With around 10% of the parameters and half of the GFLOPs used in the state-of-the-art HRNet, our proposed GolfPose model can achieve 9.16 mean pixel error (MPE) in our golf swing dataset, compared with 9.20 MPE for HRNet. Furthermore, the proposed temporal-based method, facilitated with golf club detection(GCD), significantly improves the accuracy of keypoints on the golf club from 13.98 to 9.21 MPE.
{"title":"GolfPose: Golf Swing Analyses with a Monocular Camera Based Human Pose Estimation","authors":"Zhongyu Jiang, Haorui Ji, Samuel Menaker, Jenq-Neng Hwang","doi":"10.1109/ICMEW56448.2022.9859415","DOIUrl":"https://doi.org/10.1109/ICMEW56448.2022.9859415","url":null,"abstract":"With the rapid developments of computer vision and deep learning technologies, artificial intelligence takes a more and more important role in sports analyses. In this paper, to attain the objective of automated golf swing analyses, we propose a lightweight temporal-based 2D human pose estimation (HPE) method, called GolfPose, which achieves improved performance than the state-of-the-art image-based HPE methods. Unlike traditional image-based methods, our temporal-based method, designed for efficient and effective golf swing analyses, takes advantage of the temporal information to improve the estimation accuracy of fast-moving and partially self-occluded keypoints. Furthermore, in order to make sure the golf swing analyses can run on mobile devices, we optimize the model architecture to achieve real-time inference. With around 10% of the parameters and half of the GFLOPs used in the state-of-the-art HRNet, our proposed GolfPose model can achieve 9.16 mean pixel error (MPE) in our golf swing dataset, compared with 9.20 MPE for HRNet. Furthermore, the proposed temporal-based method, facilitated with golf club detection(GCD), significantly improves the accuracy of keypoints on the golf club from 13.98 to 9.21 MPE.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129730450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}