In this paper, we propose a new algorithm for High Efficiency Video Coding(HEVC) based on multi-objective particle swarm optimization (MOPSO) to enhance the visual quality of ROI while ensuring a certain overall quality. According to the R-λ model of detected ROI, the fitness function in MOPSO can be designed as the distortion of ROI and that of the overall frame. The particle consists of ROI's rate and other region's rate. After iterating through the multi-objective particle swarm optimization algorithm, the Pareto front is obtained. Then, the final bit allocation result which are the appropriate bit rate for ROI and non-ROI is selected from this set. Finally, according to the R-λ model, the coding parameters could be determined for coding. The experimental results show that the proposed algorithm improves the visual quality of ROI while guarantees overall visual quality.
{"title":"Multi-Objective Particle Swarm Optimization for ROI based Video Coding","authors":"Guangjie Ren, Feiyang Liu, Daiqin Yang, Yiyong Zha, Yunfei Zhang, Xin Liu","doi":"10.1145/3338533.3366608","DOIUrl":"https://doi.org/10.1145/3338533.3366608","url":null,"abstract":"In this paper, we propose a new algorithm for High Efficiency Video Coding(HEVC) based on multi-objective particle swarm optimization (MOPSO) to enhance the visual quality of ROI while ensuring a certain overall quality. According to the R-λ model of detected ROI, the fitness function in MOPSO can be designed as the distortion of ROI and that of the overall frame. The particle consists of ROI's rate and other region's rate. After iterating through the multi-objective particle swarm optimization algorithm, the Pareto front is obtained. Then, the final bit allocation result which are the appropriate bit rate for ROI and non-ROI is selected from this set. Finally, according to the R-λ model, the coding parameters could be determined for coding. The experimental results show that the proposed algorithm improves the visual quality of ROI while guarantees overall visual quality.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121846440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weiming Zhang, Yi Huang, Wanting Yu, Xiaoshan Yang, Wei Wang, J. Sang
Human Activity Recognition (HAR) automatically recognizes human activities such as daily life and work based on digital records, which is of great significance to medical and health fields. Egocentric video and human acceleration data comprehensively describe human activity patterns from different aspects, which have laid a foundation for activity recognition based on multimodal behavior data. However, on the one hand, the low-level multimodal signal structures differ greatly and the mapping to high-level activities is complicated. On the other hand, the activity labeling based on multimodal behavior data has high cost and limited data amount, which limits the technical development in this field. In this paper, an activity recognition model MAFE based on multimodal attribute feature embedding is proposed. Before the activity recognition, the middle-level attribute features are extracted from the low-level signals of different modes. On the one hand, the mapping complexity from the low-level signals to the high-level activities is reduced, and on the other hand, a large number of middle-level attribute labeling data can be used to reduce the dependency on the activity labeling data. We conducted experiments on Stanford-ECM datasets to verify the effectiveness of the proposed MAFE method.
{"title":"Multimodal Attribute and Feature Embedding for Activity Recognition","authors":"Weiming Zhang, Yi Huang, Wanting Yu, Xiaoshan Yang, Wei Wang, J. Sang","doi":"10.1145/3338533.3366592","DOIUrl":"https://doi.org/10.1145/3338533.3366592","url":null,"abstract":"Human Activity Recognition (HAR) automatically recognizes human activities such as daily life and work based on digital records, which is of great significance to medical and health fields. Egocentric video and human acceleration data comprehensively describe human activity patterns from different aspects, which have laid a foundation for activity recognition based on multimodal behavior data. However, on the one hand, the low-level multimodal signal structures differ greatly and the mapping to high-level activities is complicated. On the other hand, the activity labeling based on multimodal behavior data has high cost and limited data amount, which limits the technical development in this field. In this paper, an activity recognition model MAFE based on multimodal attribute feature embedding is proposed. Before the activity recognition, the middle-level attribute features are extracted from the low-level signals of different modes. On the one hand, the mapping complexity from the low-level signals to the high-level activities is reduced, and on the other hand, a large number of middle-level attribute labeling data can be used to reduce the dependency on the activity labeling data. We conducted experiments on Stanford-ECM datasets to verify the effectiveness of the proposed MAFE method.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129276184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Segmentation of a specific organ or tissue plays an important role in medical image analysis with the rapid development of clinical decision support systems. With medical imaging equipments, segmenting the lung nodules in the images is able to help physicians diagnose lung cancer diseases and formulate proper schemes. Therefore the research of lung nodule segmentation has attracted a lot of attention these years. However, this task faces some challenges, including the intensity similarity between lung nodules and vessel, inaccurate boundaries and presence of noise in most of the images. In this paper, an automated segmentation method is proposed for lung nodules in CT images. At the first stage, a nodule detection network is used to generate region proposals and locate the bounding boxes of nodules, which are employed as the initial input for the following segmentation. Then the nodules are segmented in the bounding boxes at the second stage. Since the image scale for region growing is reduced by locating the nodule in advance, the efficiency of segmentation can be improved. And due to the localization of nodule before segmentation, some tissues with similar intensity can be excluded from the object region. The proposed method is evaluated on a public lung nodule dataset, and the experimental results indicate the effectiveness and efficiency of the proposed method.
{"title":"An Automated Lung Nodule Segmentation Method Based On Nodule Detection Network and Region Growing","authors":"Yanhao Tan, K. Lu, Jian Xue","doi":"10.1145/3338533.3366604","DOIUrl":"https://doi.org/10.1145/3338533.3366604","url":null,"abstract":"Segmentation of a specific organ or tissue plays an important role in medical image analysis with the rapid development of clinical decision support systems. With medical imaging equipments, segmenting the lung nodules in the images is able to help physicians diagnose lung cancer diseases and formulate proper schemes. Therefore the research of lung nodule segmentation has attracted a lot of attention these years. However, this task faces some challenges, including the intensity similarity between lung nodules and vessel, inaccurate boundaries and presence of noise in most of the images. In this paper, an automated segmentation method is proposed for lung nodules in CT images. At the first stage, a nodule detection network is used to generate region proposals and locate the bounding boxes of nodules, which are employed as the initial input for the following segmentation. Then the nodules are segmented in the bounding boxes at the second stage. Since the image scale for region growing is reduced by locating the nodule in advance, the efficiency of segmentation can be improved. And due to the localization of nodule before segmentation, some tissues with similar intensity can be excluded from the object region. The proposed method is evaluated on a public lung nodule dataset, and the experimental results indicate the effectiveness and efficiency of the proposed method.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125561309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A typical tag/keyword-based search system retrieves documents where, given a query term q, the query term q occurs in the dataset. However, when applying these systems to a real-world font web community setting, practical challenges arise --- font tags are more subjective than other benchmark datasets, which magnify the tag mismatch problem. To address these challenges, we propose a tag dictionary space leveraged by word embedding, which relates undefined words that have a similar meaning. Even if a query is not defined in the tag dictionary, we can represent it as a vector on the tag dictionary space. The proposed system facilitates multi-modal inputs that can use both textual and image queries. By integrating a visual sentiment concept model that classifies affective concepts as adjective--noun pairs for a given image and uses it as a query, users can interact with the search system in a multi-modal way. We used crowd sourcing to collect user ratings for the retrieved fonts and observed that the retrieved font with the proposed methods obtained a higher score compared to other methods.
{"title":"Social Font Search by Multimodal Feature Embedding","authors":"Saemi Choi, Shun Matsumura, K. Aizawa","doi":"10.1145/3338533.3366595","DOIUrl":"https://doi.org/10.1145/3338533.3366595","url":null,"abstract":"A typical tag/keyword-based search system retrieves documents where, given a query term q, the query term q occurs in the dataset. However, when applying these systems to a real-world font web community setting, practical challenges arise --- font tags are more subjective than other benchmark datasets, which magnify the tag mismatch problem. To address these challenges, we propose a tag dictionary space leveraged by word embedding, which relates undefined words that have a similar meaning. Even if a query is not defined in the tag dictionary, we can represent it as a vector on the tag dictionary space. The proposed system facilitates multi-modal inputs that can use both textual and image queries. By integrating a visual sentiment concept model that classifies affective concepts as adjective--noun pairs for a given image and uses it as a query, users can interact with the search system in a multi-modal way. We used crowd sourcing to collect user ratings for the retrieved fonts and observed that the retrieved font with the proposed methods obtained a higher score compared to other methods.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115034939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lip reading aims at decoding texts from the movement of a speaker's mouth. In recent years, lip reading methods have made great progress for English, at both word-level and sentence-level. Unlike English, however, Chinese Mandarin is a tone-based language and relies on pitches to distinguish lexical or grammatical meaning, which significantly increases the ambiguity for the lip reading task. In this paper, we propose a Cascade Sequence-to-Sequence Model for Chinese Mandarin (CSSMCM) lip reading, which explicitly models tones when predicting sentence. Tones are modeled based on visual information and syntactic structure, and are used to predict sentence along with visual information and syntactic structure. In order to evaluate CSSMCM, a dataset called CMLR (Chinese Mandarin Lip Reading) is collected and released, consisting of over 100,000 natural sentences from China Network Television website. When trained on CMLR dataset, the proposed CSSMCM surpasses the performance of state-of-the-art lip reading frameworks, which confirms the effectiveness of explicit modeling of tones for Chinese Mandarin lip reading.
{"title":"A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading","authors":"Ya Zhao, Rui Xu, Mingli Song","doi":"10.1145/3338533.3366579","DOIUrl":"https://doi.org/10.1145/3338533.3366579","url":null,"abstract":"Lip reading aims at decoding texts from the movement of a speaker's mouth. In recent years, lip reading methods have made great progress for English, at both word-level and sentence-level. Unlike English, however, Chinese Mandarin is a tone-based language and relies on pitches to distinguish lexical or grammatical meaning, which significantly increases the ambiguity for the lip reading task. In this paper, we propose a Cascade Sequence-to-Sequence Model for Chinese Mandarin (CSSMCM) lip reading, which explicitly models tones when predicting sentence. Tones are modeled based on visual information and syntactic structure, and are used to predict sentence along with visual information and syntactic structure. In order to evaluate CSSMCM, a dataset called CMLR (Chinese Mandarin Lip Reading) is collected and released, consisting of over 100,000 natural sentences from China Network Television website. When trained on CMLR dataset, the proposed CSSMCM surpasses the performance of state-of-the-art lip reading frameworks, which confirms the effectiveness of explicit modeling of tones for Chinese Mandarin lip reading.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131393602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Typical convolutional networks are trained and conducted on RGB images. However, images are often compressed for memory savings and efficient transmission in real-world applications. In this paper, we explore methods for performing semantic segmentation on the discrete cosine transform (DCT) representation defined by the JPEG standard. We first rearrange the DCT coefficients to form a preferred input type, then we tailor an existing network to the DCT inputs. The proposed method has an accuracy close to the RGB model at about the same network complexity. Moreover, we investigate the impact of selecting different DCT components on segmentation performance. With a proper selection, one can achieve the same level accuracy using only 36% of the DCT coefficients. We further show the robustness of our method under the quantization errors. To our knowledge, this paper is the first to explore semantic segmentation on the DCT representation.
{"title":"Exploring Semantic Segmentation on the DCT Representation","authors":"Shao-Yuan Lo, H. Hang","doi":"10.1145/3338533.3366557","DOIUrl":"https://doi.org/10.1145/3338533.3366557","url":null,"abstract":"Typical convolutional networks are trained and conducted on RGB images. However, images are often compressed for memory savings and efficient transmission in real-world applications. In this paper, we explore methods for performing semantic segmentation on the discrete cosine transform (DCT) representation defined by the JPEG standard. We first rearrange the DCT coefficients to form a preferred input type, then we tailor an existing network to the DCT inputs. The proposed method has an accuracy close to the RGB model at about the same network complexity. Moreover, we investigate the impact of selecting different DCT components on segmentation performance. With a proper selection, one can achieve the same level accuracy using only 36% of the DCT coefficients. We further show the robustness of our method under the quantization errors. To our knowledge, this paper is the first to explore semantic segmentation on the DCT representation.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127886660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although skeleton-based action recognition has achieved great success in recent years, most of the existing methods may suffer from a large model size and slow execution speed. To alleviate this issue, we analyze skeleton sequence properties to propose a Double-feature Double-motion Network (DD-Net) for skeleton-based action recognition. By using a lightweight network structure (i.e., 0.15 million parameters), DD-Net can reach a super fast speed, as 3,500 FPS on an ordinary GPU (e.g., GTX 1080Ti), or, 2,000 FPS on an ordinary CPU (e.g., Intel E5-2620). By employing robust features, DD-Net achieves state-of-the-art performance on our experiment datasets: SHREC (i.e., hand actions) and JHMDB (i.e., body actions). Our code is on https://github.com/fandulu/DD-Net.
{"title":"Make Skeleton-based Action Recognition Model Smaller, Faster and Better","authors":"Fan Yang, S. Sakti, Yang Wu, Satoshi Nakamura","doi":"10.1145/3338533.3366569","DOIUrl":"https://doi.org/10.1145/3338533.3366569","url":null,"abstract":"Although skeleton-based action recognition has achieved great success in recent years, most of the existing methods may suffer from a large model size and slow execution speed. To alleviate this issue, we analyze skeleton sequence properties to propose a Double-feature Double-motion Network (DD-Net) for skeleton-based action recognition. By using a lightweight network structure (i.e., 0.15 million parameters), DD-Net can reach a super fast speed, as 3,500 FPS on an ordinary GPU (e.g., GTX 1080Ti), or, 2,000 FPS on an ordinary CPU (e.g., Intel E5-2620). By employing robust features, DD-Net achieves state-of-the-art performance on our experiment datasets: SHREC (i.e., hand actions) and JHMDB (i.e., body actions). Our code is on https://github.com/fandulu/DD-Net.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114690715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Real-time semantic segmentation plays an important role in practical applications such as self-driving and robots. Most semantic segmentation research focuses on improving estimation accuracy with little consideration on efficiency. Several previous studies that emphasize high-speed inference often fail to produce high-accuracy segmentation results. In this paper, we propose a novel convolutional network named Efficient Dense modules with Asymmetric convolution (EDANet), which employs an asymmetric convolution structure and incorporates dilated convolution and dense connectivity to achieve high efficiency at low computational cost and model size. EDANet is 2.7 times faster than the existing fast segmentation network, ICNet, while it achieves a similar mIoU score without any additional context module, post-processing scheme, and pretrained model. We evaluate EDANet on Cityscapes and CamVid datasets, and compare it with the other state-of-art systems. Our network can run with the high-resolution inputs at the speed of 108 FPS on one GTX 1080Ti.
实时语义分割在自动驾驶、机器人等实际应用中发挥着重要作用。大多数语义分割研究都集中在提高估计精度上,很少考虑效率。以往一些强调高速推理的研究往往不能得到高精度的分割结果。在本文中,我们提出了一种新的卷积网络,称为EDANet (Efficient Dense modules with Asymmetric convolution),它采用非对称卷积结构,结合了扩展卷积和密集连接,以低计算成本和模型大小实现了高效率。EDANet比现有的快速分割网络ICNet快2.7倍,并且在没有任何额外的上下文模块、后处理方案和预训练模型的情况下获得了相似的mIoU分数。我们在城市景观和CamVid数据集上评估EDANet,并将其与其他最先进的系统进行比较。我们的网络可以在一台GTX 1080Ti上以108 FPS的速度运行高分辨率输入。
{"title":"Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation","authors":"Shao-Yuan Lo, H. Hang, S. Chan, Jing-Jhih Lin","doi":"10.1145/3338533.3366558","DOIUrl":"https://doi.org/10.1145/3338533.3366558","url":null,"abstract":"Real-time semantic segmentation plays an important role in practical applications such as self-driving and robots. Most semantic segmentation research focuses on improving estimation accuracy with little consideration on efficiency. Several previous studies that emphasize high-speed inference often fail to produce high-accuracy segmentation results. In this paper, we propose a novel convolutional network named Efficient Dense modules with Asymmetric convolution (EDANet), which employs an asymmetric convolution structure and incorporates dilated convolution and dense connectivity to achieve high efficiency at low computational cost and model size. EDANet is 2.7 times faster than the existing fast segmentation network, ICNet, while it achieves a similar mIoU score without any additional context module, post-processing scheme, and pretrained model. We evaluate EDANet on Cityscapes and CamVid datasets, and compare it with the other state-of-art systems. Our network can run with the high-resolution inputs at the speed of 108 FPS on one GTX 1080Ti.","PeriodicalId":273086,"journal":{"name":"Proceedings of the ACM Multimedia Asia","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127259121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}