首页 > 最新文献

Proceedings of the 2nd ACM International Conference on Multimedia in Asia最新文献

英文 中文
Hungry networks: 3D mesh reconstruction of a dish and a plate from a single dish image for estimating food volume 饥饿网络:从单个盘子图像重建盘子和盘子的三维网格,用于估计食物体积
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446275
Shu Naritomi, Keiji Yanai
Dietary calorie management has been an important topic in recent years, and various methods and applications on image-based food calorie estimation have been published in the multimedia community. Most of the existing methods of estimating food calorie amounts use 2D-based image recognition. On the other hand, in this paper, we would like to make inferences based on 3D volume for more accurate estimation. We performed 3D reconstruction of a dish (food and plate) and a plate (without foods), from a single image. We succeeded in restoring the 3D shape with high accuracy while maintaining the consistency between a plate part of an estimated 3D dish and an estimated 3D plate. To achieve this, the following contributions were made in this paper. (1) Proposal of "Hungry Networks," a new network that generates two kinds of 3D volumes from a single image. (2) Introduction of plate consistency loss that matches the shapes of the plate parts of the two reconstructed models. (3) Creating a new dataset of 3D food models that are 3D scanned of actual foods and plates. We also conducted an experiment to infer the volume of only the food region from the difference of the two reconstructed volumes. As a result, it was shown that the introduced new loss function not only matches the 3D shape of the plate, but also contributes to obtaining the volume with higher accuracy. Although there are some existing studies that consider 3D shapes of foods, this is the first study to generate a 3D mesh volume from a single dish image.
膳食热量管理是近年来研究的一个重要课题,多媒体界已经发表了各种基于图像的食物热量估算方法和应用。大多数现有的估算食物卡路里量的方法使用基于2d的图像识别。另一方面,在本文中,我们希望基于三维体积进行推断,以获得更准确的估计。我们对一个盘子(食物和盘子)和一个盘子(没有食物)进行了3D重建。我们成功地以高精度恢复了三维形状,同时保持了估计的3D盘子的盘子部分和估计的3D盘子之间的一致性。为了实现这一目标,本文做出了以下贡献。(1)“饥饿网络”(Hungry Networks)的提议,这是一种新的网络,可以从一张图像中生成两种3D体量。(2)引入与两种重构模型的板部形状相匹配的板一致性损失。(3)创建新的3D食品模型数据集,对实际食品和盘子进行3D扫描。我们还进行了一个实验,通过两个重建体积的差异来推断仅食物区域的体积。结果表明,所引入的损失函数不仅与板的三维形状相匹配,而且有助于获得更高精度的体积。虽然已有一些研究考虑了食物的三维形状,但这是第一次从单个盘子图像生成三维网格体积的研究。
{"title":"Hungry networks: 3D mesh reconstruction of a dish and a plate from a single dish image for estimating food volume","authors":"Shu Naritomi, Keiji Yanai","doi":"10.1145/3444685.3446275","DOIUrl":"https://doi.org/10.1145/3444685.3446275","url":null,"abstract":"Dietary calorie management has been an important topic in recent years, and various methods and applications on image-based food calorie estimation have been published in the multimedia community. Most of the existing methods of estimating food calorie amounts use 2D-based image recognition. On the other hand, in this paper, we would like to make inferences based on 3D volume for more accurate estimation. We performed 3D reconstruction of a dish (food and plate) and a plate (without foods), from a single image. We succeeded in restoring the 3D shape with high accuracy while maintaining the consistency between a plate part of an estimated 3D dish and an estimated 3D plate. To achieve this, the following contributions were made in this paper. (1) Proposal of \"Hungry Networks,\" a new network that generates two kinds of 3D volumes from a single image. (2) Introduction of plate consistency loss that matches the shapes of the plate parts of the two reconstructed models. (3) Creating a new dataset of 3D food models that are 3D scanned of actual foods and plates. We also conducted an experiment to infer the volume of only the food region from the difference of the two reconstructed volumes. As a result, it was shown that the introduced new loss function not only matches the 3D shape of the plate, but also contributes to obtaining the volume with higher accuracy. Although there are some existing studies that consider 3D shapes of foods, this is the first study to generate a 3D mesh volume from a single dish image.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115986742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Learning intra-inter semantic aggregation for video object detection 学习用于视频目标检测的语义内聚合
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446273
Jun Liang, Haosheng Chen, Kaiwen Du, Yan Yan, Hanzi Wang
Video object detection is a challenging task due to the appearance deterioration problems in video frames. Thus, object features extracted from different frames of a video are usually deteriorated in varying degrees. Currently, some state-of-the-art methods enhance the deteriorated object features in a reference frame by aggregating the undeteriorated object features extracted from other frames, simply based on their learned appearance relation among object features. In this paper, we propose a novel intra-inter semantic aggregation method (ISA) to learn more effective intra and inter relations for semantically aggregating object features. Specifically, in the proposed ISA, we first introduce an intra semantic aggregation module (Intra-SAM) to enhance the deteriorated spatial features based on the learned intra relation among the features at different positions of an individual object. Then, we present an inter semantic aggregation module (Inter-SAM) to enhance the deteriorated object features in the temporal domain based on the learned inter relation among object features. As a result, by leveraging Intra-SAM and Inter-SAM, the proposed ISA can generate discriminative features from the novel perspective of intra-inter semantic aggregation for robust video object detection. We conduct extensive experiments on the ImageNet VID dataset to evaluate ISA. The proposed ISA obtains 84.5% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, and it achieves superior performance compared with several state-of-the-art video object detectors.
由于视频帧的外观退化问题,视频目标检测是一项具有挑战性的任务。因此,从视频的不同帧中提取的目标特征通常会有不同程度的退化。目前,一些最先进的方法通过聚合从其他帧中提取的未劣化目标特征来增强参考帧中的劣化目标特征,简单地基于它们学习到的目标特征之间的外观关系。在本文中,我们提出了一种新的语义内-语义间聚合方法(ISA),以学习更有效的语义聚合对象特征的内部和相互关系。具体来说,我们首先引入了一个内部语义聚合模块(intra - semantic aggregation module,简称intra - sam),基于学习到的单个对象不同位置特征之间的内部关系来增强退化的空间特征。然后,基于学习到的对象特征之间的相互关系,提出了一种语义间聚合模块(inter - sam),在时域增强劣化对象特征。因此,利用Intra-SAM和Inter-SAM,本文提出的ISA可以从语义内-语义间聚合的新角度生成判别特征,用于鲁棒视频目标检测。我们在ImageNet VID数据集上进行了大量的实验来评估ISA。采用ResNet-101和ResNeXt-101实现了84.5%的mAP和85.2%的mAP,与几种最先进的视频目标检测器相比,具有优越的性能。
{"title":"Learning intra-inter semantic aggregation for video object detection","authors":"Jun Liang, Haosheng Chen, Kaiwen Du, Yan Yan, Hanzi Wang","doi":"10.1145/3444685.3446273","DOIUrl":"https://doi.org/10.1145/3444685.3446273","url":null,"abstract":"Video object detection is a challenging task due to the appearance deterioration problems in video frames. Thus, object features extracted from different frames of a video are usually deteriorated in varying degrees. Currently, some state-of-the-art methods enhance the deteriorated object features in a reference frame by aggregating the undeteriorated object features extracted from other frames, simply based on their learned appearance relation among object features. In this paper, we propose a novel intra-inter semantic aggregation method (ISA) to learn more effective intra and inter relations for semantically aggregating object features. Specifically, in the proposed ISA, we first introduce an intra semantic aggregation module (Intra-SAM) to enhance the deteriorated spatial features based on the learned intra relation among the features at different positions of an individual object. Then, we present an inter semantic aggregation module (Inter-SAM) to enhance the deteriorated object features in the temporal domain based on the learned inter relation among object features. As a result, by leveraging Intra-SAM and Inter-SAM, the proposed ISA can generate discriminative features from the novel perspective of intra-inter semantic aggregation for robust video object detection. We conduct extensive experiments on the ImageNet VID dataset to evaluate ISA. The proposed ISA obtains 84.5% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, and it achieves superior performance compared with several state-of-the-art video object detectors.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121570640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Classification of multimedia SNS posts about tourist sites based on their focus toward predicting eco-friendly users 基于对生态友好型用户的预测,对旅游景点的多媒体SNS帖子进行分类
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446272
Naoto Kashiwagi, Tokinori Suzuki, Jounghun Lee, Daisuke Ikeda
Overtourism has had a negative impact on various things at tourist sites. One of the most serious problems is environmental issues, such as littering, caused by too many visitors to tourist sites. It is important to change people's mindset to be more environmentally aware in order to improve such situation. In particular, if we can find people with comparatively high awareness about environmental issues for overtourism, we will be able to work effectively to promote eco-friendly behavior for people. However, grasping a person's awareness is inherently difficult. For this challenge, we introduce a new task, called Detecting Focus of Posts about Tourism, which is given users' posts of pictures and comment on SNSs about tourist sites, to classify them into types of their focuses based on such awareness. Once we classify such posts, we can see its result showing tendencies of users awareness and so we can discern awareness of the users for environmental issues at tourist sites. Specifically, we define four labels on focus of SNS posts about tourist sites. Based on these labels, we create an evaluation dataset. We present experimental results of the classification task with a CNN classifier for pictures or an LSTM classifier for comments, which will be baselines for the task.
过度旅游已经对旅游景点的各种事物产生了负面影响。最严重的问题之一是环境问题,如乱扔垃圾,造成太多的游客到旅游景点。重要的是要改变人们的观念,提高环保意识,以改善这种状况。特别是,如果我们能找到对过度旅游的环境问题有较高认识的人,我们将能够有效地促进人们的环保行为。然而,把握一个人的意识本质上是困难的。对于这个挑战,我们引入了一个新的任务,叫做“探测旅游帖子的焦点”,即给用户在sns上发布的关于旅游网站的图片和评论,根据这种意识将它们分类为他们的焦点类型。一旦我们对这些帖子进行分类,我们可以看到它的结果显示出用户意识的趋势,从而我们可以看出用户对旅游景点环境问题的意识。具体来说,我们对旅游网站SNS帖子的焦点定义了四种标签。基于这些标签,我们创建一个评估数据集。我们给出了对图片使用CNN分类器或对评论使用LSTM分类器的分类任务的实验结果,这将作为任务的基线。
{"title":"Classification of multimedia SNS posts about tourist sites based on their focus toward predicting eco-friendly users","authors":"Naoto Kashiwagi, Tokinori Suzuki, Jounghun Lee, Daisuke Ikeda","doi":"10.1145/3444685.3446272","DOIUrl":"https://doi.org/10.1145/3444685.3446272","url":null,"abstract":"Overtourism has had a negative impact on various things at tourist sites. One of the most serious problems is environmental issues, such as littering, caused by too many visitors to tourist sites. It is important to change people's mindset to be more environmentally aware in order to improve such situation. In particular, if we can find people with comparatively high awareness about environmental issues for overtourism, we will be able to work effectively to promote eco-friendly behavior for people. However, grasping a person's awareness is inherently difficult. For this challenge, we introduce a new task, called Detecting Focus of Posts about Tourism, which is given users' posts of pictures and comment on SNSs about tourist sites, to classify them into types of their focuses based on such awareness. Once we classify such posts, we can see its result showing tendencies of users awareness and so we can discern awareness of the users for environmental issues at tourist sites. Specifically, we define four labels on focus of SNS posts about tourist sites. Based on these labels, we create an evaluation dataset. We present experimental results of the classification task with a CNN classifier for pictures or an LSTM classifier for comments, which will be baselines for the task.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122473620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A treatment engine by multimodal EMR data 多模态电子病历数据处理引擎
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446254
Zhaomeng Huang, Liyan Zhang, Xu Xu
In recent years, with the development of electronic medical record (EMR) systems, it has become possible to mine patient clinical data to improve medical care quality. After the treatment engine learns knowledge from the EMR data, it can automatically recommend the next stage of prescriptions and provide treatment guidelines for doctors and patients. However, this task is always challenged by the multi-modality of EMR data. To more effectively predict the next stage of treatment prescription by using multimodal information and the connection between the modalities, we propose a cross-modal shared-specific feature complementary generation and attention fusion algorithm. In the feature extraction stage, specific information and shared information are obtained through a shared-specific feature extraction network. To obtain the correlation between the modalities, we propose a sorting network. We use the attention fusion network in the multimodal feature fusion stage to give different multimodal features at different stages with different weights to obtain a more prepared patient representation. Considering the redundant information of specific modal information and shared modal information, we introduce a complementary feature learning strategy, including modality adaptation for shared features, project adversarial learning for specific features, and reconstruction enhancement. The experimental results on the real EMR data set MIMIC-III prove its superiority and each part's effectiveness.
近年来,随着电子病历(EMR)系统的发展,对患者临床数据的挖掘,提高医疗质量成为可能。治疗引擎从EMR数据中学习知识后,可以自动推荐下一阶段的处方,为医生和患者提供治疗指南。然而,电子病历数据的多模态对这一任务提出了挑战。为了利用多模态信息和模态之间的联系更有效地预测下一阶段的治疗处方,我们提出了一种跨模态共享特异性特征互补生成和注意融合算法。在特征提取阶段,通过共享特征提取网络获得特定信息和共享信息。为了获得模态之间的相关性,我们提出了一个排序网络。我们利用多模态特征融合阶段的注意融合网络,在不同阶段赋予不同的多模态特征不同的权重,以获得更有准备的患者表征。考虑到特定模态信息和共享模态信息的冗余性,提出了一种互补的特征学习策略,包括针对共享特征的模态自适应、针对特定特征的项目对抗学习和重构增强。在真实EMR数据集MIMIC-III上的实验结果证明了该方法的优越性和各部分的有效性。
{"title":"A treatment engine by multimodal EMR data","authors":"Zhaomeng Huang, Liyan Zhang, Xu Xu","doi":"10.1145/3444685.3446254","DOIUrl":"https://doi.org/10.1145/3444685.3446254","url":null,"abstract":"In recent years, with the development of electronic medical record (EMR) systems, it has become possible to mine patient clinical data to improve medical care quality. After the treatment engine learns knowledge from the EMR data, it can automatically recommend the next stage of prescriptions and provide treatment guidelines for doctors and patients. However, this task is always challenged by the multi-modality of EMR data. To more effectively predict the next stage of treatment prescription by using multimodal information and the connection between the modalities, we propose a cross-modal shared-specific feature complementary generation and attention fusion algorithm. In the feature extraction stage, specific information and shared information are obtained through a shared-specific feature extraction network. To obtain the correlation between the modalities, we propose a sorting network. We use the attention fusion network in the multimodal feature fusion stage to give different multimodal features at different stages with different weights to obtain a more prepared patient representation. Considering the redundant information of specific modal information and shared modal information, we introduce a complementary feature learning strategy, including modality adaptation for shared features, project adversarial learning for specific features, and reconstruction enhancement. The experimental results on the real EMR data set MIMIC-III prove its superiority and each part's effectiveness.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126534710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multi-scale human action recognition method based on Laplacian pyramid depth motion images 基于拉普拉斯金字塔深度运动图像的多尺度人体动作识别方法
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446284
Chang Li, Qian Huang, Xing Li, Qianhan Wu
Human action recognition is an active research area in computer vision. Aiming at the lack of spatial muti-scale information for human action recognition, we present a novel framework to recognize human actions from depth video sequences using multi-scale Laplacian pyramid depth motion images (LP-DMI). Each depth frame is projected onto three orthogonal Cartesian planes. Under three views, we generate depth motion images (DMI) and construct Laplacian pyramids as structured multi-scale feature maps which enhances multi-scale dynamic information of motions and reduces redundant static information in human bodies. We further extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features. Finally, we utilize extreme learning machine (ELM) for action classification. Through extensive experiments on the public MSRAction3D datasets, we prove that our method outperforms state-of-the-art benchmarks.
人体动作识别是计算机视觉领域的一个活跃研究领域。针对人体动作识别缺乏空间多尺度信息的问题,提出了一种利用多尺度拉普拉斯金字塔深度运动图像(LP-DMI)从深度视频序列中识别人体动作的新框架。每个深度帧被投影到三个正交的笛卡尔平面上。在三种视图下,我们生成深度运动图像(DMI),并将拉普拉斯金字塔构造为结构化的多尺度特征图,增强了运动的多尺度动态信息,减少了人体中冗余的静态信息。我们进一步提取了称为LP-DMI-HOG的多粒度描述符,以提供更多的判别特征。最后,我们利用极限学习机(ELM)进行动作分类。通过对公共MSRAction3D数据集的广泛实验,我们证明了我们的方法优于最先进的基准。
{"title":"A multi-scale human action recognition method based on Laplacian pyramid depth motion images","authors":"Chang Li, Qian Huang, Xing Li, Qianhan Wu","doi":"10.1145/3444685.3446284","DOIUrl":"https://doi.org/10.1145/3444685.3446284","url":null,"abstract":"Human action recognition is an active research area in computer vision. Aiming at the lack of spatial muti-scale information for human action recognition, we present a novel framework to recognize human actions from depth video sequences using multi-scale Laplacian pyramid depth motion images (LP-DMI). Each depth frame is projected onto three orthogonal Cartesian planes. Under three views, we generate depth motion images (DMI) and construct Laplacian pyramids as structured multi-scale feature maps which enhances multi-scale dynamic information of motions and reduces redundant static information in human bodies. We further extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features. Finally, we utilize extreme learning machine (ELM) for action classification. Through extensive experiments on the public MSRAction3D datasets, we prove that our method outperforms state-of-the-art benchmarks.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125891284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Video scene detection based on link prediction using graph convolution network 基于图卷积网络链路预测的视频场景检测
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446293
Yingjiao Pei, Zhongyuan Wang, Heling Chen, Baojin Huang, Weiping Tu
With the development of the Internet, multimedia data grows by an exponential level. The demand for video organization, summarization and retrieval has been increasing where scene detection plays an essential role. Existing shot clustering algorithms for scene detection usually treat temporal shot sequence as unconstrained data. The graph based scene detection methods can locate the scene boundaries by taking the temporal relation among shots into account, while most of them only rely on low-level features to determine whether the connected shot pairs are similar or not. The optimized algorithms considering temporal sequence of shots or combining multi-modal features will bring parameter trouble and computational burden. In this paper, we propose a novel temporal clustering method based on graph convolution network and the link transitivity of shot nodes, without involving complicated steps and prior parameter setting such as the number of clusters. In particular, the graph convolution network is used to predict the link possibility of node pairs that are close in temporal sequence. The shots are then clustered into scene segments by merging all possible links. Experimental results on BBC and OVSD datasets show that our approach is more robust and effective than the comparison methods in terms of F1-score.
随着互联网的发展,多媒体数据呈指数级增长。对视频的组织、总结和检索的需求日益增长,其中场景检测起着至关重要的作用。现有的场景检测镜头聚类算法通常将时间镜头序列作为无约束数据处理。基于图的场景检测方法可以通过考虑镜头之间的时间关系来定位场景边界,而大多数的场景检测方法仅依靠底层特征来判断连接的镜头对是否相似。考虑镜头时间序列或结合多模态特征的优化算法会带来参数麻烦和计算负担。在本文中,我们提出了一种新的基于图卷积网络和镜头节点的链接传递性的时间聚类方法,该方法不涉及复杂的步骤和预先设置簇数等参数。特别地,使用图卷积网络来预测时间序列上接近的节点对的链接可能性。然后,通过合并所有可能的链接,将镜头聚集到场景片段中。在BBC和OVSD数据集上的实验结果表明,我们的方法在F1-score方面比比较方法更加稳健和有效。
{"title":"Video scene detection based on link prediction using graph convolution network","authors":"Yingjiao Pei, Zhongyuan Wang, Heling Chen, Baojin Huang, Weiping Tu","doi":"10.1145/3444685.3446293","DOIUrl":"https://doi.org/10.1145/3444685.3446293","url":null,"abstract":"With the development of the Internet, multimedia data grows by an exponential level. The demand for video organization, summarization and retrieval has been increasing where scene detection plays an essential role. Existing shot clustering algorithms for scene detection usually treat temporal shot sequence as unconstrained data. The graph based scene detection methods can locate the scene boundaries by taking the temporal relation among shots into account, while most of them only rely on low-level features to determine whether the connected shot pairs are similar or not. The optimized algorithms considering temporal sequence of shots or combining multi-modal features will bring parameter trouble and computational burden. In this paper, we propose a novel temporal clustering method based on graph convolution network and the link transitivity of shot nodes, without involving complicated steps and prior parameter setting such as the number of clusters. In particular, the graph convolution network is used to predict the link possibility of node pairs that are close in temporal sequence. The shots are then clustered into scene segments by merging all possible links. Experimental results on BBC and OVSD datasets show that our approach is more robust and effective than the comparison methods in terms of F1-score.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130964252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An autoregressive generation model for producing instant basketball defensive trajectory 篮球即时防守轨迹生成的自回归生成模型
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446300
Huan-Hua Chang, Wen-Cheng Chen, Wan-Lun Tsai, Min-Chun Hu, W. Chu
Learning basketball tactic via virtual reality environment requires real-time feedback to improve the realism and interactivity. For example, the virtual defender should move immediately according to the player's movement. In this paper, we proposed an autoregressive generative model for basketball defensive trajectory generation. To learn the continuous Gaussian distribution of player position, we adopt a differentiable sampling process to sample the candidate location with a standard deviation loss, which can preserve the diversity of the trajectories. Furthermore, we design several additional loss functions based on the domain knowledge of basketball to make the generated trajectories match the real situation in basketball games. The experimental results show that the proposed method can achieve better performance than previous works in terms of different evaluation metrics.
通过虚拟现实环境学习篮球战术需要实时反馈,以提高真实感和互动性。例如,虚拟防守者应该根据玩家的移动立即移动。提出了一种用于篮球防守轨迹生成的自回归生成模型。为了学习玩家位置的连续高斯分布,我们采用可微采样过程对候选位置进行标准差损失采样,以保持轨迹的多样性。在此基础上,设计了基于篮球领域知识的损失函数,使生成的轨迹更符合篮球比赛的实际情况。实验结果表明,在不同的评价指标下,该方法均能取得较好的效果。
{"title":"An autoregressive generation model for producing instant basketball defensive trajectory","authors":"Huan-Hua Chang, Wen-Cheng Chen, Wan-Lun Tsai, Min-Chun Hu, W. Chu","doi":"10.1145/3444685.3446300","DOIUrl":"https://doi.org/10.1145/3444685.3446300","url":null,"abstract":"Learning basketball tactic via virtual reality environment requires real-time feedback to improve the realism and interactivity. For example, the virtual defender should move immediately according to the player's movement. In this paper, we proposed an autoregressive generative model for basketball defensive trajectory generation. To learn the continuous Gaussian distribution of player position, we adopt a differentiable sampling process to sample the candidate location with a standard deviation loss, which can preserve the diversity of the trajectories. Furthermore, we design several additional loss functions based on the domain knowledge of basketball to make the generated trajectories match the real situation in basketball games. The experimental results show that the proposed method can achieve better performance than previous works in terms of different evaluation metrics.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128957901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Text-based visual question answering with knowledge base 基于文本的可视化问答与知识库
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446306
Fang Zhou, Bei Yin, Zanxia Jin, Heran Wu, Dongyang Zhang
Text-based Visual Question Answering(VQA) usually needs to analyze and understand the text in a picture to give a correct answer for the given question. In this paper, a generic Text-based VQA with Knowledge Base (KB) is proposed, which performs text-based search on text information obtained by optical character recognition (OCR) in images, constructs task-oriented knowledge information and integrates it into the existing models. Due to the complexity of the image scene, the accuracy of OCR is not very high, and there are often cases where the words have individual character that is incorrect, resulting in inaccurate text information; here, some correct words can be found with help of KB, and the correct image text information can be added. Moreover, the knowledge information constructed with KB can better explain the image information, allowing the model to fully understand the image and find the appropriate text answer. The experimental results on the TextVQA dataset show that our method improves the accuracy, and the maximum increment is 39.2%.
基于文本的视觉问答(VQA)通常需要对图片中的文本进行分析和理解,从而对给定的问题给出正确的答案。本文提出了一种通用的基于文本的知识库VQA,该知识库对图像中光学字符识别(OCR)获得的文本信息进行基于文本的搜索,构建面向任务的知识信息,并将其集成到现有模型中。由于图像场景的复杂性,OCR的准确率不是很高,经常会出现单词有个别字符不正确的情况,导致文本信息不准确;在这里,可以借助KB找到一些正确的单词,并添加正确的图像文本信息。而且,用KB构建的知识信息可以更好地解释图像信息,使模型能够充分理解图像并找到合适的文本答案。在TextVQA数据集上的实验结果表明,我们的方法提高了准确率,最大增幅为39.2%。
{"title":"Text-based visual question answering with knowledge base","authors":"Fang Zhou, Bei Yin, Zanxia Jin, Heran Wu, Dongyang Zhang","doi":"10.1145/3444685.3446306","DOIUrl":"https://doi.org/10.1145/3444685.3446306","url":null,"abstract":"Text-based Visual Question Answering(VQA) usually needs to analyze and understand the text in a picture to give a correct answer for the given question. In this paper, a generic Text-based VQA with Knowledge Base (KB) is proposed, which performs text-based search on text information obtained by optical character recognition (OCR) in images, constructs task-oriented knowledge information and integrates it into the existing models. Due to the complexity of the image scene, the accuracy of OCR is not very high, and there are often cases where the words have individual character that is incorrect, resulting in inaccurate text information; here, some correct words can be found with help of KB, and the correct image text information can be added. Moreover, the knowledge information constructed with KB can better explain the image information, allowing the model to fully understand the image and find the appropriate text answer. The experimental results on the TextVQA dataset show that our method improves the accuracy, and the maximum increment is 39.2%.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127336729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-modal learning for saliency prediction in mobile environment 移动环境下显著性预测的跨模态学习
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446304
Dakai Ren, X. Wen, Xiao-Yang Liu, Shuai Huang, Jiazhong Chen
The existing researches reveal that a significant impact is introduced by viewing conditions for visual perception when viewing media on mobile screens. This brings two issues in the area of visual saliency that we need to address: how the saliency models perform in mobile conditions, and how to consider the mobile conditions when designing a saliency model. To investigate the performance of saliency models in mobile environment, eye fixations in four typical mobile conditions are collected as the mobile ground truth in this work. To consider the mobile conditions when designing a saliency model, we combine viewing factors and visual stimuli as two modalities, and a cross-modal based deep learning architecture is proposed for visual attention prediction. Experimental results demonstrate the model with the consideration of mobile viewing factors often outperforms the models without such consideration.
现有研究表明,在移动屏幕上观看媒体时,观看条件对视觉感知产生了显著影响。这给视觉显著性领域带来了两个我们需要解决的问题:显著性模型在移动条件下的表现如何,以及在设计显著性模型时如何考虑移动条件。为了研究显著性模型在移动环境下的性能,本研究收集了四种典型移动条件下的眼球注视作为移动地面真值。为了在设计显著性模型时考虑移动条件,我们将观看因素和视觉刺激作为两种模态结合起来,提出了一种基于跨模态的深度学习结构用于视觉注意预测。实验结果表明,考虑移动观看因素的模型往往优于不考虑移动观看因素的模型。
{"title":"Cross-modal learning for saliency prediction in mobile environment","authors":"Dakai Ren, X. Wen, Xiao-Yang Liu, Shuai Huang, Jiazhong Chen","doi":"10.1145/3444685.3446304","DOIUrl":"https://doi.org/10.1145/3444685.3446304","url":null,"abstract":"The existing researches reveal that a significant impact is introduced by viewing conditions for visual perception when viewing media on mobile screens. This brings two issues in the area of visual saliency that we need to address: how the saliency models perform in mobile conditions, and how to consider the mobile conditions when designing a saliency model. To investigate the performance of saliency models in mobile environment, eye fixations in four typical mobile conditions are collected as the mobile ground truth in this work. To consider the mobile conditions when designing a saliency model, we combine viewing factors and visual stimuli as two modalities, and a cross-modal based deep learning architecture is proposed for visual attention prediction. Experimental results demonstrate the model with the consideration of mobile viewing factors often outperforms the models without such consideration.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"31 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127980894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time arbitrary video style transfer 实时任意视频风格转换
Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446301
Xingyu Liu, Zongxing Ji, Piao Huang, Tongwei Ren
Video style transfer aims to synthesize a stylized video that has similar content structure with a content video and is rendered in the style of a style image. The existing video style transfer methods cannot simultaneously realize high efficiency, arbitrary style and temporal consistency. In this paper, we propose the first real-time arbitrary video style transfer method with only one model. Specifically, we utilize a three-network architecture consisting of a prediction network, a stylization network and a loss network. Prediction network is used for extracting style parameters from a given style image; Stylization network is for generating the corresponding stylized video; Loss network is for training prediction network and stylization network, whose loss function includes content loss, style loss and temporal consistency loss. We conduct three experiments and a user study to test the effectiveness of our method. The experimental results show that our method outperforms the state-of-the-arts.
视频风格转换旨在合成与内容视频具有相似内容结构的程式化视频,并以样式图像的样式呈现。现有的视频风格转换方法不能同时实现高效率、任意风格和时间一致性。在本文中,我们提出了第一种仅使用一个模型的实时任意视频风格传输方法。具体来说,我们利用了一个由预测网络、风格化网络和损失网络组成的三网络架构。使用预测网络从给定的样式图像中提取样式参数;风格化网络用于生成相应的风格化视频;损失网络用于训练预测网络和风格化网络,其损失函数包括内容损失、风格损失和时间一致性损失。我们进行了三个实验和一个用户研究来测试我们方法的有效性。实验结果表明,该方法优于目前的方法。
{"title":"Real-time arbitrary video style transfer","authors":"Xingyu Liu, Zongxing Ji, Piao Huang, Tongwei Ren","doi":"10.1145/3444685.3446301","DOIUrl":"https://doi.org/10.1145/3444685.3446301","url":null,"abstract":"Video style transfer aims to synthesize a stylized video that has similar content structure with a content video and is rendered in the style of a style image. The existing video style transfer methods cannot simultaneously realize high efficiency, arbitrary style and temporal consistency. In this paper, we propose the first real-time arbitrary video style transfer method with only one model. Specifically, we utilize a three-network architecture consisting of a prediction network, a stylization network and a loss network. Prediction network is used for extracting style parameters from a given style image; Stylization network is for generating the corresponding stylized video; Loss network is for training prediction network and stylization network, whose loss function includes content loss, style loss and temporal consistency loss. We conduct three experiments and a user study to test the effectiveness of our method. The experimental results show that our method outperforms the state-of-the-arts.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129210123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 2nd ACM International Conference on Multimedia in Asia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1