Proceedings of the 2nd ACM International Conference on Multimedia in Asia最新文献

英文中文

Classification of multimedia SNS posts about tourist sites based on their focus toward predicting eco-friendly users 基于对生态友好型用户的预测，对旅游景点的多媒体SNS帖子进行分类

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446272

Naoto Kashiwagi, Tokinori Suzuki, Jounghun Lee, Daisuke Ikeda

Overtourism has had a negative impact on various things at tourist sites. One of the most serious problems is environmental issues, such as littering, caused by too many visitors to tourist sites. It is important to change people's mindset to be more environmentally aware in order to improve such situation. In particular, if we can find people with comparatively high awareness about environmental issues for overtourism, we will be able to work effectively to promote eco-friendly behavior for people. However, grasping a person's awareness is inherently difficult. For this challenge, we introduce a new task, called Detecting Focus of Posts about Tourism, which is given users' posts of pictures and comment on SNSs about tourist sites, to classify them into types of their focuses based on such awareness. Once we classify such posts, we can see its result showing tendencies of users awareness and so we can discern awareness of the users for environmental issues at tourist sites. Specifically, we define four labels on focus of SNS posts about tourist sites. Based on these labels, we create an evaluation dataset. We present experimental results of the classification task with a CNN classifier for pictures or an LSTM classifier for comments, which will be baselines for the task.

过度旅游已经对旅游景点的各种事物产生了负面影响。最严重的问题之一是环境问题，如乱扔垃圾，造成太多的游客到旅游景点。重要的是要改变人们的观念，提高环保意识，以改善这种状况。特别是，如果我们能找到对过度旅游的环境问题有较高认识的人，我们将能够有效地促进人们的环保行为。然而，把握一个人的意识本质上是困难的。对于这个挑战，我们引入了一个新的任务，叫做“探测旅游帖子的焦点”，即给用户在sns上发布的关于旅游网站的图片和评论，根据这种意识将它们分类为他们的焦点类型。一旦我们对这些帖子进行分类，我们可以看到它的结果显示出用户意识的趋势，从而我们可以看出用户对旅游景点环境问题的意识。具体来说，我们对旅游网站SNS帖子的焦点定义了四种标签。基于这些标签，我们创建一个评估数据集。我们给出了对图片使用CNN分类器或对评论使用LSTM分类器的分类任务的实验结果，这将作为任务的基线。

{"title":"Classification of multimedia SNS posts about tourist sites based on their focus toward predicting eco-friendly users","authors":"Naoto Kashiwagi, Tokinori Suzuki, Jounghun Lee, Daisuke Ikeda","doi":"10.1145/3444685.3446272","DOIUrl":"https://doi.org/10.1145/3444685.3446272","url":null,"abstract":"Overtourism has had a negative impact on various things at tourist sites. One of the most serious problems is environmental issues, such as littering, caused by too many visitors to tourist sites. It is important to change people's mindset to be more environmentally aware in order to improve such situation. In particular, if we can find people with comparatively high awareness about environmental issues for overtourism, we will be able to work effectively to promote eco-friendly behavior for people. However, grasping a person's awareness is inherently difficult. For this challenge, we introduce a new task, called Detecting Focus of Posts about Tourism, which is given users' posts of pictures and comment on SNSs about tourist sites, to classify them into types of their focuses based on such awareness. Once we classify such posts, we can see its result showing tendencies of users awareness and so we can discern awareness of the users for environmental issues at tourist sites. Specifically, we define four labels on focus of SNS posts about tourist sites. Based on these labels, we create an evaluation dataset. We present experimental results of the classification task with a CNN classifier for pictures or an LSTM classifier for comments, which will be baselines for the task.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122473620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Text-based visual question answering with knowledge base 基于文本的可视化问答与知识库

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446306

Fang Zhou, Bei Yin, Zanxia Jin, Heran Wu, Dongyang Zhang

Text-based Visual Question Answering(VQA) usually needs to analyze and understand the text in a picture to give a correct answer for the given question. In this paper, a generic Text-based VQA with Knowledge Base (KB) is proposed, which performs text-based search on text information obtained by optical character recognition (OCR) in images, constructs task-oriented knowledge information and integrates it into the existing models. Due to the complexity of the image scene, the accuracy of OCR is not very high, and there are often cases where the words have individual character that is incorrect, resulting in inaccurate text information; here, some correct words can be found with help of KB, and the correct image text information can be added. Moreover, the knowledge information constructed with KB can better explain the image information, allowing the model to fully understand the image and find the appropriate text answer. The experimental results on the TextVQA dataset show that our method improves the accuracy, and the maximum increment is 39.2%.

基于文本的视觉问答(VQA)通常需要对图片中的文本进行分析和理解，从而对给定的问题给出正确的答案。本文提出了一种通用的基于文本的知识库VQA，该知识库对图像中光学字符识别(OCR)获得的文本信息进行基于文本的搜索，构建面向任务的知识信息，并将其集成到现有模型中。由于图像场景的复杂性，OCR的准确率不是很高，经常会出现单词有个别字符不正确的情况，导致文本信息不准确;在这里，可以借助KB找到一些正确的单词，并添加正确的图像文本信息。而且，用KB构建的知识信息可以更好地解释图像信息，使模型能够充分理解图像并找到合适的文本答案。在TextVQA数据集上的实验结果表明，我们的方法提高了准确率，最大增幅为39.2%。

引用次数: 0

A multi-scale human action recognition method based on Laplacian pyramid depth motion images 基于拉普拉斯金字塔深度运动图像的多尺度人体动作识别方法

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446284

Chang Li, Qian Huang, Xing Li, Qianhan Wu

Human action recognition is an active research area in computer vision. Aiming at the lack of spatial muti-scale information for human action recognition, we present a novel framework to recognize human actions from depth video sequences using multi-scale Laplacian pyramid depth motion images (LP-DMI). Each depth frame is projected onto three orthogonal Cartesian planes. Under three views, we generate depth motion images (DMI) and construct Laplacian pyramids as structured multi-scale feature maps which enhances multi-scale dynamic information of motions and reduces redundant static information in human bodies. We further extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features. Finally, we utilize extreme learning machine (ELM) for action classification. Through extensive experiments on the public MSRAction3D datasets, we prove that our method outperforms state-of-the-art benchmarks.

人体动作识别是计算机视觉领域的一个活跃研究领域。针对人体动作识别缺乏空间多尺度信息的问题，提出了一种利用多尺度拉普拉斯金字塔深度运动图像(LP-DMI)从深度视频序列中识别人体动作的新框架。每个深度帧被投影到三个正交的笛卡尔平面上。在三种视图下，我们生成深度运动图像(DMI)，并将拉普拉斯金字塔构造为结构化的多尺度特征图，增强了运动的多尺度动态信息，减少了人体中冗余的静态信息。我们进一步提取了称为LP-DMI-HOG的多粒度描述符，以提供更多的判别特征。最后，我们利用极限学习机(ELM)进行动作分类。通过对公共MSRAction3D数据集的广泛实验，我们证明了我们的方法优于最先进的基准。

引用次数: 3

Learning intra-inter semantic aggregation for video object detection 学习用于视频目标检测的语义内聚合

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446273

Jun Liang, Haosheng Chen, Kaiwen Du, Yan Yan, Hanzi Wang

Video object detection is a challenging task due to the appearance deterioration problems in video frames. Thus, object features extracted from different frames of a video are usually deteriorated in varying degrees. Currently, some state-of-the-art methods enhance the deteriorated object features in a reference frame by aggregating the undeteriorated object features extracted from other frames, simply based on their learned appearance relation among object features. In this paper, we propose a novel intra-inter semantic aggregation method (ISA) to learn more effective intra and inter relations for semantically aggregating object features. Specifically, in the proposed ISA, we first introduce an intra semantic aggregation module (Intra-SAM) to enhance the deteriorated spatial features based on the learned intra relation among the features at different positions of an individual object. Then, we present an inter semantic aggregation module (Inter-SAM) to enhance the deteriorated object features in the temporal domain based on the learned inter relation among object features. As a result, by leveraging Intra-SAM and Inter-SAM, the proposed ISA can generate discriminative features from the novel perspective of intra-inter semantic aggregation for robust video object detection. We conduct extensive experiments on the ImageNet VID dataset to evaluate ISA. The proposed ISA obtains 84.5% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, and it achieves superior performance compared with several state-of-the-art video object detectors.

由于视频帧的外观退化问题，视频目标检测是一项具有挑战性的任务。因此，从视频的不同帧中提取的目标特征通常会有不同程度的退化。目前，一些最先进的方法通过聚合从其他帧中提取的未劣化目标特征来增强参考帧中的劣化目标特征，简单地基于它们学习到的目标特征之间的外观关系。在本文中，我们提出了一种新的语义内-语义间聚合方法(ISA)，以学习更有效的语义聚合对象特征的内部和相互关系。具体来说，我们首先引入了一个内部语义聚合模块(intra - semantic aggregation module，简称intra - sam)，基于学习到的单个对象不同位置特征之间的内部关系来增强退化的空间特征。然后，基于学习到的对象特征之间的相互关系，提出了一种语义间聚合模块(inter - sam)，在时域增强劣化对象特征。因此，利用Intra-SAM和Inter-SAM，本文提出的ISA可以从语义内-语义间聚合的新角度生成判别特征，用于鲁棒视频目标检测。我们在ImageNet VID数据集上进行了大量的实验来评估ISA。采用ResNet-101和ResNeXt-101实现了84.5%的mAP和85.2%的mAP，与几种最先进的视频目标检测器相比，具有优越的性能。

{"title":"Learning intra-inter semantic aggregation for video object detection","authors":"Jun Liang, Haosheng Chen, Kaiwen Du, Yan Yan, Hanzi Wang","doi":"10.1145/3444685.3446273","DOIUrl":"https://doi.org/10.1145/3444685.3446273","url":null,"abstract":"Video object detection is a challenging task due to the appearance deterioration problems in video frames. Thus, object features extracted from different frames of a video are usually deteriorated in varying degrees. Currently, some state-of-the-art methods enhance the deteriorated object features in a reference frame by aggregating the undeteriorated object features extracted from other frames, simply based on their learned appearance relation among object features. In this paper, we propose a novel intra-inter semantic aggregation method (ISA) to learn more effective intra and inter relations for semantically aggregating object features. Specifically, in the proposed ISA, we first introduce an intra semantic aggregation module (Intra-SAM) to enhance the deteriorated spatial features based on the learned intra relation among the features at different positions of an individual object. Then, we present an inter semantic aggregation module (Inter-SAM) to enhance the deteriorated object features in the temporal domain based on the learned inter relation among object features. As a result, by leveraging Intra-SAM and Inter-SAM, the proposed ISA can generate discriminative features from the novel perspective of intra-inter semantic aggregation for robust video object detection. We conduct extensive experiments on the ImageNet VID dataset to evaluate ISA. The proposed ISA obtains 84.5% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, and it achieves superior performance compared with several state-of-the-art video object detectors.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121570640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A large-scale image retrieval system for everyday scenes 用于日常场景的大规模图像检索系统

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446253

Arun Zachariah, Mohamed Gharibi, P. Rao

We present a system for large-scale image retrieval on everyday scenes with common objects. Our system leverages advances in deep learning and natural language processing (NLP) for improved understanding of images by capturing the relationships between the objects within an image. As a result, a user can retrieve highly relevant images and obtain suggestions for similar image queries to further explore the repository. Each image in the repository is processed (using deep learning) to obtain the most probable captions and objects in it. The captions are parsed into tree structures using NLP techniques, and stored and indexed in a database system. When a query image is posed, an optimized tree-pattern query is executed by the database system to obtain candidate matches, which are then ranked using tree-edit distance of the tree structures to output the top-k matches. Word embeddings and Bloom filters are used to obtain similar image queries. By clicking the suggested similar image queries, a user can intuitively explore the repository.

提出了一种基于日常场景中常见物体的大规模图像检索系统。我们的系统利用深度学习和自然语言处理(NLP)的进步，通过捕捉图像中对象之间的关系来提高对图像的理解。因此，用户可以检索高度相关的图像，并获得类似图像查询的建议，以进一步探索存储库。存储库中的每个图像都经过处理(使用深度学习)，以获得其中最可能的标题和对象。使用NLP技术将标题解析为树结构，并在数据库系统中存储和索引。当查询图像被提出时，数据库系统执行优化的树模式查询以获得候选匹配，然后使用树结构的树编辑距离对候选匹配进行排序，输出top-k匹配。单词嵌入和布隆过滤器用于获得类似的图像查询。通过单击建议的类似图像查询，用户可以直观地浏览存储库。

引用次数: 1

A treatment engine by multimodal EMR data 多模态电子病历数据处理引擎

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446254

Zhaomeng Huang, Liyan Zhang, Xu Xu

In recent years, with the development of electronic medical record (EMR) systems, it has become possible to mine patient clinical data to improve medical care quality. After the treatment engine learns knowledge from the EMR data, it can automatically recommend the next stage of prescriptions and provide treatment guidelines for doctors and patients. However, this task is always challenged by the multi-modality of EMR data. To more effectively predict the next stage of treatment prescription by using multimodal information and the connection between the modalities, we propose a cross-modal shared-specific feature complementary generation and attention fusion algorithm. In the feature extraction stage, specific information and shared information are obtained through a shared-specific feature extraction network. To obtain the correlation between the modalities, we propose a sorting network. We use the attention fusion network in the multimodal feature fusion stage to give different multimodal features at different stages with different weights to obtain a more prepared patient representation. Considering the redundant information of specific modal information and shared modal information, we introduce a complementary feature learning strategy, including modality adaptation for shared features, project adversarial learning for specific features, and reconstruction enhancement. The experimental results on the real EMR data set MIMIC-III prove its superiority and each part's effectiveness.

近年来，随着电子病历(EMR)系统的发展，对患者临床数据的挖掘，提高医疗质量成为可能。治疗引擎从EMR数据中学习知识后，可以自动推荐下一阶段的处方，为医生和患者提供治疗指南。然而，电子病历数据的多模态对这一任务提出了挑战。为了利用多模态信息和模态之间的联系更有效地预测下一阶段的治疗处方，我们提出了一种跨模态共享特异性特征互补生成和注意融合算法。在特征提取阶段，通过共享特征提取网络获得特定信息和共享信息。为了获得模态之间的相关性，我们提出了一个排序网络。我们利用多模态特征融合阶段的注意融合网络，在不同阶段赋予不同的多模态特征不同的权重，以获得更有准备的患者表征。考虑到特定模态信息和共享模态信息的冗余性，提出了一种互补的特征学习策略，包括针对共享特征的模态自适应、针对特定特征的项目对抗学习和重构增强。在真实EMR数据集MIMIC-III上的实验结果证明了该方法的优越性和各部分的有效性。

{"title":"A treatment engine by multimodal EMR data","authors":"Zhaomeng Huang, Liyan Zhang, Xu Xu","doi":"10.1145/3444685.3446254","DOIUrl":"https://doi.org/10.1145/3444685.3446254","url":null,"abstract":"In recent years, with the development of electronic medical record (EMR) systems, it has become possible to mine patient clinical data to improve medical care quality. After the treatment engine learns knowledge from the EMR data, it can automatically recommend the next stage of prescriptions and provide treatment guidelines for doctors and patients. However, this task is always challenged by the multi-modality of EMR data. To more effectively predict the next stage of treatment prescription by using multimodal information and the connection between the modalities, we propose a cross-modal shared-specific feature complementary generation and attention fusion algorithm. In the feature extraction stage, specific information and shared information are obtained through a shared-specific feature extraction network. To obtain the correlation between the modalities, we propose a sorting network. We use the attention fusion network in the multimodal feature fusion stage to give different multimodal features at different stages with different weights to obtain a more prepared patient representation. Considering the redundant information of specific modal information and shared modal information, we introduce a complementary feature learning strategy, including modality adaptation for shared features, project adversarial learning for specific features, and reconstruction enhancement. The experimental results on the real EMR data set MIMIC-III prove its superiority and each part's effectiveness.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126534710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An autoregressive generation model for producing instant basketball defensive trajectory 篮球即时防守轨迹生成的自回归生成模型

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446300

Huan-Hua Chang, Wen-Cheng Chen, Wan-Lun Tsai, Min-Chun Hu, W. Chu

Learning basketball tactic via virtual reality environment requires real-time feedback to improve the realism and interactivity. For example, the virtual defender should move immediately according to the player's movement. In this paper, we proposed an autoregressive generative model for basketball defensive trajectory generation. To learn the continuous Gaussian distribution of player position, we adopt a differentiable sampling process to sample the candidate location with a standard deviation loss, which can preserve the diversity of the trajectories. Furthermore, we design several additional loss functions based on the domain knowledge of basketball to make the generated trajectories match the real situation in basketball games. The experimental results show that the proposed method can achieve better performance than previous works in terms of different evaluation metrics.

通过虚拟现实环境学习篮球战术需要实时反馈，以提高真实感和互动性。例如，虚拟防守者应该根据玩家的移动立即移动。提出了一种用于篮球防守轨迹生成的自回归生成模型。为了学习玩家位置的连续高斯分布，我们采用可微采样过程对候选位置进行标准差损失采样，以保持轨迹的多样性。在此基础上，设计了基于篮球领域知识的损失函数，使生成的轨迹更符合篮球比赛的实际情况。实验结果表明，在不同的评价指标下，该方法均能取得较好的效果。

引用次数: 3

Cross-modal learning for saliency prediction in mobile environment 移动环境下显著性预测的跨模态学习

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446304

Dakai Ren, X. Wen, Xiao-Yang Liu, Shuai Huang, Jiazhong Chen

The existing researches reveal that a significant impact is introduced by viewing conditions for visual perception when viewing media on mobile screens. This brings two issues in the area of visual saliency that we need to address: how the saliency models perform in mobile conditions, and how to consider the mobile conditions when designing a saliency model. To investigate the performance of saliency models in mobile environment, eye fixations in four typical mobile conditions are collected as the mobile ground truth in this work. To consider the mobile conditions when designing a saliency model, we combine viewing factors and visual stimuli as two modalities, and a cross-modal based deep learning architecture is proposed for visual attention prediction. Experimental results demonstrate the model with the consideration of mobile viewing factors often outperforms the models without such consideration.

现有研究表明，在移动屏幕上观看媒体时，观看条件对视觉感知产生了显著影响。这给视觉显著性领域带来了两个我们需要解决的问题:显著性模型在移动条件下的表现如何，以及在设计显著性模型时如何考虑移动条件。为了研究显著性模型在移动环境下的性能，本研究收集了四种典型移动条件下的眼球注视作为移动地面真值。为了在设计显著性模型时考虑移动条件，我们将观看因素和视觉刺激作为两种模态结合起来，提出了一种基于跨模态的深度学习结构用于视觉注意预测。实验结果表明，考虑移动观看因素的模型往往优于不考虑移动观看因素的模型。

引用次数: 0

Video scene detection based on link prediction using graph convolution network 基于图卷积网络链路预测的视频场景检测

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446293

Yingjiao Pei, Zhongyuan Wang, Heling Chen, Baojin Huang, Weiping Tu

With the development of the Internet, multimedia data grows by an exponential level. The demand for video organization, summarization and retrieval has been increasing where scene detection plays an essential role. Existing shot clustering algorithms for scene detection usually treat temporal shot sequence as unconstrained data. The graph based scene detection methods can locate the scene boundaries by taking the temporal relation among shots into account, while most of them only rely on low-level features to determine whether the connected shot pairs are similar or not. The optimized algorithms considering temporal sequence of shots or combining multi-modal features will bring parameter trouble and computational burden. In this paper, we propose a novel temporal clustering method based on graph convolution network and the link transitivity of shot nodes, without involving complicated steps and prior parameter setting such as the number of clusters. In particular, the graph convolution network is used to predict the link possibility of node pairs that are close in temporal sequence. The shots are then clustered into scene segments by merging all possible links. Experimental results on BBC and OVSD datasets show that our approach is more robust and effective than the comparison methods in terms of F1-score.

随着互联网的发展，多媒体数据呈指数级增长。对视频的组织、总结和检索的需求日益增长，其中场景检测起着至关重要的作用。现有的场景检测镜头聚类算法通常将时间镜头序列作为无约束数据处理。基于图的场景检测方法可以通过考虑镜头之间的时间关系来定位场景边界，而大多数的场景检测方法仅依靠底层特征来判断连接的镜头对是否相似。考虑镜头时间序列或结合多模态特征的优化算法会带来参数麻烦和计算负担。在本文中，我们提出了一种新的基于图卷积网络和镜头节点的链接传递性的时间聚类方法，该方法不涉及复杂的步骤和预先设置簇数等参数。特别地，使用图卷积网络来预测时间序列上接近的节点对的链接可能性。然后，通过合并所有可能的链接，将镜头聚集到场景片段中。在BBC和OVSD数据集上的实验结果表明，我们的方法在F1-score方面比比较方法更加稳健和有效。

{"title":"Video scene detection based on link prediction using graph convolution network","authors":"Yingjiao Pei, Zhongyuan Wang, Heling Chen, Baojin Huang, Weiping Tu","doi":"10.1145/3444685.3446293","DOIUrl":"https://doi.org/10.1145/3444685.3446293","url":null,"abstract":"With the development of the Internet, multimedia data grows by an exponential level. The demand for video organization, summarization and retrieval has been increasing where scene detection plays an essential role. Existing shot clustering algorithms for scene detection usually treat temporal shot sequence as unconstrained data. The graph based scene detection methods can locate the scene boundaries by taking the temporal relation among shots into account, while most of them only rely on low-level features to determine whether the connected shot pairs are similar or not. The optimized algorithms considering temporal sequence of shots or combining multi-modal features will bring parameter trouble and computational burden. In this paper, we propose a novel temporal clustering method based on graph convolution network and the link transitivity of shot nodes, without involving complicated steps and prior parameter setting such as the number of clusters. In particular, the graph convolution network is used to predict the link possibility of node pairs that are close in temporal sequence. The shots are then clustered into scene segments by merging all possible links. Experimental results on BBC and OVSD datasets show that our approach is more robust and effective than the comparison methods in terms of F1-score.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130964252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

WFN-PSC: weighted-fusion network with poly-scale convolution for image dehazing WFN-PSC:用于图像去雾的多尺度卷积加权融合网络

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446292

Lexuan Sun, Xueliang Liu, Zhenzhen Hu, Richang Hong

Image dehazing is a fundamental task for the computer vision and multimedia and usually in the face of the challenge from two aspects, i) the uneven distribution of arbitrary haze and ii) the distortion of image pixels caused by the hazed image. In this paper, we propose an end-to-end trainable framework, named Weighted-Fusion Network with Poly-Scale Convolution (WFN-PSC), to address these dehazing issues. The proposed method is designed based on the Poly-Scale Convolution (PSConv). It can extract the image feature from different scales without upsampling and downsampled, which avoids the image distortion. Beyond this, we design the spatial and channel weighted-fusion modules to make the WFN-PSC model focus on the hard dehazing parts of image from two dimensions. Specifically, we design three Part Architectures followed by the channel weighted-fusion module. Each Part Architecture consists of three PSConv residual blocks and a spatial weighted-fusion module. The experiments on the benchmark demonstrate the dehazing effectiveness of the proposed method. Furthermore, considering that image dehazing is a low-level task in the computer vision, we evaluate the dehazed image on the object detection task and the results show that the proposed method can be a good pre-processing to assist the high-level computer vision task.

图像去雾是计算机视觉和多媒体的一项基本任务，通常面临两个方面的挑战，一是任意雾霾的不均匀分布，二是雾霾导致图像像素失真。在本文中，我们提出了一个端到端可训练的框架，称为加权融合网络与多尺度卷积(WFN-PSC)，以解决这些去雾问题。该方法是基于多尺度卷积(PSConv)设计的。它可以在不上采样和下采样的情况下提取不同尺度的图像特征，避免了图像失真。在此基础上，我们设计了空间加权融合模块和信道加权融合模块，使WFN-PSC模型从二维角度关注图像的硬去雾部分。具体来说，我们设计了三个部分架构，然后是信道加权融合模块。每个部分架构由三个PSConv残差块和一个空间加权融合模块组成。在基准上的实验验证了该方法的除雾效果。此外，考虑到图像去雾是计算机视觉中的一项低级任务，我们将去雾图像用于目标检测任务进行评估，结果表明该方法可以作为辅助高级计算机视觉任务的良好预处理方法。

{"title":"WFN-PSC: weighted-fusion network with poly-scale convolution for image dehazing","authors":"Lexuan Sun, Xueliang Liu, Zhenzhen Hu, Richang Hong","doi":"10.1145/3444685.3446292","DOIUrl":"https://doi.org/10.1145/3444685.3446292","url":null,"abstract":"Image dehazing is a fundamental task for the computer vision and multimedia and usually in the face of the challenge from two aspects, i) the uneven distribution of arbitrary haze and ii) the distortion of image pixels caused by the hazed image. In this paper, we propose an end-to-end trainable framework, named Weighted-Fusion Network with Poly-Scale Convolution (WFN-PSC), to address these dehazing issues. The proposed method is designed based on the Poly-Scale Convolution (PSConv). It can extract the image feature from different scales without upsampling and downsampled, which avoids the image distortion. Beyond this, we design the spatial and channel weighted-fusion modules to make the WFN-PSC model focus on the hard dehazing parts of image from two dimensions. Specifically, we design three Part Architectures followed by the channel weighted-fusion module. Each Part Architecture consists of three PSConv residual blocks and a spatial weighted-fusion module. The experiments on the benchmark demonstrate the dehazing effectiveness of the proposed method. Furthermore, considering that image dehazing is a low-level task in the computer vision, we evaluate the dehazed image on the object detection task and the results show that the proposed method can be a good pre-processing to assist the high-level computer vision task.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127935984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀