{"title":"基于多特征融合的两流网络视频中人-物交互识别","authors":"Lunzheng Tan, Rui Ding","doi":"10.1109/ICNISC57059.2022.00050","DOIUrl":null,"url":null,"abstract":"To understand a scene, a machine not only has to learn to recognize individual object instances, but also pair them up and recognize visual relationships between them. Human-object interaction (HOI) recognition is one of the fundamental tasks in understanding the visual world. In this paper, we address the task of understanding and recognizing HOI in videos, and represent HOI as the doublet of . This paper proposes a two-stream network model that fuses multiple features. Since appearance information (the appearance features of instances), spatial information (the distance between each key part of the human body and the interacting object), and motion information (optical flow) in the video are all essential cues for recognizing HOI, our model uses two streams to fuse the information to complete the HOI recognition task. We valid the effectiveness of the model by conducting experiments on two recently proposed public video datasets (Charades and CAD-120 datasets), and perform ablation experiments to show the effect of components.","PeriodicalId":286467,"journal":{"name":"2022 8th Annual International Conference on Network and Information Systems for Computers (ICNISC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Recognition of Human-object Interaction in Video through a Two-stream Network Integrating Multiple Features\",\"authors\":\"Lunzheng Tan, Rui Ding\",\"doi\":\"10.1109/ICNISC57059.2022.00050\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To understand a scene, a machine not only has to learn to recognize individual object instances, but also pair them up and recognize visual relationships between them. Human-object interaction (HOI) recognition is one of the fundamental tasks in understanding the visual world. In this paper, we address the task of understanding and recognizing HOI in videos, and represent HOI as the doublet of . This paper proposes a two-stream network model that fuses multiple features. Since appearance information (the appearance features of instances), spatial information (the distance between each key part of the human body and the interacting object), and motion information (optical flow) in the video are all essential cues for recognizing HOI, our model uses two streams to fuse the information to complete the HOI recognition task. We valid the effectiveness of the model by conducting experiments on two recently proposed public video datasets (Charades and CAD-120 datasets), and perform ablation experiments to show the effect of components.\",\"PeriodicalId\":286467,\"journal\":{\"name\":\"2022 8th Annual International Conference on Network and Information Systems for Computers (ICNISC)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 8th Annual International Conference on Network and Information Systems for Computers (ICNISC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICNISC57059.2022.00050\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 8th Annual International Conference on Network and Information Systems for Computers (ICNISC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNISC57059.2022.00050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Recognition of Human-object Interaction in Video through a Two-stream Network Integrating Multiple Features
To understand a scene, a machine not only has to learn to recognize individual object instances, but also pair them up and recognize visual relationships between them. Human-object interaction (HOI) recognition is one of the fundamental tasks in understanding the visual world. In this paper, we address the task of understanding and recognizing HOI in videos, and represent HOI as the doublet of . This paper proposes a two-stream network model that fuses multiple features. Since appearance information (the appearance features of instances), spatial information (the distance between each key part of the human body and the interacting object), and motion information (optical flow) in the video are all essential cues for recognizing HOI, our model uses two streams to fuse the information to complete the HOI recognition task. We valid the effectiveness of the model by conducting experiments on two recently proposed public video datasets (Charades and CAD-120 datasets), and perform ablation experiments to show the effect of components.