Powen Yao, Yu Hou, Yuan He, Da Cheng, Huanpu Hu, Michael Zyda
{"title":"Toward Using Multi-Modal Machine Learning for User Behavior Prediction in Simulated Smart Home for Extended Reality","authors":"Powen Yao, Yu Hou, Yuan He, Da Cheng, Huanpu Hu, Michael Zyda","doi":"10.1109/VRW55335.2022.00195","DOIUrl":null,"url":null,"abstract":"In this work, we propose a multi-modal approach to manipulate smart home devices in a smart home environment simulated in virtual reality (VR). We determine the user's target device and the desired action by their utterance, spatial information (gestures, positions, etc.), or a combination of the two. Since the information contained in the user's utterance and the spatial information can be disjoint or complementary to each other, we process the two sources of information in parallel using our array of machine learning models. We use ensemble modeling to aggregate the results of these models and enhance the quality of our final prediction results. We present our preliminary architecture, models, and findings.","PeriodicalId":326252,"journal":{"name":"2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VRW55335.2022.00195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this work, we propose a multi-modal approach to manipulate smart home devices in a smart home environment simulated in virtual reality (VR). We determine the user's target device and the desired action by their utterance, spatial information (gestures, positions, etc.), or a combination of the two. Since the information contained in the user's utterance and the spatial information can be disjoint or complementary to each other, we process the two sources of information in parallel using our array of machine learning models. We use ensemble modeling to aggregate the results of these models and enhance the quality of our final prediction results. We present our preliminary architecture, models, and findings.