Designing an extensive set of mid-air gestures that are both easy to learn and perform quickly presents a significant challenge. Further complicating this challenge is achieving high-accuracy detection of such gestures using commonly available hardware, like a 2D commodity camera. Previous work often proposed smaller, application-specific gesture sets, requiring specialized hardware and struggling with adaptability across diverse environments. Addressing these limitations, this paper introduces Abacus Gestures, a comprehensive collection of 100 mid-air gestures. Drawing on the metaphor of Finger Abacus counting, gestures are formed from various combinations of open and closed fingers, each assigned different values. We developed an algorithm using an off-the-shelf computer vision library capable of detecting these gestures from a 2D commodity camera feed with an accuracy exceeding 98% for palms facing the camera and 95% for palms facing the body. We assessed the detection accuracy, ease of learning, and usability of these gestures in a user study involving 20 participants. The study found that participants could learn Abacus Gestures within five minutes after executing just 15 gestures and could recall them after a four-month interval. Additionally, most participants developed motor memory for these gestures after performing 100 gestures. Most of the gestures were easy to execute with the designated finger combinations, and the flexibility in executing the gestures using multiple finger combinations further enhanced the usability. Based on these findings, we created a taxonomy that categorizes Abacus Gestures into five groups based on motor memory development and three difficulty levels according to their ease of execution. Finally, we provided design guidelines and proposed potential use cases for Abacus Gestures in the realm of mid-air interaction.
{"title":"Abacus Gestures","authors":"Md Ehtesham-Ul-Haque, Syed Masum Billah","doi":"10.1145/3610898","DOIUrl":"https://doi.org/10.1145/3610898","url":null,"abstract":"Designing an extensive set of mid-air gestures that are both easy to learn and perform quickly presents a significant challenge. Further complicating this challenge is achieving high-accuracy detection of such gestures using commonly available hardware, like a 2D commodity camera. Previous work often proposed smaller, application-specific gesture sets, requiring specialized hardware and struggling with adaptability across diverse environments. Addressing these limitations, this paper introduces Abacus Gestures, a comprehensive collection of 100 mid-air gestures. Drawing on the metaphor of Finger Abacus counting, gestures are formed from various combinations of open and closed fingers, each assigned different values. We developed an algorithm using an off-the-shelf computer vision library capable of detecting these gestures from a 2D commodity camera feed with an accuracy exceeding 98% for palms facing the camera and 95% for palms facing the body. We assessed the detection accuracy, ease of learning, and usability of these gestures in a user study involving 20 participants. The study found that participants could learn Abacus Gestures within five minutes after executing just 15 gestures and could recall them after a four-month interval. Additionally, most participants developed motor memory for these gestures after performing 100 gestures. Most of the gestures were easy to execute with the designated finger combinations, and the flexibility in executing the gestures using multiple finger combinations further enhanced the usability. Based on these findings, we created a taxonomy that categorizes Abacus Gestures into five groups based on motor memory development and three difficulty levels according to their ease of execution. Finally, we provided design guidelines and proposed potential use cases for Abacus Gestures in the realm of mid-air interaction.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135536094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The primary focus of this research is the discreet and subtle everyday contact interactions between mobile phones and their surrounding surfaces. Such interactions are anticipated to facilitate mobile context awareness, encompassing aspects such as dispensing medication updates, intelligently switching modes (e.g., silent mode), or initiating commands (e.g., deactivating an alarm). We introduce MicroCam, a contact-based sensing system that employs smartphone IMU data to detect the routine state of phone placement and utilizes a built-in microscope camera to capture intricate surface details. In particular, a natural dataset is collected to acquire authentic surface textures in situ for training and testing. Moreover, we optimize the deep neural network component of the algorithm, based on continual learning, to accurately discriminate between object categories (e.g., tables) and material constituents (e.g., wood). Experimental results highlight the superior accuracy, robustness and generalization of the proposed method. Lastly, we conducted a comprehensive discussion centered on our prototype, encompassing topics such as system performance and potential applications and scenarios.
{"title":"MicroCam","authors":"Yongquan Hu, Hui-Shyong Yeo, Mingyue Yuan, Haoran Fan, Don Samitha Elvitigala, Wen Hu, Aaron Quigley","doi":"10.1145/3610921","DOIUrl":"https://doi.org/10.1145/3610921","url":null,"abstract":"The primary focus of this research is the discreet and subtle everyday contact interactions between mobile phones and their surrounding surfaces. Such interactions are anticipated to facilitate mobile context awareness, encompassing aspects such as dispensing medication updates, intelligently switching modes (e.g., silent mode), or initiating commands (e.g., deactivating an alarm). We introduce MicroCam, a contact-based sensing system that employs smartphone IMU data to detect the routine state of phone placement and utilizes a built-in microscope camera to capture intricate surface details. In particular, a natural dataset is collected to acquire authentic surface textures in situ for training and testing. Moreover, we optimize the deep neural network component of the algorithm, based on continual learning, to accurately discriminate between object categories (e.g., tables) and material constituents (e.g., wood). Experimental results highlight the superior accuracy, robustness and generalization of the proposed method. Lastly, we conducted a comprehensive discussion centered on our prototype, encompassing topics such as system performance and potential applications and scenarios.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135536449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juexing Wang, Guangjing Wang, Xiao Zhang, Li Liu, Huacheng Zeng, Li Xiao, Zhichao Cao, Lin Gu, Tianxing Li
Recent advancements in deep learning have shown that multimodal inference can be particularly useful in tasks like autonomous driving, human health, and production line monitoring. However, deploying state-of-the-art multimodal models in distributed IoT systems poses unique challenges since the sensor data from low-cost edge devices can get corrupted, lost, or delayed before reaching the cloud. These problems are magnified in the presence of asymmetric data generation rates from different sensor modalities, wireless network dynamics, or unpredictable sensor behavior, leading to either increased latency or degradation in inference accuracy, which could affect the normal operation of the system with severe consequences like human injury or car accident. In this paper, we propose PATCH, a framework of speculative inference to adapt to these complex scenarios. PATCH serves as a plug-in module in the existing multimodal models, and it enables speculative inference of these off-the-shelf deep learning models. PATCH consists of 1) a Masked-AutoEncoder-based cross-modality imputation module to impute missing data using partially-available sensor data, 2) a lightweight feature pair ranking module that effectively limits the searching space for the optimal imputation configuration with low computation overhead, and 3) a data alignment module that aligns multimodal heterogeneous data streams without using accurate timestamp or external synchronization mechanisms. We implement PATCH in nine popular multimodal models using five public datasets and one self-collected dataset. The experimental results show that PATCH achieves up to 13% mean accuracy improvement over the state-of-art method while only using 10% of training data and reducing the training overhead by 73% compared to the original cost of retraining the model.
{"title":"PATCH","authors":"Juexing Wang, Guangjing Wang, Xiao Zhang, Li Liu, Huacheng Zeng, Li Xiao, Zhichao Cao, Lin Gu, Tianxing Li","doi":"10.1145/3610885","DOIUrl":"https://doi.org/10.1145/3610885","url":null,"abstract":"Recent advancements in deep learning have shown that multimodal inference can be particularly useful in tasks like autonomous driving, human health, and production line monitoring. However, deploying state-of-the-art multimodal models in distributed IoT systems poses unique challenges since the sensor data from low-cost edge devices can get corrupted, lost, or delayed before reaching the cloud. These problems are magnified in the presence of asymmetric data generation rates from different sensor modalities, wireless network dynamics, or unpredictable sensor behavior, leading to either increased latency or degradation in inference accuracy, which could affect the normal operation of the system with severe consequences like human injury or car accident. In this paper, we propose PATCH, a framework of speculative inference to adapt to these complex scenarios. PATCH serves as a plug-in module in the existing multimodal models, and it enables speculative inference of these off-the-shelf deep learning models. PATCH consists of 1) a Masked-AutoEncoder-based cross-modality imputation module to impute missing data using partially-available sensor data, 2) a lightweight feature pair ranking module that effectively limits the searching space for the optimal imputation configuration with low computation overhead, and 3) a data alignment module that aligns multimodal heterogeneous data streams without using accurate timestamp or external synchronization mechanisms. We implement PATCH in nine popular multimodal models using five public datasets and one self-collected dataset. The experimental results show that PATCH achieves up to 13% mean accuracy improvement over the state-of-art method while only using 10% of training data and reducing the training overhead by 73% compared to the original cost of retraining the model.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135536453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dennis Stanke, Tim Duente, Kerem Can Demir, Michael Rohs
The earlobe is a well-known location for wearing jewelry, but might also be promising for electronic output, such as presenting notifications. This work elaborates the pros and cons of different notification channels for the earlobe. Notifications on the earlobe can be private (only noticeable by the wearer) as well as public (noticeable in the immediate vicinity in a given social situation). A user study with 18 participants showed that the reaction times for the private channels (Poke, Vibration, Private Sound, Electrotactile) were on average less than 1 s with an error rate (missed notifications) of less than 1 %. Thermal Warm and Cold took significantly longer and Cold was least reliable (26 % error rate). The participants preferred Electrotactile and Vibration. Among the public channels the recognition time did not differ significantly between Sound (738 ms) and LED (828 ms), but Display took much longer (3175 ms). At 22 % the error rate of Display was highest. The participants generally felt comfortable wearing notification devices on their earlobe. The results show that the earlobe indeed is a suitable location for wearable technology, if properly miniaturized, which is possible for Electrotactile and LED. We present application scenarios and discuss design considerations. A small field study in a fitness center demonstrates the suitability of the earlobe notification concept in a sports context.
{"title":"Can You Ear Me?","authors":"Dennis Stanke, Tim Duente, Kerem Can Demir, Michael Rohs","doi":"10.1145/3610925","DOIUrl":"https://doi.org/10.1145/3610925","url":null,"abstract":"The earlobe is a well-known location for wearing jewelry, but might also be promising for electronic output, such as presenting notifications. This work elaborates the pros and cons of different notification channels for the earlobe. Notifications on the earlobe can be private (only noticeable by the wearer) as well as public (noticeable in the immediate vicinity in a given social situation). A user study with 18 participants showed that the reaction times for the private channels (Poke, Vibration, Private Sound, Electrotactile) were on average less than 1 s with an error rate (missed notifications) of less than 1 %. Thermal Warm and Cold took significantly longer and Cold was least reliable (26 % error rate). The participants preferred Electrotactile and Vibration. Among the public channels the recognition time did not differ significantly between Sound (738 ms) and LED (828 ms), but Display took much longer (3175 ms). At 22 % the error rate of Display was highest. The participants generally felt comfortable wearing notification devices on their earlobe. The results show that the earlobe indeed is a suitable location for wearable technology, if properly miniaturized, which is possible for Electrotactile and LED. We present application scenarios and discuss design considerations. A small field study in a fitness center demonstrates the suitability of the earlobe notification concept in a sports context.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135536454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nod and shake of one's head are intuitive and universal gestures in communication. As smartwatches become increasingly intelligent through advances in user activity sensing technologies, many use scenarios of smartwatches demand quick responses from users in confirmation dialogs, to accept or dismiss proposed actions. Such proposed actions include making emergency calls, taking service recommendations, and starting or stopping exercise timers. Head gestures in these scenarios could be preferable to touch interactions for being hands-free and easy to perform. We propose Headar to recognize these gestures on smartwatches using wearable millimeter wave sensing. We first surveyed head gestures to understand how they are performed in conversational settings. We then investigated positions and orientations to which users raise their smartwatches. Insights from these studies guided the implementation of Headar. Additionally, we conducted modeling and simulation to verify our sensing principle. We developed a real-time sensing and inference pipeline using contemporary deep learning techniques, and proved the feasibility of our proposed approach with a user study (n=15) and a live test (n=8). Our evaluation yielded an average accuracy of 84.0% in the user study across 9 classes including nod and shake as well as seven other signals -- still, speech, touch interaction, and four non-gestural head motions (i.e., head up, left, right, and down). Furthermore, we obtained an accuracy of 72.6% in the live test which reveals rich insights into the performance of our approach in various realistic conditions.
{"title":"Headar","authors":"Xiaoying Yang, Xue Wang, Gaofeng Dong, Zihan Yan, Mani Srivastava, Eiji Hayashi, Yang Zhang","doi":"10.1145/3610900","DOIUrl":"https://doi.org/10.1145/3610900","url":null,"abstract":"Nod and shake of one's head are intuitive and universal gestures in communication. As smartwatches become increasingly intelligent through advances in user activity sensing technologies, many use scenarios of smartwatches demand quick responses from users in confirmation dialogs, to accept or dismiss proposed actions. Such proposed actions include making emergency calls, taking service recommendations, and starting or stopping exercise timers. Head gestures in these scenarios could be preferable to touch interactions for being hands-free and easy to perform. We propose Headar to recognize these gestures on smartwatches using wearable millimeter wave sensing. We first surveyed head gestures to understand how they are performed in conversational settings. We then investigated positions and orientations to which users raise their smartwatches. Insights from these studies guided the implementation of Headar. Additionally, we conducted modeling and simulation to verify our sensing principle. We developed a real-time sensing and inference pipeline using contemporary deep learning techniques, and proved the feasibility of our proposed approach with a user study (n=15) and a live test (n=8). Our evaluation yielded an average accuracy of 84.0% in the user study across 9 classes including nod and shake as well as seven other signals -- still, speech, touch interaction, and four non-gestural head motions (i.e., head up, left, right, and down). Furthermore, we obtained an accuracy of 72.6% in the live test which reveals rich insights into the performance of our approach in various realistic conditions.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135535368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tieqi Shou, Zhuohan Ye, Yayao Hong, Zhiyuan Wang, Hang Zhu, Zhihan Jiang, Dingqi Yang, Binbin Zhou, Cheng Wang, Longbiao Chen
Hospital Emergency Departments (EDs) are essential for providing emergency medical services, yet often overwhelmed due to increasing healthcare demand. Current methods for monitoring ED queue states, such as manual monitoring, video surveillance, and front-desk registration are inefficient, invasive, and delayed to provide real-time updates. To address these challenges, this paper proposes a novel framework, CrowdQ, which harnesses spatiotemporal crowdsensing data for real-time ED demand sensing, queue state modeling, and prediction. By utilizing vehicle trajectory and urban geographic environment data, CrowdQ can accurately estimate emergency visits from noisy traffic flows. Furthermore, it employs queueing theory to model the complex emergency service process with medical service data, effectively considering spatiotemporal dependencies and event context impact on ED queue states. Experiments conducted on large-scale crowdsensing urban traffic datasets and hospital information system datasets from Xiamen City demonstrate the framework's effectiveness. It achieves an F1 score of 0.93 in ED demand identification, effectively models the ED queue state of key hospitals, and reduces the error in queue state prediction by 18.5%-71.3% compared to baseline methods. CrowdQ, therefore, offers valuable alternatives for public emergency treatment information disclosure and maximized medical resource allocation.
{"title":"CrowdQ","authors":"Tieqi Shou, Zhuohan Ye, Yayao Hong, Zhiyuan Wang, Hang Zhu, Zhihan Jiang, Dingqi Yang, Binbin Zhou, Cheng Wang, Longbiao Chen","doi":"10.1145/3610875","DOIUrl":"https://doi.org/10.1145/3610875","url":null,"abstract":"Hospital Emergency Departments (EDs) are essential for providing emergency medical services, yet often overwhelmed due to increasing healthcare demand. Current methods for monitoring ED queue states, such as manual monitoring, video surveillance, and front-desk registration are inefficient, invasive, and delayed to provide real-time updates. To address these challenges, this paper proposes a novel framework, CrowdQ, which harnesses spatiotemporal crowdsensing data for real-time ED demand sensing, queue state modeling, and prediction. By utilizing vehicle trajectory and urban geographic environment data, CrowdQ can accurately estimate emergency visits from noisy traffic flows. Furthermore, it employs queueing theory to model the complex emergency service process with medical service data, effectively considering spatiotemporal dependencies and event context impact on ED queue states. Experiments conducted on large-scale crowdsensing urban traffic datasets and hospital information system datasets from Xiamen City demonstrate the framework's effectiveness. It achieves an F1 score of 0.93 in ED demand identification, effectively models the ED queue state of key hospitals, and reduces the error in queue state prediction by 18.5%-71.3% compared to baseline methods. CrowdQ, therefore, offers valuable alternatives for public emergency treatment information disclosure and maximized medical resource allocation.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135535539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Device-free indoor localization and tracking using commercial millimeter wave radars have attracted much interest lately due to their non-intrusive nature and high spatial resolution. However, it is challenging to achieve high tracking accuracy due to rich multipath reflection and occlusion in indoor environments. Static objects with non-negligible reflectance of mmWave signals interact with moving human subjects and generate time-varying multipath ghosts and shadow ghosts, which can be easily confused as real subjects. To characterize the complex interactions, we first develop a geometric model that estimates the location of multipath ghosts given the locations of humans and static reflectors. Based on this model, the locations of static reflectors that form a reflection map are automatically estimated from received radar signals as a single person traverses the environment along arbitrary trajectories. The reflection map allows for the elimination of multipath and shadow ghost interference as well as the augmentation of weakly reflected human subjects in occluded areas. The proposed environment-aware multi-person tracking system can generate reflection maps with a mean error of 15.5cm and a 90-percentile error of 30.3cm, and achieve multi-person tracking accuracy with a mean error of 8.6cm and a 90-percentile error of 17.5cm, in four representative indoor spaces with diverse subjects using a single mmWave radar.
{"title":"Environment-aware Multi-person Tracking in Indoor Environments with MmWave Radars","authors":"Weiyan Chen, Hongliu Yang, Xiaoyang Bi, Rong Zheng, Fusang Zhang, Peng Bao, Zhaoxin Chang, Xujun Ma, Daqing Zhang","doi":"10.1145/3610902","DOIUrl":"https://doi.org/10.1145/3610902","url":null,"abstract":"Device-free indoor localization and tracking using commercial millimeter wave radars have attracted much interest lately due to their non-intrusive nature and high spatial resolution. However, it is challenging to achieve high tracking accuracy due to rich multipath reflection and occlusion in indoor environments. Static objects with non-negligible reflectance of mmWave signals interact with moving human subjects and generate time-varying multipath ghosts and shadow ghosts, which can be easily confused as real subjects. To characterize the complex interactions, we first develop a geometric model that estimates the location of multipath ghosts given the locations of humans and static reflectors. Based on this model, the locations of static reflectors that form a reflection map are automatically estimated from received radar signals as a single person traverses the environment along arbitrary trajectories. The reflection map allows for the elimination of multipath and shadow ghost interference as well as the augmentation of weakly reflected human subjects in occluded areas. The proposed environment-aware multi-person tracking system can generate reflection maps with a mean error of 15.5cm and a 90-percentile error of 30.3cm, and achieve multi-person tracking accuracy with a mean error of 8.6cm and a 90-percentile error of 17.5cm, in four representative indoor spaces with diverse subjects using a single mmWave radar.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135535737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matias Laporte, Martin Gjoreski, Marc Langheinrich
The latest developments in wearable sensors have resulted in a wide range of devices available to consumers, allowing users to monitor and improve their physical activity, sleep patterns, cognitive load, and stress levels. However, the lack of out-of-the-lab labelled data hinders the development of advanced machine learning models for predicting affective states. Furthermore, to the best of our knowledge, there are no publicly available datasets in the area of Human Memory Augmentation. This paper presents a dataset we collected during a 13-week study in a university setting. The dataset, named LAUREATE, contains the physiological data of 42 students during 26 classes (including exams), daily self-reports asking the students about their lifestyle habits (e.g. studying hours, physical activity, and sleep quality) and their performance across multiple examinations. In addition to the raw data, we provide expert features from the physiological data, and baseline machine learning models for estimating self-reported affect, models for recognising classes vs breaks, and models for user identification. Besides the use cases presented in this paper, among which Human Memory Augmentation, the dataset represents a rich resource for the UbiComp community in various domains, including affect recognition, behaviour modelling, user privacy, and activity and context recognition.
{"title":"LAUREATE","authors":"Matias Laporte, Martin Gjoreski, Marc Langheinrich","doi":"10.1145/3610892","DOIUrl":"https://doi.org/10.1145/3610892","url":null,"abstract":"The latest developments in wearable sensors have resulted in a wide range of devices available to consumers, allowing users to monitor and improve their physical activity, sleep patterns, cognitive load, and stress levels. However, the lack of out-of-the-lab labelled data hinders the development of advanced machine learning models for predicting affective states. Furthermore, to the best of our knowledge, there are no publicly available datasets in the area of Human Memory Augmentation. This paper presents a dataset we collected during a 13-week study in a university setting. The dataset, named LAUREATE, contains the physiological data of 42 students during 26 classes (including exams), daily self-reports asking the students about their lifestyle habits (e.g. studying hours, physical activity, and sleep quality) and their performance across multiple examinations. In addition to the raw data, we provide expert features from the physiological data, and baseline machine learning models for estimating self-reported affect, models for recognising classes vs breaks, and models for user identification. Besides the use cases presented in this paper, among which Human Memory Augmentation, the dataset represents a rich resource for the UbiComp community in various domains, including affect recognition, behaviour modelling, user privacy, and activity and context recognition.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135535924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The use of audio and video modalities for Human Activity Recognition (HAR) is common, given the richness of the data and the availability of pre-trained ML models using a large corpus of labeled training data. However, audio and video sensors also lead to significant consumer privacy concerns. Researchers have thus explored alternate modalities that are less privacy-invasive such as mmWave doppler radars, IMUs, motion sensors. However, the key limitation of these approaches is that most of them do not readily generalize across environments and require significant in-situ training data. Recent work has proposed cross-modality transfer learning approaches to alleviate the lack of trained labeled data with some success. In this paper, we generalize this concept to create a novel system called VAX (Video/Audio to 'X'), where training labels acquired from existing Video/Audio ML models are used to train ML models for a wide range of 'X' privacy-sensitive sensors. Notably, in VAX, once the ML models for the privacy-sensitive sensors are trained, with little to no user involvement, the Audio/Video sensors can be removed altogether to protect the user's privacy better. We built and deployed VAX in ten participants' homes while they performed 17 common activities of daily living. Our evaluation results show that after training, VAX can use its onboard camera and microphone to detect approximately 15 out of 17 activities with an average accuracy of 90%. For these activities that can be detected using a camera and a microphone, VAX trains a per-home model for the privacy-preserving sensors. These models (average accuracy = 84%) require no in-situ user input. In addition, when VAX is augmented with just one labeled instance for the activities not detected by the VAX A/V pipeline (~2 out of 17), it can detect all 17 activities with an average accuracy of 84%. Our results show that VAX is significantly better than a baseline supervised-learning approach of using one labeled instance per activity in each home (average accuracy of 79%) since VAX reduces the user burden of providing activity labels by 8x (~2 labels vs. 17 labels).
{"title":"VAX","authors":"Prasoon Patidar, Mayank Goel, Yuvraj Agarwal","doi":"10.1145/3610907","DOIUrl":"https://doi.org/10.1145/3610907","url":null,"abstract":"The use of audio and video modalities for Human Activity Recognition (HAR) is common, given the richness of the data and the availability of pre-trained ML models using a large corpus of labeled training data. However, audio and video sensors also lead to significant consumer privacy concerns. Researchers have thus explored alternate modalities that are less privacy-invasive such as mmWave doppler radars, IMUs, motion sensors. However, the key limitation of these approaches is that most of them do not readily generalize across environments and require significant in-situ training data. Recent work has proposed cross-modality transfer learning approaches to alleviate the lack of trained labeled data with some success. In this paper, we generalize this concept to create a novel system called VAX (Video/Audio to 'X'), where training labels acquired from existing Video/Audio ML models are used to train ML models for a wide range of 'X' privacy-sensitive sensors. Notably, in VAX, once the ML models for the privacy-sensitive sensors are trained, with little to no user involvement, the Audio/Video sensors can be removed altogether to protect the user's privacy better. We built and deployed VAX in ten participants' homes while they performed 17 common activities of daily living. Our evaluation results show that after training, VAX can use its onboard camera and microphone to detect approximately 15 out of 17 activities with an average accuracy of 90%. For these activities that can be detected using a camera and a microphone, VAX trains a per-home model for the privacy-preserving sensors. These models (average accuracy = 84%) require no in-situ user input. In addition, when VAX is augmented with just one labeled instance for the activities not detected by the VAX A/V pipeline (~2 out of 17), it can detect all 17 activities with an average accuracy of 84%. Our results show that VAX is significantly better than a baseline supervised-learning approach of using one labeled instance per activity in each home (average accuracy of 79%) since VAX reduces the user burden of providing activity labels by 8x (~2 labels vs. 17 labels).","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135535933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saif Mahmud, Ke Li, Guilin Hu, Hao Chen, Richard Jin, Ruidong Zhang, François Guimbretière, Cheng Zhang
In this paper, we introduce PoseSonic, an intelligent acoustic sensing solution for smartglasses that estimates upper body poses. Our system only requires two pairs of microphones and speakers on the hinges of the eyeglasses to emit FMCW-encoded inaudible acoustic signals and receive reflected signals for body pose estimation. Using a customized deep learning model, PoseSonic estimates the 3D positions of 9 body joints including the shoulders, elbows, wrists, hips, and nose. We adopt a cross-modal supervision strategy to train our model using synchronized RGB video frames as ground truth. We conducted in-lab and semi-in-the-wild user studies with 22 participants to evaluate PoseSonic, and our user-independent model achieved a mean per joint position error of 6.17 cm in the lab setting and 14.12 cm in semi-in-the-wild setting when predicting the 9 body joint positions in 3D. Our further studies show that the performance was not significantly impacted by different surroundings or when the devices were remounted or by real-world environmental noise. Finally, we discuss the opportunities, challenges, and limitations of deploying PoseSonic in real-world applications.
{"title":"PoseSonic","authors":"Saif Mahmud, Ke Li, Guilin Hu, Hao Chen, Richard Jin, Ruidong Zhang, François Guimbretière, Cheng Zhang","doi":"10.1145/3610895","DOIUrl":"https://doi.org/10.1145/3610895","url":null,"abstract":"In this paper, we introduce PoseSonic, an intelligent acoustic sensing solution for smartglasses that estimates upper body poses. Our system only requires two pairs of microphones and speakers on the hinges of the eyeglasses to emit FMCW-encoded inaudible acoustic signals and receive reflected signals for body pose estimation. Using a customized deep learning model, PoseSonic estimates the 3D positions of 9 body joints including the shoulders, elbows, wrists, hips, and nose. We adopt a cross-modal supervision strategy to train our model using synchronized RGB video frames as ground truth. We conducted in-lab and semi-in-the-wild user studies with 22 participants to evaluate PoseSonic, and our user-independent model achieved a mean per joint position error of 6.17 cm in the lab setting and 14.12 cm in semi-in-the-wild setting when predicting the 9 body joint positions in 3D. Our further studies show that the performance was not significantly impacted by different surroundings or when the devices were remounted or by real-world environmental noise. Finally, we discuss the opportunities, challenges, and limitations of deploying PoseSonic in real-world applications.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135536106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}