Ambient Intelligence (AmI) focuses on creating environments capable of proactively and transparently adapting to users and their activities. Traditionally, AmI focused on the availability of computational devices, the pervasiveness of networked environments, and means to interact with users. In this paper, we propose a renewed AmI architecture that takes into account current technological advancements while focusing on proactive adaptation for assisting and protecting users. This architecture consist of four phases: Perceive, Interpret, Decide, and Interact. The AmI systems we propose, called Digital Companions (DC), can be embodied in a variety of ways (e.g., through physical robots or virtual agents) and are structured according to these phases to assist and protect their users. We further categorize DCs into Expert DCs and Personal DCs, and show that this induces a favorable separation of concerns in AmI systems, where user concerns (including personal user data and preferences) are handled by Personal DCs and environment concerns (including interfacing with environmental artifacts) are assigned to Expert DCs; this separation has favorable privacy implications as well. Herein, we introduce this architecture and validate it through a prototype in an industrial scenario where robots and humans collaborate to perform a task.
环境智能(Ambient Intelligence,AMI)的重点是创造能够主动、透明地适应用户及其活动的环境。传统上,环境智能侧重于计算设备的可用性、网络环境的普及性以及与用户互动的手段。在本文中,我们提出了一个全新的 AmI 架构,该架构考虑到了当前的技术进步,同时侧重于主动适应,以协助和保护用户。该架构包括四个阶段:感知、解释、决策和互动。我们提出的 AmI 系统被称为 "数字伴侣"(Digital Companions,DC),可以通过各种方式(如实体机器人或虚拟代理)体现出来,并根据这些阶段进行结构化,以协助和保护用户。我们进一步将DC分为专家DC和个人DC,并证明这在人工智能系统中形成了有利的关注点分离,其中用户关注点(包括个人用户数据和偏好)由个人DC处理,而环境关注点(包括与环境人工制品的接口)则分配给专家DC;这种分离还具有有利的隐私影响。在这里,我们将介绍这种架构,并通过一个机器人与人类合作执行任务的工业场景原型对其进行验证。
{"title":"A Digital Companion Architecture for Ambient Intelligence","authors":"Kimberly García, Jonathan Vontobel, Simon Mayer","doi":"10.1145/3659610","DOIUrl":"https://doi.org/10.1145/3659610","url":null,"abstract":"Ambient Intelligence (AmI) focuses on creating environments capable of proactively and transparently adapting to users and their activities. Traditionally, AmI focused on the availability of computational devices, the pervasiveness of networked environments, and means to interact with users. In this paper, we propose a renewed AmI architecture that takes into account current technological advancements while focusing on proactive adaptation for assisting and protecting users. This architecture consist of four phases: Perceive, Interpret, Decide, and Interact. The AmI systems we propose, called Digital Companions (DC), can be embodied in a variety of ways (e.g., through physical robots or virtual agents) and are structured according to these phases to assist and protect their users. We further categorize DCs into Expert DCs and Personal DCs, and show that this induces a favorable separation of concerns in AmI systems, where user concerns (including personal user data and preferences) are handled by Personal DCs and environment concerns (including interfacing with environmental artifacts) are assigned to Expert DCs; this separation has favorable privacy implications as well. Herein, we introduce this architecture and validate it through a prototype in an industrial scenario where robots and humans collaborate to perform a task.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140982078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate road networks play a crucial role in modern mobile applications such as navigation and last-mile delivery. Most existing studies primarily focus on generating road networks in open areas like main roads and avenues, but little attention has been given to the generation of community road networks in closed areas such as residential areas, which becomes more and more significant due to the growing demand for door-to-door services such as food delivery. This lack of research is primarily attributed to challenges related to sensing data availability and quality. In this paper, we design a novel framework called SmallMap that leverages ubiquitous multi-modal sensing data from last-mile delivery to automatically generate community road networks with low costs. Our SmallMap consists of two key modules: (1) a Trajectory of Interest Detection module enhanced by exploiting multi-modal sensing data collected from the delivery process; and (2) a Dual Spatio-temporal Generative Adversarial Network module that incorporates Trajectory of Interest by unsupervised road network adaptation to generate road networks automatically. To evaluate the effectiveness of SmallMap, we utilize a two-month dataset from one of the largest logistics companies in China. The extensive evaluation results demonstrate that our framework significantly outperforms state-of-the-art baselines, achieving a precision of 90.5%, a recall of 87.5%, and an F1-score of 88.9%, respectively. Moreover, we conduct three case studies in Beijing City for courier workload estimation, Estimated Time of Arrival (ETA) in last-mile delivery, and fine-grained order assignment.
{"title":"SmallMap: Low-cost Community Road Map Sensing with Uncertain Delivery Behavior","authors":"Zhiqing Hong, Haotian Wang, Yi Ding, Guang Wang, Tian He, Desheng Zhang","doi":"10.1145/3659596","DOIUrl":"https://doi.org/10.1145/3659596","url":null,"abstract":"Accurate road networks play a crucial role in modern mobile applications such as navigation and last-mile delivery. Most existing studies primarily focus on generating road networks in open areas like main roads and avenues, but little attention has been given to the generation of community road networks in closed areas such as residential areas, which becomes more and more significant due to the growing demand for door-to-door services such as food delivery. This lack of research is primarily attributed to challenges related to sensing data availability and quality. In this paper, we design a novel framework called SmallMap that leverages ubiquitous multi-modal sensing data from last-mile delivery to automatically generate community road networks with low costs. Our SmallMap consists of two key modules: (1) a Trajectory of Interest Detection module enhanced by exploiting multi-modal sensing data collected from the delivery process; and (2) a Dual Spatio-temporal Generative Adversarial Network module that incorporates Trajectory of Interest by unsupervised road network adaptation to generate road networks automatically. To evaluate the effectiveness of SmallMap, we utilize a two-month dataset from one of the largest logistics companies in China. The extensive evaluation results demonstrate that our framework significantly outperforms state-of-the-art baselines, achieving a precision of 90.5%, a recall of 87.5%, and an F1-score of 88.9%, respectively. Moreover, we conduct three case studies in Beijing City for courier workload estimation, Estimated Time of Arrival (ETA) in last-mile delivery, and fine-grained order assignment.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140984189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Cuadra, Justine Breuch, Samantha Estrada, David Ihim, Isabelle Hung, Derek Askaryar, Marwan Hassanien, Kristen L. Fessele, James A. Landay
Digital forms help us access services and opportunities, but they are not equally accessible to everyone, such as older adults or those with sensory impairments. Large language models (LLMs) and multimodal interfaces offer a unique opportunity to increase form accessibility. Informed by prior literature and needfinding, we built a holistic multimodal LLM agent for health data entry. We describe the process of designing and building our system, and the results of a study with older adults (N =10). All participants, regardless of age or disability status, were able to complete a standard 47-question form independently using our system---one blind participant said it was "a prayer answered." Our video analysis revealed how different modalities provided alternative interaction paths in complementary ways (e.g., the buttons helped resolve transcription errors and speech helped provide more options when the pre-canned answer choices were insufficient). We highlight key design guidelines, such as designing systems that dynamically adapt to individual needs.
{"title":"Digital Forms for All: A Holistic Multimodal Large Language Model Agent for Health Data Entry","authors":"Andrea Cuadra, Justine Breuch, Samantha Estrada, David Ihim, Isabelle Hung, Derek Askaryar, Marwan Hassanien, Kristen L. Fessele, James A. Landay","doi":"10.1145/3659624","DOIUrl":"https://doi.org/10.1145/3659624","url":null,"abstract":"Digital forms help us access services and opportunities, but they are not equally accessible to everyone, such as older adults or those with sensory impairments. Large language models (LLMs) and multimodal interfaces offer a unique opportunity to increase form accessibility. Informed by prior literature and needfinding, we built a holistic multimodal LLM agent for health data entry. We describe the process of designing and building our system, and the results of a study with older adults (N =10). All participants, regardless of age or disability status, were able to complete a standard 47-question form independently using our system---one blind participant said it was \"a prayer answered.\" Our video analysis revealed how different modalities provided alternative interaction paths in complementary ways (e.g., the buttons helped resolve transcription errors and speech helped provide more options when the pre-canned answer choices were insufficient). We highlight key design guidelines, such as designing systems that dynamically adapt to individual needs.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140985269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Since sleep plays an important role in people's daily lives, sleep monitoring has attracted the attention of many researchers. Physical and physiological activities occurring in sleep exhibit unique patterns in different sleep stages. It indicates that recognizing a wide range of sleep activities (events) can provide more fine-grained information for sleep stage detection. However, most of the prior works are designed to capture limited sleep events and coarse-grained information, which cannot meet the needs of fine-grained sleep monitoring. In our work, we leverage ubiquitous in-ear microphones on sleep earbuds to design a sleep monitoring system, named EarSleep1, which interprets in-ear body sounds induced by various representative sleep events into sleep stages. Based on differences among physical occurrence mechanisms of sleep activities, EarSleep extracts unique acoustic response patterns from in-ear body sounds to recognize a wide range of sleep events, including body movements, sound activities, heartbeat, and respiration. With the help of sleep medicine knowledge, interpretable acoustic features are derived from these representative sleep activities. EarSleep leverages a carefully designed deep learning model to establish the complex correlation between acoustic features and sleep stages. We conduct extensive experiments with 48 nights of 18 participants over three months to validate the performance of our system. The experimental results show that our system can accurately detect a rich set of sleep activities. Furthermore, in terms of sleep stage detection, EarSleep outperforms state-of-the-art solutions by 7.12% and 9.32% in average precision and average recall, respectively.
{"title":"EarSleep: In-ear Acoustic-based Physical and Physiological Activity Recognition for Sleep Stage Detection","authors":"Feiyu Han, Panlong Yang, Yuanhao Feng, Weiwei Jiang, Youwei Zhang, Xiang-Yang Li","doi":"10.1145/3659595","DOIUrl":"https://doi.org/10.1145/3659595","url":null,"abstract":"Since sleep plays an important role in people's daily lives, sleep monitoring has attracted the attention of many researchers. Physical and physiological activities occurring in sleep exhibit unique patterns in different sleep stages. It indicates that recognizing a wide range of sleep activities (events) can provide more fine-grained information for sleep stage detection. However, most of the prior works are designed to capture limited sleep events and coarse-grained information, which cannot meet the needs of fine-grained sleep monitoring. In our work, we leverage ubiquitous in-ear microphones on sleep earbuds to design a sleep monitoring system, named EarSleep1, which interprets in-ear body sounds induced by various representative sleep events into sleep stages. Based on differences among physical occurrence mechanisms of sleep activities, EarSleep extracts unique acoustic response patterns from in-ear body sounds to recognize a wide range of sleep events, including body movements, sound activities, heartbeat, and respiration. With the help of sleep medicine knowledge, interpretable acoustic features are derived from these representative sleep activities. EarSleep leverages a carefully designed deep learning model to establish the complex correlation between acoustic features and sleep stages. We conduct extensive experiments with 48 nights of 18 participants over three months to validate the performance of our system. The experimental results show that our system can accurately detect a rich set of sleep activities. Furthermore, in terms of sleep stage detection, EarSleep outperforms state-of-the-art solutions by 7.12% and 9.32% in average precision and average recall, respectively.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140983909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cybersickness, a factor that hinders user immersion in VR, has been the subject of ongoing attempts to predict it using AI. Previous studies have used CNN and LSTM for prediction models and used attention mechanisms and XAI for data analysis, yet none explored a transformer that can better reflect the spatial and temporal characteristics of the data, beneficial for enhancing prediction and feature importance analysis. In this paper, we propose cybersickness prediction models using multimodal time-series sensor data (i.e., eye movement, head movement, and physiological signals) based on a transformer algorithm, considering sensor data pre-processing and multimodal data fusion methods. We constructed the MSCVR dataset consisting of normalized sensor data, spectrogram formatted sensor data, and cybersickness levels collected from 45 participants through a user study. We proposed two methods for embedding multimodal time-series sensor data into the transformer: modality-specific spatial and temporal transformer encoders for normalized sensor data (MS-STTN) and modality-specific spatial-temporal transformer encoder for spectrogram (MS-STTS). MS-STTN yielded the highest performance in the ablation study and the comparison of the existing models. Furthermore, by analyzing the importance of data features, we determined their relevance to cybersickness over time, especially the salience of eye movement features. Our results and insights derived from multimodal time-series sensor data and the transformer model provide a comprehensive understanding of cybersickness and its association with sensor data. Our MSCVR dataset and code are publicly available: https://github.com/dayoung-jeong/PRECYSE.git.
{"title":"PRECYSE: Predicting Cybersickness using Transformer for Multimodal Time-Series Sensor Data","authors":"Dayoung Jeong, Kyungsik Han","doi":"10.1145/3659594","DOIUrl":"https://doi.org/10.1145/3659594","url":null,"abstract":"Cybersickness, a factor that hinders user immersion in VR, has been the subject of ongoing attempts to predict it using AI. Previous studies have used CNN and LSTM for prediction models and used attention mechanisms and XAI for data analysis, yet none explored a transformer that can better reflect the spatial and temporal characteristics of the data, beneficial for enhancing prediction and feature importance analysis. In this paper, we propose cybersickness prediction models using multimodal time-series sensor data (i.e., eye movement, head movement, and physiological signals) based on a transformer algorithm, considering sensor data pre-processing and multimodal data fusion methods. We constructed the MSCVR dataset consisting of normalized sensor data, spectrogram formatted sensor data, and cybersickness levels collected from 45 participants through a user study. We proposed two methods for embedding multimodal time-series sensor data into the transformer: modality-specific spatial and temporal transformer encoders for normalized sensor data (MS-STTN) and modality-specific spatial-temporal transformer encoder for spectrogram (MS-STTS). MS-STTN yielded the highest performance in the ablation study and the comparison of the existing models. Furthermore, by analyzing the importance of data features, we determined their relevance to cybersickness over time, especially the salience of eye movement features. Our results and insights derived from multimodal time-series sensor data and the transformer model provide a comprehensive understanding of cybersickness and its association with sensor data. Our MSCVR dataset and code are publicly available: https://github.com/dayoung-jeong/PRECYSE.git.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140985603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Kovačević, Christian Holz, M. Gross, Rafael Wampfler
Large language models such as GPT-3 and ChatGPT can mimic human-to-human conversation with unprecedented fidelity, which enables many applications such as conversational agents for education and non-player characters in video games. In this work, we investigate the underlying personality structure that a GPT-3-based chatbot expresses during conversations with a human. We conducted a user study to collect 147 chatbot personality descriptors from 86 participants while they interacted with the GPT-3-based chatbot for three weeks. Then, 425 new participants rated the 147 personality descriptors in an online survey. We conducted an exploratory factor analysis on the collected descriptors and show that, though overlapping, human personality models do not fully transfer to the chatbot's personality as perceived by humans. We also show that the perceived personality is significantly different from that of virtual personal assistants, where users focus rather on serviceability and functionality. We discuss the implications of ever-evolving large language models and the change they affect in users' perception of agent personalities.
{"title":"The Personality Dimensions GPT-3 Expresses During Human-Chatbot Interactions","authors":"N. Kovačević, Christian Holz, M. Gross, Rafael Wampfler","doi":"10.1145/3659626","DOIUrl":"https://doi.org/10.1145/3659626","url":null,"abstract":"Large language models such as GPT-3 and ChatGPT can mimic human-to-human conversation with unprecedented fidelity, which enables many applications such as conversational agents for education and non-player characters in video games. In this work, we investigate the underlying personality structure that a GPT-3-based chatbot expresses during conversations with a human. We conducted a user study to collect 147 chatbot personality descriptors from 86 participants while they interacted with the GPT-3-based chatbot for three weeks. Then, 425 new participants rated the 147 personality descriptors in an online survey. We conducted an exploratory factor analysis on the collected descriptors and show that, though overlapping, human personality models do not fully transfer to the chatbot's personality as perceived by humans. We also show that the perceived personality is significantly different from that of virtual personal assistants, where users focus rather on serviceability and functionality. We discuss the implications of ever-evolving large language models and the change they affect in users' perception of agent personalities.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140983158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increasing availability of low-cost wearable devices and smartphones has significantly advanced the field of sensor-based human activity recognition (HAR), attracting considerable research interest. One of the major challenges in HAR is the domain shift problem in cross-dataset activity recognition, which occurs due to variations in users, device types, and sensor placements between the source dataset and the target dataset. Although domain adaptation methods have shown promise, they typically require access to the target dataset during the training process, which might not be practical in some scenarios. To address these issues, we introduce CrossHAR, a new HAR model designed to improve model performance on unseen target datasets. CrossHAR involves three main steps: (i) CrossHAR explores the sensor data generation principle to diversify the data distribution and augment the raw sensor data. (ii) CrossHAR then employs a hierarchical self-supervised pretraining approach with the augmented data to develop a generalizable representation. (iii) Finally, CrossHAR fine-tunes the pretrained model with a small set of labeled data in the source dataset, enhancing its performance in cross-dataset HAR. Our extensive experiments across multiple real-world HAR datasets demonstrate that CrossHAR outperforms current state-of-the-art methods by 10.83% in accuracy, demonstrating its effectiveness in generalizing to unseen target datasets.
低成本可穿戴设备和智能手机的日益普及极大地推动了基于传感器的人类活动识别(HAR)领域的发展,吸引了大量研究人员的关注。跨数据集活动识别中的域转移问题是 HAR 面临的主要挑战之一,由于源数据集和目标数据集之间的用户、设备类型和传感器位置存在差异,域转移问题时有发生。虽然域适应方法已经显示出前景,但它们通常需要在训练过程中访问目标数据集,这在某些情况下可能并不实用。为了解决这些问题,我们引入了 CrossHAR,这是一种新的 HAR 模型,旨在提高模型在未见过的目标数据集上的性能。CrossHAR 包括三个主要步骤:(i) CrossHAR 探索传感器数据生成原理,使数据分布多样化并增强原始传感器数据。(ii) 然后,CrossHAR 利用增强数据采用分层自监督预训练方法,以开发可通用的表征。(iii) 最后,CrossHAR 利用源数据集中的一小部分标注数据对预训练模型进行微调,从而提高其在跨数据集 HAR 中的性能。我们在多个真实世界 HAR 数据集上进行的大量实验表明,CrossHAR 的准确率比目前最先进的方法高出 10.83%,证明了它在泛化到未见过的目标数据集方面的有效性。
{"title":"CrossHAR: Generalizing Cross-dataset Human Activity Recognition via Hierarchical Self-Supervised Pretraining","authors":"Zhiqing Hong, Zelong Li, Shuxin Zhong, Wenjun Lyu, Haotian Wang, Yi Ding, Tian He, Desheng Zhang","doi":"10.1145/3659597","DOIUrl":"https://doi.org/10.1145/3659597","url":null,"abstract":"The increasing availability of low-cost wearable devices and smartphones has significantly advanced the field of sensor-based human activity recognition (HAR), attracting considerable research interest. One of the major challenges in HAR is the domain shift problem in cross-dataset activity recognition, which occurs due to variations in users, device types, and sensor placements between the source dataset and the target dataset. Although domain adaptation methods have shown promise, they typically require access to the target dataset during the training process, which might not be practical in some scenarios. To address these issues, we introduce CrossHAR, a new HAR model designed to improve model performance on unseen target datasets. CrossHAR involves three main steps: (i) CrossHAR explores the sensor data generation principle to diversify the data distribution and augment the raw sensor data. (ii) CrossHAR then employs a hierarchical self-supervised pretraining approach with the augmented data to develop a generalizable representation. (iii) Finally, CrossHAR fine-tunes the pretrained model with a small set of labeled data in the source dataset, enhancing its performance in cross-dataset HAR. Our extensive experiments across multiple real-world HAR datasets demonstrate that CrossHAR outperforms current state-of-the-art methods by 10.83% in accuracy, demonstrating its effectiveness in generalizing to unseen target datasets.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140983579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luca-Maxim Meinhardt, Maximilian Rück, Julian Zähnle, Maryam Elhaidary, Mark Colley, Michael Rietzler, Enrico Rukzio
Highly Automated Vehicles offer a new level of independence to people who are blind or visually impaired. However, due to their limited vision, gaining knowledge of the surrounding traffic can be challenging. To address this issue, we conducted an interactive, participatory workshop (N=4) to develop an auditory interface and OnBoard- a tactile interface with expandable elements - to convey traffic information to visually impaired people. In a user study with N=14 participants, we explored usability, situation awareness, predictability, and engagement with OnBoard and the auditory interface. Our qualitative and quantitative results show that tactile cues, similar to auditory cues, are able to convey traffic information to users. In particular, there is a trend that participants with reduced visual acuity showed increased engagement with both interfaces. However, the diversity of visual impairments and individual information needs underscores the importance of a highly tailored multimodal approach as the ideal solution.
{"title":"Hey, What's Going On?","authors":"Luca-Maxim Meinhardt, Maximilian Rück, Julian Zähnle, Maryam Elhaidary, Mark Colley, Michael Rietzler, Enrico Rukzio","doi":"10.1145/3659618","DOIUrl":"https://doi.org/10.1145/3659618","url":null,"abstract":"Highly Automated Vehicles offer a new level of independence to people who are blind or visually impaired. However, due to their limited vision, gaining knowledge of the surrounding traffic can be challenging. To address this issue, we conducted an interactive, participatory workshop (N=4) to develop an auditory interface and OnBoard- a tactile interface with expandable elements - to convey traffic information to visually impaired people. In a user study with N=14 participants, we explored usability, situation awareness, predictability, and engagement with OnBoard and the auditory interface. Our qualitative and quantitative results show that tactile cues, similar to auditory cues, are able to convey traffic information to users. In particular, there is a trend that participants with reduced visual acuity showed increased engagement with both interfaces. However, the diversity of visual impairments and individual information needs underscores the importance of a highly tailored multimodal approach as the ideal solution.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140983041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated Learning (FL) has emerged as a privacy-preserving paradigm for collaborative deep learning model training across distributed data silos. Despite its importance, FL faces challenges such as high latency and less effective global models. In this paper, we propose ShuffleFL, an innovative framework stemming from the hierarchical FL, which introduces a user layer between the FL devices and the FL server. ShuffleFL naturally groups devices based on their affiliations, e.g., belonging to the same user, to ease the strict privacy restriction-"data at the FL devices cannot be shared with others", thereby enabling the exchange of local samples among them. The user layer assumes a multi-faceted role, not just aggregating local updates but also coordinating data shuffling within affiliated devices. We formulate this data shuffling as an optimization problem, detailing our objectives to align local data closely with device computing capabilities and to ensure a more balanced data distribution at the intra-user devices. Through extensive experiments using realistic device profiles and five non-IID datasets, we demonstrate that ShuffleFL can improve inference accuracy by 2.81% to 7.85% and speed up the convergence by 4.11x to 36.56x when reaching the target accuracy.
{"title":"ShuffleFL: Addressing Heterogeneity in Multi-Device Federated Learning","authors":"Ran Zhu, Mingkun Yang, Qing Wang","doi":"10.1145/3659621","DOIUrl":"https://doi.org/10.1145/3659621","url":null,"abstract":"Federated Learning (FL) has emerged as a privacy-preserving paradigm for collaborative deep learning model training across distributed data silos. Despite its importance, FL faces challenges such as high latency and less effective global models. In this paper, we propose ShuffleFL, an innovative framework stemming from the hierarchical FL, which introduces a user layer between the FL devices and the FL server. ShuffleFL naturally groups devices based on their affiliations, e.g., belonging to the same user, to ease the strict privacy restriction-\"data at the FL devices cannot be shared with others\", thereby enabling the exchange of local samples among them. The user layer assumes a multi-faceted role, not just aggregating local updates but also coordinating data shuffling within affiliated devices. We formulate this data shuffling as an optimization problem, detailing our objectives to align local data closely with device computing capabilities and to ensure a more balanced data distribution at the intra-user devices. Through extensive experiments using realistic device profiles and five non-IID datasets, we demonstrate that ShuffleFL can improve inference accuracy by 2.81% to 7.85% and speed up the convergence by 4.11x to 36.56x when reaching the target accuracy.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140983399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Silent Speech Interfaces (SSI) on mobile devices offer a privacy-friendly alternative to conventional voice input methods. Previous research has primarily focused on smartphones. In this paper, we introduce Lipwatch, a novel system that utilizes acoustic sensing techniques to enable SSI on smartwatches. Lipwatch leverages the inaudible waves emitted by the watch's speaker to capture lip movements and then analyzes the echo to enable SSI. In contrast to acoustic sensing-based SSI on smartphones, our development of Lipwatch takes into full consideration the specific scenarios and requirements associated with smartwatches. Firstly, we elaborate a wake-up-free mechanism, allowing users to interact without the need for a wake-up phrase or button presses. The mechanism utilizes the inertial sensors on the smartwatch to detect gestures, in combination with acoustic signals that detecting lip movements to determine whether SSI should be activated. Secondly, we design a flexible silent speech recognition mechanism that explores limited vocabulary recognition to comprehend a broader range of user commands, even those not present in the training dataset, relieving users from strict adherence to predefined commands. We evaluate Lipwatch on 15 individuals using a set of the 80 most common interaction commands on smartwatches. The system achieves a Word Error Rate (WER) of 13.7% in user-independent test. Even when users utter commands containing words absent in the training set, Lipwatch still demonstrates a remarkable 88.7% top-3 accuracy. We implement a real-time version of Lipwatch on a commercial smartwatch. The user study shows that Lipwatch can be a practical and promising option to enable SSI on smartwatches.
{"title":"Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing","authors":"Qian Zhang, Yubin Lan, Kaiyi Guo, Dong Wang","doi":"10.1145/3659614","DOIUrl":"https://doi.org/10.1145/3659614","url":null,"abstract":"Silent Speech Interfaces (SSI) on mobile devices offer a privacy-friendly alternative to conventional voice input methods. Previous research has primarily focused on smartphones. In this paper, we introduce Lipwatch, a novel system that utilizes acoustic sensing techniques to enable SSI on smartwatches. Lipwatch leverages the inaudible waves emitted by the watch's speaker to capture lip movements and then analyzes the echo to enable SSI. In contrast to acoustic sensing-based SSI on smartphones, our development of Lipwatch takes into full consideration the specific scenarios and requirements associated with smartwatches. Firstly, we elaborate a wake-up-free mechanism, allowing users to interact without the need for a wake-up phrase or button presses. The mechanism utilizes the inertial sensors on the smartwatch to detect gestures, in combination with acoustic signals that detecting lip movements to determine whether SSI should be activated. Secondly, we design a flexible silent speech recognition mechanism that explores limited vocabulary recognition to comprehend a broader range of user commands, even those not present in the training dataset, relieving users from strict adherence to predefined commands. We evaluate Lipwatch on 15 individuals using a set of the 80 most common interaction commands on smartwatches. The system achieves a Word Error Rate (WER) of 13.7% in user-independent test. Even when users utter commands containing words absent in the training set, Lipwatch still demonstrates a remarkable 88.7% top-3 accuracy. We implement a real-time version of Lipwatch on a commercial smartwatch. The user study shows that Lipwatch can be a practical and promising option to enable SSI on smartwatches.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140983148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}