User interfaces (UI) of desktop, web, and mobile applications involve a hierarchy of objects (e.g., applications, screens, view class, and other types of design objects) with multimodal (e.g., textual, visual) and positional (e.g., spatial location, sequence order and hierarchy level) attributes. We can therefore represent a set of application UIs as a heterogeneous network with multimodal and positional attributes. Such a network not only represents how users understand the visual layout of UIs, but also influences how users would interact with applications through these UIs. To model the UI semantics well for different UI annotation, search, and evaluation tasks, this paper proposes the novel Heterogeneous Attention-based Multimodal Positional (HAMP) graph neural network model. HAMP combines graph neural networks with the scaled dot-product attention used in transformers to learn the embeddings of heterogeneous nodes and associated multimodal and positional attributes in a unified manner. HAMP is evaluated with classification and regression tasks conducted on three distinct real-world datasets. Our experiments demonstrate that HAMP significantly out-performs other state-of-the-art models on such tasks. To further provide interpretations of the contribution of heterogeneous network information for understanding the relationships between the UI structure and prediction tasks, we propose Adaptive HAMP (AHAMP), which adaptively learns the importance of different edges linking different UI objects. Our experiments demonstrate AHAMP’s superior performance over HAMP on a number of tasks, and its ability to provide interpretations of the contribution of multimodal and positional attributes, as well as heterogeneous network information to different tasks.
Distracted driving is a leading cause of accidents worldwide. The tasks of distraction detection and recognition have been traditionally addressed as computer vision problems. However, distracted behaviors are not always expressed in a visually observable way. In this work, we introduce a novel multimodal dataset of distracted driver behaviors, consisting of data collected using twelve information channels coming from visual, acoustic, near-infrared, thermal, physiological and linguistic modalities. The data were collected from 45 subjects while being exposed to four different distractions (three cognitive and one physical). For the purposes of this paper, we performed experiments with visual, physiological, and thermal information to explore potential of multimodal modeling for distraction recognition. In addition, we analyze the value of different modalities by identifying specific visual, physiological, and thermal groups of features that contribute the most to distraction characterization. Our results highlight the advantage of multimodal representations and reveal valuable insights for the role played by the three modalities on identifying different types of driving distractions.
While EXplainable Artificial Intelligence (XAI) approaches aim to improve human-AI collaborative decision-making by improving model transparency and mental model formations, experiential factors associated with human users can cause challenges in ways system designers do not anticipate. In this article, we first showcase a user study on how anchoring bias can potentially affect mental model formations when users initially interact with an intelligent system and the role of explanations in addressing this bias. Using a video activity recognition tool in cooking domain, we asked participants to verify whether a set of kitchen policies are being followed, with each policy focusing on a weakness or a strength. We controlled the order of the policies and the presence of explanations to test our hypotheses. Our main finding shows that those who observed system strengths early on were more prone to automation bias and made significantly more errors due to positive first impressions of the system, while they built a more accurate mental model of the system competencies. However, those who encountered weaknesses earlier made significantly fewer errors, since they tended to rely more on themselves, while they also underestimated model competencies due to having a more negative first impression of the model. Motivated by these findings and similar existing work, we formalize and present a conceptual model of user’s past experiences that examine the relations between user’s backgrounds, experiences, and human factors in XAI systems based on usage time. Our work presents strong findings and implications, aiming to raise the awareness of AI designers toward biases associated with user impressions and backgrounds.
Texting relies on screen-centric prompts designed for sighted users, still posing significant barriers to people who are blind and visually impaired (BVI). Can we re-imagine texting untethered from a visual display? In an interview study, 20 BVI adults shared situations surrounding their texting practices, recurrent topics of conversations, and challenges. Informed by these insights, we introduce TextFlow, a mixed-initiative context-aware system that generates entirely auditory message options relevant to the users’ location, activity, and time of the day. Users can browse and select suggested aural messages using finger-taps supported by an off-the-shelf finger-worn device without having to hold or attend to a mobile screen. In an evaluative study, 10 BVI participants successfully interacted with TextFlow to browse and send messages in screen-free mode. The experiential response of the users shed light on the importance of bypassing the phone and accessing rapidly controllable messages at their fingertips while preserving privacy and accuracy with respect to speech or screen-based input. We discuss how non-visual access to proactive, contextual messaging can support the blind in a variety of daily scenarios.
The opaque nature of many intelligent systems violates established usability principles and thus presents a challenge for human-computer interaction. Research in the field therefore highlights the need for transparency, scrutability, intelligibility, interpretability and explainability, among others. While all of these terms carry a vision of supporting users in understanding intelligent systems, the underlying notions and assumptions about users and their interaction with the system often remain unclear.
We review the literature in HCI through the lens of implied user questions to synthesise a conceptual framework integrating user mindsets, user involvement, and knowledge outcomes to reveal, differentiate and classify current notions in prior work. This framework aims to resolve conceptual ambiguity in the field and enables researchers to clarify their assumptions and become aware of those made in prior work. We further discuss related aspects such as stakeholders and trust, and also provide material to apply our framework in practice (e.g., ideation/design sessions). We thus hope to advance and structure the dialogue on supporting users in understanding intelligent systems.