Pepijn Van Aken, Merel M. Jung, Werner Liebregts, Itir Onal Ertugrul
Acquiring early-stage investments for the purpose of developing a business is a fundamental aspect of the entrepreneurial process, which regularly entails pitching the business proposal to potential investors. Previous research suggests that business viability data and the perception of the entrepreneur play an important role in the investment decision-making process. This perception of the entrepreneur is shaped by verbal and non-verbal behavioral cues produced in investor-entrepreneur interactions. This study explores the impact of such cues on decisions that involve investing in a startup on the basis of a pitch. A multimodal approach is developed in which acoustic and linguistic features are extracted from recordings of entrepreneurial pitches to predict the likelihood of investment. The acoustic and linguistic modalities are represented using both hand-crafted and deep features. The capabilities of deep learning models are exploited to capture the temporal dynamics of the inputs. The findings show promising results for the prediction of the likelihood of investment using a multimodal architecture consisting of acoustic and linguistic features. Models based on deep features generally outperform hand-crafted representations. Experiments with an explainable model provide insights about the important features. The most predictive model is found to be a multimodal one that combines deep acoustic and linguistic features using an early fusion strategy and achieves an MAE of 13.91.
{"title":"Deciphering Entrepreneurial Pitches: A Multimodal Deep Learning Approach to Predict Probability of Investment","authors":"Pepijn Van Aken, Merel M. Jung, Werner Liebregts, Itir Onal Ertugrul","doi":"10.1145/3577190.3614146","DOIUrl":"https://doi.org/10.1145/3577190.3614146","url":null,"abstract":"Acquiring early-stage investments for the purpose of developing a business is a fundamental aspect of the entrepreneurial process, which regularly entails pitching the business proposal to potential investors. Previous research suggests that business viability data and the perception of the entrepreneur play an important role in the investment decision-making process. This perception of the entrepreneur is shaped by verbal and non-verbal behavioral cues produced in investor-entrepreneur interactions. This study explores the impact of such cues on decisions that involve investing in a startup on the basis of a pitch. A multimodal approach is developed in which acoustic and linguistic features are extracted from recordings of entrepreneurial pitches to predict the likelihood of investment. The acoustic and linguistic modalities are represented using both hand-crafted and deep features. The capabilities of deep learning models are exploited to capture the temporal dynamics of the inputs. The findings show promising results for the prediction of the likelihood of investment using a multimodal architecture consisting of acoustic and linguistic features. Models based on deep features generally outperform hand-crafted representations. Experiments with an explainable model provide insights about the important features. The most predictive model is found to be a multimodal one that combines deep acoustic and linguistic features using an early fusion strategy and achieves an MAE of 13.91.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hansi Liu, Hongsheng Lu, Kristin Data, Marco Gruteser
In Smart City and Vehicle-to-Everything (V2X) systems, acquiring pedestrians’ accurate locations is crucial to traffic and pedestrian safety. Current systems adopt cameras and wireless sensors to estimate people’s locations via sensor fusion. Standard fusion algorithms, however, become inapplicable when multi-modal data is not associated. For example, pedestrians are out of the camera field of view, or data from the camera modality is missing. To address this challenge and produce more accurate location estimations for pedestrians, we propose a localization solution based on a Generative Adversarial Network (GAN) architecture. During training, it learns the underlying linkage between pedestrians’ camera-phone data correspondences. During inference, it generates refined position estimations based only on pedestrians’ phone data that consists of GPS, IMU, and FTM. Results show that our GAN produces 3D coordinates at 1 to 2 meters localization error across 5 different outdoor scenes. We further show that the proposed model supports self-learning. The generated coordinates can be associated with pedestrians’ bounding box coordinates to obtain additional camera-phone data correspondences. This allows automatic data collection during inference. Results show that after fine-tuning the GAN model on the expanded dataset, localization accuracy is further improved by up to 26%.
{"title":"ViFi-Loc: Multi-modal Pedestrian Localization using GAN with Camera-Phone Correspondences","authors":"Hansi Liu, Hongsheng Lu, Kristin Data, Marco Gruteser","doi":"10.1145/3577190.3614119","DOIUrl":"https://doi.org/10.1145/3577190.3614119","url":null,"abstract":"In Smart City and Vehicle-to-Everything (V2X) systems, acquiring pedestrians’ accurate locations is crucial to traffic and pedestrian safety. Current systems adopt cameras and wireless sensors to estimate people’s locations via sensor fusion. Standard fusion algorithms, however, become inapplicable when multi-modal data is not associated. For example, pedestrians are out of the camera field of view, or data from the camera modality is missing. To address this challenge and produce more accurate location estimations for pedestrians, we propose a localization solution based on a Generative Adversarial Network (GAN) architecture. During training, it learns the underlying linkage between pedestrians’ camera-phone data correspondences. During inference, it generates refined position estimations based only on pedestrians’ phone data that consists of GPS, IMU, and FTM. Results show that our GAN produces 3D coordinates at 1 to 2 meters localization error across 5 different outdoor scenes. We further show that the proposed model supports self-learning. The generated coordinates can be associated with pedestrians’ bounding box coordinates to obtain additional camera-phone data correspondences. This allows automatic data collection during inference. Results show that after fine-tuning the GAN model on the expanded dataset, localization accuracy is further improved by up to 26%.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although users increasingly spread their activities across multiple devices—even to accomplish a single task—information transfer between apps on separate devices still incurs non-negligible effort and time overhead. These interaction flows would considerably benefit from more seamless cross-device interaction that directly connects the information flow between the involved apps across devices. In this paper, we propose cross-device shortcuts, an interaction technique that enables direct and discoverable content exchange between apps on different devices. When users switch their attention between multiple engaged devices as part of a workflow, our system establishes a cross-device shortcut—a deep link between apps on separate devices that presents itself through feed-forward previews, inviting and facilitating quick content transfer. We explore the use of this technique in four scenarios spanning multiple devices and applications, and highlight the potential, limitations, and challenges of its design with a preliminary evaluation.
{"title":"Cross-Device Shortcuts: Seamless Attention-guided Content Transfer via Opportunistic Deep Links between Apps and Devices","authors":"Marilou Beyeler, Yi Fei Cheng, Christian Holz","doi":"10.1145/3577190.3614145","DOIUrl":"https://doi.org/10.1145/3577190.3614145","url":null,"abstract":"Although users increasingly spread their activities across multiple devices—even to accomplish a single task—information transfer between apps on separate devices still incurs non-negligible effort and time overhead. These interaction flows would considerably benefit from more seamless cross-device interaction that directly connects the information flow between the involved apps across devices. In this paper, we propose cross-device shortcuts, an interaction technique that enables direct and discoverable content exchange between apps on different devices. When users switch their attention between multiple engaged devices as part of a workflow, our system establishes a cross-device shortcut—a deep link between apps on separate devices that presents itself through feed-forward previews, inviting and facilitating quick content transfer. We explore the use of this technique in four scenarios spanning multiple devices and applications, and highlight the potential, limitations, and challenges of its design with a preliminary evaluation.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruoqi Wang, Haifeng Zhang, Shaun Alexander Macdonald, Patrizia Di Campli San Vito
Heartbeat is not only one of our physical health indicators, but also plays an important role in our emotional changes. Previous investigations have been repeatedly investigated to the soothing effects of low frequency vibrotactile cues which evoke a slow heartbeat in stressful situations. The impact of stimuli which evoke faster heartbeats on users’ anxiety or heart rate is, however, poorly understood. We conducted two studies to evaluate the influence of the presentation of a fast heartbeat via vibration and/or sound, both in calm and stressed states. Results showed that the presentation of fast heartbeat stimuli can induce increased anxiety levels and heart rate. We use these results to inform how future designers could carefully present fast heartbeat stimuli in multimedia application to enhance feelings of immersion, effort and engagement.
{"title":"Increasing Heart Rate and Anxiety Level with Vibrotactile and Audio Presentation of Fast Heartbeat","authors":"Ruoqi Wang, Haifeng Zhang, Shaun Alexander Macdonald, Patrizia Di Campli San Vito","doi":"10.1145/3577190.3614161","DOIUrl":"https://doi.org/10.1145/3577190.3614161","url":null,"abstract":"Heartbeat is not only one of our physical health indicators, but also plays an important role in our emotional changes. Previous investigations have been repeatedly investigated to the soothing effects of low frequency vibrotactile cues which evoke a slow heartbeat in stressful situations. The impact of stimuli which evoke faster heartbeats on users’ anxiety or heart rate is, however, poorly understood. We conducted two studies to evaluate the influence of the presentation of a fast heartbeat via vibration and/or sound, both in calm and stressed states. Results showed that the presentation of fast heartbeat stimuli can induce increased anxiety levels and heart rate. We use these results to inform how future designers could carefully present fast heartbeat stimuli in multimedia application to enhance feelings of immersion, effort and engagement.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph convolutional networks (GCNs) have achieved excellent results in image classification and natural language processing. However, at present, the application of GCNs in speech emotion recognition (SER) is not widely studied. Meanwhile, recent studies have shown that GCNs may not be able to adaptively capture the long-range context emotional information over the whole audio. To alleviate this problem, this paper proposes a Graph Convolutional Transformer (GCFormer) model which empowers the model to extract local and global emotional information. Specifically, we construct a cyclic graph and perform concise graph convolution operations to obtain spatial local features. Then, a consecutive transformer network further strives to learn more high-level representations and their global temporal correlation. Finally and sequentially, the learned serialized representations from the transformer are mapped into a vector through a gated recurrent unit (GRU) pooling layer for emotion classification. The experiment results obtained on two public emotional datasets demonstrate that the proposed GCFormer performs significantly better than other GCN-based models in terms of prediction accuracy, and surpasses the other state-of-the-art deep learning models in terms of prediction accuracy and model efficiency.
{"title":"GCFormer: A Graph Convolutional Transformer for Speech Emotion Recognition","authors":"Yingxue Gao, Huan Zhao, Yufeng Xiao, Zixing Zhang","doi":"10.1145/3577190.3614177","DOIUrl":"https://doi.org/10.1145/3577190.3614177","url":null,"abstract":"Graph convolutional networks (GCNs) have achieved excellent results in image classification and natural language processing. However, at present, the application of GCNs in speech emotion recognition (SER) is not widely studied. Meanwhile, recent studies have shown that GCNs may not be able to adaptively capture the long-range context emotional information over the whole audio. To alleviate this problem, this paper proposes a Graph Convolutional Transformer (GCFormer) model which empowers the model to extract local and global emotional information. Specifically, we construct a cyclic graph and perform concise graph convolution operations to obtain spatial local features. Then, a consecutive transformer network further strives to learn more high-level representations and their global temporal correlation. Finally and sequentially, the learned serialized representations from the transformer are mapped into a vector through a gated recurrent unit (GRU) pooling layer for emotion classification. The experiment results obtained on two public emotional datasets demonstrate that the proposed GCFormer performs significantly better than other GCN-based models in terms of prediction accuracy, and surpasses the other state-of-the-art deep learning models in terms of prediction accuracy and model efficiency.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ehsan Yaghoubi, Andre Peter Kelm, Timo Gerkmann, Simone Frintrop
This paper introduces an unsupervised model for audio-visual localization, which aims to identify regions in the visual data that produce sounds. Our key technical contribution is to demonstrate that using distilled prior knowledge of both sounds and objects in an unsupervised learning phase can improve performance significantly. We propose an Audio-Visual Correspondence (AVC) model consisting of an audio and a vision student, which are respectively supervised by an audio teacher (audio recognition model) and a vision teacher (object detection model). Leveraging a contrastive learning approach, the AVC student model extracts features from sounds and images and computes a localization map, discovering the regions of the visual data that correspond to the sound signal. Simultaneously, the teacher models provide feature-based hints from their last layers to supervise the AVC model in the training phase. In the test phase, the teachers are removed. Our extensive experiments show that the proposed model outperforms the state-of-the-art audio-visual localization models on 10k and 144k subsets of the Flickr and VGGS datasets, including cross-dataset validation.
{"title":"Acoustic and Visual Knowledge Distillation for Contrastive Audio-Visual Localization","authors":"Ehsan Yaghoubi, Andre Peter Kelm, Timo Gerkmann, Simone Frintrop","doi":"10.1145/3577190.3614144","DOIUrl":"https://doi.org/10.1145/3577190.3614144","url":null,"abstract":"This paper introduces an unsupervised model for audio-visual localization, which aims to identify regions in the visual data that produce sounds. Our key technical contribution is to demonstrate that using distilled prior knowledge of both sounds and objects in an unsupervised learning phase can improve performance significantly. We propose an Audio-Visual Correspondence (AVC) model consisting of an audio and a vision student, which are respectively supervised by an audio teacher (audio recognition model) and a vision teacher (object detection model). Leveraging a contrastive learning approach, the AVC student model extracts features from sounds and images and computes a localization map, discovering the regions of the visual data that correspond to the sound signal. Simultaneously, the teacher models provide feature-based hints from their last layers to supervise the AVC model in the training phase. In the test phase, the teachers are removed. Our extensive experiments show that the proposed model outperforms the state-of-the-art audio-visual localization models on 10k and 144k subsets of the Flickr and VGGS datasets, including cross-dataset validation.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Depression is a severe mental illness that not only affects the patient but also has major social and economical implications. Recent studies have employed artificial intelligence using multimodal behavioural cues to objectively investigate depression and alleviate the subjectivity involved in current depression diagnostic process. However, head motion has received a fairly limited attention as a behavioural marker for detecting depression and the lack of explainability of the "black box" approaches have restricted their widespread adoption. Consequently, the objective of this research is to examine the utility of fundamental head-motion units termed kinemes and explore the explainability of multimodal behavioural cues for depression detection. To this end, the research to date evaluated depression classification performance on the BlackDog and AVEC2013 datasets using multiple machine learning methods. Our findings indicate that: (a) head motion patterns are effective cues for depression assessment, and (b) explanatory kineme patterns can be observed for the two classes, consistent with prior research.
{"title":"Explainable Depression Detection using Multimodal Behavioural Cues","authors":"Monika Gahalawat","doi":"10.1145/3577190.3614227","DOIUrl":"https://doi.org/10.1145/3577190.3614227","url":null,"abstract":"Depression is a severe mental illness that not only affects the patient but also has major social and economical implications. Recent studies have employed artificial intelligence using multimodal behavioural cues to objectively investigate depression and alleviate the subjectivity involved in current depression diagnostic process. However, head motion has received a fairly limited attention as a behavioural marker for detecting depression and the lack of explainability of the \"black box\" approaches have restricted their widespread adoption. Consequently, the objective of this research is to examine the utility of fundamental head-motion units termed kinemes and explore the explainability of multimodal behavioural cues for depression detection. To this end, the research to date evaluated depression classification performance on the BlackDog and AVEC2013 datasets using multiple machine learning methods. Our findings indicate that: (a) head motion patterns are effective cues for depression assessment, and (b) explanatory kineme patterns can be observed for the two classes, consistent with prior research.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135045188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While depression has been studied via multimodal non-verbal behavioural cues, head motion behaviour has not received much attention as a biomarker. This study demonstrates the utility of fundamental head-motion units, termed kinemes, for depression detection by adopting two distinct approaches, and employing distinctive features: (a) discovering kinemes from head motion data corresponding to both depressed patients and healthy controls, and (b) learning kineme patterns only from healthy controls, and computing statistics derived from reconstruction errors for both the patient and control classes. Employing machine learning methods, we evaluate depression classification performance on the BlackDog and AVEC2013 datasets. Our findings indicate that: (1) head motion patterns are effective biomarkers for detecting depressive symptoms, and (2) explanatory kineme patterns consistent with prior findings can be observed for the two classes. Overall, we achieve peak F1 scores of 0.79 and 0.82, respectively, over BlackDog and AVEC2013 for binary classification over episodic thin-slices, and a peak F1 of 0.72 over videos for AVEC2013.
{"title":"Explainable Depression Detection via Head Motion Patterns","authors":"Monika Gahalawat, Raul Fernandez Rojas, Tanaya Guha, Ramanathan Subramanian, Roland Goecke","doi":"10.1145/3577190.3614130","DOIUrl":"https://doi.org/10.1145/3577190.3614130","url":null,"abstract":"While depression has been studied via multimodal non-verbal behavioural cues, head motion behaviour has not received much attention as a biomarker. This study demonstrates the utility of fundamental head-motion units, termed kinemes, for depression detection by adopting two distinct approaches, and employing distinctive features: (a) discovering kinemes from head motion data corresponding to both depressed patients and healthy controls, and (b) learning kineme patterns only from healthy controls, and computing statistics derived from reconstruction errors for both the patient and control classes. Employing machine learning methods, we evaluate depression classification performance on the BlackDog and AVEC2013 datasets. Our findings indicate that: (1) head motion patterns are effective biomarkers for detecting depressive symptoms, and (2) explanatory kineme patterns consistent with prior findings can be observed for the two classes. Overall, we achieve peak F1 scores of 0.79 and 0.82, respectively, over BlackDog and AVEC2013 for binary classification over episodic thin-slices, and a peak F1 of 0.72 over videos for AVEC2013.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135045202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel C. Tozadore, Lise Aubin, Soizic Gauthier, Barbara Bruno, Salvatore M. Anzalone
In the rapidly evolving landscape of education, the integration of technology and innovative pedagogical approaches has become imperative to engage learners effectively. Our workshop aimed to delve into the intersection of technology, cognitive psychology, and educational theory to explore the potential of multimodal interfaces in transforming the learning experience for both regular and special education. Its interdisciplinary brought together experts from fields of human-computer interaction, education, cognitive science, and computer science. To give further insights to participants discussions, 3 keynotes from experts in the field, 6 presentations of accepted short-papers from participants, and 6 in-loco demos of relevant projects were performed. The high-level content approached tend to tailor works future developed towards this area.
{"title":"Multimodal, Interactive Interfaces for Education","authors":"Daniel C. Tozadore, Lise Aubin, Soizic Gauthier, Barbara Bruno, Salvatore M. Anzalone","doi":"10.1145/3577190.3616881","DOIUrl":"https://doi.org/10.1145/3577190.3616881","url":null,"abstract":"In the rapidly evolving landscape of education, the integration of technology and innovative pedagogical approaches has become imperative to engage learners effectively. Our workshop aimed to delve into the intersection of technology, cognitive psychology, and educational theory to explore the potential of multimodal interfaces in transforming the learning experience for both regular and special education. Its interdisciplinary brought together experts from fields of human-computer interaction, education, cognitive science, and computer science. To give further insights to participants discussions, 3 keynotes from experts in the field, 6 presentations of accepted short-papers from participants, and 6 in-loco demos of relevant projects were performed. The high-level content approached tend to tailor works future developed towards this area.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135045699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predicting the future trajectory of a crowd is important for safety to prevent disasters such as stampedes or collisions. Extensive research has been conducted to explore trajectory prediction in typical crowd scenarios, where the majority of individuals can be easily identified. However, this study focuses on a more challenging scenario known as the super-crowd scene, wherein individuals within the crowd can only be annotated based on their heads. In this particular scenario, people’s re-identification process in tracking does not perform well due to a lack of clear image data. Our research proposes a clustering strategy to overcome people re-identification problems and predict the cluster crowd trajectory. Two-dimensional(2D) maps and multi-cameras will be used to capture full pictures of crowds in a location and extract the venue’s spatial data (see figure 1). The research methodology encompasses several key steps, including evaluating data extraction of the state-of-the-art methods, estimating crowd clusters, integrating 2D maps and multi-view fusion, and evaluating the proposed method on a dataset of multi-view videos collected in a real-world super-crowded scenario.
{"title":"Crowd Behaviour Prediction using Visual and Location Data in Super-Crowded Scenarios","authors":"Antonius Bima Murti Wijaya","doi":"10.1145/3577190.3614230","DOIUrl":"https://doi.org/10.1145/3577190.3614230","url":null,"abstract":"Predicting the future trajectory of a crowd is important for safety to prevent disasters such as stampedes or collisions. Extensive research has been conducted to explore trajectory prediction in typical crowd scenarios, where the majority of individuals can be easily identified. However, this study focuses on a more challenging scenario known as the super-crowd scene, wherein individuals within the crowd can only be annotated based on their heads. In this particular scenario, people’s re-identification process in tracking does not perform well due to a lack of clear image data. Our research proposes a clustering strategy to overcome people re-identification problems and predict the cluster crowd trajectory. Two-dimensional(2D) maps and multi-cameras will be used to capture full pictures of crowds in a location and extract the venue’s spatial data (see figure 1). The research methodology encompasses several key steps, including evaluating data extraction of the state-of-the-art methods, estimating crowd clusters, integrating 2D maps and multi-view fusion, and evaluating the proposed method on a dataset of multi-view videos collected in a real-world super-crowded scenario.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}