{"title":"TS2ACT","authors":"Kang Xia, Wenzhong Li, Shiwei Gan, Sanglu Lu","doi":"10.1145/3631445","DOIUrl":null,"url":null,"abstract":"Human Activity Recognition (HAR) based on embedded sensor data has become a popular research topic in ubiquitous computing, which has a wide range of practical applications in various fields such as human-computer interaction, healthcare, and motion tracking. Due to the difficulties of annotating sensing data, unsupervised and semi-supervised HAR methods are extensively studied, but their performance gap to the fully-supervised methods is notable. In this paper, we proposed a novel cross-modal co-learning approach called TS2ACT to achieve few-shot HAR. It introduces a cross-modal dataset augmentation method that uses the semantic-rich label text to search for human activity images to form an augmented dataset consisting of partially-labeled time series and fully-labeled images. Then it adopts a pre-trained CLIP image encoder to jointly train with a time series encoder using contrastive learning, where the time series and images are brought closer in feature space if they belong to the same activity class. For inference, the feature extracted from the input time series is compared with the embedding of a pre-trained CLIP text encoder using prompt learning, and the best match is output as the HAR classification results. We conducted extensive experiments on four public datasets to evaluate the performance of the proposed method. The numerical results show that TS2ACT significantly outperforms the state-of-the-art HAR methods, and it achieves performance close to or better than the fully supervised methods even using as few as 1% labeled data for model training. The source codes of TS2ACT are publicly available on GitHub1.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"1 10","pages":"1 - 22"},"PeriodicalIF":3.6000,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TS2ACT\",\"authors\":\"Kang Xia, Wenzhong Li, Shiwei Gan, Sanglu Lu\",\"doi\":\"10.1145/3631445\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human Activity Recognition (HAR) based on embedded sensor data has become a popular research topic in ubiquitous computing, which has a wide range of practical applications in various fields such as human-computer interaction, healthcare, and motion tracking. Due to the difficulties of annotating sensing data, unsupervised and semi-supervised HAR methods are extensively studied, but their performance gap to the fully-supervised methods is notable. In this paper, we proposed a novel cross-modal co-learning approach called TS2ACT to achieve few-shot HAR. It introduces a cross-modal dataset augmentation method that uses the semantic-rich label text to search for human activity images to form an augmented dataset consisting of partially-labeled time series and fully-labeled images. Then it adopts a pre-trained CLIP image encoder to jointly train with a time series encoder using contrastive learning, where the time series and images are brought closer in feature space if they belong to the same activity class. For inference, the feature extracted from the input time series is compared with the embedding of a pre-trained CLIP text encoder using prompt learning, and the best match is output as the HAR classification results. We conducted extensive experiments on four public datasets to evaluate the performance of the proposed method. The numerical results show that TS2ACT significantly outperforms the state-of-the-art HAR methods, and it achieves performance close to or better than the fully supervised methods even using as few as 1% labeled data for model training. The source codes of TS2ACT are publicly available on GitHub1.\",\"PeriodicalId\":20553,\"journal\":{\"name\":\"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies\",\"volume\":\"1 10\",\"pages\":\"1 - 22\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-01-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3631445\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3631445","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
基于嵌入式传感器数据的人类活动识别(HAR)已成为泛在计算领域的热门研究课题,在人机交互、医疗保健和运动跟踪等多个领域有着广泛的实际应用。由于感知数据注释的困难,无监督和半监督 HAR 方法被广泛研究,但其性能与全监督方法相比差距明显。在本文中,我们提出了一种名为 TS2ACT 的新型跨模态协同学习方法,以实现少点 HAR。它引入了一种跨模态数据集增强方法,利用语义丰富的标签文本搜索人类活动图像,形成一个由部分标签时间序列和完全标签图像组成的增强数据集。然后,它采用对比学习方法,将预先训练好的 CLIP 图像编码器与时间序列编码器联合训练,如果时间序列和图像属于同一活动类别,则在特征空间中将它们拉近。在推理过程中,从输入时间序列中提取的特征会与预先训练好的 CLIP 文本编码器的嵌入进行比较,然后输出最佳匹配结果作为 HAR 分类结果。我们在四个公共数据集上进行了大量实验,以评估所提出方法的性能。数值结果表明,TS2ACT 的性能明显优于最先进的 HAR 方法,即使只使用 1% 的标注数据进行模型训练,它也能达到接近或优于完全监督方法的性能。TS2ACT 的源代码可在 GitHub 上公开获取1。
Human Activity Recognition (HAR) based on embedded sensor data has become a popular research topic in ubiquitous computing, which has a wide range of practical applications in various fields such as human-computer interaction, healthcare, and motion tracking. Due to the difficulties of annotating sensing data, unsupervised and semi-supervised HAR methods are extensively studied, but their performance gap to the fully-supervised methods is notable. In this paper, we proposed a novel cross-modal co-learning approach called TS2ACT to achieve few-shot HAR. It introduces a cross-modal dataset augmentation method that uses the semantic-rich label text to search for human activity images to form an augmented dataset consisting of partially-labeled time series and fully-labeled images. Then it adopts a pre-trained CLIP image encoder to jointly train with a time series encoder using contrastive learning, where the time series and images are brought closer in feature space if they belong to the same activity class. For inference, the feature extracted from the input time series is compared with the embedding of a pre-trained CLIP text encoder using prompt learning, and the best match is output as the HAR classification results. We conducted extensive experiments on four public datasets to evaluate the performance of the proposed method. The numerical results show that TS2ACT significantly outperforms the state-of-the-art HAR methods, and it achieves performance close to or better than the fully supervised methods even using as few as 1% labeled data for model training. The source codes of TS2ACT are publicly available on GitHub1.