{"title":"和聪明的伴侣进行一次文化之旅","authors":"A. Bimbo","doi":"10.1145/3078971.3079005","DOIUrl":null,"url":null,"abstract":"Digital and mobile technologies have become increasingly popular to support and improve the quality of experience during cultural visits. The portability of the device, the daily adaptation of most people to its usage, the easy access to information and the opportunity of interactive augmented reality have been key factors of this popularity. We believe that computer vision may help to improve such quality of experience, by making the mobile device smarter and capable of inferring the visitor interests directly from his/her behavior, so triggering the delivery of the appropriate information at the right time without any specific user actions. At MICC University of Florence, we have developed two prototypes of smart audio guides, respectively for indoor and outdoor cultural visits, that exploit the availability of multi-core CPUs and GPUs on mobile devices and computer vision to feed information according to the interests of the visitor, in a non intrusive and natural way. In the first one [Seidenari et al. 2017], the YOLO network [Redmon et al. 2016] is used to distinguish between artworks and people in the camera view. If an artwork is detected, it predicts a specific artwork label. The artwork's description is hence given in audio in the visitor's language. In the second one, the GPS coordinates are used to search Google Places and obtain the interest points closeby. To determine what landmark the visitor is actually looking at, the actual view of the camera is matched against the Google Street Map database using SIFT features. Matched views are classified as either artwork or background and for artworks, descriptions are obtained from Wikipedia. Both prototypes were conceived as a smart mate for visits in museums and outdoor sites or cities of art, respectively. In both prototypes, voice activity detection provides hints about what is happening in the surrounding context of the visitor and triggers the audio description only when the visitor is not talking with the accompanying persons. They were developed on NVIDIA Jetson TK1 and deployed on a NVIDIA Shield K1 Tablet, run in real time and were tested in real contexts in a musum and the city of Florence.","PeriodicalId":93291,"journal":{"name":"ICMR'17 : proceedings of the 2017 ACM International Conference on Multimedia Retrieval : June 6-9, 2017, Bucharest, Romania. ACM International Conference on Multimedia Retrieval (2017 : Bucharest, Romania)","volume":"66 1","pages":"2"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Making a Cultural Visit with a Smart Mate\",\"authors\":\"A. Bimbo\",\"doi\":\"10.1145/3078971.3079005\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Digital and mobile technologies have become increasingly popular to support and improve the quality of experience during cultural visits. The portability of the device, the daily adaptation of most people to its usage, the easy access to information and the opportunity of interactive augmented reality have been key factors of this popularity. We believe that computer vision may help to improve such quality of experience, by making the mobile device smarter and capable of inferring the visitor interests directly from his/her behavior, so triggering the delivery of the appropriate information at the right time without any specific user actions. At MICC University of Florence, we have developed two prototypes of smart audio guides, respectively for indoor and outdoor cultural visits, that exploit the availability of multi-core CPUs and GPUs on mobile devices and computer vision to feed information according to the interests of the visitor, in a non intrusive and natural way. In the first one [Seidenari et al. 2017], the YOLO network [Redmon et al. 2016] is used to distinguish between artworks and people in the camera view. If an artwork is detected, it predicts a specific artwork label. The artwork's description is hence given in audio in the visitor's language. In the second one, the GPS coordinates are used to search Google Places and obtain the interest points closeby. To determine what landmark the visitor is actually looking at, the actual view of the camera is matched against the Google Street Map database using SIFT features. Matched views are classified as either artwork or background and for artworks, descriptions are obtained from Wikipedia. Both prototypes were conceived as a smart mate for visits in museums and outdoor sites or cities of art, respectively. In both prototypes, voice activity detection provides hints about what is happening in the surrounding context of the visitor and triggers the audio description only when the visitor is not talking with the accompanying persons. They were developed on NVIDIA Jetson TK1 and deployed on a NVIDIA Shield K1 Tablet, run in real time and were tested in real contexts in a musum and the city of Florence.\",\"PeriodicalId\":93291,\"journal\":{\"name\":\"ICMR'17 : proceedings of the 2017 ACM International Conference on Multimedia Retrieval : June 6-9, 2017, Bucharest, Romania. ACM International Conference on Multimedia Retrieval (2017 : Bucharest, Romania)\",\"volume\":\"66 1\",\"pages\":\"2\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICMR'17 : proceedings of the 2017 ACM International Conference on Multimedia Retrieval : June 6-9, 2017, Bucharest, Romania. ACM International Conference on Multimedia Retrieval (2017 : Bucharest, Romania)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3078971.3079005\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICMR'17 : proceedings of the 2017 ACM International Conference on Multimedia Retrieval : June 6-9, 2017, Bucharest, Romania. ACM International Conference on Multimedia Retrieval (2017 : Bucharest, Romania)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078971.3079005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
数字和移动技术日益普及,以支持和提高文化访问体验的质量。设备的便携性、大多数人对其使用的日常适应、信息的容易获取以及交互式增强现实的机会是这种受欢迎的关键因素。我们相信计算机视觉可以帮助提高这种体验质量,通过使移动设备更智能,能够直接从访问者的行为中推断出访问者的兴趣,从而在正确的时间触发适当的信息传递,而无需任何特定的用户操作。在佛罗伦萨MICC大学,我们开发了两种智能音频导览原型,分别用于室内和室外文化参观,它们利用移动设备上的多核cpu和gpu的可用性以及计算机视觉,以一种非侵入性和自然的方式根据游客的兴趣提供信息。在第一个[Seidenari et al. 2017]中,使用YOLO网络[Redmon et al. 2016]来区分相机视图中的艺术品和人。如果检测到艺术品,则预测特定的艺术品标签。因此,艺术作品的描述以参观者的语言以音频形式呈现。在第二种方法中,使用GPS坐标搜索Google Places并获得附近的兴趣点。为了确定游客真正看到的是什么地标,相机的实际视图将使用SIFT功能与谷歌街道地图数据库进行匹配。匹配的视图被分类为艺术品或背景,艺术品的描述来自维基百科。这两种原型都被认为是博物馆、户外场所或艺术城市参观的智能伴侣。在这两个原型中,语音活动检测提供了关于访问者周围环境中正在发生的事情的提示,并且只有当访问者没有与陪同人员交谈时才触发音频描述。它们是在NVIDIA Jetson TK1上开发的,部署在NVIDIA Shield K1平板电脑上,实时运行,并在博物馆和佛罗伦萨市的真实环境中进行了测试。
Digital and mobile technologies have become increasingly popular to support and improve the quality of experience during cultural visits. The portability of the device, the daily adaptation of most people to its usage, the easy access to information and the opportunity of interactive augmented reality have been key factors of this popularity. We believe that computer vision may help to improve such quality of experience, by making the mobile device smarter and capable of inferring the visitor interests directly from his/her behavior, so triggering the delivery of the appropriate information at the right time without any specific user actions. At MICC University of Florence, we have developed two prototypes of smart audio guides, respectively for indoor and outdoor cultural visits, that exploit the availability of multi-core CPUs and GPUs on mobile devices and computer vision to feed information according to the interests of the visitor, in a non intrusive and natural way. In the first one [Seidenari et al. 2017], the YOLO network [Redmon et al. 2016] is used to distinguish between artworks and people in the camera view. If an artwork is detected, it predicts a specific artwork label. The artwork's description is hence given in audio in the visitor's language. In the second one, the GPS coordinates are used to search Google Places and obtain the interest points closeby. To determine what landmark the visitor is actually looking at, the actual view of the camera is matched against the Google Street Map database using SIFT features. Matched views are classified as either artwork or background and for artworks, descriptions are obtained from Wikipedia. Both prototypes were conceived as a smart mate for visits in museums and outdoor sites or cities of art, respectively. In both prototypes, voice activity detection provides hints about what is happening in the surrounding context of the visitor and triggers the audio description only when the visitor is not talking with the accompanying persons. They were developed on NVIDIA Jetson TK1 and deployed on a NVIDIA Shield K1 Tablet, run in real time and were tested in real contexts in a musum and the city of Florence.