Cheng-Hsin Hsu, Hua-Jun Hong, Tarek Elgamal, K. Nahrstedt, N. Venkatasubramanian
In cloud computing, minions refer to virtual or physical machines that carry out the actual workload. Minions in the cloud hide in faraway data centers and thus cloud computing is less friendly to multimedia applications. The fog computing paradigm pushes minions toward edge networks. We adopt a generalized definition, where minions get into end devices owned by the crowd. The serious uncertainty, such as dynamic network conditions, limited battery levels, and unpredictable minion availability in multimedia fog platforms makes them harder to be managed than cloud platforms. In this chapter, we share our experience on utilizing resources from the crowd to optimize multimedia applications. The learned lessons shed some light on the optimal design of a unified multimedia fog platform for distributed multimedia applications.
{"title":"Multimedia fog computing: minions in the cloud and crowd","authors":"Cheng-Hsin Hsu, Hua-Jun Hong, Tarek Elgamal, K. Nahrstedt, N. Venkatasubramanian","doi":"10.1145/3122865.3122876","DOIUrl":"https://doi.org/10.1145/3122865.3122876","url":null,"abstract":"In cloud computing, minions refer to virtual or physical machines that carry out the actual workload. Minions in the cloud hide in faraway data centers and thus cloud computing is less friendly to multimedia applications. The fog computing paradigm pushes minions toward edge networks. We adopt a generalized definition, where minions get into end devices owned by the crowd. The serious uncertainty, such as dynamic network conditions, limited battery levels, and unpredictable minion availability in multimedia fog platforms makes them harder to be managed than cloud platforms. In this chapter, we share our experience on utilizing resources from the crowd to optimize multimedia applications. The learned lessons shed some light on the optimal design of a unified multimedia fog platform for distributed multimedia applications.","PeriodicalId":408764,"journal":{"name":"Frontiers of Multimedia Research","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130483581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Friedland, P. Smaragdis, Josh H. McDermott, B. Raj
What do the fields of robotics, human-computer interaction, AI, video retrieval, privacy, cybersecurity, Internet of Things, and big data all have in common? They all work with various sources of data: visual, textual, time stamps, links, records. But there is one source of data that has been almost completely ignored by the academic community---sound. Our comprehension of the world relies critically on audition---the ability to perceive and interpret the sounds we hear. Sound is ubiquitous, and is a unique source of information about our environment and the events occurring in it. Just by listening, we can determine whether our child's laughter originated inside or outside our house, how far away they were when they laughed, and whether the window through which the sound passed was open or shut. The ability to derive information about the world from sound is a core aspect of perceptual intelligence. Auditory inferences are often complex and sophisticated despite their routine occurrence. The number of possible inferences is typically not enumerable, and the final interpretation is not merely one of selection from a fixed set. And yet humans perform such inferences effortlessly, based only on sounds captured using two sensors, our ears. Electronic devices can also "perceive" sound. Every phone and tablet has at least one microphone, as do most cameras. Any device or space can be equipped with microphones at minimal expense. Indeed, machines can not only "listen"; they have potential advantages over humans as listening devices, in that they can communicate and coordinate their experiences in ways that biological systems simply cannot. Collections of devices that can sense sound and communicate with each other could instantiate a single electronic entity that far surpasses humans in its ability to record and process information from sound. And yet machines at present cannot truly hear. Apart from well-developed efforts to recover structure in speech and music, the state of the art in machine hearing is limited to relatively impoverished descriptions of recorded sounds: detecting occurrences of a limited pre-specified set of sound types, and their locations. Although researchers typically envision artificially intelligent agents such as robots to have human-like hearing abilities, at present the rich descriptions and inferences humans can make about sound are entirely beyond the capability of machine systems. In this chapter, we suggest establishing the field of Computer Audition to develop the theory behind artificial systems that extract information from sound. Our objective is to enable computer systems to replicate and exceed human abilities. This chapter describes the challenges of this field.
{"title":"Audition for multimedia computing","authors":"G. Friedland, P. Smaragdis, Josh H. McDermott, B. Raj","doi":"10.1145/3122865.3122868","DOIUrl":"https://doi.org/10.1145/3122865.3122868","url":null,"abstract":"What do the fields of robotics, human-computer interaction, AI, video retrieval, privacy, cybersecurity, Internet of Things, and big data all have in common? They all work with various sources of data: visual, textual, time stamps, links, records. But there is one source of data that has been almost completely ignored by the academic community---sound. \u0000 \u0000Our comprehension of the world relies critically on audition---the ability to perceive and interpret the sounds we hear. Sound is ubiquitous, and is a unique source of information about our environment and the events occurring in it. Just by listening, we can determine whether our child's laughter originated inside or outside our house, how far away they were when they laughed, and whether the window through which the sound passed was open or shut. The ability to derive information about the world from sound is a core aspect of perceptual intelligence. \u0000 \u0000Auditory inferences are often complex and sophisticated despite their routine occurrence. The number of possible inferences is typically not enumerable, and the final interpretation is not merely one of selection from a fixed set. And yet humans perform such inferences effortlessly, based only on sounds captured using two sensors, our ears. \u0000 \u0000Electronic devices can also \"perceive\" sound. Every phone and tablet has at least one microphone, as do most cameras. Any device or space can be equipped with microphones at minimal expense. Indeed, machines can not only \"listen\"; they have potential advantages over humans as listening devices, in that they can communicate and coordinate their experiences in ways that biological systems simply cannot. Collections of devices that can sense sound and communicate with each other could instantiate a single electronic entity that far surpasses humans in its ability to record and process information from sound. \u0000 \u0000And yet machines at present cannot truly hear. Apart from well-developed efforts to recover structure in speech and music, the state of the art in machine hearing is limited to relatively impoverished descriptions of recorded sounds: detecting occurrences of a limited pre-specified set of sound types, and their locations. Although researchers typically envision artificially intelligent agents such as robots to have human-like hearing abilities, at present the rich descriptions and inferences humans can make about sound are entirely beyond the capability of machine systems. \u0000 \u0000In this chapter, we suggest establishing the field of Computer Audition to develop the theory behind artificial systems that extract information from sound. Our objective is to enable computer systems to replicate and exceed human abilities. This chapter describes the challenges of this field.","PeriodicalId":408764,"journal":{"name":"Frontiers of Multimedia Research","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123744981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This chapter addresses one of the fundamental problems involved in multimedia systems, namely efficient similarity search for large collections of multimedia content. This problem has received a lot of attention from various research communities. In particular, it is a historical line of research in computational geometry and databases. The computer vision and multimedia communities have adopted pragmatic approaches guided by practical requirements: the large sets of features required to describe image collections make visual search a highly demanding task. As a result, early works [Flickner et al. 1995, Fagin 1998, Beis and Lowe 1997] in image indexing have foreseen the interest in approximate algorithms, especially after the dissemination of methods based on local description in the 90s, as any improvement obtained on this indexing part improves the whole visual search system. Among the existing approximate nearest neighbors (ANN) strategies, the popular framework of Locality-Sensitive Hashing (LSH) [Indyk and Motwani 1998, Gionis et al. 1999] provides theoretical guarantees on the search quality with limited assumptions on the underlying data distribution. It was first proposed [Indyk and Motwani 1998] for the Hamming and l1 spaces, and was later extended to the Euclidean/ cosine cases [Charikar 2002, Datar et al. 2004] or the earth mover's distance [Charikar 2002, Andoni and Indyk 2006]. LSH has been successfully used for local descriptors [Ke et al. 2004], 3D object indexing [Matei et al. 2006, Shakhnarovich et al. 2006], and other fields such as audio retrieval [Casey and Slaney 2007, Ryynanen and Klapuri 2008]. It has also received some attention in a context of private information retrieval [Pathak and Raj 2012, Aghasaryan et al. 2013, Furon et al. 2013]. A few years ago, approaches inspired by compression and more specifically quantization-based approaches [Jǵou et al. 2011] were shown to be a viable alternative to hashing methods, and shown successful for efficiently searching in a billion-sized dataset. This chapter discusses these different trends. It is organized as follows. Section 5.1 gives some background references and concepts, including evaluation issues. Most of the methods and variants are exposed within the LSH framework. It is worth mentioning that LSH is more of a concept than a particular algorithm. The search algorithms associated with LSH follow two distinct search mechanisms, the probe-cell model and sketches, which are discussed in Sections 5.2 and 5.3, respectively. Section 5.4 describes methods inspired by compression algorithms, while Section 5.5 discusses hybrid approaches combining the non-exhaustiveness of the cell-probe model with the advantages of sketches or compression-based algorithms. Other metrics than Euclidean and cosine are briefly discussed in Section 5.6.
本章讨论了多媒体系统中涉及的一个基本问题,即大型多媒体内容集合的高效相似度搜索。这个问题受到了各个研究团体的广泛关注。特别是,它是计算几何和数据库研究的历史路线。计算机视觉和多媒体社区采用了由实际需求指导的实用方法:描述图像集合所需的大量特征使视觉搜索成为一项要求很高的任务。因此,早期在图像索引方面的工作[Flickner et al. 1995, Fagin 1998, Beis and Lowe 1997]已经预见到对近似算法的兴趣,特别是在90年代基于局部描述的方法传播之后,因为在该索引部分获得的任何改进都会改善整个视觉搜索系统。在现有的近似近邻(ANN)策略中,流行的位置敏感哈希(LSH)框架[Indyk and Motwani 1998, Gionis et al. 1999]通过对底层数据分布的有限假设,为搜索质量提供了理论上的保证。它首先被提出[Indyk和Motwani 1998]用于Hamming和l1空间,后来被扩展到欧几里得/余弦情况[Charikar 2002, Datar et al. 2004]或土方的距离[Charikar 2002, Andoni和Indyk 2006]。LSH已成功用于局部描述符[Ke et al. 2004], 3D对象索引[Matei et al. 2006, Shakhnarovich et al. 2006],以及其他领域,如音频检索[Casey and Slaney 2007, Ryynanen and Klapuri 2008]。在私人信息检索的背景下,它也受到了一些关注[Pathak和Raj 2012, Aghasaryan等人2013,Furon等人2013]。几年前,受压缩和更具体的基于量化的方法(Jǵou et al. 2011)启发的方法被证明是哈希方法的可行替代方案,并成功地在十亿大小的数据集中进行有效搜索。本章讨论这些不同的趋势。它的组织如下。第5.1节给出了一些背景参考和概念,包括评估问题。大多数方法和变体都在LSH框架中公开。值得一提的是,LSH更多的是一个概念,而不是一个特定的算法。与LSH相关的搜索算法遵循两种不同的搜索机制,探针单元模型和草图,分别在第5.2节和5.3节中讨论。第5.4节描述了受压缩算法启发的方法,而第5.5节讨论了混合方法,将细胞探针模型的非耗尽性与草图或基于压缩的算法的优势相结合。除欧几里得和余弦以外的其他度量将在第5.6节中简要讨论。
{"title":"Efficient similarity search","authors":"H. Jégou","doi":"10.1145/3122865.3122871","DOIUrl":"https://doi.org/10.1145/3122865.3122871","url":null,"abstract":"This chapter addresses one of the fundamental problems involved in multimedia systems, namely efficient similarity search for large collections of multimedia content. This problem has received a lot of attention from various research communities. In particular, it is a historical line of research in computational geometry and databases. The computer vision and multimedia communities have adopted pragmatic approaches guided by practical requirements: the large sets of features required to describe image collections make visual search a highly demanding task. As a result, early works [Flickner et al. 1995, Fagin 1998, Beis and Lowe 1997] in image indexing have foreseen the interest in approximate algorithms, especially after the dissemination of methods based on local description in the 90s, as any improvement obtained on this indexing part improves the whole visual search system. \u0000 \u0000Among the existing approximate nearest neighbors (ANN) strategies, the popular framework of Locality-Sensitive Hashing (LSH) [Indyk and Motwani 1998, Gionis et al. 1999] provides theoretical guarantees on the search quality with limited assumptions on the underlying data distribution. It was first proposed [Indyk and Motwani 1998] for the Hamming and l1 spaces, and was later extended to the Euclidean/ cosine cases [Charikar 2002, Datar et al. 2004] or the earth mover's distance [Charikar 2002, Andoni and Indyk 2006]. LSH has been successfully used for local descriptors [Ke et al. 2004], 3D object indexing [Matei et al. 2006, Shakhnarovich et al. 2006], and other fields such as audio retrieval [Casey and Slaney 2007, Ryynanen and Klapuri 2008]. It has also received some attention in a context of private information retrieval [Pathak and Raj 2012, Aghasaryan et al. 2013, Furon et al. 2013]. \u0000 \u0000A few years ago, approaches inspired by compression and more specifically quantization-based approaches [Jǵou et al. 2011] were shown to be a viable alternative to hashing methods, and shown successful for efficiently searching in a billion-sized dataset. \u0000 \u0000This chapter discusses these different trends. It is organized as follows. Section 5.1 gives some background references and concepts, including evaluation issues. Most of the methods and variants are exposed within the LSH framework. It is worth mentioning that LSH is more of a concept than a particular algorithm. The search algorithms associated with LSH follow two distinct search mechanisms, the probe-cell model and sketches, which are discussed in Sections 5.2 and 5.3, respectively. Section 5.4 describes methods inspired by compression algorithms, while Section 5.5 discusses hybrid approaches combining the non-exhaustiveness of the cell-probe model with the advantages of sketches or compression-based algorithms. Other metrics than Euclidean and cosine are briefly discussed in Section 5.6.","PeriodicalId":408764,"journal":{"name":"Frontiers of Multimedia Research","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114157877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This chapter provides an accessible introduction for point processes, and especially Hawkes processes, for modeling discrete, inter-dependent events over continuous time. We start by reviewing the definitions and key concepts in point processes. We then introduce the Hawkes process and its event intensity function, as well as schemes for event simulation and parameter estimation. We also describe a practical example drawn from social media data---we show how to model retweet cascades using a Hawkes self-exciting process.We present a design of the memory kernel, and results on estimating parameters and predicting popularity. The code and sample event data are available in an online repository.
{"title":"Hawkes processes for events in social media","authors":"Marian-Andrei Rizoiu, Young Lee, Swapnil Mishra","doi":"10.1145/3122865.3122874","DOIUrl":"https://doi.org/10.1145/3122865.3122874","url":null,"abstract":"This chapter provides an accessible introduction for point processes, and especially Hawkes processes, for modeling discrete, inter-dependent events over continuous time. We start by reviewing the definitions and key concepts in point processes. We then introduce the Hawkes process and its event intensity function, as well as schemes for event simulation and parameter estimation. We also describe a practical example drawn from social media data---we show how to model retweet cascades using a Hawkes self-exciting process.We present a design of the memory kernel, and results on estimating parameters and predicting popularity. The code and sample event data are available in an online repository.","PeriodicalId":408764,"journal":{"name":"Frontiers of Multimedia Research","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116493447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Situation recognition using multimodal data","authors":"Vivek K. Singh","doi":"10.1145/3122865.3122873","DOIUrl":"https://doi.org/10.1145/3122865.3122873","url":null,"abstract":"","PeriodicalId":408764,"journal":{"name":"Frontiers of Multimedia Research","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130320065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kuan-Ta Chen, Wei Cai, R. Shea, Chun-Ying Huang, Jiangchuan Liu, Victor C. M. Leung, Cheng-Hsin Hsu
{"title":"Cloud gaming","authors":"Kuan-Ta Chen, Wei Cai, R. Shea, Chun-Ying Huang, Jiangchuan Liu, Victor C. M. Leung, Cheng-Hsin Hsu","doi":"10.1145/3122865.3122877","DOIUrl":"https://doi.org/10.1145/3122865.3122877","url":null,"abstract":"","PeriodicalId":408764,"journal":{"name":"Frontiers of Multimedia Research","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128808785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
"Free-standing conversational groups" are what we call the elementary building blocks of social interactions formed in settings when people are standing and congregate in groups. The automatic detection, analysis, and tracking of such structural conversational units captured on camera poses many interesting challenges for the research community. First, although delineating these formations is strongly linked to other behavioral cues such as head and body poses, finding methods that successfully describe and exploit these links is not obvious. Second, the use of visual data is crucial, but when analyzing crowded scenes, one must account for occlusions and low-resolution images. In this regard, the use of other sensing technologies such as wearable devices can facilitate the analysis of social interactions by complementing the visual information. Yet the exploitation of multiple modalities poses other challenges in terms of data synchronization, calibration, and fusion. In this chapter, we discuss recent advances in multimodal social scene analysis, in particular for the detection of conversational groups or F-formations [Kendon 1990]. More precisely, a multimodal joint head and body pose estimator is described and compared to other recent approaches for head and body pose estimation and F-formation detection. Experimental results on the recently published SALSA dataset are reported, they evidence the long road toward a fully automated high-precision social scene analysis framework.
{"title":"Multimodal analysis of free-standing conversational groups","authors":"Xavier Alameda-Pineda, E. Ricci, N. Sebe","doi":"10.1145/3122865.3122869","DOIUrl":"https://doi.org/10.1145/3122865.3122869","url":null,"abstract":"\"Free-standing conversational groups\" are what we call the elementary building blocks of social interactions formed in settings when people are standing and congregate in groups. The automatic detection, analysis, and tracking of such structural conversational units captured on camera poses many interesting challenges for the research community. First, although delineating these formations is strongly linked to other behavioral cues such as head and body poses, finding methods that successfully describe and exploit these links is not obvious. Second, the use of visual data is crucial, but when analyzing crowded scenes, one must account for occlusions and low-resolution images. In this regard, the use of other sensing technologies such as wearable devices can facilitate the analysis of social interactions by complementing the visual information. Yet the exploitation of multiple modalities poses other challenges in terms of data synchronization, calibration, and fusion. In this chapter, we discuss recent advances in multimodal social scene analysis, in particular for the detection of conversational groups or F-formations [Kendon 1990]. More precisely, a multimodal joint head and body pose estimator is described and compared to other recent approaches for head and body pose estimation and F-formation detection. Experimental results on the recently published SALSA dataset are reported, they evidence the long road toward a fully automated high-precision social scene analysis framework.","PeriodicalId":408764,"journal":{"name":"Frontiers of Multimedia Research","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127149919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Utilizing implicit user cues for multimedia analytics","authors":"Subramanian Ramanathan, S. O. Gilani, N. Sebe","doi":"10.1145/3122865.3122875","DOIUrl":"https://doi.org/10.1145/3122865.3122875","url":null,"abstract":"","PeriodicalId":408764,"journal":{"name":"Frontiers of Multimedia Research","volume":"153 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127271110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}