Recent advances in processing and networking capabilities of computers have led to an accumulation of immense amounts of multimedia data such as images. One of the largest repositories for such data is the World Wide Web (WWW). We present Cortina, a large-scale image retrieval system for the WWW. It handles over 3 million images to date. The system retrieves images based on visual features and collateral text. We show that a search process which consists of an initial query-by-keyword or query-by-image and followed by relevance feedback on the visual appearance of the results is possible for large-scale data sets. We also show that it is superior to the pure text retrieval commonly used in large-scale systems. Semantic relationships in the data are explored and exploited by data mining, and multiple feature spaces are included in the search process.
{"title":"Cortina: a system for large-scale, content-based web image retrieval","authors":"Till Quack, U. Mönich, L. Thiele, B. S. Manjunath","doi":"10.1145/1027527.1027650","DOIUrl":"https://doi.org/10.1145/1027527.1027650","url":null,"abstract":"Recent advances in processing and networking capabilities of computers have led to an accumulation of immense amounts of multimedia data such as images. One of the largest repositories for such data is the World Wide Web (WWW). We present Cortina, a large-scale image retrieval system for the WWW. It handles over 3 million images to date. The system retrieves images based on visual features and collateral text. We show that a search process which consists of an initial query-by-keyword or query-by-image and followed by relevance feedback on the visual appearance of the results is possible for large-scale data sets. We also show that it is superior to the pure text retrieval commonly used in large-scale systems. Semantic relationships in the data are explored and exploited by data mining, and multiple feature spaces are included in the search process.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128308450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we report our newly developed 3D face modeling system with arbitrary expressions in a high level of detail using the topographic analysis and mesh instantiation process. Given a sequence of images of facial expressions at frontal views, we automatically generate 3D expressions at arbitrary views. Our face modeling system consists of two major components: facial surface representation using topographic analysis and generic model individualization based on labeled surface features and surface curvatures. The realism of the generated individual model is demonstrated through 3D views of facial expressions in videos. This work targets the accurate modeling of face and face expression for human computer interaction and 3D face recognition.
{"title":"Generating 3D views of facial expressions from frontal face video based on topographic analysis","authors":"L. Yin, K. Weiss","doi":"10.1145/1027527.1027611","DOIUrl":"https://doi.org/10.1145/1027527.1027611","url":null,"abstract":"In this paper, we report our newly developed 3D face modeling system with arbitrary expressions in a high level of detail using the topographic analysis and mesh instantiation process. Given a sequence of images of facial expressions at frontal views, we automatically generate 3D expressions at arbitrary views. Our face modeling system consists of two major components: facial surface representation using topographic analysis and generic model individualization based on labeled surface features and surface curvatures. The realism of the generated individual model is demonstrated through 3D views of facial expressions in videos. This work targets the accurate modeling of face and face expression for human computer interaction and 3D face recognition.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129368519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Detecting Image Near-Duplicate (IND) is an important problem in a variety of applications, such as copyright infringement detection and multimedia linking. Traditional image similarity models are often difficult to identify IND due to their inability to capture scene composition and semantics. We present a part-based image similarity measure derived from stochastic matching of Attributed Relational Graphs that represent the compositional parts and part relations of image scenes. Such a similarity model is fundamentally different from traditional approaches using low-level features or image alignment. The advantage of this model is its ability to accommodate spatial attributed relations and support supervised and unsupervised learning from training data. The experiments compare the presented model with several prior similarity models, such as color histogram, local edge descriptor, etc. The presented model outperforms the prior approaches with large margin.
{"title":"Detecting image near-duplicate by stochastic attributed relational graph matching with learning","authors":"Dong-Qing Zhang, Shih-Fu Chang","doi":"10.1145/1027527.1027730","DOIUrl":"https://doi.org/10.1145/1027527.1027730","url":null,"abstract":"Detecting Image Near-Duplicate (IND) is an important problem in a variety of applications, such as copyright infringement detection and multimedia linking. Traditional image similarity models are often difficult to identify IND due to their inability to capture scene composition and semantics. We present a part-based image similarity measure derived from stochastic matching of Attributed Relational Graphs that represent the compositional parts and part relations of image scenes. Such a similarity model is fundamentally different from traditional approaches using low-level features or image alignment. The advantage of this model is its ability to accommodate spatial attributed relations and support supervised and unsupervised learning from training data. The experiments compare the presented model with several prior similarity models, such as color histogram, local edge descriptor, etc. The presented model outperforms the prior approaches with large margin.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124628200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the proliferation of multimedia data sources on the Internet, we envision an increasing demand for value-added and function-rich multimedia services that transport, process, and analyze multimedia data on behalf of end users. More importantly, multimedia services are expected to be easily accessible and composable by users. In this paper, we propose MSODA, a service-oriented platform that hosts a wide spectrum of media services provided by different parties. From the user's point of view, MSODA is a shared "market" for media service access and composition. For a media service provider, MSODA creates a virtual dedicated environment for service deployment and management. Finally, the underlying MSODA middleware performs the key functions of service composition, configuration, and mapping for users. We discuss key challenges in the design of MSODA and present preliminary results towards its full realization.
{"title":"Towards an integrated multimedia service hosting overlay","authors":"Dongyan Xu, Xuxian Jiang","doi":"10.1145/1027527.1027545","DOIUrl":"https://doi.org/10.1145/1027527.1027545","url":null,"abstract":"With the proliferation of multimedia data sources on the Internet, we envision an increasing demand for value-added and function-rich multimedia services that transport, process, and analyze multimedia data on behalf of end users. More importantly, multimedia services are expected to be easily accessible and composable by users. In this paper, we propose MSODA, a service-oriented platform that hosts a wide spectrum of media services provided by different parties. From the user's point of view, MSODA is a shared \"market\" for media service access and composition. For a media service provider, MSODA creates a virtual dedicated environment for service deployment and management. Finally, the underlying MSODA middleware performs the key functions of service composition, configuration, and mapping for users. We discuss key challenges in the design of MSODA and present preliminary results towards its full realization.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129457295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Sarvas, Mikko Viikari, Juha Pesonen, H. Nevanlinna
In this paper we describe the design and implementation of a mobile one picture sharing system MobShare that enables immediate, controlled, and organized sharing of mobile pictures, and the browsing, combining, and discussion of the shared pictures. The design combines research on otogray, personal image management, mobile one camera use, mobile picture publishing, and an interview study we conducted on mobile one camera users. The system is based on a client-server architecture and uses current mobile one and web technology. The implementation describes novel solutions in immediate sharing of mobile images to an organized web album, and in providing full control over with whom the images are shared. Also, we describe new ways of promoting discussion in sharing images and enabling the combination and comparison of personal and shared pictures. The system proves that the designed solutions can be implemented with current technology and provides novel approaches to general issues in sharing digital images.
{"title":"MobShare: controlled and immediate sharing of mobile images","authors":"R. Sarvas, Mikko Viikari, Juha Pesonen, H. Nevanlinna","doi":"10.1145/1027527.1027690","DOIUrl":"https://doi.org/10.1145/1027527.1027690","url":null,"abstract":"In this paper we describe the design and implementation of a mobile one picture sharing system <i>MobShare</i> that enables immediate, controlled, and organized sharing of mobile pictures, and the browsing, combining, and discussion of the shared pictures. The design combines research on otogray, personal image management, mobile one camera use, mobile picture publishing, and an interview study we conducted on mobile one camera users. The system is based on a client-server architecture and uses current mobile one and web technology. The implementation describes novel solutions in immediate sharing of mobile images to an organized web album, and in providing full control over with whom the images are shared. Also, we describe new ways of promoting discussion in sharing images and enabling the combination and comparison of personal and shared pictures. The system proves that the designed solutions can be implemented with current technology and provides novel approaches to general issues in sharing digital images.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129228169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose an automatic pan control system for broadcasting ball games by tracking face direction of audience. Presuming that the audience's face is directed to a notable play, we can shoot a broadcasting video by controlling the camera toward the audience's face direction. In our method, a court sensor which detects rough region where players exist is used in addition to the face sensor in order to obtain higher accuracy. Based on these sensors, the broadcasting video is generated by cylindrical mosaicing off-line. We conducted objective and subjective evaluation experiments and the result shows that our approach is better than previously proposed methods in respect of stable tracking and easy watching. Conclusively our method is as effective as the optimum camerawork.
{"title":"Automatic pan control system for broadcasting ball games based on audience's face direction","authors":"Shinji Daigo, S. Ozawa","doi":"10.1145/1027527.1027634","DOIUrl":"https://doi.org/10.1145/1027527.1027634","url":null,"abstract":"We propose an automatic pan control system for broadcasting ball games by tracking face direction of audience. Presuming that the audience's face is directed to a notable play, we can shoot a broadcasting video by controlling the camera toward the audience's face direction. In our method, a court sensor which detects rough region where players exist is used in addition to the face sensor in order to obtain higher accuracy. Based on these sensors, the broadcasting video is generated by cylindrical mosaicing off-line. We conducted objective and subjective evaluation experiments and the result shows that our approach is better than previously proposed methods in respect of stable tracking and easy watching. Conclusively our method is as effective as the optimum camerawork.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"6 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123885278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Communication between a pair of nodes in the network may get disrupted due to failures of links/nodes resulting in zero effective bandwidth between them during the recovery period. It has been observed that such disruptions are not too uncommon and may last from tens of seconds to minutes. Even an occasional such disruption can drastically degrade the viewing experience of a participant in a video streaming session particularly when a sequence of frames central to the story are lost during the disruption. The conventional approach of prefetching video frames and patching lost ones with retransmissions is not always viable when disruptions are localized and experienced only by a few among many receivers. Error spreading approaches that distribute the losses across the video work well only when the disruptions are quite short. As a better alternative, we propose a disruption-tolerant content-aware video streaming approach that combines the techniques of content summarization and error spreading to enhance viewers experience even when the disruptions are long. We introduce the notion of "substitutable content summary frames" and provide a method to select these frames and also their transmission order to mitigate the impact of a disruption. In the event of a disruption, the already received summary frames are played by the client during disruption and near normal playback is resumed after the disruption. We evaluate our approach and demonstrate that it provides acceptable viewing experience with minimal startup latency and client buffer.
{"title":"Disruption-tolerant content-aware video streaming","authors":"Tiecheng Liu, Srihari Nelakuditi","doi":"10.1145/1027527.1027627","DOIUrl":"https://doi.org/10.1145/1027527.1027627","url":null,"abstract":"Communication between a pair of nodes in the network may get disrupted due to failures of links/nodes resulting in zero effective bandwidth between them during the recovery period. It has been observed that such disruptions are not too uncommon and may last from tens of seconds to minutes. Even an occasional such disruption can drastically degrade the viewing experience of a participant in a video streaming session particularly when a sequence of frames central to the story are lost during the disruption. The conventional approach of prefetching video frames and patching lost ones with retransmissions is not always viable when disruptions are localized and experienced only by a few among many receivers. Error spreading approaches that distribute the losses across the video work well only when the disruptions are quite short. As a better alternative, we propose a disruption-tolerant content-aware video streaming approach that combines the techniques of content summarization and error spreading to enhance viewers experience even when the disruptions are long. We introduce the notion of \"substitutable content summary frames\" and provide a method to select these frames and also their transmission order to mitigate the impact of a disruption. In the event of a disruption, the already received summary frames are played by the client during disruption and near normal playback is resumed after the disruption. We evaluate our approach and demonstrate that it provides acceptable viewing experience with minimal startup latency and client buffer.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123977670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We have studied the usability and artistic potential of an immersive 3D painting system in its early state. The system allows one to draw lines, meshes and particle clouds using a one-hand wand in a virtual room with stereoscopic display. In its more mature state the software will allow for two-handed interaction with new interaction devices. Ten professional artists participated for two days each in a test, performing both given tasks and free artistic sketching. Their experiences were collected through observation and interviews. At this stage, we found that common technical limitations of virtual environments, such as latency and tracking inaccuracy as well as clumsiness of the hardware devices, may considerably hinder handicraft work. On the other hand, every single participant felt that immersion offers new potential for artistic expression and was definitely willing to continue in the second phase of the test later this year.
{"title":"Possibilities and limitations of immersive free-hand expression: a case study with professional artists","authors":"Wille Mäkelä, M. Reunanen, T. Takala","doi":"10.1145/1027527.1027649","DOIUrl":"https://doi.org/10.1145/1027527.1027649","url":null,"abstract":"We have studied the usability and artistic potential of an immersive 3D painting system in its early state. The system allows one to draw lines, meshes and particle clouds using a one-hand wand in a virtual room with stereoscopic display. In its more mature state the software will allow for two-handed interaction with new interaction devices. Ten professional artists participated for two days each in a test, performing both given tasks and free artistic sketching. Their experiences were collected through observation and interviews. At this stage, we found that common technical limitations of virtual environments, such as latency and tracking inaccuracy as well as clumsiness of the hardware devices, may considerably hinder handicraft work. On the other hand, every single participant felt that immersion offers new potential for artistic expression and was definitely willing to continue in the second phase of the test later this year.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121294229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a novel post-filtering algorithm with low computational complexity that improves the visual quality of decoded images using block boundary classification and simple adaptive filter (SAF). At first, each block boundary is classified into smooth or complex sub-region. And for smooth-smooth sub-regions, the existence of blocking artifacts is determined using blocky strength. Simple adaptive filtering is processed in each block boundary. The proposed method processes adaptively, that is, a nonlinear 1-D 8-tap filter is applied to smooth-smooth sub-regions with blocking artifacts, and for smooth-complex or complex-smooth sub-regions, a nonlinear 1-D variant filter is applied to block boundary pixels so as to reduce the blocking and ringing artifacts. And for complex-complex sub-regions, a nonlinear 1-D 2-tap filter is only applied to adjust two block boundary pixels so as to preserve the image details. Experimental results show that the proposed algorithm produced better results than those of the conventional algorithms both subjective and objective viewpoints.
{"title":"Picture quality improvement in MPEG-4 video coding using simple adaptive filter","authors":"Kee-Koo Kwon, Sung-Ho Im, Dong-Sun Lim","doi":"10.1145/1027527.1027592","DOIUrl":"https://doi.org/10.1145/1027527.1027592","url":null,"abstract":"In this paper, we propose a novel post-filtering algorithm with low computational complexity that improves the visual quality of decoded images using block boundary classification and simple adaptive filter (SAF). At first, each block boundary is classified into smooth or complex sub-region. And for smooth-smooth sub-regions, the existence of blocking artifacts is determined using blocky strength. Simple adaptive filtering is processed in each block boundary. The proposed method processes adaptively, that is, a nonlinear 1-D 8-tap filter is applied to smooth-smooth sub-regions with blocking artifacts, and for smooth-complex or complex-smooth sub-regions, a nonlinear 1-D variant filter is applied to block boundary pixels so as to reduce the blocking and ringing artifacts. And for complex-complex sub-regions, a nonlinear 1-D 2-tap filter is only applied to adjust two block boundary pixels so as to preserve the image details. Experimental results show that the proposed algorithm produced better results than those of the conventional algorithms both subjective and objective viewpoints.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116192225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes a novel, interactive multimodal framework that enables a network of friends to effectively visualize and browse a shared image collection. The framework is very useful for geographically disconnected friends to share experiences. Our solution involves three components - (a) an event model, (b) three new spatio-temporal event exploration schemes, and (c) a novel technique for summarizing the user interaction. We develop a simple multimedia event model, that additionally incorporates the idea of user viewpoints. We also develop new dissimilarity measures between events, that additionally incorporate user context. We develop three, task driven, event exploration environments - (a) spatio-temporal evolution, (b) event cones and (c) viewpoint centric interaction. An original contribution of this paper is to summarize the user-interaction using an interactive framework. We conjecture that an interactive summary serves to recall the original content better, than a static image-based summary. Our user studies indicate that the exploratory environment performs very well.
{"title":"Networked multimedia event exploration","authors":"Preetha Appan, H. Sundaram","doi":"10.1145/1027527.1027536","DOIUrl":"https://doi.org/10.1145/1027527.1027536","url":null,"abstract":"This paper describes a novel, interactive multimodal framework that enables a network of friends to effectively visualize and browse a shared image collection. The framework is very useful for geographically disconnected friends to share experiences. Our solution involves three components - (a) an event model, (b) three new spatio-temporal event exploration schemes, and (c) a novel technique for summarizing the user interaction. We develop a simple multimedia event model, that additionally incorporates the idea of user viewpoints. We also develop new dissimilarity measures between events, that additionally incorporate user context. We develop three, task driven, event exploration environments - (a) spatio-temporal evolution, (b) event cones and (c) viewpoint centric interaction. An original contribution of this paper is to summarize the user-interaction using an interactive framework. We conjecture that an interactive summary serves to recall the original content better, than a static image-based summary. Our user studies indicate that the exploratory environment performs very well.","PeriodicalId":292207,"journal":{"name":"MULTIMEDIA '04","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127755429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}