{"title":"Description of audiovisual virtual 3D scenes: MPEG-4 perceptual parameters in the auditory domain","authors":"A. Dantele, U. Reiter","doi":"10.1109/ISCE.2004.1375910","DOIUrl":null,"url":null,"abstract":"A high level of immei-siwi cart he provided f o r the user of virtrial irudiovis~rul erivirorimeiits when sorrnd and visuirl irnlwessiori get coordinated on 11 high quality level. Tlierefore. a coniprehensive scene rlescription Iangrruge is rieeiieil for both. the auditory and the visital purt. The mrrltiriredia sr(rfiilard MPEG-4 provides a powerfiil tool-set f o r the sceiie decription of 2D ond SD virtiiiil environments. fiir the undio part, apart from a coriventiorial yhysicul description, u novel approach is available which is based on perceptual pnrameters which hove been derived from psycho-acoustic e.rperirnents. The practical qualijcarion of this method is discussed when applied to auditory und audiovisual 30 scenes. Enhancements of-e proposed to an cxample application of the perceptrial upproacli which is included in the MPEG-4 stanrlurd arid an implementution f o r 30 rrrrdio rendering is introduced. Index Terms Auditory Scene Description, MPEG-4, Perceptual Parameters, Virtual Acoustics 1. AUDIOVISUAL SCENE DESCRIPTION I MPEG-4 Moving Picture Expens Group T E P E G ) has established novel approaches for the coding of multimedia content in the international standard MPEG-4. Auditory, visual and other content is subdivided into media objects which together build a 2D or 3D scene. Thus the most efficient coding scheme for each object can be chosen according to its type of media, e.g. video, audio, graphics, etc. [I]. For the combination of the objects MPEG-4 provides a powerful tool-set for scene description, the so-called BIFS (Binary Format for Scene Description) [ 2 ] . Here all the elements describing media objects and their properties are put together as nodes in a scene graph. The resulting structure reflects the mutual dependency of the single objects. This concept is based on the scene graph of the Virtual Reality Modeling Language (VRML) standard [3]. The audio part of this scene description (AudioBIFSj allows to specify the behavior of sound emitting objects in the scene (e.g. their position, level, directivity). These basic fuuctionalities have been extended in version 2 of AudioBIFS where new nodes. mainly for virtual acoustics in a 3D ‘This work was conductcd in the research group IAVAS (Intcmztivu AudioVirual Application Systmmr) which i s funded by lhr Thuringim Minisuy 01 Scicnce. Resrmh and thc Ans. Erlun. Germany. Andrcns Dnnlele and Ulnch Rcitcr are with 1hr Institute of Media Technology at 1he Technischc Univcrsicil Ilmcnau. 0.98614 Ilmenau. Germany (e-mail: andruas.dantrIr@lu-iImennu.dc. uhch.ruitcr@tuilnlmnu.duJ. environment, have been added [4]. These are often referred to as Advanced AudioBlFS (AABIFS) and are of main interest for the work described here. In general, the auralization of virtual scenes not only has to reproduce sound sources which are placed in the scenery but also to add ambient sound effects like reverberation. Thus the user can feel the surrounding virtual space by listening to the acoustic cues. The auditory impression for a listener in a real room can he described thoroughly by an impulse response. This represents the acoustic transfer function for a given pair of source and listener at their particular locations in a certain room [5]. When convolving a unreverberated source signal with an impulse response. the output yields a reverberated sound signal. The result sounds as if perceived at the specific location in the room, where the impulse response has been recorded. The characteristic features of any impulse response can he extracted from the temporal distribution and the energy content of its characteristic components, which are the direct sound, the early reflections, and the late reverberation. For a virtual space the task of auralization can he defined as the synthesis of these components in order to model an artificial impulse response. The main problem is to derive a suitable parameter set from virtual scene description which can he used to synthesize a desired impulse response. When using Advanced AudioBIFS the author of a virtual scene can choose among two different approaches of auditoly modelling: ( I ) Based on physical properties like frequency dependent directivity of sound sources and absorption coefficients of material, it is possible to specify what the resulting sound impression should he like. ( 2 ) Another approach is based on psychoacoustic parameters which express the acoustic sensation perceived by the user. Therefore, a set of perceptual parameters which iue based on psycho-experimental research has been introduced 161. 11. THE PERCEPTUAL APPROACH OF MPEG-4 This approach is quite challenging, since the scope is shifted towards the user and thus to the perception of the human senses. Therefore, parameters have to he found which satisfy human needs and which can easily he explained and understood, even with little theoretical knowledge of sound propagation or electroacoustics. Although the field of auditory scene analysis in psychoacoustics is well elaborated, a widely spread language of subjective attributes has not emerged from it yet (but the discussion is going on, e.g. [7]). Nevertheless, the underlying technique of MPEG-4 perceptual approach is already used, e.g. for creating a virtual acoustic space [XI. We want to take a closer look at the perceptual approach given in the MPEG-4 standard, because it can be useful 0-7803-8526-81041S20.00 02004 IEEE 87 especially for audiovisual environments: the reproduction of audiovisual applications is often exhausting in terms of processing power, and the visual and graphic p m consumes most of it. This is especially true for interactive systems, which have to react in real-time to every new user demand. So for the rendering of the auditory part it is sometimes not possible to render a very detailed description of frequency dependent behavior of sound sources or of reflections within virtual acoustics. Thus. instead of a definite and thus measurable description which can not be satisfied, a perceptual description with emphasis on the subjective quality of a perceived acoustic sensation seems to be more appropriate. A. Peir.eptua1 Parameters In the MPEG-4 perceptual approach a set of nine parameters has been chosen which should enable the author of an auditory scene to thoroughly describe the acoustic impression of a sound event. These are high level parameters which are based on human perception. As will be explained in the following, they are related to objective criteria which correspond to the characteristic features of an impulse response. The perceptual parameters can be divided into three groups: I ) Source-related attributes: 2) Room-related attributes: 3) SourceRoom interaction: SowcePresence, Sourre Warmth, and SourceBrilliance DlteReverberance, Heaviness, Liveness RoomPresencc. RunningReverl,erance, Envelopment The first group covers propelties which are directly connected to the source and thus to the impression of the direct sound this includes the amount of energy related to the direct sound (SonrcePresence. which gives the listener a clue about distance and directivity of the source) as well as the early amount of energy in low (SoarceWarmth) and high (SorrrceBrilliance) frequency bands. In the second group features of the surrounding acoustical space are put together. which cover the damping propenies during the late reverberation (LnteReverberance) and the relative damping properties for low (Heuvinrss) and high (Livencss) frequency bands. The third group contains parameters describing the behavior of a sound source within the room: the distribution of late energy in the room (RoomPrcsence), the early decay time (RuiiriingReverbernnce), and the early energy distribution within the room in relation to the direct sound (Enwlopment).","PeriodicalId":169376,"journal":{"name":"IEEE International Symposium on Consumer Electronics, 2004","volume":"30 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on Consumer Electronics, 2004","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCE.2004.1375910","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
A high level of immei-siwi cart he provided f o r the user of virtrial irudiovis~rul erivirorimeiits when sorrnd and visuirl irnlwessiori get coordinated on 11 high quality level. Tlierefore. a coniprehensive scene rlescription Iangrruge is rieeiieil for both. the auditory and the visital purt. The mrrltiriredia sr(rfiilard MPEG-4 provides a powerfiil tool-set f o r the sceiie decription of 2D ond SD virtiiiil environments. fiir the undio part, apart from a coriventiorial yhysicul description, u novel approach is available which is based on perceptual pnrameters which hove been derived from psycho-acoustic e.rperirnents. The practical qualijcarion of this method is discussed when applied to auditory und audiovisual 30 scenes. Enhancements of-e proposed to an cxample application of the perceptrial upproacli which is included in the MPEG-4 stanrlurd arid an implementution f o r 30 rrrrdio rendering is introduced. Index Terms Auditory Scene Description, MPEG-4, Perceptual Parameters, Virtual Acoustics 1. AUDIOVISUAL SCENE DESCRIPTION I MPEG-4 Moving Picture Expens Group T E P E G ) has established novel approaches for the coding of multimedia content in the international standard MPEG-4. Auditory, visual and other content is subdivided into media objects which together build a 2D or 3D scene. Thus the most efficient coding scheme for each object can be chosen according to its type of media, e.g. video, audio, graphics, etc. [I]. For the combination of the objects MPEG-4 provides a powerful tool-set for scene description, the so-called BIFS (Binary Format for Scene Description) [ 2 ] . Here all the elements describing media objects and their properties are put together as nodes in a scene graph. The resulting structure reflects the mutual dependency of the single objects. This concept is based on the scene graph of the Virtual Reality Modeling Language (VRML) standard [3]. The audio part of this scene description (AudioBIFSj allows to specify the behavior of sound emitting objects in the scene (e.g. their position, level, directivity). These basic fuuctionalities have been extended in version 2 of AudioBIFS where new nodes. mainly for virtual acoustics in a 3D ‘This work was conductcd in the research group IAVAS (Intcmztivu AudioVirual Application Systmmr) which i s funded by lhr Thuringim Minisuy 01 Scicnce. Resrmh and thc Ans. Erlun. Germany. Andrcns Dnnlele and Ulnch Rcitcr are with 1hr Institute of Media Technology at 1he Technischc Univcrsicil Ilmcnau. 0.98614 Ilmenau. Germany (e-mail: andruas.dantrIr@lu-iImennu.dc. uhch.ruitcr@tuilnlmnu.duJ. environment, have been added [4]. These are often referred to as Advanced AudioBlFS (AABIFS) and are of main interest for the work described here. In general, the auralization of virtual scenes not only has to reproduce sound sources which are placed in the scenery but also to add ambient sound effects like reverberation. Thus the user can feel the surrounding virtual space by listening to the acoustic cues. The auditory impression for a listener in a real room can he described thoroughly by an impulse response. This represents the acoustic transfer function for a given pair of source and listener at their particular locations in a certain room [5]. When convolving a unreverberated source signal with an impulse response. the output yields a reverberated sound signal. The result sounds as if perceived at the specific location in the room, where the impulse response has been recorded. The characteristic features of any impulse response can he extracted from the temporal distribution and the energy content of its characteristic components, which are the direct sound, the early reflections, and the late reverberation. For a virtual space the task of auralization can he defined as the synthesis of these components in order to model an artificial impulse response. The main problem is to derive a suitable parameter set from virtual scene description which can he used to synthesize a desired impulse response. When using Advanced AudioBIFS the author of a virtual scene can choose among two different approaches of auditoly modelling: ( I ) Based on physical properties like frequency dependent directivity of sound sources and absorption coefficients of material, it is possible to specify what the resulting sound impression should he like. ( 2 ) Another approach is based on psychoacoustic parameters which express the acoustic sensation perceived by the user. Therefore, a set of perceptual parameters which iue based on psycho-experimental research has been introduced 161. 11. THE PERCEPTUAL APPROACH OF MPEG-4 This approach is quite challenging, since the scope is shifted towards the user and thus to the perception of the human senses. Therefore, parameters have to he found which satisfy human needs and which can easily he explained and understood, even with little theoretical knowledge of sound propagation or electroacoustics. Although the field of auditory scene analysis in psychoacoustics is well elaborated, a widely spread language of subjective attributes has not emerged from it yet (but the discussion is going on, e.g. [7]). Nevertheless, the underlying technique of MPEG-4 perceptual approach is already used, e.g. for creating a virtual acoustic space [XI. We want to take a closer look at the perceptual approach given in the MPEG-4 standard, because it can be useful 0-7803-8526-81041S20.00 02004 IEEE 87 especially for audiovisual environments: the reproduction of audiovisual applications is often exhausting in terms of processing power, and the visual and graphic p m consumes most of it. This is especially true for interactive systems, which have to react in real-time to every new user demand. So for the rendering of the auditory part it is sometimes not possible to render a very detailed description of frequency dependent behavior of sound sources or of reflections within virtual acoustics. Thus. instead of a definite and thus measurable description which can not be satisfied, a perceptual description with emphasis on the subjective quality of a perceived acoustic sensation seems to be more appropriate. A. Peir.eptua1 Parameters In the MPEG-4 perceptual approach a set of nine parameters has been chosen which should enable the author of an auditory scene to thoroughly describe the acoustic impression of a sound event. These are high level parameters which are based on human perception. As will be explained in the following, they are related to objective criteria which correspond to the characteristic features of an impulse response. The perceptual parameters can be divided into three groups: I ) Source-related attributes: 2) Room-related attributes: 3) SourceRoom interaction: SowcePresence, Sourre Warmth, and SourceBrilliance DlteReverberance, Heaviness, Liveness RoomPresencc. RunningReverl,erance, Envelopment The first group covers propelties which are directly connected to the source and thus to the impression of the direct sound this includes the amount of energy related to the direct sound (SonrcePresence. which gives the listener a clue about distance and directivity of the source) as well as the early amount of energy in low (SoarceWarmth) and high (SorrrceBrilliance) frequency bands. In the second group features of the surrounding acoustical space are put together. which cover the damping propenies during the late reverberation (LnteReverberance) and the relative damping properties for low (Heuvinrss) and high (Livencss) frequency bands. The third group contains parameters describing the behavior of a sound source within the room: the distribution of late energy in the room (RoomPrcsence), the early decay time (RuiiriingReverbernnce), and the early energy distribution within the room in relation to the direct sound (Enwlopment).