UsdAudio Proposal

UsdAudio Proposal

Copyright (C) 2019, Pixar Animation Studios                                                                                 Revision 1.1


The goal of this proposal is to define a standard schema for adding audio support to USD.  The schema must support basic, timed-start audio playback, as well as the ability to identify sounds as having well-defined locations (and orientations) in space. The primary purpose of the new schema is to provide an interchange solution to audio elements embedded in USD scenes and animations. Given that interchange is the main goal, we want to keep the following schema simple, providing what we believe are the minimal requirements to play audio elements in the scene. This schema is not intended as a document format for an audio editing tool since more complex operations on sound elements are not supported.  The addition of such a schema may lead to thoughts about describing interactive environments in USD, including such features as triggers for sounds.  While encoding triggers for sounds and other actions may be a worthwhile area of future investigation, we defer any consideration of triggers or interactive environments for now in the interests of simplicity and resources.

Initial Requirements

  • The USD schema should not directly contain any audio data but should refer to an audio file or stream via an assetPath
  • The USD schema should not directly define the supported audio file formats, but see Usdz Considerations below for usdz-related restrictions 
  • Any number of audio elements should be allowed in a USD stage
    • Each audio element is consumed as a whole: the USD schema does not provide for isolating specific sub-channels or tracks within an audio asset.
  • The same audio file should be referenceable by multiple USD primitives
  • Audio elements should be positionable (and orientable) in space with a given 3D transformation that can change over time
    • Audio elements should also be able to represent "ambient" sound where the mono or stereo sound is played regardless of spatial positioning
  • Audio should be able to be amplified/attenuated over time
    • This does not provide any absolute measure of sound pressure (i.e. dB)
    • Each audio element should be able to be amplified/attenuated independently. At a later date we may introduce a "grouping" feature to modulate multiple audio streams at once
    • No explicit support for "fadein" or "fadeout", which could be encoded via time-varying attenuation, should fading not be embedded directly in the audio streams being referenced
  • The start of playback within a UsdStage's time-range should be independently specifiable for each audio element, as should an optional endpoint.
  • Should be able to specify an offset into each referenced audio file from which playback should begin
  • Should be able to easily loop the audio stream, if desired
  • Each audio stream should be able to have its own arbitrary sample rate. It should be the responsibility of the playback tool to properly determine and play the sample rate specified in a given stream

Proposed Prim Schema


The SpatialAudio schema defines basic properties for encoding playback of an audio file or stream within a USD Stage. The SpatialAudio schema derives from UsdGeomXformable since it can support full spatial audio, while also supporting non-spatial mono and stereo sounds. One or more SpatialAudio prims can be placed anywhere in the namespace, though it is advantageous to place truly spatial audio prims under/inside the models from which the sound emanates, so that the audio prim need only be transformed relative to the model, rather than copying its animation.

Below we define the builtin properties of the schema, with their fallback values.  Note that the only non-uniform (i.e. animatable) property is  gain.


  • uniform asset filePath = @@
    Path to the audio file
  • uniform token auralMode = "spatial"
    Determines how audio should be consumed. Allowed values:

    spatial : Play the audio in 3D space if the device can support spatial audio. if not, fall back to mono.
    nonSpatial  : Play the audio without regard to the SpatialAudio prim's position.  If the audio media contains any form of stereo or other multi-channel sound, it is left to the application to determine whether the listener's position should be taken into account.  We expect nonSpatial to be the choice for ambient sounds and music sound-tracks.

    For now, we consider oriented and spread-limited sounds, which are supported by major game engines, as beyond scope.  However, in anticipation of future support of these features, in consideration to authoring tools needing to author transformations for SpatialAudio prims now,  we stipulate that the emission direction for SpatialAudio prims is the -Z axis, which matches the orientation of directional lights in UsdLux.
  • uniform timeCode startTime = 0
    Expressed in the timeCodesPerSecond  of the containing stage, startTime specifies  when the audio stream will start playing during animation playback.   Note that startTime is expressed as a SdfTimeCode (new as of USD release 19.11) so that the stage can properly apply SdfLayerOffsets when resolving its value.  
  • uniform timeCode endTime = 0
    Expressed in the timeCodesPerSecond of the containing stage,  endTime  specifies when the audio stream will cease playing during animation playback, if the length of the referenced audio clip is longer than desired.  If  endTime  is less than or equal to startTime, then play the audio stream to its completion.  Note that endTime is expressed as a SdfTimeCode (new as of USD release 19.11) so that the stage can properly apply SdfLayerOffsets when resolving its value.  
  • uniform double mediaOffset = 0
    Expressed in seconds, mediaOffset specifies the offset from the referenced audio file's beginning at which we should begin playback when stage playback reaches the prim's  startTime .  
  • uniform token playbackMode = "once"
    For the fallback value of "once" play the referenced audio once, beginning at startTime, continuing until  endTime,   if endTime is greater than startTime, otherwise until the audio completes.   For any value other than "once", loop the audio file to fill time, as described below.  If specified,  mediaOffset is applied to the first run-through of the audio clip with the second and all other loops beginning from the start of the audio clip.  In future we can add additional attributes that can identify cut-points within the referenced media such that we can continuously loop a subsection of the referenced clip, but for now we choose to start more simply. Non-fallback values of playbackMode determine when looping should begin and end:

    once : Play the audio once from startTime to  endTime.
    loopFromStage : Loop between the containing stage's authored startTimeCode and endTimeCode.  This can be useful for ambient sounds that should always be active.
    loopFromStartToEnd : Loop between the authored  startTime and  endTime.

  • double gain = 1.0
    Float multiplier on the incoming audio signal. A value of 0 "mutes" the signal. Negative values will be clamped to 0.  Although gain is commonly expressed in dB in the audio world, that formulation requires the addition of an extra animatable  mute property since dB can only express ratios of signal, not an absolute scale factor, except theoretically as -∞ dB.  Further, given the intended use of SpatialAudio for content delivery in usdz assets, the commonality with the Web Audio API seems relevant.

USD Sample

Here is an example of two different kinds of sounds encoded in USD with the SpatialAudio schema.

#usda 1.0
   upAxis = "Z"
   endTimeCode = 200
   startTimeCode = 1
   timeCodesPerSecond = 24

def Xform "Sounds"
    def SpatialAudio "AmbientSound"
        # We need not encode startTime, mediaOffset, or level as the fallback 
        # values suffice for ambient sound.  Playback will begin at timeCode 1
        uniform asset filePath       = @AmbientSound.mp3@
        uniform token auralMode      = "nonSpatial"
        uniform token playbackMode   = "loopFromStage"

    def SpatialAudio "WoodysVoice"
        # SpatialAudio xform.  This prim might typically be located
        # as a child of the "Woody" model so that it's location need
        # only be specified relative to Woody, rather than replicating
        # Woody's animation
        double3 xformOp:translate     = (3.0, -3.0, 2)
        uniform token[] xformOpOrder  = ["xformOp:translate"]

        # SpatialAudio Properties.  We have not authored endTime, so its 
        # fallback value of zero will cause the sound to play to completion
        uniform asset  filePath       = @WoodysVoice.mp3@
        uniform token  auralMode      = "spatial"
        uniform timeCode startTime =  65.0
        # Skip the first third of a second in WoodysVoice.mp3 
        uniform double mediaOffset    =  0.33333333333

Other Notes/Questions

Extent/Falloff for Ambient Sounds

Ambient sounds may be associated with sub-sections (levels, areas, etc) of a scene.  A common way to represent such location-based effects is with a parameterized spatial attenuation curve.  We may consider adding such controls in the future, but leave them out for simplicity, currently.

Usdz Considerations

In general, the formats allowed for audio files is no more constrained by USD than is image-type.  As with images, however, Usdz has stricter requirements based on DMA and format support in browsers and consumer devices.  We propose the allowed audio filetypes for usdz be M4A, MP3, WAV (in order of preference); the usdz specification has been accordingly updated.

Reference Implementation

Unlike the majority of 3D-graphics-related features added to USD, for which we strive to provide computation API's (if relevant) and reference imaging support in Hydra and one or more of its renderers, the SpatialAudio schema represents "external" data, for which we do not have an existing in-house solution that could be easily adapted, nor existing workflows that would exercise it (spatial audio, especially).  Therefore it may be some time before we are able to provide a reference implementation.

A nearer-term, graphical representation of SpatialAudio prims in Hydra viewports seems reasonable: SpatialAudio prims would have a fallback purpose of "guide", with a simple, oriented icon/card representation indicating the position and orientation of the emitter.


Graphics Home