UsdAudio Proposal

Note

This proposal has been implemented. This document exists for historical reference and may be outdated. For up-to-date documentation, see the UsdMediaSpatialAudio page.

Goal 

The goal of this proposal is to define a standard schema for adding audio support to USD. The schema must support basic, timed-start audio playback, as well as the ability to identify sounds as having well-defined locations (and orientations) in space. The primary purpose of the new schema is to provide an interchange solution to audio elements embedded in USD scenes and animations. Given that interchange is the main goal, we want to keep the following schema simple, providing what we believe are the minimal requirements to play audio elements in the scene. This schema is not intended as a document format for an audio editing tool since more complex operations on sound elements are not supported. The addition of such a schema may lead to thoughts about describing interactive environments in USD, including such features as triggers for sounds. While encoding triggers for sounds and other actions may be a worthwhile area of future investigation, we defer any consideration of triggers or interactive environments for now in the interests of simplicity and resources .

Initial Requirements 

The USD schema should not directly contain any audio data but should refer to an audio file or stream via an assetPath

The USD schema should not directly define the supported audio file formats, but see Usdz Considerations below for usdz-related restrictions

Any number of audio elements should be allowed in a USD stage

Each audio element is consumed as a whole: the USD schema does not provide for isolating specific sub-channels or tracks within an audio asset.

The same audio file should be referenceable by multiple USD primitives

Audio elements should be positionable (and orientable) in space with a given 3D transformation that can change over time

Audio elements should also be able to represent “ambient” sound where the mono or stereo sound is played regardless of spatial positioning

Audio should be able to be amplified/attenuated over time

This does not provide any absolute measure of sound pressure (i.e. dB)

Each audio element should be able to be amplified/attenuated independently. At a later date we may introduce a “grouping” feature to modulate multiple audio streams at once

No explicit support for “fadein” or “fadeout”, which could be encoded via time-varying attenuation, should fading not be embedded directly in the audio streams being referenced

The start of playback within a UsdStage’s time-range should be independently specifiable for each audio element, as should an optional endpoint.

Should be able to specify an offset into each referenced audio file from which playback should begin

Should be able to easily loop the audio stream, if desired

Each audio stream should be able to have its own arbitrary sample rate. It should be the responsibility of the playback tool to properly determine and play the sample rate specified in a given stream

Proposed Prim Schema 

SpatialAudio 

The SpatialAudio schema defines basic properties for encoding playback of an audio file or stream within a USD Stage. The SpatialAudio schema should live in a new domain, UsdMedia, as we expect the set of media-related schemas to grow over time (e.g. imagine a move-clip-related schema). The UsdMediaSpatialAudio schema derives from UsdGeomXformable since it can support full spatial audio, while also supporting non-spatial mono and stereo sounds. One or more SpatialAudio prims can be placed anywhere in the namespace, though it is advantageous to place truly spatial audio prims under/inside the models from which the sound emanates, so that the audio prim need only be transformed relative to the model, rather than copying its animation.

Below we define the builtin properties of the schema, with their fallback values. Note that the only non-uniform (i.e. animatable) property is gain.

Properties

uniform asset filePath = @@

Path to the audio file.

uniform token auralMode = “spatial”

Determines how audio should be consumed. Allowed values:

spatial: Play the audio in 3D space if the device can support spatial audio. If not, fall back to mono.

nonSpatial : Play the audio without regard to the SpatialAudio prim’s position. If the audio media contains any form of stereo or other multi-channel sound, it is left to the application to determine whether the listener’s position should be taken into account. We expect nonSpatial to be the choice for ambient sounds and music sound-tracks.

For now, we consider oriented and spread-limited sounds, which are supported by major game engines, as beyond scope. However, in anticipation of future support of these features, in consideration to authoring tools needing to author transformations for SpatialAudio prims now , we stipulate that the emission direction for SpatialAudio prims is the -Z axis , which matches the orientation of directional lights in UsdLux .

uniform token playbackMode = “onceFromStart”

For the fallback value of “once” play the referenced audio once, beginning at startTime, continuing until endTime, if endTime is greater than startTime, otherwise until the audio completes. For any value other than “once”, loop the audio file to fill time, as described below. If specified, mediaOffset is applied to the first run-through of the audio clip with the second and all other loops beginning from the start of the audio clip. In future we can add additional attributes that can identify cut-points within the referenced media such that we can continuously loop a subsection of the referenced clip, but for now we choose to start more simply. Non-fallback values of playbackMode determine when looping should begin and end:

onceFromStart : Play the audio once from startTime, continuing until the audio clip completes.

onceFromStartToEnd : Play the audio once from startTime to endTime or until the audio clip completes, whichever comes first. See SdfTimeCode and TimeScaling to understand how this mode is affected by time-scaling.

loopFromStart : Start playing the audio at startTime and continue looping through to the stage’s authored endTimeCode.

loopFromStartToEnd : Start playing the audio at startTime and continue looping through, stopping the audio at endTime. See SdfTimeCode and TimeScaling to understand how this mode is affected by time-scaling.

loopFromStage : Start playing the audio at the containing stage’s a uthored startTimeCode, and continue looping through to the stage’s authored endTimeCode. This can be useful for ambient sounds that should always be active.

uniform timeCode startTime = 0

Expressed in the timeCodesPerSecond of the containing stage, *startTime* specifies when the audio stream will start playing during animation playback. See SdfTimeCode and TimeScaling to understand how this property resolves on a stage.

uniform timeCode endTime = 0

Expressed in the timeCodesPerSecond of the containing stage, *endTime* specifies when the audio stream will cease playing during animation playback, if the length of the referenced audio clip is longer than desired. If e *ndTime* is less than or equal to startTime , then play the audio stream to its completion. See SdfTimeCode and TimeScaling to understand how this property resolves on a stage.

uniform double mediaOffset = 0

Expressed in seconds , mediaOffset specifies the offset from the referenced audio file’s beginning at which we should begin playback when stage playback reaches the prim’s startTime. If the prim’s playbackMode is a looping mode, mediaOffset is applied only to the first run-through of the audio clip; the second and all other loops begin from the start of the audio clip.

double gain = 1.0

Float multiplier on the incoming audio signal. A value of 0 “mutes” the signal. Negative values will be clamped to 0. Although gain is commonly expressed in dB in the audio world, that formulation requires the addition of an extra animatable mute property since dB can only express ratios of signal, not an absolute scale factor, except theoretically as \(- inf\) dB. Further, given the intended use of SpatialAudio for content delivery in usdz assets, the commonality with the Web Audio API seems relevant.

USD Sample 

Here is an example of two different kinds of sounds encoded in USD with the SpatialAudio schema.

#usda 1.0
(
   upAxis = "Z"
   endTimeCode = 200
   startTimeCode = 1
   timeCodesPerSecond = 24
)

def Xform "Sounds"
{
    def SpatialAudio "AmbientSound"
    {
        # We need not encode startTime, mediaOffset, or level as the fallback
        # values suffice for ambient sound.  Playback will begin at timeCode 1
        uniform asset filePath       = @AmbientSound.mp3@
        uniform token auralMode      = "nonSpatial"
        uniform token playbackMode   = "loopFromStage"
    }

    def SpatialAudio "WoodysVoice"
    {
        # SpatialAudio xform.  This prim might typically be located
        # as a child of the "Woody" model so that it's location need
        # only be specified relative to Woody, rather than replicating
        # Woody's animation
        double3 xformOp:translate     = (3.0, -3.0, 2)
        uniform token[] xformOpOrder  = ["xformOp:translate"]


        # SpatialAudio Properties.  We have left the playbackMode at its
        # fallback of "onceFromStart", so we do not require an endTime:
        # the sound will play to completion
        uniform asset  filePath       = @WoodysVoice.mp3@
        uniform token  auralMode      = "spatial"
        uniform timeCode startTime =  65.0
        # Skip the first third of a second in WoodysVoice.mp3
        uniform double mediaOffset    =  0.33333333333
    }
}

Other Notes/Questions 

SdfTimeCode and Time Scaling 

USD release 19.11 introduced a new value type, SdfTimeCode, to accommodate the needs of the SpatialAudio schema, in which we need to encode time as attribute values in USD layers, and expect that those values be properly adjusted during value resolution such that any SdfLayerOffsets affecting the layer in which the SdfTimeCode values are authored are applied to the values themselves. This is necessary so that any audio that is synchronized to animation remains synchronized as the animation is affected by SdfLayerOffsets in the composition. This behavior brings with it some interesting edge-cases and caveats:

Inverting startTime and endTime

Although we do not expect it to be common, it is possible to apply a negative time scale to USD layers, which mostly has the effect of reversing animation in the affected composition. If a negative scale is applied to a composition that contains authored startTime and endTime, it will reverse their relative ordering in time. Therefore, we stipulate for the playbackModes of “onceFromStartToEnd” and “loopFromStartToEnd”, if endTime is less than startTime , then begin playback at endTime , and continue until startTime. When startTime and endTime are inverted, we do not, however, stipulate that playback of the audio media itself be inverted, since doing so “successfully” would require perfect knowledge of when, within the audio clip, relevant audio ends (so that we know how to offset the reversed audio to align it so that we reach the “beginning” at startTime), and sounds played in reverse are not likely to produce desirable results.
SdfLayerOffsets do not Affect Media Dilation

Although authored time-transformations will affect the value and ordering of startTime and endTime, we make no attempt to infer any playback dilation of the actual audio media itself. Given that startTime and endTime can be independently authored in different layers with differing time-scales, it is not possible, in general, to define an “original timeframe” from which we can compute a dilation to composed stage-time. Even if we could compute a composed dilation this way, it would still be impossible to flatten a stage or layer stack into a single layer and still retain the composed audio dilation using this schema.

Extent/Falloff for Ambient Sounds 

Ambient sounds may be associated with sub-sections (levels, areas, etc) of a scene. A common way to represent such location-based effects is with a parameterized spatial attenuation curve. We may consider adding such controls in the future, but leave them out for simplicity, currently.

Usdz Considerations 

In general, the formats allowed for audio files is no more constrained by USD than is image-type. As with images, however, Usdz has stricter requirements based on DMA and format support in browsers and consumer devices. We propose the allowed audio filetypes for usdz be M4A, MP3, !WAV (in order of preference); the Usdz Specification has been accordingly updated.

Reference Implementation 

Unlike the majority of 3D-graphics-related features added to USD, for which we strive to provide computation API’s (if relevant) and reference imaging support in Hydra and one or more of its renderers, the SpatialAudio schema represents “external” data, for which we do not have an existing in-house solution that could be easily adapted, nor existing workflows that would exercise it (spatial audio, especially). Therefore it may be some time before we are able to provide a reference implementation.

A nearer-term, graphical representation of SpatialAudio prims in Hydra viewports seems reasonable: SpatialAudio prims would have a fallback purpose of “guide”, with a simple, oriented icon/card representation indicating the position and orientation of the emitter.