Copyright (C) 2019, Pixar Animation Studios Revision 1.1
The goal of this proposal is to define a standard schema for adding audio support to USD. The schema must support basic, timed-start audio playback, as well as the ability to identify sounds as having well-defined locations (and orientations) in space. The primary purpose of the new schema is to provide an interchange solution to audio elements embedded in USD scenes and animations. Given that interchange is the main goal, we want to keep the following schema simple, providing what we believe are the minimal requirements to play audio elements in the scene. This schema is not intended as a document format for an audio editing tool since more complex operations on sound elements are not supported. The addition of such a schema may lead to thoughts about describing interactive environments in USD, including such features as triggers for sounds. While encoding triggers for sounds and other actions may be a worthwhile area of future investigation, we defer any consideration of triggers or interactive environments for now in the interests of simplicity and resources.
- The USD schema should not directly contain any audio data but should refer to an audio file or stream via an assetPath
- The USD schema should not directly define the supported audio file formats, but see Usdz Considerations below for usdz-related restrictions
- Any number of audio elements should be allowed in a USD stage
- Each audio element is consumed as a whole: the USD schema does not provide for isolating specific sub-channels or tracks within an audio asset.
- The same audio file should be referenceable by multiple USD primitives
- Audio elements should be positionable (and orientable) in space with a given 3D transformation that can change over time
- Audio elements should also be able to represent "ambient" sound where the mono or stereo sound is played regardless of spatial positioning
Audio should be able to be amplified/attenuated over time
- This does not provide any absolute measure of sound pressure (i.e. dB)
- Each audio element should be able to be amplified/attenuated independently. At a later date we may introduce a "grouping" feature to modulate multiple audio streams at once
- No explicit support for "fadein" or "fadeout", which could be encoded via time-varying attenuation, should fading not be embedded directly in the audio streams being referenced
- The start of playback within a UsdStage's time-range should be independently specifiable for each audio element, as should an optional endpoint.
- Should be able to specify an offset into each referenced audio file from which playback should begin
- Should be able to easily loop the audio stream, if desired
- Each audio stream should be able to have its own arbitrary sample rate. It should be the responsibility of the playback tool to properly determine and play the sample rate specified in a given stream
Proposed Prim Schema
The SpatialAudio schema defines basic properties for encoding playback of an audio file or stream within a USD Stage. The SpatialAudio schema derives from UsdGeomXformable since it can support full spatial audio, while also supporting non-spatial mono and stereo sounds. One or more SpatialAudio prims can be placed anywhere in the namespace, though it is advantageous to place truly spatial audio prims under/inside the models from which the sound emanates, so that the audio prim need only be transformed relative to the model, rather than copying its animation.
Below we define the builtin properties of the schema, with their fallback values. Note that the only non-uniform (i.e. animatable) property is gain.
uniform asset filePath = @@
Path to the audio file
uniform token auralMode = "spatial"
Determines how audio should be consumed. Allowed values:
spatial : Play the audio in 3D space if the device can support spatial audio. if not, fall back to mono.
nonSpatial : Play the audio without regard to the SpatialAudio prim's position. If the audio media contains any form of stereo or other multi-channel sound, it is left to the application to determine whether the listener's position should be taken into account. We expect nonSpatial to be the choice for ambient sounds and music sound-tracks.
For now, we consider oriented and spread-limited sounds, which are supported by major game engines, as beyond scope. However, in anticipation of future support of these features, in consideration to authoring tools needing to author transformations for SpatialAudio prims now, we stipulate that the emission direction for SpatialAudio prims is the -Z axis, which matches the orientation of directional lights in UsdLux.
uniform double stageStartTime = 0
Expressed in the timeCodesPerSecond of the containing stage, stageStartTime specifies when the audio stream will start playing during animation playback. Note that stageStartTime is expressed in the timeframe of the composed Stage, not the timeframe of the layer(s) in which it is authored - in other words SdfLayerOffsets are not applied to these values. We consider this a limitation that we will discuss below.
uniform double stageEndTime = 0
Expressed in the timeCodesPerSecond of the containing stage, stageEndTime specifies when the audio stream will cease playing during animation playback, if the length of the referenced audio clip is longer than desired. If stageEndTime is less than or equal to stageStartTime, then play the audio stream to its completion. Note that stageEndTime is expressed in the timeframe of the composed Stage, not the timeframe of the layer(s) in which it is authored - in other words SdfLayerOffsets are not applied to these values. We consider this a limitation that we will discuss below.
uniform double mediaOffset = 0
Expressed in seconds, mediaOffset specifies the offset from the referenced audio file's beginning at which we should begin playback when stage playback reaches the prim's stageStartTime .
uniform token playbackMode = "once"
For the fallback value of "once" play the referenced audio once, beginning at stageStartTime, continuing until stageEndTime, if authored, otherwise until the audio completes. For any value other than "once", loop the audio file to fill time, as described below. If specified, mediaOffset is applied to the first run-through of the audio clip with the second and all other loops beginning from the start of the audio clip. In future we can add additional attributes that can identify cut-points within the referenced media such that we can continuously loop a subsection of the referenced clip, but for now we choose to start more simply. Non-fallback values of playbackMode determine when looping should begin and end:
once : Play the audio once from stageStartTime to stageEndTime.
loopFromStage : Loop between the containing stage's authored startTimeCode and endTimeCode. This can be useful for ambient sounds that should always be active.
loopFromStartToEnd : Loop between the authored stageStartTime and stageEndTime.
double gain = 1.0
Float multiplier on the incoming audio signal. A value of 0 "mutes" the signal. Negative values will be clamped to 0. Although gain is commonly expressed in dB in the audio world, that formulation requires the addition of an extra animatable mute property since dB can only express ratios of signal, not an absolute scale factor, except theoretically as -∞ dB. Further, given the intended use of SpatialAudio for content delivery in usdz assets, the commonality with the Web Audio API seems relevant.
Here is an example of two different kinds of sounds encoded in USD with the SpatialAudio schema.
Future Deprecation of stageStartTime/stageEndTime
Through the application of SdfLayerOffset to references or subLayers, one can non-destructively shift and scale the animation in a referenced layer, in the context of a consuming layer. This can be very useful for merging work of different artists, addressing director notes, or generally making animation more broadly/easily reusable. When animation "reaches" a final composed stage via multiple subLayers and potentially nested references, the time-scale implied by SdfLayerOffsets on each of the composition arcs is a composite, "chained" function. The USD core takes care of shifting timeSamples automatically so that they appear to clients in the "stage timeframe", regardless of the referenced timeframe in which they were authored.
However, USD currently has no way to apply "the correct" composite offset to attribute values, because we have no way to robustly identify which double-valued attributes might actually be expressing a timeCode. The SpatialAudio schema is the first we have designed that actually wants to encode timeCodes as attribute values. We propose to add a new SdfValueType of "TimeCode" so that we can express attributes whose values should have time-offsets applied to them when querying their value at the UsdStage-level.
Deploying a new ValueType and ensuring that the desired transformations happen in all and only the proper places, however, is a decently-sized project, and there is some urgency in deploying the SpatialAudio schema. Therefore we have chosen to deploy initially the stageStartTime and stageEndTime properties as simple double values expressed in locked-off "stage time", acknowledging that offsetting the layers that host such prims will lead to loss of sync between animation and sound, but with the intention of eventually deprecating them in favor of more functional startTime and endTime attributes that are timeCode valued, when that value type exists. The SpatialAudio schema will, from its initial deployment provide UsdTimeCode-valued GetStageStartTime() and GetStageEndTime() methods that we encourage all clients to use. For now they will simply return the authored stageStartTime/stageEndTime values, but in the future, they will (if authored) instead return the properly scaled/offset startTime/endTime values instead.
Extent/Falloff for Ambient Sounds
Ambient sounds may be associated with sub-sections (levels, areas, etc) of a scene. A common way to represent such location-based effects is with a parameterized spatial attenuation curve. We may consider adding such controls in the future, but leave them out for simplicity, currently.
In general, the formats allowed for audio files is no more constrained by USD than is image-type. As with images, however, Usdz has stricter requirements based on DMA and format support in browsers and consumer devices. We propose the allowed audio filetypes for usdz be M4A, MP3, WAV (in order of preference). Upon acceptance of this proposal, we will update the usdz specification.
Unlike the majority of 3D-graphics-related features added to USD, for which we strive to provide computation API's (if relevant) and reference imaging support in Hydra and one or more of its renderers, the SpatialAudio schema represents "external" data, for which we do not have an existing in-house solution that could be easily adapted, nor existing workflows that would exercise it (spatial audio, especially). Therefore it may be some time before we are able to provide a reference implementation.
A nearer-term, graphical representation of SpatialAudio prims in Hydra viewports seems reasonable: SpatialAudio prims would have a fallback purpose of "guide", with a simple, oriented icon/card representation indicating the position and orientation of the emitter.