For more than 20 years, audio descriptions (also known as video descriptions) have been delivered using human narration. Traditionally, a describer will first write a script that describes key visual elements, such as costumes, scenery, scene changes, on-screen text, etc., that would otherwise not be available to viewers unable to see the screen. These descriptions are normally carefully timed to fit into the natural pauses of the dialog or narration. This script is recorded by a human narrator, and the description audio track is then mixed with the regular program-audio soundtrack before the program or movie is broadcast. In television programming, descriptions are usually delivered to the viewer via a separate audio channel and can be turned on and off; in theatrical presentations, such as first-run movies, descriptions can be delivered via wireless transmitter to patrons wearing special headsets; in an online environment, descriptions are often delivered as part of the regular program-audio soundtrack and cannot be turned off (these open-described movies are often offered as alternatives to the undescribed versions).

IBM-Research Tokyo recently partnered with the Carl and Ruth Shapiro Family National Center for Accessible Media (NCAM) at WGBH to research ways to deliver online audio descriptions via text-to-speech (TTS) methods, rather than using human recordings. IBM and NCAM explored two approaches which exploit new HTML5 media elements-- video, audio and track-- as well as Javascript:

  1. Writing and time-stamping a description script, then delivering the descriptions as hidden text in real time in such a way that a user's screen reader will read them aloud. The descriptions remain otherwise invisible and inaudible to non-screen-reader users.
  2. Writing and time-stamping descriptions, then recording them using TTS technology. At the time of playback, each description is individually retrieved and played aloud at intervals corresponding to the time-stamped script.

The Described Videos

This Web site showcases five brief videos, linked below, each of which demonstrates the two methods outlined above. Each video shows off a different type of description technique, as described in the following list.

  1. A Change for Life shows how on-screen textual information can be made accessible using TTS descriptions.
  2. Simply Ming illustrates how descriptions can be compressed into the briefest of pauses, at times competing with the program audio.
  3. Sintel demonstrates how longer descriptions can be used in dramatic settings.
  4. The Sense of Taste shows descriptions in a documentary setting.
  5. Snort Sniffle Sneeze illustrates brief descriptions.

What's on Each Page

Each page contains two video players, one on the left side of the screen and one on the right:

Each player can be controlled with a mouse or from the keyboard using the buttons located below each player. You can also use each player's built-in controls, but the screen-reader accessibility of these controls will vary depending on the browser your are using. If you find that you cannot identify or use the built-in controls, use the buttons instead (all of which are screen-reader and keyboard accessible).

Each player also makes use of keyboard shortcuts, or access keys. These shortcuts are listed below.

Keyboard shortcuts (access keys) for controlling the players

TTS player:

Pre-recorded player:

You will need to press certain modifier keys along with the numbers above in order to activate the shortcuts. Which modifiers you press will depend on the browser you're using. A list of browsers and their modifier keys is shown below. (Note: some screen readers will announce the modifier keys for you.)

Keyboard-shortcut modifier keys:

Please see special notes and troubleshooting if you are having problems hearing the descriptions, seeing the video or controlling the players.

System Requirements

To access the descriptions in the TTS player, you will need the following:

To access the descriptions in the pre-recorded player, you can use any of the browsers listed above as well as other browsers, such as Opera. You do not need a screen reader in order to hear the pre-recorded descriptions.

iOS users, take note: you can also access the TTS descriptions using VoiceOver on the iPhone or iPad. The iPad will play videos in the Web page, but the iPhone will launch the videos separately in the QuickTime player. In both instances, however, VoiceOver will read the TTS descriptions aloud. For best results, make sure your handheld device is running the most current version of the operating system (5.0.1 as of this writing). At this time, the pre-recorded descriptions will not play on iOS devices because Mobile Safari is not capable of playing more than one media file (e.g., separate video and audio files) at a time.


TTS player problems

Can't hear the descriptions

  1. First, review the system requirements. Using a Mac? You must be running OS 10.7/Lion, not 10.6/Snow Leopard, in order to hear the TTS descriptions. Note that if you are not using a screen reader, you will not be able to hear the TTS descriptions provided by the TTS player.
  2. If your screen reader is not reading the TTS descriptions, reloading the page and shutting down/restarting the screen reader usually causes the descriptions to be read aloud. Note that the screen reader will only read descriptions aloud in the TTS player, not the pre-recorded player.
  3. On the Mac, VoiceOver will only work with Safari and Chrome. It will not work with Firefox or Opera.
  4. On Windows, you must use either JAWS or NVDA. Currently, Window-Eyes does not provide support for text displayed in live regions and so cannot be used to read the TTS descriptions.

Also remember that the speed at which the descriptions are delivered depends entirely on your screen-reader settings. If you find that the descriptions are competing with or obscuring the program audio, increase the reading speed.

VoiceOver reads the descriptions twice

If you find that VoiceOver always reads the TTS descriptions twice, shut down VoiceOver and restart it. (You should not need to close and re-open your browser.) This should solve the problem.

Pre-recorded player problems

Video and descriptions play unevenly or are truncated

  1. Firefox browsers (Mac and Windows) may occasionally truncate the pre-recorded audio descriptions. If this happens, pause or stop the video, wait a few seconds and then resume playback.
  2. Firefox browsers (Mac and Windows) may occasionally display erratic video, video and program audio that are somewhat out of sync, or stuttering audio. If this happens, pause or stop the video, wait a few seconds and then resume playback.

Browser problems

Can't see video or hear audio

  1. The videos are supplied in MP4 and Ogg/Theora formats using the HTML5 video element. The track element and Javascript are used to deliver the descriptions, and ARIA markup is used to help screen readers locate and read the TTS descriptions. For best results, use the most current versions of the browsers and screen readers listed in the system requirements.
  2. Firefox browsers (Mac and Windows) may occasionally display erratic video, video and program audio that are somewhat out of sync, or stuttering audio. If this happens, pause or stop the video, wait a few seconds and then resume playback.

VoiceOver isn't passing the keyboard-shortcut commands for the players

VoiceOver will take control of the keyboard-shortcut combinations by default. To use VoiceOver with the keyboard shortcuts, first press Control+Option+Tab to activate the pass-through command, then press the keyboard-shortcut commands you want to use.

Technical Notes

The videos in both the TTS and pre-recorded demonstrations are displayed using the HTML5 media elements video and audio, for which there is now reliable browser support.

TTS Description (left-side player) Notes

The TTS descriptions can be thought of as timed text, similar to captions except that they are used to describe visual elements rather than display a transcription of narration, dialog or non-speech information. And instead of being displayed visually they're "displayed" aurally.

While the HTML5 specification does include includes a track element for displaying captions, there's currently no browser support for it. In these demonstrations, track element support is emulated using the Popcorn.js HTML5 Media Framework. The Popcorn.js library is designed to manage timed events related to media playback. Furthermore, media playback has been enhanced with custom Javascript which makes use of the Popcorn.js library, and has been encapsulated in the popcornPlayer.js code. Additionally, both of those libraries make use of the JQuery.js library, which provides convenient access to DOM elements and various other operations.

The TTS descriptions flow logically as follows:

It should be noted that when browsers finally implement support for track elements directly within the HTML5 media framework, most of the Javascript around the TTS playback will be unnecessary. At that point, one would simply create the div to hold the description text and let the media element implementation handle its display, with the screen reader handling the audio rendering. It should further be noted that the author would still need to employ ARIA markup (such as the aria-live attribute) so that screen readers intercept and read the descriptions in a timely manner.

Pre-Recorded Description (right-side player) Notes

The pre-recorded descriptions follow a similar path as the TTS descriptions, with a few deviations. Using a TTS library such as Text Aloud™, MS SAPI, etc., the TTML textual descriptions written for the TTS player are each turned into audio clips. These clips are then compiled sequentially into a single audio file. At this point, each discrete audio description can be thought of as a sub-clip within the larger compilation file.

To coordinate description playback at the correct moment in the timeline, a TTML file is used as a "trigger" file. Each trigger is made up of two timecodes: the first corresponds to the spot in the compilation file where the specific description sub-clip begins, and the second corresponds to the point in the main timeline at which the sub-clip must stop playing. The popcornPlayer.js code programmatically adds an audio element to the Web page and uses it to play back the sub-clips as needed.

The flow logic of the pre-recorded descriptions is shown below.

Note that the pre-recorded player also uses the volume attribute of the media elements to manage "ducking," which allows for lowering the volume of the program-audio track while raising the volume of the description audio sub-clip. This process is managed as a fade-in/out to make the transitions more pleasant to hear. Authors may also use volume attributes in the TTML file, so that ducking can be managed on a per-description (as opposed to a global) basis, if necessary.

It should be noted that, unlike browser support for the track element, there is currently no support in HTML5 for synchronous media-element playback. SMIL provides such support, but HTML5 has not, as yet, embraced a similar timing model.

URL Parameters

For convenience, the Javascript on each page recognizes three specific parameters in the URL line:

For example, the following URL will display the page with TTS text visible, the button invisible, and the jump forward/back value to 20 seconds:


Download the full archive of the Web pages and media. Expand into a single directory. (Approximately 160 MB.)

Your Comments

To comment on the videos and technology behind them, visit our blog.