Skip to Content

Developer's Guide to Creating Talking Menus for Set-top Boxes and DVDs
For the Set-top Box Developer

A set-top box, or STB, refers to any device inserted between the cable or satellite feed and the user's television set. These devices have the capability to select and display individual channels. Satellite television today follows the MPEG-2 digital-compression standard, which is the basis of both the Digital Video Broadcast standard (DVB) and the Advanced Television Systems Committee (ATSC) standard. Most U.S. cable providers have either adopted the MPEG-2 standard or are in the process of upgrading their services from analog to digital in order to do so. MPEG-2 allows the delivery of any information that can be digitized, including audio, video and text. That material can then be easily displayed on digital High Definition television sets, or in analog NTSC, PAL or SECAM formats.

STB Operating Constraints

Given the flexibility of the DVB standard, one would expect that adding an audio-navigation capability to an STB would be simple matter of writing the proper software to include in the STB's operating system. However, despite the power and flexibility of the DVB standard, it is currently next to impossible for American cable or satellite services to offer audio-navigation services. The computers inside American STBs are simply too primitive to support this additional capability.

Most STBs in the United States are loaned to consumers by their cable or satellite providers. The cost of the device is essentially included in the subscription fee. In some countries, third-party manufacturers of STBs market their products directly to consumers. However, this trend has not carried over into the U.S. market. This single fact is significant to any discussion of the future of audio-navigation services available to cable and satellite customers. As long as it is the service provider who pays for the development and manufacture of STBs, they will likely choose to manufacture boxes at the lowest price possible. Because they face no competition from third-party manufacturers, there is little incentive to make the computers inside those boxes more powerful than necessary to satisfy the majority of their customers.

Today, most STBs contain a single CPU similar or identical to the Motorola 68000, a chip introduced in the early 1980s. The American STB also contains between 2 and 4 megabytes of RAM but no mechanical storage device such as a hard drive. Most of the RAM is devoted to the operating-system code required to carry out basic interactive functions. As a consequence of the limited power of the typical STB computer, it is unlikely that a software-only solution to audio navigation could be added to that computer's functionality.

Synthesized or Recorded Speech Constraints

As discussed earlier, there are two approaches to creating the audio-navigation interface: speech synthesis and recorded voice samples. In the idealized scenario discussed earlier, the program guide spoke to the user as he moved through the program selections. To create this spoken voice the cable provider would have to either record an actor's voice or somehow provide enough computer power to synthesize that voice from text entries appearing in the program guide.

Because the contents of the program guide are in constant flux, and listings can change at any moment, the job of recording and assembling sufficient audio samples to speak every possible program title would be unmanageable. Unlike pre-recorded telephone navigation systems that piece together words or parts of words to speak times, dates and other regularized information, an STB system based on recorded audio would never be able to stay abreast of the tremendous variety of titles and other information one would likely find contained in a typical program guide (to say nothing of the problem of parsing non-English words or the ever-expanding number of advertisements and product offerings showing up in program guides). It soon becomes obvious that the only possible means of delivery for audio navigation is to employ speech synthesis to convert written text to spoken text on the fly.

Possible Approaches

From the service provider's point of view, there are two ways to approach voice synthesis and delivery. In one approach, the provider could conceivably install speech-synthesis software on the central server, and then feed that digitized speech to the user along with the program guide information. Even assuming that sufficient bandwidth exists to carry this extra traffic, this approach would likely not work. The fact is, the STB has only limited control over the program guide information that flows out of the central server. The interactivity enjoyed by the user is largely an illusion. The server feeds its data in a "carousel" formation. The central server feeds program information in a one-way stream that cycles from beginning to end. The STB "drinks" from that stream and downloads blocks of data, ready to be displayed one listing at a time to the viewer. When users move from one program guide entry to another, using the remote control, they do not interact with the central server, per se. Rather, they interact with the STB, scrolling through information already downloaded into the STB's limited memory. In order to add audio navigation to this system, the synthesized audio would have to flow into the STB in the same "carousel" fashion. That would mean that the STB would have to store large blocks of synthesized audio, just as it does large blocks of text data. Because digitized audio samples are orders of magnitude larger than their text equivalents, the standard STB's RAM would quickly become overloaded and the system would cease to operate properly. Unless the operating system changes to allow users greater interactivity, server-centered speech synthesis is not a viable option.

Alternatively, it should be possible for the STB itself to synthesize speech from text. Speech synthesis is a trivial task for today's desktop computers, and users who are blind rely on text-to-speech synthesizers when using the screen readers that make computers accessible. However, as noted above, the American STB's internal computer doesn't yet have the processing muscle or the memory to deliver this functionality. The only possible conclusion is that until the STB catches up with the kind of computer power found in even low-end desktop computers, audio navigation will not soon appear in cable or satellite systems.

There is another alternative, however. In the U.S. the STBs provided by cable and satellite companies are not, in fact, the only STBs in use. Personal digital recorders, such as those marketed under the TiVo brand, are in fact STBs in disguise, and they contain relatively more-powerful computers.

Personal recorders are designed to automatically choose channels, display program information and record television programs for later playback. To do so, they must download and interact with program guide information. Today, users of TiVo and other personal recorders download those program guides via modem from servers maintained by the vendors of the recorders. Users pay a subscription fee for this privilege. Theoretically, this program guide information could be fed to a text-to-speech synthesizer as part of an audio-navigation feature. Under the hood, these boxes do seem to have enough computing horsepower to be able to add a voice-navigation feature.

Another alternative is to dispense altogether with the STB and feed the cable signal directly into a desktop computer. Today, PC users can purchase a Digital Video Broadcast card that allows them to watch basic cable programs on their computers. These cards cannot decode encrypted premium channels, such as HBO, but because the program guide information delivered by the service provider is not encrypted, the PC should be able to display it. An enterprising developer should be able to write software that is compatible with the screen-reading software used by people who are blind, and thus deliver an audio-navigation system for cable or satellite TV to the desktop.

Program guide developers who seek to make their products accessible to screen-reading software should follow these rules:

  • Use standard system tools to draw and erase all on-screen text and to display all cursors and pointers. This means using true text and not graphics which display text.
  • Structure the text appropriately, identifying headings and other structural elements.
  • Embed descriptive text in the graphic images in such a way to make the text known to screen-reading software. Screen readers can read text even if that text is written to the screen invisibly.
  • Assign logical names to controls, even if the names are not visible on the screen. Screen readers can access this information and use it to describe the type and function of the control on the screen.
  • Use consistent and predictable screen layouts;
  • Avoid assigning unlabeled "hot spots" in graphic images that are used as controls, unless they are redundant with menu selections.
  • Avoid non-redundant graphic tool bars. Make any tool bar command available in a menu.
  • Caption all auditory content;
  • Provide a text transcription of auditory content.
  • Provide audio-description tracks for multimedia that describe visual aspects of the content.
Obviously, because some of these recommendations involve embedding "hidden" text for the benefit of the screen reader — and because these considerations are likely new to the developers who create program guides — some reverse engineering of the basic program guide architecture may be necessary to accommodate them.