.: Latest News :. .:News in Pictures:.




Horoscope Recipes

Weekly SectionMarker



Pakistan's Internet Magazine
Herald




Weather

Dawn Classified

Cowasjee Ayaz Mazdak Review Dawn Magazine Young World Images

Previous Story DAWN - the Internet Edition Next Story



Science.com

August 2, 2003



SALT: the future speech technology for the web



By Javeriah Zahir


SPEECH is one of the ultimate user interfaces: it is far easier to speak than to use your fingers to type and move mouse to input information. Besides, it is preferable to hear a news item than to read it. Realizing this fact, Intel’s Research and Development has started work on designing speech technologies to bring enhancement to internet accessibility.

In this regard, the introduction of Speech Application Language Tags (SALT) in the webpages will allow users to utter the commands (replacing or supplementing the keyboard, mouse, or stylus) to access information online, order products, and so on. The telephony call-control feature will also let users make or receive calls through their computer or participate in phone conferences directly from a web page in their browsers.


Mechanism

According to SALT Forum, SALT is a small set of XML elements, with associated attributes and DOM object properties, events, and methods, which may be used in conjunction with a source markup document to apply a speech interface to the source page. The SALT formalism and semantics are independent of the nature of the source document. Therefore, SALT can be incorporated effectively in the HTML-based codes and all its flavours, such as WML, or with any other SGML-derived markup language.

SALT elements: The main top-level elements of SALT are:

for speech synthesis configuration and prompt playing

for speech recognizer configuration, recognition execution and post-processing, and recording

for configuration and control of DTMF collection

for general-purpose communication with platform components

The input elements and contain sub controls:

for specifying input grammar resources

for processing of recognition results and also contains the facility to record audio input

for recording audio input.

In addition, a call control object is provided for control of telephony functionality.

There are several advantages of using this technology with a mature display language, such as HTML. Some of them are:

1. The event and scripting models supported by visual browsers can be used by SALT applications to implement dialog flow and other forms of interaction processing without the need for extra markup.

2. The addition of speech capabilities to the visual page provides a simple and intuitive means of creating multimodal applications.

SALT Forum claims that it is a lightweight specification which adds a powerful speech interface to web pages, while maintaining and leveraging all the advantages of the web application model.

Another salient feature that this technology offers is its DTMF and call control capabilities for telephony browsers running voice-only applications through a set of DOM objects properties, methods, and events.

 

Functionality

The two major scenarios for the use of SALT are outlined below:

Multimodal: For multimodal applications, SALT can be added to a visual page to support speech input and/or output. This is a way to speech-enable individual controls for “push-to-talk” form-filling scenarios, or to add more complex mixed initiative capabilities if necessary.

A SALT recognition may be started by a browser event such as pen-down on a textbox, for example, which activates a grammar relevant to the textbox, and binds the recognition result in the textbox.

Telephony: For applications without a visual display, SALT manages the interactional flow of the dialog and the extent of user initiative by using the HTML eventing and scripting model. In this way, the full programmatic control of client-side (or server-side) code is available to application authors for the management of prompt playing and grammar activation.

 

Architecture

According to Microsoft, there are four key components to its implementation:

Web server: This web server generates web pages containing HTML, SALT, and embedded script. The script controls the dialog flow for voice-only interactions. For example, the script defines the order for playing the audio prompts to the caller assuming there are several prompts on a page.

Telephony server: This telephony server connects to the telephone network. The server incorporates a voice browser interpreting the HTML, SALT, and script. The browser can run in a separate process or thread for each caller. Of course, the voice browser interprets only a subset of HTML since much of HTML refers to GUI and is not relevant to a voice browser.

Speech server: This speech server recognizes speech, plays audio prompts, and responses back to the user.

Client device: Clients include, for example, a Pocket PC or desktop PC running a version of Internet Explorer capable of interpreting HTML and SALT.

 

Design principles

The design of SALT is based on the following fundamental principles:

Integration of speech with web pages: SALT is designed so as to support the DOM execution model for web pages. It reuses the knowledge and skills of web developers because of its easy integration. This allows for simplicity of design in SALT, since it does not need to reinvent page execution or programming models.

Separation of the speech interface: SALT does not extend any individual markup language directly, rather it applies the speech interface as a separate layer which is extensible across different markup languages.

The dialog framework which drives the SALT speech interface can be as loosely or as tightly coupled as necessary to the underlying data structure ( for instance, an HTML form), so that speech and dialog components can be reused across pages and across applications.

Flexible programming model: Flexibility in programming the speech interface is crucially important for top quality speech applications. SALT offers fine-level control of dialog execution through the powerful DOM event and scripting model. This allows the SALT elements to remain simple and clear-sighted, while leveraging the benefits of a rich and well-understood execution environment.

Reuseability: It reuses existing standards for speech output, grammar formats and semantic results, so it remains a lightweight application-level markup that builds on industry standards.

Compatibility: This technology is not designed for any particular device type, but rather for a range of architectural scenarios. Its way of adding speech to web pages is generic, so the whole continuum of devices from PCs to mobile devices to the telephone can be speech-enabled. For instance, personal computers might run all speech recognition and speech output processes locally, whereas smaller devices with limited processing capabilities will have a SALT-capable browser but use remote servers for speech recognition or synthesis services.

Traditional telephones without processors will make telephone calls to a server running a voice-only SALT browser. SALT will be modularized and page profiles will be defined according to the modal and environmental capabilities of clients.

Authoring across modes and devices: Deriving from all the above principles is a notion, which is becoming increasingly important for web developers as the types and numbers of web client devices proliferate: minimizing the authoring overhead for different modes and different clients.

This enables two important classes of application scenario:

a. multimodal, where a visual page can be seamlessly enhanced with a speech interface on the same device; and

b. cross-modal, where a single application page can be reused for different modes on different devices, for instance, visual-only and voice-only. In this way, delivering the main principles of clean integration and speech interface separation will allow maximum reuse of developers’ work.

SALT’s method of adding the speech mode to web pages reuses to the greatest extent possible the relevant pages, forms and fields, scripts and back-end logic of existing web applications.

The writer is a young scholar of SZABIST, Karachi.



Click to learn more...
Please Visit our Sponsor (Ads open in separate window)

Previous Story Top of Page Next Story

Seprater
Contributions
Privacy Policy
© DAWN Group of Newspapers, 2005