|
|
Speech inputVoice or speech input permits the user to speak directly to a device, with no intermediate keying or hand-written steps. Ideals for speech recognition systems are: speaker independence, continuous speech, large vocabularies and natural language processing. Speaker independence means that the system can accept and recognise with high accuracy the speech of many talkers, including voices that were not part of its training set. Speaker independent systems require no prior training for an individual user. In contrast, a speaker dependent system requires samples of speech for each individual user prior to system use (i.e. the user has to train the system). Continuous speech allows a system to deal with words as normally spoken in fluent speech. Two other categories of speech recognisers are isolated word and connected word. Isolated word speech recognisers are the cheapest solution for voice input but require a short pause of approximately 1/5 second between each word. Connected word speech recognisers are in the middle between isolated word and continuous speech. They recognise words spoken continuously provided words do not vary as they run together i.e. they require clear pronunciation. Guidelines for using speech input are as follows: Structure the vocabulary to give a small number of possible inputs at each stage (i.e. a low 'branching factor'). This will improve recognition accuracy. Try to base the input language on a set of acoustically different words. This will simplify training and guarantee more robust recognition performance. Words which are clearly distinguishable as text or to the human ear are not necessarily so distinct to the speech recogniser. Users should be able to turn speech recognition on and off and fall back on more traditional input modes or on a human intermediary. Give the device a key phrase e.g. "video, wake up" to put it in standby mode to receive voice inputs. A similar phrase such as "video, sleep" could then be used to stop it reacting to inputs. Provide a keyword to halt or undo incorrectly interpreted actions. Provide adequate feedback on how the system has interpreted the user's voice input either with an auditory sound or visual signal. If necessary, allow the user to correct errors before 'sending' the input, or to 'undo' previous inputs. Users may have a certain degree of control over the size and content of the employed vocabulary; e.g. the addition of user-defined synonyms and names should be allowed. Sometimes it will not be possible for the recogniser module to decide between two or more candidate words, so the user will be given a choice list and confirm his input (tie breaking). To improve recognition accuracy provide the user with a hand held microphone (perhaps located within the remote controller). Structure the voice input so that only one or two word commands are required. This will avoid the need for users to speak longer passages with unnatural pauses between words. However it may be useful to include one or two longer phrases which, as they contain more information, will be more distinguishable from the rest. Speech outputSpeech output can also be used as a means of prompting user input, to provide input instructions about using the system, or an explanation about a displayed item e.g. a speech commentary to accompany picture of the Taj Mahal. To help users distinguish between different data conditions of speech output (e.g. presentation of information, a prompt for input, or a warning), it is useful to employ different voices for each condition. It is important to provide the user with the option to adjust the volume of audio or speech output, from within the program (not just the computer's set-up software), to turn it off completely, and also to repeat the audio sequence. Music can provide extra information. For example in a multimedia presentation about Mozart, excerpts from his works might be included to supplement the pictures and text, or if about John Kennedy, short sections from his speeches to add impact. Relevant sounds can also be provided to add atmosphere to a video sequence, say of a jungle or dinosaur world. Guidelines for using speech output are as follows: Speaking should be limited to about 45 seconds, if it occurs without anything happening on the screen. Spoken sequences should require a length of three or four sentences to not seem too abrupt in a multimedia context. Synthetic speech should be used if the text is generated at run time. Digitised text spoken by professional speakers should be used for text which is known at design time. Use different voices in order to give the impression of a realistic scene or to clarify different contexts of information. For example, warning messages or help messages use different voices which can easily be matched to their meaning. Use original sound in order to achieve authentic impression. For example, use the sound from a plant as background for an interview with workers in the plant. Show the actual position and the total length of the speech sequence on a time scale. Speech can also be used as a means of prompting user input, to provide input instructions about using the system, or an explanation about a displayed item e.g. a speech commentary to accompany picture of the Taj Mahal. To help users distinguish between different data conditions of speech output (e.g. presentation of information, a prompt for input, or a warning), it is useful to employ different voices for each condition.
|
Copyright EMMUS 1999.
|