Basic strategy for better speech intelligibility in auditory rehabilitation. Devices such as the hearing aid (HA), bone-anchored hearing aid (BAHA, a type of HA based on bone conduction), and middle ear implant (MEI, a direct-drive, implantable middle ear device, which mechanically stimulates the ossicles, mimicking the natural hearing process) increase the energy of the sound vibrations transmitted to the damaged inner ear. In contrast, the cochlear implant (CI), auditory brainstem implant (ABI), and auditory midbrain implant (AMI) stimulate the auditory system electrically. The CI restores hearing by direct electrical stimulation of the cochlear nerve, whereas the ABI and AMI directly stimulate the auditory pathway at the cochlear nucleus and mid-inferior colliculus, respectively. Auditory information can be considerably improved by these electrical stimulation devices, but is usually insufficient. Therefore, rehabilitation and sometimes visual information known as the lip-reading effect are usually necessary to restore auditory communication to an adequate level
Therefore, it is important to improve both the quality of the auditory information that can be provided by each prosthesis ((1) in Fig. 10.1), and the rehabilitation process for individual patients ((2) in Fig. 10.1). Moreover, the complementary role of visual cues is also important ((3) in Fig. 10.1). The “lip-reading” phenomenon is well known in patients with degraded speech perception; i.e., reduced speech perception in the presence of poor auditory conditions, such as background noise and in patients with hearing loss, is improved by the combined presentation of visual speech [12, 15]. If the degraded speech can be perceived as bimodal audio-visual stimuli, the visual information from the speaker’s face can be effectively utilized to compensate for the inadequate auditory information [2, 9, 13]. In addition to such conventional lip-reading, audio-visual speech has another beneficial role in the auditory rehabilitation process; i.e., the visual cue enhances the auditory adaptation process to the degraded speech sound .
Here, these two aspects of audio-visual speech in auditory rehabilitation are reviewed.
10.2 Recruitment of Visual Cues in Degraded Speech Conditions
Perception of external signals is followed by integration of the information from multisensory modalities in the brain. Such multi-modal processing results in fast and accurate recognition of the perceived signals. Speech perception effectively utilizes the visual information from the speaker’s face not only in patients with hearing loss but also in healthy subjects; i.e., speech perception in degraded conditions such as background noise can be improved by visual information obtained from the speaker’s face [12, 15]. Therefore, visual cues (speaker’s face) presented with auditory cues (speech sound) will be utilized to complement the auditory information in every situation. However, the degree of recruitment of visual cues will depend on the degree of deterioration of speech perception .
Positron emission tomography (PET) was used to evaluate the effect of this recruitment of visual cues on the activation of additional brain areas caused by degradation of auditory input, as presented in Fig. 10.2. This PET study compared brain activation caused by the presentation of a visual cue (facial movement at speech) with control conditions (visual noise) under two different audio-conditions, normal speech and degraded speech. Lip-reading for degraded speech caused more activations than for normal speech in V2 and V3 of visual cortex as well as in the right fusiform gyrus of the temporal lobe (see  for details). The right fusiform gyrus of the temporal lobe is a well-known brain area known as the fusiform face area (FFA). The FFA, together with the inferior occipital gyri and the superior temporal sulcus, is one of the three important brain regions in the occipitotemporal visual extrastriate cortex related to human face perception [3–8, 11]. Therefore, activation of the FFA during auditory-visual speech perception is very likely. The present study indicated that the degree of activation of FFA depends on the degree of the degradation of auditory cues. This observation is consistent with the hypothesis that more visual information than usual is recruited under conditions of degraded auditory information.
Additional recruitment of brain areas caused by degradation of auditory input (unpublished figure using our data published previously ). Positron emission tomography (PET) was used to compare brain activation caused by the presentation of a visual cue (facial movement at speech) with control conditions (visual noise) under two different audio-conditions, normal speech and degraded speech. Significant brain activation is presented during lip-reading under degraded speech compared with normal speech. Suprathreshold voxels (P < 0.001, uncorrected for multiple comparisons, k > 20 voxels) superimposed on the 3D-rendered surface image. Lip-reading for degraded speech caused more activations than for normal speech in V2 and V3 of visual cortex as well as in the right fusiform gyrus of the temporal lobe (see Kawase et al. 2005  for details)
10.3 Auditory Training with Bimodal Audio-Visual Stimuli
These investigations of perception of bimodal audio-visual stimuli under degraded speech conditions show that visual information from the speaker’s face can be effectively utilized to make up for inadequate auditory information. Therefore, combined presentation of visual speech information is important in speech communication in the presence of degraded auditory conditions, such as background noise and in patients with hearing loss.
On the other hand, audio-visual speech cues have another beneficial role in the auditory rehabilitation process; i.e., the visual cue enhances the auditory adaptation process to the degraded speech sound . In that study, auditory training was examined in normal volunteers using highly degraded noise-vocoded speech sound (NVSS), which is often used as a simulation of the effects of cochlear implant on speech [1, 14