Abstract
Oral squamous cell carcinoma (OSCC) and its treatment impair speech intelligibility by alteration of the vocal tract. The aim of this study was to identify the factors of oral cancer treatment that influence speech intelligibility by means of an automatic, standardized speech-recognition system. The study group comprised 71 patients (mean age 59.89, range 35–82 years) with OSCC ranging from stage T1 to T4 (TNM staging). Tumours were located on the tongue ( n = 23), lower alveolar crest ( n = 27), and floor of the mouth ( n = 21). Reconstruction was conducted through local tissue plasty or microvascular transplants. Adjuvant radiotherapy was performed in 49 patients. Speech intelligibility was evaluated before, and at 3, 6, and 12 months after tumour resection, and compared to that of a healthy control group ( n = 40). Postoperatively, significant influences on speech intelligibility were tumour localization ( P = 0.010) and resection volume ( P = 0.019). Additionally, adjuvant radiotherapy ( P = 0.049) influenced intelligibility at 3 months after surgery. At 6 months after surgery, influences were resection volume ( P = 0.028) and adjuvant radiotherapy ( P = 0.034). The influence of tumour localization ( P = 0.001) and adjuvant radiotherapy ( P = 0.022) persisted after 12 months. Tumour localization, resection volume, and radiotherapy are crucial factors for speech intelligibility. Radiotherapy significantly impaired word recognition rate (WR) values with a progression of the impairment for up to 12 months after surgery.
The therapy of patients suffering from oral squamous cell carcinoma (OSCC) focuses on the functional and aesthetic rehabilitation of swallowing, chewing, speech, and facial appearance. Due to the anatomical complexity of the orofacial region, reconstruction aimed at rehabilitating these factors remains a major challenge. Speech function in particular is a key human resource and serves as an important social interaction method, correlating highly with the patient’s quality of life. OSCC and its treatment are known to cause persistent speech disorders. Speech intelligibility can be altered by the tumour itself, by surgical resection and reconstruction, and/or by radiotherapy/radiochemotherapy.
A variety of factors influence speech intelligibility in patients undergoing tumour resection in the oral cavity. Still, the influence of single treatment parameters on speech intelligibility remains controversial due to missing data concerning objective speech analysis and adequate study designs. In fact, a method that allows for the objective and independent assessment of speech has been missing so far, and there is no standardized method for the assessment of speech disorders in adults or children at the national or international level. In a recent literature review, this lack of uniformity and consensus in speech evaluation was emphasized with regard to patients with oral and oropharyngeal cancer. Thus, questionnaires for speech outcome still represent the most common tool for postoperative evaluation of speech in oral cancer patients and are used in over 50% of publications in this field. Additionally, auditory perception performed by speech therapists is state-of-the-art in analysing speech intelligibility as the overall phonetic outcome of patients suffering from OSCC. Yet, the assessment of speech disorders or intelligibility by professionals is still a subjective method and shows limited reliability due to differences in speech therapist experience and due to inconsistent test conditions. This is accompanied by a certain lack of repeatability of the evaluation results. Subjective rating of speech intelligibility is found in the literature, but still lacks uniformity and generalization when regarding the global outcome of speech due to the above mentioned limitations. Transcription tasks and multiple choice tasks by multiple evaluators are considered to be suitable for obtaining more reliable results. Still, the use of multiple listeners is rather time-consuming and thus has mainly been used for research projects and not under clinical conditions. To overcome these problems, tests with semi-standardized diagnostic tools for the assessment of speech outcome after treatment for oral carcinomas have been conducted. However, these objective and independent diagnostic tools for the assessment of speech outcome have only been applied to single acoustic patterns of distorted speech.
A new computer-based technique for the objective evaluation of speech intelligibility has been introduced by our work group as a diagnostic tool in adult patients who suffer from neurological diseases, who stutter, in laryngectomies with tracheoesophageal speech, in children with cleft lip and palate, and in patients wearing dental prostheses, with high correlations to expert listener ratings. In particular, the validation of this system in the area of oral and maxillofacial surgery demonstrated significant correlation ( P < 0.01) between the experts’ ratings of intelligibility and the automatic assessment of word accuracy, with a Pearson’s rank coefficient of −0.93. Promising preliminary results in this medical field regarding automatic speech recognition have recently been published by our research group. Prospectively assessed objective speech intelligibility data of patients with OSCC before and after treatment have been missing so far.
In the current study, speech intelligibility was prospectively evaluated in patients with OSCC, before and for up to 12 months after surgical resection, by means of automated speech analysis; further focus was placed on T-stage, tumour localization, resected tissue volume, hard tissue resection, the surgical defect reconstruction technique, and adjuvant radiotherapy and their influence on speech outcome.
Materials and methods
Patients
The patient cohort consisted of 71 patients (16 females/55 males, mean age 59.89 ± 10.11 years) suffering from OSCC ( Table 1 ). Preoperative classification of the cohort by tumour localization showed 23 with cancer of the tongue, 21 in the anterior and lateral floor of the mouth, and 27 in the lower alveolar crest. The TNM classification showed 32 with stage T1 cancers, 27 with T2, and 12 with T4. T3 carcinomas were not present in our cohort. All patients underwent tumour resection at our clinic, with a safety margin around the tumour of 1 cm. For 37 patients, a hard tissue resection of the lower jaw was conducted due to tumour infiltration into the jaw bone. With regard to surgical reconstruction, local tissue plasty was performed in 10 of the patients, a microvascular reanastomosed soft tissue flap plasty in 37 of the patients, and combined microvascular reanastomosed soft and hard tissue flap plasty in 24 cases. Soft tissue surgery was performed using either a radial forearm flap ( n = 30) or with a lateral arm flap reconstruction ( n = 7).
Demographic data | |
Patients ( n = 71), n | |
Female | 16 |
Male | 55 |
Controls ( n = 40), n | |
Female | 10 |
Male | 30 |
Patient age, years, mean ± SD, range | |
Female | 60.81 ± 10.97, 35–76 |
Male | 59.62 ± 9.94, 39–82 |
Overall | 59.89 ± 10.11, 35–82 |
Control age, years, mean ± SD, range | |
Female | 62 ± 11, 34–82 |
Male | 53 ± 11, 44–79 |
Overall | 59 ± 12, 34–82 |
Patient clinical data | |
Tumour localization, n | |
Tongue | 23 |
Floor of mouth | 21 |
Lower alveolar crest | 27 |
T-stage, n | |
T1 | 32 |
T2 | 27 |
T3 | 0 |
T4 | 12 |
Resected volume, n | |
0–27 cm 3 | 31 |
27–125 cm 3 | 32 |
>125 cm 3 | 8 |
Hard tissue resection, n | |
Yes | 37 |
No | 34 |
Surgical reconstruction, n | |
Local tissue plasty | 10 |
With hard tissue resection | 3 |
Without hard tissue resection | 7 |
Microvascular soft tissue transplants | 37 |
With hard tissue resection | 10 |
Without hard tissue resection | 27 |
Microvascular combined tissue transplants | 24 |
With hard tissue resection | 24 |
Without hard tissue resection | 0 |
Adjuvant radiotherapy, n | |
Yes | 49 |
No | 22 |
Adjuvant radiotherapy by T-stage, n | |
T1 | 16 |
T2 | 21 |
T3 | 0 |
T4 | 12 |
To calculate the resection volume of the cancer, the excised tissue volume was determined in accordance with the standardized measurement protocol conducted by the staff of the department of pathology. In order to classify the cancer volume into reproducible groups, the TNM classification of carcinomas in the oral cavity (T1: <2 cm; T2: 2–4 cm; T3/T4: >4 cm) was used as a basis for calculating the appendant tumour size. Since the resection safety margin adds 1 cm in each dimension to the resected volume, 1 cm was added before squaring the TNM-dependent values by 3, resulting in the following groups: group 1: 0–27 cm 3 ; group 2: 27–125 cm 3 ; group 3: >125 cm 3 .
To identify the impact of different factors on speech intelligibility at each study group recording session, speech outcome was analysed according to T-stage, tumour localization, resected tissue volume, hard tissue resection, surgical defect reconstruction technique, and adjuvant radiotherapy.
All patients provided written consent for their participation in the present investigation. The study respected the principles of the ethics committee in charge, as well as the Declaration of Helsinki of 1975/1983, and was approved by the local ethics committee.
Control group
Forty healthy subjects (10 females, 30 males) with a mean age of 59 ± 12 years and without any speech disorders or oral diseases served as the control group ( Table 1 ). All subjects in the control group were native German speakers and spoke in a local dialect similar to that of the patient group.
Treatment protocol
Patients were treated according to a standard protocol: radical surgical resection of the pathological tissue with a safety margin of 1 cm in all dimensions around the tumour, a modified radical neck dissection on the ipsilateral side and suprahyoid neck dissection on the contralateral side. If the rapid section showed the lymph nodes on the contralateral side to be affected, a modified complete neck dissection was additionally conducted. According to the surgical records, cervical branches of the cranial nerves were preserved for all the patients included in the study. Reconstruction of the defective area was performed by means of local tissue plasty, microvascular reanastomosed soft tissue plasty, or combined microvascular reanastomosed soft/hard tissue transplants, depending on the type and volume of the anatomical structure to be reconstructed. The soft tissue flaps were taken from the radial forearm or the upper lateral arm region. Microvascular reanastomosed scapular/parascapular flaps were used for combined soft and hard tissue reconstructions.
The therapy provided after surgical resection of the tumour was specified by an interdisciplinary tumour board, including a radio-oncologist. Adjuvant radiotherapy was carried out depending on the tumour size, tumour infiltration, and whether the cervical lymph nodes were affected. Chemotherapy or combined radiochemotherapy was not suggested by the interdisciplinary tumour board for any of the patient cases in this study. During the first 6 months after cancer treatment, all patients were referred to an outpatient speech therapist.
Speech data
The first speech data were recorded after admission of the patient to the hospital, prior to the surgical intervention. The second speech recording was conducted 14–20 days after the tumour resection, prior to discharge. The three follow-up recordings, at 3, 6, and 12 months after the surgical tumour intervention, were collected during regular oncology outpatient examinations. The participants were recorded reading a standardized text out loud: the German version of the text ‘The North Wind and the Sun’, a fable by Aesop that is the reference text of the International Phonetic Association and is widely used for phonetic research on an international level. This phonetically balanced text contains 108 words, of which 71 are unique, and 172 syllables. It includes all possible phonemes of the German language. For the speech recording procedure, the text was divided into 10 sequences (10.8 ± 2.4 words) according to syntactic boundaries and displayed on a computer screen in large, easy-to-read letters. The recording software automatically segments the audio data according to these boundaries. The single speech recording session, including the speech quality analysis, took 15 min/patient on average. All speech samples were recorded at 16 kHz with 16-bit quantization, using a close-talking microphone (Call4U Comfort-Headset, DNT GmbH, Dietzenbach, Germany). All participants were native German speakers. Criteria for exclusion were speech disorders caused by medical pathologies other than oral carcinomas, recurrent tumours of the oral cavity, hearing impairments, and permanent tracheostomy after the treatment.
Automatic speech recognition system and standardized speech assessment
For measuring the intelligibility on an objective and independent basis, a state-of-the-art speech recognition system developed at the Chair for Pattern Recognition, Department of Computer Science, University of Erlangen-Nuremberg (version 1.7.4), as described in detail in Stemmer, was used in the present study (PEAKS—Programme for Evaluation and Analysis for all Kinds of Speech Disorders). The automatic speech recognition system was trained with all phonemes of the German language as well as with its morphosyntactic rules. Speech samples from the German VERBMOBILE Project (11,714 utterances, 257,810 words) with a total of 27 h of speech from 578 training speakers (304 male, 274 female) were used for the training set, as well as 48 utterances (1042 words) for the validation set. The speakers were from all over Germany and thus covered most regional dialects. The automatic, standardized speech recognition (ASR) system uses triphone models for recognition performance. The recognition is performed with semi-continuous hidden Markov models (SCHMMs). The computed temporal and spectral characteristics are compared to word models given by acoustic speech samples. These describe the likelihood of an acoustic signal being identical to a certain phoneme. This way, the probability for each word can be obtained. At the end, the recognized word chain is calculated as the most likely sequence of words to match a spoken text. The codebook contains 500 full covariance Gaussian densities, which are shared by all HMM states. A unigram language model was used that assumes that the current word is independent of previously spoken words. The only linguistic parameter included is the frequency of each word in the recognition vocabulary. Thus, recognition mainly depends on the acoustic signal of each single word. For the purpose of this study, the system was validated for the study cohort with high correlations to the subjectively attained results of an expert group.
We computed the so-called ‘word recognition rate’ (WR) of each patient’s speech data using the PEAKS evaluation software. The WR describes the percentage of correctly recognized words in the entire text. It is calculated as follows: WR (%) = C / R × 100%, where C is the number of correctly recognized words and R is the number of words in the reference text.
Statistics
The Levene test was used for testing the homogeneity of variance and the Shapiro–Wilk test for proof of normal distribution. If not indicated otherwise, the statistical preconditions were given for all tests conducted. Multiple step-wise linear regression analysis was carried out to model the relationship between the WR and multiple explanatory variables. Student’s t -test was used to assess the relationship to the control group. Comparison of the WR values between subgroups was conducted by univariate analysis of variance (ANOVA) after Bonferroni correction, followed by post hoc tests. P -values equal to or less than 0.05 were considered to be statistically significant. All statistical tests were performed using SPSS version 19 for Windows (IBM, Armonk, NY, USA).