Measuring orthodontic treatment impact: Description or judgment, challenge or result


Determination of improvement in orthodontic treatment may depend on the measurement method used and the purpose.


Improvement after orthodontic treatment (from T1 to T2 [beginning to end of treatment]) was assessed 3 ways from a set of 98 patient records: (1) calculated by subtracting judges’ assessments at T2 from T1 for records presented in random order, (2) judged as a holistic impression viewing T1 and T2 records side by side, and (3) determined from proxies (American Board of Orthodontics Discrepancy Index, the American Board of Orthodontics Objective Grading System, and the Peer Assessment Rating index).


High levels of intramethod consistency were observed, with intraclass correlation coefficient clustering around an intraclass correlation coefficient of 0.900, and distributions were normal. Calculated and judged improvements correlated at r = 0.606. Calculated or judged improvements were correlated at a lower level with proxies. Calculated improvement was significantly associated with “challenge” (T1) scores and judged improvement associated with “results” (T2) scores. Common method bias was observed, with higher correlations among similar indexes than among indexes at the same time that used various methods. Relative to differences in Peer Assessment Rating scores, calculated improvement overestimated low scores and underestimated high ones. The same effect, but statistically greater, was observed using direct judgment of improvement.


These findings are consistent with decision science and measurement theory. In some circumstances, such as third-party reimbursement and research, operationally defined measures of occlusion are appropriate. In practice, the determination of occlusion and improvement are best performed by judgment that naturally corrects for biases in proxies and incorporates background information.


  • Proxies and professional judgment both contained large measurement error.

  • Professional judgment deviated in ways consistent with measurement and decision theory.

  • When T1 and T2 records are considered together, T2 is given more weight.

  • Independent assessment with numerical calculation of improvement give T1 more weight.

It is accepted that generalizations regarding patients’ oral characteristics include sampling and measurement variances. The same is true when data from individual patients are obtained and interpreted, minus interpatient variance. Less often studied is the variation across practitioners who interpret and treat on the basis of their perceptions and interpretations of patient data. The extent and nature of such operator variation is the focus of this study.

There is literature on the characteristics of operationally defined proxies for malocclusion. Early indexes such as the Index of Complexity, Outcome and Need, the Occlusion Index, the Dental Aesthetic Index, and the system by Yuliya et al for clinical evaluation have been proposed with the hope of substantially replacing subjective assessment of treatment need using objective standards. In wider use today are the Peer Assessment Rating (PAR) and the American Board of Orthodontics’ (ABO) Diagnostic Index (DI) and Objective Grading System (OGS). Whereas DI was specifically designed as a measure of T1 treatment need and OGS as a measure of outcomes at T2, PAR has been used to assess both pretreatment and posttreatment occlusion (T1 and T2 [beginning to end of treatment]). Easily measured features, such as commonly used indexes, that stand in for more complex assessments are called proxies.

Validation of these proxies has focused on assessments of internal reliability and correlation with clinicians’ judgments taken as the “gold standard.” There is little research, however, clarifying the extent and nature of the variation within the “gold standards,” despite reports of intra- and interjudge differences. At the same time, there is a body of literature indicating that orthodontists exhibit a wide range of clinical decisions given identical treatment records.

This study sought to contrast operationally defined characteristics of occlusion with professional judgments. Specifically, comparisons were made among 3 assessments of “the same” patient trait: (1) clinical assessment based on calculated differences, (2) judgments of treatment improvement in side-by-side records at T1 and T2, and (3) operationally defined proxies. For all 3 measures, the same sets of records were used. Thus, differences in characterization of the effects of treatment reflect differences in the processes by which practitioners arrive at their opinions regarding improvement in care. The relationships among the 3 assessments of improvement in occlusion after treatment are viewed in the context of the contemporary field of human decision theory and classical measurement theory.

It is hypothesized that these 3 measures of improvement from treatment will provide different pictures of the same patients from the same records. It is further hypothesized that these differences will be consistent with current theory regarding human judgment.

Material and methods

Evaluations were performed on a sample of 98 records of adolescent and adult Chinese patients. A staggered sampling procedure was used, beginning with 2383 patients drawn from 6 orthodontic treatment centers in China, each providing T1 and T2 records for at least 300 patients. A stratified random sample was taken of 108 patients. The sample contained 68 female and 30 male patients. Their average age was 15.8 years, with a standard deviation of 4.1. Neither of these patient characteristics were associated with any measures of occlusion or improvement in occlusions. The sample also included approximately equal numbers of patients with Angle Class I, II, and III occlusions. Inclusion criteria were full pretreatment and posttreatment records consisting of digital dental scans, intraoral and extraoral photographs, a panoramic radiograph, and a cephalometric radiograph and patients who did not have surgery and those without cranial facial syndrome. Because 10 models were broken or lost, the final number of photographed models available for evaluation was 98.

These records were scored by 3 orthodontic residents according to DI and PAR standards for T1 records and the OGS and PAR for T2 (proxy measurements). Training and calibration are described in a previous article by Liu et al. Residents were used to assess the proxy scores because these are regarded as operationally defined. The analysis reported in this article was conducted on the average of the 3 ratings for each subject.

Professional opinions of improvement from treatment were assessed by orthodontists in 2 ways. Calculated improvement was determined by 15 orthodontists practicing in the United States who evaluated the malocclusion of 98 subjects from before and after records. All 196 records (98 each for T1 and T2) were presented in fully randomized order, and there were no markings indicating patient identification or whether the record was before or after treatment. Calculated improvement was determined by a researcher (S.E.H) reordering the dataset and subtracting the T2 scores from the T1 scores. The same 5-point scale was used for all presented patients: 1 = mild, 2 = mildly moderate, 3 = moderate, 4 = moderately severe, and 5 = severe. The judges were all faculty members in the [Arthur A. Dugoni, School of Dentiatry, University of the Pacific]. All were ABO diplomates, and 13 of the 15 had at least 10 years of clinical practice experience.

Judged improvement from orthodontic treatment was determined by 12 practitioners who evaluated 98 of the same records a year later. Ten records were lost during the intervening year between the 2 forms of testing. An analysis was performed on the judgments of malocclusion from the original study of the 98 records used in this study. The average score for patients in the calculated condition was 3.19 and 3.28 for those in the judgment condition, t = 0.956. A 9-point scale was used in the study based on side-by-side judgment, anchored in 0 = markedly worse to 8 = markedly improved, with 4 = no difference.

The scores of the 15 judges in the calculated improvement group were averaged, as were the scores for the 12 judges in the judged improvement group. A post-hoc analysis was conducted suggesting that the possibility that the results were attributed to judge dropout was prohibitively small. The number of judges in both cases was determined by extending the observed intraclass correlation coefficient (ICC) score from the study by Liu et al using the Spearman-Brown formula with the standard of ICC = 0.900.

This research was approved by the institutional review board at [Arthur A. Dugoni, School of Dentiatry, University of the Pacific], #18-117 and #20-125.

Statistical analysis

A single dataset was created containing age, sex, and angle classification for each of the 98 subjects included in all 3 studies. T1, T2, and calculated improvement and directly judged improvement (expressed as raw and normalized scores) were also available for each subject. Finally, the proxies of total DI score and the scores for 12 of the subscales, total OGS and 7 subscales, and PAR scores and 8 subscales for T1 (PAR1) and T2 (PAR2) were included in the dataset.

All variables were evaluated by skew tests and goodness of fit using chi-square analysis for fitting the normal distribution. Means and standard deviations were calculated. Tests of hypotheses were performed using significance of correlation coefficients and controlled-entry multiple regression analysis. The fact that scales with different numbers of potential values were used in all 3 sets of measures did not affect the analyses because correlation and multiple regression are unit-free tests and the calculated and judged scores were normalized before analysis.


The internal consistency of the measurement scales used in this study (judged improvement, calculated improvement, DI, OGS, and PAR proxies measured at T1 and T2) were determined by calculating the ICC. These are shown in Table I and uniformly clustered around 0.900. The distribution of scores on these measures were all normal. That was, however, not the case for subscale scores on the DI. Crossbite, open bite, and ANB were markedly skewed in the positive direction. The appropriate adjustment was taken in analyzing these data, as indicated in Tables II and III , but these variables did not Figure in the overall findings of the study. Although both T1 and T2 records were presented in completely random order for determining calculated improvement, it was obvious in some cases, for example, those in which extraction was performed, which ones were posttreatment patients. However, 2% of the cases were calculated to have been worse after treatment. Reverse order of treatment effect was less common using indexes in which 3 of the 98 scores between the average PAR scores at T2 were lower than those at T1. Only a single case of 1176 (98 patients × 12 professionals) was judged to have been worse when the cases were compared side by side.

Table I
Descriptive characteristics of improvement dataset
Source Mean SD ICC Comp DI PAR1 OGS PAR2 PAR1 − 2
Judged 6.427 0.587 0.881 0.606∗∗∗ 0.288∗∗ 0.196 −0.324∗∗∗ −0.291∗∗ −0.152
Computed 1.642 0.690 0.914 0.621∗∗∗ 0.370∗∗∗ −0.119 −0.046 −0.044
DI 10.037 4.702 0.919 0.177 0.633∗∗∗ 0.271∗∗ 0.038
PAR1 12.548 4.878 0.947 0.123 0.566∗∗∗ 0.067
OGS 19.132 8.397 0.894 0.121 −0.177
PAR2 2.868 1.442 0.872 −0.039
PAR1 − PAR2 9.689 4.936

Note. N = 98. Judged mean (mean = 6.427) is a quarter of the way between “improved” and “greatly improved.” Calculated mean (mean = 1.642) is approximately one and a half categories of improvement on a 5-point scale, with an SD of just over half a category.
SD , standard deviation; Comp , computed.
P <0.05; ∗∗ P <0.01; ∗∗∗ P = 0.001.

Table II
Correlations between judges’ estimates of improvement and ratings on DI and OGS
Index subscale or total Judged Computed
DI overjet 0.367∗∗∗ 0.651∗∗∗
DI overbite 0.083 0.267∗∗
DI# anterior open bite 0.000 0.060
DI# lateral open bite 0.016 −0.036
DI crowding 0.012 0.041
DI occlusal relation 0.160 0.476∗∗∗
DI# posterior crossbite 0.011 0.081
DI# buccal crossbite 0.104 −0.090
DI total 0.288∗∗ 0.621∗∗∗
OGS alignment −0.201∗ −0.110
OGS marginal ridge −0.200∗ −0.014
OGS buccal inclination −0.086 −0.075
OGS occlusal contacts −0.130 −0.150
OGS occlusal relation −0.386∗∗∗ −0.111
OGS overjet −0.011 0.015
OGS∗ interproximal contact 0.014 0.195
OGS total −0.324∗∗∗ −0.119

Note. N = 98.
P <0.05; ∗∗ P <0.01; ∗∗∗ P = 0.001; #Index subscales that showed highly skewed distributions. For these cases, the nonparametric Spearman correlation coefficient was calculated instead of the parametric Pearson coefficient.

Table III
Correlations between judges’ estimates of improvement and ratings on PAR1, PAR2, and change in PAR score
Index subscale or total PAR1 PAR2 PAR1 − PAR2
Judged Computed Judged Computed Judged Computed
PAR maxillary anterior segment 0.091 0.063 −0.050 0.104 −0.140 −0.170
PAR mandibular anterior segment 0.034 0.005 −0.190 −0.126 −0.207∗ −0.083
PAR right buccal occlusion 0.069 0.217∗ −0.199∗ −0.072 0.233∗ 0.151
PAR left buccal occlusion 0.012 0.225∗ −0.180 0.090 −0.017 0.109
PAR overjet 0.339∗∗∗ 0.594∗∗∗ −0.206∗ −0.087 −0.133 −0.071
PAR overbite 0.172 0.344∗∗∗ 0.123 0.013 0.010 0.130
PAR midline −0.062 0.083 −0.292∗∗ 0.152 −0.068 0.110
PAR total 0.196 0.370∗∗∗ −0.291∗∗ −0.046 −0.152 0.044
Only gold members can continue reading. Log In or Register to continue

Jun 12, 2021 | Posted by in Orthodontics | Comments Off on Measuring orthodontic treatment impact: Description or judgment, challenge or result
Premium Wordpress Themes by UFO Themes