In reporting reliability, duplicate measurements are often needed to determine if measurements are sufficiently in agreement among the observers (interobserver agreement) and/or within the same observer (intraobserver agreement). Some reports are often analyzed inappropriately using paired t tests and/or correlation coefficients. The aim of this article is to highlight the statistical problems of reliability testing using paired t tests and correlation coefficients and to encourage good reliability reporting within orthodontic research. With regard to the complex issue of reliability, a simple and singular statistical approach is not available. However, some methods are better than others. A graphic technique based on the Bland-Altman plot that can be simultaneously applied for both intra- and interobserver reliability will also be discussed.
No matter how reliable a method has been found to be in the past, reliability should be assessed again before a new study. There is no guarantee that the reliability of that same measurement will continue to be high for a new group of observers obtaining measurements of new subjects or samples. For example, when measuring and comparing pretreatment and posttreatment cephalometric variables, the potential for error is nearly limitless. Errors can occur during superimposition, while determining the reference plane of measurement, and when identifying specific landmarks. Ironically, uncertainty is one of the most certain truths in the universe (uncertainty principle by Werner Heisenberg, 1927). Some errors are not due to any correctable fault of the investigators but are simply inevitable. Because errors will always exist, the magnitude and the pattern of the errors become more important than the errors themselves.
To report how similar multiple sets of data from different methods are to each other, 2 methods have been used more frequently than any others in the orthodontic literature : Dahlberg’s formula (the root mean squares style) and the intraclass correlation coefficient. Traditionally, in orthodontics, Dahlberg’s formula has been the most popular method. It is a simple twist on the standard error of the difference, dividing it by the square root of 2 (standard error of the difference ÷ √2). Consequently, there really is no reason that the Dahlberg formula is any better than just the standard deviations of the differences or the residual standard deviation.
Unfortunately, it is also true that we commonly see improper methods applied in reporting reliability. Examples of this are the reporting of a high correlation coefficient and/or showing nonsignificance from a paired t test. When reporting a reliability measure or a result of methods comparison, it is common to find the following improper statement in the orthodontic literature. “A paired t test showed that the 2 measurements did not differ significantly, and the correlation coefficient demonstrated a significant correlation. Therefore, the magnitude of the differences was within clinically acceptable limits.” These are inadequate methods in reporting reliability. To maintain its level of competency and professionalism, orthodontic literature should embrace and benefit from the advances of contemporary comparative statistical techniques that are far more powerful than simply the t test and/or the correlation analysis.
This article is organized as follows. First, we highlight the statistical problems of testing reliability via a comparing-means method such as the t test and/or the correlation coefficient. Second, we demonstrate a simple modification of the Bland-Altman plot that shows the pattern of error, magnitude of error, and between-group differences in the errors, if present. This can be done with ease by assigning a different style to the error points. In this discussion, we will not venture into reliability methods involving qualitative categorical data including nominal or ordered categories.
Criticism against correlation coefficient as a method of measuring reliability
As mentioned, a popular method of reporting reliability has been to calculate the correlation coefficient between the 2 measurements. However, a more stringent method is recommended for measuring agreement between judges. The null hypothesis for a correlation test is always that the measurements are not linearly related. The correlation coefficient does not indicate the actual agreement between 2 measurements, but only the strength of the relationship between 2 variables: ie, the correlation between them. For example, 2 measurements can have perfect correlation but never agree. Let us suppose that 1 measurement is always 1 mm shorter than the other. Therefore, the correlation coefficient is considered to be an inappropriate and too liberal measurement of reliability that usually overestimates the true reliability.
When the probability is small, we can safely conclude that the 2 measurements are related. However, this high correlation does not mean that the 2 measurements agree. Regarding the P value, the main problem with the correlation coefficient is that researchers test the null hypothesis with a correlation of zero, but a more meaningful null hypothesis should be that correlation between 2 methods is unity. In addition, small P values are often obtained for a correlation coefficient, but these P values have no practical value. Therefore, the results are likely not clinically significant. A small P value does not characterize the importance of the data. Unfortunately, when publication bias (the acceptance of articles indicating statistically significant results) was investigated, it was found that journals prefer reporting significant results over nonsignificant results. Therefore, the misuse of P values in the literature might contribute to the misinterpretation of research data and result in improper implementation of the findings in clinical practice.
Bland and Altman helped to clarify this misuse by further explaining that a change in the scale of measurement does not affect the correlation, but it certainly affects the agreement. Correlation depends on the range of the true quantity in the sample. A test of significance might show that 2 measurements are related, but it would be amazing if 2 measurements designed to measure the same quantity were not related. Consequently, data that seem to have poor agreement can produce quite high correlations.
Criticism against paired t test as a method of measuring reliability
As stated, some researchers often report the results of paired t tests to measure reliability. Paired t tests should not be used as a reliability statistic but should be applied to prove difference or bias. Biomedical researchers too often fail to follow this rule. There are 2 reasons why the use of this method is inappropriate. First, the failure to reject the hypothesis that 2 particular means are equal should not lead to the conclusion that the 2 means are equal. The correct conclusion is that the difference between the 2 means is not large enough to be detected with the sample size. Therefore, proving a null hypothesis, such as “there is no difference between the measurements,” is a difficult challenge. This is why a number of reliability statistics have been developed. On the contrary, when the null hypothesis is rejected, the alternative hypothesis, that “there is a significant difference between the measurements,” can be easily proven.
Second, the result of a paired t test demonstrating statistical significance does not always have clinical significance. Likewise, in the case of a reliability test, a probability value less than 0.05 does not necessarily have clinical meaning. The result of a t test indicates whether the difference between 2 averages is different from zero. Significance tests, whether a paired t test or others, are computed to assess difference, not concordance. Instead, a method for determining the magnitude and the pattern of the difference between the 2 measurements is what is really needed. This is the main point of this discussion. The t test should be applied only when a graphic display suggests that a systematic constant difference is involved. Unfortunately, many authors uncritically apply the classic paired t test in reporting reliability.
Criticism against paired t test as a method of measuring reliability
As stated, some researchers often report the results of paired t tests to measure reliability. Paired t tests should not be used as a reliability statistic but should be applied to prove difference or bias. Biomedical researchers too often fail to follow this rule. There are 2 reasons why the use of this method is inappropriate. First, the failure to reject the hypothesis that 2 particular means are equal should not lead to the conclusion that the 2 means are equal. The correct conclusion is that the difference between the 2 means is not large enough to be detected with the sample size. Therefore, proving a null hypothesis, such as “there is no difference between the measurements,” is a difficult challenge. This is why a number of reliability statistics have been developed. On the contrary, when the null hypothesis is rejected, the alternative hypothesis, that “there is a significant difference between the measurements,” can be easily proven.
Second, the result of a paired t test demonstrating statistical significance does not always have clinical significance. Likewise, in the case of a reliability test, a probability value less than 0.05 does not necessarily have clinical meaning. The result of a t test indicates whether the difference between 2 averages is different from zero. Significance tests, whether a paired t test or others, are computed to assess difference, not concordance. Instead, a method for determining the magnitude and the pattern of the difference between the 2 measurements is what is really needed. This is the main point of this discussion. The t test should be applied only when a graphic display suggests that a systematic constant difference is involved. Unfortunately, many authors uncritically apply the classic paired t test in reporting reliability.