A recent article is a good paradigm of the abundance and perpetuation of studies that might not resolve clinical questions because of their inherent design deficiencies (Barrett AAF, Baccetti T, McNamara JA Jr. Treatment effects of the light-force chincup. Am J Orthod Dentofacial Orthop 2010;138:468-76). A response from the authors to the following and specific points of concern might assist readers in appreciating the validity of the reported evidence.
- 1.
Was this an observational retrospective cohort study or a matched case-control study? Observational studies, although useful, are prone to selection or information bias and confounding, and require solid methodology and explanation.
- 2.
Influential factors such as age might be important confounders, and the 1-year mean age difference between groups is not accounted for in the data analysis (Table II).
- 3.
For baseline characteristics, small differences in subjectively selected cephalometric angles might be questionable criteria for pretreatment group similarity.
- 4.
Sample calculations, based on assumptions from previous research, should consider all outcomes. Breaking down samples, as in the chin-cup and quad-helix groups, reduces power. Tables IV and V show 72 comparison tests! Is the reported power applicable to all tests?
- 5.
As is well known, conventional cephalometrics do not always record treatment outcomes correctly. Using a 3-category score (positive, negative, neutral) as an outcome is a step in the right direction; however, all inferences and conclusions seem to be based on P values from comparison testing of cephalometric measurements.
- 6.
Subgroup testing lacks power and increases the chance of false positive and spurious results, whereas interaction tests are more appropriate for assessing the influence of the quad-helix on the effect of the chincup. Infinite subgroup testing begs for selective reporting of “interesting” results and inflates publication bias.
- 7.
Only P values are reported but no estimates and confidence intervals, which convey clinical relevance. Result interpretation based on multiple testing and overreliance on P values is misleading and considered to be an error. P values indicate only the probability of getting the observed measurement or a more extreme one when the null hypothesis is true, and do not indicate the size of the effect and the clinical importance of the intervention. In Table III, ANB angle absolute difference between groups is 1.2° (not significant), whereas in Table IV the difference is 1.1° (significant). Does the second ANB measurement bear more weight in a clinical setting?
To conclude, generating data for the sake of showing results does not meet the stiff requirements of evidence-based practice because the latter must be based on evidence produced by valid methodology. The foregoing issues do not represent my opinion but, rather, constitute the fundamentals of years of established clinical study methodology and reporting in the biomedical field as presented in the references below.