Small sample sizes are a common problem in biomedical research, and the periodontal literature is no exception. It is a problem leading to not only reduced statistical power but also an inappropriate statistical inference of a treatment effect. Using statistical methods with an insufficient sample size may give rise to an increased chance of falsely detecting treatment efficacy. This article provides some guidelines to cope with the small sample size problem. The authors discuss adequate sample sizes in several statistical tests and then suggest alternative statistical methods that are valid with a small sample size.
Key points
- •
The authors provide some rules of thumb regarding sufficient sample sizes for a few statistical methods.
- •
The authors examine distributional characteristics of data from periodontal studies and relevant sampling distributions.
- •
The authors provide some strategies to perform statistically sound data analysis with a small sample size.
Introduction
A key feature of scientific research activities is the design of an experiment, the collection of data from the realization of the given experiment, and testing of statistical hypotheses based on the accumulated data. The data itself may take a variety of forms. Thus, a broad knowledge of various statistical methods is crucial for performing scientifically sound research. Basic statistics texts well explain appropriate statistical techniques for different characteristics of the data corresponding to the various hypotheses of interests. Less emphasized is whether those statistical methods are suitable regardless of the sample sizes. What needs to be considered are the 2 types of errors involved in statistical testing termed type I and type II errors, as well as the statistical power (1 minus the probability of a type II error). Type I error is the probability of rejecting the statistical hypothesis of interest (termed the null hypothesis) when in fact it is true. Type II error rate is the probability of failing to reject the statistical null hypothesis when in fact it is false. The statistical power is the probability of rejecting the statistical null hypothesis of interest when in fact it is false.
Many scientific studies suffer from a small sample size owing to limitations on available resources or poor planning. Periodontal research outcomes may include outcomes such as bacteria colonization, measure of plaque, pocket depth, and clinical attachment loss. With these types of data, one may ask whether it is appropriate that someone tests the difference between 2 treatments based on only 10 subjects per group. One may also ask whether several predictors can be included in a regression analysis when the total sample size is only, for example, 30. It is possible through trial and error to have a model that incorrectly overfits the data given a large number of covariates or a small sample size. This is due to an inflation of the type I error across multiple tests of multiple models. That is, if we test enough models, by chance alone we are likely to find one that fits the data well even, although there is no relationship. These issues are to be considered before carrying out the data analysis. Many statistical textbooks extensively address the power of hypothesis testing; however, not many statistical textbooks sufficiently address the validity (especially in the context of the type I error) of statistical inference with a small sample size, even though hypothesis testing with a small sample size may be of great interest to many researchers. The concept of the type I error is equivalent to the confidence level in a confidence interval, thus the validity of tests in terms of the type I error is directly addressing the validity of a confidence interval. This article provides some reasonable guidelines for researchers who want to draw statistically sound results and relevant interpretation from studies with small sample sizes.
It is well understood that studies with a small sample size have a low statistical power to detect a true effect. The statistical literature offers a rule of thumb for sample size in the context that the minimum sample size requirement often assures a certain statistical power for detecting a moderate effect size (eg, Refs. ). The suggested sample sizes based on the statistical power are usually large enough that the accompanying statistical methods are justified in terms of maintaining the desired type I error rate. In practice, the available samples are often limited because of time, budget, or ethical reasons. For researchers with limited resources, the statistical analysis needs to be performed based on the available sample size rather than on the sample size for a certain study power. As a result, the right question to be asked regards the suitability of the statistical methods with a limited sample size. Because many statistical methods rely on so-called large sample properties, a liberal use of statistical tests, regardless of inadequate sample sizes, may lead to inflated statistical type I errors (more detail for this point is given in the section Caution on Using the Bootstrap Method for a Small Sample Size), meaning that nonexistent study effects may too often be declared to be significant. Statistical software packages print out results but in general do not warn users that the results may be inaccurate because of small sample sizes. Inflated type I errors in turn give rise to low reproducibility of the results by future research. The problem of inflated type I error is less commonplace in practice by the incorporation of computationally intensive exact statistical methods (details given in section Alternative Methods for Small Sample Sizes), although the exact methods have a problem to adapt a variety of complex modeling schemes.
This article therefore discusses the sample sizes adequate for several popular statistical methods. It discusses distributional characteristics of the data based on a study to reduce the oral colonization of pathogens in the oral cavity and other periodontal studies, followed by the bootstrap method that can be used to cope with small sample size problems. Alternative statistical methods that can be used with the small sample sizes are also described.
Introduction
A key feature of scientific research activities is the design of an experiment, the collection of data from the realization of the given experiment, and testing of statistical hypotheses based on the accumulated data. The data itself may take a variety of forms. Thus, a broad knowledge of various statistical methods is crucial for performing scientifically sound research. Basic statistics texts well explain appropriate statistical techniques for different characteristics of the data corresponding to the various hypotheses of interests. Less emphasized is whether those statistical methods are suitable regardless of the sample sizes. What needs to be considered are the 2 types of errors involved in statistical testing termed type I and type II errors, as well as the statistical power (1 minus the probability of a type II error). Type I error is the probability of rejecting the statistical hypothesis of interest (termed the null hypothesis) when in fact it is true. Type II error rate is the probability of failing to reject the statistical null hypothesis when in fact it is false. The statistical power is the probability of rejecting the statistical null hypothesis of interest when in fact it is false.
Many scientific studies suffer from a small sample size owing to limitations on available resources or poor planning. Periodontal research outcomes may include outcomes such as bacteria colonization, measure of plaque, pocket depth, and clinical attachment loss. With these types of data, one may ask whether it is appropriate that someone tests the difference between 2 treatments based on only 10 subjects per group. One may also ask whether several predictors can be included in a regression analysis when the total sample size is only, for example, 30. It is possible through trial and error to have a model that incorrectly overfits the data given a large number of covariates or a small sample size. This is due to an inflation of the type I error across multiple tests of multiple models. That is, if we test enough models, by chance alone we are likely to find one that fits the data well even, although there is no relationship. These issues are to be considered before carrying out the data analysis. Many statistical textbooks extensively address the power of hypothesis testing; however, not many statistical textbooks sufficiently address the validity (especially in the context of the type I error) of statistical inference with a small sample size, even though hypothesis testing with a small sample size may be of great interest to many researchers. The concept of the type I error is equivalent to the confidence level in a confidence interval, thus the validity of tests in terms of the type I error is directly addressing the validity of a confidence interval. This article provides some reasonable guidelines for researchers who want to draw statistically sound results and relevant interpretation from studies with small sample sizes.
It is well understood that studies with a small sample size have a low statistical power to detect a true effect. The statistical literature offers a rule of thumb for sample size in the context that the minimum sample size requirement often assures a certain statistical power for detecting a moderate effect size (eg, Refs. ). The suggested sample sizes based on the statistical power are usually large enough that the accompanying statistical methods are justified in terms of maintaining the desired type I error rate. In practice, the available samples are often limited because of time, budget, or ethical reasons. For researchers with limited resources, the statistical analysis needs to be performed based on the available sample size rather than on the sample size for a certain study power. As a result, the right question to be asked regards the suitability of the statistical methods with a limited sample size. Because many statistical methods rely on so-called large sample properties, a liberal use of statistical tests, regardless of inadequate sample sizes, may lead to inflated statistical type I errors (more detail for this point is given in the section Caution on Using the Bootstrap Method for a Small Sample Size), meaning that nonexistent study effects may too often be declared to be significant. Statistical software packages print out results but in general do not warn users that the results may be inaccurate because of small sample sizes. Inflated type I errors in turn give rise to low reproducibility of the results by future research. The problem of inflated type I error is less commonplace in practice by the incorporation of computationally intensive exact statistical methods (details given in section Alternative Methods for Small Sample Sizes), although the exact methods have a problem to adapt a variety of complex modeling schemes.
This article therefore discusses the sample sizes adequate for several popular statistical methods. It discusses distributional characteristics of the data based on a study to reduce the oral colonization of pathogens in the oral cavity and other periodontal studies, followed by the bootstrap method that can be used to cope with small sample size problems. Alternative statistical methods that can be used with the small sample sizes are also described.
General requirement of sample sizes for the validity of statistical tests
Adequate sample sizes are often determined to assure a certain chance to detect the true effect size. For example, analysis of variance (ANOVA) may require 30 observations per cell to detect a medium effect size to have about 80% power. Green suggests that the required sample size is greater than 50 + 8 m in a regression analysis assuming the medium size association, where m is the number of predictors. These discussions do not answer the questions regarding the minimum sample sizes for the validity of statistical methods besides the use of the target standardized effect sizes, which is difficult to be justified. Because the validity of many statistical tests (eg, the likelihood ratio tests) depends on the accuracy of large sample approximations of the true distribution of the test statistic, we discuss how large the sample size should be so that a related inference can be trusted relative to the desired type I error rate in this section. Different statistical approaches may require different sample sizes. Hence, sample sizes for several statistical methods are discussed as follows.
Tests for Categorical Data
Many outcome variables of interest are categorical, that is, either ordinal or nominal. For example, one may want to see an effect of an antibiotic on successful management of harmful microorganisms. The outcome of the microorganisms can be expressed as binary data (ie, successful management or failure). When a study compares effects of k therapies, the binary outcome variable can be summarized by a k × 2 contingency table, and thus the table consists of 2 k cells. The test statistic for comparing the equality of the response proportions across k therapies can be obtained based on the normal approximation of the binomial distribution or the Poisson distribution. Resulting statistics may have approximately a normal distribution or a χ 2 distribution in a large sample.
A common rule of thumb for the sufficient sample size in normal-based hypothesis testing or constructing confidence intervals is that each cell size should satisfy p n > 5 , where n is the sample size and p is the true or estimated proportion (rate) that corresponds to the cell. Note that p n is the expected cell size. Suppose that one expects that full-mouth disinfectant successfully manages Porphyromonas gingivalis infection in 90% of patients. This result also means that the expected rate of management failure is 10%. Then, a study with n = 51 total subjects would expect to observe 5.1 failures (10% of 51) and 45.9 successes (90% of 50). Notice that this is only for considering one treatment group. If one deals with k different groups, the sample size consideration should address each group separately. A similar calculation finds that the required sample size for the rate of 50% is 11. This fact indicates that the sample size requirements increase drastically for rare events. For outcome variables with more than 2 levels (eg, mild, moderate, and severe), the same argument stated earlier can be applied based on the expected cell size.
In summary, categorical data analysis requires a fairly large sample.
Logistic Regression with Binary Outcomes
Logistic regression is a popular approach for modeling the relationship between a binary outcome and a set of explanatory variables based on the likelihood method. Logistic regression defines the relationship between the binary outcomes and predictors through the log odds or logit function, although a suitable transformation can also express the relationship between the probability of an outcome of interest and the values of the explanatory variables. The method is popular among researchers because it allows investigating multiple factors in predicting a binary outcome. To ensure that the likelihood ratio tests for the model follow a χ 2 distribution, the sample size needs to be sufficiently large. In addition, if the outcome response rate is assumed to be close to the extremes, that is, 0 or 1, the required sample size is even larger for logistic regression.
Some discussions relative to the minimum sample size for logistic regression are found in the related literature. Peduzzi and colleagues and Hosmer and colleagues suggest that the minimum number of events per independent variable is approximately 10. According to these guidelines, if 2 predictors are included, the total number of events is 20. If the event happens about 10% of the time and the other 90% nonevent cases are also taken into account, then the recommended total sample size is 20 + 180 = 200.
In summary, a large sample size is required for proper inferences in logistic regression.
Tests for Numeric Data
Under certain distributional assumptions, some statistical tests are based on exactly correct distributions. These are often called exact tests. One example of an exact test is the t test (or Student’s t test). The t statistic follows the t distribution exactly if all observations are independent and from the same normal distribution, and the test is valid even with the sample size as small as 2. But, this also means that the t test is not an exact test when these assumptions are violated. When the underlying distribution is not normal, the t test statistic approximately follows a normal distribution with a sufficiently large sample as dictated by the central limit theorem. Thus, for the t test approximation to be accurate regardless of the underlying distributions, a sufficient sample is required.
A quick rule of thumb indicates 25 to 30 as a sufficient sample size. This criterion is reasonable because the sampling distribution of the sample mean is often well approximated by a normal distribution when the sample size is sufficiently large. Of course, if an underlying distribution of data has a symmetric shape, the approximation to the normal distribution may be achieved much faster than the sample size of 25 to 30. On the other hand, if an underlying distribution is extremely skewed, even much larger sample sizes may not be sufficient for the normal approximation of the summary statistics. For the subsequent discussions, we accept this rule of thumb and consider the sample size of 30 as a general guideline for the large sample for testing based on numeric data.
Now, consider the two-sample t test, the most commonly used test statistic for comparing the mean of 2 independent groups. When the sample size is large, the two-sample t test is well approximated by the standard normal distribution despite nonnormal underlying distributions. First, consider sufficient sample sizes (for a normal approximation) for the t test based on equal variances between the 2 groups. The two-sample t test has the degrees of freedom the total sample size minus 2. The degrees of freedom can be interpreted as the size of independent data. In the degrees of freedom, we subtract 2 from the total sample size because the 2 parameters are estimated, giving rise to the dependence between the summands in the sample variance formula. In this context and also analogously to the 1-sample problem, one may say that the required total sample size is 32 or greater for the 2-group comparison, when the underlying distribution is not normal, which is not larger than 30.
Now, consider the two-sample t test with unequal variances, which can be used when the 2 groups have different variances. Now the value of the degrees of freedom is the function of both the sample sizes and the variances. Suppose that one group variance is k times of the other group variance. Consider a balanced design, that is, n 1 = n 2 = n ( n i : the sample size of group i ). Then, based on the popular Satterthwaite approximation of the degrees of freedom, the degrees of freedom are ( n − 1 ) ( 1 + k ) 2 / ( 1 + k 2 ) . If k = 2 , one has the degrees of freedom 1.8 n .That is, to achieve the degrees of freedom 30, a sample size of about 17 per group or the total of 34 is needed. A severely unbalanced sample size may require much larger sample sizes. For example, when the sample size of the group with the smaller variance is 2 times larger than that of the group with the larger variance, then the required sample size to have the degrees of freedom 30 is a total of 63 (21 for the larger variance group and 42 for the smaller variance group) with k = 2 .
In summary, based on the sample size 30 as the rule of thumb, the group comparisons based on numeric data may require slightly more than the sample size of 30 with the balanced design or equal variance cases. However, if inequality in variances between the 2 groups is extreme, the sample size requirement may be considerably bigger than the case with equal variances. We again emphasize that this rule of thumb is only a guideline not a definitive number. In some difficult cases, there may not be reasonable performance of the test statistics even with fairly large sample size. This difficult example is dealt with when discussing the bootstrap method. Sufficient sample sizes for ANOVA can be discussed relative to regression.
Regression and Analysis of Variance with Numeric Outcomes
The validity of regression methods depends on assumptions such as normality, constant variance, and the independence of observations. When a numeric predictor is of interest, the t test can be used to show its significance. If a predictor consists of more than 2 categories (eg, ANOVA), the relevant test statistic has the form of F test. These test statistics already take into account the dependence structures of the observations (ie, residuals). Thus, the dependence between residuals may be less of a concern, if the raw observations are independent. Another important question is whether the underlying distribution of the data is a normal distribution because the t statistic and F statistic are based on the normal distribution assumption.
Checking of the underlying distribution can be carried out based on the residuals (this itself can be a problem with a small sample size; please see more detail in section Goodness-of-Fit Testing). If the normal assumption (with equal variances) is met, the t test and F test are exact tests. That is, they provide the promised type I error that users specify under the null hypothesis (ie, coefficient of a predictor = 0). If the normal assumption is not met, those tests are not exact, and inferences based on them largely rely on the test statistics’ approximation to the normal distribution or χ 2 distributions. For the moment, assume a constant variance of an outcome variable across different predictor values. If the underlying distribution is not normal, analogously to the discussion in section Tests for Numeric Data, the approximation to a normal distribution may require the degrees of freedom of the t test greater than 30. Because the value of the degrees of freedom is given as the sample size minus the number of coefficients in the model (including the intercept), if only 1 numeric predictor is included in the regression model, the approximation requires a total sample size of 32. If the predictor variable is categorical with k levels such as ANOVA, this is counted as equivalent to k − 1 predictors (excluding the intercept term). If the constant variance assumption is not met, the regression method based on the ordinary least square may not perform well; thus, other techniques such as the generalized least square method need to be considered.
If the number of predictors in the regression model is large, it may be possible that the model itself can be highly significant even if none of the predictors have a relationship with the outcome variable. Freedman and Pee suggest that the ratio of the number of predictors and the observations to be 25% may maintain the desired type I errors in the model. Individually significant predictors that are correlated with each other may lead to so-called multicollinearity issues when combined in a model. In cases of near collinearity, the magnitude of the effect may be reversed or the variables may be no longer significant.
Goodness-of-Fit Testing
Goodness-of-fit tests for distributions quantify the differences between the observations and their expected values under the assumed distribution and are commonly used for evidence that observations have a certain distribution. Many standard statistical programs provide these tests. For example, SAS provides the Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, and Cramér-von Mises test for testing normality in PROC UNIVARIATE among many distributions.
What is often overlooked is that these tests are subject to sample sizes for their power. For testing normality, these tests use the null hypothesis that the data are normally distributed. When the sample sizes are small, there may not be enough power to reject the null hypothesis even if the data do not follow a normal distribution. Let us consider the Shapiro-Wilk test to investigate its power to detect nonnormality, which, among the goodness-of-fit tests for normality, is known to be the most powerful. In Table 1 , we generate the random numbers from various distributions with the sample sizes of 5, 10, 20, and 30 (5000 simulations per scenario). The result with the normal distribution is the type I error of the test (the target type I error = 0.05). The result shows that the Shapiro-Wilk test has accurate type I error control even with the sample size of 5. But, with the Gamma (5,1) distribution, the power to detect the nonnormality with the sample size of 20 is only 24%. That is, about 76% of the times, the Shapiro-Wilk test may provide a wrong conclusion. That is, it fails to reject that the data have a normal distribution. The simulation results show that the goodness-of-fit tests may not be a practical approach to show that the data are normally distributed with the small sample sizes. The authors’ recommendation is that one needs to apply the valid statistical methods for small samples when the sample sizes are small, rather than using the goodness-of-fit tests to show the normality of the data.