Validation of cervical vertebral maturation stages: Artificial intelligence vs human observer visual analysis


This study aimed to develop an artificial neural network (ANN) model for cervical vertebral maturation (CVM) analysis and validate the model’s output with the results of human observers.


A total of 647 lateral cephalograms were selected from patients with 10-30 years of chronological age (mean ± standard deviation, 15.36 ± 4.13 years). New software with a decision support system was developed for manual labeling of the dataset. A total of 26 points were marked on each radiograph. The CVM stages were saved on the basis of the final decision of the observer. Fifty-four image features were saved in text format. A new subset of 72 radiographs was created according to the classification result, and these 72 radiographs were visually evaluated by 4 observers. Weighted kappa (wκ) and Cohen’s kappa (cκ) coefficients and percentage agreement were calculated to evaluate the compatibility of the results.


Intraobserver agreement ranges were as follows: wκ = 0.92-0.98, cκ = 0.65-0.85, and 70.8%-87.5%. Interobserver agreement ranges were as follows: wκ = 0.76-0.92, cκ = 0.4-0.65, and 50%-72.2%. Agreement between the ANN model and observers 1, 2, 3, and 4 were as follows: wκ = 0.85 (cκ = 0.52, 59.7%), wκ = 0.8 (cκ = 0.4, 50%), wκ = 0.87 (cκ = 0.55, 62.5%), and wκ = 0.91 (cκ = 0.53, 61.1%), respectively ( P <0.001). An average of 58.3% agreement was observed between the ANN model and the human observers.


This study demonstrated that the developed ANN model performed close to, if not better than, human observers in CVM analysis. By generating new algorithms, automatic classification of CVM with artificial intelligence may replace conventional evaluation methods used in the future.


  • We developed an artificial neural network (ANN) model to determine skeletal age.

  • The ANN model was compared with human observers in cervical vertebral maturation staging.

  • Repeatability and reproducibility of the ANN model were in the range of human observers.

  • Human interaction is still required in the clinical decision-making process.

  • Artificial intelligence interpretation of radiographs may someday replace other methods.

In treating skeletal misalignments of the jaws, the ability to individualize and estimate the bone growth rate is important for achieving better treatment outcomes. Although evaluation of hand–wrist radiographs is the most conventional method used to determine skeletal age, one of the main drawbacks of this technique is the increased dose of ionizing radiation to the patient. Determining skeletal development by evaluating changes in the size and shape of the cervical vertebrae was first proposed by Lamparski, after which several subsequent studies concluded that the cervical vertebral maturation (CVM) method was effective for individual growth estimation and identification of the mandibular growth spurt. Flores-Mir et al evaluated skeletal maturation on the basis of the hand–wrist radiographs using Fishman’s method compared with CVM staging and reported a moderately high correlation between the 2 methods. Kucukkeles et al also argued that the 2 methods are correlated.

Accurate determination of the direction and amount of mandibular growth is critical in cases in which growth modifications are planned—especially using functional appliances. This makes it necessary for the methods used to determine skeletal maturation and growth spurts in such cases to provide reliable results that can be reproduced by clinicians. In their visual assessment of the CVM method, Perinetti et al reported high diagnostic accuracy and reproducibility. In contrast, Gabriel et al worked with 10 orthodontists to test the reproducibility of the CVM method and reported that interobserver agreement might be lower than 50%. Similar results were also reported by Nestman et al, who emphasized that the low reproducibility may limit its use as a clinical guide.

Artificial intelligence (AI) can be described as a branch of computer science aimed at designing systems that can perform tasks that require human intelligence. AI should have the capacity to evaluate information received from sources and decide how accurate it is, deal with incomplete or inaccurate information, and thus manage the sources. Learning and expanding the knowledge base are among the main components of AI. Machine learning (ML) is the subfield of AI in which algorithms are trained using a learning model generated from available data instead of being specifically coded to perform a function. The artificial neural network (ANN) algorithm, a subcomponent of ML, can be described as a mathematical model inspired by the biological nervous system’s process of learning and processing information. Specifically, ANN aims to find solutions to problems that require the natural skills of thinking and observation. These systems include an input layer, any number of hidden intermediate layers, and an output layer. Connections, each of which are associated with a numerical weight, are established between processing units (neurons). , Thus, with repeated adjustment of the weights, the network gains the ability to “learn.” There are 2 main approaches to teach an AI model. In supervised learning, human experts extract features from the samples (or label them) to be used with the models. In unsupervised learning, the features of the samples are extracted by the designed algorithms automatically and manual extraction is not needed.

In their study aiming to detect interproximal caries in bitewing images using an artificial multilayer perceptron neural network, Devito et al reported that the algorithms showed 39.4% higher performance than humans on average. Additional studies in the literature have aimed to detect root fractures. , Niño-Sandoval et al conducted a study using AI to evaluate the mandibular morphology of skeletal Class I, II, and III subjects. Tajmir et al reported that supporting radiologists with AI applications can increase accuracy and decrease variability and root mean squared error in radiological bone age assessment. On the basis of our premise that AI applications could reduce interobserver differences in the CVM method, we have developed 5 different ML models (ANN, decision tree, logistic regression, support vector machine, and random forest) to compare their success in determining the CVM stage of subjects in images. Cervical vertebral features were extracted in text format from lateral cephalometric images, and it was determined that ANN was the model that most successfully classified the radiographs.

Although there are studies in the current literature that evaluate CVM with various software, there is no previous study that validates the results obtained in these studies with human observers. Therefore, we performed the present study to validate the results of the ANN model developed for CVM analysis by comparing the model’s output with the results of human observers.

Material and methods

Ethical approval was obtained from the Ethics Committee of Suleyman Demirel University. A total of 647 digital lateral cephalometric radiographs of patients with a chronologic age between 10 and 30 years (mean age ± standard deviation, 15.36 ± 4.13 years) were selected from the archive of Suleyman Demirel University, Faculty of Dentistry, Department of Dentomaxillofacial Radiology. Patients with no congenital or acquired malformation of the cervical vertebrae, and radiographs with good visualization of the C2, C3, C4, and C5 vertebrae were included. Those with evidence of current orthodontic treatment, missing permanent incisors or first molars, erupted or supernumerary teeth overlying incisor apices, gross skeletal asymmetries, or bone disease were excluded from the study. Selected radiographs were exported in Joint Photographic Experts Group format. The CVM stage of the patients were determined as described by Baccetti et al ( Table I ).

Table I
Cervical vertebrae morphologic features of each maturation stage
Cervical vertebral maturation stage Concavity Body shape
C2 C3 C4 C3 C4
Cervical stage 1 Trapezoid Trapezoid
Cervical stage 2 + Trapezoid Trapezoid
Cervical stage 3 + + Trapezoid or RHO Trapezoid or RHO
Cervical stage 4 + + + RHO RHO
Cervical stage 5 + + + Square or RHO Square or RHO
Cervical stage 6 + + + RVE or Square RVE or Square

, absence; + , presence; RHO , rhomboid horizontal; RVE , rhomboid vertical.

Characteristic vertebral body shape of the stage, at least 1 of C3 or C4 body shape must be the characteristic.

In our previous study using these data, the performance of 5 different ML models (ANN, decision tree, logistic regression, random forest, and support vector machine) was compared with CVM assessment. New software with a clinical decision support system (CDSS) was developed to label the dataset. A total of 26 points were marked on each radiograph ( Fig ). The CDSS estimated the morphology of the cervical vertebrae and suggested the CVM stage. The novel cervical vertebrae morphology prediction algorithm used in the CDSS was developed by analyzing the subset of 100 radiographs (chronologic age between 10 and 19 years, equal in all ages) by 2 dentomaxillofacial radiologists (D.Y. and K.O. with 11 and 16 years of experience at the time of development). The CVM stages and cervical vertebrae morphologies were saved on the basis of the final decision of another dentomaxillofacial radiologist (H.A. with 4 years of experience at the time of development) with the aid of the CDSS. Fifty-four image features were extracted in text format. The ML models were developed with the Keras, scikit-learn, NumPy, and pandas libraries using the Python programming language (Python Software Foundation, ). The ANN model consisted of 1 hidden layer using the Softmax activation function, and it was conducted for 120 epochs. The extracted image features were classified with the supervised ML classifier models developed with AI techniques. In conclusion, the ANN model was found to be the most successful model in classifying the lateral cephalograms.

A magnified view of the anatomic landmarks used for cervical vertebral feature extraction using the labeling software.

As a next step, in the present study, we compared the results of this ANN model with those of 4 independent observers (K.O., D.Y., E.C., H.A.; O1, O2, O3 and O4, subsequently). For this purpose, a new subset of radiographs was created (mean ± standard deviation, 15.36 ± 4.45 years). Power analysis (GPower 3.1.0, Universität Düsseldorf, Germany) was conducted to determine the minimum sample size. The power analysis indicated that at least 70 images were needed to detect differences between the ANN and human observers at a power of 0.8 (α = 0.05). Thus, the comparison was conducted using 72 randomly selected good-quality digital lateral cephalometric images with equal numbers for each CVM stage. In a distribution made by chronological age, there may be intensity at some maturation levels (determined by the model), which could affect the results if there was a difference between the model’s success in determining each class. To prevent this bias, we made a random selection based on the classification results only, without knowing how sure the model was in its decision. The radiographs were evaluated visually by 3 dentomaxillofacial radiologists (H.A., D.Y., and K.O., with 5, 12, and 17 years of experience) and an orthodontist (E.C., with 8 years of experience). One month later, 24 radiographs were evaluated again for intraobserver agreement. Before the study, a calibration session was held to determine and reach a consensus on how the cervical vertebrae morphologies should be evaluated. Sixty lateral cephalometric images that were not part of the study sample were used for this purpose.

Statistical analysis

Weighted kappa (wκ) and Cohen’s kappa (cκ) coefficients (95% confidence interval) were calculated additionally with the percent agreement to evaluate the compatibility of human visual analysis and ANN classification results. The same statistical methods were used to evaluate the intraobserver agreement. In the calculation of the wκ, although the stages determined by the observers are not the same, those results are still taken into account. However, the resulting wκ coefficient is reduced according to the magnitude of these differences. In the calculation of the cκ and percent agreement, if determined stages are not exactly the same, the results are reduced regardless of the magnitude of the difference (even 1-stage apart). The results of the wκ and cκ were interpreted according to Table II .

Table II
Interpretation of the weighted κ and Cohen κ coefficients
κ Interpretation
0.01-0.2 Slight
0.21-0.4 Fair
0.41-0.6 Moderate
0.61-0.8 Substantial
0.81-1 Almost perfect


The wκ coefficients for the intraobserver agreement were almost perfect (wκ = 0.92-0.98). Interobserver agreement, including the ANN model, ranged from substantial to almost perfect (wκ = 0.76-0.92) ( Table III ; P <0.001). The wκ coefficients for agreement between the ANN model and O1, O2, O3, and O4 were 0.85, 0.8, 0.87, and 0.91, respectively. The highest consistency was between O1 and O3 (wκ = 0.92).

Jan 9, 2021 | Posted by in Orthodontics | Comments Off on Validation of cervical vertebral maturation stages: Artificial intelligence vs human observer visual analysis
Premium Wordpress Themes by UFO Themes