Limitations of Value-added Modeling

As a norm-referenced evaluation system, the teacher's performance is compared to the results seen in other teachers in the chosen comparison group. It is therefore possible to use this model to infer that a teacher is better, worse, or the same as the typical teacher, but it is not possible to use this model to determine whether a given level of performance is desirable.

Because each student's expected score is largely derived from the student's actual scores in previous years, it difficult to use this model to evaluate teachers of Kindergarten and first grade. Some research limits the model to teachers of third grade and above.

Schools may not be able to obtain new students' prior scores from the students' former schools, or the scores may not be useful because of the non-comparability of some tests. A school with high levels of student turnover may have difficulty in collecting sufficient data to apply this model. When students change schools in the middle of the year, their progress during the year is not solely attributable to their final teachers.

Value-added scores are more sensitive to teacher effects for mathematics than for language. This may be due to widespread use of poorly constructed tests for reading and language skills, or it may be because teachers ultimately have less influence over language development. Students learn language skills from many sources, especially their families, while they learn math skills primarily in school.

There is some variation in scores from year to year and from class to class. This variation is similar to performance measures in other fields, such as Major League Baseball and thus may reflect real, natural variations in the teacher's performance. Because of this variation, scores are most accurate if they are derived from a large number of students (typically 50 or more). As a result, it is difficult to use this model to evaluate first-year teachers, especially in elementary school, as they may have only taught 20 students. A ranking based on a single classroom is likely to classify the teacher correctly about 65% of the time. This number rises to 88% if ten years' data are available. Additionally, because the confidence interval is wide, the method is most reliable when identifying teachers who are consistently in the top or bottom 10%, rather than trying to draw fine distinctions between teachers that produce more or less typical achievements, such as attempting to determine whether a teacher should be rated as being slightly above or slightly below the median.