Criticism of Value Added Modeling via Student Testing as a Means to Evaluate Teacher Effectiveness
There is a debate at the heart of educational public policy around the idea of incentive based pay. One of the leading ways to determine the basis of pay is via value added testing. However, this author in Ed Week points out that such testing is a form of tunnel vision which only tells one form of the story (as well as encouraging a teach to the text methodology):
The whole thing brings to my mind the collateralized debt bubble, in which incredibly complex models were built atop a pretty narrow set of assumptions and the simple conviction that assumptions could be taken as givens. In 2004, questioning underlying assumptions about real estate valuation would get an analyst dismissed as unsophisticated.
Edu-econometricians are eagerly building intricate models stacked atop value-added scores. Yet, today’s value-added measures are, at best, a pale measure of teacher quality. There are legitimate concerns about test quality; the noisiness and variability of calculations; the fact that metrics don’t account for the impact of specialists, support staff, or shared instruction; and the degree to which value-added calculations rest upon a narrow, truncated conception of good teaching. Value-added does tell us something useful and I’m in favor of integrating it into evaluation and pay decisions, accordingly, but I worry when it becomes the foundation upon which everything else is constructed.
When well-run public or private sector firms evaluate employees, they incorporate managerial judgment, peer feedback, and so forth, without assuming that these will or should reflect project completion, sales, assembly line performance, or what-have-you. The whole point of these other measures is to get a fuller picture of performance; and that would be self-defeating if these other measures were supposed to measure one underlying thing.
The one downside to having a slew of first-rate econometricians engaged in edu-research nowadays is that in their eagerness for outcomes to analyze, they tend to care less about the caliber of the numbers than whether they can count them. In the housing bubble, rocket scientists crunched decades of housing data to build complex models. Their job wasn’t to sweat the quality of the data, its appropriateness, or the real-world utility of their assumptions; it was to build dazzling models. The problem is that even the cleverest of models is only as good as the data. And it turned out that the data and assumptions were rife with overlooked problems.
Edu-econometricians love test scores because they can find increasingly sophisticated ways to model them. But if the scores are flawed, biased, or incomplete measures of learning or teacher effectiveness, the models won’t pick that up. Yet those raising such questions are at risk of being dismissed as unsophisticated and retrograde.
In addition, even economists, who understand the nature of data have pointed to the intrinsic problems of relying solely on these models. For instance, scholars convened by the Economic Policy Institute have published “Problems with the Use of Student Test Scores to Evaluate Teachers” (including Eva L. Baker, Paul E. Barton, Linda Darling-Hammond, Edw ard Haertel, Helen F. Ladd , Robe rt L. Linn, Diane Ravitch, Richard Rothstein, Richard J. Shavelson, and Lorrie A. Shepard). The researchers highlight the problems of value added modeling via testing alone:
For a variety of reasons, analyses of VAM results have led researchers to doubt whether the methodology can accurately identify more and less effective teachers. VAM estimates have proven to be unstable across statistical models, years, and classes that teachers teach. One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers’ effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year. Thus, a teacher who appears to be very ineffective in one year might have a dramatically different result the following year. The same dramatic fluctuations were found for teachers ranked at the bottom in the first year of analysis. This runs counter to most people’s notions that the true quality of a teacher is likely to change very little over time and raises questions about whether what is measured is largely a “teacher effect” or the effect of a wide variety of other factors.
The researchers continue:
VAM’s instability can result from differences in the characteristics of students assigned to particular teachers in a particular year, from small samples of students (made even less representative in schools serving disadvantaged students by high rates of student mobility), from other influences on student learning both inside and outside school, and from tests that are poorly lined up with the curriculum teachers are expected to cover, or that do not measure the full range of achievement of students in the class.
For these and other reasons, the research community has cautioned against the heavy reliance on test scores, even when sophisticated VAM methods are used, for high stakes decisions such as pay, evaluation, or tenure. For instance, the Board on Testing and Assessment of the National Research Council of the National Academy of Sciences stated,
…VAM estimates of teacher effectiveness should not be used to make operational decisions because such estimates are far too unstable to be considered fair or reliable.
A review of VAM research from the Educational Testing Service’s Policy Information Center concluded,
VAM results should not serve as the sole or principal basis for making consequential decisions about teachers. There are many pitfalls to making causal attributions of teacher effectiveness on the basis of the kinds of data available from typical school districts. We still lack sufficient understanding of how seriously the different technical problems threaten the validity of such interpretations.
And RAND Corporation researchers reported that,
The estimates from VAM modeling of achievement will often be too imprecise to support some of the desired inferences…and that
The research base is currently insufficient to support the use of VAM for high-stakes decisions about individual teachers
The researchers further point out:
A number of factors have been found to have strong influences on student learning gains, aside from the teachers to whom their scores would be attached. These include the influences of students’ other teachers—both previous teachers and, in secondary schools, current teachers of other subjects—as well as tutors or instructional specialists, who have been found often to have very large influences on achievement gains. These factors also include school conditions—such as the quality of curriculum materials, specialist or tutoring supports, class size, and other factors that affect learning. Schools that have adopted pull-out, team teaching, or block scheduling practices will only inaccurately be able to isolate individual teacher “effects” for evaluation, pay, or disciplinary purposes.
Student test score gains are also strongly influenced by school attendance and a variety of out-of-school learning experiences at home, with peers, at museums and libraries, in summer programs, on-line, and in the community. Well educated and supportive parents can help their children with homework and secure a wide variety of other advantages
for them. Other children have parents who, for a variety of reasons, are unable to support their learning academically. Student test score gains are also influenced by family resources, student health, family mobility, and the influence of neighborhood peers and of classmates who may be relatively more advantaged or disadvantaged.
Teachers’ value-added evaluations in low-income communities can be further distorted by the summer learning loss their students experience between the time they are tested in the spring and the time they return to school in the fall. Research shows that summer gains and losses are quite substantial. A research summary concludes that while students
overall lose an average of about one month in reading achievement over the summer, lower-income students lose significantly more, and middle-income students may actually gain in reading proficiency over the summer, creating a widening achievement gap. Indeed, researchers have found that three-fourths of schools identified as being in the bottom 20% of
all schools, based on the scores of students during the school year, would not be so identified if differences in learning outside of school were taken into account. Similar conclusions apply to the bottom 5% of all schools.
For these and other reasons, even when methods are used to adjust statistically for student demographic factors and school differences, teachers have been found to receive lower “effectiveness” scores when they teach new English learners, special education students, and low-income students than when they teach more affluent and educationally advantaged
students. The nonrandom assignment of students to classrooms and schools—and the wide variation in students’ experiences at home and at school—mean that teachers cannot be accurately judged against one another by their students’ test scores, even when efforts are made to control for student characteristics in statistical models.
In fact, the failure of No Child Left Behind proves the need to move beyond value added modeling:
The limited existing indirect evidence on this point, which emerges from the country’s experience with the No Child Left Behind (NCLB) law, does not provide a very
promising picture of the power of test-based accountability to improve student learning. NCLB has used student test scores to evaluate schools, with clear negative sanctions
for schools (and, sometimes, their teachers) whose students fail to meet expected performance standards.
In fact, the researches document:
Yet although there has been some improvement in NAEP scores for African Americans since the implementation of NCLB, the rate of improvement
was not much better in the post- than in the pre-NCLB period, and in half the available cases, it was worse.
This contributes to a host of other problems: misanalysis & blow back in the form of lower teacher morale:
As we show in what follows, research and experience indicate that approaches to teacher evaluation that rely heavily on test scores can lead to narrowing and over-simplifying the curriculum, and to misidentifying both successful and unsuccessful teachers. These and other problems can undermine teacher morale, as well as provide disincentives for teachers to take on the neediest students. When attached to individual merit pay plans, such approaches may also create disincentives for teacher collaboration. These negative effects can result both from the statistical and practical difficulties of evaluating teachers by their students’ test scores.
A second reason to be wary of evaluating teachers by their students’ test scores is that so much of the promotion of such approaches is based on a faulty analogy—the notion that this is how the private sector evaluates professional employees. In truth, although payment for professional employees in the private sector is sometimes related to various aspects of their performance, the measurement of this performance almost never depends on narrow quantitative measures analogous to test scores in education. Rather, private-sector managers almost always evaluate their professional and lower-management employees based on qualitative reviews by supervisors; quantitative indicators are used sparingly and in tandem with other evidence. Management experts warn against significant use of quantitative measures for making salary or bonus decisions.2
Additional Research on Value Added Modeling (VAM), Value Added Testing (VAT), & Teacher Quality Assessment
To Be Added