Skip to content
September 5, 2011 / compassioninpolitics

Criticism of Value Added Modeling via Student Testing as a Means to Evaluate Teacher Effectiveness

There is a debate at the heart of educational public policy around the idea of incentive based pay. One of the leading ways to determine the basis of pay is via value added testing. However, this author in Ed Week points out that such testing is a form of tunnel vision which only tells one form of the story (as well as encouraging a teach to the text methodology):

The whole thing brings to my mind the collateralized debt bubble, in which incredibly complex models were built atop a pretty narrow set of assumptions and the simple conviction that assumptions could be taken as givens. In 2004, questioning underlying assumptions about real estate valuation would get an analyst dismissed as unsophisticated.

Edu-econometricians are eagerly building intricate models stacked atop value-added scores. Yet, today’s value-added measures are, at best, a pale measure of teacher quality. There are legitimate concerns about test quality; the noisiness and variability of calculations; the fact that metrics don’t account for the impact of specialists, support staff, or shared instruction; and the degree to which value-added calculations rest upon a narrow, truncated conception of good teaching. Value-added does tell us something useful and I’m in favor of integrating it into evaluation and pay decisions, accordingly, but I worry when it becomes the foundation upon which everything else is constructed.

When well-run public or private sector firms evaluate employees, they incorporate managerial judgment, peer feedback, and so forth, without assuming that these will or should reflect project completion, sales, assembly line performance, or what-have-you. The whole point of these other measures is to get a fuller picture of performance; and that would be self-defeating if these other measures were supposed to measure one underlying thing.

The one downside to having a slew of first-rate econometricians engaged in edu-research nowadays is that in their eagerness for outcomes to analyze, they tend to care less about the caliber of the numbers than whether they can count them. In the housing bubble, rocket scientists crunched decades of housing data to build complex models. Their job wasn’t to sweat the quality of the data, its appropriateness, or the real-world utility of their assumptions; it was to build dazzling models. The problem is that even the cleverest of models is only as good as the data. And it turned out that the data and assumptions were rife with overlooked problems.

Edu-econometricians love test scores because they can find increasingly sophisticated ways to model them. But if the scores are flawed, biased, or incomplete measures of learning or teacher effectiveness, the models won’t pick that up. Yet those raising such questions are at risk of being dismissed as unsophisticated and retrograde.

In addition, even economists, who understand the nature of data have pointed to the intrinsic problems of relying solely on these models. For instance, scholars convened by the Economic Policy Institute have published “Problems with the Use of Student Test Scores to Evaluate Teachers” (including Eva L. Baker, Paul E. Barton, Linda Darling-Hammond, Edw ard Haertel, Helen F. Ladd , Robe rt L. Linn, Diane Ravitch, Richard Rothstein, Richard J. Shavelson, and Lorrie A. Shepard). The researchers highlight the problems of value added modeling via testing alone:

For a variety of reasons, analyses of VAM results have led researchers to doubt whether the methodology can accurately identify more and less effective teachers. VAM estimates have proven to be unstable across statistical models, years, and classes that teachers teach. One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers’ effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year. Thus, a teacher who appears to be very ineffective in one year might have a dramatically different result the following year. The same dramatic fluctuations were found for teachers ranked at the bottom in the first year of analysis. This runs counter to most people’s notions that the true quality of a teacher is likely to change very little over time and raises questions about whether what is measured is largely a “teacher effect” or the effect of a wide variety of other factors.

The researchers continue:

VAM’s instability can result from differences in the characteristics of students assigned to particular teachers in a particular year, from small samples of students (made even less representative in schools serving disadvantaged students by high rates of student mobility), from other influences on student learning both inside and outside school, and from tests that are poorly lined up with the curriculum teachers are expected to cover, or that do not measure the full range of achievement of students in the class.
For these and other reasons, the research community has cautioned against the heavy reliance on test scores, even when sophisticated VAM methods are used, for high stakes decisions such as pay, evaluation, or tenure. For instance, the Board on Testing and Assessment of the National Research Council of the National Academy of Sciences stated,
…VAM estimates of teacher effectiveness should not be used to make operational decisions because such estimates are far too unstable to be considered fair or reliable.
A review of VAM research from the Educational Testing Service’s Policy Information Center concluded,
VAM results should not serve as the sole or principal basis for making consequential decisions about teachers. There are many pitfalls to making causal attributions of teacher effectiveness on the basis of the kinds of data available from typical school districts. We still lack sufficient understanding of how seriously the different technical problems threaten the validity of such interpretations.

And RAND Corporation researchers reported that,
The estimates from VAM modeling of achievement will often be too imprecise to support some of the desired inferences…and that
The research base is currently insufficient to support the use of VAM for high-stakes decisions about individual teachers
or schools.

The researchers further point out:

A number of factors have been found to have strong influences on student learning gains, aside from the teachers to whom their scores would be attached. These include the influences of students’ other teachers—both previous teachers and, in secondary schools, current teachers of other subjects—as well as tutors or instructional specialists, who have been found often to have very large influences on achievement gains. These factors also include school conditions—such as the quality of curriculum materials, specialist or tutoring supports, class size, and other factors that affect learning. Schools that have adopted pull-out, team teaching, or block scheduling practices will only inaccurately be able to isolate individual teacher “effects” for evaluation, pay, or disciplinary purposes.

Student test score gains are also strongly influenced by school attendance and a variety of out-of-school learning experiences at home, with peers, at museums and libraries, in summer programs, on-line, and in the community. Well educated and supportive parents can help their children with homework and secure a wide variety of other advantages
for them. Other children have parents who, for a variety of reasons, are unable to support their learning academically. Student test score gains are also influenced by family resources, student health, family mobility, and the influence of neighborhood peers and of classmates who may be relatively more advantaged or disadvantaged.

Teachers’ value-added evaluations in low-income communities can be further distorted by the summer learning loss their students experience between the time they are tested in the spring and the time they return to school in the fall. Research shows that summer gains and losses are quite substantial. A research summary concludes that while students
overall lose an average of about one month in reading achievement over the summer, lower-income students lose significantly more, and middle-income students may actually gain in reading proficiency over the summer, creating a widening achievement gap. Indeed, researchers have found that three-fourths of schools identified as being in the bottom 20% of
all schools, based on the scores of students during the school year, would not be so identified if differences in learning outside of school were taken into account. Similar conclusions apply to the bottom 5% of all schools.

For these and other reasons, even when methods are used to adjust statistically for student demographic factors and school differences, teachers have been found to receive lower “effectiveness” scores when they teach new English learners, special education students, and low-income students than when they teach more affluent and educationally advantaged
students. The nonrandom assignment of students to classrooms and schools—and the wide variation in students’ experiences at home and at school—mean that teachers cannot be accurately judged against one another by their students’ test scores, even when efforts are made to control for student characteristics in statistical models.

In fact, the failure of No Child Left Behind proves the need to move beyond value added modeling:

The limited existing indirect evidence on this point, which emerges from the country’s experience with the No Child Left Behind (NCLB) law, does not provide a very
promising picture of the power of test-based accountability to improve student learning. NCLB has used student test scores to evaluate schools, with clear negative sanctions
for schools (and, sometimes, their teachers) whose students fail to meet expected performance standards.

In fact, the researches document:

Yet although there has been some improvement in NAEP scores for African Americans since the implementation of NCLB, the rate of improvement
was not much better in the post- than in the pre-NCLB period, and in half the available cases, it was worse.

This contributes to a host of other problems: misanalysis & blow back in the form of lower teacher morale:

As we show in what follows, research and experience indicate that approaches to teacher evaluation that rely heavily on test scores can lead to narrowing and over-simplifying the curriculum, and to misidentifying both successful and unsuccessful teachers. These and other problems can undermine teacher morale, as well as provide disincentives for teachers to take on the neediest students. When attached to individual merit pay plans, such approaches may also create disincentives for teacher collaboration. These negative effects can result both from the statistical and practical difficulties of evaluating teachers by their students’ test scores.

A second reason to be wary of evaluating teachers by their students’ test scores is that so much of the promotion of such approaches is based on a faulty analogy—the notion that this is how the private sector evaluates professional employees. In truth, although payment for professional employees in the private sector is sometimes related to various aspects of their performance, the measurement of this performance almost never depends on narrow quantitative measures analogous to test scores in education. Rather, private-sector managers almost always evaluate their professional and lower-management employees based on qualitative reviews by supervisors; quantitative indicators are used sparingly and in tandem with other evidence. Management experts warn against significant use of quantitative measures for making salary or bonus decisions.2

Additional Research on Value Added Modeling (VAM), Value Added Testing (VAT), & Teacher Quality Assessment
To Be Added

4 Comments

Leave a Comment
  1. compassioninpolitics / Sep 5 2011 11:02 pm

    Other research on value added assessment (they will likely be more pro-valued added assessment):

    http://www.cgp.upenn.edu/ope_new_system.html
    http://www.cgp.upenn.edu/ctr_pubs.html#ope
    http://www.cgp.upenn.edu/ope_techreports.html

    And this book:
    http://www.hepg.org/hep/book/105/AGrandBargainForEducationReform
    (Its $32 for used to 50 for a new version on Amazon)

    You can find a summary here:
    http://www.aasa.org/SchoolAdministratorArticle.aspx?id=12534

    The New Teacher Project:
    http://tntp.org/publications/reports/

    Specifically, Teacher Evaluation 2.0 (the author is one of the main proponents of the value added & pay for performance models):
    http://tntp.org/publications/issue-analysis/teacher-evaluation-2.0/

    It uses 3 sets of criteria (although this is just a sample):
    50% objective learning measures
    30% classroom observation
    20% other student learning measures

    Vs current 60/40 they claim represents the current model. (see diagram on page 6)

    “Teachers will need clear information about how the system works and how they can suggest improvements. This will likely require directing more resources and personnel toward teacher evaluations and relieving administrators of less critical responsibilities.”

    “Are teachers receiving useful feedback based on clear expectations?”

    “Do teachers believe they are being evaluated fairly?”

    “Are school leaders getting the support they need to conduct accurate evaluations?”

    As we showed in our 2009 report, The Widget Effect: Our National Failure to Acknowledge and Act on Differences in Teacher Effectiveness, most teacher evaluation systems suffer from a slew of design flaws.

    Infrequent: Many teachers—especially more experienced teachers—aren’t evaluated every year. These teachers might go years between receiving any meaningful feedback on their performance.

    Unfocused: A teacher’s most important responsibility is to help students learn, yet student academic progress rarely factors directly into evaluations. Instead, teachers
    are often evaluated based on superficial judgments about behaviors and practices that may not have any impact on student learning—like the presentation of their bulletin
    boards.

    Undifferentiated: In many school districts, teachers can earn only two possible ratings: “satisfactory” or “unsatisfactory.” This pass/fail system makes it impossible
    to distinguish great teaching from good, good from fair, and fair from poor. To make matters worse, nearly all teachers—99 percent in many districts—earn the
    “satisfactory” rating. Even in districts where evaluations include more than two possible ratings, most teachers earn top marks.

    Unhelpful: In many of the districts we studied, teachers overwhelmingly reported that evaluations don’t give them useful feedback on their performance in the classroom.
    Inconsequential: The results of evaluations are rarely used to make important decisions about development, compensation, tenure or promotion. In fact, most of
    the school districts we studied considered teachers’ performance only when it came time to dismiss them. Taken together, these shortcomings reflect and reinforce a pervasive but deeply flawed belief that all teachers are essentially the same—interchangeable parts rather than individual professionals.

    This provides a robust critique of Teaching 2.0 (specifically p. 3 to 6):
    http://nepc.colorado.edu/thinktank/review-teach-eval-TNTP

  2. compassioninpolitics / Sep 6 2011 1:16 am

    “Moreover, teachers contribute to other valued student outcomes that are more difficult to measure—for example, socio emotional wellness, civic engagement, moral character, open-mindedness, and motivation for continued learning. A teacher appraisal system based solely on value-added models would exclude these other important contributions.”
    Source: TBA

    This study differs from Jacob and Lefgren (2008) and the previous literature as it is the first to study how well past subjective and objective ratings predict future productivity. We also build on previous work in other ways. First, we consider a broader range of teacher characteristics, one that is based on previous theories and evidence of teacher productivity.4 We include personality traits such as “caring,” “enthusiastic,” and “intelligent,” as well as evaluations of subject matter knowledge and teaching skill. Second, we analyze the relationship between each of these measures and both the overall evaluation by principals and the teacher value added. Finally, we analyze teacher ratings and student performance in middle and high school, in addition to elementary school, and allow the relationship between teacher characteristics and teacher ratings or teacher value added to vary across these grade groupings.

    page 10 & 16 & 24 offer some teacher traits (24 seems to offer multipliers for the skills over time):
    http://www.caldercenter.org/upload/CALDER-Working-Paper-30_FINAL.pdf

  3. compassioninpolitics / Sep 6 2011 1:28 am

    Gallagher, H. Alix. 2004. “Vaughan Elementary’s Innovative Teacher Evaluation System: Are Teacher Evaluation Scores Related to Growth in Student Achievement.” Peabody Journal of Education 79(4): 79–107.
    [I think this is a themed issue]

    Jepsen, Christopher. 2005. “Teacher Characteristics and Student Achievement: Evidence from Teacher Surveys.” Journal of Urban Economics 57(2): 302–19.

    Podgursky, Michael J., and Matthew G. Springer. 2007. “Teacher Performance Pay: A Review.” Journal of Policy Analysis and Management 26(4): 909–49.

  4. compassioninpolitics / Sep 6 2011 1:37 am

    Finally, this guy from the University of Wisconsin seems to be a leader in the movement:
    http://eps.education.wisc.edu/faculty/harris.asp

    Here is a large bibliography on the teaching effectiveness & value added debate:
    http://www.tqsource.org/webcasts/evaluateEffectiveness/resources.php

    This has a critique of the test based model to assessment:
    Goe, L., Bell, C., Little, O. (2008). Approaches to evaluating teacher effectiveness: A research synthesis. Washington, DC: National Comprehensive Center for Teacher Quality. http://www.tqsource.org/publications/EvaluatingTeachEffectiveness.pdf

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: