1. What started the movement to evaluate teacher performance by test scores?
For the last several decades, as far back as the , researchers have been teachers鈥 impact on their students鈥 standardized test scores, and have found substantial variation among teachers. The students of some teachers consistently made more growth on tests than the students of other teachers did. They also generally found that teachers鈥 ability to raise test scores was only modestly correlated with concrete characteristics, such as experience and . This helped galvanize a focus among school reformers on improving teacher quality 鈥 often to as the most important in-school factor affecting student achievement 鈥 as well as the belief that long-standing certification and personnel practices were falling short. In 2009, the reform-minded nonprofit TNTP (then known as the New Teacher Project) released an influential called the 鈥淲idget Effect,鈥 which studied a dozen school districts in four states. It found that 鈥渙n paper, almost every teacher is a great teacher.鈥 In other words, districts鈥 evaluation systems did not differentiate between good and bad teaching, resulting in limited help for struggling teachers or recognition for high-performers. The report recommended that districts 鈥渁dopt a comprehensive performance evaluation system鈥 that would rate teachers 鈥渂ased on their effectiveness in promoting student achievement.鈥 The report, along with other evidence, made an impression on the Obama administration, which used its newly created Race to the Top initiative to reward money to states that created teacher and principal evaluation systems that included 鈥渟tudent growth,鈥 i.e. test scores. The administration has also used waivers from the tough requirements of the federal No Child Left Behind law to spur similar policies 鈥 one state, Washington, its waiver because it refused to use test scores in teacher evaluation.
2. Has the move to evaluate teachers based on test scores created a backlash?
Yes, indeed. The move to judge teachers by student growth has helped lead to a proliferation of new tests partially because of the desire to evaluate teachers in grades and subjects that have traditionally lacked standardized tests, such as social studies and grades K鈥2. Frustration about overtesting spurred an opt-out movement across the country, most prominently in New York, where roughly one in five students to sit for the most recent state test. Teachers unions have also pushed back forcefully against testing. The Obama administration has to reduce testing, but says that student growth should remain a part of teacher evaluation systems. It is unclear, however, whether such plans will actually lead to fewer tests.
3. Are all teachers now evaluated on test scores?
Not all, but most. A 2013 by the National Council for Teacher Quality found that 41 states now require student test scores to be a part of teachers鈥 evaluations 鈥 up from just 15 states in 2009.
4. What are value-added measures or VAM?
Value-added measures (VAM) are among the most common ways to evaluate teachers using test scores. They are that attempt to isolate a given teacher鈥檚 impact on (or 鈥榓dded value鈥 to) student learning. The models work by comparing a student鈥檚 estimated score on a standardized test to the student鈥檚 actual score 鈥 the difference between the two is the teacher鈥檚 VAM. The estimated score聽is聽聽past student test scores and sometimes other factors such as poverty and disability status.聽A teacher鈥檚 overall VAM score is computed by averaging together the value-added to each of his or her individual students.聽Not all teachers receive VAM scores, since it is only computed for those who teach a grade and subject that ends in a standardized test. In New York, for instance, about one in five teachers聽聽a growth rating from the state.聽 聽
5. Are teachers who don鈥檛 receive VAM scores still evaluated based on tests?
In many cases, yes. There are for such teachers:
- Group measures of performance, in which teachers are evaluated based on test scores of students or subjects they don鈥檛 teach. A common example is teachers being judged on the entire school鈥檚 math or English score even if they teach, say, art. This occurred in , , , and and generated significant controversy.
- Student learning objectives (SLOs), in which teachers set goals for student performance on a test, either one they create themselves or a standardized one. The goals are approved by their supervisor, who then assesses the teacher based on how well the students meet those goals. One of schools in Austin, Texas found no correlation between a teacher鈥檚 SLO score and his or her VAM score; while another in Denver, Colorado found a moderate correlation. These results may be because SLOs and VAMs are assessing different aspects of teacher quality, but they might also call into question whether SLOs are valid measures of teacher performance.
6. Are there different types of growth models?
There are.
VAMs are among the most common. Another common model is known as , which, like VAM, measures student test score growth, but with a different mathematical technique. These models rank students with similar prior achievement based on how much growth they make. Such models, unlike VAM, often do not include controls for student characteristics like poverty, and so may unfairly disadvantage teachers of at-risk students.
Different VAMs also and demographic factors to create students鈥 estimated scores. In general, models that account for more student characteristics do a of ensuring a level playing field for teachers of academically challenged students.
Some models compare teachers only to other teachers in the same school, though most compare teachers across a given state. Generally, different models produce at least results.
7. What are some potential uses of VAM?
The most controversial question is whether to use VAM for individual teacher evaluation. It can be 鈥 and often is 鈥 used for other purposes as well. For example, it has long been used for research in order to evaluate the effectiveness of a given program or look for teacher characteristics that are associated with student achievement. Some advocate that VAM also be used for evaluation of principals, schools, and .
8. What are some of the arguments for and against using VAM in teacher evaluation?
Significant debate exists as to whether (and to what extent) VAM should be used in individual teachers鈥 evaluations. that it directly measures teachers鈥 effects on student achievement, is free of some of the bias that is part of other measures of teacher performance 鈥斅爈ike principal observations 鈥斅爄s connected to long-run student outcomes, and is particularly effective in identifying high- and low-performing teachers. that scores fluctuate significantly from year to year, that test scores provide a narrow sense of teacher quality, and that attaching stakes to tests will lead to teaching to the test and even cheating.
9. Is VAM a valid measure of teacher performance?
This is a controversial and complex question, and the research to date has not reached a clear conclusion. The answer also depends on subjective views on which student outcomes are important and how they should be measured. Even among those who support VAM in teacher evaluation, there is a disagreement on how heavily it should be weighted. There is that teachers who have high VAM scores produce lasting gains for their students in terms of college enrollment and adult earnings (though these results have been ). On the other hand, it is clear that teachers鈥 influence extends well beyond test scores. have shown that teachers can affect students鈥 non-cognitive skills and behaviors (such as attendance, discipline, etc.) and that teachers who do well in this aspect are not necessarily the same ones who raise test scores the most. VAM also tends to be but not highly with other measures of teacher quality, though some have found no correlation. Together, this research suggests that although VAM does not capture all aspects of quality teaching, it is capturing at least some meaningful information. There are also concerns about whether VAM can accurately measure teachers who work with students who are particularly high- or low-performing, though some suggests this is rarely a major problem. whether or not how students are assigned into classrooms can bias VAM scores. It is also probably fair to assume that validity varies from test to test 鈥 a low-quality exam is unlikely to be a particularly strong measure of teacher performance or student knowledge.
*Note that 鈥榲alidity鈥 here is used in the statistical sense, meaning a measure鈥檚 success in measuring what it purports to measure, meaning in this case teacher effectiveness.
10. Is VAM reliable?
VAM scores can and do fluctuate from year to year and much of this fluctuation is the result of imprecise measurement (also known as 鈥渆rror鈥). For example, one found that 57 percent of teachers who were in the bottom fifth of performance in one year, had moved to another level in the subsequent year 鈥 and 8 percent of the bottom-level teachers were in the top performance category in the following year. In general the from year-to- year ranges between .2 (weakly) and .7 (fairly high).1
The reliability for math teachers than for English teachers. Some (but not all) of this instability by averaging multiple years of data. The year-to-career correlation of a given teacher鈥檚 VAM is 鈥 ranging from .55 (medium) to .78 (high) in one study 鈥 than the year-to-year correlation. Finally, it鈥檚 crucial to note that all performance measures have some degree of instability. There is less evidence about the reliability of these alternative measures, but what exists generally suggests principal observations are somewhat more than VAM 鈥 though stability/reliability does not imply validity. In other words, a measure could be consistent over time 鈥 like a teacher鈥檚 height 鈥 but not a very valid one to judge how well that teacher teaches.
*Note that 鈥榬eliability鈥 here is used in the statistical sense, meaning a measure鈥檚 consistency.
*聽In statistical terms a correlation coefficient ranges between -1 and 1. A correlation of 0 means there is no association whatsoever; 1 means a perfect correlation; and -1 means a perfectly negative correlation.
11. Does using tests for high-stakes decisions in teacher evaluation lead to negative unintended consequences? Will it lead to positive consequences?
We don鈥檛 know for sure yet, though there鈥檚 certainly a possibility that it will, and there is some evidence suggesting both positive and negative outcomes.
There is research showing that holding schools accountable for student test scores has led to and . At the same time, there is evidence that test-based accountability for schools has in many circumstances increased student achievement both on high-stakes tests 鈥 like the yearly standardized tests 鈥 and on low-stakes exams, like the National Assessment of Educational Progress test given every two years.
However, the gains on the low-stakes tests are not as dramatic as those on the high-stakes exams, which gets back to whether teachers are teaching to the high-stakes tests or cheating on them. Schools can adopt policies that cheating and there may be ways of designing tests to make teaching to them .
12. Didn鈥檛 the American Statistical Association (ASA) say that VAM should not be used?
Not quite, even though some news outlets have reported it that way. The ASA, the country鈥檚 largest organization of statisticians, does urge significant caution in how VAM is used. The ASA put out a summarizing research on VAM, but does not say at any point that VAM should not be used. In fact the statement says, 鈥淲hen used appropriately, VAMs may provide quantitative information that is relevant for improving education processes.鈥 The statement warns, 鈥淓stimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.鈥 The statement adds, 鈥淩anking teachers by their VAM scores can have unintended consequences that reduce quality.鈥 Some researchers in the field have parts of ASA鈥檚 statement, suggesting the group left out recent research that addressed many of its own concerns about VAM.
13. Has the use of VAM led to improved results for students?
It鈥檚 too early to tell.
There have been relatively few studies on how the use of VAM in districts and schools affects students. The few pieces of research that do exist offer both reasons for caution and optimism.
- A found that providing districts with value-added data did not lead to improved student outcomes (relative to similar districts that did not have access to such data).
- A that offered teachers with high VAM scores a $20,000 bonus for transferring to a high-poverty school produced significant student achievement gains in elementary grades but no effect in middle school.
- A of New York City鈥檚 tenure system 鈥 which was made more rigorous, partly by using VAM scores 鈥 found that the reforms likely led to improvements in teacher quality.
- A in which a group of New York City principals were given VAM scores produced small improvements in student achievement (relative to students of principals who were not given such data).
14. What do teachers unions say about using test scores in teacher evaluations?
Teachers unions have generally been skeptical about the use of test scores in teacher evaluation, and such skepticism has increased in recent years. Randi Weingarten, president of the American Federation of Teachers (AFT), originally expressed openness to the use of test scores in teacher evaluation, in 2010 that student progress should be used alongside other measures; Weingarten also a Colorado law that required half of teachers鈥 evaluations to be based on student assessments. However, in 2014 Weingarten strongly against VAM, saying its use had led to an overemphasis on testing as well as high-profile . The National Education Association (NEA) has followed a similar path. In 2011, the union passed a signaling openness to using test scores in teacher evaluation in theory. But union leaders at the time that no tests were high-quality enough to be used for that purpose in practice. The NEA backed further away from the practice in 2014, that 鈥渟tandardized tests, even if deemed valid and reliable, may not be used to support any employment action against a teacher.鈥 NEA president Lily Eskelsen Garcia has been a sharp critic of standardized testing, to VAM as 鈥渧oodoo.鈥
15. Where can I find additional information about VAM?
- Carnegie Knowledge Network on Value-Added Measures in Education:
- Economic Policy Institute, 鈥淧roblems with use of student test scores to evaluate teachers鈥:
- Brookings, 鈥淓valuating Teachers: The Important Role of Value-Added鈥:
- Brookings, 鈥淣ew Evidence Requires New Thinking鈥:
- American Statistical Association (ASA) Statement on VAM:
- Response to ASA Statement:
- Shanker Institute, 鈥淰alue-Added Versus Observations鈥: and
- Shanker Institute, 鈥淎bout Value-Added and 鈥楯unk Science鈥欌:
- American Enterprise Institute, 鈥淭eacher Quality 2.0鈥:
- Doug Harris, Value-Added Measures in Education:
Did you use this article in your work?
We鈥檇 love to hear how 蜜桃影视鈥檚 reporting is helping educators, researchers, and policymakers.