Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure of teaching effectiveness. The way SET are used is statistically unsound--but worse, SET are biased and unreliable. Observational evidence shows that student ratings vary with instructor gender, ethnicity, and attractiveness; with course rigor, mathematical content, and format; and with students' grade expectations. Experiments show that the majority of student responses to some objective questions can be demonstrably false. A recent randomized experiment shows that giving students cookies increases SET scores. Randomized experiments show that SET are negatively associated with objective measures of teaching effectiveness and biased against female instructors by an amount that can cause more effective female instructors to get lower SET than less effective male instructors. Gender bias also affects how students rate objective aspects of teaching. It is not possible to adjust for the bias, because it depends on many factors, including course topic and student gender. Students are uniquely situated to observe some aspects of teaching and students' opinions matter. But for the purposes of evaluating and improving teaching quality, SET are biased, unreliable, and subject to strategic manipulation. Reliance on SET for employment decisions disadvantages protected groups and may violate federal law. For some administrators, risk mitigation might be a more persuasive argument than equity for ending reliance on SET in employment decisions: union arbitration and civil litigation over institutional use of SET are on the rise. Several major universities in the U.S. and Canada have already de-emphasized, substantially re-worked, or abandoned reliance on SET for personnel decisions.
Philip B. Stark is Professor of Statistics and Associate Dean of Mathematical and Physical Sciences at the University of California, Berkeley. He works on inference and the quantification of uncertainty in physical, biological, and social sciences, from astrophysics, particle physics, earthquakes, and climate, to elections, food safety, gender bias, and teaching evaluations. He is interested in foundational questions in the philosophy of science, such as the meaning of probability and the role of reproducibility and replicability. Methods he developed for auditing elections ("risk-limiting audits") are in law in California, Colorado, and Rhode Island, and in pending U.S. federal legislation. Methods he developed or co-developed are part of the Øersted geomagnetic satellite data pipeline and the Global Oscillations Network Group helioseismic data pipeline. He has been an expert witness in civil and criminal cases, including matters involving elections, employment discrimination, equal protection, the First Amendment, food safety, jury selection, federal legislation, patents, public utilities, trade secrets, truth in advertising, vaccines, and whistleblower claims.