Article appearing in Shiken 18.1 (August 2014) pp. 34-35.
Author: Jeffrey Durand
Opening paragraph:
A few years ago, I had to put together a speaking test for all the students (about 2,000) at my university. About 60 teachers were available to rate students, who were tested in groups of four. Two teachers worked together to rate all the students in each group. In speaking tests, the raters are often not equally strict (some tend to give slightly higher scores than others), and on occasion may give an unusually high or low score. These problems can be discovered by using software like Facets (Linacre, 2012), and scores can be adjusted or students can be retested. To do this, however, there needs to be a way to know how strict each rater is in comparison to others. This can only be done if all the raters (and tasks and prompts) are connected together in what is called a judging plan (Linacre, 1997; Sick, 2013).