Is 85% a B? Grading by percentages is not the way to go

In a strictly non-scientific survey, 89% of all students and teachers indicated that they believe in traditional percentage-based grading, where an 85% would be a middling B, a 75% a middling C, and so forth.

Actually, I just made up that 89% figure. But it’s probably in the right ballpark, and it makes just as much sense as the common belief about the meaning of that 89%: obviously a high B, and very close to an A–. Now I suppose that I could devise a test in which a student who earned 89% of the available points really was doing work at the top of the B range — maybe. But devising such a test would be ridiculously time-consuming, and the validity of its results would be highly questionable anyway. In reality, if the objective quality of a student’s work was at the top of the Bs, s/he might get a 92 on one test and a 79 on another, depending on the difficulty of the questions.

So what do most teachers do? There seem to be three solutions to this problem:

  1. Use percentages anyway, and live with the questionable validity.
  2. Grade on a curve.
  3. Use your professional judgment to determine an appropriate scale.

Probably the most common solution is option #1. And what’s wrong with that? Well, by far the biggest flaw is that it discourages teachers from asking challenging questions. If all the questions on a test are easy enough, then it’s reasonable to require a minimum of 80% in order to earn the minimum B– (the mark of basic competence these days). But as soon as the questions get difficult, the class median starts dropping, and otherwise competent students get scores of 72% and the like. Then you get complaints from students and their teachers, and in some well-known cases the teacher even gets fired, and in any case no one can live with the results. The inevitable consequence is a drift toward mediocrity: ask a mixture of easy and moderate questions, the minimally competent students will get somewhere around 80%, and everyone is happy.

What about option #2 then? Grading on a curve ensures that the teacher can ask challenging questions and still ensure that the median grade is a B– (or any desired median grade, as the case may be). But it results in a whole host of undesirable and probably undesired consequences. In particular, it discourages cooperative learning by pitting one student against another, and it makes the incorrect assumption that all groups of students are equivalent. In reality, all teachers know that it’s perfectly possible — by the luck of the draw or the vagaries of the master schedule, where honors chemistry meets at the same time as one math class and conceptual chemistry meets at the same time as another — for the majority of one class to deserve As and the majority of another to deserve Cs. In the former case, grading on a curve unfairly gives excellent students low grades, and in the latter case it unfairly gives mediocre or poor students decent grades. Clearly not the way to go.

So we must go with option #3. It clearly allows for challenging questions and variations in the distribution of student populations. This choice in turn branches into three sub-options: determining a scale in advance, determining a scale after looking at students’ raw scores, and determining a scale after looking at student work. Combinations of these three are also possible.

Determining a scale in advance has certain attractions: principally, it keeps the teacher honest by preventing excessive generosity when student results are disappointing. Ideally I think this is the way to go, but it also requires unrealistic amounts of forethought and accuracy in predicting what good students will do; I like it in theory, but I have had to abandon it in practice. Going through each problem and deciding in advance how many points a B student is likely to earn feels too much like guessing.

The second sub-option, determing a scale after looking at the raw scores, is deservedly popular with many teachers, and it’s what most of my colleagues and I have done with final exams for decades. It lets one draw lines between the As and the Bs, between the Ds and the Fs, and so forth, without regard to raw percentages and conscientiously avoiding excessive harshness or excessive generosity. But it still tends to even out the true differences in populations, so I’m not convinced by it.

My choice is the third sub-option, looking at student work in order to determine the scale. This is what we’ve been encouraged to do at Weston, and I’m sold on the idea. Based on research and recommendations of the Annenberg Foundation, Harvard Project Zero, and others, it takes the most time but has the biggest payoff. The idea is to determine raw scores first without regard to letter grades, and then to examine in detail a reasonable sample of student papers. For instance, sort them by raw scores and then pick three from the middle of the pack, three around the third quartile, and three around the first quartile; go through them problem by problem, and use professional judgment to determine whether those students have “got it” or not. Sometimes it’s hard to sort out conceptual misunderstandings from skill-based errors, but it’s always informative. In this way, we can make an informed decision that says that on this particular test a raw score of 68 is worth a low B. And on another test it might be an 79. In this way it’s possible to give truly challenging problems, while assigning fair and meaningful grades for them. I think it’s a clear win all the way around — except for the fact that it takes more time for the teacher, as do many good ideas.

Categories: Teaching & Learning, Weston