How Reliable Is Peer Assessment In Higher Education?

Published by Simon James

School of Information Technology, Deakin University

These findings are described in the article entitled Identifying items for moderation in a peer assessment framework, recently published in the journal Knowledge-Based Systems (Knowledge-Based Systems 162 (2018) 211-219). This work was conducted by S. James, E. Lanham, V. Mak-Hau, L. Pan, T. Wilkin, and G. Wood-Bradley from Deakin University.

Peer assessment can help alleviate one of the major bottlenecks affecting higher education systems today – but is it reliable? For those who might be worried about biased or incompetent peers deciding their academic fortune, modern aggregation techniques are available to help make peer assessment fairer; however, its widespread acceptance requires a shift in our attitude to grades.

In the higher education system, peer assessment involves the marking of work that would traditionally be done by their teaching professor instead being undertaken by students in the same course. This is by no means new, with peer assessment going back to at least the beginning of higher education institutions in the 17th century; however, today we have the technology and methods available which allow it to be implemented on a much larger scale. The changing landscape of universities also has meant that instructors rely on non-traditional marking methods in order to handle their workload.

Outside higher education, as consumers, we are becoming increasingly comfortable in trusting the opinions of our peers, for example, in the ratings of hotels, restaurants and mobile apps. In other contexts, peer assessment is seen as a way of being inclusive. The NBA all-star game starters are at least partially (25%) decided by player votes. But how would you feel if the grade on your latest essay, presentation, or mathematics problem-solving assignment were to be entirely based on the grades given by a random selection of your peers?

There are two standard criticisms that often arise. Firstly, other students may not be perceived to possess the required aptitude to mark fairly. You – an A grade student – might be being marked by a C grade student that doesn’t understand what you’ve written. They’re not qualified, and you’re paying all of this money to the university, and part of that is to get expert feedback. Secondly, certain students may exist in the course that, for whatever reason, may not approach the task of marking others earnestly. They might give high, low, or average marks across the board without putting thought into the individual assignments, or worse, they might downgrade an assignment if they suspect it to belong to someone they dislike or upgrade the assignments of their friends.

In regard to the first criticism, such concerns can usually be addressed by ensuring the marking criteria is clearly explained and limited training is provided on how to mark toward that criteria. In fact, this exercise can help make the criteria clearer when you complete the assignment yourself – by preparing to give feedback to others, you gain a better idea of what is expected. There are also some further considerations we should reflect on when considering this criticism.

The nature of the professor-student relationship in universities is changing. The idea that the professor’s expertise is the only one that can provide meaningful commentary on your work may not always hold true, and worse, such an attitude may limit your opportunity to develop the evaluative judgment skills necessary to survive in the real world. Whether an assignment is measuring learning outcomes related to communicating an idea, responding to questions, or reporting results, it should usually be understandable and clear to anyone in the course. Any formal requirements (addressing certain points, referencing properly etc.) should also be easy enough to assess from anyone in the course. Furthermore, students may be unaware of this, but a great majority of marking work is already outsourced to the sessional staff, teaching assistants and so on. Such work is usually paid for assuming a certain number of scripts are marked per hour, so assignments may rarely receive the attention we imagine they should in writing and submitting them. Even if the teaching professor is in charge of doing this, the constraints on their time will often mean that assignments are skimmed through and marked according to broad criteria that can be easily identified. For a richer feedback experience, in large classes, it may be preferable to receive feedback from 5 other students than from 1 professor or teaching assistant.

To the problem of addressing bias, our research (alongside others’) has developed methods for reliably aggregating marks in a way that minimizes the impact of marks that differ substantially from others given for the same assignment, or which are given by peers who seem unreliable. By assuming the presence of a moderator, the aim is to automatically identify items (assignments) that might require the intervention of an expert, while automatically assigning grades for the remainder of assignments. We want to reduce the overall work required of the moderator, while still providing a reliable evaluation of the students’ work. There are three anomalies that we want to develop automatic detection methods for.

Firstly, is for any item of work over which there is a low consensus in the grade given. Quantifying consensus is related to measuring variability, and so methods often involve looking at the standard deviation or some similar measure of spread. If this is high enough, then it might be best for the expert to look at the assignments and decide on a fair grade.

Secondly, are any particular marks that might seem abnormal when compared to the rest, i.e. outliers. The presence of one outlier might not be enough to set off the alarm when it comes to an overall measure of consensus, but a single unfair grade can be enough to bring down or pull up the overall score. We, therefore, need methods that can either identify outliers or downweight their importance when it comes to the final grade.

Lastly, we want to identify those peers who may be inauthentic in the grades they provide. In our work, we developed a number of calculations to quantify notions such as bias (over-marking bias, under-marking bias), internal variability (are they just providing a lazy set of Bs to everyone they’re marking), and consistency with respect to the grades provided by peers to the same items.

In synthetic experiments, we found that such methods are capable of providing fair marks, even in the presence of a high percentage of unreliable markers. In reality, if the peer marks are visible to the moderator, experience has shown that most students do approach the task authentically (it’s not as insidious as a Panopticon, but generally students would aim to do the right thing by their peers if they know the professor is watching).

A shift towards systems of peer-marking could also help facilitate an important change in perspective when it comes to education. Rather than thinking of assignments toward achieving that pat-on-the-back or validation from experts, we can approach them toward opportunities for feedback – developing a sense of where we sit compared to other students in the class, while also gaining insight into their perspectives and opportunities to improve.