By Louise Benson, Lead Psychometrician
Thursday 4 February 2021
When asked why I chose to pursue a career in assessment, my response can be summed up in a single word: fairness. There are many technical terms used to talk about assessments, but they all ultimately boil down to one thing: fairness.
In the pre-Covid world I felt confident I knew what this meant: high-quality test instruments which give learners the opportunity to demonstrate what they can do, and stakeholders (teachers, awarding bodies, further and higher education institutions, employers) useful information about what those learners know and can do.
Characteristics of high-quality assessment
A high quality assessment is generally one which can be demonstrated to be valid and reliable. Validity is an umbrella term, often described as meaning that a test measures the thing it is supposed to measure, but really what this means is that the results from a test can be interpreted in a meaningful way. This is by no means straightforward. Academic tests seek to measure something inside the heads of students: a so-called 'latent trait' which influences performance on a test but not in a perfectly direct or uniform way (deviance from this is referred to as measurement error). The job of test developers and psychometricians is to gather evidence to build a reasonable argument that the test scores are sufficiently related to what the test is designed to measure (the ‘test construct’) and, conversely, to minimise any influences on test results which are irrelevant to the construct. Ensuring that assessments are free from bias (that is, that no group of test-takers is likely to perform better than another due to factors unrelated to the construct being measured) is a key component of validity. So is comparability: ensuring that outcomes from different versions of the same assessment (for example, the test papers for the same qualification in two different years) can be used interchangeably, and outcomes are not influenced by the year in which a student happened to be examined.
Reliability is also a key component of validity, but can sometimes be in conflict with it too as striving for reliability can reduce the authenticity of assessment (for example, all multiple choice questions reduce marking variability but may also limit the depth of understanding which can be assessed). A reliable test is one in which a student would get the same outcome regardless of the exact circumstances of the test itself, assuming their knowledge of the subject remained constant. Outcomes can, of course, be affected to some degree by all manner of circumstances: the selection of questions in the test, the length of the test, who marked the paper, how the candidate was feeling on the morning of the test, the noise in the examination hall. Some fluctuation is inevitable, but a good test minimises it as far as possible within the practical constraints of the assessment, while striking a balance such that validity is not unduly compromised.
What does high-quality assessment mean for qualifications in the current situation?
And so to qualifications in a Covid world. Never before has the concept of fairness felt so in conflict with that of good assessment. Yet that is exactly the challenge that a year like 2021 brings. The principles of good assessment say we should expect the same standard of performance to achieve a particular grade for the 2020 and 2021 qualifications cohorts as in previous years, such that they truly reflect what students know and can do, and enable stakeholders to interpret results meaningfully - this is, after all, the key component of assessment validity! But to enforce that seems grossly unfair on young people who have missed sizeable chunks of learning through no fault of their own. Why shouldn't they be compensated to enable them to earn the qualifications they would have got, had Covid not darkened our doors early in 2020? It's easy to criticise the plans drawn up by the DfE and Ofqual attempting to address this problem, but there is no easy answer when the issue reaches right to the heart of the purpose of assessment. How, then, do the current plans in Ofqual’s consultation stack up against the usual criteria we would use for good quality assessment?
Ofqual say that “a breadth of evidence should inform a teacher’s assessment of their student’s deserved grade” and I wholeheartedly agree with this. It is a crucial component of validity that an assessment samples well from across the whole content domain (that is, the curriculum or syllabus that a test is designed to cover). How far this will be possible this year is difficult to predict. There have been many calls for optionality of test papers so that students can be assessed on the areas of the syllabus they have covered in greater depth, and the consultation proposes this level of choice in papers that would be provided by exam boards. This sounds fair but also means that students with the same eventual grade may have learned very different content to get there. This is, of course, true to some extent in any year as students will have strengths and weaknesses in different areas within a subject, but this will be exacerbated considerably this year, making interpretations of eventual grades more difficult. While we recognise the need for some degree of choice in the evidence that teachers will use to inform their assessments this year, the areas of the curriculum assessed for each student should be as broad as possible to maximise validity.
Reliability is maximised by making assessments as objective as possible and gathering as much evidence as possible, within practical constraints. The provision of papers by exam boards in the proposals would go some way to achieving this but some level of external marking or moderation of these papers would make a big difference to reliability if it were feasible. Marking for qualifications and national assessments usually involves an extensive process of recruiting and training a large team of markers, assessing their readiness to mark to the standard in a ‘standardisation’ process and some form of ongoing quality assurance, as well as support and peer review in the form of a hierarchy of team leaders and so on. Of course, this still doesn’t result in marking which is perfectly reliable – that’s simply not possible when it involves humans making judgements - but the process ensures a much higher level of objectivity and consistency than can be achieved by simply providing teachers with a mark scheme. We certainly agree that teachers should be given as much support as possible by exam boards in marking the papers provided by them, but it is difficult to imagine how this support could replicate the level of consistency that would usually be achieved in the marking process. The level of training required in addition to the marking itself, awarding the grades, internal moderation as well as continuing to teach all year groups during a year this challenging would place a huge level of burden on teachers in terms of workload. In addition to this, there is a considerable burden of responsibility on teachers in terms of being accountable for the judgements themselves, with the need to provide compelling evidence supporting those judgements in the proposed appeals process. We do understand that external marking or moderation may just not be practical in the timescales required to inform teacher assessments but if it is possible then the benefits to the reliability of the assessments and burden on teachers would be significant.
The potential for bias in human judgement is an emotive topic but there is some evidence of bias in teacher assessments in the literature, albeit very mixed. An external element of marking, moderation or peer review would help to minimise the likelihood of bias affecting the results of some groups of students, and would also help to protect teachers from accusations of such.
Ofqual rightly state in the consultation document (pg 29) that “the usual assurances of comparability between years, between individual students, between schools and colleges and between exam boards will not be possible”. It is important to recognise this issue, meaning that grades from 2021 should not be compared directly with those from 2020 and neither should be compared with those from 2019 and earlier. The grade inflation that arose in 2020 due to the eventual method used to assign grades is unlikely to have reflected a genuine improvement in performance at whole cohort level of the magnitude seen in the outcomes. It is difficult to anticipate what the eventual distribution of grades will look like in 2021, but I agree that Ofqual are sensible not to attempt to apply statistical moderation at cohort level in an attempt to align outcomes with previous years: this would perhaps provide some perception of comparability but, ultimately, would be unlikely to achieve it. This is problematic from a validity perspective though, given that validity is concerned with how reported outcomes (i.e. grades) can be interpreted. It is important that stakeholders using qualifications to inform decisions (further and higher education sectors and future employers) understand this caveat, but I am concerned, as is generally the case with statistical caveats, that this understanding will be lost and direct comparisons will continue to be made across different cohorts. Provision of performance descriptors and extensive guidance and training on how teachers should use them, as well as externally provided assessments, would help to maximise comparability of standards across schools and across year groups as far as possible.
But does that make it ‘fair’
Whilst in this blog I have focused on the quality of assessment itself, we mustn’t, of course, lose sight of the impact of lost learning and uncertainty on the students who were due to take examinations this year. Assessment, however ‘good’ it is, cannot undo the social injustice arising from the fact that different people will have been affected to different degrees by circumstances outside of their control during the course of the pandemic. It would be impossible for grades to be adjusted somehow to take account of this fairly across the whole cohort of students with such a vast array of experiences. This remains the biggest challenge to be addressed, and one which stakeholders selecting students for the next stage in their education or employment careers must be mindful of.