Validity and baseline assessment
Monday 20 April 2015
Much of the recent controversy about baseline assessment has centred on arguments about its validity. However, this term is widely used and abused with little attention to its real meaning – for example, the phrase ‘statistically invalid’ in a recent letter to the Guardian is literally meaningless.
Validity is a powerful and subtle concept which deserves better. As something of a validity geek, I have been listening to conference presentations and reading articles on this fascinating topic for over two decades. It is an important consideration in the baseline debate, but one that needs to be better understood.
In any assessment, the outcome, whether expressed as a score, level, grade or descriptor, has to stand forsomething much more important – the actual abilities, skills and understanding that the learner has acquired. The validity of the assessment is the extent to which the outcome accurately represents those important abilities. But just as the learner’s abilities themselves cannot be made instantly visible in their pure form, nor can the validity of an assessment be displayed as a quality that is clearly present or absent. Instead, recent scholarship has emphasised validation as a process, a research investigation that takes place during the development of the assessment and continues when it is in use. The validation process identifies threats to validity and seeks evidence to demonstrate that the threats have been avoided.
Broadly, validation evidence is of two types: judgmental and statistical. In developing the NFER Reception Baseline Assessment, we systematically gathered these two types of evidence from: a review of research; an advisory panel; a practitioner panel; a cultural reviewer; and the Reception teachers and children who trialled the assessment in over 500 schools. These are some of the threats we investigated.
Threat 1: does the context of the assessment prevent children from showing what they know and can do? This is the biggest and most important threat. Early on, we rejected the idea of an on-screen assessment. Although this would be highly manageable, it would be too far from the everyday experience of the classroom. So we have a set of face-to-face activities with the child guided by the teacher or another familiar adult, using practical resources. From the trials, we learned that 92 per cent of teachers thought it was important to have practical resources, and that 83 per cent of teachers agreed that the activities allowed children to demonstrate the skills and abilities they bring to school. Statistical analysis showed that the sample of children obtained a wide range of scores, further evidence that their varied abilities were reflected in the varied outcomes.
Threat 2: is there too much variety in the way the assessment is conducted and scored? If the activities are not carried out in the same way for everyone, the outcome may reflect the circumstances of assessment rather than the capabilities of the children. We decided that there should be mainly structured tasks, a set wording and clearly defined criteria. Teachers’ responses to the questionnaire showed that they valued this consistency, and the statistical evidence took the form of established reliability coefficients: 0.90 to 0.92 for internal consistency; and 0.93 to 0.94 for inter-rater reliability; all of these indicating high reliability.
Threat 3: is the assessment biased against particular groups? For example, if an assessment is set in a context that is more familiar to boys than girls, the score may reflect familiarity with the context rather than the abilities of the children. For much of our assessment, the context is the familiar classroom, so avoiding this kind of bias. Where we provide pictures, they are carefully designed to be familiar to as wide a range of children as possible. There was judgmental evidence from a cultural reviewer who had the brief of detecting any bias, and statistical evidence of the performance of different groups.
Threat 4: does implementing the assessment have undesirable consequences that can be traced to the nature of the assessment? This can happen when teachers narrow the curriculum they teach to focus on what is assessed. The NFER research team made efforts to avoid this by linking the assessment closely to the existing Early Years Foundation Stage curriculum and by including a Foundations of Learning Checklist to embrace a broader range of development. Empirical evidence about this can only be gathered when the assessment is in use, and follow-up investigations will look at these questions.
This post can give only a brief summary of the main lines of validation; there is much more detail lying behind each of these points. The Carter review identified assessment as the area in which teacher preparation was most in need of improvement. It would be unrealistic to expect all teachers to find validity debates as interesting as I do. But they do need an overview of what it does and does not mean, and some grasp of its power as a tool for evaluating the worth of an assessment. There is a danger in the current climate that teachers will become less well informed about this, rather than improving their expertise.