Centre for Assessment
Information about assessment
A discussion of reliability
Although it is traditional to distinguish between reliability and validity, it is probably more accurate to say that reliability is a facet of validity, indeed, of construct-related validity. Construct-related validity poses the question: how well do scores on a particular test represent the actual distribution of the characteristic that the test is supposed to assess? Reliability poses the logically prior question: how well do scores on a particular test represent the distribution of an underlying characteristic, per se? Putting it more straightforwardly, while construct-related validity is concerned with whether test scores represent the ‘correct’ characteristic, reliability is concerned with whether test scores can genuinely be said to represent a characteristic at all.
If, for example, people tended to score radically differently from one administration of a test to the next, then we would seriously question whether the test was measuring anything significant at all (let alone the characteristic that we wanted it to measure). Test scores are reliable to the extent that they are not erratic or random, i.e., to the extent that they appear to measure a real quality, characteristic or trait. This is why the investigation of reliability is centrally concerned with consistency. There are a number of key questions in this respect, each tackling the issue of consistency from a different angle.
First, do all items (i.e., questions) within a test assess the same thing? Strictly speaking, in a test of characteristic X, all individual test items should measure characteristic X and only characteristic X. In reality, the qualities that we attempt to assess are often quite broad, which means that we often end up assessing a number of related characteristics. However, when the marks on a test are to be added together to form a single mark total, then we are often implying the presence of a single coherent quality and, as such, it might still be considered important to demonstrate that the individual test items are actually assessing similar things.
One way of investigating this begins by splitting a test in half: one sub-test is constructed from one half of the items and another from the remaining items. If test takers score similar marks in each sub-test then this would be evidence that the items in each sub-test were assessing similar things. One problem with this approach is that there are usually many ways in which two sub-tests could be constructed from the items of a single test. For this reason, the NFER tends to employ a statistic, known as the KR20 (or its sibling, the KR21), for investigating reliability. These formulae effectively average out the correlations that would be obtained between all possible item-splits across two sub-tests. For practical purposes, the values of KR20 and KR21 range from 0 to +1. In practice, this coefficient depends upon factors such as test length, test content, and the guessibility or otherwise of the questions. It typically varies from around 0.75 to 0.96, the latter being associated with verbal reasoning tests of around 90 items.
Second, does a test measure the same thing from one administration to the next? Assuming that there really is a stable underlying characteristic to be assessed, then if we happened to administer the same test twice, we would expect people to obtain a similar mark. If they didn’t, then either the characteristic was not as stable as we had assumed or the test was not as sensitive as it should have been. The NFER uses this test-retest technique for investigating the reliability of scores on a number of its tests. Once again, the evidence is presented as a correlation coefficient, theoretically ranging from -1 to +1. For example, the coefficients presented in the User Guide for the NFER’s General Ability Tests were +0.81 and +0.87, for the Non-Verbal Test and Numerical Test, respectively. The one-week re-test reliability for both Mathematics 11 and Progress in English 10 were 0.93.
Third, will test scripts receive the same marks from one rating to the next? When multiple choice tests are marked by machines, this aspect of reliability is relatively unproblematic. However, consistency of marking can become a major issue, particularly for large-scale tests involving many markers and requiring the evaluation of extended answers. Attempts are made to minimise inconsistencies by, for example: developing clear and thorough marking schemes; co-ordinating the marking of scripts through hierarchical training and monitoring systems, and systematically checking for clerical errors such as the incorrect addition of marks. Procedures like these are employed when national curriculum tests are marked. Evidence relating to marking reliability is typically investigated either with the same marker marking identical scripts at two points in time or with different markers marking identical scripts at the same time. In both instances, correlation coefficients are generally used to indicate the degree of marking reliability obtained.