Centre for Assessment
Information about assessment
For a test to be acceptable, it must be fair to all sub-groups of the population for which it is to be used. Often, the purpose of a test is to discriminate between pupils/ participants, for example between pupils working at different levels of the National Curriculum in specific subject areas. However, the test should not discriminate between sub-groups on any basis unrelated to the purpose of the test. In other words, there should be no unfair advantage to any sub-group based on attributes such as gender or ethnic group. A test is therefore unfair if it is not as accurate for one sub-group as it is for the rest of the test population, in that the validityis different for different groups.
However, a test is not necessarily unfair if members of one sub-group, e.g. boys, do less well than another, e.g. girls. At the present time, girls on average tend to score more highly than boys do in English reading comprehension tests. (The reasons for this performance difference are not addressed here.) If the average scores of the two groups are in line with expectations based on similar tests or other assessments, the test as a whole may be considered to be fair to both groups. However, a test item is considered to be unfair or biased if it is more likely to be answered correctly by one sub-group of individuals for reasons other than differences in the overall ability or attainment being assessed. For example, items set in a context that is more familiar or of more interest to one group than another, such as rugby, may be answered correctly by more boys than girls, although this could only be established by a statistical analysis of the functioning of the test.
In order to minimise the possibility of bias, particular care must therefore be exercised in the choice of items. Items with a likely gender bias or items requiring knowledge or understanding specific to a particular culture or ethnic group should be avoided. Also, where reading is not being assessed, for example in a test of mathematics or science, care should be taken that the language used should make the test as accessible as possible for those test takers whose first language is not that of the test. Equality in terms of accessibility and level of demand is a particular issue where the test is being developed in more than one language, or where tests are being subsequently modified for particular groups of test-takers (e.g. Braille versions)
Generally, items that are thought likely to favour a particular sub-group would not be included in the initial selection of items for a test. However, it is always possible that some items that do not exhibit any apparent bias may favour one or more sub-groups. Therefore, in order to ensure that tests developed by the NFER are as fair as possible, statistical analyses are carried out to detect any items that may be potentially biased. Depending on the test population, such analyses can be carried out across a number of different sub-samples: gender, ethnicity, language fluency, etc. Where the performance of a group of test takers on one item is significantly different to its performance on the test as a whole, the item is said to exhibit 'differential item functioning'. This can sometimes, although not always, be an indicator of item bias. Sometimes there can be no apparent or obvious reason for the differential item functioning identified. For example, a context-free mathematics item would not obviously be expected to favour a sub-group disproportionate to its overall test performance. Depending on the magnitude of the difference between the groups, the sample size taking the test and the nature of the item, the test developers will use their judgement to decide whether to omit the item.