On the relative weight of item types in a mixed-item test

© 2006 Paul Cooijmans

Tests containing a mixture of item types — such as verbal, spatial and numerical — tend to be the most satisfactory tests for several reasons which are not the topic of this article. A question that arises is what the relative weights of the various item types in such a test are. Often one sees equal numbers of items in the various categories, for instance 20 verbal, 20 spatial and 20 numerical. Or there is a balance between verbal and non-verbal, as in 40 verbal, 20 spatial and 20 numerical. This form of equality or balance is highly deceptive. To think the relative weights are proportional to the numbers of items in each category is the classical mistake.

In reality, the relative weights of the item types depend exclusively on the standard deviations of the raw scores on the various item types. If the verbal, spatial and numerical standard deviations are, say, 3, 4 and 6, then those are the relative weights, regardless the number of items in each category. So in that case, the spatial item type has twice the weight of the verbal item type, regardless the numbers of spatial and verbal items.

This means the relative weights are not only a property of the test, but also of the answering behaviour of the testees, and they may differ across populations. In other words, it is impossible to construct a test with equal weight in each category, if one would want that. A balance in the numbers of items per category is merely a symbolic gesture.

This is not a great problem, because small differences in weight per item type have no significant impact on the ranking of total scores. Big differences, such as a doubling, do change that ranking. If big differences exist these may be corrected by artificially weighting the items in one or more of the categories (e.g., two points per correct answer instead of one), but the disadvantage thereof is that the resulting score no longer reflects the number of items solved correctly. This is a great loss as a score that does reflect that is psychologically very much more satisfactory. This loss is even greater when individual items are given different weights.

So if the relative weights of the item types are not naturally in balance, there is a dilemma between artificially correcting the balance (losing the advantage of a simple informative raw score), and accepting the natural state of affairs (and retaining a simple informative raw score). The second is far preferable, and the bottom line therefore is that one should try to construct a test so that the natural balance between item types is acceptable.