On the relative weight of item types in a mixed-item test

Weight is not proportional to number of items but to raw standard deviation

I.Q. tests containing a mixture of item types — such as verbal, spatial, and numerical — tend to be the most satisfactory tests for reasons that are not the topic of this article. A question that arises is what the relative weights of the various item types in such a test are. Often one sees equal numbers of items in the various categories, for instance 20 verbal, 20 spatial and 20 numerical. Or there is a balance between verbal and non-verbal, as in 40 verbal, 20 spatial and 20 numerical. This form of equality or balance is deceptive. To think that the relative weights are proportional to the numbers of items in each category is a mistake.

In reality, the relative weights of the item types depend exclusively on the standard deviations of the raw scores on the various item types. If the verbal, spatial, and numerical standard deviations are, say, 3, 4, and 6, then those are the relative weights, regardless of the number of items in each category. So in that case, the spatial item type has twice the weight of the verbal item type, regardless of the numbers of spatial and verbal items.

This means that the relative weights are not only a property of the test, but also of the answering behaviour of the candidates, and they may differ across populations. In other words, it is impossible to construct a test with equal weight in each category for each population, if one would want that. A balance in the numbers of items per category is merely a symbolic gesture.

This is not a great problem, because small differences in weight per item type have no significant impact on the ranking of total scores. Big differences, such as a doubling, do change that ranking. If big differences exist these may be corrected by artificially weighting the items in one or more of the categories (e.g., two points per correct answer instead of one) but the disadvantage thereof is that the resulting score no longer reflects the number of items solved correctly. This is a loss as a score that does reflect the number of solved items is psychologically more satisfactory. This loss is even greater when individual items are given different weights.

So if the relative weights of the item types are not naturally in balance, there is a dilemma between artificially correcting the balance (losing the advantage of a simple informative raw score), or accepting the natural state of affairs (and retaining a simple informative raw score). The second is preferable, and the bottom line therefore is that one should try to construct a test so that the natural balance between item types is acceptable.