Norming of high-range I.Q. tests

Introduction

High-range tests aim to measure high levels of mental ability, and are used for admission to higher-I.Q. societies, for entertainment, and for research after high intelligence, in particular to find out whether and to which point and extent the general factor g is present and measurable in the high range. The goals of high-range tests are formulated more precisely at I.Q. Tests for the High Range.

Overview

Once a high-range test has been created and submissions are coming in, two basic norming methods come in view:

1 - Norming based on the candidates' known scores on other tests (called "equation" or "anchoring");
2 - Direct norming based on the actual scores on the object test.

Also a complex method is possible:

3 - Direct norming based on the norms resulting from method 1, by combining several tests.

For method 1, about 30 submissions with at least 70 usable scores on other tests are needed; when relatively few scores on other tests are known, 30 submissions will not suffice. Method 2 requires about 200, and method 3 requires a total of 200 submissions at least over a number of tests that have already been normed with method 1. These methods are described in some detail below; problems that are known to occur in the the process of norming are reported in Issues in the norming of high-range tests.

Method 1

1.1 - Each known score on another test forms a score pair with the corresponding raw score on the object test;
1.2 - The tests to be used for norming are selected;
1.3 - The scores from the other tests are converted to the same scale;
1.4 - The distributions of the "other" and the raw scores are equated to obtain norms.

Step 1.1

This is self-obvious, but a quick example to be certain: if a candidate has two other scores, say 133 on Test X and 140 on Test Y, and has a raw score of 27, 2 score pairs are formed, to wit 27-(Test X 133) and 27-(Test Y 140). So the number of score pairs is not the same as the number of submissions/candidates. In practice it may not be needed to write these pairs out like this; a good working mode is to have a form or record for each submission, containing that candidate's raw score and that candidate's other scores.

Step 1.2

In selecting tests to be used, one should be as objective and consistent as possible and avoid arbitrary decisions, especially decisions based on subjective notions. E.g., "this is a bad test so I had better not use it" is not the right approach. Neither is "this candidate probably did not do one's best on this particular other test, so I had better not use it". One should not judge, but let objective statistics prevail.

A few selection methods:

1.2.1 - Use all pairs;
1.2.2 - Use all pairs, except those from self-scored tests and free online automatically scored tests;
1.2.3 - Use all scores from one or more particular tests for which there is solid, reliable data, like the test constructors own tests or college admission tests, and exclude all other pairs;
1.2.4 - Use all pairs from the tests that correlate highest with the object test; set the correlation threshold so that enough pairs remain for norming;
1.2.5 - Use all pairs from all the tests with positive correlations with the object test, but with weights attached to the tests that are based on the correlations of the tests with the object test.

I myself have, after experimentation, ended up with 1.2.4 as the basic method. 1.2.5 is also sublime, but more complex and labour-intensive.

Step 1.3

The I.Q. scale most used in the high-I.Q. world is that with a total population mean of 100 and standard deviation of 15. The "other" scores to be used have to be converted to this scale, if not already expressed on it. So one needs to know which scale is employed by every particular test. This is a matter of collecting information. For college admission tests, that do not give I.Q.'s, one may need to study available statistics and derive centiles therefrom, and convert those to I.Q.'s using the normal distribution. For tests that only report general population centiles, those can be converted to I.Q.'s in the same way.

For conversion of childhood mental/biological age ratio scores, one may use "Sare's Prediction". They can not be used verbatim as the distribution of ratio I.Q.'s in the high range deviates sharply from normal.

A remark: note that (adult, deviation) I.Q.'s are really centiles in disguise. They are linked to centiles by the normal distribution. Use of the actual mean and standard deviation of college admission tests will not result in proper I.Q.'s. Actual distributions of scores rarely have a true normal curve, especially not beyond about 1.5 or 2 standard deviations above or below the mean.

Step 1.4

To equate distributions of raw and prior scores, two basic methods exist:

1.4.1 - Equating ranks;
1.4.2 - Equating z-scores (= means and standard deviations).

Step 1.4.1

Rank equation is explained at its page on Statistics explained.

Step 1.4.2

Equating z-scores always and necessarily results in a linear norm table and therefore is only recommended when there is a reason to assume that the object test scores are in linear relation to I.Q. This is the case when the object test has internal item weighting, where items of different difficulty get different weights based on their statistical behaviour (like e.g. in Kevin Langdon's LSFIT and LIGHT). It is also the case when the object test has been constructed carefully to contain equal numbers of items of each level of difficulty (this is only possible by selecting them statistically from a larger pool of items for which certain statistics are already known; but even then the problems may behave differently in their new environment).

As in rank equation, two (conceptual, hypothetical) columns are formed with the paired raw and "other" scores. Weighting of score pairs is possible in the same way as described for rank equation. For each column, mean and standard deviation (S.D.) are computed.

To obtain norms, mean raw score is set equal to mean I.Q. (or whichever scale one uses for the "other" tests). And for each raw score point above or below that, I.Q.S.D./R.S.D. (I.Q. S.D. divided by Raw S.D.) is added or subtracted to or from the mean I.Q. For practicality, norms may be expressed in a formula like:

I.Q. = RawScore × (I.Q.S.D./R.S.D.) + [I.Q. for RawScore = 0]

As a rule of thumb, extrapolating outside the raw score distribution becomes uncertain when done over a distance greater than half the square root of the actual range of the scores. Also, norms tend to become uncertain at the "edges" of the test, which (the "edges") as a rule of thumb may be defined as the top and bottom half square root of the total possible score range (the total possible score range is the difference between the maximum and minimum possible scores, regardless of the fact whether or not those have been reached).

Method 2

The simplest way of direct norming is to compute the "proportions outscored" for each raw score based on the actual test submissions. This is explained at its page at Statistics explained.

As proportions do not have a linear character they are intuitively unsatisfactory. To arrive at a (more or less) linear scale, two methods are thinkable:

2.1 - Convert the proportions to standard scores via the normal distribution ("normalization");
2.2 - Use the actual z-scores of the raw score distribution (a z-score of a particular raw score is the signed distance to the mean in S.D. of that raw score, for instance -1.75 when the score is 1.75 S.D. below the mean).

Step 2.1

Following the assumption that if a distribution is normal, its underlying scores can be taken as a linear scale, converting the proportions to standard scores via the normal distribution should result in a linear scale of high-range intelligence. The mean and S.D. of the scale can be defined in any desired way, for example with a mean of 50 and S.D. of 10 (t-scores). There are arguments against normalization and against the assumption of linearity under a normal distribution, but that discussion falls outside the scope of this article.

Step 2.2

To base a standard scale directly on the actual raw score mean and S.D. seems only wise if the test has been constructed purposely to yield linear scores as meant in step 1.4.2. And mean and S.D. can be defined in any desired way.

Method 3

To enable scores on tests normed with method 1 to be expressed as high-range proportions outscored as meant in Method 2, the scores from a number of tests already normed with method 1 can be combined to arrive at a sample size sufficient to make a direct norming as in method 2. The I.Q. norms (or whichever type of norms one used) from the original normings are treated as "raw scores" in the new method 3 norming, arriving at a test-independent high-range norming (that is, a norming that does not rely on the particular sample of candidates that happened to take one particular test).

If the scores from the already normed tests are combined like that, a problem is that some candidates may occur more than once in the sample. This may or may not be seen as a serious problem.