The quality of a test is traditionally reflected in two statistics called validity and reliability. In short, validity consists of correlations between the test and the world outside it, while reliability consists of correlations within the test.
In addition, there is a quality of tests which I call "robustness", and which I find even more important than either validity or reliability, especially with regard to selection purposes. A brief explanation of the three statistics follows. More detailed explanations of many test statistics are in the section Statistics explained.
Robustness is a test's resistance to score inflation through whatever cause; practice effects, fraud, answer leakage, increasing quality of research materials like the Internet, unauthorized publication and so on. I have not had a good measure of robustness until now [2006], and have therefore not studied it in detail yet. An experimental measure of robustness which I will use in the following years is based on the correlation between the chronological ranks (or test dates) of test submissions and the raw scores of those submissions. This correlation denotes the extent to which the raw scores rise over time.
Especially when it comes to selection purposes, robustness is the single most important quality of a test. Ignoring this — that is, using non-robust tests in selection — leads to an inflation of "giftedness", "high intelligence", and membership in I.Q. societies, and weakens one's organization.
Validity is the answer to the question, "what does the test measure or predict, and how good is it at doing so?" This answer lies in two groups of correlations with concepts outside the test:
These correlations are the "fingerprint" of a test. From these one can infer what the test measures or predicts. The distinction between "measure" and "predict" lies in the intention of the test designer; a test's correlation with what it was intended to measure is called "construct validity", and its correlation with anything else is called "predictive validity" or "empirical validity" (note: sometimes the term "predictive validity" is used for a test's ability to literally predict the future value of some real-world variable, but often the statistical term "predict" simply refers to a correlation with some variable, regardless of the fact whether that variable exists in the past, present, or future).
For instance, say an I.Q. test is designed to measure g, the general factor in mental testing. The I.Q. test's validity would be its correlation with g, also called g factor loading. The test's correlation with performance on a particular job would be its predictive validity with regard to performance on that job.
It is normal for psychological tests to have validity in areas different from what they were meant and designed for. This is so because these tests work largely "under the surface", in ways that are not transparent even to the constructor. The distinction between "measure" and "predict" therefore is not a true one.
"Face validity" is what a test appears to measure as judged by its outer appearance. The idea that one can see what a test measures by looking at it has been termed the "topographical fallacy". In other words, face validity is a euphemism for mistake. A test that shows its validity on the outside would be easy to manipulate by the candidate. Designing psychological tests has been described as "the art of fooling people twice".
A notorious example of mistaken face validity is the Pavlovian reflex of saying "cultural bias" when a test contains knowledge-related verbal items, or even any verbal items at all. A less known example is the mistaken notion that a test with only visual-spatial items is "culture fair".
Validities of I.Q. tests are mostly higher than .7. Those for personality tests tend to be much lower.
When it comes to I.Q. tests, greater validity is not always better. Tests with very high validity, like over .9, are the most vulnerable to the rise of raw scores over the years (Flynn-type inflationary effects). This is true for both regular psychological and for high-range tests. More robust are tests with validities between .7 and .9.
Reliability is the answer to the question, "if it were possible to take this test again without any learning effect between the two test administrations, to what extent would I get the same score?" Reliability can also be understood as the correlation between two hypothetical very similar versions of a test ("parallel forms").
In practice, one estimates reliability by looking at internal correlations based on just one test administration (that is, one does not do actual retesting). The basic method is to split the test into halves, for instance the odd- and even-numbered items. The correlation between those reflects the reliability of half the test. One then applies a correction formula to extend this to the reliability of the full length of the test. There are also more complicated methods that estimate the mean of all possible split-half reliabilities.
Given a certain quality of items, it can be said that reliability increases in proportion to the square root of the number of items.
The correlation between different sections of a test, such as verbal and non-verbal sections, does not properly reflect reliability, but rather reflects the test's internal consistency.
Reliabilities of full-scale I.Q. tests are mostly higher than .9 (subtests have lower reliabilities obviously, if only because they are shorter than the full-scale test). Reliabilities of other personality tests tend to be much lower.
Robustness, validity, and reliability are affected by the sheer number of items of a test (apart from obviously depending on the quality of the items). The more items, the higher the robustness, validity, and reliability.
Greater diversity in item types of I.Q. tests tends to be good for robustness (less fraud as cheaters prefer one-sided tests) and validity (higher g loading).
Finally I observe that statistics like these are only partly dependent on the test in question, the other source of influence being the group of candidates from which the scores come. For these computations always require the test to be taken first; they can not be calculated without actual scores. And one group of candidates ("sample") may yield another outcome for reliability than another group. My hypothesis is that with conscientious candidates one gets higher reliability and validity than with unconscientious candidates. This is self-obvious if one understands the process. Implicit in this hypothesis is that, looking at individual candidates who take multiple tests, conscientious candidates will have their scores closer together (smaller intra-individual spread in I.Q.) than unconscientious candidates.