The first issue occurring when one tries to norm a high-range test is simply that it is hard to get enough people to take a very difficult test. Only a small fraction of the population is willing to do that, and the beginning test designer tends to underestimate the difficulty of his test and the limitations of his candidates. Experience shows that the harder a test is, the fewer people will take it. "Hardness" is here: the percentage of the items on a test that people on average fail to solve. Therefore, relatively easy items must be included, next to very hard ones, to get enough submissions for norming.
Next is the phenomenon of candidates not doing their best, e.g. not using reference aids while this is allowed, or taking too little time, and thus scoring below their true level. This occurs especially when the test is free, and also when there is nothing to gain by scoring high (like society membership). Also, when candidates take the test mainly to aid in norming, rather than out of desire to score high, they tend to underperform. A norming based on such submissions will be (far) too generous. Therefore it seems advisable to ask a compensation, and offer some kind of "reward" to high scorers. And it is better to wait patiently for truly interested candidates than to actively recruit a "norming sample". The best norming is one that is based on submissions from people who did their best (regardless of how high they thus scored). Such submissions typically drip in slowly over the years.
Then there is the problem that candidates tend to withhold their lower prior scores (in reporting prior scores for norming). In fact, when people have taken many tests before, it is rare to find them honestly reporting all those scores. This selective reporting of scores too has an inflationary effect on a possible norming. The best solution is to keep records of all candidates' scores, and consult those whenever a candidate occurs in a norming sample. Over the years, more and more candidates will emerge of whom all scores are known, so that this type of inflation is prevented. Also it helps to select the tests to be used for norming by their correlation with the object test; tests that suffer the most from score withholding tend to have lower correlations and are thus kept from exercising their boosting effect on the norms.
Something to be aware of is the difference in approach between high-range tests and regular psychological tests. High-range tests basically have one set of norms for all who take them, regardless of nationality, sex or age. They are "absolute".
But regular tests, like WAIS, have different sets of norms for different age groups, and are normed on the local (national) population. A WAIS I.Q. of 100 is not the same for a 20-year old as for a 40-year old, and not the same for an American as for a Frenchman or Korean. It's an entirely different concept; the norms are relative to a narrow population segment. On top of that, individual psychologists may deviate from the test manual in score reporting. Using prior scores from regular tests therefore is a bit like throwing dice to determine the norms. Again, selecting tests by their correlation helps, as regular tests often have lower correlations with a high-range test than do other high-range tests. When it comes to standardization of I.Q. at high levels, high-range tests are superior to regular tests.
One more thing one encounters with high-range tests is the discrepancy between male and female scores. In the past this has been rarely mentioned as sex differences were more or less a taboo, but more recently societies have emerged - e.g. East Coast Mega, UltraHIQ, Grail and Mega HIQ Girls - that have separate norms per sex (some prefer the term "gender", however that should best only be used to refer to one's position on a masculine/feminine scale, while "sex" is the appropriate term for the biological male/female distinction).
It is known that males and females on average have about the same level in general intelligence or g, but that males have greater variance, so that at the low and high end one finds more males than females. At or above the 98th percentile there are almost twice more males, and at the 99.9th centile about fifteen times more. On an I.Q. scale with a standard deviation (σ) of 15, the male σ may really be about 16, and the female σ about 12. This is a fundamental biological difference that has its cause mainly in the first months of pregnancy, when the brain is formed. The different testosterone levels in males and females regulate the formation and lateralization of the brain in different ways (regardless of genetic disposition). A male and female with (apart from their sex chromosomes) identical D.N.A. would still have very different brains. From puberty onward, there is a second period wherein the sex hormones influence the male and female brains in different ways.
This places who deal with high-range tests for a few dilemmas; should pass levels of societies be set within-sex or not? Should possible group centiles directly based on high-range candidates be normed within-sex or not? Should perhaps even I.Q.'s be normed within-sex? So far, only within-sex high-range centiles are put into practice without problems. Within-sex pass levels are still an experiment of a small number of groups. Within-sex I.Q.'s seem unwise as they would destroy the group-independent nature of the concept of I.Q. on high-range tests. In regular psychology though, there are tests that have some built-in compensation for sex differences, either in the construction of the test or in the norms.
Perhaps the most important question is: do high-range tests measure g, the general factor in mental tests? Or does g break down into its factors at high levels, so that high-range tests measure various kinds of group factors or specificity rather than g? To answer this one needs to obtain a correlation matrix between a number of high-range tests of different content types, and perform some kind of factor analysis. This is a process that will take years, because it requires that each test in the matrix is taken by each individual from a group of testees.
Finally there is the question of the exact distribution of I.Q. scores in the high range. There are good reasons to assume a more or less "normal" distribution between plus and minus about two standard deviations from the mean (I.Q. 70 to 130), but above that the distribution is unknown, as regular psychological tests end there, and also because there exists no absolute (ratio) scale for mental ability. To study the distribution in the high range, such a scale will be needed.