Statistical and Non-Statistical Interpretation of Test Results

Author:

Samuel W. Fernberger, Ph.D.,

Assistant Professor of Psychology, University of Pennsylvania.

Some time ago, a survey was made to determine standards for children of the fifteen-year-old level at the University of Pennsylvania.* This age is a critical one for vocational guidance and the study is thus relatively important. From a preliminary study, it was decided to test three groups of children as follows: Group I, children applying to a Junior Employment Bureau for assistance in obtaining jobs; Group II, children who were working and also attending Continuation School; Group III, children in High School who were not working. A battery of tests was given each child which included?shortened Binet, Witmer Cylinders, Dearborn Formboard, Monroe Reading, Courtis Arithmetic, and Woodworth and Wells’ Hard Directions Test. One hundred boys and one hundred girls were each tested in Groups II and III. One hundred and thirty boys and seventy girls were tested in Group I.

One of the important results of this study was the discovery that the performances of the High School children (Group III) were considerably better than those of either of the other two groups. The performances of Group II were slightly better than those of Group I, but these two groups may be considered as parts of a larger group which contrasts sharply with Group III. In all three groups, the performances of the boys were slightly better than those of the girls. The High School children give higher intellectual and more intelligent performances than those in the two working groups. It was thought advisable to determine the degree of statistical validity of these differences. For this purpose, the Binet I. Q. and the times of the first trial of the Dearborn Formboard were selected. The probable errors of the averages were calculated separately for boys and girls for all three groups. These results are given in Table I for the Binet I. Q. and in Table II for the Dearborn Formboard. It will be noted that the variations in the size of the averages are relatively great in both cases when one compares the values for Group III with those for Groups I and II. The probable errors, however, are quite large indicating great variability within each group. The probable errors for the Dearborn Formboard results are relatively very much larger than those for the Binet I. Q.

Table I.?I. Q. Group. Average. I B 87.6 I G 83.8 II B 92.6 II G.: 92.2 III B 117.0 III G 107.2 Table II.?Dearborn First Trial. Group. Average. P. E. I B… I G… II B.. II G.. III B. Ill G. 194 227 178 197 150 170 72.75 110.68 93.75 75.96 65.14 63.42

Then the probable errors of the differences were calculated by the formula, P. E.d = VP. E.^ + P. E.b. Then a value z was calculated by the formula, z = D/P. E.d, i. e., the difference divided by the probable error of the difference. One then looks up in a special table* a value Pz corresponding to the calculated z. The Pz is the probability that the difference will not vary by an amount greater than itself. It is, therefore, a reliable “index of significance.” For mathematical certainty that the difference is significant, the value Pz must be 0.9999 or unity. Table III.?I. Q. Boys-Girls.

D P. E.d Pz I- I II-I I III-III.. 3.8 0.2 9.8 16.8 14.2 12.0 0.1286 0.0161 0.4198 Table IV.?I. Q. Boys. D P. E.d Pz I-II… I-III.. II-I1I. 5.0 29.4 24.4 13.8 12.7 12.5 0.1919 0.8824 0.8116 Table V.?I. Q. Girls. D P. E.d I-U… II-III. 8.4 23.4 15.0 16.4 15.5 13.6

Table VI.?Dearborn. Boys-Girls. D P. E.d Pz I- 1 33 133 0.1339 II-I I 19 121 0.0859 III-III… 20 91 0.1180 Table VII.?Dearborn. Boys. D P. E.d Pz I-II 16 119 0.0699 I-III 44 98 0.2385 n-III…. 28 114 0.1301 Table VIII.?Dearborn. Group. D P. E.d Pz I-II 30 135 0.1180 I-II I 57 128 0.2385 II-II I 27 99 0.1445

These values are to be found in Tables III-VIII. Tables III-V contain the values for the Binet I. Q., while Tables VI-VIII contain those for the Dearborn Formboard. Tables III and VI compare the boys and girls in each group. Tables IV and VII compare the different groups for boys and Tables V and VIII compare the different groups for girls. Each table is similarly constructed. In the first columns are indicated the groups compared. In the second columns are given the differences with the probable errors of the differences in the third columns. In the last columns are given the values of P2, the index of significance.

In the consideration of the I. Q. for boys and girls (Table III) none of the differences are at all significant. That for Group III is the largest but this indicates less than a fifty per cent probability that the difference is not due to chance factors.

Comparing the I. Q. for boys (Table IV), the differences between Groups I and II are very insignificant. Group III shows a much greater significance; indeed, comparing Groups I and III there is better than a four to five chance that the difference is not due to chance. Comparing the groups of girls (Table V) none of the differences show a great degree of significance. That for Groups I and II is the lowest, however, and that for Groups I and III is the highest. In the case of the Dearborn Formboard (Tables VI, VII and VIII), none of the differences are at all statistically significant?none in fact show a difference as significant as a one to four chance. The obvious reason for the low degree of significance of the differences is to be found in the size of the probable errors rather than in the size of the differences themselves. Take, for example, Groups I and III Boys for the Dearborn Formboard. The difference is 44 seconds which is nearly one-third as large as the lower average?a relatively large difference indeed. But there is so much variability within each group that the probable errors are so large that the value of z = (D/P. E.d) is small.

It was therefore thought advisable to try and correct the values so as to obtain greater statistical significance. The obvious method is to lower the time limits beyond which the performance is considered a failure. In the present experiment, a period of 10 minutes was allowed for the Dearborn Formboard and, if the task was not completed, it was scored a failure. Five failures were recorded for Group I; one failure for Group III. It is obviously the extreme cases which give the large variations and which give extremely large values when the variations are squared. Hence we arbitrarily cut off the “tails” of the distribution curves for the upper limits by arbitrarily calling all performances over 400 seconds failures. In this new procedure, we record 17 failures for Group I and six failures for Group III. The averages and probable errors for these new “corrected” values are to be found in Table IX. We have reprinted in this table the original “uncorrected” values for comparison.

Table IX. Group. Average. I 1-12 cases (corrected) Ill III-5 cases (corrected) 194 162 150 135 Table X. Girls. P. E.d Pz I-III I?III (corrected) . 44 27 98 74 0.2385 0.1971

The size of the probable errors are considerably reduced by the assumption of a shorter time limit for failure. But it will also be noticed that the elimination of the few longer values from the calculation of the averages also considerably reduces the size of the averages themselves. The subsequent calculations, described above, were carried out for the “corrected” values and the results are given in Table X. It will be noted that the probable error of the differences is considerably smaller for the corrected values. But the size of the difference is relatively even more decreased. Hence the index of significance of the difference is not so large as for the uncorrected results?being reduced from 0.2385 to 0.1971. Such a procedure as we have applied to these results can only be applied to performance tests where a time record is kept. In the case of the Binet I. Q. the final result is a ratio and no extreme values can be properly eliminated. All of this means that the differences calculated, do not seem to be great enough to have mathematical significance. The two tests calculated were chosen because they seemed to show the greatest and most clean-cut differences. And, indeed, when one considers only the differences and disregards the probable errors these differences are very marked. But the extreme variability within each group increases the size of the probable errors so that, mathematically, the significance of the difference is obscured. These results also indicate the extreme relative importance of the “tails” of the curves of distribution both for the size of the average and of the probable errors. In the non-mathematical interpretation of the results, one fact gave added conviction to the significance of the differences; namely, that when one compared two groups, the differences for all of the tests in the battery were invariably in the same direction. Hence, for both boys and girls, the averages for Group II showed greater competency than for Group I and the averages for Group III showed much greater competency than for Group II in every test of the battery. And, in comparing the sex results, the boys were almost invariably, although only slightly, better in all the tests than the girls of the same group.

Furthermore we do not see how one can expect mathematical significance in a mental test which is going to be of any value for purposes of diagnosis. It is upon the variability within the group that the examiner is able to make his diagnosis and prognosis. The greater the variability within the group the more chance the examiner has to form the basis of his diagnosis, and hence, the better the test. Hence the ideal for mental tests is great variability within the group which is utterly incompatable with mathematical significance when one compares the results of one group with another.

It seems possible to draw several conclusions from these results: 1. Mental tests, as they are now developed, show such a degree of variability within a relatively homogeneous group that the differences between two groups do not have statistical significance, even though the differences may be great. In this connection, the “tails” of the curves of distribution assume a relatively great importance both with regard to the size of the averages and of the probable errors. Such a degree of variability is the thing desired in a mental test?with the reduction of the variability one has a correspondingly great reduction in the value of the test for the purposes of differentiating the members of the group. 2. From this it would seem that, if tests are to have a diagnostic value, which means great variability within the group, we can never hope to obtain differences between groups which will have statistical significance.

3. The modern tendency to over-statisticize test results seems, therefore, to be erroneous. It would seem better to treat the raw material with as little statisticizing as possible. In so doing one is, of course, forced to a non-statistical interpretation of the differences. 4. This emphasizes the point of view that less weight is to be put on final test scores. From this viewpoint, mental tests become merely a standardized means of having the subject do something so that the trained examiner may observe his behavior and thus may arrive at a qualitative analytic diagnosis of the individual case.

Disclaimer

The historical material in this project falls into one of three categories for clearances and permissions:

  1. Material currently under copyright, made available with a Creative Commons license chosen by the publisher.

  2. Material that is in the public domain

  3. Material identified by the Welcome Trust as an Orphan Work, made available with a Creative Commons Attribution-NonCommercial 4.0 International License.

While we are in the process of adding metadata to the articles, please check the article at its original source for specific copyrights.

See https://www.ncbi.nlm.nih.gov/pmc/about/scanning/