Preliminary Impressions of the Stanford Revision of the Binet-Simon Scale

The Psychological Clinic Copyright, 1918, :Author: Lightner Witmer, Editor. Vol. XII. No. 1 March 15, 1918 J. E. Wallace Wallin, Ph.D., Psycho-educational Clinic and Special Schools, St. Louis, Mo.

For a period of eight years we have utilized the Binet-Simon Scale in the clinical examination of thousands of juveniles and adults. We have used this scale because, in spite of all its imperfections, it has remained the best available single scale for the measurement of general intelligence. Although clearly recognizing their shortcomings, we have continued to use the first two American revisions, commonly referred to as the 1908 and the 1911 scales. We did not deem it advisable to attempt to use any of the later American versions, because of the limitations under which they were made,1 for purposes of practical mental classification or diagnosis, until the Stanford revision appeared.2 And we did not deem it expedient to use this modification until the original data on which it is based had been given to the public.3 This revision represents the greatest attempt thus far made to modify and extend the scale, to establish more reliable age standards and to standardize the method of administering the tests. We began to use this scale in the summer of 1917 in a clinic practicum, conducted at a state university, which was attended by graduate and under-graduate students. The impressions which we have formed regarding this version are based not only upon our constant use of it during somewhat less than a year but upon a very much longer use of the 1908 and 1911 Binet-Simon scales (American versions), and upon conclusions which we have reached respecting conditions which should be observed in successful individual mental examination of children. We propose in this article to give some of our preliminary impressions of the scale:

1 We are not now considering the Point Scale, which is constructed on a different principle from the BinetSimon scale. Some of the claims advanced in favor of the Point Scale are amusing, to say the least. 1 Terman, Lewis M., The Measurement of Intelligence. Boston: Houghton, Mifflin Co., 1916. * Lewis M. Terman, ei al. The Stanford Revision and Extension of the Binet-Simon Scale for Measuring Intel* ligence, 1917.

detailed analysis of results must await the accumulation of more extensive data. The justification for this preliminary discussion of the Stanford Revision should be obvious. The belief has become very general that an individual’s mental capacity can be accurately determined by means of intelligence tests. In practice we have for several years been diagnosing subjects as feebleminded primarily on the basis of their “intelligence age,” “years of retardation” or “intelligence quotient,” as determined by some form of the Binet-Simon scale. On the basis of such determinations1 subjects have been committed to institutions, assigned to special schools for mental defectives and diagnosed as incompetents, dependents or irresponsibles. It is evident, therefore, that if our measuring devices are faulty and our standards wrong, our psychological diagnoses will be wrong or at most only roughly correct. If the standards are too easy, the percentage of mental defectives found will be too small, while if the standards are too difficult, the percentage will be too large. The question is just as important from the educational as from the social or criminological point of view, for it is unwise to require backward children to spend their school days in classes which were established for feebleminded children, as is now done quite generally throughout the country.2 Moreover, later investigations have invariably shown that measuring scales of intelligence which have been said to be about as accurate as they can be made, have in fact proved to be quite inaccurate in certain regions. Nor do we, in point of fact, know as yet the degree of accuracy which we should demand of reliable intelligence scales. Because of limitations of space, we must restrict our discussion to three or four points.

1. Mastery of the administrative technique of the Stanford version requires far more study or training than the mastery of the original scale (1908 or 1911), due not only to the multiplication of the tests but to the elaborated and rigidly circumscribed conditions under which the tests must be given. The import of this statement can be aptly illustrated from the record of a class of university students, quite equal, we believe, to the average in ability,3 who pursued a course in mental tests during a six-weeks’ period. The class met during three one-and-a-half hour periods per week. Practically the whole course was devoted to the study of the Stanford Revision and the mastery of its technique. One child was tested

1 Of course, other relevant data are always considered by the critical workers. 2 Problems of Subnormality, 1917, pp. 46-96. 8 Three PhD, four M.A. and one M.D. candidates, and six undergraduates. Three of these attended merely as observers.

each period before the class by a student by means of this scale. At the conclusion of the examination the mistakes made by the examiner were systematically pointed out by the other members of the class and by the writer. In order to assist the student in the mastery of the testing technique, the writer prepared a “Condensed Guide” of about a dozen typewritten pages. In this Guide all the directions and conditions bearing on the administration of a given test, and the exact phraseology in which it should be presented, were concisely given in one place. Although they possessed “The Measurement of Intelligence,” every student in the class preferred to follow the Guide rather than the book when testing the pupils, owing to the convenient, condensed arrangement of the material in the Guide, as compared with the scattered presentation of the directions and the qualifying conditions in the book. In the book one must frequently read through paragraphs or even pages of comments or discussions to discover all the conditions which must be observed in giving a test. The students all agreed that the discursive treatment in the book made it more difficult for the learner to master the technique with the aid of the unabridged text.1 Under these favorable conditions and with these special aids we found that not a single student was proficient in the use of the scale at the end of the course. That is, no one was able to administer all the tests used in the examination of any given subject, in an errorless, free, confident, aggressive manner without the aid of the Guide.2 With the aid of the Guide, most of the errors made toward the close of the course were not very serious, but many of the tests were put too slowly. The number of errors made by five students whose records we have available were 18, 16, 16, 13 and 8, the two latter made in the second trials toward the close of the term. Some of these errors, of course, were minor ones, and probably did not materially affect the rating. The time that it takes to train students to give psychological tests correctly, is well known by the experimental psychologist. When the student finds it difficult to learn to administer a single test correctly in the psychological laboratory, is it to be wondered at that the effective mastery of the Stanford scale, which contains ninety tests, cannot be secured except as the 1 We laid the above facts before the publishers and the author of the Measurement of Intelligence and offered the Guide for publication. But permission to publish was refused on the expressed ground that the author “is entitled to the credit and any financial return for developing his modification of the Binet-Simon testa,” nor would the author give his endorsement. We also prepared an abbreviated record blank, now in use in our clinic, which serves every need of the practical examiner and which can be issued at a fraction of the cost of the ridiculously elaborate “Record Booklet,” but the publishers likewise declined to publish this blank, or to permit us to publish it, on the same ground. We understand that the War Department has since published an abbreviated guide and record blank.

2 Our record blanks, of course, furnished less aid than the elaborate “Booklet.” 4 THE PSYCHOLOGICAL CLINIC. result of prolonged study and extensive practice in the actual testing of cases?

The practical bearing of our conclusion should be obvious. The amateur will find it much more difficult to administer the Stanford scale than the old Binet scale, while anyone who merely tests subjects occasionally will certainly not be able to do satisfactory testing. He will not be able “to keep himself in fit condition.” He will be unable to test a subject without the use of the book.1 While the increased difficulty of its, administration is not a valid objection to the value of the Stanford Scale itself, it emphasizes the need of the skilled, experienced examiner and renders less respectable than ever the work of the dilettante. 2. The assumption is implicit that the methods which have been adopted for administering the tests are the best methods to use. This assumption admits of challenge. An examination of the book often fails to reveal any convincing reason for giving a test in the form in which it is given, or for adopting the particular method of scoring that is used. In some cases it appears that the particular method of administration and scoring used was arbitrarily adopted in order to give an arbitrary result, namely, a correspondence between the median chronological age and the median mental age. The responses were re-scored “according to any desired standard,” in order to “fit a test more perfectly to the age level assigned it”? the standard of “fitness,” however, being a variable quantity for the different ages. The original data were subjected to three successive revisions and re-scorings. We have no doubt that it is possible to produce any desired result by such a process of optional accommodation. It is evident, however, that the method according to which the tests were to be given had necessarily been determined upon before the results could be re-scored, and that it is not possible to determine the best method of giving a test by a process of accommodation or alteration of the method of scoring. That can only be done, or at least can best be done, by actually testing large masses of children, that is, by making objective studies of the psychological reactions of children and determining from first-hand experience the best methods of giving a particular test. We do not know to what extent this method was used in the Stanford study. It appears that the examiners who actually tested the 989 cases2 up to age 14, i We have seen a psychologist laboriously thumbing through “The Measurement of Intelligence” while examining children. From the standpoint of an efficient mental examination, the whole performance was ridiculous. One can no more administer an intelligence scale satisfactorily without automatic command of the technique than one can perform surgical operations without automatic control of the mechanical operations. : The Stanford Revision and Extension of the Binet-Simon Scale for Measuring Intelligence, p. 9. On p. 164 the number is given aa 992, and Terman’s name is here included among the examiners. which were examined by the Stanford revision, got the methods ready-made, and, we may add, were adequately trained on the ready-made methods so as to secure uniformity. The compiler of the revision appears not to have tested any of these cases, but only thirty-two high school pupils between 16 and 21. He had, however, previously tested 135 pupils in one group, and 90 pupils in a second group which are used as check series, but these pupils were given tests which “were practically identical with those in Whipple’s Manual” (save for certain exceptions which are noted). It is evident that the revision then attempted (in 1912) was based upon procedures of administration which do not coincide with the procedures used with many tests in the Stanford revision. We are unable to find any facts justifying the inference that the changed methods of administering some of the Binet-Simon tests had first been “tried on,” and had been proved to be the most satisfactory methods. The changes apparently were based more or less arbitrarily upon theoretical considerations. Be that as it may, we do not think that sufficient study has yet been given to the question of the best method of giving the tests. Most of the investigators have largely confined their attention to the establishment of norms.

We doubt whether some of the tests should be administered according to the Stanford formulas, while we also question the propriety of even including some of the tests. Let us point out for a number of specific tests some of the questionable procedures and objectionable features, and suggest possible lines of improvement.1 IV, 1. There are no “middle and lower pairs of lines” on the cardboard supplied in the set of materials. Hence reverse cardboard and ask: “Which one is the longer (longest) now?” It is a question whether the words “long” or “big” should not be permitted in this test (in addition to “biggest” and “longest”), and the words “heavy” in V, 1, comparison of weights (“which is the heavy one?”). The word “heaviest” is used interchangeably with “heavier” in the latter test. Moreover, the introductory statements to the weight test could profitably be eliminated.

IV, 4. There is no need of giving a second or third trial if the child passes the first time, as the last two trials will not affect the rating. The same remark applies to VII, 6, third trial, when the first two are satisfactory. Nor is there any necessity of using a whole minute in the IX, 6, if three rhymes can be given in shorter time. It is sufficient to ask the subject to “Find as many words as you can that rhyme with ‘day,’ ” and close the test as soon as three rhymes have been given, with one minute as the maximum. It requires more time to administer the Stanford scale than the original Binet and the busy examiner cannot afford to waste time on procedures which do not affect the rating. The fact that it requires considerably more time to test a subject by the Stanford than

1 Some statements made in “The Measurement of Intelligence” concerning the individual tests are based on assumption and theory and probably would not be accepted by many workers who have had extensive experience with the tests in the actual examination of children.

by the old Binet-Simon scale is a valid objection to the Stanford scale only in so far as the experimenter is obliged to waste time on profitless procedures or procedures which do not affect the rating, and on tests permitting accidental failures. While all the “waste time” must be eliminated from the Stanford scale, if it is to be of practical service to the routine examiner whose time is limited, we view with suspicion the evident attempt to construct scales of tests that can be given in from five to ten minutes. The mind of man is so complicated and subtle, and frequently so unevenly developed, with special abilities and disabilities, that it cannot be adequately weighed, measured, rated or analyzed by a few tests which can be given in a few minutes.

V, 2. The directions are to give the colors in the order red, yellow, blue, green. The colors are not arranged in this order on the cardboard, as they should have been. The error will at least affect the rate of the response. Purely accidental misses penalize the subject, in this test, as a second trial is not permitted. We have tested a number of subjects whose misses the first time were due to momentary amnesia, or excitement or too great impulsiveness. They gave the colors correctly the second time. These subjects had to be scored minus, although they knew the colors. There are other tests, to which we shall refer later, in which accidental failures occur because of the arbitrary conditions under which the tests are given. The argument that the tests are properly administered because when they are given as directed the median chronological and mental ages coincide is not satisfying.

We may illustrate the influence of accidental failures by the discrepancies in the age rating of subjects tested twice within a short period of time by means of the Stanford Scale. The same experimenter tested the same subject on both occasions.

Case 1 graded 6.50 years on June 26th, and 6.66 on July 5th. The additional test passed the second time was VII, 5, differences. Case 2 tested 5.83 on June 27th and 6.33 on July 20th, a difference of half a year in about three weeks. The first time he failed on VI, 1, right and left, VI, 2, omissions, VI, 4, comprehension, and VIII, 4, similarities, all of which he passed the second time. The second time he failed on VI, 5, coins, which he passed the first time. Case 3 tested 7.16 on July 18th and 8.66 on July 24th, advancing half a year or from an I. Q. of .56 to .67 in less than a week. The tests which were failed the first time but passed the second time were: VII, 2, descriptions; VII, 3, five digits; VIII, 3, comprehension; VIII, 6, vocabulary; IX, 2, weights; IX, 3, change; and X, 4, reading.

We expect to find large variations like these in the intelligence level of epileptics,1 but when they are found in the simple types of mentally defective or backward children, we have reason to suspect that unusual alterations in the physiological condition or state of responsiveness of the individual have occurred, or that the testing has been inefficiently done, or that the scale is not maximally reliable, permitting of accidental failures. There were no observable changes in the physical or emotional states of these subjects. We are inclined to ascribe the difference in rating chiefly to the fact that some of the tests are so constructed as to put a premium upon accidental failures.

V, 3. Children of foreign parents who do not quite understand the words “prettier” or “prettiest” have scored plus on a second trial after they have gotten the idea of the test. Children with poor vision have also scored plus on 1 Some epileptics whom we have tested have varied four or five mental ages during a few months. Such variations, however, are exceptional. a second trial. Some impulsive subjects have scored failures under the conditions of the test, although they certainly were able to make correct distinctions. It is a question whether these results should be scored minus, in conformity with the text, which does not permit corrections or even uncertainty. V, 5. The test permits no sign of approval, even through facial expression, when the cards are properly placed. The result is that when the child is asked to try it again he is led to believe that he did it wrong the first time. Naturally he acts upon this suggestion and tries some other method of approximation, gets confused, and fails. We have examined many children who did the test almost immediately the first time, but required five times as much time the second time or failed entirely, because they were designedly trying some other method. When told: “Why, you did it right the first time,” many of these children did the test at once during the third trial. These children had to be scored minus, according to the artificial conditions of the test. One cannot expect young children to function at their maximum by withholding legitimate approbation or by arousing apprehensions. Those who have tested college students know how readily they succumb to positive or negative suggestions. We must not demand of children what we do not obtain from youths. Either the test should be given according to the original Binet formula (one success passes), or the subject should be told when he has done it correctly once: “Good!” or “All right!” “Now, see if you can do it again.”

VI, 2. We infer that the “single question” in b, c, and d cannot be repeated. If repeated, some subjects will pass the test who would otherwise fail. The test allows chance misses, affecting particularly the impulsive and easily distracted types, even when they possess fair intelligence. Would it not be well to allow a second trial on the failed picture? A test of intelligence increases in reliability as the possibilities of chance errors decrease.

VI, 4, and VIII, 3. Some children are not familiar with the formula “What’s the thing to do?” This formula must “not under any circumstances be altered.” Alternate, equivalent questions, should be permitted: “What ought you to do?” or, “What should you do?”. Exception may be taken to the scoring given in the text to some answers. Some subjects, particularly children of foreigners who do not understand English well, will score plus if given a second trial, especially when the form of the question is changed as above indicated. Thus the first answer of an Italian boy to VIII, 3 a, was: “To her mother.” But on repeating the question: “Buy another.” To VI, 4, b, an 11-year old boy first said, “Fire.” On repeating the question, he said, “The firemen.” But when asked, “But what should you do?” he said: “Ring up the firemen.” To VIII, 3, a, he first replied, “To me.” But on being asked, “What should you do?” he replied: “Pay.” A 12-year old girl replied to VI, 4, c, “Can’t go.” But when asked, “But what should you do then?” replied, “Wait for another.” According to the arbitrary rules of the test these subjects scored failures, and yet when the questions were stated so as to be comprehended they were properly answered. We could cite many similar instances in which, according to the conditions of the test, we have been obliged to rate the child by his snap-shot reply rather than by his reflective answer. According to the test the question may be repeated in the original form only, but only “if there is no response, or if the child looks puzzled.”

VII, 1. Since the scoring is entered for the right and left hand, it would seem logical to ask, “How many fingers have you on the right hand?” Then, “On the left hand?”. However, we prefer the following procedure: “How many fingers on this hand?” “On this hand?” “On both hands?”. The experimenter momentarily holds up his right hand, then his left hand, and then both hands. The tendency to count is lessened by this method. Some pupils make accidental errors which they immediately correct if given a second chance. Thus a girl who said she had five fingers on each hand and five on both hands, answered “ten” immediately when asked, “On both hands?”. One boy who responded rapidly, “5,” “10,” “10,” when asked, “How many have you on each hand?” immediately replied, “Five.” One girl who responded similarly replied correctly when asked, “How many fingers have you on that hand” (pointing)? “And on that hand?”. All of these failed on the test, although all knew how many fingers they had. The test penalizes purely accidental failures. It permits repetition of the question only when the child has begun to count.

Ill, 3; VII, 2; and XII, 7. The instructions are incomplete, owing to the fact that a three-fold procedure is authorized. The text does not make clear when the question is to be asked as in III, VII, or XII, but leaves this to the judgment of the tester. If the question was asked as in III, producing enumerative responses,1 and the subject later proved to have a VII- or Vlll-year intelligence, should the pictures be presented again, using the VH-year formula? According to the text, the test has been ruined for VII if it has been given as in III. Accordingly we cannot determine in a case like the above whether the subject is or is not able to pass the test. We have found VIII- and IX-year old subjects who responded as in VII when the VH-year formula was used but as in XII, when the Xll-year formula was later used. Should the XH-year formula also be given to all VIII- and IX-year olds? Moreover, why not start directly with the XH-year formula with higher grade pupils, instead of wasting time on the VH-year formula? It is awkward, to say the least, to give the VH-year formula, and then immediately use the XH-year formula with the same pictures. The requirement of three enumerations in III should only be expected when the question is asked, “Tell me everything you can see in this picture?”. The rule in VII, that the reply must be chiefly descriptive, is impractical. Many pupils only enumerate or describe the outstanding characteristic of each picture. In some pictures some subjects chiefly enumerate, in others they chiefly describe. According to the directions, pupils who give one enumeration and one description for each picture would grade III. We have many records in which the pupil enumerates two or three objects, but gives only one description. They describe the most important action. Such a descriptive response is worth more than several enumerations, and should be scored VII. The original Binet rule did not require more than one descriptive response. This basis of rating is to be preferred. VII, 3. The question “Was it right?” may make some subjects more cautious, but it tends to excite others. It might be eliminated. It is, of course, necessary to secure the child’s close attention to each series. The following formula will usually accomplish this without arousing apprehension: “Now, then, see if you can say these numbers. Listen carefully.” The same remarks apply to all the memory-span tests.

VII, 4. Some children fail on this test merely because they have never been taught to tie a bowknot, while others fail becaue of the unnaturalness of the situation (tying the string around a stick). Some girls who have failed on 1 On page 146 the author states that “enumeration” rarely occurs “before 9 or 10:” The Measurement of Intelligence.

the test have instantly tied a bowknot with a hair ribbon in place, while some boys who have failed, have had no difficulty in tying the knot in their shoestring. In our judgment this test should be discarded. VII, 5. Many pupils who fail on this test, will pass, if asked: “You know what a fly is?” “And a butterfly?” “Are they the same?” (“No.”) “Well, then, what is the difference between them?” This form of approach was wisely permitted by some of the earlier versions, in order properly to prepare the child to apprehend the concept of difference. Of course, many foreign-language children who now fail will succeed on the test as given above. VIII, 1. Many VIII- and IX-year old pupils respond with a vacant stare to this test. Some of the foreign-born children (Italians) have no idea what is intended by the word “path.” With other children, the trouble is that the test does not adequately prepare the child’s imagination to receive the suggestion that the circle is a field. To do this, the test should begin with the statement: “Suppose this is a round field with a growth of short grass.” (Or high grass. The test is silent on what is in the field.) The following statements serve no purpose, they are time consuming and should be eliminated: “You don’t know what direction it came from, how it got there, or with what force it came.” The scoring of this test presents considerable difficulty. The grading of the sample responses supplied on the score card, is not satisfactory, if the object is the finding of the ball. This may be illustrated by the criticisms made of the score card by our class of university students.

K (the initials refer to the subjects): No. 12, mark failure; 7, place lower down; 10, place in superior class?”It is perfectly natural to walk thus in looking for a ball;” 5 and 10, exchange.

L. S.: No. 2 should not be in superior class compared with 10, 14, or 15; 5 is better than 4, “because the path begins at the gate;” if 1 to 5 are to be retained in superior class, would rank them as 1, 3, 5, 4, and 2; 12, mark failure, or place below 14, 15, 19, and 20; a better order would be, 10, 7,14, 15,19, 20, 6, 8 and 11.

A. S.: 7, 8, and 10 are equal to 2; 12 is not a satisfactory search; 17, “though lacking in plan covers the field satisfactorily”; 22 perhaps equal to 17; “if the object is to find the ball,” 21 would do it. M. S.: 7 and 10, and 8 very nearly fulfil the requirements of “No intersections or breaks,” while 2 does not. A “person would naturally walk irregularly as in 10, or turn square corners as in 7.” B. D.: Mark 12 failure; 2 and 10 might be interchanged.

The above judgments illustrate the disagreements which one finds in practice with the theoretical rating given on the score card. Of course, designs are frequently received which do not resemble any of the patterns on the score card, and it is frequently difficult to determine whether they should be rated as VIII, XII or zero. We have found subjects who draw a line outside of the field (many do this), but who, when pressed to tell what they would do, say they would walk around and hunt until they find the ball. This test may well be eliminated. It impresses us as “freakish,” and it so impresses many subjects. VIII, 4. Some pupils respond properly to this test if asked, “How are they the same?” But this form of question is not permitted.

VIII, 5. When the child has merely given a functional definition in response to the stereotyped question of the text, we have frequently received a descriptive definition by asking the further question: “Yes, but what is a ?” The following are illustrative answers to the text question (1) and the additional question (2) given above: football: (1) “you can kick;” (2) “a ball you can kick;” balloon: (1) “round;” (2) “a round thing that goes up in the air;” soldier: (1) “he helps to fight for us;” (2) “a man who fights.” It is not clear just what value a test has as a test of intelligence which is so administered as to allow fortuitous failures. By a slight twist of the question or the proper placing of an accent, the response can often be transformed from a failure to a success.

VIII, 6. The illustrative scoring given in the text shows that different degrees of perfection are exacted for different words. Since the scale contains tests which are designed to appraise the quality of the subject’s definitions, the only object of this test should be to determine whether the child knows what the word means. It should be immaterial whether he indicates this by an illustration or by a good or poor definition. The test permits the use of articles with nouns. Does it permit the use of prepositions with verbs? Judging by the illustrations given, No. But this is not clearly so stated. The test is too long (it seems to have been shortened one-half, according to a recent record booklet).1 The selection of chance words is objectionable. It would be better to follow the plan of the Ayres spelling list in the selection of words. The vocabulary formula with this selection of accidental words does not, we fancy, give a reliable index of the size of a pupil’s vocabulary.

IX, 1. May the following supplementary or alternate questions be used? (a) “What day is it today?” or, “What day is this?” (6) “What month is it now?” or, “What month is this?” (c) “What date is it today?” “January what?” (d) “What year is it now?” Some children pass these questions who fail on those given in the text, while others pass on a second trial who fail the first time. The text merely allows a second trial when the day of the month and the week are interchanged.

IX, 2. The question is too lengthy. The following part adds nothing and tends to confuse the subject: “Find the heaviest one and put it here, the next heaviest here, and lighter, lighter, until you have the very lightest here.” Omit also the introduction: “See these boxes. They all look alike, don’t they? But they are not alike.”

The square cubes with the reversed numbers on one face are inferior to the metal pill boxes with the weights indicated by initials on the bottom, owing to the time required to find the numbers on the cubes. The initials on the pill boxes can be found at once by turning up the bottom. Moreover, if the child examines the reverse numbers and arranges the boxes according to the size of the numbers he will score plus. Many subjects fail who misplace only one box. The factor of accidental error operates in this test. We have elsewhere presented facts indicating that this test might well be differently administered.2 IX, 5. The form of the test does not sufficiently emphasize that all of the words are to be used in one sentence, or the same sentence. The combination desert, rivers and lakes is more difficult than the other two, and many fail because of this fact.

1 The assumption, of course is that the difficulty of the words in the two lists supplied is equal. That thia assumption may be without warrant is indicated by the number of words defined in each list by th e following subjects:

Subject: ABCDEF G H List I: 10 18 9 17 7 18 23 20 List II: 6 10 10 15 5 14.5 13.5 17 The differences are sometimes considerable. 1 The Individual Tests in the Binet-Simon Scale. I. Psychological Clinic, 1917, 79f. X, 2. The matter-of-fact, colorless, smileless manner in which this test must be given renders the administration forced and unnatural and does not tend to stimulate the child to put forth his best effort. X, 3. In spite of the aid given from the elaborate instructions and the score card, the marking of the reproductions is not always as easy as might be inferred from the text. Not all of our students fully agreed with the ratings of the samples on the score card. We have also found a lack of agreement regarding the marking of some of the sample answers given in tests VIII, 3, and X, 5.

X, 4. The units in the rating are of unequal value. Are words pronounced for the subject counted as errors? We question the advisability of the rigid two-errors standard. Children who have made three, four or more errors have reproduced from ten to fifteen memories.

XI, The increase in the number and value of the tests in XII is not an adequate substitute for the lack of tests in XI, even though by a process of artificial scoring we may cause 11-year olds to score XI. In testing children with a base of, say, VIII years, it frequently is not worth while giving the XIIyear tests, all of which are too difficult for age 11, according to the tabulated data, while it would be quite worth while giving tests which are not too difficult for 11-year old children. Artificial sur-scores will not take the place of tests actually adapted to a given age during the preadolescent period. We regard this lacuna as a serious defect in the scale, causing all children to grade too low who cannot do any of the XH-year tests but who could do easier Xl-year tests. XII, 4. It is not clear why the instructions may be repeated and a new trial allowed when the subject adds new words, but not when he leaves out words. Many impulsive subjects who make minor mistakes succeed within the time limit when told: “Look carefully and try it again.” Their failures are accidental. They have the capacity to do the test when made to deliberate, but must be marked minus.1

XII, 5. This test requires more time than it is worth. It should be abbreviated or eliminated. We have had less experience with the tests above XII. Some, however, are too cumbersome to administer or require too much time. To recapitulate: Not sufficient attention has yet been devoted to the determination of the best method of formulating the tests? the technique of the tests is not yet instrumentally perfect. In some tests there is a discrepancy between the text and the test materials; in some tests time must be wasted on details which do not affect the rating; in some the instructions are unnecessarily long and involved; in a few the directions are too meagre, or not sufficiently specific (however, the conditions are more minutely set forth for most of the tests than in any earlier version); in some, the conditions are impractical; alternate questions should be permitted and should be provided; in several tests a premium is put upon accidental failures; in some the procedure is calculated needlessly to mislead or confuse the subject; in some the experimenter

1 Of course, wo are rating our subjects according to the prescribed formulas, whether or not we consider them proper.

finds it difficult to give the test with natural expressiveness, so as to encourage the subject to exert his best efforts; and some tests should be discarded. It is evident that to introduce desirable changes in the procedure would necessitate the establishment of new standards.

3. Previous analyses have shown that the “scattering” is extensive in both the 1908 and 1911 scales but that, from the standpoint of scattering, the 1911 is inferior to the 1908 edition.1 We are not yet ready to analyze our Stanford records, but it is evident that the scattering is very extensive in this scale, apparently more extensive than in the 1908 and 1911 editions. The practical significance of this fact is a problem for future study. The large amount of scattering in this scale is partly due, we believe, to the rigid conditions under which some tests are given, which results in accidental failures, to the rigid base requirements (all tests plus) and to the misplacement of some of the tests. We expect to make detailed analyses of our data in future.

4. The most fundamental question, after all is said and done, is whether the Stanford version actually grades American subjects more accurately than the earlier American versions. It has been generally conceded that the 1908 and the 1911 editions were too difficult in the higher ages (say above X), and that they were too easy in the lowest two or three ages. On the other hand, it has been felt by the most cautious workers that the age standards have been approximately correct between VI and IX. Our previous analyses2 have shown that the 1908 scale gives a higher rating than the 1911, that the smallest difference is in the lowest ages (especially), from III to V, and the middle ages, from VII to IX, that the largest difference is in the highest ages, from X to XIII, and that the 1911 scale is probably more accurate in the lowest ages, and possibly in the middle ages, while the 1908 is more accurate in the higher ages.

The claim is put forth that the Stanford scale is accurate in all ages, because the tests have been so placed, so administered and so scored that the median mental ages are identical with the corresponding median chronological ages. Whether with the particular tests and the particular processes of scoring used, and the particular groups of children tested, we have obtained age standards which are maximally accurate and generally applicable to American 1 The Phenomenon of Scattering in the Binet-Simon Scale. Psychological Clinic 1917, p. 179ff. Wide Range versus Narrow Range Binet-Simon Testing. Journal of Delinquency, 1917, p. 315ff. A Further Comparison of Scattering and of the Mental Rating by the 1908 and 1911 Binet-Simon Scales. Journal of Delinquency, 1918,p.l2f. iA Further Comparison of Scattering and of the Mental Rating by the 1908 and 1911 Binet-Simon Scales. Journal of Delinquency, 1918.

children, is a question to which a positive reply cannot as yet be given. Our own use of the scale has been largely restricted to the ages from III to XII. For practically all of these ages in which we have comparative scores for subjects who have been tested by the 1908, 1911 and Stanford scales, the Stanford rating is appreciably lower. We may cite a few cases which include all the mental ages (Stanford) from IV to XI, except VIII, IX, and X.

A, 8.16 yrs. Stanford, 4.83 yrs. 1908, 5.8 and 1911, 5.6 yrs. Difference (between the Stanford and 1908 and 1911; given in this order throughout), 1. yr., and .8-yr. Seguin, 5.3 or 4. The first age rating for the Seguin formboard is according to the average scores of three investigators and the second rating according to the writer’s norms, as given in table XLIX, of PsychoMotor Norms for Practical Diagnosis, 1916. B, 8.41 yrs. Stanford 4.83 yrs. from the lower base (III) and 5.33 yrs. from the upper base (V). 1908, 5.4 and 1911, 5.6 yrs. Difference, based on the lower base, .6 and .8 yr.; or, based on the upper base, .1 and .3 (Stanford lower). Seguin, 5.5 or 4.8.

C, 11.25 yrs. Stanford, 5.66 yrs., 1908, 6.6 and 1911, 6.2 (or 6. yrs. from lower base). Difference, 1. and .6 yrs. (or .4 from lower base). Seguin, 8. or 7.2. D, 10.16 yrs. Stanford, 5.66. 1908, 6.8 and 1911, 6.6. Difference, 1.2 and 1. yr. Seguin, 5.5 or 4.6. E, 9.75 yrs. Stanford, 6.16. 1908, 7.6 and 1911, 7.2 (or 7. from lower base). Difference, 1.5 and 1. yr. (or .9). Seguin, 10.5 or 9. F, examined May, 1916, age 7.66. 1908, 6.6, and 1911, 5.6. Re-examined, ‘January, 1918, age 9. Stanford, 6.33. 1908, 7.2 (7.6 from lower base) and 1911, 6.8. Difference, .9 and .5. Seguin, 9.2 or 7.5. G, 9.33. Stanford, 6.66. 1908, 8. and 1911, 7.8 (7.6 from lower base). Difference, 1.4 and 1.2. (or 1.). Seguin, 9.8 or 8. H, 8.41. Stanford, 6.6. 1908, 7.6 and 1911, 7. Difference, 1. and .4 yr. Seguin, 5.5 or 5. I, examined October, 1916, age 11.5. 1908, 7.4. Re-examined, January, 1918, age 12.83. Stanford, 7.75. 1908 and 1911, 7.8. Difference .1 yr. Seguin, 9.5 or 8. J, 14.66. Stanford, 11. 1908, 11.8 and 1911, 11.6 (11. from lower base).

Difference, .8 and .6. Seguin, meaningless because of spastic paralysis. The Stanford rating is lower in every one of these cases, and in every age. It is decidedly lower than the 1908 rating. In only one or two cases is the difference insignificant. The difference between the 1908 and the Stanford rating ranges from .8 to 1.5 of a year in 8 cases, and between the 1911 and Stanford rating from 0.5 to 1.2 of a year in seven cases.

While the diagnosis would not differ for some of these cases no matter which rating was followed, in spite of the large age differences, in other cases, the child would be adjudged feebleminded according to the Stanford age, while he could only be diagnosed as borderline or backward or “deferred” according to the 1908 or 1911 scale. Thus subject E according to the Stanford standard would be “definitely feebleminded” by the Stanford scale (I. Q., 63), while according to the 1908 (I. Q., 77) and 1911 (I. Q., 73) he would be borderline or backward, or “possibly feebleminded.” Subject G would be borderline and probably feebleminded by the Stanford scale (I. Q., 71) but only “dull” by the 1908 (I. Q., 85) and 1911 (I. Q., 83). It is perfectly evident that the criteria of diagnosis of “The Measurement of the Intelligence” have no value with any scale of intelligence except the Stanford, and their value with this scale is still “sub judice.”1

Which age rating of the above cases is the most accurate? Taking the Seguin formboard performance as a criterion, and using as criteria the average scores of three investigators, the 1908 and 1911 results gave a more accurate measure in six cases, the 1908 in one, and the Stanford in two. Based on the writer’s formboard norms, the 1908 and the 1911 ratings were more accurate in five cases and the Stanford in four. From the standpoint of this single criterion, the old scales have a slight advantage. We may further evaluate the mental ages by the pedagogical status of the children. In estimating the pedagogical status it is of importance to rate the child according to the grade of work he is capable of doing successfully and not according to the grade in which he is classified. When so rated (by the teachers and principals on our Forms 13-A, and 13-H?the detailed reports cannot here be given), the Stanford mental rating seems to be more accurate for two pupils, the 1908 for one, and the 1911 for two, while the results are approximately the same for one and uncertain for four pupils. The pedagogical data for the latter four were not furnished with sufficient accuracy to enable us to reach a satisfactory conclusion. It is evident that our analyses do not justify the categorical assertion that the Stanford scale is much superior to the other scales (in the ages we have considered) because its age standards are more accurate (our results, however, do show that they are more difficult). Certainly an analysis of the Stanford data in table I2 does not justify the conclusion that the age standards are absolutely correct, for only in three ages does the number of children tested equal or exceed 100, while in age 4 only 17 were tested and in age 5 only 54. According to the table on p. 164, only 10 pupils were tested in age 3. In spite of all that has been said upon the subject, we are not ready to accept i No demonstrable proof has been furnished of the correctness of the standards of feeblemindedness there given. We do not subscribe to the sweeping statement that aStanford I. Q., “below 70” invariably indicates “definite feeblemindedness.”

1 The Stanford Revision and Extension of the Binet-Simon Scale for Measuring Intelligence, p. 9. implicitly the dictum that norms based on such limited numbers are entirely trustworthy and that it would be a pure waste of time to increase the number of examinees.

The more we have used the scale the more we have felt that the standards in IV and V (and possibly in III) are too difficult, while we suspect that those standards above age VI which we have been considering are somewhat too difficult, due, among other things, as we have intimated, to the administrative technique in some tests, which permits chance failures, and to the absence of tests in age XI.

That the examiner must be wary of implicitly following the Stanford age norms and standards appears from the following consideration. Many young children who would have to be called “definitely feebleminded” by the rigid Stanford standards have graded exactly like children whom we examined years ago by the old scales, who were diagnosed at that time as backward or borderline and who have subsequently developed beyond the feebleminded status.1 These young children graded from a year and a half to two and a half years retarded. By the Stanford scale (judging by our records) they perhaps would have averaged another year backward, and would thus have tested indubitably feebleminded according to the conventional standards. Altogether, we are inclined to feel that the Stanford revision marks some advance, but we feel that it would be unfortunate if the scale were to be regarded as a finality. Intelligence testing is still in its early stages. We have now had a considerable number of private revisions, to which we owe much. But the limitations of such revisions are inevitable. We believe that the next revision should be undertaken on an extensive scale by a subsidized bureau or by a psychological association. One of the incidental fruits of such a revision would be that it would serve to check the profiteering tendency in connection with the exploitation and sale of testing manuals, record booklets, and materials.

The best form of scale for the measurement of intelligence is still an open question. Many of the arguments for the point scale are specious.

i Others, on the other hand, have lacked the potentiality for development and have eventually proved to be feebleminded.

Disclaimer

The historical material in this project falls into one of three categories for clearances and permissions:

  1. Material currently under copyright, made available with a Creative Commons license chosen by the publisher.

  2. Material that is in the public domain

  3. Material identified by the Welcome Trust as an Orphan Work, made available with a Creative Commons Attribution-NonCommercial 4.0 International License.

While we are in the process of adding metadata to the articles, please check the article at its original source for specific copyrights.

See https://www.ncbi.nlm.nih.gov/pmc/about/scanning/