IQ tests hurt kids, schools -- and don't measure intelligence

1991: As I settle into my seat in the back of the classroom, I can’t take my eyes off the perfect girl. She is the lead in every play, the soloist in every choir performance, and the winner of every writing award. Quite simply, she is the pride and joy of every teacher at the school. She also happens to be beautiful, and I am infatuated. I decide I’m going to talk to her after class. It’s sixth grade and I’m back in the public school system. A fresh start. A new, improved— and I hope, suaver—me.

“Is Scott Kaufman here?” the teacher asks. My trance is interrupted. Without hesitation I raise my hand. “Can you come sit up front please?” she requests. Confused, I pick up my backpack and move down, inching closer and closer to the perfect girl, who is sitting in the front row. As I get closer, my heart starts beating faster. Why am I being asked to move to the front? What if I have to sit next to her? What would I say? Walk smooth, Scott. Smooth. I start to slow down. I put on a big, confident smile. Finally I reach my destination. The desk right next to hers.

She is writing in her notebook. Probably composing the next great sonata. I try to look cool. I nod my head a lot. I think that’s a cool thing to do. The teacher seems impressed with my coolness, as she is smiling. She kneels down beside me and within earshot of the perfect girl, whispers, “Scott, your Mom requested that you sit at the front of the classroom since you have a serious learning disability. Thanks for changing seats.”

The room starts to spin. Did the perfect girl hear? She must have heard. Humiliated, I sink down in my chair. I no longer feel cool. I feel trapped. It seems that no matter what I want to achieve, I am imprisoned by my label.

* * *

As early as the nineteenth century in Europe, case reports of children with learning disabilities in reading, writing, and arithmetic cropped up. Here’s a description in 1896 from the physician W. Pringle Morgan of a 14-year-old named Percy F.: “I might add that the boy is bright and of average intelligence in conversation. . . . The schoolmaster who has taught him for some years says that he would be the smartest lad in school if the instruction were entirely oral.”

The history of learning disabilities is a tale of multiple conceptualizations, spanning several continents. In the United States, physician Samuel Orton studied children with reading disabilities who had at least average IQ scores. Orton conceptualized language and motor disabilities as brain dysfunction in spite of normal or even above average intelligence. He believed that to adequately diagnose learning disabilities, it was important to combine a variety of sources of information, including IQ test scores, achievement test scores, family histories, and school histories. For those who then warranted the learning disability diagnosis, Orton believed the proper intervention consisted of directly targeting the specific area of weakness and using the child’s “spared” abilities to help remediate the disability.

In Germany, the neurologist Kurt Goldstein studied the deficits of soldiers who sustained head injuries. His focus was on their deficits in visual perception and attention. Goldstein’s student Alfred Strauss took this approach and studied adolescents with learning difficulties.Along with educator Laura Lehtinen, they developed remediation techniques that involved providing students with a distraction-free environment and training perceptual deficits. They merely inferred brain damage, though. They didn’t actually peer inside the head.

The Goldstein-Strauss approach was widespread in the 1950s and 1960s. Thousands of children were identified as having “minimal brain dysfunction” by the use of a checklist, which included things such as academic difficulty, aggression, and “acting-out.” If a student exhibited 9 out of 37 possible symptoms, they received treatment, which typically meant they spent hours a day doing perceptual tasks such as connecting dots and learning how to distinguish between a foreground and background. Although a systematic review of 81 studies concluded that these techniques were useless, many public schools in the United States continued to rely on perceptual training to remediate learning difficulties.

In the 1950s and 1960s, a number of psychologists and speech and language specialists, including William Cruickshank, Helmer Myklebust, and

Doris Johnson, began focusing more on the specific cognitive processes relating to academic difficulties. Their focus was much more targeted on specific areas of academic weakness. But this hodgepodge of different approaches created much confusion in the schools, because children with distinctly different areas of academic weakness were lumped together, and no one knew what to call them. Children who were having difficulties learning in school were given a number of different labels, including “dyslexia,” “learning disorder,” “perceptual disorder,” and “minimal brain dysfunction.”

On Saturday April 6, 1963, parents and professionals met in Chicago to explore the “problems” of the perceptually handicapped child. All were struggling to integrate all of these various approaches. At this historic conference Samuel Kirk, professor of special education at the University of Illinois, coined the term “learning disabilities,” noting, “I have used the term ‘learning disabilities’ to describe a group of children who have disorders in the development of language, speech, reading, and association communication skills needed for social interaction. In this group, I do not include children who have sensory handicaps, such as blindness, because we have methods of managing and training the deaf and blind. I also excluded from this group children who have generalized mental retardation.

Professionals, educators, and parents rejoiced. Finally they had a single, unified label.

* * *

Kirk’s speech was highly influential on the first federal definition of learning disabilities: the 1969 “Children with Specific Learning Disabilities Act.” Their definition was essentially Kirk’s definition:

The term “specific learning disability” means a disorder in one or more of the basic psychological processes involved in understanding or in using language, spoken or written, which may manifest itself in imperfect ability to listen, think, speak, read, write, spell, or do mathematical calculations. The term includes such conditions as perceptual handicaps, brain injury, minimal brain dysfunction, dyslexia, and developmental aphasia. The term does not include children who have learning disabilities, which are primarily the result of visual, hearing, or motor handicaps, or mental retardation, or emotional disturbance, or of environmental, cultural, or economic disadvantage.

Notice there’s no actual mention of “intelligence” in this definition. There’s the fuzzy term “basic psychological processes.” The core of the definition is that those with a specific learning disability (SLD) show “unexpected” low achievement in a specific academic area that cannot be explained by other factors. This definition of specific learning disability remains in place today, virtually unchanged from its 1969 formulation, so it’s important to understand its origins: It was literally a definition created by a committee.

But defining the term was only the first step. Educators needed to know how they should identify children with a specific learning disability. Beginning with the “Right to Education for All Handicapped Children Act” of 1975, the following guidelines were included for identification:

The child does not achieve commensurate with age and ability when provided with appropriate educational experiences.

The child has a severe discrepancy between levels of ability and achievement in one or more of seven areas that are specifically listed (basic reading skills, reading comprehension, mathematics calculation, mathematics reasoning, oral expression, listening comprehension, and written expression).

The first guideline was intended to make sure that low educational achievement was due to an intrinsic characteristic of the student, and not just a reflection of bad teaching. The second guideline was their attempt to measure “unexpected” low achievement. But they had a problem. There was no good way for educators to measure the “basic cognitive processes” mentioned in their definition. What were these mysterious processes? Theory-based IQ tests, grounded in neuropsychological processes, hadn’t yet arrived on the scene.

Their solution: use a “severe discrepancy” between IQ and achievement. This decision was largely based on the Isle of Wight studies conducted in the early 1970s. Michael Rutter and William Yule found tentative evidence that there are meaningful differences between two different groups of poor readers—those whose low reading was unexpected based on their IQ (“specific reading retardation”) and those whose low reading was “expected” based on their low IQ score (“general reading backwardness”). Rutter and Yule concluded their study with the following: “The next question clearly is: ‘do the two groups need different types of remedial help with their reading?’ No data are available on this point but the other findings suggest that the matter warrants investigation.”

But the U.S. government needed guidelines and couldn’t wait for more research. So they left their guidelines open-ended, leaving it up to each state to decide what constituted a “severe discrepancy” between IQ and achievement. Of course, states differed quite a bit, creating a situation in which parents who wanted to gain a specific learning disability diagnosis for their child could pack up and move to a state whose guidelines required a smaller discrepancy! States also disagreed on which IQ test should be used and whether a global IQ score or subscale should be used. As we’ll see, these aren’t trivial differences.

Thus was born one of the most unintelligent methods of identifying learning disabilities ever invented.

* * *

Despite the high reliability of IQ test scores across most of the lifespan, IQ testing is not an exact science. One of Binet’s key insights is that you can’t measure someone’s IQ—or any psychological trait, for that matter—to the same level of precision as you can measure a person’s height or weight. There are many reasons why a person’s test score can change from one testing session to the next. One major source of IQ fluctuation is measurement error. Sometimes a score can be seriously underestimated because the test taker zoned out or temporarily became distracted. For instance, perhaps just before one IQ testing session, the test taker had a traumatic breakup that affects his or her concentration. It’s also possible for a person’s IQ score to be artificially inflated, which can happen with lucky guessing or cheating. There are some cases on record of parents feeding their children the answers ahead of time.

But the source of measurement error isn’t always the test taker. There’s plenty of room for administration errors, such as two different test examiners scoring answers differently, or one examiner making a clerical mistake and accidentally omitting the third digit in a child’s IQ score. Just how prevalent are these errors? One study found that about 90 percent of examiners made at least one error, and two-thirds of the errors resulted in a different IQ score. Also, despite IQ test administrators reporting confidence in their scoring accuracy, average levels of agreement was only 42.1 percent. As Kevin McGrew notes, “This level of examiner error is alarming, particularly in the context of important decision-making.”

To account for measurement error, most modern IQ tests provide an examiner with a confidence interval—the range of IQ scores that are likely to contain a person’s “true” IQ. Of course, there is no such thing as a true IQ score. The only way we’d actually be able to find that out would be to give a person the same IQ test an infinite number of times. But it’s clearly not feasible to give the same person the same test even a handful of times, so most IQ test manuals provide a range of IQ scores, leaving it up to the examiner to choose his or her confidence levels.

A commonly chosen confidence interval is 68 percent. Suppose you are trying to predict what an 11-year old child’s IQ score will be at the age of 21 and you know that there’s a .70 correlation in the general population between IQ measured at age 11 and IQ measured at age 21 (this correlation is at the upper end of what is typically found). Based solely on that information, what range of IQ scores can you expect he will obtain on his twenty-first birthday?

It depends how confident you want to be. If you are only 68 percent confident, you can expect that the child’s true score is somewhere within 10 points of his 11-year-old score (in both directions—10 points higher or 10 points lower than his original IQ score). But that’s with only 68 percent confidence. As Alan Kaufman notes, “I wouldn’t cross a busy intersection if I had only a 65% to 70% probability of making it to the other side.”

For high-stakes decisions, test administrators have the option of increasing their confidence interval to 90 percent or even 95 percent. Of course, higher confidence comes at a cost: it widens the range of possible IQ scores. In the example of this 11-year-old boy, if you want to be 95 percent confident of what this child’s IQ score will be at age 21, you’d have to expect a range of 20 points in either direction.

Most contemporary IQ tests are a bit more reliable, but no test exists that is perfectly reliable. Even using the most reliable IQ tests available today, the expected spread is significant. Kevin McGrew reviewed IQ fluctuations among today’s most frequently administered IQ tests and estimated that the full range of expected IQ differences for most of the general population is 16 to 26 points.

The problem gets even worse when you realize that those who are most impacted by high-stakes decisions—those at the extreme low and high ends of the bell curve—are also the ones who are most likely to show the largest test score fluctuations. The technical term for this phenomenon is “regression to the mean.” Let’s say you learn a new game, such as Scrabble, and the first time you play you do really great (beginner’s luck). All else being equal, the next time you play you’ll probably perform closer to average. Same thing if you performed really poorly the first time. Chances are, you’ll perform better next time. This applies to any form of measurement. Sports rookies who have an amazing first year are rarely as hot the second year. Even the “linsanity” of Jeremy Lin cooled down. It’s a statistical fact that initial expectations, based on a single number, can’t be trusted.

But measurement error isn’t the only culprit in IQ test score fluctuations. There are lots of other reasons why a person’s IQ score might differ from one testing session to the next. One important (but often overlooked) cause is the format of the test. Different IQ tests measure a different mixture of cognitive abilities, and school psychologists often find different IQ scores if they administer more than one IQ test to the same person (even if the test manual says they are measuring the same skills).

Just how much can scores fluctuate from one IQ test battery to the next? During 2002–2003, as part of validation for their new IQ test, the KABC-II, Alan and Nadeen Kaufman looked at IQ test scores from a dozen children who were tested on three different contemporary IQ tests. Consider the IQ profiles of a representative sample of those children, aged 12–13. The first thing to note is that those exposed to greater opportunities for learning (higher SES, based on parents’ education) tended to score higher on IQ tests than those from lower-SES backgrounds. But even collapsing across SES, every single preadolescent had a different IQ score based on which test they took. The differences for the dozen children ranged from 1 to 22 points, with an average difference of 12 points. Leo earned IQs that ranged from 102 to 124. Brianna ranged from 105 to 125. In some districts, Brianna would have qualified as “gifted” based solely on her KABC-II score. But if the district looked at her WJ III score, she would be considered as having only average intelligence. Therefore, an ideal IQ score would be one that not only averages across multiple testing sessions but also uses multiple test batteries to get at what you are trying to measure and averages those scores as well.

But not all test score fluctuations are the result of measurement error or changes in test battery. Various school factors significantly influence IQ scores, such as quality of instruction, enriching classes and afterschool activities, entering school late, intermittent attendance, length of schooling, and summer vacations. All of these factors influence genuine brain maturation. An inconvenient truth for educators who employ rigid IQ cutoff scores when making important decisions is that the environment matters, and people really do grow at different rates.

There are also important personal factors that can affect IQ scores, such as changes in test anxiety and test motivation. In a recent analysis of multiple studies based on 2,008 participants, Angela Duckworth and her colleagues found that offering material incentives increased IQ scores substantially (the effects ranged from medium to large). The effect of incentives depended on the person’s IQ score, however. Offering external rewards boosted IQ scores much more among those with below-average IQs than among those with above-average IQ scores. These results don’t mean IQ scores are meaningless as indicators of cognitive ability. It’s pretty difficult to obtain a high IQ score based solely on passion. Indeed, IQ remained a significant predictor of life outcomes, even taking motivation into account, and IQ scores were a better predictor of academic achievement than test motivation.

Nevertheless, the study highlights the fact that motivation is an important contributor to IQ test performance. No test administrator knows all of the causes of a testee’s low IQ score. Test examiners are trained to look out for signs of low motivation and high test anxiety, but they don’t always take their qualitative observations into account when interpreting a child’s score.

* * *

If IQ is such a fallible measuring stick, can we really predict an individual’s future level of academic achievement from their current IQ score? Averaging over many students, we can. The most reliable IQ tests typically show correlations with academic achievement ranging from the mid-.60s to the mid-.70s. These correlations offer some of the most reliable predictions in all of psychology.

But even with correlations this high, about 40 to 60 percent of differences in academic outcomes are not related to IQ scores. There are numerous factors that contribute to academic achievement (many of which we will review throughout this book). These include specific cognitive abilities, other student characteristics such as motivation, persistence, self-control, mindset, self-regulation strategies, classroom practices, design and delivery of curriculum and instruction, school demographics, climate, politics and practices, home and community environments, and, indirectly, state and school district organization and governance.

In fact, the strength of the relationship between IQ and academic achievement depends heavily on just how you define “academic achievement.” Consider two recent studies conducted by Angela L. Duckworth, Patrick D. Quinn, and Eli Tsukayama on middle school students.They found that self-control predicted changes in report card grades better than IQ, whereas IQ predicted changes in standardized achievement test scores better than self-control. The teachers indicated that they factored in completion of homework assignments, class participation, effort, and attendance when determining report card grades. These results suggest that GPA highlights a broader range of key life skills than standardized test performance.

Even the very highest correlations between IQ and academic achievement leave plenty of room for error. Kevin McGrew found a correlation of .75 between IQ and standardized academic achievement test performance on a representative sample of the most recent edition of the Woodcock-Johnson tests of cognitive abilities and achievement. Based on this correlation— which is on the very high end of what is typically found—just how well could he predict the academic achievement of those with IQs within the 70–80 range (those often labeled “slow learners”)?

Consider a scatter plot of the relationship between IQ and academic achievement (averaged across tests of reading, math, and written language). Each little circle represents a real, live, breathing person. Even for individuals within that small range of IQ scores, expected achievement scores ranged quite a bit, from about 40 to 110. Half of the individuals within the 70–80 IQ range achieved at or below their expected achievement, but importantly, the other half scored at or above their predicted achievement. This finding has some pretty striking implications! Back in 1937, Cyril Burt made his famous pint of milk analogy: “Capacity must obviously limit content. It is impossible for a pint jug to hold more than a pint of milk, and it is equally impossible for a child’s educational attainment to rise higher than his educable capacity.”

McGrew concludes that the correct metaphor for the association between IQ and academic achievement is not that the jug can’t hold more milk, but that the “cup can flow over.” According to McGrew, “the carte blanche assumption that all students with disabilities should have an alternative set of educational standards and an assessment system is inconsistent with empirical data. . . . The current reality is that despite being one of the flagship developments in all of psychology, intelligence tests are fallible predictors of academic achievement.”

McGrew’s findings don’t apply only to those with IQs in the 70–80 range. No matter what IQ band you pull out from McGrew’s analysis, you’ll find the same thing. In fact, a law emerges. Using the most reliable IQ tests available today, McGrew notes that “for any given IQ test score, half of the students will obtain achievement scores at or below their IQ score. Conversely, and frequently not recognized, is that for any given IQ test score, half of the students will obtain achievement scores at or above their IQ score.” Clearly a child’s current discrepancy between IQ and achievement doesn’t necessarily indicate a learning disability.

But perhaps the biggest flaw in the severe discrepancy method is that it’s a fundamentally unintelligent method. It treats single IQ scores as the arbiter of truth, without looking at the person’s history and understanding the numbers in context. Responsible and intelligent use of IQ tests require us to consider the student’s overall pattern of strengths and weaknesses (not just on the IQ test but even more generally in terms of talents, and social and emotional functioning), life aspirations, developmental history, environmental circumstances, and opportunities to learn.

Robert Sternberg and Elena Grigorenko sum the situation up nicely: “The use of difference scores in diagnosing reading disabilities is analogous to the building of a house of cards. Millions of high-stake decisions are being made on the basis of a procedure that is flawed and greatly in need of modification.” You would think that every single school in the United States would have firmly placed the severe discrepancy method in the dustbin by now. Alas, this isn’t the case. A recent survey by Perry Zirkel and Lisa Thomas found that the severe discrepancy approach remains a viable approach in the vast majority of states in the United States, with the decision to use the method left up to the local districts. If you are one of those districts still relying on the severe discrepancy approach, you may want to seriously rethink your procedures for identifying learning disabilities.

IQ tests hurt kids, schools -- and don't measure intelligence

The research proves that IQ tests poorly predict learning disabilities. So why are schools still using them?

By Scott Barry Kaufman

Published July 7, 2013 3:30PM (EDT)

Shares

By Scott Barry Kaufman

Related Topics ------------------------------------------

Related Articles