Language Policy Research Unit (LPRU)
LPRU Home | About LPRU | Advisory Board | LPRUNEWS Listserv | SCEED Home

Resources
Journal of Language, Identity, and Education
Policy Briefs
Demographic Data
Bibliography
Virtual Library
Book Notes
Legal Resources
Media Resources
Scholarly Journals
Internet Resources


EPSL-0204-101-LPRU

"Accountability" Versus Science in the
Bilingual Education Debate

James Crawford

How should we judge the success of bilingual education, structured English immersion, and other programs for English language learners? Many members of the public, and most journalists, seem to rely primarily on standardized tests of student achievement. Thanks to the "accountability" movement, scores from a growing number of tests—reported by district, school, grade, and numerous demographic categories—are easily accessible via the Internet. For those interested in educational issues, the temptation to download and analyze these numbers can be irresistible. That's especially true when the experts are divided about what works. It seems that research evidence can always be cited to support one conflicting theory or another. Frustrated laypersons tend to ask: Why not draw our own conclusions based on "real world" test results?

Following this logic, the Boston Globe (2002) recently editorialized about the need to "reform" bilingual education in Massachusetts. As proof it cited data from the MCAS, a state-mandated test, showing "that children with limited English proficiency failed at more than three times the rate of other students." Such a disparity must mean that something is terribly wrong with the way these children are being taught, the newspaper concluded. What could be more obvious?

Unfortunately, the meaning of test scores is seldom transparent. In this case it is especially clouded by language—so much so that the Globe editors might want to reconsider their reasoning. The MCAS is a test designed to assess the academic skills of English speakers and it is administered entirely in English. Children who do not understand the language of the test will have trouble, to varying degrees, in showing what they have learned. If limited-English-proficient students (LEP) scored well on the MCAS, it would be reasonable to conclude that something was terribly wrong with the test—or that these students were no longer LEP. In other words, the MCAS is the wrong yardstick for measuring the academic achievement of English language learners and for evaluating programs that serve them.

Let me emphasize that certain tests, such as assessments of English proficiency or of academic skills in Spanish or Korean, may be quite useful in gauging the progress of LEP children. But testing their academic knowledge in English, often required these days in the name of accountability, cannot reliably serve that goal. Simply put, for students who have not mastered English, such tests are not very meaningful. Complaints about their lagging scores are therefore misplaced.

As Stephen Krashen (2002), a researcher in second-language acquisition, explained in a letter to the Globe, the relevant question for policymaking is "how well these children do after they [leave English learner classrooms and] enter the mainstream." The newspaper has yet to investigate that issue, or to publish Krashen's response.

In fairness, the Globe's mistake is hardly unique. It is one of countless examples from the bilingual education debate, in which journalists (not to mention advocates) believe themselves qualified to practice social science without a license. People who would never presume to challenge the findings of medical researchers or physicists or even meteorologists seem to give little credence to the experts when it comes to education. If the proof of the pudding is in the eating, they reason, the proof of a pedagogical approach must be in the test-taking. Or, as one opponent of bilingual education puts it, "Reality trumps theory." The news media, known for their pragmatist mindset, generally respond favorably to such arguments.

What accounts for this trend and how is it likely to affect education policy? To answer these questions, we need to consider the political context.


Holding schools accountable for student performance sounds like a fine idea to the average taxpayer. So does relying on "scientifically based research"—rather than, say, the latest fad—to guide educational programs and policies. Such matters used to be seen as judgment calls, best left to professional educators and local school boards. But in recent years, public trust in our educational system has been eroded. A steady stream of negative reports, flowing from policy centers and media outlets, has convinced a majority of Americans that the public schools are in trouble. This, in turn, has produced a bipartisan tide of reform, aimed at perceived mismanagement and misguided methodologies in the classroom, culminating in the No Child Left Behind Act passed earlier this year. The Elementary and Secondary Education Act, which had defined (and limited) the federal role in this field since 1965, has been mostly dismantled. In its place, the new law creates a command structure for American education founded on the twin priorities of accountability and science. Along with a modest increase in federal funding, it authorizes an unprecedented expansion of federal power. A mandate for the annual testing of students in grades 3 through 8 is designed to provide the necessary leverage.

To enhance the accountability of state and local education agencies, the legislation introduces an array of management controls: planning, reporting, deadlines, assurances, high standards, measurable goals, progress indicators, financial rewards, corrective actions, and of course, sanctions. Success or failure under this regime will be gauged almost entirely on the basis of test scores, raw and unadjusted for social and resource inequities. Holding schools accountable will mean accepting no excuses. Career advancement for teachers and principals will depend largely on how their students perform on tests like the MCAS, as will students' chances for promotion to the next grade and for graduation from high school. To legitimize this "high stakes" system, test results are given enormous weight and credibility.

Simultaneously, No Child Left Behind will require all recipients of federal funding—that is to say, virtually every school district in the United States—to employ scientific "research-based" instructional methodologies, classroom materials, academic assessments, teacher training, and remedial tutoring, as well as anti-drug, school safety, dropout prevention, gifted-and-talented, parent involvement, English language learner, and Indian education programs. Experts will need to back up their claims with hard evidence from scientific research before they will be authorized to design programs or train teachers. To qualify as scientific, research will have to be rigorous, empirical, systematic, objective, experimental, replicable, and peer-reviewed—in other words, highly controlled to ensure validity and relevance.

It would be hard to overstate the magnitude of these changes, at least on paper. How far and how fast the federal government will go in enforcing them are political questions that remain to be answered. But from an educational perspective, one thing is clear: the goals of quick accountability and rigorous science are on a collision course, with the potential to do serious harm. Indeed, the crash is already under way. Let's return to the policy debate over the schooling of English learners.


A campaign to replace bilingual education with all-English "immersion" programs, approved by voters in California (1998) and Arizona (2000), continues to spread. Similar ballot initiatives are being organized this year in Massachusetts and Colorado, generating bitter debate and substantial media interest. One point of contention, not surprisingly, has been the impact of the first of these measures, California's Proposition 227. As usual, test scores are at the center of the controversy.

Ron Unz, the Silicon Valley millionaire behind these initiatives, says his campaign has been vindicated by English learners' performance on the Stanford 9 achievement test. As he argued in a debate at Harvard University last fall:

"The facts are now in. The largest controlled educational experiment in the history of the world took place a few years ago involving over a million students in California who were largely shifted away from bilingual education to English immersion. ... The average test scores of over a million immigrant students have gone up by 50 percent in less than three years. Those school districts that most strictly followed the initiative and got rid of their bilingual programs doubled their test scores in three years. Don't believe me. Believe the New York Times, the Washington Post, CBS News—every major media source. The war is over. Or at least it should be over if academics were willing to look at the reality of the world rather than at their own research" (Unz et al., 2001).

In response, Harvard professor Catherine Snow noted that not a single expert in language education or psychometrics has endorsed this interpretation of the Stanford 9 results (for reasons to be discussed below). Why, then, should we take the journalists' word for it? Since Unz describes himself as "a theoretical physicist by training"—he dropped out of graduate school to pursue a political career—Snow wondered whether, on the basis of media reports, he also believes in "cold fusion." Unz countered:

"I think academics should look at the reality of the world rather than at theories published in a lot of books, which may or may not be correct. ... Reality trumps theory. Theory cannot defeat reality. You really have to ask yourself whether you believe the reality of your own senses, the test scores of a million immigrant students, or four or five books written by some professors at Harvard" (Unz et al., 2001).

Reality versus theory, experience versus books, a million immigrant children versus a few elite academics ... which side are you on? Unz may not have made it as a physicist, but he deserves a doctorate in demagoguery. Judging by the public reaction, his tactics seem to be working. The mainstream media used to laugh at George Wallace when he attacked "pointy-headed intellectuals." They are not laughing at Ron Unz. By and large, they have embraced both his reasoning and his conclusions about the success of Proposition 227 (for a detailed critique of this coverage, see Thompson et al., 2002).

The most influential account appeared in the New York Times, which highlighted "striking rates" of improvement for English learners on the Stanford 9 "after Californians voted to end bilingual education" (Steinberg, 2000). In particular, the article contrasted scores in the Oceanside Unified School District (see Table I), which Unz had hailed as a showcase for English immersion, with those in a neighboring district where bilingual programs continued in some schools.

"In Oceanside, the average score of third graders who primarily speak Spanish improved by 11 percentage points in reading over the last two years, to the 22nd percentile; in Vista, (Note 1) the gain was a more modest 5 percentage points, to the 18th percentile. In fifth grade in Oceanside, limited English speakers gained 10 percentage points in reading, with the average in the 19th percentile; in Vista, there was no increase, the average of limited English speakers staying flat, in the 12th percentile." (Steinberg, 2000)

Table I
Stanford 9 Reading Scores and Redesignation Rates* for
English Language Learners, Oceanside Unified School District and
California State Average, 1998-2001

Grade

2

3

4

5

6

7

8

9

10

11

Rate*

1997-98

Oceanside

12

9

8

6

9

4

9

5

2

3

5.4%

Statewide

19

14

15

14

16

12

15

10

8

10

7.0%

1998-99

Oceanside

26

15

16

16

16

12

15

9

6

6

6.6%

Statewide

23

18

17

16

18

14

17

11

9

11

7.6%

1999-00

Oceanside

32

22

23

19

20

13

18

11

6

8

4.1%

Statewide

28

21

20

17

19

15

18

12

9

11

7.8%

2000-01

Oceanside

32

22

19

16

16

12

15

8

7

7

17.8%

Statewide

31

23

21

18

21

16

19

12

9

11

9.0%

(Source: California Department of Education; Proposition 227 took effect in 1998-99.)
*Redesignation rates represent the percentage of English language learners who are redesignated as "fully English proficient" each year.

One educational researcher, Kenji Hakuta of Stanford University, was quoted briefly by the Times, cautioning that no scientific conclusions about Proposition 227 could be drawn from these data. But it was Unz's interpretation that received the lion's share of attention. "The test scores these last two years have risen, and risen dramatically," he said. "Something has gone tremendously right for immigrants being educated in California" (Steinberg, 2000).

As it happened, Hakuta and some colleagues had already conducted an extensive analysis of the California test results, which went unmentioned in the Times article. Their conclusion: "Scores rose for all students, and in no clear pattern that could be attributable to Proposition 227" (Orr et al., 2000). White and minority, rich and poor, language-minority and native English-speaking—virtually all groups of students had improved their performance on the Stanford 9. At parents' request, about 12 percent of California's English learners remained in bilingual classrooms under Proposition 227. So the researchers sampled a cross-section of school districts, including those that had eliminated all bilingual programs in 1998, those that had continued some bilingual programs, and those that had never offered any bilingual programs. They found that average gains among English language learners were about the same in each.

Among numerous possible explanations for the pattern of rising scores, Hakuta (2000) cited a California initiative to reduce class size in the early grades, a movement toward higher standards and accountability, and more effective preparation for the Stanford 9 as teachers become more familiar with the test. This last factor may be especially important, since 1998—the year before Proposition 227 took effect—was the first year of California's statewide testing program. As Krashen (2001) explains, "Typical test score inflation is about 1.5 to 2 points per year, which accounts for a great deal of the gains seen in grades 2-6 in California," both for English learners and for English-proficient students.

But what about Oceanside? English learners' scores did rise substantially there in the two years after the initiative passed. Could this mean that English immersion is working "miracles" in that district? Based on the Stanford 9 results, it is impossible to say. Numerous other explanations are equally plausible, however, and they have nothing to do with the dismantling of bilingual education. First, the school district started from a dismal position in 1998—with reading scores sliding from the 12th percentile (grade 2) to the 6th percentile (grade 5) to the 2nd percentile (grade 10)—well below statewide averages for English learners (see Table I). With intensive test preparation, such results can be improved significantly. Oceanside superintendent Ken Noonan (2000) has reported that, before Proposition 227, these students were taught entirely in Spanish for the first four years, sometimes longer. With no substantial exposure to English in the classroom, it's no wonder they did so poorly on English-language tests. Finally, there's the statistical phenomenon of "regression to the mean." As Hakuta (2001) notes, "Oceanside finally managed to drag its test scores from rock bottom up to the statewide average for EL students. This is not a story about excellence, hardly a miracle."

Subsequent events have cast further doubts on the role of Proposition 227 in raising the district's scores. A month after the laudatory New York Times story appeared, the California Department of Education (2000) cited Oceanside for violating the civil rights of LEP students. Among numerous infractions, investigators found that the district had no coherent English immersion program; that it was failing to train teachers in immersion methodologies; and that, after one year of the immersion treatment, most English learners had been arbitrarily reassigned to mainstream classrooms, where they received little if any help in overcoming language barriers. Soon after, the federal Office for Civil Rights reached similar conclusions.

In 2001, the reading scores for Oceanside's English learners leveled off in grades 2 and 3 and declined in grades 4 through 9 and 11. The only "good news" was posted in grade 10—an increase from the 6th to the 7th percentile. What does this all mean? Probably not very much, just another regression to the mean, as Hakuta (2001) argues. But it does highlight another reason to dismiss the Stanford 9 scores for English learners as mostly meaningless. No confidence can be placed in these year-to-year comparisons because students may differ in ways that cannot be statistically controlled. For example, because of a recent wave of immigrants, this year's second graders may come from poorer, less educated backgrounds and start out with less knowledge of English than last year's second graders. Such factors are certain to affect performance on the Stanford 9, but with the data now available there is no way to adjust for them. Therefore, it is impossible to draw valid conclusions about year-to-year fluctuations in second grade scores.

Moreover, the English learner category itself is constantly changing. Students enter and exit at varying rates, depending on how much English they arrive with and how long it takes them to be "redesignated" as fully fluent in the language. Naturally those who have acquired more English will do better on English-language achievement tests than those who have acquired less English. This means that when students are redesignated as fluent in English, their scores are no longer counted in LEP group, usually lowering the overall average. In effect, a district's successes in teaching English count against it when Stanford 9 scores are calculated, even if students are doing well. Because there is no statewide gauge of English proficiency—criteria and procedures vary considerably among California districts—test results for English learners can be easily manipulated. For example, to boost average scores on the Stanford 9, all a clever administrator would need to do is slow down the official resdesignation of students as fluent in English, which would automatically retain many high-scorers in the LEP category. That stratagem could well account for Oceanside's "striking" performance reported in 2000. That year only 4 percent of these students were deemed ready for the mainstream—about half the statewide average. Then, in 2001, test scores fell off as the district's redesignation rate jumped to nearly 18 percent.

It is also important to note that none of the year-to-year comparisons are based on the scores of individual students receiving distinct educational treatments. They are based on aggregate scores for school districts using greater or lesser amounts of students' native language for instruction. This is a crude way to analyze program alternatives, considering that there are no "pure" bilingual education districts in California, only districts where a percentage of English learners—usually a minority—have been granted "waivers" of the English-only rule. (For other statistical anomalies, see Thompson et al., 2002.)

To sum up, there are excellent reasons to suspect that rising scores for English learners in Oceanside and other California districts have more to do with extraneous factors than with what is happening in the classroom. But who knows? Available data are insufficient to prove, or disprove, any hypothesis about the impact of Proposition 227 on English learners' achievement. What is needed is a truly controlled educational experiment that tracks the academic progress of individual children over several years. The California legislature has authorized a study along these lines, with interim results due to be reported soon. Unfortunately, in drawing conclusions about student outcomes, the researchers are relying on a single assessment tool: the Stanford 9.


None of this has inspired the New York Times or other media to reassess their verdict. According to the conventional wisdom, Proposition 227 remains a great success. Meanwhile, Ron Unz is urging voters and legislators elsewhere to follow California's lead—in fact, to go even further—in mandating English-only instruction. Bilingual programs could be restricted or banned outright in numerous states, largely on the basis of claims about the Stanford 9. A high-stakes test indeed.

How should defenders of bilingual education respond? There are two basic choices. Either we criticize this use of raw test scores as scientifically unsupportable. Or we use raw test scores to make contrary claims. Many of us have taken the latter course, and it's not hard to see why. Journalists tend to ignore the scientific arguments of Hakuta, Krashen, and others, which can be tedious to explain and rarely produce flashy headlines. Conversely, they give lots of attention and credence to the bold assertions of Ron Unz—always good copy. Advocates for bilingual education are tempted to respond in kind.

Recently I participated in a press conference sponsored by opponents of Unz's initiative in Massachusetts. Our side argued that, according to the latest Stanford 9 scores, English learners were losing ground in California. We pointed out that since 1998, in virtually every grade, the "achievement gap" has widened between these children and their English-speaking peers. In other words, scores of students overall are rising faster than those of LEP students—thus demonstrating the failure of Proposition 227 (Note 2) (Vaishnav, 2002). This is a plausible claim. Based on everything I have heard anecdotally and read in ethnographic studies of the initiative's impact (e.g., Gándara et al., 2000), I expect that a scientific study will one day render such a verdict. Yet, alas, it cannot be proved with existing Stanford 9 data, for the reasons explained above. In particular, English language learners make up an inconsistent and unstable category. Raising or lowering the bar for English proficiency, or altering procedures for redesignating students, could have a significant impact on the achievement gap. Since these decisions are made separately by each of the nearly 1,100 school districts in the state, there is no way to control for them. Ergo, no valid conclusions can be drawn.

This illustrates the dilemma for bilingual education supporters today. Credibility is a valuable commodity for advocates and researchers alike, especially when political assets are in short supply. Using persuasive but scientifically disreputable arguments could easily squander what little advantage we have in the public debate. On the other hand, when voters want answers, merely dismissing test scores as irrelevant is likely to make us irrelevant. What's the solution?

Often overlooked in the debates over Proposition 227 are the numerous research studies, from California and other states, showing the effectiveness of well designed bilingual education programs. These far outnumber the studies showing the benefits of all-English immersion programs (see, e.g., August & Hakuta, 1997). There is no study, in this country or abroad, that reports any promise whatsoever for the one-year immersion model prescribed by Proposition 227. Under the circumstances, it's hardly surprising that Unz derides all research in the field as unproven "theory" and elevates Stanford 9 scores as the ultimate "reality." He knows that most journalists like a simple story-line, with few subplots or caveats, and he has constructed a clever one.

We who oppose him have been far less adept in making our case, despite the ample evidence at our disposal. No doubt this reflects our inexperience in media manipulation and political chicanery. Mainly, however, it demonstrates the low priority that bilingual educators and researchers have placed on making scientific findings accessible to the public. As a direct result, policies on how to teach English language learners are increasingly based on what is politically, not pedagogically, effective. All this must change, or we ourselves should be held accountable.

Notes

1. It turned out that because of a computation error, Vista's highest-scoring English learners were wrongly counted as fully English-proficient. Thus its average scores for LEP students were considerably higher than those the Times reported. Nevertheless, Unz has continued to use the erroneous results, insisting they had been "officially" reported (Zehr, 2000; Unz et al., 2001).

2. There is no question that the initiative has failed to deliver on its promise of teaching children English in one school year. Since 1998, the statewide redesignation rate has budged only slightly, from 7 percent to 9 percent, continuing an upward trend that began in the early 1990s. By the (absurd) standard Unz that used for judging bilingual education during the campaign, Proposition 227 had a 91 percent "failure rate" in teaching English last year.

References

August, D. & Hakuta, K. (Eds). 1997.Improving schooling for language-minority students: A research agenda. Washington, DC: National Academy Press.

Boston Globe. 2002. Threatening language. April 8.

California Department of Education. 2000. Report on the investigation of complaints against the Oceanside Unified School District. Sacramento, September 29.

Gándara, P., Maxwell-Jolly, J., Garcia, E., Asato, J., Gutierrez, K., Stritikus, T. & Curry, J. 2000. The initial impact of Proposition 227 on the instruction of English learners. University of California, Linguistic Minority Research Institute, April. [Online] Available: http://lmri.ucsb.edu/RESDISS/prop227effects.pdf

Hakuta, K. 2000. Points on SAT-9 performance and Proposition 227. Stanford University, August 22. [Online] Available:

Hakuta, K. 2001. Silence from Oceanside and the future of bilingual education. Stanford University, August 18. [Online] Available: http://www.stanford.edu/~hakuta/SAT9/Silence%20from%20Oceanside.htm

Krashen, S. 2001. Why did test scores go up in California? A response to Unz/Reinhard.NYSABE Newsletter, 1, (3), 21-23.[Online] Available:

Krashen, S. 2002. Letter to the Boston Globe [unpublished]. April 8.

Noonan, K. 2000. I believed that bilingual education was best ... until the kids proved me wrong. Washington Post, September 3.

Orr, J.E., Butler, Y.G., Bousquet, M. & Hakuta, K. 2000. What can we learn about the impact of Proposition 227 from SAT-9 scores? An analysis of results from 2000. Stanford University, August 15. [Online] Available:

Steinberg, J. 2000. Increase in test scores counters dire forecasts for bilingual ban. New York Times, August 20.

Thompson, M.S., DiCerbo, K.E., Mahoney, K. & MacSwan, J. 2002. ¿Exito en California?
A validity critique of language program evaluations and analysis of English learner test scores. Education Policy Analysis Archives 10 (7). [Online] Available: http://epaa.asu.edu/epaa/v10n7/

Unz, R., Snow, C. & Randolph, T. 2001. Bilingual education: A necessary help or a failed hindrance. [Videotape]. Harvard Graduate School of Education, October 15.

Vaishnav, A. 2002. Backers step up bilingual ed fight. Boston Globe, March 13.

Zehr, M.A. Cause of higher Calif. scores sore point in bilingual ed. debate. Education Week, September 6.


Copyright © 2002 by James Crawford. All rights reserved. No permission is required for personal use of this article, or to quote it in research papers. But republication of this material in any form and for any purpose- including course use and Internet postings - is prohibited, except by permission of the author at jwcrawford@compuserve.com.  Before writing, please read his permissions FAQ,  http://ourworld.compuserve.com/homepages/JWCRAWFORD/copy.htm.

 



Language Policy Research Unit - Mary Lou Fulton College of Education - Arizona State University
© Arizona Board of Regents (ABOR)