Essay questions

How well can computers judge prose -- and would you want one grading your exam?

By Christopher Ott

Published May 25, 1999 4:00PM (EDT)

Forget No. 2 pencils and Scantron ovals: Some educators are beginning to use computers to grade essays. Already a system called E-rater evaluates every essay
written as part of the Graduate Management Admission Test (GMAT) — or about 800,000
compositions crafted by 400,000 business school applicants this year.

And some professors, bogged down by the volume of student papers they must
read, eagerly anticipate computerized readers that can help them slog
through the volume of words that comes across their desks each semester.

“It is becoming increasingly difficult to manage the load associated with
essay grading, and lecturers are gradually shifting the focus of their
assessment to multiple-choice questions,” says Chris Janeke, a senior psychology lecturer at
the University of South Africa, a 120,000-student university experimenting
with a computerized grading system called the
Intelligent Essay Assessor.
Software “offers the possibility of automatizing at least some aspects of essay grading and may present a technological solution to our logistic problems.”

But hold on: If a student writes an essay that is graded by a computer, has it really been “read” at all? Well, sort of. A machine obviously can’t
comprehend a student’s argument — but it can determine whether a
composition addresses a specific question, and it can judge an essay’s structure. Electronic grading systems analyze hundreds of sample answers to a specific question (something
like “Should a government be able to censor the media?”), then compare the content and semantic structure of the students’ answers to the sample essays.

If this sounds like a lifeless way to examine a student’s thoughtful writing, it is. But it’s actually little different from the decades-old system that depends on people to grade the essay portion of standardized tests. Human graders, too, are required to read sample essays and judge student responses based on qualities prescribed by the testing service.

“The procedures are actually identical,” says Fred McHale, vice president for assessment and research at the Educational Testing Service (ETS), which
developed E-rater. “Once the scoring rubrics are created by expert readers
from the sample responses, those samples are used to train human readers —
or programmed into E-rater.” (GMAT essays have been submitted electronically since 1997, so neither people nor software has to read
handwriting.)

Every electronically graded essay still gets a second read by a real live human. Still, the notion that computers play any part in evaluating student
essays hasn’t gone down well with everyone in the academic community. “I think it’s silly,” says Dennis Baron, head of the English Department at the
University of Illinois at Urbana-Champaign, with an edge of derision. Computerized grading undermines the very purpose of essays, he adds. “Like
the teacher says, ‘I’m not just talking to hear myself talk.’ We don’t ask students to write just to have them jump through a hoop.”

Writing for a computerized audience is, to some critics’ thinking, an absurd waste of time that can only warp the educational process. (ETS is also considering making E-rater available to score practice essays for students preparing to take the GMAT.) “Even before this, the pressure was there to teach to the test,” says Baron. If students know their essays are being graded by a machine that can parse
semantics and syntax, they “will learn to write for the formula.”

Critics have long lodged similar complaints against all standardized testing, arguing that the tests measure students’ ability to take tests, not their ability to learn and produce ideas. Some also maintain that the tests often include subtle racial, class or gender biases — benefiting students who are white, middle-class and male.

But could we use technology to eliminate bias from the grading process, and to promote fairness and consistency? “In essence [the technology] is doing what a person is trained to do when they’re doing holistic grading,” says Darrell Laham, chief scientist for Knowledge Analysis Technologies, which developed the Intelligent
Essay Assessor. “You see samples of
what an excellent essay is supposed to look like, or a medium essay, or a very bad essay. With a person, their criteria may shift a little bit.” The
software, on the other hand, is 100 percent consistent: “You give it the same set of parameters, and it will always give the same results.”

But Monty Neill, executive director of FairTest — an advocacy group that fights for fairness in standardized
testing — says the software’s lack of bias doesn’t mean electronic grading will be free of prejudice: It all depends on how the software is programmed. “If you’re looking for things that are not really relevant but are associated with a particular demographic group, then certainly that would reinforce a bias,” he says.

A question assuming knowledge of stock dividends, for example, could penalize test-takers whose
family never owned securities. But Neill agrees that a computerized grading system, properly programmed, could eliminate other forms of bias. “You might have someone who identifies black writing as automatically bad,
whereas a machine might not,” he says.

Laham says the best way to escape grading bias is to choose the model essays with care: “The underlying comparison set of essays should represent the population that the grades are meant to represent.” To be fair to the test-takers, he says, his Intelligent Essay Assessor is designed to know
its limits and not give a student a poor mark when the software can’t “read” an essay, for stylistic or other reasons. “What the technology will do when it sees an essay that is completely unlike what it has seen before is to flag it and tell a teacher to look at it … It won’t be able to grade it, but it will know it can’t grade it.”

E-rater examines 50 linguistic features, including transitional phrases,
vocabulary and the ratio of complement clauses to the total number of
sentences. “For each essay, about eight to 12 of the features turn out to be
particularly predictive and explain why an essay should get a certain
score,” says Jill Bustein, a developmental scientist who invented the
E-rater prototype and led the ETS development team.

E-rater is surprisingly consistent with human graders. The E-rater scores
agree with scores given by a human grader about 90 percent of the time —
or as often as a second human reader would, according to ETS statistics.
And when a second human grader does score a disputed essay, he or she agrees with
E-rater about 97 percent of the time. In other words, the electronic
graders seem to do the job about as well as their human counterparts.

Computerized grading could cut student fees by $5 to $10 per test, according to ETS; readers who score the GMATs currently earn $23.75 per hour. And at Knowledge Analysis Technologies, Laham argues that essay-grading software can improve education by helping to eliminate multiple-choice
testing. His company’s Web site says: “Students need many more
opportunities to put their knowledge into words and find out how well
they’ve done and how to do better”; and Laham asserts that student writing,
even when written for a computerized reader, demonstrates “a much deeper level of learning” than multiple-choice exams do.

But he is conscious of his product’s limitations. “When you start getting
into the creativity types of things, that’s not really our focus,” says
Laham. “This technology is not appropriate for looking at term papers where
every student is writing on a unique topic. We see it as a way to provide
students with the opportunity to write and revise their writing and to get
immediate feedback that they simply can’t have right now. A person can’t
always look at what a student produces.”

University of Illinois professor Baron still criticizes the system, however, saying he’s gotten surprisingly good grades after
submitting essays that were completely off-topic to a demonstration of the
Intelligent Essay Assessor that is available online. “If you don’t care
about what might be in the text that doesn’t match your template, then I
suppose you can go ahead and use it,” he says. “But it seems to me that
it’s also an insult to the writer. You’re asking these test-takers to write
connected prose, but you’re having it graded by an entity that has no sense
of what’s good about connected prose and how to evaluate it.” (Laham defended the product, saying that the version of IEA currently online does not yet have the system’s full battery of validity checks.)

Meanwhile, won’t students rebel against computerized readers?

Test-takers haven’t been troubled by the electronic grading of GMATs, says
McHale. “We were expecting more negative reaction, but we’ve had minimal
complaints, and just a single response of ‘I don’t want a computer grading
my essay,’ which someone wrote in one of their essays.” Part of the reason
for the subdued response may be that a person still reads each submission
— a procedure that McHale expects to continue. “For the large-scale,
high-stakes kind of testing that we do, I don’t see a human reader being
taken out of the loop,” he said. “The small discrepancies that we do see
could be very creative responses that we really do want to allow in the
testing.”

So far, there’s no plan to employ E-rater as a judge of literary merit or
creative writing, but ETS is researching the possibility of computerized
grading for the Test of English as a Foreign Language and the
Graduate Record Examinations. The GMAT was the first to employ the
software because the test had already phased out handwritten essays in
favor of keyboarded essays.

While it’s unlikely that computerized grading will ever replace the careful
eye of a teacher, technology proponents like Laham say it can be a great
addition to the current academic system. “The reality is that teachers can’t
read enough to provide the student with enough feedback,” says Laham.

So instead of comparing the software to a human reader — where it can’t help
but appear a poor substitute — Laham argues critics should view electronic grading as a great benefit to students who want to write more than their teachers can read.
Dismissing the technology’s detractors, Laham says, “There aren’t as
many of the critics as there are teachers who want this system.”

By Christopher Ott

Christopher Ott is a writer in Madison, Wis.

Essay questions

How well can computers judge prose -- and would you want one grading your exam?

Published May 25, 1999 4:00PM (EDT)

By Christopher Ott

Related Topics ------------------------------------------

Related Articles