The trouble with Google Books

How rampant errors threaten the scholarly mission of the vast digital library

Topics: Google, Books,

The trouble with Google Books

Depending on who you ask, Google Books — the pioneering tech company’s ambitious plan to “digitally scan every book in the world” and make them searchable over the Web and in libraries — is either a marvelous, utopian scheme or an unprecedented copyright power-grab. The people who can claim to fully understand the Google Books Search Settlement — the resolution of a class-action suit filed against the company by the Authors Guild and the Association of American Publishers — may be as few as those who comprehend the theory of special relativity.

But everyone seems to agree that Google Book Search represents a revolutionary boon to scholars, especially people embarked on specialized research but without ready access to a university library. But is it? As UC-Berkeley professor Geoffrey Nunberg pointed out in an article for the Chronicle of Higher Education last year (expanded from a post on the blog Language Log), a research library is only as useful as the tools required to extract its riches. And there are some serious problems with the bibliographic information attached to many of the digital texts in Google Books.

Nunberg, a linguist interested in how word usage changes over time, noticed “endemic” errors in Google Books, especially when it comes to publication dates. A search for books published before 1950 and containing the word “Internet” turned up the unlikely bounty of 527 results. Woody Allen is mentioned in 325 books ostensibly published before he was born.

Other errors include misattributed authors — Sigmund Freud is listed as a co-author of a book on the Mosaic Web browser and Henry James is credited with writing “Madame Bovary.” Even more puzzling are the many subject misclassifications: an edition of “Moby Dick” categorized under “Computers,” and “Jane Eyre” as “Antiques and Collectibles” (“Madame Bovary” got that label, too).

Although Google representatives did respond to Nunberg’s article, blaming the bulk of the errors on outside contractors, much of the incorrect information remains in place. Looking at listings for “The Golden Bough” by James Frazer, a seminal work on comparative religion with a complex and fascinating publication history, I found one edition characterized as “Life Sciences.” The 12 volumes of what is arguably the most authoritative edition of the book (published between 1910 and 1915) aren’t grouped together or searchable as a whole, and the foremost search result is a dubious reprint of the bowdlerized 1922 edition with an introduction lifted from Wikipedia and a publication date of 1947, although the text itself claims a publication date of 2008.



I’ve already written about inadequate metadata — specifically how it can curtail readers’ choices. So I gave Nunberg a call to find out how flawed metadata affects historians and other scholars.

What is metadata?

Metadata is data about a text or work. The card for a book from an old card catalog is metadata: The title, the author, the publisher, the date of publication, the number of pages and so on. In the future, it could also include all sorts of other information, such as how many people have read it, or how many copies of it have sold.

When you’re dealing with any collection of books — whether it’s a research library or your local Barnes and Noble — you need something like metadata. Say you’re looking for a children’s book on antelope or birds, so you go to the children’s section and within that section you look for the “nature” category. Similarly, if you want a novel by Anthony Trollope, you use the metadata of the retail space to find it. It’s in the fiction or literature sections, shelved alphabetically by author.

So the actual physical organization and shelves in a bookstore are a form of metadata, since they provide you with information about the books contained in each section of the store?

Yes. Even at home, if you’ve got more than, say 100 books (or more than 100 of anything, really) you have a system for organizing them in some way so you can find what you want. Everyone has metadata, even if it’s just alphabetical order. And that’s even more important with a scholarly collection.

And what are the problems with the way Google Books handles metadata about the books in its collection?

Google Books was conceived of in two ways. The first is as a new library — I call it the “last library” — an aggregate of all the libraries in the world. The second is as a big database, a storehouse of information that you could search the way you search Google. The idea behind that is that books are just stored information. If I want to know who wrote Roosevelt’s inaugural speech, I can do a search and look it up.

But those two ideas are at odds with each other, which is something that Google didn’t realize. The beauty of Google is that you don’t need metadata, after all. You just barrel into the text and pull out what you want. So metadata — information about the source text — was not something they focused on.

How is that inattention a problem? Why is metadata important to, for example, scholars?

Metadata includes information about a particular text, and sometimes also about a particular copy: Zhou Enlai’s personal copy of Marx, for example, might be of special interest to a scholar. I might want to search for the first sentence of a Henry Fielding novel across different editions. That information can’t be derived from Googling. And I might want to search across collections: How often was a word used in a particular historical period? In that case, the accuracy of metadata about each book is crucial.

Even though you’re not looking at each individual book, one at a time. I see. Can you give me an example?

There’s this observation that “United States” was used first as a plural noun, but now it’s invariably used as a singular noun, which reflects an evolution in how people viewed the nation. Supposedly, this changed with the Civil War, but it’s actually more complicated than that. If you don’t have the correct metadata — in this case, the publication date — attached to the texts, then you can’t do an accurate search on how the word was used (before and after 1865).

Google has also included in its metadata a system of subject-matter classification designed for the book trade known as BISAC. How is that a problem?

Well, Google may not be applying the risible BISAC subject categories anymore, at least not to older titles that weren’t given BISAC categories when they were published. (The BISAC standard is only a couple of decades old.) So “Madame Bovary” is no longer classified in Antiques and Collectibles!

BISAC is just right for a local Barnes and Noble store, but even when it’s correctly applied, it’s hopeless for a larger collection. It was designed as a way for publishers to tell booksellers where to shelve a book so that their customers could find it. That’s why BISAC has 20 subcategories for children’s books about various animals — books about bears, or about monkeys, for example — but only one category for European poetry. In a retail bookstore, you’re not going to have a section for 18th century Italian poetry or 17th century German poetry; all the European poetry is going to be shelved together. But that’s a ridiculous way to classify the collection of the Harvard Library.

How did some of the more outrageous mistakes happen, such as categorizing Walt Whitman’s “Leaves of Grass” as a book about botany or listing Henry James as the author of “Madame Bovary”?

I still don’t know what the story is. Several people at Google took pains to respond to my original blog posting about this issue, and they claim that many of these errors originated with the providers (libraries or commercial services hired to provide metadata about books), not Google. It’s true that no metadata source is perfect. The Harvard Library makes mistakes, too. But nothing on the scale I found in Google Books. The Harvard Library does not have Henry James as the author of “Madame Bovary.”

My guess would be that there was an edition of “Madame Bovary” that had James’ name on it somewhere, maybe as the author of an introduction, and in the automated process of scanning the books, the wrong name got identified as that of the author.

I thought it was a machine error, too, but Google assured me that they had people doing this by hand. In some cases, they got their metadata from a provider in Armenia. They say that they want to have a diversity of sources to get a more complete classification for every book, but that’s just silly. The metadata at the Harvard Library was done by hand by smart people who know how to catalog.

People at Google are also saying, “Let’s crowdsource this,” but that is a stupid idea. You and I are both smart, knowledgeable people, but I wouldn’t trust either of us to do the skilled work of cataloging a 1890 edition of “Madame Bovary.” It’s very difficult. It has to be coordinated by uniform standards. An example of the kind of mess you get when you don’t use uniform standards is Wiktionary (the lexical counterpart of Wikipedia). Unlike an encyclopedia, a dictionary isn’t useful unless it’s consistent in style. And metadata is hard to fix if you don’t get it right in the first place. Someone has to spend a lot of money to properly catalog a research library, and I don’t know if Google understood that going into it.

But surely these books are already cataloged by the libraries whose collections Google is scanning?

Yes, but Google isn’t using an alternate, more comprehensive system, such as the Library of Congress cataloging system. They could license that. In time, you could generate all kinds of interesting new classifications, too, but you have to have the old ones.

What are some of the other problems you’ve had trying to do linguistic research using Google Books?

I can’t find all the volumes of the Century Dictionary (an important lexical reference first published in the late 1800s) in a particular edition at once. Sometimes a volume comes up and sometimes it doesn’t. Sometimes I get volumes from different editions. Serial works are also difficult. I’ve been researching the changing use of the word “sensitivity.” I’d get hits for numbers of a journal that began publishing in the 1950s, so all of them are dated in the ’50s, even though the issue where the word was found is actually from the 1970s.

Then there are problems with the scanning itself. I was researching the history of the word “cad,” and got a result in the Transactions of the Philological Society from the late 19th century challenging the OED definition. But I can’t read the first four pages of it because all four pages are bunched together and there’s someone’s thumb in the image. Now, no one is going to go back and rescan those pages — it would cost more than scanning the whole shelf — so that’s it. As far as the digital collection is concerned, those pages are lost. I could find them by going to the Bodleian Library (in Oxford, England) and asking them to pull that out of whatever deep storage they have it in, but realistically, I’m not going to do that. It’s too difficult to get to.

It’s not like the information is actually lost, however, and it’s not like that information wasn’t just as difficult to get to before Google Books came along.

You’re absolutely right. People have accused me of looking a gift horse in the mouth. Let me be clear: I love Google Books. It’s an amazing resource for scholars. I don’t think they knew what they were getting into, though. Of course, if they hadn’t been insensitive to the subtleties of the task, maybe they wouldn’t have taken it on. A friend who’s worked there told me that it’s a culture that awards innovation, even if it’s something relatively useless, like a map function that shows you all the place-names mentioned in a book. You get less credit at Google for making sure that old things continue to work well.

Since my initial blog posting, however, Google has shown themselves to be aware of what they’re dealing with. They want to see themselves in the right light and they don’t want to be seen as criticizing librarians. My goal was really to get the librarians to talk to Google, because until recently they’ve been been taking it for granted that Google Books will do it right.

Because if this really is the “last library,” as I put it, and no one is going to go back and do all this scanning again, which I think we can all agree is probably the case, then it’s really important that it be done right. And it’s going to cost a lot of money to do it. A disproportionate percentage of the resources have to go to a relative small percentage of users. That’s what a research library is all about. That is the nature of scholarship.

Referenced in this article:

Geoffrey Nunberg’s original post in the Language Log Blog, with comments from Google representatives

Geoffrey Nunberg’s article about Google Books in the Chronicle of Higher Education.

Laura Miller

Laura Miller is a senior writer for Salon. She is the author of "The Magician's Book: A Skeptic's Adventures in Narnia" and has a Web site, magiciansbook.com.

More Related Stories

Featured Slide Shows

  • Share on Twitter
  • Share on Facebook
  • 1 of 11
  • Close
  • Fullscreen
  • Thumbnails

    Ten spectacular graphic novels from 2014

    Beautiful Darkness by Fabien Vehlmann & Kerascoët
    Kerascoët's lovely, delicate pen-and-watercolor art -- all intricate botanicals, big eyes and flowing hair -- gives this fairy story a deceptively pretty finish. You find out quickly, however, that these are the heartless and heedless fairies of folk legend, not the sentimental sprites beloved by the Victorians and Disney fans. A host of tiny hominid creatures must learn to survive in the forest after fleeing their former home -- a little girl who lies dead in the woods. The main character, Aurora, tries to organize the group into a community, but most of her cohort is too capricious, lazy and selfish to participate for long. There's no real moral to this story, which is refreshing in itself, beyond the perpetual lessons that life is hard and you have to be careful whom you trust. Never has ugly truth been given a prettier face.

    Ten spectacular graphic novels from 2014

    Climate Changed: A Personal Journey Through the Science by Philippe Squarzoni
    Squarzoni is a French cartoonist who makes nonfiction graphic novels about contemporary issues and politics. While finishing up a book about France under Jacques Chirac, he realized that when it came to environmental policy, he didn't know what he was talking about. "Climate Changed" is the result of his efforts to understand what has been happening to the planet, a striking combination of memoir and data that ruminates on a notoriously elusive, difficult and even imponderable subject. Panels of talking heads dispensing information (or Squarzoni discussing the issues with his partner) are juxtaposed with detailed and meticulous yet lyrical scenes from the author's childhood, the countryside where he takes a holiday and a visit to New York. He uses his own unreachable past as a way to grasp the imminent transformation of the Earth. The result is both enlightening and unexpectedly moving.

    Ten spectacular graphic novels from 2014

    Here by Richard McGuire
    A six-page version of this innovative work by a regular contributor to the New Yorker first appeared in RAW magazine 25 years ago. Each two-page spread depicts a single place, sometimes occupied by a corner of a room, over the course of 4 billion years. The oldest image is a blur of pink and purple gases; others depict hazmat-suited explorers from 300 years in the future. Inset images show the changing decor and inhabitants of the house throughout its existence: family photos, quarrels, kids in Halloween costumes, a woman reading a book, a cat walking across the floor. The cumulative effect is serene and ravishing, an intimation of the immensity of time and the wonder embodied in the humblest things.

    Ten spectacular graphic novels from 2014

    Kill My Mother by Jules Feiffer
    The legendary Pulitzer Prize-winning cartoonist delivers his debut graphic novel at 85, a deliriously over-the-top blend of classic movie noir and melodrama that roams from chiaroscuro Bay City to Hollywood to a USO gig in the Pacific theater of World War II. There's a burnt-out drunk of a private eye, but the story is soon commandeered by a multigenerational collection of ferocious women, including a mysterious chanteuse who never speaks, a radio comedy writer who makes a childhood friend the butt of a hit series and a ruthless dame intent on making her whiny coward of a husband into a star. There are disguises, musical numbers and plenty of gunfights, but the drawing is the main attraction. Nobody convey's bodies in motion more thrillingly than Feiffer, whether they're dancing, running or duking it out. The kid has promise.

    Ten spectacular graphic novels from 2014

    The Motherless Oven by Rob Davis
    This is a weird one, but in the nervy surreal way that word-playful novels like "A Clockwork Orange" or "Ulysses" are weird. The main character, a teenage schoolboy named Scarper Lee, lives in a world where it rains knives and people make their own parents, contraptions that can be anything from a tiny figurine stashable in a pocket to biomorphic boiler-like entities that seem to have escaped from Dr. Seuss' nightmares. Their homes are crammed with gadgets they call gods and instead of TV they watch a hulu-hoop-size wheel of repeating images that changes with the day of the week. They also know their own "death day," and Scarper's is coming up fast. Maybe that's why he runs off with the new girl at school, a real troublemaker, and the obscurely dysfunctional Castro, whose mother is a cageful of talking parakeets. A solid towline of teenage angst holds this manically inventive vision together, and proves that some graphic novels can rival the text-only kind at their own game.

    Ten spectacular graphic novels from 2014

    NOBROW 9: It's Oh So Quiet
    For each issue, the anthology magazine put out by this adventurous U.K.-based publisher of independent graphic design, illustration and comics gives 45 artists a four-color palette and a theme. In the ninth issue, the theme is silence, and the results are magnificent and full of surprises. The comics, each told in images only, range from atmospheric to trippy to jokey to melancholy to epic to creepy. But the two-page illustrations are even more powerful, even if it's not always easy to see how they pertain to the overall concept of silence. Well, except perhaps for the fact that so many of them left me utterly dumbstruck with visual delight.

    Ten spectacular graphic novels from 2014

    Over Easy by Mimi Pond
    When Pond was a broke art student in the 1970s, she took a job at a neighborhood breakfast spot in Oakland, a place with good food, splendid coffee and an endlessly entertaining crew of short-order cooks, waitresses, dishwashers and regular customers. This graphic memoir, influenced by the work of Pond's friend, Alison Bechdel, captures the funky ethos of the time, when hippies, punks and disco aficionados mingled in a Bay Area at the height of its eccentricity. The staff of the Imperial Cafe were forever swapping wisecracks and hopping in and out of each other's beds, which makes them more or less like every restaurant team in history. There's an intoxicating esprit de corps to a well-run everyday joint like the Imperial Cafe, and never has the delight in being part of it been more winningly portrayed.

    Ten spectacular graphic novels from 2014

    The Shadow Hero by Gene Luen Yang and Sonny Liew
    You don't have to be a superhero fan to be utterly charmed by Yang and Liew's revival of a little-known character created in the 1940s by the cartoonist Chu Hing. This version of the Green Turtle, however, is rich in characterization, comedy and luscious period detail from the Chinatown of "San Incendio" (a ringer for San Francisco). Hank, son of a mild-mannered grocer, would like to follow in his father's footsteps, but his restless mother (the book's best character and drawn with masterful nuance by Liew) has other ideas after her thrilling encounter with a superhero. Yang's story effortlessly folds pathos into humor without stooping to either slapstick or cheap "darkness." This is that rare tribute that far surpasses the thing it celebrates.

    Ten spectacular graphic novels from 2014

    Shoplifter by Michael Cho
    Corinna Park, former English major, works, unhappily, in a Toronto advertising agency. When the dissatisfaction of the past five years begins to oppress her, she lets off steam by pilfering magazines from a local convenience store. Cho's moody character study is as much about city life as it is about Corinna. He depicts her falling asleep in front of the TV in her condo, brooding on the subway, roaming the crowded streets after a budding romance goes awry. Like a great short story, this is a simple tale of a young woman figuring out how to get her life back, but if feels as if it contains so much of contemporary existence -- its comforts, its loneliness, its self-deceptions -- suspended in wintery amber.

    Ten spectacular graphic novels from 2014

    Through the Woods by Emily Carroll
    This collection of archetypal horror, fairy and ghost stories, all about young girls, comes lushly decked in Carroll's inky black, snowy white and blood-scarlet art. A young bride hears her predecessor's bones singing from under the floorboards, two friends make the mistake of pretending to summon the spirits of the dead, a family of orphaned siblings disappears one by one into the winter nights. Carroll's color-saturated images can be jagged, ornate and gruesome, but she also knows how to chill with absence, shadows and a single staring eye. Literary readers who cherish the work of Kelly Link or the late Angela Carter's collection, "The Bloody Chamber," will adore the violent beauty on these pages.

  • Recent Slide Shows

Comments

0 Comments

Comment Preview

Your name will appear as username ( settings | log out )

You may use these HTML tags and attributes: <a href=""> <b> <em> <strong> <i> <blockquote>