Tuesday, September 22, 2009

Google Book a disaster for scholars

Following the Department for Justice formally raising concerns about the Google Book settlement last week the deal is in the news and generating lots of heat and light on listservs, blogs, and in the standard news media.

Most of the criticism is focussed on the Google monoply and privacy issues so I thought it was worth referring back to Geoffrey Nunberg's piece in the Chronicle at the end of last month, where he outlines the concerns about the poor handling of meta data which will lead to usabilty problems with Google book search, Google's Book Search: A Disaster for Scholars.
"we're sometimes interested in finding a book for reasons that have nothing to do with the information it contains, and for those purposes googling is not a very efficient way to search. If you're looking for a particular edition of Leaves of Grass and simply punch in, "I contain multitudes," that's what you'll get. For those purposes, you want to be able to come in via the book's metadata, the same way you do if you're trying to assemble all the French editions of Rousseau's Social Contract published before 1800 or books of Victorian sermons that talk about profanity.

Or you may be interested in books simply as records of the language as it was used in various periods or genres. Not surprisingly, that's what gets linguists and assorted wordinistas adrenalized at the thought of all the big historical corpora that are coming online. But it also raises alluring possibilities for social, political, and intellectual historians and for all the strains of literary philology, old and new. With the vast collection of published books at hand, you can track the way happiness replaced felicity in the 17th century, quantify the rise and fall of propaganda or industrial democracy over the course of the 20th century, or pluck out all the Victorian novels that contain the phrase "gentle reader."

But to pose those questions, you need reliable metadata about dates and categories, which is why it's so disappointing that the book search's metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess.

Start with publication dates. To take Google's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux's La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams's Culture and Society 1780-1950, and Robert Shelton's biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf's letters is dated 1900, when she would have been 8 years old. Tom Wolfe's Bonfire of the Vanities is dated 1888, and an edition of Henry James's What Maisie Knew is dated 1848.

Of course, there are bound to be occasional howlers in a corpus as extensive as Google's book search, but these errors are endemic...

I have the sense that a lot of the initial problems are due to Google's slightly clueless fumbling as it tried master a domain that turned out to be a lot more complex than the company first realized. It's clear that Google designed the system without giving much thought to the need for reliable metadata. In fact, Google's great achievement as a Web search engine was to demonstrate how easy it could be to locate useful information without attending to metadata or resorting to Yahoo-like schemes of classification. But books aren't simply vehicles for communicating information, and managing a vast library collection requires different skills, approaches, and data than those that enabled Google to dominate Web searching."

No comments: