Google Books “Science” Article February 26, 2011

Posted by Christopher Lemery in Google.
A while back I read the much-discussed Science article entitled “Quantitative Analysis of Culture Using Millions of Digitized Books” (thanks to Librarian.net, you can read it here), which describes some lexicographic and cultural analysis  the authors performed on the Google Books database (or “corpus,” as the authors call it).  It was a really, really interesting article, and the graphs they provide are great. One graph in particular illustrates that, of their estimate of 1 million words in the English lexicon, only about half are in the Oxford English Dictionary. This reminded me how ridiculously time-consuming compiling a dictionary is (read the superb The Meaning of Everything for a good description of this) and of Erin McKean’s great TED talk about the evolution of the dictionary.  But I digress. In any event, the Science article illustrates once again how useful and potentially revolutionary Google Books is and I thought this field of “culturomics” could be earth-shattering.

As usual, the excellent Geoff Nunberg put the Science article into the proper perspective. In an article in the Chronicle Review (which I just got around to reading), Nunberg notes that quantitative methods have been around for a long time and this use of Google Books is just a jump in scale rather than kind from previous efforts. He also notes that the ways the data can be searched and manipulated leave a lot to be desired, particularly in comparison with similar tools such as the Corpus of Historical American English, which I’d never even heard of! Nunberg notes that culturomics will likely be subsumed into already-present fields and won’t replace the need for literature criticism or scholars to evaluate and understand the datasets that culturomics produces. I tend to agree with Nunberg’s conclusions, but it’s also true that there will be ways to use Google Books that we probably haven’t thought of yet, so the jury is still out on “culturomics.”

The Chronicle article also noted that metadata errors are a continuing problem with Google Books, again reminding me why the BIP project is so vital. As sophisticated as computers get, the maxim of “garbage in, garbage out” still holds and even Watson gets things really wrong!

My Tenure with the BIP/Google Team February 10, 2011

Posted by Christopher Lemery in BIP, Google.
I thought it would be a good idea to go into some detail about what my current position with the Penn State Libraries entails. I am currently the head of the Barcoding Inventory Project (BIP) Team. I began this position in January of 2009 serving under Jackie Dillon-Fast.  The Barcoding Inventory Project was started in 2005 with the goal of placing barcode labels on all items in the three Libraries Annex facilities that lacked labels. The vast majority of the monographs were already done, but the serials were almost entirely untouched. As you can imagine, this was a huge project. (You can find the presentation Jackie and I did in 2009 on the BIP project here.) By the time I arrived, the largest annex facility, Cato I, was done, thanks to Jackie’s hard work. When I started, work on the items in the Academic Activities Annex had just begun. And yes, “Academic Activities” is the single most generic name for a campus building ever. But we’re next to the building with a (really small) nuclear reactor, so maybe there’s a limit on the number of exciting buildings per block.

Anyway, the day-to-day work involves pulling serial titles off the shelves that need to be barcoded. We then place a label on each bound item in the series and then add the barcode number and item information for each item to the correct catalog record. The items then go back on the shelf. It’s not very exciting, but it is vital to Library users’ ability to find and use our collection. The job does provide a unique sense of accomplishment, though. It’s great to be able to fix really messed-up records and know that the fixes I’ve made will make things easier to find. It’s also nice to be able to know that a certain amount of physical material is done. Hard numbers make it easier to see how astonishing our progress has been.

The BIP project intersects with the CIC Google Books project in that we’ve had to process some stuff before it can go off to Google to be scanned. In fact, Google is sort of paying my salary, so they’re really integral to the BIP project. People ask me what types of things Google wants, but I never have a definite answer, because their stated answer to that question is “everything.” They seem to be doing a lot of grey literature (technical reports and such), though, so having that stuff out there will be a huge boon to researchers. Of course, there is also the occasional item that I can’t imagine anyone wanting to look at, but having it digitized is better than having it just sit there, too.

More than anything, my tenure has again reinforced how many vital behind-the-scenes jobs there are in libraries that no one knows about but without which libraries wouldn’t be nearly as cool as they are!