EndUser 2006 notes on opening session [Updated]
[Through a series of missteps that I won't go into here, I discovered that I had accidentally deleted this post, first published a few weeks ago. I feel pretty dumb. When I figured out what happened, I sat here, stunned, wondering what to do. Then I remembered Google's good 'ol caching capability, did a quick search to call up the cached version of this post, did a quick copy and paste, and voila, problem solved. Well, almost. My error wiped out the original post entirely, meaning that it automatically broke the link to that post, as well. There's nothing I can do about that. In the process of reconstituting the content, I decided on some editorial tweaks throughout.]
(Warning, this is a pretty lengthy post.)
Yesterday was the start of EndUser 2006, Endeavor’s customer conference. Somewhere around 1,000 customers have shown up for this event, some coming from as far away as Australia, New Zealand, several European countries, as well as Canada, Latin America, and of course, the U.S. As I’ve noted before, there are several conference sessions dealing with topics of interest, but yesterday’s highlight was the opening general session featuring a representative from Google who spoke in depth about Google’s Book Search project. Tom Turvey, Head, Google Book Search Partnerships, gave a brief over of Google and how it makes money, defined the elements of Google Book Search, described the Google Book Search Partner Program (which he oversees), and finally discussed the Library Program portion of Google Book Search. Tom has a long history of working with online content, serving in numerous roles in the publishing industry relating to online delivery, including launching Barnes & Noble’s ebook offerings and most recently holding a senior post at HarperCollins.
Tom began by describing Google’s business. He mentioned that Google now provides 59% of all Internet search referrals. Google’s oft-repeated mission is “to organize the world’s information and make it universally accessible and useful.” Their Its core business, i.e. how they the company makes money, is from advertising revenue generated via paid search ads using Google AdSense. Tom also mentioned that Google is the leader, by far, in referrals to book sites (currently it processes about 60% of all such referrals). In describing Google’s business, Tom pointed out some interesting statistics about book purchasing. He provided statistics showing that 13% Thirteen percent of all book purchases are now done online; schools/libraries make up about 24% of the book buying market, direct to consumer purchasing (direct from publishers) is about 2%; and the biggest growth area recently has been in non bookstore retail (books being purchased in Costco, Sam’s Club, Wal-Mart, etc.).
The next portion of the presentation focused on an explanation of Google Book Search. Tom pointed out that in his experience, never has there been so much misinformation about a product as there has been with Google Book Search (GBS). He made some comment that 90% of what has been published in the news media is false, thus the importance of explaining exactly what it’s about. GBS, at its heart, is an attempt to associate book content with what searchers are looking for in search engines. There are two main parts to GBS: the Partner Program, and the Library Program. The Partner Program involves relationships and agreements between Google and publishers. GBS launched in October 2004 at the Frankfort Book Fair. As of now there are literally thousands of publisher partners spanning seven languages. One of the most frequent questions publishers ask Google is, what books are good choices for discovery via GBS? One of Tom’s funnier statements was “we don’t need to help Harry Potter find an audience.” What Google is mostly interested in is the arcane, the obscure, and bringing this material to light via searching GBS. Every page is searchable; users are searching books from cover to cover. There are two ways of providing search on book content: a dedicated search (books.google.com), and integrating book content within the general Google search. The main intent of working with publishers is to drive book sales. Content is protected in a variety of ways (Tom mentioned that as you can imagine, this element of agreements with publishers often gets “into the weeds”). Only 20% of a book is viewable by one user during the course of a month. Print, copy, and save are disabled. Scanned images are purposely low resolution. Publishers can add/remote remove their material at any time. There is page level security as well. A percentage of pages is never visible at one time. Google’s process for receiving publisher content is pretty straightforward: the publisher usually sends either a PDF or a print copy. If the latter, Google digitizes it. As an interesting aside to closing out this portion of the talk, Tom mentioned “Oh by the way, the five publishers who are suing Google over the Library Project are actually members of the Partner Program.”
In turning to the third and last portion of the presentation, Tom outlined the elements of the Library Project. Partner libraries, as most people are aware by now, include Stanford, NYPL, Oxford, Michigan, and Harvard. In researching and comparing collections from each partner library, Google discovered that 60% of books are held in only one of the partner libraries. For legal and other issues, Google began the project by focusing on public domain books. However, public domain books make up only about 20% of a typical library collection. Ten percent of a typical collection is made up of books that are still in print (i.e. the stuff that is handled via the Partner Program). Most books, 90%, are in print but in a fuzzy area in which they may be out of print but still in copyright, or perhaps out of copyright. Seventy percent of collections were published after 1923 and fall into three categories: in copyright, in public domain, or the rights may have reverted. Obviously Google needed to figure out how to solve or address these complexities. Their solution was to offer to scan everything but provide three views: sample pages (partner view), snippet view (book under copyright w/out agreement with a publisher partner), and full book view (book is in public domain). The snippet view means that the full text of each book is indexed; users can only view three snippets from the book; there are links to “buy this book” as well as “find in a library”; different categories of books are handled in different ways; and copyright holders may opt out of display and/or scanning.
Obviously a critical factor for Google is optimizing and streamlining the workflow. For example, a key consideration was figuring out how long it takes to scan a typical book. Tom mentioned that in the early days of the project, founder Larry Brin and another staff member would use a metronome to time each other over and over again as they tried to figure out how best to scan a book. (Why a metronome? I have no idea and neither did Tom.) Books are scanned as is, including scribbles, marginalia, notes, whatever. Google is aiming to build a comprehensive collection of indexed books but has a long way to go yet on achieving that goal. Some of the challenges they face on a daily basis are 100% OCR accuracy, 100% image quality, search and integration with web search, the accuracy of any affiliated metadata, the existence of lots of “edge cases” in terms of how to process and display the scanned results, how to address books that contain multiple languages and/or scripts; and how best to achieve a good level of speed/automation of the entire process. As with their much vaunted (and top secret) search algorithms, Google is constantly tweaking the process to try to improve the quality. How do they handle math formulas, spelling correction (Tom used the example of vernacular language that is meant to be spelled a certain way but which looks wrong to a typical spell checker), etc.? What is the best way to deal with automated metadata extraction? Can they figure out an automated way to detect (and appropriately handle) different languages and/or scripts?
Tom made a big point of the fact that Google is actively engaging the library community. Librarians tell Google the good and the bad about GBS (e.g. of bad: too overwhelming for users, hard to know which stuff is authoritative and what is junk, desire to know exactly how the process for scanning and indexing works). Google wants to ensure that GBS works for libraries by making information more discoverable, driving more library usage, and supporting a worldwide community, which is especially relevant for remote and distributed library users. Google has no desire whatsoever to put libraries out of business; in fact, Tom claims that the opposite is true.
[One of the things that I thought was particularly striking was that at one point during the session, Mr. Turvey asked for a show of hands from the audience of those people who were aware of the facts and details he had provided about Google Book Search. To my astonishment, I was one of the few people to raise their hands. Maybe this was just due to some people not fully understanding the question or to some people's innate shyness, who knows. But if it was an indicator of professional ignorance of these matters, then we're in big trouble.]
After concluding his prepared remarks, Tom invited the audience to pose questions. This was perhaps the most interesting portion of the session and Tom handled the questions with aplomb and a dose of wit. Below are my notes of the substance of some of the questions posed, followed by the substance of what I could jot down of Tom’s answers.
Question: When a user sees a link to “find in a library” which leads to Open WorldCat, what librarians want is to have that user come to us rather than use Google and/or buy the book from the publisher. What is your view on this?
Answer: It appears that this is in fact what is happening. Logs show that adding the “find in a library” link, directed to Open WorldCat, has driven a tremendous growth in traffic to WorldCat. Presumably this leads to higher library use.
Question: I’d like to see much more powerful search options, including things like truncation, proximity searching, and boolean capabilities. Is this something Google is considering?
Answer: That’s a very good question, what I’d expect from a librarian <laughter from the audience>. Some of these capabilities are things we are indeed working on, while some of them are already available via the Advanced Search option.
Question: I believe that in search results from publisher content, there is no link to “find in a library” when there is such a link provided in the library search. Why is that?
Answer: Good question. Remember that the goal of GBS is to have a relevant search. The vast majority of books available in GBS at this time are from publishers. Over the next few years, that proportion will flip to emphasize library-owned material. Honestly there is a constant tug and pull between publishers and Google over this issue of how to direct users. Publishers, obviously, participate in GBS to sell more books.
Question: Is there any plan to include Library of Congress Subject Headings (LCSH) as part of the GBS search?
Answer: LCSH and other taxonomies are already used to some extent behind the scenes to assist with determining relevance as well as identifying relationships between books (linking from one book to a related book).
Question: Can you speak about why you are being sued by some of your publisher partners?
Answer: Attorneys love it when you talk publicly about their litigation <much laughter from audience>. Seriously, though, no, I can’t answer that.
Question: Are you indexing each book cover to cover (i.e. full text)? How do you determine relevancy? [Editorial aside: Was this person paying attention? This question was clearly answered in the context of the presentation.]
Answer: Yes, we are doing full text. The ranking/relevancy algorithms used in GBS are pretty much the same as those used in the regular Google search. Some tweaking is of course necessary to make the algorithms relevant for book search. We do user interface testing every month and as a result, we constantly tweak/change the algorithms.
Question: Do you have a formal digital preservation strategy?
Answer: We have agreements with our library partners that cover preservation to whatever degree they have specified in their legal agreements. It really depends on what partner libraries want. Other than that, no, we do not have a formal preservation strategy and do not feel that that is a role we should assume.
Question: Elaborate on how relevant metadata is in GBS.
Answer: Well, first of all, metadata does play a role in GBS but our bias is always toward full text, with metadata/abstracts thought of as secondary. This is probably the opposite of how most libraries would prioritize things.
Question: I have a question on the issue of fair use. Are you working to expand the concept of fair use in terms of scholarly material in particular?
Answer: We feel that our stance on fair use and GBS is very, very significant. We do not have any formal focus on scholarly material in GBS, though.
Question: What is Google’s stance toward the Open Content Alliance? Does Google view them as partners, or competitors?
Answer: We have an open door, a desire to partner and share in digitizing material. We believe that initiatives such as the Open Content Alliance are worthy of our support. However, as you can imagine, there are certain complexities and a lot of politics involved in this kind of interaction. We want to participate in initiatives like this in as open a way as possible.
Question: “Find in a library” links only to WorldCat at present. Does Google have any plans for directing traffic to other bibliographic (i.e. library) databases (this is particularly important for those libraries who aren’t linked from WorldCat)?
Answer: We’d be interested in any other worthwhile bibliographic databases, but WorldCat is it for now.
Question: A single search box is very attractive, but when you expand your data sources (as Google is doing), the simplicity and relevance of this one search become more difficult to maintain. How do you handle this?
Answer: We constantly reevaluate the one box concept and it is an ongoing problem to solve. There is no ready answer.
Question: How do you handle materials from publishers once those materials have gone out of print?
Answer: Good question. Once a publisher’s book goes out of print, they request that it be removed from the index and then it no longer appears in the search. The exception to this would be if there happens to be a copy of that same book that has been scanned and indexed as part of the Library Project. In that case, the book would remain in the index.
Question: Do you have plans for providing regional Google book searches (e.g. one for New Zealand imprints)? This is important for those outside of the U.S. because currently there is such a predominance of U.S. imprints in GBS.
Answer: We already do this, e.g. currently we have 65 regional book searches.
Question: The exposure from GBS for libraries is great, but it needs to be more two way, e.g. to direct users looking for material in a local library catalog to GBS and/or elsewhere. Are there any plans to extend the Google API to be used by libraries for integration into their online catalogs?
Answer: Something like this functionality is present in Google Scholar. We are very happy with this integration with library services and we want to figure out ways to extend this further.
Question: What’s your view on library’s development of customized Greasemonkey scripts to integrate library results in with GBS?
Answer: Anything that doesn’t violate copyright, we’re all for.
Question: GBS is very exciting. What about developing Google Journals?
Answer: <tongue in cheek> …So we have this thing called Google Scholar…Actually we are working ways to better integrate or link between GBS and Google Scholar.
Question: There is clearly a balance of power issue relating to the premise that allowing Google to do all this scanning and digitizing of book content puts the burden of proof on the content creator rather than the user. What are your thoughts about this?
Answer: We believe that this is a very important issue and our stance on this hinges on the belief that we are simply being consistent between the indexing of website content and indexing the content of books.
Question: What about working to include government documents, because they do no present a copyright problem?
Answer: Yes, we have a team devoted to this very issue. It is a bigger challenge to do this than it may at first appear because in order to do it we need to work out who is responsible (i.e. the publisher) of the multitude of gov docs. Expect progress on this front.



Add New Comment
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Add New Comment
Trackbacks
(Trackback URL)