Wednesday, October 21, 2009

Brief Thoughts on EPUB Books at Google

I've been playing with downloading epub books from Google. EPUB is a format for digital publication targeted to portable readers. That's not what I care about right now. It is cool that it uses plain old xhtml and standard image formats to represent the contents of a book. That means if you can unpack an EPUB file, which is very easy, you have access to text and images in readily consumable form.

I'm not the first to point this out. See Greg Crane What do you do with a Million Books for early thinking on the large scale implications of Google's work.

In terms of playing, here's what's fun. If you go to the G Books page for H. Chase's Catalogue of Arretine pottery from the MFA, you'll see a link to download the "EPUB" version.

Once you've downloaded that file, it's easy to unpack. I'm a Mac/Linux user. If you are too, and you like the command line, 'unzip Catalogue_of_Arretine_pottery.epub' will do the trick. Otherwise, change the extension to ".zip" and double-click on the file. I'm sure something similar will work in Windows.

Once unpacked, you have two directories, 'OEBPS' and 'META-INF'. The first is the one with all the goodies in it. Open 'OEBPS/images' and you'll see the plates from the book. Those files aren't hi-res, but better than nothing.

The text is the in 'Content-###.xml' files. These can be opened in a browser directly.

As people like Greg have noted, cool things will happen when communities, such as scholars/enthusiasts of the ancient Mediterranean world, take these files and add value to them. In the meantime, I like being able to get at the images, and to have the text on my hard-drive so its available for searching. On the Mac, Spotlight does a good job of indexing the Content files. It also indexes the compressed archives when their extensions are ".zip". It seems to ignore the ".epub" files but I bet that will change soon enough.

3 comments:

Jeremy said...

Excellent. Gives me an idea of what else to do with Tomber and Dore's "Roman Pottery Fabrics" when we re-publish it online shortly. I've put it all in XML and will transform it to HTML but looks like we could pack it up as EPUB pretty easily. I wanted to take a look at the EPUB version of the Chase book you mentioned to learn a bit more about it but unfortunately I can't see the link to it. Perhaps I need to buy it?

Sebastian Heath said...

Firstly, Tomber and Dore online! Nice. I look forward to that. And in nicely structured/reusable formats? Even better!

Here's a direct link to the EPUB version http://books.google.com/books/download/Catalogue_of_Arretine_pottery.epub?id=kEAOAQAAIAAJ&output=epub&source=gbs_v2_summary_r&cad=0 . I don't know if Google does different things for different regions (I'm assuming you're in England). It's free here in the US.

And, yes, I too would rather reverse engineer than RTFM.

Jeremy said...

:-) to the reverse engineering point! Yup, I've had a look at the wikipedia page you linked to and it's pretty clear, but there's nothing like browsing through a complete example. The link you sent (same one as in your post, I guess) doesn't work for me but I suspect you're onto the real reason because yes, I'm in the UK and it looks like we don't get to see it. Hey ho.
Hope to get Tomber and Dore up in the next few months. It's very delayed already but when it goes live it will be, well, at least no longer out-of-print! I welcome any ideas on how to make it more useful to all, so I'll let you know when it's available.