Wednesday, February 23, 2011

Test Bed for (X)HTML Conventions for Scholarly Publication

The main reason I joined the Institute for the Study of the Ancient World at NYU was to be part of initiating a program of digital publication of peer-reviewed scholarship. We haven't announced anything formally and this blog post isn't that announcement. It is the beginning of a nuts-and-bolts conversation about the markup of digital scholarship that is intended to encourage long-term viability, flexible re-use, and easy display (among many other things).

To get right down to business, http://dl.dropbox.com/u/17002562/isaw-papers-preprint.xhtml is the very temporary URL for a preprint version of "Review of Ptolemaic Numismatics, 1996 to 2007" by Catherine Lorber and Andrew Meadows. I'm very grateful to Andy and Cathy for their willingness to be part of this experiment. Their work is largely done. Now it's up to me to make progress on the markup and I'm hoping to do that in a very public way.

But where to begin the conversation? I think the best approach is to admit I'm in the middle of things and just start laying out issues and thoughts. Keep in mind that everything is subject to change...
  • The format for ISAW digital publications is XHTML with RDFa. XHTML (for now 1.1 but moving to XHTML5) is a widely supported standard with excellent tooling that is directly viewable in many contexts. That makes it appropriate for long-term archival storage of born-digital scholarship.
  • Internal reference structures are important.For now this means each <p> element has an @id. div's of class 'section' also have @id attributes. This is in anticipation of using the semantic elements of HTML5.
  • Named entities will be tagged with links to stable resources describing those entities. For geography, Pleiades. For many other entities, Wikipedia. See below for RDFa patterns.
  • Existing ontologies/vocabularies will be used whenever possible. Geographic entities are typed as "dcterms:Location". That sort of thing.
  • Basic constructs for marking up bibliography and footnote-like structures are lacking for HTML-based markup languages. There are lots of semi-complete "best practices" but narrowing these down to a consistent and flexible convention will be an importnat process.

Looking ahead:
  • Multiple formats will be supported. We will distribute this text as "raw" valid xhtml. It will be hosted in a more interactive environment that does slick things like make maps, etc. Epub, pdf... all those are coming. Again, the ease with which a base XHTML representation can be converted to these other formats is one reason to use XHTML.
  • We'll use CC licenses Right now the document is CC-BY-NC-ND. We'll drop the ND eventually, perhaps the NC as well. The preprint is ND as a signal that a better version is coming from us.


A word on RDFa (a standardized way of embedding information in XHTML pages)...

The basic pattern that I'm using to markup named entities is illustrated by the sentence:
In a study of tax receipts from early Ptolemaic <a class="citation"
href="http://pleiades.stoa.org/places/991398"
typeof="dcterms:Location" rel="iana:describedby"
property="rdfs:label">Thebes</a>...


That produces the RDF/Turtle
[ a dcterms:Location ;
rdfs:label "Thebes"@en ;
iana:describedby <http://pleiades.stoa.org/places/991398>].
You can see the turtle for the whole document at http://bit.ly/hJjgcx.

An "English" equivalent of the turtle snippet is 'There is a site in the text with label "Thebes" and a description at http://pleiades.stoa.org/places/991398.'

I like the use of the 'describedby' @rel value here. It's defined in the IANA's register of rel values (http://www.iana.org/assignments/link-relations/link-relations.xml). I take the semantics to be "I'm not saying I'm linking to Thebes itself, only to a description of it." That seems nice and "semantic webby".

There's more to come but I'm getting this out there just to get the ball rolling...

Monday, February 21, 2011

Quick poll: Worldcat, Library of Congress, or Both

There are lots of ways of encoding bibliographic data on the web, but this post isn't about that problem. Instead, I'm wondering what is "the community's" preference between Worldcat and the Library of Congress when creating Semantic Web/Linked Open Data references.

As an example, the URIs http://www.worldcat.org/oclc/829279 and http://lccn.loc.gov/74155758 each lead to information about John Hayes' Late Roman Pottery published in 1972.

Which one of these is preferable as the long-term description of this volume? Worldcat or LOC. The use-case is a digital publication with bibliography that ideally includes a link to one or the other or both for all printed volumes or other appropriate entities.

Perhaps a discussion will ensue in the comments but here are some quick issues:
  • There are multiple URIs for that one volume in Worldcat. http://www.worldcat.org/oclc/462730938 gets you to the Danish Union Catalog.
  • There are still concerns about the licensing of Worldcat data.
  • The LOC record is to a physical volume in a single national library and may not be intended as a description of the abstract concept (e.g. a FRBR Work). I don't know that Worldcat URIs solve this problem but they have the implication of a higher level of abstraction.


Votes and/or comments are appreciated.

Monday, February 7, 2011

Quick poll: Wikipedia or DBPedia?

I've created a poll near the upper right of this page. In longer form: when making persistent "Linked Data/Semantic Web" references to concepts described in Wikipedia, is it "best practice" to link to Wikipedia or to DBPedia? As in, "http://en.wikipedia.org/wiki/Augustus" or "http://dbpedia.org/resource/Augustus"?

Friday, February 4, 2011

Access to Roman Art: Observations by Peter Stewart

The last few times I've gone to speak about issues of scholarly communication/digital humanities/digital archaeology/etc, I've opened up with a quote from Peter Stewart's 2008 book The Social History of Roman Art [Worldcat]. That's a great little book, and I was particularly pleased when reading it that Stewart is explicit about the effects of access to evidence and images on his selection and narrative. And I was further pleased that he talks about his personal efforts to solve those problems. I'll illustrate this by a series of passages given in their order of appearance:
Unfortunately, my comments in the Introduction about the problems of acquiring images were born out in the book's preparation, and I had very considerable difficulties and delays in acquiring most of the images reproduced here. I therefore owe a special debt to those who helped me to obtain pictures, and to those image-providers who waived or reduced reproduction fees. (p. xv)
Then from that introduction:

To an extent, however, these are all obvious problems of evidence and interpretation which are familiar in any branch of historical study. Other problems are insidious and lie unremarked in the methodological hinterland of books like this one. I have said that the use of examples must be highly selective. But behind any book on Roman art, there are processes of selection that are largely beyond the author’s control. Most Roman art historians will never, in their lifetime, see more than a tiny percentage even of the more significant works that survive. This is not simply because of the magnitude of this great body of material. It is also because most pieces are inaccessible. Many of the finest and most interesting Roman antiquities are in private collections, and many of these are unpublished, sometimes because of scholars’ anxieties about the legality of their origins. However works preserved in museums can be at least as difficult to access. Few museums are able to exhibit more than a small minority of the objects they hold. It is not infrequent (or surprising) for some of the objects in storage to be, effectively, lost, and for other reasons it may be hard for specialists to see material, particularly if it has been excavated recently. New discoveries may take many years to become familiar within the field, and even longer to filter into general, synoptic studies of Roman art.

So, for a variety of reason, authors depend heavily on other people's publications of Roman art, where they exist, and on their illustrations. The photographs themselves are usually supplied by the museums that own the work concerned, or simetimes by commercial agencies. In many cases no photograph exists, and new photography may not be permitted. In other cases, the acquisition of photographs proves lengthy or impossible. Moreover, the photographs (especially colour images) and the permission to reproduce them in print can be extremely costly both for individual authors and for their publishers. (p. 8)

The passages need to be read in context. It's not an angry book, and these introductory are comments are followed by interesting and challenging extended essay on the topic indicated by the title. I can highly recommend it. But back to the issue of access, here's a passage from the ending Bibliographical essay:
Finally, the photo-sharing website flickr.com contains thousands of images relevant to Roman art, many of them with 'Creative Commons' copyright licenses that make them easy to use legitimately for, e.g. educational purposes. Within that site the 'Chiron' group especially is dedicated to making images available for classical teaching and research. This site carries many of my own photographs (under the screen name 'Tintern'), including colour images of the House of the Vettii and other sites mentioned in this book. (p. 174)
So mad props to Dr. Stewart for raising the issue of access and then doing something about it. A book from CUP in which the author cites his flickr.com account? That's progress.