Tuesday, May 25, 2010

Me @ NYU/ISAW

Briefly... I'm sitting in an office at NYU's Institute for the Study of the Ancient World, where I am now a Visiting Scholar. This is preliminary to a more permanent position with details to come.

My main goal is to work on issues of digital publication and on integration of diverse digital resources. I had started collaborating with ISAW-folk on these issues some time back, which is why I've been blogging about them.

I'm extremely excited to be working with my new colleagues here - a veritable dream-team of digital humanists - and am looking forward to making real progress when it comes to sharing well-structured, semantically-rich, open-licensed scholarship about the Ancient World.

And I'll still be collaborating with my long-time friends at the ANS, particularly on Nomisma.org. And field-work goes on.

Back to work...

Wednesday, May 19, 2010

RDFa Document Metadata: Authors in PLOS One

Brief follow up to yesterday's post.

Here's the HTML that indicates authorship from an example PLOS One article.
<p xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" class="authors" xpathlocation="noSelect">
<span rel="dc:creator"><span property="foaf:name">Harold C. Sox</span></span><sup><a href="#aff1">1</a></sup>, <span rel="dc:creator"><span property="foaf:name">Mark Helfand</span></span><sup><a href="#aff2">2</a></sup><sup><a href="#cor1" class="fnoteref">*</a></sup>,
<span rel="dc:creator"><span property="foaf:name">Jeremy Grimshaw</span></span><sup><a href="#aff3">3</a></sup>,
<span rel="dc:creator"><span property="foaf:name">Kay Dickersin</span></span><sup><a href="#aff4">4</a></sup>, <span class="capture-id">the <i>PLoS Medicine</i> Editors</span>,
<span rel="dc:creator"><span property="foaf:name">David Tovey</span></span><sup><a href="#aff5">5</a></sup>, <span rel="dc:creator"><span property="foaf:name">J. André Knottnerus</span></span><sup><a href="#aff6">6</a></sup>,
<span rel="dc:creator"><span property="foaf:name">Peter Tugwell</span></span><sup><a href="#aff7">7</a></sup>
</p>
<p xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:aml="http://topazproject.org/aml/" class="affiliations" xpathlocation="noSelect">
<a name="aff1" id="aff1"></a><strong>1</strong> Dartmouth Institute, Dartmouth Medical School, Hanover, New Hampshire, United States of America,
<a name="aff2" id="aff2"></a><strong>2</strong> Portland VA Medical Center and Department of Medicine, Oregon Health &amp; Science University, Portland, Oregon, United States of America,
<a name="aff3" id="aff3"></a><strong>3</strong> Clinical Epidemiology Program, Ottawa Hospital Research Institute, Ottawa, Ontario, Canada,
<a name="aff4" id="aff4"></a><strong>4</strong> Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, United States of America,
<a name="aff5" id="aff5"></a><strong>5</strong> The Cochrane Library, London, United Kingdom, <a name="aff6" id="aff6"></a>
<strong>6</strong> Department of General Practice, University of Maastricht, Maastricht, The Netherlands,
<a name="aff7" id="aff7"></a><strong>7</strong> Departments of Medicine, and Epidemiology and Community Medicine, University of Ottawa, Ottawa, Ontario, Canada
</p>


The basic structure is two 'p' elements, one with a 'class="authors"', the second with 'class="affiliations"'. I am trying to avoid using @class to indicate document structure and metadata, so yesterday I adopted the 'bibo:authorList' convention. But it is useful to see another instance of the nested 'rel="dc:creator"'->'property="foaf:*"' pattern. Is that beginning to look like a trend?

The relationship between author and affiliation is a little broken. The reference from each author to his/her affiliation is actually to an 'a' element with no content. An automatic agent might return an empty string as the affiliation unless it had ad hoc code to pull the text as far as the next '<a>' or '</p>' tag. That's not particularly helpful.

It is important to be clear that this HTML is rendered from XML encoded in the National Institutes of Health's Journal Publishing Tag Set Version 2.0. That's my way of acknowledging that the markup delivered to your browser doesn't bear the full weight of being a well-structured archival version.

Tuesday, May 18, 2010

Towards a metadata header for XHTML5+RDFa1.1 Digital Publications

XHTML5 defines elements such as 'header' and 'summary' that improve the constructs for indicating document metadata. But it is not a finished solution for embedding these concepts in a born-digital scholarly publication. In this post I take an initial crack at a decent way of doing this.

To cut to the chase, here's a sample document:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:bibo="http://purl.org/ontology/bibo/"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:dctypes="http://purl.org/dc/dcmitype/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:owl="http ://www.w3.org/2002/07/owl#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
about="http://example.org/digpub"
typeof="dctypes:Text"
>


<head>
<title property="dc:title">Guidelines for Using XHTML5 to encode Digital Publications</title>
<base href="http://example.org/digpub"/>
</head>
<body>
<header>
<div rel="bibo:authorList">
<ul rel="rdf:Seq">
<li rel="rdf:li">
By <span rel="dc:creator">
<span rel="foaf:Person">
<span property="foaf:name" rel="owl:sameAs" resource="http://en.wikipedia.org/wiki/Albert_Gallatin">Albert Gallatin</span>
</span>
</span>
</li>

<li rel="rdf:li">
and <span rel="dc:creator">
<span rel="foaf:Person">
<span property="foaf:name" rel="owl:sameAs" resource="http://en.wikipedia.org/wiki/William_Alexander_Hammond">William Alexander Hammond</span>
</span>
</span>
</li>
</ul>
</div>
<summary property="dc:description" xml:lang="en">An abstract in English.</summary>
<summary property="dc:description" xml:lang="fr">Un résumé en Française.</summary>
</header>
<section>
<h1>Section 1</h1>
<p>Your text here.</p>
</section>
</body>
</html>
And here's the turtle representation of the embedded RDF:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <http://www.w3.org/1999/xhtml> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix dc: <http://purl.org/dc/terms/> .
@prefix dctypes: <http://purl.org/dc/dcmitype/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix owl: <http ://www.w3.org/2002/07/owl#> .

<http://example.org/digpub>
dc:description "An abstract in English."@en, "Un résumé en Française."@fr ;
dc:title "Guidelines for Using XHTML5 to encode Digital Publications" ;
bibo:authorList [
rdf:Seq _:bnode1
] ;
a dctypes:Text .

_:bnode1
rdf:li [
dc:creator [
foaf:Person [
owl:sameAs <http://en.wikipedia.org/wiki/Albert_Gallatin> ;
foaf:name "Albert Gallatin"
]
]
], [
dc:creator [
foaf:Person [
owl:sameAs <http://en.wikipedia.org/wiki/William_Alexander_Hammond> ;
foaf:name "William Alexander Hammond"
]
]
] .


Even if you don't read turtle that might make some sense.

  • <http://example.org/digpub> after the prefixes means we're defining attributes of a document at that URI.
  • The last line of this indented section just says that the document is 'a dctypes:Text' resource.
  • dc:description "...." means the RDF extractor has found the Dublin Core description (more or less used as 'abstract'). The abstract is available in two languages as indicated by the 'xml:lang' attribute in the document.
  • Same for dc:title. Note that I don't put the title in the 'body' element because html-family encoding schemes want it up in the 'head'. Perhaps the title should be repeated in the 'body/header'. I'm inclined to think that can be done on delivery to a browser when a document is published via a web-server. The archival version should not have such repetition.
  • We then come to a 'bibo:authorList', the contents of which are specified following the line beginning '_:bnode1'. The "Bibliographic Ontology" (here 'bibo') uses this construct for multi-authored works. I'm not sure I like it. Especially since it imposes the extra nesting of rdf:Seq and rdf:li. But if 'bibo' is widely adopted (which it sort of is) then it's not my place to complain. Conform to the standard and move on. The contents of each rdf:li in a bibo:authorList are not well defined in the spec. I looked through the bibo examples, adopted its use of dc:creator and foaf:Person, and then added an owl:sameAs for good measure.
My point in doing all this is to make use of existing standards that allow a corpus of born-digital scholarship to represent metadata in a machine-recognizable fashion that also allows the "text parts" to be human readable. I'm just at the beginning of this project so I welcome suggestions of where I can look for good models.

Wednesday, May 12, 2010

New KML files for hoards and mints on nomisma.org

Nomisma.org is the project I work on with my colleagues at the ANS and elsewhere to establish stable URIs for numismatic concepts. Development sometimes moves slowly but I've recently added new functionality for the mapping side of things. I'm facing one annoying bug but I think it's worth reporting this progress. So...

  1. http://nomisma.org/id/eretria will bring up the html page for the mint of Eretria in Greece. You'll see a very brief label for the site, co-ordinates, and a link to the relevant Wikipedia article.
  2. http://nomisma.org/kml/eretria.kml is a kml file that just shows the location of the mint.
  3. http://nomisma.org/kml/eretria-all.kml is much more fun. It shows the location of the mint plus all the mappable hoards that have Eretrian coins in them. 'Mappable' is just an indication that we haven't entered findspots for all hoards. But we're moving as fast as we can.

This pattern is generalized. http://nomisma.org/kml/babylon.kml and http://nomisma.org/kml/babylon-all.kml do what you would expect.

http://nomisma.org/kml/igch0262.kml shows just the findspot of the hoard. http://nomisma.org/kml/igch0262-all.kml shows findspot and location of the mints of the coins found in the hoard.

Open these files in Google Earth for best effect.

There are links to the related kml files on each page and I've also put <link> elements in each page's head (cf. S. Gillies' blog post for discussion).

The annoying bug is that when I show those maps on the site using the Google Maps API, not all mints or findspots appear. Not sure why that is, but I'm guessing I've got something incorrectly formatted. Or there is some limit in how many Network Links the Maps API will load in a short period of time. I'll investigate and fix.

More concisely, nomisma.org will show you a mint and findspots for its coins. As noted, not all information is entered; but we can begin to talk about the site and its data as a resource for mapping economic connections within the Ancient Mediterranean and Near East.

Tuesday, May 11, 2010

Document and Concept: '#this' and how DBpedia does it

I'm following up on yesterday's post in which I looked at the distinction between 'concept' and 'document' as well as its implications for scholarly practice. To be honest, I'm not sure I've really addressed the scholarly practice aspect of this thread but that's where I'm heading. I'll give a preview at the very end of this post.

Yesterday I asked, "Is there an unambiguous and widely-accepted convention for indicating the concept lying behind a document?". Gabriel Bodard left a comment noting the convention of appending '#this' to indicate that a URI is a reference to the real-world concept rather than the document describing that concept. This is definitely worth considering.

As an aside, Gabby (if I may) is correct that it's hard to look for documentation of the convention since 'this' is understandably ignored by search engines. There's the W3 document 'Cool URIs for the Semantic Web', which does discuss '#this'. I'm not sure if that's the original citation but that title is definitely on the suggested reading list for this topic. As is 'Linked Data Tutorial - NG: Publishing and consuming linked data with RDFa', which I was reminded to look at anew by Sean Gillies.

I have reservations about '#this'. Some of them are aesthetic but that's not a strong leg to stand on. Practically, I don't like having to inspect the internal characters of a URI to figure out its semantics. I also wonder if the convention hasn't really taken off. The 'Linked Data Tutorial' was published after 'Cool URIs' so it may be indicative that it doesn't discuss '#this'. I'm also not sure it's good to devote the '#' mechanism (aka fragment identifiers) to represent metadata rather than maintaining its original purpose of specifying internal portions of a document. But if '#this' comes to rule the world, I'll happily use it.

The 'Linked Data Tutorial' does use DBpedia in its examples so I want to look more closely at how that site handles the 'Document/Concept' distinction. In truth, I didn't find an explicit discussion of the topic on the DBpedia site itself. Maybe I just didn't come across it so I'd welcome a link. I did find the following on the the OpenLink site: "the URI prefixes http://dbpedia.org/resource/..., http/dbpedia.org/page/... and http://dbpedia.org/data/... distinguish between a resource and its HTML or RDF description documents". OpenLink is the creator of Virtuoso, the software that powers DBpedia's SPARQL-endpoint, so I'll take that statement as definitive until I find something more authoritative.

Time to get into details... http://dbpedia.org/resource/Antioch is the URI for the concept 'Antioch: the ancient city'. Clicking on that URI will cause your browser to be redirected to the document http://dbpedia.org/page/Antioch . That's great. We have a clean separation between concept and document.

Looking at the source of 'page/Antioch' (I'll use that shorthand going forward) shows that this document uses RDFa to embed semantic information in human-readable html. We could switch that around. RDFa allows human-readable text to be embedded in machine-parsable data. I'm not sure it matters, which is the main point.

DBpedia even references the RDFa 1.0 DTD: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">. That's very cool and very correct. When RDFa 1.1 is published, I'm counting on DBpedia to be at the forefront of adoption.

The 'resource/Antioch' URL appears three times in the 'page/Antioch' document. The following link elements are in the header:
  • <link rel="foaf:primarytopic" href="http://dbpedia.org/resource/Antioch"/>
  • <link rev="describedby" href="http://dbpedia.org/resource/Antioch"/>

The body start tag looks like this:
  • <body onload="init();" about="http://dbpedia.org/resource/Antioch">
Ignore the @onload, it's the @about that's interesting. It's just RDFa to say that all the parsable information in the document describes the resource http://dbpedia.org/resource/Antioch .

But far more interesting to me is the 'rev="describedby"' in the quoted link element of the document's head. Note that it's 'rev', not 'rel'. The meaning of the whole element is "The current document describes the resource at http://dbpedia.org/resource/Antioch". Yes, that's similar to the @about of the body. I really like the distinctiveness of using @rev . It's easily accessible by javascript or by an RDFa extractor. And I like that I can point to a major player in the Linked Data world as a precedent. That gives it a sense of de facto standard. And a little googling of 'describedby' found instances on the W3 site. It seems it's not quite an officially accepted standard but, again, it's nice to see a major player possibly getting behind 'describedby'.

So it's worth asking if this is a convention that others might be willing to adopt. Any takers or comments? Is @rev too obscure? Other objections?

I also want to briefly point out that the DBPedia 'page/...' documents make some effort to be clear to human readers that they are describing resources. The link at the top of 'page/Antioch' is to 'resource/Antioch'. This could be clearer but is a start.

And as for scholarly practice, I'll just briefly say that this discussion is in part inspired by the observation that Concepts should be permanent, Documents may be temporary. Looking back to the Geonames discussion of yesterday, I will not hold it against geonames.org if it stops responding to the URL http://www.geonames.org/3020251/embrun.html . Maybe html will fall out of use someday. It will be annoying if the string of characters http://sws.geonames.org/3020251/ , ceases to mean anything. Actually, I wish they'd remove the 'sws' cruft from that URL but that's their choice. Scholarship likes permanence and to the extent that the distinction between document and concept is clearly maintained, scholarly practice will be well served.

Monday, May 10, 2010

Concept and Document in the Ancient World Semantic Web

This post is really just me taking some notes on semantic web usage. Apologies if it's too discursive but I'm just at the gathering info stage right now.

Along with some colleagues, I've been thinking about the relationships between concrete action and scholarly intent that are inherent in the links we make when creating digital publications.

First some background. Here's a "test" sentence, along with its html.

Themistocles was born in Athens.
Or:
<a href="http://en.wikipedia.org/wiki/Themistocles">Themistocles</a> was born in <a href="http://en.wikipedia.org/wiki/Athens">Athens</a>.


http://en.wikipedia.org/wiki/Athens is a document found on the Internet. As used in our sentence, it is a placeholder for Athens - nebulously defined, I admit - as a concept. Asking the question, "What is the latitude and longitude of Athens?", focuses the issue. It is not useful to respond with the location(s) of the Wikipedia servers. We clearly want to know the location of the site in "the real world", or 37° 58′ 0″ N, 23° 43′ 0″ E.

Links point to documents, we often mean the underlying concept. Often this distinction doesn't matter. Sometimes it does, as in:

My source for the longitude and latitude of Athens is the Wikipedia article for Athens.

That sentence has the same link appearing two times, one meaning the concept, the other meaning the document. Wikipedia provides no mechanism for distinguishing between these meanings.

DBpedia does implement this distinction. But first, here's the intro sentence from the DBpedia website:
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.

In DBpedia, the following URLs are both valid:

  • http://dbpedia.org/resource/Athens
  • http://dbpedia.org/page/Athens

The first refers to the concept, the second is a specific document. This allows for the following useful HTML:
My source for the longitude and latitude of <a href="http://dbpedia.org/resource/Athens">Athens</a> is the <a href="http://dbpedia.org/page/Athens">DBpedia page</a>.

Looking at the DBpedia page http://dbpedia.org/page/Athens is useful because it gives a list of resources that are each related to dbpedia:Athens via owl:sameAs. These are:

Before looking at one of these, what is owl:sameAs? The OWL Web Ontology language is described here. Among the descriptions of owl:sameAs given there is that a "...typical use of sameAs would be to equate individuals defined in different documents to one another, as part of unifying two ontologies". So the DBpedia usage, which is paralleled in many other semantic web resources, is spot on.

The geonames.org reference is interesting. In part because the site has a discussion that explicitly addresses the difference between concept and document: http://www.geonames.org/ontology/. That page also has a link to a good blog post.

DBpedia follows the Geonames guidelines in using owl:sameAs to qualify its link to http://sws.geonames.org/264371/ , which is the Geonames URI for the concept "Athens". Clicking on that redirects you to the page http://www.geonames.org/264371/athens.html. Note the change of host to 'www.geonames.org' and the addition of 'athens.html'. The serial number remains the same.

Here is a screen grab of the "balloon" that is displayed next to the icon indicating the location of Athens.


There are two interesting links shown in this image: 'perma link' and 'semantic web rdf':

http://www.geonames.org/264371/athens.html is just the link to the page. http://sws.geonames.org/264371/about.rdf is an RDF document. It's worth looking at the source to see the attribute 'rdf:about="http://sws.geonames.org/264371/"'. URLs of the pattern 'http:...about.rdf' are documents. http://sws.geonames.org/264371/ is a concept.

Even with this soup of web addresses, there is a lot that Geonames is doing right. The only missed opportunity I see is no explicit indication in the "264371/athens.html" page of the concept address. There is the following: <link rel="alternate" type="application/rdf+xml" title="RDF Version" href="http://sws.geonames.org/264371/about.rdf" />'. This is a link to a document not a concept. And 'alternate' is too vague for me to know that I can parse that RDF to find its @about value.

It would be nice if there were somelthing like '<link rel="concept" type="application/rdf+xml" title="Concept URI" href="http://sws.geonames.org/264371" />'. I'm not too concerned with what's in @type so I left it as is. Bit 'concept' is not in anyway standard. I just made it up.

If this post has a point, that's it. Make it really easy for me to figure out which URI is for the concept, because that's the one I really want to use. Or maybe I should end with a question. Is there an unambiguous and widely-accepted convention for indicating the concept lying behind a document? If not, we need one.