Tuesday, September 21, 2010

Discussing Citation by Example

I've started a set of pages at the Digital Classicist Wiki on the topic of Citation in digital scholarship. In progress, under construction, etc., etc., etc.

The goal is to move existing practice towards a broad understanding of how to make citations to such categories of evidence as primary written sources, geographic entities, cataloged objects, and secondary scholarship so that those citations are:
  • Clearly identified in a robust yet rich fashion
  • Recognizable by automatic agents
  • To resources that are stable over the long-term

But I don't think it will be possible to establish and drive adoption of one very detailed standard. Better to have a simple notation - I follow others in suggesting 'class="citation"' for (x)html - that can indicate the presence of more detailed markup. I'm a fan of RDFa so I further discuss that on the page "Citations with added RDFa.

The Digital Classicist community is pretty open and I'm very grateful to G. Bodard (a.k.a palaeofuturist) for saying the equivalent of "Go for it." when I raised the possibility of hosting these materials in his realm.

There's a category for all the pages and I hope that list will grow.

Wednesday, September 8, 2010

References that just work (but I understand it's not that simple...)

Go to Google. Type in "John 20:24", then hit return. You can even click the "I'm Feeling Lucky" button. Or here's a direct link.

Or try the same thing in Bing (which provides results for Yahoo), and Altavista.

As you'll see, all three searches get you to the relevant passage of the Gospel According to John. And if you poke around on the Biblegateway site, you'll see various translations (but where's the Vulgate?).

That's impressive. It indicates that human readable references can be become so stable that automated agents are able to correctly translate them into links to particular chunks of primary text.

Here are some variations on the theme (all in Google):
"jean 20:24" (at google.fr): Not spot on, but pretty close.
"1 John 2:1": That's a reference to the first epistle of John. Entered into the "address bar" in Chrome. Seems to work.
"John 3": Unqualified chapter reference. Good to go.
"ephesians 1:2": Works.
"eph. 1:2": That abbreviation is OK.
"eph 1.2": Things become fuzzier if I don't use the ':' that is conventional in references to Christian scripture.
"Ephesians 2:4-10": Spans work as well, when properly formatted.

Again, I think this is interesting. Taking the New Testament as a corpus of Ancient Mediterranean texts that were written between the mid-first and third (at the latest: the Epistle of James 1 isn't definitively quoted until Origen) centuries AD makes it relevant to the study of the Ancient World as a whole. As a corpus, it's been around for a long time. Athanasius's letter of AD 367 is one conventional date for the determination of what was in, and what was out.

Those comments aside, the point remains that it is possible to automatically reverse engineer the citation scheme of a very stable corpus. I guess one caveat is that I don't absolutely know that Google, Bing, etc. haven't special cased strings that are plausibly references to the NT. Any ideas?

My larger goal is to think about references to so-called "primary texts" that just work. Given the above, my ad hoc, working definition of "primary text" is any text with a sufficiently stable name and citation scheme that search engines can find it. Sure, that's circular and incomplete, but it will do for now.

Let's try some others:
"gilgamesh 3": Muddled.
"gilgamesh tablet 3": Better.
"Iliad 23": Not bad. No Greek.
"Iliad 23.100": Individual line references don't work.
"Homer Iliad 23.100": Not better.
"Quran 32": I see it as the third link.
"hemingway, the old man and the sea": For comparison. Wikipedia is the top page for me; that's not the text itself. And Amazon is up there, as in the work is in copyright so I'd have to pay. Not sure I want to follow the links that say I can download the text for free.

A major distinction between references to NT texts and the second group is the ability of Google to handle full chapter and verse ('n:n') references. That doesn't seem to work for the Iliad. That's worth exploring.

If I go to Perseus and use the search box at the upper right, "homer iliad 23.100" doesn't work directly. Nor does "iliad 23.100". But "Hom. Il. 23.100" does. If I try that string in Google (link), it gets me to the Chicago version of the Perseus texts (via the 2nd ranked link when I tried it.). [I'll take this opportunity to note that the Chicago Perseus is wicked, and that it's likewise wicked cool that Perseus texts are licensed so this redundancy is possible.]

That kind of variation is one of the reasons I parenthetically qualified the title of this post. References to "primary texts" - and other texts for that matter - are not simple. In this post - as is often my wont - I've let myself be drawn along by current practice. I really do like to see what people are actually doing and how data actually works on the Internet. If you want a more substantive discussion of the problems of citation, I highly recommend Neel Smith's "Citation in Classical Studies" in DHQ 2009. Here's the abstract:
Citation practice reflects a model of a scholarly domain. This paper first considers traditional citation practice in the humanities as a description of our subjects of study. It then describes work at the Center for Hellenic Studies on an architecture for digital scholarship that is explicitly based on this model, and proposes a machine-actionable but technologically independent notation for citing texts, the Canonical Text Services URN.
For now, let me say that it is correct for Google (via Biblegateway) to dereference a citation to John 7:53-8:11 (the Pericope Adulterae) or John 5:7 (the Comma Johanneum). Neither may have been in the "original" text of the Gospel of John, but references to them are semantically clear and have been used "in the wild" so need to be handled. But note that Google prioritizes discussion of the CJ over the text (or at least does when I'm trying it now). Again, see N. Smith on the implications of such variation.

Clearly it helps to have a committed body of believers and/or scholars working on very old texts. Energy and time make for stable references. But there is variability in functionality even within that group. I guess the long-term question is how do we move more texts into the category of "just working"? I am assuming we want to. And how do we support co-existence of the simple "reference following" alongside what Neel describes. Both are useful.