Friday, February 1, 2008

PRAP, xhtml 2.0 and Archaeological Databases

For some years now I have been working in collaboration with many colleagues on data from the Pylos Regional Archaeological Project (PRAP). As with most more-or-less recent field-projects, one result of all our work is a collection of database files. Ours happen to be in FileMaker but that's a detail in terms of this discussion. What I really want to focus on is our decision to package all the data into an archival format that can be made available for use by all-comers and for storage by institutions that want to help ensure access to this resource over the long term. But what format should we use?

There are a couple of options that are specific to archaeology and/or cultural heritage management. Open Context is using a subset of the ArchaeoML datamodel. There's also the CIDOC Conceptual Reference Model. Right now, I am moving the PRAP data into an xhtml 2.0 based representation and thought I'd take the time to say what I like about it. Of course, I don't mean to reject other options. It's just that I'm looking for a lightweight standard in which to test the data and relationships that we want to archive in as accessible a format as possible. Who knows what will come in the future so for now I'm focusing on the present.

PRAP is a survey project so the first phase of fieldwork was to collect material by tracts, defined roughly as units of similar surface conditions: an olive grove (perhaps divided), a terrace, a fallow field. On the basis of density, Places Of Special Interest (POSIs) were identified - I'll call them sites from now on. Sites where then collected by a grid or by subdividing tracts. Either way, the material from a site consisted of both the tract collection material and the site collection material. Both tracts and the divisions of the site pickup are generically known as "collection units". Collection units held pottery and collected sherds could be numbered by extending the collection unit from which they came. In this system, "A92-001" is a tract collection unit, "A93-901001" is a site collection unit, and "A92-001-01" is a sherd from the tract, whereas "A93-901001-01" is from the site collection.

I'll now go over some broad ideas for representing this data model in xhtml .

Version of 2.0 of xhtml includes the Core, Embedding, and Metainformation attribute modules. In combination with the div and span elements, these modules make the following hypothetical xml fragments/stubs (almost) valid as part of a more complete document.

First, two sherds:

<div class="pottery" id="prap:pottery:A92-001-01">
<span property="ware">African Red Slip</span>
<span property="part">Rim</part>
<span property="quantity">1</span>
<span property="collectionunit" src="prap:collectionunit:A92-001"/>
</div>

<div class="pottery" id="prap:pottery:A93-901001-01">
<span property="ware">African Red Slip</span>
<span property="part">Rim</part>
<span property="quantity">1</span>
<span property="collectionunit" src="prap:collectionunit:A93-901001"/>
</div>

Now, two collection units:

<div class="collectionunit" id="prap:collectionunit:A92-001">
<span property="method">tract collection</span>
<span property="site" src="prap:site:A01"/>
</div>

<div class="collectionunit" id="prap:collectionunit:A93-901001">
<span property="method">site collection</span>
<span property="site" src="prap:site:A01"/>
</div>

Now the site A01:
<div class="site" id="prap:site:A01">
<span property="description">An ancient site.</span>
</div>

In the above model, divs have classes and a unique id and are analogous to records in a column-oriented database. Divs consist of spans, which have properties and either content or a src attribute. Spans are analogous to database columns/fields. If a span has content, that's the value of the property. If it has a src attribute, that's a reference to the id of an existing div. In this near xml, each sherd is said to come from a collection unit and each collection unit is assigned to a site. Therefore, it is possible to know that these sherds both came from site A01.

I like the fact that each div is self-documenting as to its structure. That's better than a line in a tab-separated text file. I also like that the metainformation is strongly typed. Take for example the following snippet of xslt:

<xsl:key name="classes" match="//*[@class]" use="@class"/>
<xsl:key name="ids" match="//*[@id]" use="@id"/>
<xsl:key name="properties" match="//*[@property]" use="@property"/>
<xsl:key name="srcs" match="//*[@src]" use="@src"/>

When applied to a large repository of xhtml data, this code will build quickly searchable indexes of all the class, id, property, and src attributes. That will in-turn allow for efficient navigation of the database structure. You can see this in practice if you download and unpack the file at prapdigitalarchive-prerelease.tar.gz. Pay attention to the prerelease in the filename. What you're getting is my very preliminary efforts to put the thoughts expressed above into action. You are also getting a mass of unedited field data so don't be picky. And note the CC by-nc-nd license on the files.

Some points of interest. Unpack the file in its own directory. If you want to just see output, look in the pdf folder for sitegazetteer.pdf. If you want to generate this file yourself, you'll need to execute something like the following sequence of commands from within the directory you created for unpacking:

xmllint -xinclude data/include.xml > data/prap.xml
xsltproc xslt/sitegazetteer.xsl data/prap.xml > sg.fo
fop sg.fo sitegazetteer.pdf

[Fop is available here.]

Or look at xslt/test.xsl to see a simple manipulation of class, id, property and src attributes. Run that against prap.xml for some not very interesting output.

I don't know that anybody will actually download the file and run these transformations. My point is that you can. That's another advantage of using an xml based data model/work flow. We can publish not only the data but the scripts that we use to manipulate that data. That's an important principle being embraced by researchers, particularly in the sciences.

So... I'm at the beginning of developing the PRAP Digital Archive and this is the first public announcement of it. There are many little things to do and big decisions to make before it gets anywhere near being "finished". But I'll update the tar ball as I go along and will highlight the ceramic bits as appropriate.

No comments: