Tags

, , , , ,

It was with a fair amount of interest that I read through AWOL that Corpus Scriptorum Ecclesiasticorum Latinorum (CSEL) texts are now available in XML (TEI/Epidoc) format through Github – just the sorts of texts I’m interested in adding to de commentariis

But it turns out there’s a fair bit of work to do on the texts before they’re usable in a programmatic way. The format of the XML raises two questions for me. It’s always confused me that people talk about using “epidoc” (“Epigraphic documents in TEI XML”) to encode literary texts. Why is it used in this way, to encode documents it is apparently not designed to encode?

The second question follows on from this. I don’t know whether this is an artefact of using Epidoc or if it’s an artefact of the particular choices made to encode the CSEL. The standard numbering systems of the critical editions of these texts are effectively lost in the Epidoc versions of the text online, rendering them problematic for programmatic access to the data in the standard scholarly reference systems.

Different texts have different breakdowns, for example, Book/Poem/Line, Book/Line, Letter Number/Line, and so on depending on the particular text and the choices made by the editor of the critical edition. In the Perseus format (the “old” format?) the TEI documents have a header that tells my programs on De Commentariis the structure of the document breakdown, thus:

<encodingdesc> <refsdecl doctype="TEI.2">
    <state delim="." unit="book"></state>
    <state unit="chapter"></state>
    <state unit="section"></state>
</refsdecl> </encodingdesc>

This tells me that that this particular text is encoded in book.chapter.section format, e.g. 5.3.2. Then the text body itself has those very book/chapter/section divisions in it:

<div1 type="book" n="1">
  <head>COMMENTARIUS PRIMUS</head>
  <div2 type="chapter" n="1">
      <div3 type="section" n="1">
          <p>Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.</p>
      </div3>
      …
  </div2>
  … 
</div1>

This gives the document a structured, hierarchical view of the content. Everything contained with the div1 element with the attributes type=“book” and n=“1” is a part of Book 1, and the div2 element inside that with type=“chapter” and n=“1” is 1.1. and inside that the div3 with a type=“section” and n=“1” is 1.1.1. The abstract document structure (according to the standardised referencing established by the critical edition) is encoded directly onto the data structure. It’s an excellent XML structure that reflects directly the way the data is referenced, with enough flexibility to encode many different types of referencing schema, as long at it’s laid out in the metadata and the relationship is hierarchical. It’s easily navigable with standardised XML tools like xpath/xquery or simple XML DOM (document object model) manipulation.

On the other hand, this is not:

<p>Sancto episcopo Salonio Saluianus salutem in domino. <note type="chapter"> 1 </note> </p>
<p>Omnes admodum homines, qui pertinere ad humani officii <lb n="5”></lb>

In this style of format, the presentation of the text (the original page it was scanned from) is confused with the data structure, and the critical data structure information is presented in the form of an annotation attached to a particular line (rather than enclosing all the lines which belong to chapter 1). This style of document is incredibly difficult to use with standard tools like xpath. This is highlighted if we go down just a little further into the text:

tantum laudem aucupantes tam indignis rebus curam impen­ <lb></lb>
derent, non tam inlustrasse mihi ipsa ingenia quam damnasse<note type="chapter"> 3 </note> <lb n="10”></lb>
uideantur. nos autem, qui rerum magis quam uerborum ama­ <lb></lb>

Where does chapter 3 start? Clearly not half-way through impenderent and most likely not at the word break in damnasse uideantur. Is it at the comma after impenderent? At the full stop after uideantur? A human, familiar with the original text, might be able to decide: a simple algorithm inside a computer program, probably not.

I bring this up – I know it may seem churlish, after all any open XML version of an ancient text has to be a good thing – because I feel that in the “official” digital classics circles there is a certain enthusiasm for recoding existing XML texts to the Epidoc format, but if this is the result, it’s a definite step backwards. Forgive me if I am wrong and this is merely the first step in getting from the presentation layer (the scan of the book) to the data layer (a properly structured XML version of the text). But I certainly hope this style of markup isn’t regarded as the standard way to proceed into the future.

Update: an interesting set of notes by @paolomonella on Epidoc and the difference between “documents” and “texts” is found here and there is a twitter conversation here.